Note that there were four benchmark results in the cloud set with percentage differences greater than 10,000% which I’ve removed as outliers. Those were not included in the calculations above; if they were included the cloud numbers would be substantially worse. I opted to remove them after inspecting them and finding inconsistencies in those benchmark results which lead me to suspect that the logs were damaged. For example, one benchmark shows the time for each iteration increased by more than 200x but the throughput for the same benchmark appears to have increased slightly, rather than decreased as one would expect.
... many of the comparisons show shifts of +-2%, roughly similar to the noise observed in local benchmarks. However, differences of as much as 50% are fairly common with no change in the code at all, which makes it very difficult to know if a change in benchmarking results is due to a change in the true performance of the code being benchmarked, or if it is simply noise. Hence, unreliable.
あくまで Travis-CI の事例ですが、「ベンチマークの信頼性を損なう程度のノイズがある」ことはどの CI サービスにも言えることかと思います。
パッと思いつく解決策は「ベンチマークの試行回数を増やし、その中央値をメトリクスとする」手法です。例えば、ベンチマークの job で10回ベンチマークスクリプトを回し、その中央値をメトリクスとします。しかし、その job が実行される仮想マシンの CPU がたまたまサーマルスロットリングを起こしていた場合、10回の計測値がすべて劣化して中央値もそれに引きづられてしまいます。複数の job に分けて計測してその影響を受けにくくすることもできますが、手間ですし、依然として一定のノイズがあります。
次に思いつく解決策は、自前の CI runners を用意して、そこでベンチマークを実行することです。GitHub Actions でいうところの self-hosted runners です。この手法なら物理的なマシンを専有できるので、かなりノイズを排除できますが、それを用意するための労力や維持費の問題が出てきます。
CodSpeed も本記事で紹介したように CPU シミュレータを使用し、CPU 命令数とCPU キャッシュヒット率を用いてベンチマークを取る仕組みのようです。
A benchmark will be run only once and the CPU behavior will be simulated. This ensures that the measurement is as accurate as possible, taking into account not only the instructions executed but also the cache and memory access patterns. The simulation gives us an equivalent of the CPU cycles that includes cache and memory access.
Due to their nature, system calls introduce variability in execution time. This variability is influenced by several factors, including system load, network latency, and disk I/O performance. As a result, the execution time of system calls can fluctuate significantly, making them the most inconsistent part of a program's execution time.
To ensure that our execution speed measurements are as stable and reliable as possible, CodSpeed does not include system calls in the measurement. Instead, we focus solely on the code executed within user space(the code you wrote), excluding any time spent in system calls. This approach allows us to provide a clear and consistent metric for the execution speed of your code, independent of your hardware and all variability that it can create.