Hypothesis Testing - The Auto-pilot Benchmarking Suite

Next: Predicate Evaluation, Previous: CSV Results, Up: Getstats

6.3 Hypothesis Testing with Getstats

Getstats supports simple hypothesis testing using a two sample t-test. If you have two samples (i.e., configurations), and you want to determine whether one configuration's results is larger than, smaller than, or equal to the other configuration's results you can use the --twosamplet transform. The --twosamplet operates like the overhead transform in that the first file on the command line is compared to each subsequent file on the command line. Before executing the command you should pick your null hypothesis, which is what you assume to be true (and would like to disprove). For example, if you just spent time optimizing a function, then you should assume your new software is slower than the existing software, and seek to prove otherwise. You can also assume that two samples are equal, and then seek to differentiate them (if you fail, then the results are statistically indistinguishable).

For a primer on hypothesis testing, I suggest reading any statistics book such as Ott and Longnecker's "An Introduction to Statistical Methods and Data Analysis", MathWorld at http://mathworld.wolfram.com/HypothesisTesting.html or WikiPedia at http://en.wikipedia.org/wiki/Hypothesis_testing.

For example, to compare grep:reboot.res with grep:noreboot.res, you should run getstats --twosamplet grep:noreboot.res grep:reboot.res. This command produces the basic tabular report, and afterwards each quantity is compared as follows:

     grep:noreboot.res: High z-score of 2.33972893857958 for elapsed in epoch 7.
     grep:noreboot.res: Linear regression slope for sys is: 1.856%.
     grep:reboot.res: High z-score of 2.82303417219122 for elapsed in epoch 1.
     grep:reboot.res: High z-score of 2.33133550896239 for sys in epoch 3.
     grep:reboot.res: High z-score of 2.47125762323635 for io in epoch 1.
     grep:noreboot.res
     NAME    COUNT MEAN   MEDIAN LOW    HIGH   MIN    MAX    SDEV% HW%
     Elapsed 10    38.751 38.699 38.580 38.921 38.465 39.307 0.614 0.439
     System  10    1.796  1.790  1.677  1.915  1.580  2.080  9.255 6.620
     User    10    23.806 23.730 23.614 23.998 23.430 24.330 1.130 0.808
     Wait    10    13.149 13.158 12.912 13.386 12.725 13.797 2.519 1.802
     CPU%    10    66.071 66.019 65.556 66.586 64.899 67.075 1.090 0.779
     
     grep:reboot.res
     NAME    COUNT MEAN   MEDIAN LOW    HIGH   MIN    MAX    SDEV% HW%   O/H
     Elapsed 10    40.422 40.661 39.885 40.960 38.301 40.788 1.859 1.330 4.314
     System  10    1.693  1.700  1.620  1.766  1.560  1.930  6.005 4.296 -5.735
     User    10    23.718 23.745 23.451 23.985 23.180 24.220 1.572 1.124 -0.370
     Wait    10    15.011 15.102 14.569 15.454 13.481 15.632 4.124 2.950 14.168
     CPU%    10    62.875 62.764 62.166 63.584 61.561 64.802 1.576 1.127 -4.837
     
     Comparing grep:reboot.res (Sample 1) to grep:noreboot.res (Sample 2).
     Elapsed: 95%CI for grep:reboot.res - grep:noreboot.res = (1.148, 2.195)
     Null Hyp. Alt. Hyp. P-value Result
     u1 <= u2  u1 >  u2  0.000   REJECT H_0
     u1 >= u2  u1 <  u2  1.000   ACCEPT H_0
     u1 == u2  u1 != u2  0.000   REJECT H_0
     
     System: 95%CI for grep:reboot.res - grep:noreboot.res = (-0.232, 0.026)
     Null Hyp. Alt. Hyp. P-value Result
     u1 <= u2  u1 >  u2  0.944   ACCEPT H_0
     u1 >= u2  u1 <  u2  0.056   ACCEPT H_0
     u1 == u2  u1 != u2  0.112   ACCEPT H_0
     
     User: 95%CI for grep:reboot.res - grep:noreboot.res = (-0.393, 0.217)
     Null Hyp. Alt. Hyp. P-value Result
     u1 <= u2  u1 >  u2  0.724   ACCEPT H_0
     u1 >= u2  u1 <  u2  0.276   ACCEPT H_0
     u1 == u2  u1 != u2  0.552   ACCEPT H_0
     
     Wait: 95%CI for grep:reboot.res - grep:noreboot.res = (1.396, 2.329)
     Null Hyp. Alt. Hyp. P-value Result
     u1 <= u2  u1 >  u2  0.000   REJECT H_0
     u1 >= u2  u1 <  u2  1.000   ACCEPT H_0
     u1 == u2  u1 != u2  0.000   REJECT H_0
     
     CPU%: 95%CI for grep:reboot.res - grep:noreboot.res = (-4.009, -2.382)
     Null Hyp. Alt. Hyp. P-value Result
     u1 <= u2  u1 >  u2  1.000   ACCEPT H_0
     u1 >= u2  u1 <  u2  0       REJECT H_0
     u1 == u2  u1 != u2  0.000   REJECT H_0

From this report, we we can see that grep with an intervening reboot runs for a longer period of time than without the intervening reboot (because we reject the null hypothesis of u1 <= u2 for Elapsed time). We also see that System and User times are indistinguishable for the two tests. Wait and CPU time are however distinguishable (reboot has higher Wait, and lower CPU utilization).

If you want to have a quieter version of the t-test, pass --set rejectonly=1 so that only rejected hypothesis are displayed. For example, getstats --set warn=0 --set rejectonly=1 --twosamplet produces the following:

     grep:noreboot.res
     NAME    COUNT MEAN   MEDIAN LOW    HIGH   MIN    MAX    SDEV% HW%
     Elapsed 10    38.751 38.699 38.580 38.921 38.465 39.307 0.614 0.439
     System  10    1.796  1.790  1.677  1.915  1.580  2.080  9.255 6.620
     User    10    23.806 23.730 23.614 23.998 23.430 24.330 1.130 0.808
     Wait    10    13.149 13.158 12.912 13.386 12.725 13.797 2.519 1.802
     CPU%    10    66.071 66.019 65.556 66.586 64.899 67.075 1.090 0.779
     
     grep:reboot.res
     NAME    COUNT MEAN   MEDIAN LOW    HIGH   MIN    MAX    SDEV% HW%   O/H
     Elapsed 10    40.422 40.661 39.885 40.960 38.301 40.788 1.859 1.330 4.314
     System  10    1.693  1.700  1.620  1.766  1.560  1.930  6.005 4.296 -5.735
     User    10    23.718 23.745 23.451 23.985 23.180 24.220 1.572 1.124 -0.370
     Wait    10    15.011 15.102 14.569 15.454 13.481 15.632 4.124 2.950 14.168
     CPU%    10    62.875 62.764 62.166 63.584 61.561 64.802 1.576 1.127 -4.837
     
     Comparing grep:reboot.res (Sample 1) to grep:noreboot.res (Sample 2).
     Elapsed: 95%CI for grep:reboot.res - grep:noreboot.res = (1.148, 2.195)
     Null Hyp. Alt. Hyp. P-value Result
     u1 <= u2  u1 >  u2  0.000   REJECT H_0
     u1 == u2  u1 != u2  0.000   REJECT H_0
     
     Wait: 95%CI for grep:reboot.res - grep:noreboot.res = (1.396, 2.329)
     Null Hyp. Alt. Hyp. P-value Result
     u1 <= u2  u1 >  u2  0.000   REJECT H_0
     u1 == u2  u1 != u2  0.000   REJECT H_0
     
     CPU%: 95%CI for grep:reboot.res - grep:noreboot.res = (-4.009, -2.382)
     Null Hyp. Alt. Hyp. P-value Result
     u1 >= u2  u1 <  u2  0       REJECT H_0
     u1 == u2  u1 != u2  0.000   REJECT H_0

There are several other variables that control the test. To replace u1 and u2 with their test names, (e.g., u1 would be replaced with grep:reboot.res, pass --set verbosettest=1. The confidence level can be adjusted with --set confidencelevel. To determine if two samples are different by a given delta use --set twosampledelta=delta.

Finally, if you want to compare each sample to every other sample in a pair-wise manner pass --pairwiset instead of --twosamplet.