When conducting cutting edge scientific experiments we are almost exclusively dealing with small effects and intricate instruments. Because we are not measuring tangible objects with physical scales or physical rulers our common-sense notions pertaining to measurements no longer apply. Therefore we cannot simply take a measuring device, make one or two measurements and proclaim a result. No, we absolutely cannot, because there is no way of telling if the result is accurate or truthful.
Statistical Approach
In a scientific experiment, the significance (i.e. our degree of certainty about the veracity) of result is entirely determined by the quality of measurement. Generally measurement process is a complicated and often unpredictable interaction between two complex systems: the experimental setup and the measurement equipment. Fortunately, we can make sense of this mess by relying on statistics and statistical models.
Because of the statistical nature of measurements no single measurement is meaningful. Our statistical models tell us that measurement results are drawn from random distributions. Therefore a single measurement can never be relied upon because the value that we happened to get could be arbitrary small or arbitrary large. In other words, we can have no confidence in our result based of a single measurement.
To give an example, suppose you are measuring a neutron flux with a neutron detector. You take a background measurement for one minute and you record 5 counts (i.e. the detector reports the count rate of 5 CPM). Then you conduct an experiment and record 10 counts in one minute (10 CPM). Although the experiment counts are twice as large as the background, based off only these two measurements one cannot conclude that there was indeed a neutron excess associated with the experiment!
Here is why we cannot draw a conclusion:
- We cannot base our conclusion on just two samples;
- We know nothing about systematic errors of the measurement process.
Random Processes
To illustrate the first point, consider natural background neutron flux. The neutron flux is thought to originate in upper atmosphere when cosmic rays interact with air molecules and produce showers of elementary particles, including neutrons. This flux varies because the intensity of cosmic rays and solar wind varies as well. Also, we have recently learned that thunderstorms generate neutrons, positrons, and gamma rays! In other words, although we have some understanding on where the background neutrons come from, we cannot predict the flux; the neutron flux can vary wildly because of weather, solar activity or cosmic events (e.g. supernova explosions).
Therefore, the background neutron flux is essentially random. This means that if we measure it, we will get random values, such as 5 CPM, 4 CPM, 0 CPM or even 100 CPM. If we measure the flux for a long period of time we will not be getting values of 5 CPM all the time, and we will not be getting values between 4 and 6 CPM all the time. We will be getting all sorts of random values, including some rather crazy ones, such as 1,000 CPM or even 1,000,000 CPM.
Fortunately, vast majority of random processes follow normal or Gaussian-shaped distribution and therefore can be characterized by mean and standard deviation.
Statistical Significance
So, as long as our counts belong to normal distribution we can record enough samples to estimate both mean and standard deviation. We must do this for background and for experiment measurements. Once we do, we can determine if the two sets of samples belong to the same distribution or not: if they do, the two sets are equivalent and there is no difference between the background and the experiment measurements regardless of how similar or dissimilar the actual count figures are.
In statistics, the above mentioned procedure is called Student’s t-test. The process begins by acquiring sufficient number of background and experiment samples. Then by crunching the numbers we determine the p-value, which gives us a numerical measure (expressed as a value between 0 and 1) of the probability that both sets of numbers are drawn from the same distribution. E.g. a p-value of 1 tells us that distributions are exactly identical; and the p-value of 0 tells us that the two distributions are definitely dissimilar.
To derive a formal conclusion, at the beginning of statistical analysis we choose an arbitrary cut-off value called a significance level (0.05 is often used). If the computed p-value is below the chosen significance level we formally pronounce the difference between the background and the experiment measurements as statistically significant. If the p-value is above the chosen significance level then the experiment measurements are not that different from background (not different enough in order to be sure).
So, what does this ‘statistical significance’ means? It means is that we probably have something. But it does not mean that we definitely have something. But the establishing of statistical significance gives us the reason to undertake additional work and analysis necessary to give a definite answer.
Systematic Errors
Now that we have established statistical significance we must research and analyze sources of systematic errors. No measurement device is perfect and it can be affected by a variety of factors such as:
- Intrinsic noise;
- Electromagnetic (EM) noise / interference;
- Calibration errors / calibration drift;
- Temperature (usually impacts calibration);
- Noise, vibration;
- Humidity, pressure, etc.
To eliminate systematic errors one must understand how the measurement device works. Without this knowledge it is impossible to eliminate systematic errors.
Raw Signal
Once we familiarize ourselves with the details of the device’s operation we should take a look at the actual raw signal acquired by the device. E.g. if the device counts pulses, do the pulses look proper? Or do they look like EM noise? If we acquire spectrum, does the spectrum look right? Or does it appear unusual? If the measurement device does not provide access to raw signal then there is no way to eliminate systematic errors originating from the signal distortions and therefore we won’t be able to make a definite determination about the meaning of our measurement. This is why vast majority of commercially available measurement tools are not suitable for scientific experimentation as they do not provide access to raw signal.
False Data
If we looked at the signal and the signal does not appear to be contaminated with EM noise we must analyze the signal itself – is it genuine? This determination is much harder to make and it requires deep knowledge of the measurement device’s principle of operation and calibration. For example, temperature affects calibration of NaI gamma scintillators; x-rays produce false counts in gas-filled neutron counters; mechanical vibrations produce false bubbles in detectors employing supercitical liquids.
In other words, we must understand our device well enough to know what can fool the device into producing false data. Because the sources of false data are far less obvious than EM noise their elimination can be a lengthy and laborious process, but a necessary one.
Edgte of Sensitivity
One easy way to avoid false data is not to push the operational range of your measurement device. Measurements taken near the limits of sensitivity are particularly prone to producing false data. E.g. suppose you count low-energy gammas using a fidler detector and you are getting a peak towards 10 keV, which is very close to the low-end sensitivity limit of a fidler system. Instead of spending time on ‘fixing’ the fidler you should switch to a different type of measurement device, such as a solid-state x-ray detector for which this 10 keV peak will be right in the middle of it’s useful range. Continuing to work on the edge of sensitivity will likely result in more false data.
Secondary Confirmation
Assuming that we have 1) established statistical significance of our results, and 2) examined, ruled out or eliminated known sources of systematic errors, what do we do about unknown sources of systematic errors?
There is but a single way to deal with the ‘unknown unknowns’ – employ secondary means of confirmation. This is where you deploy a second, principally different measurement device (or resort to a principally different measurement technique) to record additional background and experiment measurements, which one once again has to subject to statistical significance analysis and elimination of systematic errors. Do the results obtained using the second measurement technique match and corroborate the original data? If they do – then you are much closer to getting a definite answer. If not – you have an unresolved source of systematic error that you must find and eliminate.
In most cases a secondary confirmation is enough to establish the ‘definiteness’ of results. However, in some cases a ternary confirmation may be necessary. Because “extraordinary claims require extraordinary evidence” peers would expect a ternary means of confirmation before they consider accepting your ‘extraordinary’ results.
Note that It is absolutely necessary for the secondary means of confirmation to rely on a principally different measurement technique, which has a different source of systematic errors. This is the whole point. You cannot use another one of the same kind device for a secondary means of confirmation. For example, if you count neutrons with a helium-3 filled proportional counter you should also count neutrons using a bubble-counter or use a foil activation technique.
Secondary confirmation is usually much simpler than ternary. Although each layer of confirmation adds cost, time and money to the project.
Critical Analysis
So, if someone reports an interesting result, make sure the following boxes are checked before you can consider trusting it:
- Did authors perform statistical analysis to establish significance of the results?
- Did authors investigate, rule out or eliminate systematic errors?
- Did authors undertake a secondary means of confirmation?
If any of the boxes boxes 1-3 are not checked then the claim is poorly supported. Such claims is not warranted and therefore is likely to be in error.
So take the data with a grain of salt and share the wisdom!
VERY well done!