Quote:
Originally Posted by
chrisj
➡️
Good! We make tiny amounts of progress.
Excellent, then let's make even more progress.
Quote:
Yes I have, but you're omitting the small but important detail that 51% in an INFINITE DBT shows (indeed, establishes) significance. I think that only in an infinite DBT would a confidence level as low as 51% ever count for much, because defining an infinite sample set as good as says there will be no possible outcome other than the exact statistical basis of the test.
I'm not a statistician but I know just enough to be dangerous (always the worst kind of knowledge!). Let's go over some basics. In an ABX test we are trying to determine if one thing sounds different than the other; our
hypothesis is that there is indeed an audible difference. Consequently, the
null hypothesis is that there is no difference: the listener is just guessing and has a 50-50 chance of getting the answer right. The point of an ABX is to see if we can
reject the null hypothesis because we can never, ever prove the hypothesis. But to have enough confidence to reject the null -- i.e., to say that the user's choices were
unlikely due to chance -- before the test we must agree on what we will accept as a statistically significant result.
How do we do that? Well, if A and B are indeed the same and the listener is guessing, then his answers will take the shape of a binomial distribution with
p (the probability of success) equal to 0.5. This has the shape of a discrete bell curve, centered at the most probable number of correct guesses (which depends on the number of trials). In other words, the height of the curve corresponds to the probability of that particular number occurring: the peak of the curve, in the middle, is much more probable than the ends which are tapering off to zero. Because the properties of the binomial distribution are so well understood, given enough trials, we can make statistically-sound conclusions based on the results of the listener's tests.
Chris, you mention infinite trials, but I'm going to focus on finite trials because, well, who has time for infinite trials these days?
Seriously though, infinity is a weird thing in stats, well beyond my pay grade, and we're talking about real-world tests where such things are impossible anyway.
There are a few really important properties of the binomial distribution that we need to know:
1. Foremost, its shape will approach that of a bell curve only as the number of trials increases. This is easy to visualize if we think of flipping a fair coin: if you toss it ten times, while we'd expect to see five heads come up, it wouldn't at all be surprising to see four or six heads, instead.
2. As the number of tosses approaches infinity, the total number of heads will approach the expected value of half the tosses. But here is a crucial point: as the number of tosses increases, the
variance must also increase. In other words, even though the total number of heads will approach 50%, the more tosses you make, the less likely that you'll see a head every other flip. This is more intuitive than it may seem. If we flip a coin 10 times, we're unlikely to see anything crazy, such as a run of 10 heads. But if you flip a coin a trillion times, you're guaranteed to see absurdly improbable things, such as 100,000 heads in a row. Said more formally, as the number of trials increases the expected value approaches the mean of the distribution, but the probability of seeing exactly the expected value (i.e. getting half right)
decreases. This is the magic (or demon) of variance at work.
Using these two fundamental properties we can state our first objection to Chris's well-meaning notion that a non-50% result, even with a huge amount of trials, indicates significance. As the number of trials grows, the stats
guarantee us that we will see exactly 50% less and less. Therefore we expect that any large set of trials will
almost never result in a 50% result. This removes any a priori significance to a non-50% result.
So how do we determine what
is significant? A handy tool called the CDF -- the cumulative distribution function -- shows us the way. Pick a number in a distribution and the CDF adds its probability and all the probabilities of the numbers before it. Picture a bell curve, pick a spot somewhere along the right tail and shade everything to the left of it; the CDF tells us the probability of shaded part, or equivalently (when we subtract it from one), the likelihood of the non-shaded part. For example, if I flip 10 coins what is the probability that I will see 8 or more heads? The CDF tells me it's 5.47%.
With the CDF, all you and I have to do is agree on the
p-value for the test. This represents the probability threshold that we will accept, i.e., any result that is
more probable (greater) than
p will be considered inconclusive. Any result equal or less than
p and we will reject the null hypothesis. Notice what these things mean: we want a low enough
p value to give us confidence that the results were not due to chance alone (though we can
never rule that out). A typical
p-value is 0.05, which means that if the CDF of the result of the test shows that it was less than 5% likely to occur due to random chance, then we're going to reject the null hypothesis: the results are too unlikely to have occurred by chance. Notice that we're
not saying that the hypothesis is true -- we have not proven that the sounds are different, only that it is improbable (though not impossible) that the listener guessed.
Now we may state our second objection to Chris's theory of ABX: Any result greater than
p, i.e. anything less than 95% confidence, must be thrown away. It neither allows us to accept the hypothesis nor reject the null hypothesis; it is an inconclusive and therefore meaningless result. We cannot infer a difference between, for example, a 70% result and an 80% result because the stats assure us that
even with random guessing all possible results will occur; was the difference between 70% and 80% due to ephemeral perception or was it just random? We cannot say. No matter the number of trails, the difference between 70% and 80% is entirely within the realm of chance, thus we cannot make a conclusion. This is why we must agree on a
p-value that is insensitive (or at least much less sensitive) to coin-flip probabilities.
Anyway, I've run out of energy. I fear I haven't been clear enough on this admittedly eye-glazing subject. Hopefully there's enough to at least get us on the same page.