Some statistical aspects of choosing sample size

Intro

In many introductory statistics text books you'll find statements about sample sizes, such as, that in order to do a t-test, you will need “at least” 20 observations, to use linear regression you need 50 observations, for logistic regression you need double that and so on.

In these introductory books the problem of choosing appropriate sample size is usually discussed in the context of null hypothesis significance testing. To contrast this, I will (very briefly) discuss choosing sample size in the context of Bayesian statistics – however, I hope the text will give useful insights also to those not familiar with it.

My main argument in this text is that instead of raw numbers we should be considering how much information a single observation/a bunch of observations give(s). Bayesian statistics offers a well-defined and easy-to-use set of tools to do just this.

I'll discuss this idea in the context of real statistical models, but first, as an informal example, consider an eye exam, one of those, in which the participant has to correctly identify the orientation (up, down, left, right) of the letter E. On the wall, couple of meters from the participant, hangs a board which has a a bunch of these letters in a descending order, each in a random orientation.

Now imagine that a doctor is, indeed, administering one of these exams right now. The participant has correctly identified, say, the first three E's. The goal is to find the point at which the participant isn't anymore able to reliably tell the orientations of the letters. But we already know that the participant can identify the first three E's reliably and so presenting these same stimuli over and over again would not increase our information about the quantity of interest, that is, their detection threshold.

So, what's the appropriate sample size for such an eye exam? It should be clear that this can't be discussed separately from the task, from how the “information content” of observations is defined. What's also important is to notice that this is conditional on the observations: it is exactly because we have observed that the participant can detect the first letters with ease that they don't offer any additional information.

Let us look at this in a bit more detail, first in the context of an intercept-only logistic regression, which expands this informal example a bit, and then we will take a quick look at how the range of the predictor variable affects the slope estimates of an ordinary linear model. If the logistic regression example seems too tedious, take a peek at least at the latter linear regression example.

Intercept-only logistic regression

What does it mean that we know something about a parameter? In Bayesian statistics information is formalized as prior and posterior distributions. The word distribution here refers to the fact that a range of values are usually plausible. Prior refers to what we know before we make an observation and posterior refers to what we know after we've made an observation. Let us say that we want to formalize our prior knowledge about the location of a detection threshold in decibels. The figure below shows two possible scenarios:

In the left panel our knowledge of the threshold is less accurate than in the right panel, where the distribution is more concentrated around 50 decibels. In other words, standard deviation (colloquially: width) of the posterior for the threshold is smaller in the right panel. Generally speaking, we would want the width of the posterior distribution to be as small as possible: this can be achieved through observations that provide a lot of information about plausible values.

An important part of Bayesian statistics is updating information with what has been learned from observations. Let us say that we have observed a correct response to a 50 dB stimulus. We would weight each point of the prior distribution with the probability of observing that data point given that value. In this case, a prior value of 60 dB is unlikely, since if their threshold was 10 dB higher than the stimulus (which they identified correctly), we would expect an incorrect response. So in light of the observed response, threshold values around 60 dB (and greater) can be ruled out.

The updating procedure outlined above can be formalized using intercept-only logistic regression. Logistic regression is used to model the probability of some event happening as a function of the predictor, for example, what's the probability of correctly detecting a stimulus as a function of the strength of a stimulus. In an intercept-only version we are only interested in the where the s-shaped part of the logistic curve lies on the x-axis, steepness of the slope of the sigmoid curve is not of interest, and the intercept can be conceptualized as the detection threshold. See the slides below for a couple of examples of how changing the intercept affects the sigmoid curve:

So we are interested in estimating the intercept, which is conceptualized as the detection threshold. The slideshow below will show this step-by-step (notice also the narration below the picture in every slide). We begin from an initial prior distribution, and as new observations arrive, it becomes more concentrated – we learn about the location of the threshold. We can also think of the process of updating prior to posterior sequentially: after each data point the prior distribution is updated to the posterior distribution, following the procedure just outlined, and then this posterior distribution acts as the prior distribution for the next observation. This sequential linking of priors and posteriors is a point I'll return to later.

In the first step, we have the prior distribution, which shows that we believe the threshold to be somewhere around 40 decibels, but can be as high as 70 dB or low as 10dB.

The red line shows what we learn when the participant correctly identifies a 50 decibel stimulus: it is likely that their threshold is lower than this, so values above that are ruled out using the logistic model. We multiply these two lines together to update the prior distribution into the posterior distribution. Result is in the next slide.

As we can see, part of the prior distribution has been sliced off: since we learned that the threshold is below 50 dB, the higher values have been discarded. The prior distribution from step 1 has been updated to the posterior distribution. This will, in turn, act as the prior for the next step: the current distribution quantifies what we know before the next observation.

Here, the participant has failed to detect a 40dB stimulus. Again, the red line shows what we learn from this observation (we learn that the threshold must be higher than this), and as before, by multiplying the lines together, we get the posterior distribution, which will be shown in the next step.

Compared to the first observation, the 2nd observation sliced off a sizeable part of the prior. Our posterior distribution (which again will act as the prior for the next observation) is starting to get more concentrated around a single spot. See the next step.

Here, the participant has correctly identified a 60dB stimulus. What will happen? What values will be sliced off?

As we can see, the correct identification of the 60dB stimulus didn't really change the distribution at all. This is because, in light of the previous observations, we already knew that the threshold is lower than that.

(R code for the figures in the slideshow: here)

The last observation did not change the prior distribution at all, the width of the posterior distribution for the intercept did not change. As in the informal eye exam example, we do not learn anything from observing that the participant can detect a stimulus that we know to be above their detection threshold. Here we can see it formally: no matter how many stimuli we would present to the participant, if the observations have high prior probability (we present them with a stimulus we know to be above their threshold and they do detect it and vice versa), they don't give additional information.

Range of predictor in linear regression

Let us now look how sample size and the range of the predictor affects the posterior standard deviation of slope estimates in linear regression.

I drew simulated data sets with different sample sizes and for two different ranges of the predictor. After each simulation the standard deviation of the posterior of the slope estimate was approximated.

In the figure below posterior standard deviations of slope estimates are plotted against sample size. The upper line (dots) plots simulations in which the range of the predictor is narrow whereas the lower line (triangles) plots simulations in which the range of the predictor is wide.

(R code for the simulation here)

In both cases standard deviation of the slope estimate does go down as the number of observation grows, but it's clear that when the range of the predictor is wide, standard deviations are much lower. Why is that? Well, it's not really about the range of the predictor per se, it's about the range of the predictor in relation to the residual standard deviation (variability in the response variable). Look at the figure below. In both panels the slope of the underlying linear function is 0.5. When range of predictor is too small (left panel), the slope is obscured by the noise in the response variable. When the range of the predictor is widened (right panel), the slope becomes clearer:

Conclusion

I hope it is clear from these examples that sample size considerations have to take into account more than just the raw number of observations. The same sample size can lead to different standard deviations of the estimates if the information content of the observations is different. At least in these simple cases the "information content" can be understood intuitively/informally – I hope the selected examples show this – but using Bayesian statistics one can be a bit more formal about it, and generalize this concept to a wider range of applications (which will be discussed in some later text).

Moreover, as priors and posteriors can be thought to be sequentially linked (as I mentioned when discussing the logistic regression example) we can optimize data collection: during each step of data collection, it's possible to look at current estimates and their standard deviations and try to use that to deduce what would be the best way to collect more data. In psychophysics (the study of perception/sensory thresholds etc.) this is done formally by looking one step forward and calculating what stimulus would reduce the standard deviations the most[2]; little bit like what the imaginary doctor was doing in the eye exam example: they don't get stuck presenting stimuli that are low in their information content.

This kind of adaptive data collection can be hard to incorporate into Null Hypothesis Significance Testing, since p-values depend on prior intentions, such as pre-selected number of observations. Certainly, it is not part of the standard toolbox.

A useful tool for exploring the effect of sample size is to, as was done here, simulate data and see what happens. What kind of standard deviations you would see.

Notes

[1] This kind of intercept-only model is used e.g. by Watson and Pelli in their 1983 article "QUEST: A Bayesian adaptive psychometric method", however they do not specifically use the logistic distribution and they incorporate a non-zero probability of correct response (e.g. in the eye exam example there's a 1/4 chance that the participant will guess the correct orientation).

[2] Modern methods look at the reduction of the expected entropy of the posterior distribution, since this is less dependent on the scale of the parameters than standard deviation, when dealing with multi-dimensional posterior distributions.