Monday, April 5, 2010

Power hunger


There is debate as to when to use non-parametric statistics, and the debate permeates my life because I am an archaeologist.  We (archaeologists and other researchers) attempt to use statistics to bolster the inferences that we make, but one must question not only what kind of statistics are used but whether or not their use is appropriate in the first place.  Because I spend a lot of time thinking about how to communicate this to students and colleagues, I thought I’d write about it today.  Be aware that for many of you, this entry may be a nice sleep aid!

The first rule of inferential statistics is that by design sampling must insure representation of the population for a particular variable (e.g., height).  Researchers increase their confidence in ‘representativeness’ by using a random mechanism accompanied by, perhaps, a few other design elements to select individuals in samples.  This random design helps the researcher avoid some forms of sampling bias.  Mathematically, this works well because randomly one is most likely to choose those individuals exhibiting those traits most common in the population, which would be the best representatives of the population.  Archaeologists study artifacts, but we do not design sampling of artifacts directly.  Instead we carefully carve up geographic space and either survey or excavate it in hopes that we will recover a representative sample of artifacts.  As a result, our samples may not be representative, and we may quite often violate the first rule of inference.

The second rule of inference is to use the correct tool, and this involves the assumptions we make about the underlying structure of data.  An important assumption we must choose whether or not to make is that of normality.  Normality, in my opinion, is poorly understood and difficult to teach.  Be here goes; we do not really care about the shape of the population or the sample; instead at a particular sample size, if samples are random, the distribution of error for any statistic we calculate approximates a normal curve.  That sample size is n = 30, and with increasing sample size above 30, error distributions become more precisely normal.  What this means is, regardless of the shape of the sample or population, at large enough sample sizes (that are random), we are likely to produce a sufficient estimate of the population parameter, such as a mean or a standard deviation.  If the population is normally shaped to begin with, which we rarely know, then normality of error distributions can be assumed at any sample size because representing the population mean for example is likely when most of the population is in the middle of the distribution—this relaxes the sample size constraint.

Punch line: samples should be random and designed to insure representativeness, samples should be of n = 30 or greater, or populations must be known to be normal.  In archaeology we rarely have any of these three factors working to our advantage.  Thus, our inferences based on statistics may be problematic.

Assuming representativeness of artifact populations from archaeological samples (itself a difficult challenge), we can avoid the assumption of normality altogether by reducing the power of the inferential tests we use when samples are less than 30.  The original data can be converted to ranks, and a whole slew of non-parametric tests can be used in the place of common parametric ones.  By converting to ranks, one becomes more conservative because all data points in the sample are equidistant from one another, and the power of more precision is not taken advantage of.  That precision relates directly to the power of normality, and if one is willing to forfeit such power there is an important gain to be had: one can bolster confidence.

Inferential statistics centers on hypothesis testing, and significant results allow the conlcusion that ‘something special has happened,’ such as a difference between two samples that does not relate to chance.  It is more difficult to say that ‘something special has happened’ with non-parametric tests (if applied correctly, which requires appropriate rounding of the original data).  If one does not have confidence from a rigorous sampling design, from a large sample size, or from knowledge about the population itself, then confidence must be gained in some other way.  A significant result with a lower-power (non-parametric) test is one way to do this.  If you do this, you will hear the complaint that you have compromised precision and variability inherent in your data by converting to ranks; however, you had little (or should not have had) confidence that such precision was representative in the first place.

The use of statistics in archaeology is like fine wood working; the wood is delicate so use hand tools to ensure a high-quality product.  Power tools can do a lot of damage when misapplied.  Do not let your desire for significant results turn into power hunger; to do so will lead to false confidence.

0 comments:

Post a Comment