Help Notes: What are statistical significance tests?

Introduction

Psychological researchers make extensive use of statistical methods in the analysis of data gathered in their research. Statistical methods serve two primary functions: descriptive, to provide a summary of large sets of data so that the important features are readily apparent, and inferential, to evaluate the extent to which the data support the hypothesis being studied as well as the extent to which the findings can be generalized to the population as a whole. It is this second function that makes use of statistical significance tests.

Researchers may employ these tests either to ascertain whether there is a significant difference in the performance of different groups being studied or to determine whether different variables (characteristics) of subjects have a strong relationship to one another. For example, in conducting an experiment to test the effect of a particular treatment on behavior, the experimenter would be interested in testing the differences in performance between the treatment group and a control group. Another researcher might be interested in looking at the strength of the relationship between two variables—such as scores on the SAT Reasoning Test and college grade-point average. In both cases, statistical significance tests would be employed to find out whether the difference between groups or the strength of the relationship between variables was statistically significant.

Laws of Probability

The term statistically significant has a specific meaning based on the outcome of certain statistical procedures. Statistical significance tests have their basis in the laws of probability, specifically in the law of large numbers
and conditional probability. Primarily, the law of large numbers states that as the number of events with a certain probabilistic outcome increases, the frequencies of occurrence that are observed should come closer and closer to matching the frequencies that would be expected based on the probabilities associated with those events. For example, with a coin flip, the probability associated with heads is .50 (50 percent), as is the probability associated with tails. If a person flipped a coin ten times, it would not be too startling if he or she got eight heads; if the coin were flipped ten thousand times, however, the observed frequencies of heads and tails would be about 50 percent each. Thus, with large numbers of probabilistic events, the expected outcomes can be predicted with great precision.

Conditional probability refers to the probability of a second event, given that a certain first event has occurred. For example, if someone has already pulled one ace from a deck of cards, what is the probability of pulling a second ace on that person’s next attempt (without replacing the first card)? The probability of pulling the first ace was four (the number of aces in the deck) out of fifty-two (the number of cards in the deck). The second draw has a conditional probability created by what happened on the first pick. Because an ace was drawn first, there are now three left in the deck, and since the card was not replaced, there are now fifty-one cards left in the deck. Therefore, the probability of pulling the second ace would be three out of fifty-one.

Establishing and Testing Hypotheses

Armed with these two concepts, it is now possible to understand how statistical significance tests work. Researchers are always investigating hypothetical relationships between variables through experiments or other methodologies. These hypotheses
can be about how strongly related two variables are or about differences between the average performance between groups—for example, an experimental and a control group. One possible hypothesis is that the variables have no relationship, or that the groups are not different in their performance. This is referred to as the null hypothesis, and it plays an important role in establishing statistical significance. A second possible hypothesis is that there is a relationship between variables, or that there is a difference between the mean (the average value of a group of scores) performances of groups. This is referred to as the alternative hypothesis, and it is the hypothesis truly of interest to the researcher. Although it is not possible to test this hypothesis directly, it is possible to test the null hypothesis. Because these hypotheses are both mutually exclusive and exhaustive (only one can be true, but one must be true), if it can be shown from the data gathered that the null hypothesis is highly unlikely, then researchers are willing to accept the alternative hypothesis.

This works through a conditional probability strategy. First, the researcher assumes that the null hypothesis is true. Then the researcher looks at the data gathered during the research and asks, “How likely is it that we would have gotten this particular sample data if the null hypothesis were true?” In other words, researchers evaluate the probability of the data given the null hypothesis. If they find, after evaluation, that the data would be very unlikely if the null hypothesis were true, then they are able to reject the null hypothesis and accept the alternative hypothesis. In such a case, it can be said that the results were statistically significant.

Arbitrarily, the standard for statistical significance is usually a conditional probability of less than .05 (5 percent). This probability value required to reject the null hypothesis is referred to as the significance level. This criterion is set at a stringent level because science tends to be conservative; it does not want to reject old ideas and accept new ones too easily. The significance level actually represents the probability of making a certain type of error—of rejecting the null hypothesis when it is in fact true. The lower the significance level, the higher the confidence that the data obtained would be very unlikely if the null hypothesis were true and the observed effects are reliable.

Evaluation of Conditional Probability

Statistical significance tests are the procedures that allow researchers to evaluate the conditional probability of the data, given the null hypothesis. The data from a study are used to compute a test statistic. This is a number whose size reflects the degree to which the data differ from what would be expected if the null hypothesis were true. Some commonly encountered test statistics are the t-ratio, the F-ratio, and the critical (Z) ratio. The probability associated with the test statistic can be established by consulting published tables, which give the probability of obtaining a particular value of the test statistic if the null hypothesis is true. The null hypothesis is rejected if the probability associated with the test statistic is less than a predetermined “critical” value (usually .05). If the probability turns out to be greater than the critical value, then one would fail to reject the null hypothesis; however, that does not mean that the null hypothesis is true. It could simply be that the research design was not powerful enough (for example, the sample size may have been too small, like flipping the coin only ten times) to detect real effects that were there, like a microscope that lacks sufficient power to observe a small object that is nevertheless present.

Differences in Significance

There is sometimes a difference between statistical significance and practical significance. The size of most test statistics can be increased simply by increasing the number of subjects in the sample that is studied. If samples are large enough, any effect will be statistically significant no matter how small that effect may be. Statistical significance tests tell researchers how reliable an effect is, but not whether that effect has any practical significance. For example, a researcher might be investigating the effectiveness of two diet plans and using groups of one thousand subjects for each diet. On analyzing the data, the researcher finds that subjects following diet A have a significantly greater weight loss than subjects following diet B. This is practically significant. If, however, the average difference between the two groups was only one-tenth of a pound, this difference would be statistically significant, but it would have no practical significance whatsoever.

T-Test

Statistical significance tests provide a measure of how likely it is that the results of a particular research study came about by chance. They accomplish this by putting a precise value on the confidence or probability that rerunning the same study would produce similar or stronger results. A specific test, the t-test, can provide an example of how this works in practice. The t-test is used to test the significance of the difference between the mean performance of two groups on some measure of behavior. It is one of the most widely used tests of significance in psychological research.

Suppose a professor of psychology is interested in whether the more “serious” students tend to choose the early morning sections of classes. To test this hypothesis, the professor compares the performance on the final examination of two sections of an introductory psychology course, one that met at 8:00 a.m. and one that met at 2:00 p.m. In this example, the null hypothesis would state that there is no difference in the average examination scores for the two groups. The alternative hypothesis would state that the average score for the morning group will be higher. In calculating the mean scores for each of the two groups, the professor finds that the early morning class had an average score of 82, while the afternoon class had an average score of 77. Before reaching any conclusion, however, the professor would have to find out how likely it is that this difference could be attributable to chance, so a t-test would be employed.

Influential Factors

There are three factors that influence a test of significance such as the t-test. One is the size of the difference between the means. In general, the larger the measured difference, the more likely that the difference reflects an actual difference in performance and not chance factors. A second factor is the size of the sample, or the number of measurements being tested. In general, differences based on large numbers of observations are more likely to be significant than the same differences based on fewer observations (as in the coin-flipping example). This is true because with larger samples, random factors within a group (such as the presence of a “hotshot” student, or some students who were particularly sleepy on exam day) tend to be canceled out across groups. The third factor that influences a measure of statistical significance is the variability of the data, or how spread out the scores are from one another. If there is considerable variability in the scores, then the difference (variability) in the group means is more likely to be attributable to chance. The variability in the scores is usually measured by a statistic called the standard deviation, which could loosely be thought of as the average distance of a typical score from the mean of the group. As the standard deviation of the groups gets smaller, the size of the measure of statistical significance will get larger.

Knowing these three things—the size of the difference of the two groups, the number of scores for each group, and the standard deviations of the test scores of the two groups—the professor can calculate a t-statistic and then draw conclusions. With a difference of mean test scores of five points, fifty students in each class, and standard deviations of 3.5 in the first class and 2.2 in the second, the value of t would be 1.71. To determine whether this t is significant, the professor would go to a published statistical table that contains the minimum values for significance of the t statistic based on the number of subjects in the calculation (more technically, the degrees of freedom, which is the total number of subjects minus the number of groups—in this case, 100 minus 2, or 98). If the computed value of t is larger than the critical value published in the table, the professor can reject the null hypothesis and conclude that the performance of the early morning class was significantly better than that of the afternoon class.

Complex Designs

Many research studies in psychology involve more complex designs than simple comparisons between two groups. They may contain three or more groups and evaluate the effects of more than one treatment or condition. This more complex evaluation of statistical significance is usually carried out through a procedure known as the analysis of variance (or F-test). Like the t-test, the F-test is calculated based on the size of the group differences, the sizes of the groups, and the standard deviation of the groups.

Other tests of statistical significance are available, and the choice of the appropriate technique is determined by such factors as the kind of scale on which the data are measured, the number of groups, and whether one is interested in assessing a difference in performance or the relationship between subject characteristics. One should bear in mind, however, that statistical significance tests are only tools and that numbers can be deceptive. Even the best statistical analysis means nothing if a research study is designed poorly.

Evolution of Practice

Tests of statistical significance have been important in psychological research since the early 1900s. Pioneers such as Sir Francis Galton,
Karl Pearson, and Sir Ronald A. Fisher
were instrumental in both developing and popularizing these methods. Galton was one of the first to recognize the importance of the normal distribution (the bell-shaped curve) for organizing psychological data. The properties of the normal distribution are the basis for many of the probabilistic judgments underlying inferential statistics. Pearson, strongly influenced by Galton’s work, was able to develop the chi-squared goodness-of-fit test around 1900. This was the first test that enabled the determination of the probability of discrepancies between the observed number of occurrences of categories of phenomena and what would be expected by chance.

It was the publication of Fisher’s book Statistical Methods for Research Workers in 1925, however, that popularized the method of hypothesis testing and the use of statistical significance tests. Through his book, Fisher was able to establish the 0.05 level of significance as the standard for scientific research. Fisher’s second book, The Design of Experiments (1935), brought his theory of hypothesis testing to a wider audience, and he believed that he had developed the “perfectly rigorous” method of inductive inference. Among Fisher’s accomplishments was the development of the method of analysis of variance for use with complex experimental designs (the F-test was named for Fisher). Before Fisher’s work, the evaluation of whether the results of research were “significant” was based either on a simple “eyeballing” of the data or on an informal comparison of mean differences with standard deviations. Through the efforts of Fisher and some of his followers, hypothesis testing using statistical significance tests soon became an indispensable part of most scientific research. In particular, between 1940 and 1955, statistical methods became institutionalized in psychology during a period that has been called the inference revolution. Many researchers believed that these techniques provided scientific legitimacy to the study of otherwise abstract psychological constructs.

Problems with Approach

Some problems, however, have arisen with the statistical revolution in psychological research. For example, many researchers routinely misinterpret the meaning of rejecting the null hypothesis. Employing statistical significance tests can only tell the probability of the data, given the null hypothesis. It cannot tell the probability of a hypothesis (either the null or the alternative), given the data. These are two different conditional probabilities. Yet many researchers, and even some textbooks in psychology, claim that the level of significance specifies the probability that the null hypothesis is correct or the probability that the alternative hypothesis is wrong. Often the quality of research is measured by the level of significance, and researchers are often reluctant to submit, and journal editors reluctant to publish, research reports in which there was a failure to reject the null hypothesis. This tendency has led over the years to the publication of many statistically significant research results that have no practical significance and to the withholding by researchers of reports of worthwhile studies (that might have had practical significance) because of a failure to reject the null hypothesis.

Statistical significance tests are valuable techniques for the analysis of research results, but they must be applied correctly and analyzed properly to serve their intended function.

Bibliography

Aron, Arthur, Elaine N. Aron, and Elliot J. Coups. Statistics for Psychology. 5th ed. Upper Saddle River: Pearson Prentice Hall, 2009. Print.

Coolican, Hugh. Research Methods and Statistics in Psychology. 5th ed. New York: Routledge, 2013. Print.

Howell, David C. Fundamental Statistics for the Behavioral Sciences. 6th ed. Belmont: Thomson/Wadsworth, 2008. Print.

Kline, Rex B. Beyond Significance Testing: Statistics Reform in the Behavioral Sciences. Washington: American Psychological Association, 2013. Print.

Morrison, Denton E., and Ramon E. Henkel, eds. The Significance Test Controversy: A Reader. New Brunswick: AldineTransaction, 2006. Print.

Rowntree, Derek. Statistics without Tears: A Primer for Non-Mathematicians. Boston: Pearson/A & B, 2004. Print.

Walker, Jeffery T., and Sean Maddan. Understanding Statistics for the Social Sciences, Criminal Justice, and Criminology. Burlington: Jones, 2013. Print.

Ziliak Stephen T., and Deirdre N. McCloskey. The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. Ann Arbor: U of Michigan P, 2008. Print.

Help Notes

Friday, 24 July 2015

What are statistical significance tests?

No comments:

Post a Comment

How can a 0.5 molal solution be less concentrated than a 0.5 molar solution?