Information

Needing help in order to infer the statistical hypothesis tests performed in an old paper

Needing help in order to infer the statistical hypothesis tests performed in an old paper


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am in great need of help in order to infer the statistical hypothesis tests performed in an old paper. However I need to make some reasonable guesses only from the abstract of the paper since the original text is in chinese which i can not understand (after extensive search i was also unable to find the original text even in chinese)

The title of the paper is: “Detection of sister chromatic exchange in workers exposed to coal tar pitch and to coke oven volatiles"

The abstract of the paper (which was published in 1998 is the following):

In order to know the changes of genetic toxicological effects on workers occupationally exposed to polycyclic aromatic hydrocarbons (PAHs), sister chromatic exchange(SCE) was detected by the methods of peripheral lymphocyte culture in 23 workers exposed to coal tar pitch (CTP) and in 19 workers exposed to coke oven volatiles (COV) and 12 normal controls. The results suggested that the SCE in occupational workers was significantly higher than that in controls (11.31 vs 6.37, P < 0.001). The SCE in workers exposed to CTP and to COV was higher than that of control (10.27 and 12.58 vs 6.37) respectively. In workers exposed to CTP and COV, there were no differences of SCE for smokers and nonsmokers (P > 0.05). It is indicated that CTP and COV caused strong genetic toxicity and injury to chromosome.

In your opinion how do you think that the above reasarch was organized

For example: a) What types of statistical hypothesis testing was performed by the reasearchers b) What kind of data was collected and used for each statistical hypothesis test c) What methods were employed for the each hypothesis test ?


Someone else can probably elaborate on how "peripheral lymphocyte culture" works, but this is what the statistical results suggest to me:

The results suggested that the SCE in occupational workers was significantly higher than that in controls (11.31 vs 6.37, P < 0.001).

This is a two-sample t-test. Occupational workers were pooled into one group and the mean value of SCE for those workers (11.31) was compared to controls (6.37). The null hypothesis is that the means are not different. They reject the null hypothesis. It would be nice if they included the standard errors for the means, so that you would have a sense for the amount of variation within the groups.

The SCE in workers exposed to CTP and to COV was higher than that of control (10.27 and 12.58 vs 6.37) respectively.

This is likely an ANOVA with three groups (CTP, COV, and control). This test is kind of redundant with the first test, since the appropriate post-hoc test will tell you that CTP and COV are both significantly higher than control. But since they report only one P-value, then this is probably the overall F-test for the ANOVA. So all they can say with this test is "at least one group is different." You don't know if, for example, CTP and COV are different from one another. It's not clear from the text that they did a posthoc test (Tukey's HSD, for example), but I doubt it.

In workers exposed to CTP and COV, there were no differences of SCE for smokers and nonsmokers (P > 0.05).

Considering only the occupational groups, the sample is divided into smokers and non-smokers. There was no significant difference in mean SCE between the groups. This is a two-sample t-test like the first one. The null hypothesis is that the means are not different. They fail to reject the null hypothesis.

It's also possible (but impossible to determine from the abstract alone) that they did a single, larger multiple regression. Properly coded, they would be able to, at once, test for occupational vs. control, CTP vs. COV vs. control, and smoker vs. non-smoker. It would be pretty dicey with such a small sample, so they probably didn't take that approach.

Why do we perform t-tests?

The null hypothesis of a two-sample t-test is that the means of two groups are not different from one another.

How do we know that our data follows a normal distribution? Shouldn't we perform tests in order to decide whether our data follows normal distribution and is of equal variance?

Assumptions of t-tests include normal distributions within groups and equal variance between groups. These should be checked prior to carrying out the test. We can assume that the authors did these tests, but they are very rarely reported.

In case the criteria needed to perform a t-test are not fulfilled should we choose a non parametric equivalent?

Non-parametric alternatives should be considered when the assumptions are not met. That being said, t-tests are pretty robust to violations of these assumptions.

Also the anova mentioned in the answer above is an N-way anova?

ANOVA in general is a test of equality among N groups. So you could think of a t-test as just a special kind of ANOVA on two groups (indeed, they are numerically equal).


A step-by-step guide to hypothesis testing

Published on November 8, 2019 by Rebecca Bevans. Revised on February 15, 2021.

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is most often used by scientists to test specific predictions, called hypotheses, that arise from theories.

There are 5 main steps in hypothesis testing:

  1. State your research hypothesis as a null (Ho) and alternate (Ha) hypothesis. in a way designed to test the hypothesis.
  2. Perform an appropriate statistical test.
  3. Decide whether the null hypothesis is supported or refuted.
  4. Present the findings in your results and discussion section.

Though the specific details might vary, the procedure you will use when testing a hypothesis will always follow some version of these steps.


An Overview of Statistics in Education

Statistical Inference

Statistical inference consists in the use of statistics to draw conclusions about some unknown aspect of a population based on a random sample from that population. Some preliminary conclusions may be drawn by the use of EDA or by the computation of summary statistics as well, but formal statistical inference uses calculations based on probability theory to substantiate those conclusions. Statistical inference can be divided into two areas: estimation and hypothesis testing. In estimation, the goal is to describe an unknown aspect of a population, for example, the average scholastic aptitude test (SAT) writing score of all examinees in the State of California in the USA. Estimation can be of two types, point estimation and interval estimation, depending on the goal of the application. The goal of hypothesis testing is to decide which of two complementary statements about a population is true. Two such complementary statements may be: (1) the students of California score higher on an average on SAT writing than the students of Texas, and (2) the students of California score lower on an average on SAT writing than the students of Texas. Point estimation is discussed in the statistics section of the encyclopedia. Details on interval estimation and hypothesis testing, and power analysis, which play a key role in hypothesis testing are also discussed in the statistics section of the encyclopedia. Often, an investigator has to perform several hypothesis tests simultaneously. For example, one may want to compare the SAT critical reading scores of several pairs of schools belonging to a geographical region. The article on multiple comparison in the statistics section of the encyclopedia, discusses how to handle such a situation in an appropriate manner.


Step 1: Define the Hypothesis

Usually, the reported value (or the claim statistics) is stated as the hypothesis and presumed to be true. For the above examples, the hypothesis will be:

  • Example A: Students in the school score an average of 7 out of 10 in exams.
  • Example B: The annual return of the mutual fund is 8% per annum.

This stated description constitutes the “Null Hypothesis (H0)” and is assumedto be true – the way a defendant in a jury trial is presumed innocent until proven guilty by the evidence presented in court. Similarly, hypothesis testing starts by stating and assuming a “null hypothesis,” and then the process determines whether the assumption is likely to be true or false.

The important point to note is that we are testing the null hypothesis because there is an element of doubt about its validity. Whatever information that is against the stated null hypothesis is captured in the Alternative Hypothesis (H1). For the above examples, the alternative hypothesis will be:

  • Students score an average that is not equal to 7.
  • The annual return of the mutual fund is not equal to 8% per annum.

In other words, the alternative hypothesis is a direct contradiction of the null hypothesis.

As in a trial, the jury assumes the defendant's innocence (null hypothesis). The prosecutor has to prove otherwise (alternative hypothesis). Similarly, the researcher has to prove that the null hypothesis is either true or false. If the prosecutor fails to prove the alternative hypothesis, the jury has to let the defendant go (basing the decision on the null hypothesis). Similarly, if the researcher fails to prove an alternative hypothesis (or simply does nothing), then the null hypothesis is assumed to be true.

The decision-making criteria have to be based on certain parameters of datasets.


INTRODUCTION

The medical journals are replete with P values and tests of hypotheses. It is a common practice among medical researchers to quote whether the test of hypothesis they carried out is significant or non-significant and many researchers get very excited when they discover a “statistically significant” finding without really understanding what it means. Additionally, while medical journals are florid of statement such as: “statistical significant”, “unlikely due to chance”, “not significant,” 𠇍ue to chance”, or notations such as, “P > 0.05”, “P < 0.05”, the decision on whether to decide a test of hypothesis is significant or not based on P value has generated an intense debate among statisticians. It began among founders of statistical inference more than 60 years ago 1-3 . One contributing factor for this is that the medical literature shows a strong tendency to accentuate the positive findings many researchers would like to report positive findings based on previously reported researches as “non-significant results should not take up” journal space 4-7 .

The idea of significance testing was introduced by R.A. Fisher, but over the past six decades its utility, understanding and interpretation has been misunderstood and generated so much scholarly writings to remedy the situation 3 . Alongside the statistical test of hypothesis is the P value, which similarly, its meaning and interpretation has been misused. To delve well into the subject matter, a short history of the evolution of statistical test of hypothesis is warranted to clear some misunderstanding.

A Brief History of P Value and Significance Testing

Significance testing evolved from the idea and practice of the eminent statistician, R.A. Fisher in the 1930s. His idea is simple: suppose we found an association between poverty level and malnutrition among children under the age of five years. This is a finding, but could it be a chance finding? Or perhaps we want to evaluate whether a new nutrition therapy improves nutritional status of malnourished children. We study a group of malnourished children treated with the new therapy and a comparable group treated with old nutritional therapy and find in the new therapy group an improvement of nutritional status by 2 units over the old therapy group. This finding will obviously, be welcomed but it is also possible that this finding is purely due to chance. Thus, Fisher saw P value as an index measuring the strength of evidence against the null hypothesis (in our examples, the hypothesis that there is no association between poverty level and malnutrition or the new therapy does not improve nutritional status). To quantify the strength of evidence against null hypothesis “he advocated P < 0.05 (5% significance) as a standard level for concluding that there is evidence against the hypothesis tested, though not as an absolute rule’’ 8 . Fisher did not stop there but graded the strength of evidence against null hypothesis. He proposed “if P is between 0.1 and 0.9 there is certainly no reason to suspect the hypothesis tested. If it’s below 0.02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at 0.05’’ 9 . Since Fisher made this statement over 60 years ago, 0.05 cut-off point has been used by medical researchers worldwide and has become ritualistic to use 0.05 cut-off mark as if other cut-off points cannot be used. Through the 1960s it was a standard practice in many fields to report P values with the star attached to indicate P < 0.05 and two stars to indicate P < 0.01. Occasionally three stars were used to indicate P < 0.001. While Fisher developed this practice of quantifying the strength of evidence against null hypothesis some eminent statisticians where not accustomed to the subjective interpretation inherent in the method 7 . This led Jerzy Neyman and Egon Pearson to propose a new approach which they called “Hypothesis tests”. They argued that there were two types of error that could be made in interpreting the results of an experiment as shown in Table ​ Table1 1 .

Table 1.

Errors associated with results of experiment.

The truth
Result of experimentNull hypothesis trueNull hypothesis false
Reject null hypothesisType I error rate(α)Power = 1- β
Accept null hypothesisCorrect decisionType II error rate (β)

The outcome of the hypothesis test is one of two: to reject one hypothesis and to accept the other. Adopting this practice exposes one to two types of errors: reject null hypothesis when it should be accepted (i.e., the two therapies differ when they are actually the same, also known as a false-positive result, a type I error or an alpha error) or accept null hypothesis when it should have rejected (i.e. concluding that they are the same when in fact they differ, also known as a false-negative result, type II error or a beta error).

What does P value Mean?

The P value is defined as the probability under the assumption of no effect or no difference (null hypothesis), of obtaining a result equal to or more extreme than what was actually observed. The P stands for probability and measures how likely it is that any observed difference between groups is due to chance. Being a probability, P can take any value between 0 and 1 . Values close to 0 indicate that the observed difference is unlikely to be due to chance, whereas a P value close to 1 suggests no difference between the groups other than due to chance. Thus, it is common in medical journals to see adjectives such as “highly significant” or “very significant” after quoting the P value depending on how close to zero the value is.

Before the advent of computers and statistical software, researchers depended on tabulated values of P to make decisions. This practice is now obsolete and the use of exact P value is much preferred. Statistical software can give the exact P value and allows appreciation of the range of values that P can take up between 0 and 1. Briefly, for example, weights of 18 subjects were taken from a community to determine if their body weight is ideal (i.e. 100kg). Using student’s t test, t turned out to be 3.76 at 17 degree of freedom. Comparing tstat with the tabulated values, t= 3.26 is more than the critical value of 2.11 at p=0.05 and therefore falls in the rejection zone. Thus we reject null hypothesis that ì = 100 and conclude that the difference is significant. But using an SPSS (a statistical software), the following information came when the data were entered, t = 3.758, P = 0.0016, mean difference = 12.78 and confidence intervals are 5.60 and 19.95. Methodologists are now increasingly recommending that researchers should report the precise P value. For example, P = 0.023 rather than P < 0.05 10 . Further, to use P = 0.05 “is an anachronism. It was settled on when P values were hard to compute and so some specific values needed to be provided in tables. Now calculating exact P values is easy (i.e., the computer does it) and so the investigator can report (P = 0.04) and leave it to the reader to (determine its significance)” 11 .

Hypothesis Tests

A statistical test provides a mechanism for making quantitative decisions about a process or processes. The purpose is to make inferences about population parameter by analyzing differences between observed sample statistic and the results one expects to obtain if some underlying assumption is true. This comparison may be a single obser ved value versus some hypothesized quantity or it may be between two or more related or unrelated groups. The choice of statistical test depends on the nature of the data and the study design.

Neyman and Pearson proposed this process to circumvent Fisher’s subjective practice of assessing strength of evidence against the null effect. In its usual form, two hypotheses are put forward: a null hypothesis (usually a statement of null effect) and an alternative hypothesis (usually the opposite of null hypothesis). Based on the outcome of the hypothesis test one hypothesis is rejected and accept the other based on a previously predetermined arbitrary benchmark. This bench mark is designated the P value. However, one runs into making an error: one may reject one hypothesis when in fact it should be accepted and vise versa. There is type I error or á error (i.e., there was no difference but really there was) and type II error or â error (i.e., when there was difference when actually there was none). In its simple format, testing hypothesis involves the following steps:

Identify null and alternative hypotheses.

Determine the appropriate test statistic and its distribution under the assumption that the null hypothesis is true.

Specify the significance level and determine the corresponding critical value of the test statistic under the assumption that null hypothesis is true.

Calculate the test statistic from the data. Having discussed P value and hypothesis testing, fallacies of hypothesis testing and P value are now looked into.

Fallacies of Hypothesis Testing

In a paper I submitted for publication in one of the widely read medical journals in Nigeria, one of the reviewers commented on the age-sex distribution of the participants, “Is there any difference in sex distribution, subject to chi square statistics”? Statistically, this question does not convey any query and this is one of many instances among medical researchers (postgraduate supervisors alike) in which test of hypothesis is quickly and spontaneously resorted to without due consideration to its appropriate application. The aim of my research was to determine the prevalence of diabetes mellitus in a rural community it was not part of my objectives to determine any association between sex and prevalence of diabetes mellitus. To the inexperienced, this comment will definitely prompt conducting test of hypothesis simply to satisfy the editor and reviewer such that the article will sail through. However, the results of such statistical tests becomes difficult to understand and interprete in the light of the data. (The result of study turned out that all those with elevated fasting blood glucose are females). There are several fallacies associated with hypothesis testing. Below is a small list that will help avoid these fallacies.

Failure to reject null hypothesis leads to its acceptance. (No. When you fail to reject null hypothesis it means there is insufficient evidence to reject)

The use of á = 0.05 is a standard with an objective basis (No. á = 0.05 is merely a convention that evolved from the practice of R.A. Fisher. There is no sharp distinction between “significant” and “not significant” results, only increasing strong evidence against null hypothesis as P becomes smaller. (P=0.02 is stronger than P=0.04)

Small P value indicates large effects (No. P value does not tell anything about size of an effect)

Statistical significance implies clinical importance. (No. Statistical significance says very little about the clinical importance of relation. There is a big gulf of difference between statistical significance and clinical significance. By statistical definition at á = 0.05, it means that 1 in 20 comparisons in which null hypothesis is true will result in P < 0.05!. Finally, with these and many fallacies of hypothesis testing, it is rather sad to read in journals how significance testing has become an insignificance testing.

Fallacies of P Value

Just as test of hypothesis is associated with some fallacies so also is P value with common root causes, “ It comes to be seen as natural that any finding worth its salt should have a P value less than 0.05 flashing like a divinely appointed stamp of approval’’ 12 . The inherent subjectivity of Fisher’s P value approach and the subsequent poor understanding of this approach by the medical community could be the reason why P value is associated with myriad of fallacies. Thirdly, P value produced by researchers as mere ‘’passports to publication’’ aggravated the situation 13 . We were earlier on awakened to the inadequacy of the P value in clinical trials by Feinstein 14 ,

“The method of making statistical decisions about ‘significance’ creates one of the most devastating ironies in modern biologic science. To avoid usual categorical data, a critical investigator will usually go to enormous efforts in mensuration. He will get special machines and elaborate technologic devices to supplement his old categorical statement with new measurements of 𠆌ontinuous’ dimensional data. After all this work in getting 𠆌ontinuous’ data, however, and after calculating all the statistical tests of the data, the investigator then makes the final decision about his results on the basis of a completely arbitrary pair of dichotomous categories. These categories, which are called ‘significant’ and ‘nonsignificant’, are usually demarcated by a P value of either 0.05 or 0.01, chosen according to the capricious dictates of the statistician, the editor, the reviewer or the granting agency. If the level demanded for ‘significant’ is 0.05 or lower and the P value that emerge is 0.06, the investigator may be ready to discard a well-designed, excellently conducted, thoughtfully analyzed, and scientifically important experiment because it failed to cross the Procrustean boundary demanded for statistical approbation.

We should try to understand that Fisher wanted to have an index of measurement that will help him to decide the strength of evidence against null effect. But as it has been said earlier his idea was poorly understood and criticized and led to Neyman and Pearson to develop hypothesis testing in order to go round the problem. But, this is the result of their attempt: �pt” or “reject” null hypothesis or alternatively “significant” or “non significant”. The inadequacy of P value in decision making pervades all epidemiological study design. This head-or-tail approach to test of hypothesis has pushed the stakeholders in the field (statistician, editor, reviewer or granting agency) into an ever increasing confusion and difficulty. It is an accepted fact among statisticians of the inadequacy of P value as a sole standard judgment in the analysis of clinical trials 15 . Just as hypothesis testing is not devoid of caveats so also P values. Some of these are exposed below.

The threshold value, P < 0.05 is arbitrary. As has been said earlier, it was the practice of Fisher to assign P the value of 0.05 as a measure of evidence against null effect. One can make the “significant test” more stringent by moving to 0.01 (1%) or less stringent moving the borderline to 0.10 (10%). Dichotomizing P values into “significant” and “non significant” one loses information the same way as demarcating laboratory finding into normal” and �normal”, one may ask what is the difference between a fasting blood glucose of 25mmol/L and 15mmol/L?

Statistically significant (P < 0.05) findings are assumed to result from real treatment effects ignoring the fact that 1 in 20 comparisons of effects in which null hypothesis is true will result in significant finding (P < 0.05). This problem is more serious when several tests of hypothesis involving several variables were carried without using the appropriate statistical test, e.g., ANOVA instead of repeated t-test.

Statistical significance result does not translate into clinical importance. A large study can detect a small, clinically unimportant finding.

Chance is rarely the most important issue. Remember that when conducting a research a questionnaire is usually administered to participants. This questionnaire in most instances collect large amount of information from several variables included in the questionnaire. The manner in which the questions where asked and manner they were answered are important sources of errors (systematic error) which are difficult to measure.

What Influences P Value?

Generally, these factors influence P value.

Effect size. It is a usual research objective to detect a difference between two drugs, procedures or programmes. Several statistics are employed to measure the magnitude of effect produced by these interventions. They range: r 2 , ç 2 , ù 2 , R 2 , Q 2 , Cohen’s d, and Hedge’s g. Two problems are encountered: the use of appropriate index for measuring the effect and secondly size of the effect. A 7kg or 10 mmHg difference will have a lower P value (and more likely to be significant) than a 2-kg or 4 mmHg difference.

Size of sample. The larger the sample the more likely a difference to be detected. Further, a 7 kg difference in a study with 500 participants will give a lower P value than 7 kg difference observed in a study involving 250 participants in each group.

Spread of the data. The spread of observations in a data set is measured commonly with standard deviation. The bigger the standard deviation, the more the spread of observations and the lower the P value.

P Value and Statistical Significance: An Uncommon Ground

Both the Fisherian and Neyman-Pearson (N-P) schools did not uphold the practice of stating, “P values of less than 0.05 were regarded as statistically significant” or “P-value was 0.02 and therefore there was statistically significant difference.” These statements and many similar statements have criss-crossed medical journals and standard textbooks of statistics and provided an uncommon ground for marrying the two schools. This marriage of inconvenience further deepened the confusion and misunderstanding of the Fisherian and Neyman-Pearson schools. The combination of Fisherian and N-P thoughts (as exemplified in the above statements) did not shed light on correct interpretation of statistical test of hypothesis and p-value. The hybrid of the two schools as often read in medical journals and textbooks of statistics makes it as if the two schools were and are compatible as a single coherent method of statistical inference 4 , 23 , 24 . This confusion, perpetuated by medical journals, textbooks of statistics, reviewers and editors, have almost made it impossible for research report to be published without statements or notations such as, “statistically significant” or “statistically insignificant” or “Pπ.05” or “PϠ.05”.Sterne, then asked �n we get rid of P-values? His answer was “practical experience says no-why? 21 ”

However, the next section, “P-value and confidence interval: a common ground” provides one of the possible ways out of the seemingly insoluble problem. Goodman commented on P–value and confidence interval approach in statistical inference and its ability to solve the problem. “The few efforts to eliminate P values from journals in favor of confidence intervals have not generally been successful, indicating that the researchers’ need for a measure of evidence remains strong and that they often feel lost without one” 6 .

P Value and Confidence Interval: A Common Ground

Thus, so far this paper has examined the historical evolution of ‘significance’ testing as was initially proposed by R.A. Fisher. Neyman and Pearson were not accustomed to his subjective approach and therefore proposed ‘hypothesis testing’ involving binary outcomes: �pt” or “reject” null hypothesis. This, as we saw did not “solve” the problem completely. Thus, a common ground was needed and the combination of P value and confidence intervals provided the much needed common ground.

Before proceeding, we should briefly understand what confidence intervals (CIs) means having gone through what p-values and hypothesis testing mean. Suppose that we have two diets A and B given to two groups of malnourished children. An 8-kg increase in body weight was observed among children on diet A while a 3-kg increase in body weights was observed on diet B. The effect in weight increase is therefore 5kg on average. But it is obvious that the increase might be less than 3kg and also more than 8kg, thus a range can be represented and the chance associated with this range under the confidence intervals. Thus, for 95% confidence interval in this example will mean that if the study is repeated 100 times, 95 out of 100 the times, the CI contain the true increase in weight. Formally, 95% CI: “the interval computed from the sample data which when the study is repeated multiple times would contain the true effect 95% of the time.”

In the 1980s, a number of British statisticians tried to promote the use of this common ground approach in presenting statistical analysis 16 , 17 , 18 . They encouraged the combine presentation of P value and confidence intervals. The use of confidence intervals in addressing hypothesis testing is one of the four popular methods journal editors and eminent statisticians have issued statements supporting its use 19 . In line with this, the American Psychological Association’s Board of Scientific Affairs commissioned a white paper, “Task Force on Statistical Inference”. The Task Force suggested,

“When reporting inferential statistics (e.g. t - tests, F - tests, and chi-square) include information about the obtained ….. value of the test statistic, the degree of freedom, the probability of obtaining a value as extreme as or more extreme than the one obtained [i.e., the P value]…. Be sure to include sufficient descriptive statistics [e.g. per-cell sample size, means, correlations, standard deviations]…. The reporting of confidence intervals [for estimates of parameters, for functions of parameter such as differences in means, and for effect sizes] can be an extremely effective way of reporting results… because confidence intervals combine information on location and precision and can often be directly used to infer significance levels” 20 .

Jonathan Sterne and Davey Smith came up with their suggested guidelines for reporting statistical analysis as shown in the box 21 :

Box 1: Suggested guidance’s for the reporting of results of statistical analyses in medical journals.

The description of differences as statistically significant is not acceptable.

Confidence intervals for the main results should always be included, but 90% rather than 95% levels should be used. Confidence intervals should not be used as a surrogate means of examining significance at the conventional 5% level. Interpretation of confidence intervals should focus on the implication (clinical importance) of the range of values in the interval.

When there is a meaningful null hypothesis, the strength of evidence against it should be indexed by the P value. The smaller the P value, the stronger is the evidence.

While it is impossible to reduce substantially the amount of data dredging that is carried out, authors should take a very skeptical view of subgroup analyses in clinical trials and observational studies. The strength of the evidence for interaction-that effects really differ between subgroups – should always be presented. Claims made on the basis of subgroup findings should be even more tempered than claims made about main effects.

In observational studies it should be remembered that considerations of confounding and bias are at least as important as the issues discussed in this paper.

Since the 1980s when British statisticians championed the use of confidence intervals, journal after journal are issuing statements regarding its use. In an editorial in Clinical Chemistry, it read as follows,

“There is no question that a confidence interval for the difference between two true (i.e., population) means or proportions, based on the observed difference between sample estimate, provides more useful information than a P value, no matter how exact, for the probability that the true difference is zero. The confidence interval reflects the precision of the sample values in terms of their standard deviation and the sample size …..’’ 22

On the final note, it is important to know why it is statistically superior to use P value and confidence intervals rather than P value and hypothesis testing:

Confidence intervals emphasize the importance of estimation over hypothesis testing. It is more informative to quote the magnitude of the size of effect rather than adopting the significantnonsignificant hypothesis testing.

The width of the CIs provides a measure of the reliability or precision of the estimate.

Confidence intervals makes it far easier to determine whether a finding has any substantive (e.g. clinical) importance, as opposed to statistical significance.

While statistical significant tests are vulnerable to type I error, CIs are not.

Confidence intervals can be used as a significance test. The simple rule is that if 95% CIs does not include the null value (usually zero for difference in means and proportions one for relative risk and odds ratio) null hypothesis is rejected at 0.05 levels.

Finally, the use of CIs promotes cumulative knowledge development by obligating researchers to think meta-analytically about estimation, replication and comparing intervals across studies 25 . For example, in a meta-analysis of trials dealing with intravenous nitrates in acute myocardial infraction found reduction in mortality of somewhere between one quarter and two-thirds. Meanwhile previous six trials 26 showed conflicting results: some trials revealed that it was dangerous to give intravenous nitrates while others revealed that it actually reduced mortality. For the six trials, the odds ratio, 95% CIs and P-values are: OR = 0.33 (CI = 0.09, 1.13, P = 0.08) OR = 0.24 (CI = 0.08, 0.74, P = 0.01) OR = 0.83(CI = 0.33, 2.12, P = 0.07) OR = 2.04 (CI = 0.39, 10.71, P = 0.04) OR = 0.58 (CI = 0.19. 1.65 P = 0.29) and OR = 0.48 (CI = 0.28, 0.82 P = 0.007). The first, third, fourth and fifth studies appear harmful while the second and the sixth appear useful (in reducing mortality).

What is to be done?

While it is possible to make a change and improve on the practice, however, as Cohen warns, 𠇍on’t look for a magic alternative … It does not exist” 27 .

The foundation for change in this practice should be laid in the foundation of teaching statistics: classroom. The curriculum and class room teaching should clearly differentiate between the two schools. Historical evolution should be clearly explained so also meaning of “statistical significance”. The classroom teaching of the correct concepts should begin at undergraduate and move up to graduate classroom instruction, even if it means this teaching would be at introductory level.

We should promote and encourage the use of confidence intervals around sample statistics and effect sizes. This duty lies in the hands of statistics teachers, medical journal editors, reviewers and any granting agency.

Generally, researchers, preparing on a study are encouraged to consult a statistician at the initial stage of their study to avoid misinterpreting the P value especially if they are using statistical software for their data analysis.


What is Statistical Analysis?

First, let’s clarify that “statistical analysis” is just the second way of saying “statistics.” Now, the official definition:

Statistical analysis is a study, a science of collecting , organizing, exploring, interpreting, and presenting data and uncovering patterns and trends .

Many businesses rely on statistical analysis and it is becoming more and more important. One of the main reasons is that statistical data is used to predict future trends and to minimize risks.

Furthermore, if you look around you, you will see a huge number of products (your mobile phone for example) that have been improved thanks to the results of the statistical research and analysis.

Here are some of the fields where statistics play an important role:

    data collection methods, and analysis
  • Business intelligence
  • Data analysis
  • SEO and optimization for user search intent
  • Financial analysis and many others.

Statistics allows businesses to dig deeper into specific information to see the current situations, the future trends and to make the most appropriate decisions.

There are two key types of statistical analysis: descriptive and inference.

The Two Main Types of Statistical Analysis

In the real world of analysis, when analyzing information, it is normal to use both descriptive and inferential types of statistics.

Commonly, in many research run on groups of people (such as marketing research for defining market segments), are used both descriptive and inferential statistics to analyze results and come up with conclusions.

What is descriptive and inferential statistics? What is the difference between them?

Descriptive Type of Statistical Analysis

As the name suggests, the descriptive statistic is used to describe! It describes the basic features of information and shows or summarizes data in a rational way. Descriptive statistics is a study of quantitatively describing.

This type of statistics draws in all of the data from a certain population (a population is a whole group, it is every member of this group) or a sample of it. Descriptive statistics can include numbers, charts, tables, graphs, or other data visualization types to present raw data.

However, descriptive statistics do not allow making conclusions. You can not get conclusions and make generalizations that extend beyond the data at hand. With descriptive statistics, you can simply describe what is and what the data present.

For example, if you have a data population that includes 30 workers in a business department, you can find the average of that data set for those 30 workers. However, you can’t discover what the eventual average is for all the workers in the whole company using just that data. Imagine, this company has 10 000 workers.

Despite that, this type of statistics is very important because it allows us to show data in a meaningful way. It also can give us the ability to make a simple interpretation of the data.

In addition, it helps us to simplify large amounts of data in a reasonable way.

Inferential Type of Statistical Analysis

As you see above, the main limitation of the descriptive statistics is that it only allows you to make summations about the objects or people that you have measured.

It is a serious limitation. This is where inferential statistics come.

Inferential statistics is a result of more complicated mathematical estimations, and allow us to infer trends about a larger population based on samples of “subjects” taken from it.

This type of statistical analysis is used to study the relationships between variables within a sample, and you can make conclusions, generalizations or predictions about a bigger population. In other words, the sample accurately represents the population.

Moreover, inference statistics allows businesses and other organizations to test a hypothesis and come up with conclusions about the data.

One of the key reasons for the existing of inferential statistics is because it is usually too costly to study an entire population of people or objects.

To sums up the above two main types of statistical analysis, we can say that descriptive statistics are used to describe data. Inferential statistics go further and it is used to infer conclusions and hypotheses.

Other Types of Statistics

While the above two types of statistical analysis are the main, there are also other important types every scientist who works with data should know.

Predictive Analytics

If you want to make predictions about future events, predictive analysis is what you need. This analysis is based on current and historical facts.

Predictive analytics uses statistical algorithms and machine learning techniques to define the likelihood of future results, behavior, and trends based on both new and historical data.

Data-driven marketing, financial services, online services providers, and insurance companies are among the main users of predictive analytics.

More and more businesses are starting to implement predictive analytics to increase competitive advantage and to minimize the risk associated with an unpredictable future.

Predictive analytics can use a variety of techniques such as data mining, modeling, artificial intelligence, machine learning and etc. to make important predictions about the future.

It is important to note that no statistical method can “predict” the future with 100% surety. Businesses use these statistics to answer the question “ What might happen? “. Remember the basis of predictive analytics is based on probabilities.

Prescriptive Analytics

Prescriptive analytics is a study that examines data to answer the question “ What should be done? ” It is a common area of business analysis dedicated to identifying the best movie or action for a specific situation.

Prescriptive analytics aims to find the optimal recommendations for a decision making process. It is all about providing advice.

Prescriptive analytics is related to descriptive and predictive analytics. While descriptive analytics describe what has happened and predictive analytics helps to predict what might happen, prescriptive statistics aims to find the best options among available choices.

Prescriptive analytics uses techniques such as simulation, graph analysis, business rules, algorithms, complex event processing, recommendation engines, and machine learning.

Causal Analysis

When you would like to understand and identify the reasons why things are as they are, causal analysis comes to help. This type of analysis answer the question “Why?”

The business world is full of events that lead to failure. The causal seeks to identify the reasons why? It is better to find causes and to treat them instead of treating symptoms.

Causal analysis searches for the root cause – the basic reason why something happens.

Causal analysis is a common practice in industries that address major disasters. However, it is becoming more popular in the business, especially in IT field. For example, the causal analysis is a common practice in quality assurance in the software industry.

So, let’s sum the goals of casual analysis:

  • To identify key problem areas.
  • To investigate and determine the root cause.
  • To understand what happens to a given variable if you change another.

Exploratory Data Analysis (EDA)

Exploratory data analysis (EDA) is a complement to inferential statistics. It is used mostly by data scientists.

EDA is an analysis approach that focuses on identifying general patterns in the data and to find previously unknown relationships.

The purpose of exploratory data analysis is:

  • Check mistakes or missing data.
  • Discover new connections.
  • Collect maximum insight into the data set.
  • Check assumptions and hypotheses.

EDA alone should not be used for generalizing or predicting. EDA is used for taking a bird’s eye view of the data and trying to make some feeling or sense of it. Commonly, it is the first step in data analysis, performed before other formal statistical techniques.

Mechanistic Analysis

Mechanistic Analysis is not a common type of statistical analysis. However it worth mentioning here because, in some industries such as big data analysis, it has an important role.


The mechanistic analysis is about understanding the exact changes in given variables that lead to changes in other variables. However, mechanistic does not consider external influences. The assumption is that a given system is affected by the interaction of its own components.

It is useful on those systems for which there are very clear definitions. Biological science, for example, can make use of.


“Context and Calories”

Does the company you keep impact what you eat? This example comes from an article titled “Impact of Group Settings and Gender on Meals Purchased by College Students” (Allen-O’Donnell, M., T. C. Nowak, K. A. Snyder, and M. D. Cottingham, Journal of Applied Social Psychology 49(9), 2011, onlinelibrary.wiley.com/doi/10.1111/j.1559-1816.2011.00804.x/full) . In this study, researchers examined this issue in the context of gender-related theories in their field. For our purposes, we look at this research more narrowly.

Step 1: Stating the hypotheses.

In the article, the authors make the following hypothesis. “The attempt to appear feminine will be empirically demonstrated by the purchase of fewer calories by women in mixed-gender groups than by women in same-gender groups.” We translate this into a simpler and narrower research question: Do women purchase fewer calories when they eat with men compared to when they eat with women?

Here the two populations are “women eating with women” (population 1) and “women eating with men” (population 2). The variable is the calories in the meal. We test the following hypotheses at the 5% level of significance.

The null hypothesis is always H0: μ1 – μ2 = 0, which is the same as H0: μ1 = μ2.

Here μ1 represents the mean number of calories ordered by women when they were eating with other women, and μ2 represents the mean number of calories ordered by women when they were eating with men.

Note: It does not matter which population we label as 1 or 2, but once we decide, we have to stay consistent throughout the hypothesis test. Since we expect the number of calories to be greater for the women eating with other women, the difference is positive if “women eating with women” is population 1. If you prefer to work with positive numbers, choose the group with the larger expected mean as population 1. This is a good general tip.

Step 2: Collect Data.

As usual, there are two major things to keep in mind when considering the collection of data.

  • Samples need to be representative of the population in question.
  • Samples need to be random in order to remove or minimize bias.

The researchers state their hypothesis in terms of “women.” We did the same. But the researchers gathered data by watching people eat at the HUB Rock Café II on the campus of Indiana University of Pennsylvania during the Spring semester of 2006. Almost all of the women in the data set were white undergraduates between the ages of 18 and 24, so there are some definite limitations on the scope of this study. These limitations will affect our conclusion (and the specific definition of the population means in our hypotheses.)

The observations were collected on February 13, 2006, through February 22, 2006, between 11 a.m. and 7 p.m. We can see that the researchers included both lunch and dinner. They also made observations on all days of the week to ensure that weekly customer patterns did not confound their findings. The authors state that “since the time period for observations and the place where [they] observed students were limited, the sample was a convenience sample.” Despite these limitations, the researchers conducted inference procedures with the data, and the results were published in a reputable journal. We will also conduct inference with this data, but we also include a discussion of the limitations of the study with our conclusion. The authors did this, also.

Do the data met the conditions for use of a t-test?

The researchers reported the following sample statistics.

  • In a sample of 45 women dining with other women, the average number of calories ordered was 850, and the standard deviation was 252.
  • In a sample of 27 women dining with men, the average number of calories ordered was 719, and the standard deviation was 322.

One of the samples has fewer than 30 women. We need to make sure the distribution of calories in this sample is not heavily skewed and has no outliers, but we do not have access to a spreadsheet of the actual data. Since the researchers conducted a t-test with this data, we will assume that the conditions are met. This includes the assumption that the samples are independent.

Step 3: Assess the evidence.

As noted previously, the researchers reported the following sample statistics.

  • In a sample of 45 women dining with other women, the average number of calories ordered was 850, and the standard deviation was 252.
  • In a sample of 27 women dining with men, the average number of calories ordered was 719, and the standard deviation was 322.

To compute the t-test statistic, make sure sample 1 corresponds to population 1. Here our population 1 is “women eating with other women.” So x1 = 850, s1 = 252, n1 =45, and so on.

Using technology, we determined that the degrees of freedom are about 45 for this data. To find the P-value, we use our familiar simulation of the t-distribution. Since the alternative hypothesis is a “greater than” statement, we look for the area to the right of T = 1.81. The P-value is 0.0385.

Step 4: State a conclusion.

The hypotheses for this test are H0: μ1 – μ2 = 0 and Ha: μ1 – μ2 > 0. Since the P-value is less than the significance level (0.0385 < 0.05), we reject H0 and accept Ha.

At Indiana University of Pennsylvania, the mean number of calories ordered by undergraduate women eating with other women is greater than the mean number of calories ordered by undergraduate women eating with men (P-value = 0.0385).

A Comment about Conclusions

In the conclusion above, we did not generalize the findings to all women. Since the samples included only undergraduate women at one university, we included this information in our conclusion. But our conclusion is a cautious statement of the findings. The authors see the results more broadly in the context of theories in the field of social psychology. In the context of these theories, they write, “Our findings support the assertion that meal size is a tool for influencing the impressions of others. For traditional-age, predominantly White college women, diminished meal size appears to be an attempt to assert femininity in groups that include men.” This viewpoint is echoed in the following summary of the study for the general public on National Public Radio (npr.org).

  • Both men and women appear to choose larger portions when they eat with women, and both men and women choose smaller portions when they eat in the company of men, according to new research published in the Journal of Applied Social Psychology. The study, conducted among a sample of 127 college students, suggests that both men and women are influenced by unconscious scripts about how to behave in each other’s company. And these scripts change the way men and women eat when they eat together and when they eat apart.

Should we be concerned that the findings of this study are generalized in this way? Perhaps. But the authors of the article address this concern by including the following disclaimer with their findings: “While the results of our research are suggestive, they should be replicated with larger, representative samples. Studies should be done not only with primarily White, middle-class college students, but also with students who differ in terms of race/ethnicity, social class, age, sexual orientation, and so forth.” This is an example of good statistical practice. It is often very difficult to select truly random samples from the populations of interest. Researchers therefore discuss the limitations of their sampling design when they discuss their conclusions.

In the following activities, you will have the opportunity to practice parts of the hypothesis test for a difference in two population means. On the next page, the activities focus on the entire process and also incorporate technology.


S.3.3 Hypothesis Testing Examples

An engineer measured the Brinell hardness of 25 pieces of ductile iron that were subcritically annealed. The resulting data were:

170 167 174 179 179 187 179 183 179
156 163 156 187 156 167 156 174 170
183 179 174 179 170 159 187

The engineer hypothesized that the mean Brinell hardness of all such ductile iron pieces is greater than 170. Therefore, he was interested in testing the hypotheses:

The engineer entered his data into Minitab and requested that the "one-sample t-test" be conducted for the above hypotheses. He obtained the following output:

Descriptive Statistics

Null hypothesis H₀: $mu$ = 170
Alternative hypothesis H₁: $mu$ > 170

T-Value P-Value
1.22 0.117

The output tells us that the average Brinell hardness of the n = 25 pieces of ductile iron was 172.52 with a standard deviation of 10.31. (The standard error of the mean "SE Mean", calculated by dividing the standard deviation 10.31 by the square root of n = 25, is 2.06). The test statistic t* is 1.22, and the P-value is 0.117.

If the engineer set his significance level α at 0.05 and used the critical value approach to conduct his hypothesis test, he would reject the null hypothesis if his test statistic t* were greater than 1.7109 (determined using statistical software or a t-table):

Since the engineer's test statistic, t* = 1.22, is not greater than 1.7109, the engineer fails to reject the null hypothesis. That is, the test statistic does not fall in the "critical region." There is insufficient evidence, at the (alpha) = 0.05 level, to conclude that the mean Brinell hardness of all such ductile iron pieces is greater than 170.

If the engineer used the P-value approach to conduct his hypothesis test, he would determine the area under a t n - 1 = t 24 curve and to the right of the test statistic t* = 1.22:

In the output above, Minitab reports that the P-value is 0.117. Since the P-value, 0.117, is greater than (alpha) = 0.05, the engineer fails to reject the null hypothesis. There is insufficient evidence, at the (alpha) = 0.05 level, to conclude that the mean Brinell hardness of all such ductile iron pieces is greater than 170.

Note that the engineer obtains the same scientific conclusion regardless of the approach used. This will always be the case.

A biologist was interested in determining whether sunflower seedlings treated with an extract from Vinca minor roots resulted in a lower average height of sunflower seedlings than the standard height of 15.7 cm. The biologist treated a random sample of n = 33 seedlings with the extract and subsequently obtained the following heights:

11.5 11.8 15.7 16.1 14.1 10.5 9.3 15.0 11.1
15.2 19.0 12.8 12.4 19.2 13.5 12.2 13.3
16.5 13.5 14.4 16.7 10.9 13.0 10.3 15.8
15.1 17.1 13.3 12.4 8.5 14.3 12.9 13.5

The biologist's hypotheses are:

The biologist entered her data into Minitab and requested that the "one-sample t-test" be conducted for the above hypotheses. She obtained the following output:

Descriptive Statistics

Null hypothesis H₀: $mu$ = 15.7
Alternative hypothesis H₁: $mu$ < 15.7

T-Value P-Value
-4.60 0.000

The output tells us that the average height of the n = 33 sunflower seedlings was 13.664 with a standard deviation of 2.544. (The standard error of the mean "SE Mean", calculated by dividing the standard deviation 13.664 by the square root of n = 33, is 0.443). The test statistic t* is -4.60, and the P-value, 0.000, is to three decimal places.

Minitab Note. Minitab will always report P-values to only 3 decimal places. If Minitab reports the P-value as 0.000, it really means that the P-value is 0.000. something. Throughout this course (and your future research!), when you see that Minitab reports the P-value as 0.000, you should report the P-value as being "< 0.001."

If the biologist set her significance level (alpha) at 0.05 and used the critical value approach to conduct her hypothesis test, she would reject the null hypothesis if her test statistic t* were less than -1.6939 (determined using statistical software or a t-table):s-3-3

Since the biologist's test statistic, t* = -4.60, is less than -1.6939, the biologist rejects the null hypothesis. That is, the test statistic falls in the "critical region." There is sufficient evidence, at the α = 0.05 level, to conclude that the mean height of all such sunflower seedlings is less than 15.7 cm.

If the biologist used the P-value approach to conduct her hypothesis test, she would determine the area under a t n - 1 = t 32 curve and to the left of the test statistic t* = -4.60:

In the output above, Minitab reports that the P-value is 0.000, which we take to mean < 0.001. Since the P-value is less than 0.001, it is clearly less than (alpha) = 0.05, and the biologist rejects the null hypothesis. There is sufficient evidence, at the (alpha) = 0.05 level, to conclude that the mean height of all such sunflower seedlings is less than 15.7 cm.

Note again that the biologist obtains the same scientific conclusion regardless of the approach used. This will always be the case.

A manufacturer claims that the thickness of the spearmint gum it produces is 7.5 one-hundredths of an inch. A quality control specialist regularly checks this claim. On one production run, he took a random sample of n = 10 pieces of gum and measured their thickness. He obtained:

7.65 7.60 7.65 7.70 7.55
7.55 7.40 7.40 7.50 7.50

The quality control specialist's hypotheses are:

The quality control specialist entered his data into Minitab and requested that the "one-sample t-test" be conducted for the above hypotheses. He obtained the following output:

Descriptive Statistics

Null hypothesis H₀: $mu$ = 7.5
Alternative hypothesis H₁: $mu e$ 7.5

T-Value P-Value
1.54 0.158

The output tells us that the average thickness of the n = 10 pieces of gums was 7.55 one-hundredths of an inch with a standard deviation of 0.1027. (The standard error of the mean "SE Mean", calculated by dividing the standard deviation 0.1027 by the square root of n = 10, is 0.0325). The test statistic t* is 1.54, and the P-value is 0.158.

If the quality control specialist sets his significance level (alpha) at 0.05 and used the critical value approach to conduct his hypothesis test, he would reject the null hypothesis if his test statistic t* were less than -2.2616 or greater than 2.2616 (determined using statistical software or a t-table):

Since the quality control specialist's test statistic, t* = 1.54, is not less than -2.2616 nor greater than 2.2616, the quality control specialist fails to reject the null hypothesis. That is, the test statistic does not fall in the "critical region." There is insufficient evidence, at the (alpha) = 0.05 level, to conclude that the mean thickness of all of the manufacturer's spearmint gum differs from 7.5 one-hundredths of an inch.

If the quality control specialist used the P-value approach to conduct his hypothesis test, he would determine the area under a t n - 1 = t 9 curve, to the right of 1.54 and to the left of -1.54:

In the output above, Minitab reports that the P-value is 0.158. Since the P-value, 0.158, is greater than (alpha) = 0.05, the quality control specialist fails to reject the null hypothesis. There is insufficient evidence, at the (alpha) = 0.05 level, to conclude that the mean thickness of all pieces of spearmint gum differs from 7.5 one-hundredths of an inch.

Note that the quality control specialist obtains the same scientific conclusion regardless of the approach used. This will always be the case.

In closing

In our review of hypothesis tests, we have focused on just one particular hypothesis test, namely that concerning the population mean (mu). The important thing to recognize is that the topics discussed here — the general idea of hypothesis tests, errors in hypothesis testing, the critical value approach, and the P-value approach — generally extend to all of the hypothesis tests you will encounter.


References

Babbage C (1830) Reflections on the decline of science in England, and on some of its causes. B. Fellows.

Bem, D.J. 2009. Writing an empirical article. In Guide to publishing in psychology journals, ed. R.J. Sternberg, 3–16. Cambridge: Cambridge University Press.

Bem, D.J. 2011. Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology 100: 407–425. https://doi.org/10.1037/a0021524.

Benjamin, D.J., J.O. Berger, M. Johannesson, B.A. Nosek, E.J. Wagenmakers, R. Berk, K.A. Bollen, B. Brembs, L. Brown, C. Camerer, D. Cesarini, C.D. Chambers, M. Clyde, T.D. Cook, P. De Boeck, Z. Dienes, A. Dreber, K. Easwaran, C. Efferson, E. Fehr, F. Fidler, A.P. Field, M. Forster, E.I. George, R. Gonzalez, S. Goodman, E. Green, D.P. Green, A.G. Greenwald, J.D. Hadfield, L.V. Hedges, L. Held, T.H. Ho, H. Hoijtink, D.J. Hruschka, K. Imai, G. Imbens, J.P.A. Ioannidis, M. Jeon, J.H. Jones, M. Kirchler, D. Laibson, J. List, R. Little, A. Lupia, E. Machery, S.E. Maxwell, M. McCarthy, D.A. Moore, S.L. Morgan, M. Munafó, S. Nakagawa, B. Nyhan, T.H. Parker, L. Pericchi, M. Perugini, J. Rouder, J. Rousseau, V. Savalei, F.D. Schönbrodt, T. Sellke, B. Sinclair, D. Tingley, T. Van Zandt, S. Vazire, D.J. Watts, C. Winship, R.L. Wolpert, Y. Xie, C. Young, J. Zinman, and V.E. Johnson. 2017. Redefine statistical significance. Nature Human Behaviour 33 (1): 6–10. https://doi.org/10.1038/s41562-017-0189-z.

Berger, J.O. 2006. The case for objective bayesian analysis. Bayesian Analysis 1: 385–402. https://doi.org/10.1214/06-BA115.

Berger, J.O., and R.L. Wolpert. 1988. The Likelihood Principle. Hayward: Institute of Mathematical Statistics.

Birnbaum, A. 1964. The anomalous concept of statistical evidence: Axioms, interpretations, and elementary exposition. New York University.

Bishop, D.V.M. 2014. Interpreting unexpected significant findings. https://doi.org/10.6084/m9.figshare.1030406.v1.

Box, G.E.P., and G.C. Tia. 1973. Bayesian inference in statistical analysis. Weskey Publishing Company.

Button, K.S., J.P.A. Ioannidis, C. Mokrysz, B.A. Nosek, J. Flint, E.S.J. Robinson, and M.R. Munafò. 2013. Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience 14: 365–376. https://doi.org/10.1038/nrn3475.

Cohen, J. 1992. Statistical power analysis. Current Directions in Psychological Science 1: 98–101. https://doi.org/10.1111/1467-8721.ep10768783.

Colling, L.J., and R.P. Roberts. 2010. Cognitive psychology does not reduce to neuroscience. In 9th conference of the australasian society for cognitive science, 41–48. Sydney: Macquarie Centre for Cognitive Science.

Colling, L.J., and K. Williamson. 2014. Entrainment and motor emulation approaches to joint action: Alternatives or complementary approaches? Frontiers in Human Neuroscience 8: 67. https://doi.org/10.3389/fnhum.2014.00754.

Cramer, A.O.J., Ravenzwaaij D. van, D. Matzke, H. Steingroever, R. Wetzels, R.P.P.P. Grasman, L.J. Waldorp, and E.-J. Wagenmakers. 2015. Hidden multiplicity in exploratory multiway ANOVA: Prevalence and remedies. Psychonomic Bulletin & Review 23: 640–647. https://doi.org/10.3758/s13423-015-0913-5.

de Winter, J.C., and D. Dodou. 2015. A surge of p-values between 0.041 and 0.049 in recent decades (but negative results are increasing rapidly too). PeerJ 3: e733. https://doi.org/10.7717/peerj.733.

Dienes, Z. 2011. Bayesian versus orthodox statistics: Which side are you on? Perspectives on Psychological Science 6: 274–290. https://doi.org/10.1177/1745691611406920.

Dienes, Z. 2014. Using bayes to get the most out of non-significant results. Frontiers in Psychology 5. https://doi.org/10.3389/fpsyg.2014.00781.

Dienes, Z., and N. Mclatchie. 2017. Four reasons to prefer Bayesian analyses over significance testing. Psychonomic Bulletin & Review 100: 1–12. https://doi.org/10.3758/s13423-017-1266-z.

Edwards, W., H. Lindman, and L.J. Savage. 1963. Bayesian statistical inference for psychological research. Psychological Review 70: 193–242. https://doi.org/10.1037/h0044139.

Eklund, A., T.E. Nichols, and H. Knutsson. 2016. Cluster failure: Why fMRI inferences for spatial extent have inflated -positive rates. Proceedings of the National Academy of Sciences of the United States of America 113: 7900–7905. https://doi.org/10.1073/pnas.1602413113.

Etz A (2017) Introduction to the concept of likelihood and its applications. Advances in Methods and Practices in Psychological Science.

Fisher, R.A. 1925. Statistical methods for research workers. In Oliver. London: Boyd.

Gandenberger, G. 2015. A new proof of the likelihood principle. The British Journal for the Philosophy of Science 66: 475–503. https://doi.org/10.1093/bjps/axt039.

Gandenberger, G. 2017. Differences among noninformative stopping rules are often relevant to Bayesian decisions. arXiv:1707.00.214 [math.ST].

García-Pérez, M.A. 2016. Thou shalt not bear false witness against null hypothesis significance testing. Educational and Psychological Measurement 77: 631–662. https://doi.org/10.1177/0013164416668232.

Gelman, A., and C.R. Shalizi. 2013. Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology 66: 8–38. https://doi.org/10.1111/j.2044-8317.2011.02037.x.

Gelman, A., J.B. Carlin, H.S. Stern, D.B. Dunson, A. Vehtari, and D.B. Rubin. 2014. Bayesian Data Analysis. 3rd ed. Boca Raton: CRC Press.

Gigerenzer, G. 1993. A handbook for data analysis in the Behaviorial sciences. In The superego, the ego, and the id in statistical reasoning, ed. G. Keren and C. Lewis, 311–340. New York.

Gronau, Q.F., A. Ly, and E.-J. Wagenmakers. 2018. Informed Bayesian t-tests. arXiv:1704.02479 [stat.ME].

Haig, B.D. 2016. Tests of statistical significance made sound. Educational and Psychological Measurement 77: 489–506. https://doi.org/10.1177/0013164416667981.

Hill, B.M. 1974. Review of bayesian inference in statistical analysis. Technometrics 16: 47800479. https://doi.org/10.1080/00401706.1974.10489222.

Ioannidis, J.P.A. 2012. Why science is not necessarily self-correcting. Perspectives on Psychological Science 7: 645–654. https://doi.org/10.1177/1745691612464056.

Jeffreys, H. 1961. The theory of probability. 3rd ed. Oxford: Claredon Press.

John, L.K., G. Loewenstein, and D. Prelec. 2012. Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science 23: 524–532. https://doi.org/10.1177/0956797611430953.

Kaplan, D.M., and W. Bechtel. 2011. Dynamical models: An alternative or complement to mechanistic explanations? Topics in Cognitive Science 3: 438–444. https://doi.org/10.1111/j.1756-8765.2011.01147.x.

Lakens, D. 2017. Equivalence tests: A practical primer for t tests, correlations, and meta-analyses. Social Psychological and Personality Science 8: 355–362. https://doi.org/10.1177/1948550617697177.

Lakens, D., F.G. Adolfi, C.J. Albers, F. Anvari, M.A. Apps, S.E. Argamon, T. Baguley, R.B. Becker, S.D. Benning, D.E. Bradford, E.M. Buchanan, A.R. Caldwell, B. Calster, R. Carlsson, S.-C. Chen, B. Chung, L.J. Colling, G.S. Collins, Z. Crook, E.S. Cross, S. Daniels, H. Danielsson, L. DeBruine, D.J. Dunleavy, B.D. Earp, M.I. Feist, J.D. Ferrell, J.G. Field, N.W. Fox, A. Friesen, C. Gomes, M. Gonzalez-Marquez, J.A. Grange, A.P. Grieve, R. Guggenberger, J. Grist, A.-L. Harmelen, F. Hasselman, K.D. Hochard, M.R. Hoffarth, N.P. Holmes, M. Ingre, P.M. Isager, H.K. Isotalus, C. Johansson, K. Juszczyk, D.A. Kenny, A.A. Khalil, B. Konat, J. Lao, E.G. Larsen, G.M. Lodder, J. Lukavský, C.R. Madan, D. Manheim, and S.R. Martin. 2018. Justify your alpha. Nature Human Behaviour 2: 168–171. https://doi.org/10.1038/s41562-018-0311-x.

Lindley, D.V. 2000. The philosophy of statistics. Journal of the Royal Statistical Society: Series D (The Statistician) 49: 293–337. https://doi.org/10.1111/1467-9884.00238.

Masicampo, E.J., and D.R. Lalande. 2012. A peculiar prevalence of pvalues just below. 05. The Quarterly Journal of Experimental Psychology 65: 2271–2279. https://doi.org/10.1080/17470218.2012.711335.

Mayo, D.G. 1996. Error and the growth of experimental knowledge. Chicago: University of Chicago Press.

Mayo, D.G., and R.D. Morey. 2017. A poor prognosis for the diagnostic screening critique of statistical tests. https://doi.org/10.17605/OSF.IO/PS38B.

Mayo, D.G., and A. Spanos. 2006. Severe testing as a basic concept in a NeymanPearson philosophy of induction. The British Journal for the Philosophy of Science 57: 323–357. https://doi.org/10.1093/bjps/axl003.

Mayo, D.G., and A. Spanos. 2011. Error statistics. In Philosophy of statistics, ed. P.S. Bandyopadhyay and M.R. Forster. Oxford.

Morey, R.D., R. Hoekstra, J.N. Rouder, M.D. Lee, and E.J. Wagenmakers. 2016a. The fallacy of placing confidence in confidence intervals. Psychonomic Bulletin & Review 23: 103–123. https://doi.org/10.3758/s13423-015-0947-8.

Morey, R.D., J.-W. Romeijn, and J.N. Rouder. 2016b. The philosophy of Bayes factors and the quantification of statistical evidence. Journal of Mathematical Psychology 72: 6–18. https://doi.org/10.1016/j.jmp.2015.11.001.

Neyman, J. 1976. Tests of statistical hypotheses and their use in studies of natural phenomena. Communications in statistics—theory and methods 5: 737–751. https://doi.org/10.1080/03610927608827392.

Neyman, J., and E.S. Pearson. 1933. On the problem of the Most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 231: 289–337. https://doi.org/10.1098/rsta.1933.0009.

Nickerson, R.S. 2000. Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods 5: 241–301. https://doi.org/10.1037/1082-989X.5.2.241.

Nuzzo, R. 2014. Scientific method: Statistical errors. Nature 506: 150–152. https://doi.org/10.1038/506150a.

Open Science Collaboration. 2012. An open, large-scale, collaborative effort to estimate the reproducibility of psychological science. Perspectives on Psychological Science 7: 657–660. https://doi.org/10.1177/1745691612462588.

Pashler, H., and E.-J. Wagenmakers. 2012. Editors’ introduction to the special section on replicability in psychological science. Perspectives on Psychological Science 7: 528–530. https://doi.org/10.1177/1745691612465253.

Phillips, K.F. 1990. Power of the two one-sided tests procedure in bioequivalence. Journal of Pharmacokinetics and Biopharmaceutics 18: 137–144. https://doi.org/10.1007/BF01063556.

Roberts, H.V. 1967. Informative stopping rules and inferences about population size. Journal of the American Statistical Association 62: 763. https://doi.org/10.2307/2283670.

Rouder, J.N. 2014. Optional stopping: No problem for Bayesians. Psychonomic Bulletin & Review 21: 301–308. https://doi.org/10.3758/s13423-014-0595-4.

Rouder, J.N., P.L. Speckman, D. Sun, R.D. Morey, and G. Iverson. 2009. Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review 16: 225–237. https://doi.org/10.3758/PBR.16.2.225.

Simmons, J.P., L.D. Nelson, and U. Simonsohn. 2011. False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science 22: 1359–1366.

Simonsohn, U. 2015. Small Telescopes. Psychological Science 26: 559–569. https://doi.org/10.1177/0956797614567341.

Steegen, S., F. Tuerlinckx, A. Gelman, and W. Vanpaemel. 2016. Increasing transparency through a multiverse analysis. Perspectives on Psychological Science 11: 702–712. https://doi.org/10.1177/1745691616658637.

Stroebe, W., T. Postmes, and R. Spears. 2012. Scientific misconduct and the myth of self-correction in science. Perspectives on Psychological Science 7: 670–688. https://doi.org/10.1177/1745691612460687.

Szűcs, D. 2016. A tutorial on hunting statistical significance by chasing N. Frontiers in Psychology 7: 365. https://doi.org/10.3389/fpsyg.2016.01444.

Szűcs, D., and J.P.A. Ioannidis. 2017a. When null hypothesis significance testing is unsuitable for research: A reassessment. Frontiers in Human Neuroscience 11: 943. https://doi.org/10.3389/fnhum.2017.00390.

Szűcs, D., and J.P.A. Ioannidis. 2017b. Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLoS Biology 15: e2000797. https://doi.org/10.1371/journal.pbio.2000797.

Trafimow, D., and M. Marks. 2014. Editorial. Basic and Applied Social Psychology 37: 1–2. https://doi.org/10.1080/01973533.2015.1012991.

van Dyk, D.A. 2014. The role of statistics in the discovery of a Higgs boson. Annual Review of Statistics and Its Application 1: 41–59. https://doi.org/10.1146/annurev-statistics-062713-085841.

Wagenmakers, E.-J. 2007. A practical solution to the pervasive problems ofp values. Psychonomic Bulletin & Review 14: 779–804. https://doi.org/10.3758/BF03194105.

Wagenmakers, E.-J., R. Wetzels, D. Borsboom, and Maas H.L.J. van der. 2011. Why psychologists must change the way they analyze their data: The case of psi: Comment on Bem (2011). Journal of Personality and Social Psychology 100: 426–432. https://doi.org/10.1037/a0022790.

Ware, J.J., and M.R. Munafò. 2015. Significance chasing in research practice: Causes, consequences and possible solutions. Addiction 110: 4–8. https://doi.org/10.1111/add.12673.

Wasserstein, R.L., and N.A. Lazar. 2016. The ASA’s statement on p-values: Context, process, and purpose. The American Statistician 70: 129–133. https://doi.org/10.1080/00031305.2016.1154108.

Yong, E. 2012. Replication studies: Bad copy. Nature 485: 298–300. https://doi.org/10.1038/485298a.

Yu, E.C., A.M. Sprenger, R.P. Thomas, and M.R. Dougherty. 2013. When decision heuristics and science collide. Psychonomic Bulletin & Review 21: 268–282. https://doi.org/10.3758/s13423-013-0495-z.


References

shows how the choice of n influences conclusions.

is the raw data used to generate Figs. S4 and S5.

Data & Figures

The importance of displaying reproducibility. Drastically different experimental outcomes can result in the same plots and statistics unless experiment-to-experiment variability is considered. (A) Problematic plots treat n as the number of cells, resulting in tiny error bars and P values. These plots also conceal any systematic run-to-run error, mixing it with cell-to-cell variability. (B–D) To illustrate this, we simulated three different scenarios that all have identical underlying cell-level values but are clustered differently by experiment: B shows highly repeatable, unclustered data, C shows day-to-day variability, but a consistent trend in each experiment, and D is dominated by one random run. Note that the plots in A that treat each cell as its own n fail to distinguish the three scenarios, claiming a significant difference after drug treatment, even when the experiments are not actually repeatable. To correct that, “SuperPlots” superimpose summary statistics from biological replicates consisting of independent experiments on top of data from all cells, and P values were calculated using an n of three, not 300. In this case, the cell-level values were separately pooled for each biological replicate and the mean calculated for each pool those three means were then used to calculate the average (horizontal bar), standard error of the mean (error bars), and P value. While the dot plots in the “OK” column ensure that the P values are calculated correctly, they still fail to convey the experiment-to-experiment differences. In the SuperPlots, each biological replicate is color-coded: the averages from one experimental run are yellow dots, another independent experiment is represented by gray triangles, and a third experiment is shown as blue squares. This helps convey whether the trend is observed within each experimental run, as well as for the dataset as a whole. The beeswarm SuperPlots in the rightmost column represent each cell with a dot that is color-coded according to the biological replicate it came from. The P values represent an unpaired two-tailed t test (A) and a paired two-tailed t test (B–D). For tutorials on making SuperPlots in Prism, R, Python, and Excel, see the supporting information.

The importance of displaying reproducibility. Drastically different experimental outcomes can result in the same plots and statistics unless experiment-to-experiment variability is considered. (A) Problematic plots treat n as the number of cells, resulting in tiny error bars and P values. These plots also conceal any systematic run-to-run error, mixing it with cell-to-cell variability. (B–D) To illustrate this, we simulated three different scenarios that all have identical underlying cell-level values but are clustered differently by experiment: B shows highly repeatable, unclustered data, C shows day-to-day variability, but a consistent trend in each experiment, and D is dominated by one random run. Note that the plots in A that treat each cell as its own n fail to distinguish the three scenarios, claiming a significant difference after drug treatment, even when the experiments are not actually repeatable. To correct that, “SuperPlots” superimpose summary statistics from biological replicates consisting of independent experiments on top of data from all cells, and P values were calculated using an n of three, not 300. In this case, the cell-level values were separately pooled for each biological replicate and the mean calculated for each pool those three means were then used to calculate the average (horizontal bar), standard error of the mean (error bars), and P value. While the dot plots in the “OK” column ensure that the P values are calculated correctly, they still fail to convey the experiment-to-experiment differences. In the SuperPlots, each biological replicate is color-coded: the averages from one experimental run are yellow dots, another independent experiment is represented by gray triangles, and a third experiment is shown as blue squares. This helps convey whether the trend is observed within each experimental run, as well as for the dataset as a whole. The beeswarm SuperPlots in the rightmost column represent each cell with a dot that is color-coded according to the biological replicate it came from. The P values represent an unpaired two-tailed t test (A) and a paired two-tailed t test (B–D). For tutorials on making SuperPlots in Prism, R, Python, and Excel, see the supporting information.

Other plotting examples. Bar plots can be enhanced even without using beeswarm plots. (A) Bar plots that calculate P and standard error of the mean using the number of cells as n are unhelpful. (B) A bar graph can be corrected by using biological replicates to calculate P value and standard error of the mean. (C) Dot plots reveal more than a simple bar graph. (D and E) Linking each pair by the replicate conveys important information about the trend in each experiment. (F) A SuperPlot not only shows information about each replicate and the trends, but also superimposes the distribution of the cell-level data, here using a violin plot.

Other plotting examples. Bar plots can be enhanced even without using beeswarm plots. (A) Bar plots that calculate P and standard error of the mean using the number of cells as n are unhelpful. (B) A bar graph can be corrected by using biological replicates to calculate P value and standard error of the mean. (C) Dot plots reveal more than a simple bar graph. (D and E) Linking each pair by the replicate conveys important information about the trend in each experiment. (F) A SuperPlot not only shows information about each replicate and the trends, but also superimposes the distribution of the cell-level data, here using a violin plot.

Tutorial for making SuperPlots in Prism. We describe how to make SuperPlots in GraphPad Prism 8 (version 8.1.0) graphing software. If using other graphing software, one may create a separate, different colored plot for each replicate, then overlay those plots in software like Adobe Illustrator. (A) When adding data to the table, leave a blank row between replicates. (B) Create a new graph of this existing data under type of graph select “Column” and “Individual values,” and select “No line or error bar.” (C) After formatting the universal features of plot from B (e.g., symbol size, font, axes), go back to the data table and highlight the data values that correspond to one of the replicates. Under the “Change” menu, select “Format Points” and change the color, shape, etc. of the subset of points that correspond to that replicate. (D) Repeat for the other replicates to produce a graph with each trial color coded. (E and F) To display summary statistics, take the average of the technical replicates in each biological replicate (so you will have one value for each condition from each biological replicate), and enter those averages into another data table and graph. Use this data sheet that contains only the averages to run statistical tests. (G) To make a plot that combines the full dataset with the correct summary statistics, format this graph and overlay it with the above scatter SuperPlots (in Prism, this can be done on a “Layout”). This process could be tweaked to display other overlaid, color-coded plots (e.g., violin).

Tutorial for making SuperPlots in Prism. We describe how to make SuperPlots in GraphPad Prism 8 (version 8.1.0) graphing software. If using other graphing software, one may create a separate, different colored plot for each replicate, then overlay those plots in software like Adobe Illustrator. (A) When adding data to the table, leave a blank row between replicates. (B) Create a new graph of this existing data under type of graph select “Column” and “Individual values,” and select “No line or error bar.” (C) After formatting the universal features of plot from B (e.g., symbol size, font, axes), go back to the data table and highlight the data values that correspond to one of the replicates. Under the “Change” menu, select “Format Points” and change the color, shape, etc. of the subset of points that correspond to that replicate. (D) Repeat for the other replicates to produce a graph with each trial color coded. (E and F) To display summary statistics, take the average of the technical replicates in each biological replicate (so you will have one value for each condition from each biological replicate), and enter those averages into another data table and graph. Use this data sheet that contains only the averages to run statistical tests. (G) To make a plot that combines the full dataset with the correct summary statistics, format this graph and overlay it with the above scatter SuperPlots (in Prism, this can be done on a “Layout”). This process could be tweaked to display other overlaid, color-coded plots (e.g., violin).

Tutorial for making SuperPlots in Excel. (A) To make a SuperPlot using Excel (Microsoft Office 365 ProPlus for Windows version 1912 Build 12325.20172), enter the values for the first replicate for the first condition into column B (highlighted in yellow), the second condition into column D (highlighted in yellow), and continue to skip columns between datasets for the remaining conditions and replicates (in this example, replicate 2 is highlighted in green and replicate 3 is in orange). For example, “Treatment A” could be control cells and “Treatment B” could be drug-treated cells. Label the empty columns as “x” and, starting with column A, enter random values to generate the scatter effect by using the formula “=RANDBETWEEN(25, 100)”. To create a gap between the datasets A and B, use larger X values for treatment B by entering the formula “=RANDBETWEEN(225, 300)”. (B) Highlight all the data and headings. In the insert menu, expand the charts menu to open the “Insert Chart” dialog box. Select “All Charts,” and choose “X Y Scatter.” Select the option that has Y values corresponding to your datasets. (In Excel for Mac, there is not a separate dialog box. Instead, make a scatter plot, right click on the plot and select “Select Data,” remove the “x” columns from the list, then manually select the corresponding “X values =” for each dataset.) (C) Change the general properties of the graph to your liking. In this example, we removed the chart title and the gridlines, added a black outline to the chart area, resized the graph, adjusted the x axis range to 0–325, removed the x axis labels, added a y axis title and tick marks, changed the font to Arial, and changed the font color to black. This style can be saved as a template for future use by right clicking. We recommend keeping the figure legend until the next step. (D) Next, double click the graph to open the “Format Plot Area” panel. Under “Chart Options,” select your first dataset, “Series Treatment A (replicate 1).” (On a Mac, click on a datapoint from one of the replicates, right click and select “Format Data Series.”) Select “Marker” and change the color and style of the data points. Repeat with the remaining datasets so that the colors, shapes, etc. correspond to the biological replicate the data points came from. Delete the chart legend and add axis labels with the text tool if desired. (E) Calculate the average for each replicate for each condition, and pair this value with the X coordinate of 62.5 for the first treatment, and 262.5 for the second treatment to center the values in the scatterplot. Then, click the graph, and under the “Chart Design” menu, click “Select Data.” Under “Legend Entries (Series),” select “Add” and under series name, select the three trial names, then select all three X and Y values for first treatment condition for “Series X Values” and “Series Y Values,” respectively. Repeat for the second treatment condition, and hit “OK.” (F) On the chart, select the data point corresponding to the first average and double click to isolate the data point. Format the size, color, etc. and repeat for remaining data points. (G) Optional: To add an average and error bars, either generate a second graph and overlay the data, or calculate the average and standard deviation using Excel and add the data series to the graph as was done in E and F, using the “-” symbol for the data point.

Tutorial for making SuperPlots in Excel. (A) To make a SuperPlot using Excel (Microsoft Office 365 ProPlus for Windows version 1912 Build 12325.20172), enter the values for the first replicate for the first condition into column B (highlighted in yellow), the second condition into column D (highlighted in yellow), and continue to skip columns between datasets for the remaining conditions and replicates (in this example, replicate 2 is highlighted in green and replicate 3 is in orange). For example, “Treatment A” could be control cells and “Treatment B” could be drug-treated cells. Label the empty columns as “x” and, starting with column A, enter random values to generate the scatter effect by using the formula “=RANDBETWEEN(25, 100)”. To create a gap between the datasets A and B, use larger X values for treatment B by entering the formula “=RANDBETWEEN(225, 300)”. (B) Highlight all the data and headings. In the insert menu, expand the charts menu to open the “Insert Chart” dialog box. Select “All Charts,” and choose “X Y Scatter.” Select the option that has Y values corresponding to your datasets. (In Excel for Mac, there is not a separate dialog box. Instead, make a scatter plot, right click on the plot and select “Select Data,” remove the “x” columns from the list, then manually select the corresponding “X values =” for each dataset.) (C) Change the general properties of the graph to your liking. In this example, we removed the chart title and the gridlines, added a black outline to the chart area, resized the graph, adjusted the x axis range to 0–325, removed the x axis labels, added a y axis title and tick marks, changed the font to Arial, and changed the font color to black. This style can be saved as a template for future use by right clicking. We recommend keeping the figure legend until the next step. (D) Next, double click the graph to open the “Format Plot Area” panel. Under “Chart Options,” select your first dataset, “Series Treatment A (replicate 1).” (On a Mac, click on a datapoint from one of the replicates, right click and select “Format Data Series.”) Select “Marker” and change the color and style of the data points. Repeat with the remaining datasets so that the colors, shapes, etc. correspond to the biological replicate the data points came from. Delete the chart legend and add axis labels with the text tool if desired. (E) Calculate the average for each replicate for each condition, and pair this value with the X coordinate of 62.5 for the first treatment, and 262.5 for the second treatment to center the values in the scatterplot. Then, click the graph, and under the “Chart Design” menu, click “Select Data.” Under “Legend Entries (Series),” select “Add” and under series name, select the three trial names, then select all three X and Y values for first treatment condition for “Series X Values” and “Series Y Values,” respectively. Repeat for the second treatment condition, and hit “OK.” (F) On the chart, select the data point corresponding to the first average and double click to isolate the data point. Format the size, color, etc. and repeat for remaining data points. (G) Optional: To add an average and error bars, either generate a second graph and overlay the data, or calculate the average and standard deviation using Excel and add the data series to the graph as was done in E and F, using the “-” symbol for the data point.