P-Value Calculator — Z / T / Chi-Square / F Tests
Drop a test statistic and pick the test type (Z, T, Chi-square, or F). Get the exact p-value plus significance verdicts at α=0.05 and α=0.01. One- or two-tailed for Z and T; right-tail for Chi and F. NIST-style numerical methods.
- Instant result
- Private — nothing saved
- Works on any device
- AI insight included
P-Value Calculator
You might also need
What is a P-Value?
A p-value is the probability of observing data at least as extreme as the data you actually collected, assuming the null hypothesis is true. That conditional — “assuming the null is true” — is the single most important word in the definition, and it is the part that most non-statisticians get wrong. The p-value does not tell you the probability that the null hypothesis is true. It does not tell you the probability that your alternative hypothesis is true. It does not tell you the size of an effect or whether it matters in the real world. It only answers one narrow question: if nothing were going on, how surprising would this dataset be?
Small p-values mean the data is unlikely under the null — surprising enough that we suspect the null is wrong. Large p-values mean the data is consistent with the null — we have no compelling reason to abandon it. Convention is to call p < 0.05 “statistically significant,” but that threshold is arbitrary, dates back to a 1925 footnote by R. A. Fisher, and has been challenged repeatedly — most prominently by the American Statistical Association’s 2016 Statement on P-Values.
The p-value framework underlies a huge slice of empirical science: clinical trials, A/B tests, psychology experiments, agricultural studies, manufacturing quality control, economics. Understanding what it actually means — and what it definitely does not mean — is the most consequential statistical skill outside of just computing means and standard deviations.
The Methodology — Four Common Test Families
Every p-value comes from a test statistic plus the distribution that test statistic follows under the null. The four most common families this calculator supports:
- Z-test. Test statistic follows the standard normal distribution N(0, 1). Used when you know the population standard deviation, or when sample size is large enough (n ≥ 30) that sample SD is a good approximation. Common in A/B testing on conversion rates, large-sample proportion tests.
- T-test (Student’s t).Test statistic follows the t-distribution, indexed by degrees of freedom (df = n − 1 for a one-sample test). Has fatter tails than the normal for small df; converges to normal as df grows large. Use when you estimated σ from the data and the sample is small.
- Chi-square (χ²) test.Test statistic follows the chi-square distribution with df = (rows − 1) × (cols − 1) for contingency tables, or df = (categories − 1) for goodness-of-fit. Strictly right-tailed — only large values count as evidence against the null. Used for categorical data: goodness-of-fit (do observed counts match expected?), independence (are two categorical variables related?).
- F-test. Test statistic follows the F-distribution with two degrees-of-freedom parameters (df1 = numerator, df2 = denominator). Right-tailed by convention. Used for ANOVA (comparing means across 3+ groups), comparing variances of two distributions, and testing overall significance of a regression model.
All four are computed via cumulative distribution functions (CDFs) — the normal CDF Φ(z) for z-tests, the regularized incomplete beta function for t and F, and the regularized incomplete gamma function for chi-square. Numerical accuracy is around 1 part in 10⁸ for normal and 1 part in 10¹⁰ for the others, far beyond what any practical decision needs.
One-Tailed vs Two-Tailed
For z-tests and t-tests, you have to choose between a one-tailed and a two-tailed p-value. The distinction matters and gets misused constantly.
- Two-tailed asks: is the test statistic significantly far from zero in either direction? p = P(|Z| ≥ |observed|). This is the conservative, default-safe choice when you have no pre-specified directional hypothesis. Most published research and most A/B tests use two-tailed.
- One-tailed asks: is the test statistic significantly far from zero in a pre-specifieddirection? p = P(Z ≥ observed) for an upper-tailed test. Roughly half the two-tailed p-value, so easier to clear the α = 0.05 threshold — which is exactly why it gets abused.
The honest use of one-tailed is when the directional hypothesis was registered before any data was collected, and a result in the opposite direction would be considered no evidence at all. The dishonest use is peeking at the data, noticing the effect is in one direction, and then picking one-tailed to halve the p-value. That is a form of p-hacking and is one of the practices the ASA statement explicitly warns against.
Worked Example — Is This Coin Fair?
A classic. You flip a coin 100 times and get 60 heads. Is the coin biased, or could this happen by chance with a fair coin?
Step 1 — State the null hypothesis.
H₀: the coin is fair (p = 0.5). H₁: the coin is not fair (p ≠ 0.5, two-tailed).
Step 2 — Compute the expected outcome under the null.
For n = 100 flips with p = 0.5, the expected number of heads is np = 50, and the standard deviation is √(np(1−p)) = √25 = 5.
Step 3 — Compute the z-statistic.
z = (60 − 50) / 5 = 2.0
Sixty heads is 2 standard deviations above the expected count.
Step 4 — Look up the two-tailed p-value.
For z = 2.0, the upper-tail probability is 1 − Φ(2.0) ≈ 0.0228. The two-tailed p-value is twice that: p = 2 × 0.0228 ≈ 0.0455.
Step 5 — Apply the decision rule.
At α = 0.05, p = 0.0455 is just below the threshold, so we reject the null at the 5% significance level. At α = 0.01 we would fail to reject. The honest report is “p ≈ 0.045, weakly suggestive of bias but well within the range where replication is needed.” A 60/100 split is roughly the edge of what you can call significant in a single experiment.
Drop z = 2.0, two-tailed, into the calculator above and the same p ≈ 0.0455 will fall out.
Significance Thresholds — What α Actually Buys You
The α level (significance threshold) is the false-positive rate you are willing to tolerate — the probability of incorrectly rejecting a true null. Common choices:
- α = 0.05 (5% false-positive rate). Standard in most social and behavioral science, A/B testing, basic clinical research. About 1 in 20 true nulls will be wrongly rejected.
- α = 0.01 (1% false-positive rate). Used in safety-critical domains and high-stakes clinical work where false positives are expensive.
- α = 0.10 (10% false-positive rate). Used in exploratory work where you want to spot interesting patterns at the cost of more false alarms, knowing you will replicate before publishing.
- α = 5 × 10⁻⁸. Genome-wide association studies. Each genome carries millions of independent tests; the threshold is set ferociously low to control family-wise error rate after the Bonferroni correction.
Common Mistakes & Misinterpretations
- “p = 0.04 means there is a 96% chance the alternative is true.”Wrong — that is a Bayesian posterior, not a frequentist p-value. The p-value is computed assuming the null is true; it says nothing about the probability of any hypothesis being true.
- p-hacking.Running many tests, transformations, and analyses, and only reporting the ones that hit p < 0.05. With enough flexibility you can squeeze p < 0.05 out of pure noise. The remedy is preregistration of the analysis plan before seeing data.
- HARKing (Hypothesizing After Results are Known). Forming your hypothesis after looking at the data and then reporting the test as if it had been planned. This inflates false-positive rates dramatically. The remedy is to clearly separate exploratory from confirmatory analyses.
- Multiple comparisons without correction.Running 20 independent tests, all under true nulls, you expect 1 to come up at p < 0.05 by pure luck. Bonferroni (multiply each p by the number of tests), Benjamini-Hochberg (false-discovery-rate control), or other corrections are required when running families of tests.
- Confusing statistical significance with practical importance. With a large enough sample, a meaningless 0.1% difference can be highly statistically significant. Always report the effect size alongside the p-value — the size of the effect, not just whether it crossed a threshold.
- Optional stopping.Checking your A/B test every day and stopping the moment p < 0.05. This dramatically inflates false-positive rates. The remedy is either preregistering a fixed sample size, or using sequential testing methods designed for repeated peeks.
- Treating p just above and just below 0.05 as fundamentally different. p = 0.049 and p = 0.051 are essentially indistinguishable evidence-wise. The threshold is a convention, not a phase transition in nature.
- Publication bias.Studies with p < 0.05 get published; studies with p ≥ 0.05 sit in file drawers. The published literature is systematically biased toward false positives, especially in fields with small sample sizes and many possible analyses.
How to Use a P-Value Responsibly
- Preregister. Specify your hypothesis, analysis plan, and stopping rule before looking at data. Even a private timestamped note to yourself is better than nothing.
- Report effect size and confidence interval.A 95% CI that excludes the null value tells you essentially the same thing as p < 0.05, but in addition tells you how big the effect is and how precisely it is estimated. CIs are strictly more informative than bare p-values.
- Correct for multiple comparisons when running families of tests. Bonferroni for strict control; Benjamini-Hochberg for false-discovery rate; permutation tests for tightly correlated test families.
- Plan sample size in advance via power analysis. Running with a sample size big enough to detect the smallest effect that would matter to you, at a reasonable power (typically 80%), is the cleanest way to avoid both false positives and false negatives.
- Replicate before publishing.A single significant result in a single sample is weak evidence. The two-experiments rule — require replication in an independent sample before drawing conclusions — handles most p-hacking issues automatically.
Confidence Intervals Are Usually Better
A 95% confidence interval and a two-tailed p-value at α = 0.05 give you exactly the same accept/reject decision: the null is rejected if and only if the null value falls outside the 95% CI. But the CI gives you something extra: the size of the effect and the precision of the estimate, in the original units of the problem. “Mean difference = $230 [95% CI: $40, $420]” tells you more than “p = 0.018.” Whenever possible, report both.
Bayesian Alternatives
The Bayesian alternative to the p-value is the Bayes factor (BF) or the posterior probability of the hypothesis given the data. Bayes factors do answer the “how strongly does the data favor one hypothesis over the other?” question that p-values are routinely misread as answering. They require specifying a prior, which is sometimes a feature (you encode what you already knew) and sometimes a bug (the prior is doing a lot of the work). Packages like R’s BayesFactor make this practical.
Related Calculators
- Z-Score Calculator— convert a raw observation into the z-statistic that feeds a z-test.
- Standard Deviation Calculator — compute the σ or s that the test statistic depends on.
- Percentage Calculator— quick percent-change and percent-of-total work for everyday data summaries.
- Ratio Calculator— compare two quantities directly.
- Average Calculator— for the mean step that goes into computing the test statistic.
Frequently Asked Questions
The most common questions we get about this calculator — each answer is kept under 60 words so you can scan.
What is a p-value?
The probability of observing a test statistic at least as extreme as the one you computed, ASSUMING the null hypothesis is true. Small p = the observed data is unlikely under H₀ → reject H₀. Large p = the observed data is consistent with H₀ → fail to reject. P-value is NOT 'probability that H₀ is true' — a common misinterpretation flagged by the ASA Statement on P-Values (2016).What significance level should I use?
Convention is α = 0.05 (5% false-positive rate) but this is just a convention. Stricter standards (α = 0.01) for medical/safety-critical decisions. Looser standards (α = 0.10) for exploratory work. The right α depends on the cost of false positives vs false negatives in your domain. The calculator shows results at both 0.05 and 0.01 for context.When do I use a Z-test vs T-test?
Z-test: large samples (n ≥ 30) OR small samples with known population σ. T-test: small samples (n < 30) AND unknown σ. In practice, the T-test is more conservative for small samples; for n ≥ 50 the two converge. If you're using sample SD as an estimate of σ, you should be using a T-test, regardless of sample size.What's the difference between one-tailed and two-tailed?
Two-tailed: tests whether the observed statistic is significantly DIFFERENT (in either direction) from H₀. One-tailed: tests whether it's significantly LARGER (or SMALLER) than H₀ — directional. Two-tailed p ≈ 2× one-tailed p. Use one-tailed only with a pre-registered directional hypothesis; otherwise default to two-tailed (more conservative).What's a chi-square test for?
Categorical data. Two common uses: (1) Goodness-of-fit — do observed counts match expected? (rolling a die 60 times, expecting 10 of each face) (2) Independence — are two categorical variables related? (gender vs voting preference). The test statistic = Σ(observed − expected)² / expected. Always right-tailed (large values = bad fit / dependence).What's an F-test for?
Two main uses: (1) ANOVA — comparing means across 3+ groups. F = between-group variance / within-group variance. (2) Comparing variances of two distributions. (3) Testing overall significance of a regression model. F is always positive and right-tailed; small F = no group differences; large F = significant group differences. Requires both df1 (numerator) and df2 (denominator).Is p < 0.05 'statistically significant'?
By the conventional definition, yes — but the ASA Statement on P-Values (2016) explicitly warns against treating p < 0.05 as a 'discovery' or p ≥ 0.05 as 'no effect'. P-value combined with effect size, confidence intervals, prior plausibility, and replication is what determines real significance. A p of 0.049 and 0.051 are essentially identical evidence-wise.What is p-hacking and how do I avoid it?
P-hacking = manipulating data analysis (selectively reporting tests, transforming variables, removing outliers, adjusting sample size) to get p < 0.05. To avoid: (1) pre-register your analysis plan before seeing the data, (2) report ALL tests run, not just significant ones, (3) report effect sizes + confidence intervals alongside p-values, (4) replicate findings in independent samples before concluding.How accurate is this calculator's CDF math?
Normal CDF: Abramowitz & Stegun 26.2.17 series, accurate to ~7.5e-8. T-CDF and F-CDF: regularized incomplete beta function via continued fractions (Numerical Recipes betacf), accurate to ~1e-10. Chi-square CDF: regularized incomplete gamma function via series + continued fractions, accurate to ~1e-10. Sufficient for any practical hypothesis test.What are the assumptions of these tests?
Z-test: known σ OR n ≥ 30. T-test: roughly normal distribution + independent observations. Chi-square: expected counts ≥ 5 per cell (rule of thumb). F-test: normally-distributed residuals + equal variances. If assumptions are badly violated (heavy outliers, strong skew), the p-value's interpretation is unreliable; consider non-parametric alternatives (Wilcoxon, Mann-Whitney, Kruskal-Wallis).Can I use this for non-parametric tests?
Some — Mann-Whitney U test statistic can be converted to a z-score; Kruskal-Wallis converts to chi-square. For Wilcoxon signed-rank, the W statistic has its own table — this calculator doesn't support it directly. For Bayesian alternatives (Bayes factors, posterior probabilities), use dedicated tools like R's `BayesFactor` package.What's the relationship between p-value and confidence interval?
Direct. A 95% confidence interval that excludes the null hypothesis value corresponds to p < 0.05. A 99% CI that excludes null corresponds to p < 0.01. CIs are generally more informative than p-values because they show the effect size AND uncertainty in one number. When publishing, report both.