The probability of observing a test statistic at least as extreme as the one you computed, ASSUMING the null hypothesis is true. Small p = the observed data is unlikely under H₀ → reject H₀. Large p = the observed data is consistent with H₀ → fail to reject. P-value is NOT 'probability that H₀ is true' — a common misinterpretation flagged by the ASA Statement on P-Values (2016).

What significance level should I use?

Convention is α = 0.05 (5% false-positive rate) but this is just a convention. Stricter standards (α = 0.01) for medical/safety-critical decisions. Looser standards (α = 0.10) for exploratory work. The right α depends on the cost of false positives vs false negatives in your domain. The calculator shows results at both 0.05 and 0.01 for context.

When do I use a Z-test vs T-test?

Z-test: large samples (n ≥ 30) OR small samples with known population σ. T-test: small samples (n < 30) AND unknown σ. In practice, the T-test is more conservative for small samples; for n ≥ 50 the two converge. If you're using sample SD as an estimate of σ, you should be using a T-test, regardless of sample size.

What's the difference between one-tailed and two-tailed?

Two-tailed: tests whether the observed statistic is significantly DIFFERENT (in either direction) from H₀. One-tailed: tests whether it's significantly LARGER (or SMALLER) than H₀ — directional. Two-tailed p ≈ 2× one-tailed p. Use one-tailed only with a pre-registered directional hypothesis; otherwise default to two-tailed (more conservative).

What's a chi-square test for?

Categorical data. Two common uses: (1) Goodness-of-fit — do observed counts match expected? (rolling a die 60 times, expecting 10 of each face) (2) Independence — are two categorical variables related? (gender vs voting preference). The test statistic = Σ(observed − expected)² / expected. Always right-tailed (large values = bad fit / dependence).

What's an F-test for?

Two main uses: (1) ANOVA — comparing means across 3+ groups. F = between-group variance / within-group variance. (2) Comparing variances of two distributions. (3) Testing overall significance of a regression model. F is always positive and right-tailed; small F = no group differences; large F = significant group differences. Requires both df1 (numerator) and df2 (denominator).

Is p < 0.05 'statistically significant'?

By the conventional definition, yes — but the ASA Statement on P-Values (2016) explicitly warns against treating p < 0.05 as a 'discovery' or p ≥ 0.05 as 'no effect'. P-value combined with effect size, confidence intervals, prior plausibility, and replication is what determines real significance. A p of 0.049 and 0.051 are essentially identical evidence-wise.

What is p-hacking and how do I avoid it?

P-hacking = manipulating data analysis (selectively reporting tests, transforming variables, removing outliers, adjusting sample size) to get p < 0.05. To avoid: (1) pre-register your analysis plan before seeing the data, (2) report ALL tests run, not just significant ones, (3) report effect sizes + confidence intervals alongside p-values, (4) replicate findings in independent samples before concluding.

How accurate is this calculator's CDF math?

Normal CDF: Abramowitz & Stegun 26.2.17 series, accurate to ~7.5e-8. T-CDF and F-CDF: regularized incomplete beta function via continued fractions (Numerical Recipes betacf), accurate to ~1e-10. Chi-square CDF: regularized incomplete gamma function via series + continued fractions, accurate to ~1e-10. Sufficient for any practical hypothesis test.

What are the assumptions of these tests?

Z-test: known σ OR n ≥ 30. T-test: roughly normal distribution + independent observations. Chi-square: expected counts ≥ 5 per cell (rule of thumb). F-test: normally-distributed residuals + equal variances. If assumptions are badly violated (heavy outliers, strong skew), the p-value's interpretation is unreliable; consider non-parametric alternatives (Wilcoxon, Mann-Whitney, Kruskal-Wallis).

Can I use this for non-parametric tests?

Some — Mann-Whitney U test statistic can be converted to a z-score; Kruskal-Wallis converts to chi-square. For Wilcoxon signed-rank, the W statistic has its own table — this calculator doesn't support it directly. For Bayesian alternatives (Bayes factors, posterior probabilities), use dedicated tools like R's `BayesFactor` package.

What's the relationship between p-value and confidence interval?

Direct. A 95% confidence interval that excludes the null hypothesis value corresponds to p < 0.05. A 99% CI that excludes null corresponds to p < 0.01. CIs are generally more informative than p-values because they show the effect size AND uncertainty in one number. When publishing, report both.

MathFree · No signup · 90.5K/month · $1.65 CPC

P-Value Calculator — Z / T / Chi-Square / F Tests

Drop a test statistic and pick the test type (Z, T, Chi-square, or F). Get the exact p-value plus significance verdicts at α=0.05 and α=0.01. One- or two-tailed for Z and T; right-tail for Chi and F. NIST-style numerical methods.

Instant result
Private — nothing saved
Works on any device
AI insight included

Reviewed by CalcBold Editorial · Sources: NIST SEMATECH §7 (Product and Process Comparisons) + ASA Statement on P-Values (2016) + Numerical RecipesLast verified May 15, 2026Methodology

Embed builderDrop the P-Value on your site →Free widget · 3 sizes · custom theme · auto-resizes · no signupGet embed code

Frequently Asked Questions

The most common questions we get about this calculator — each answer is kept under 60 words so you can scan.

What is a p-value?
The probability of observing a test statistic at least as extreme as the one you computed, ASSUMING the null hypothesis is true. Small p = the observed data is unlikely under H₀ → reject H₀. Large p = the observed data is consistent with H₀ → fail to reject. P-value is NOT 'probability that H₀ is true' — a common misinterpretation flagged by the ASA Statement on P-Values (2016).
What significance level should I use?
Convention is α = 0.05 (5% false-positive rate) but this is just a convention. Stricter standards (α = 0.01) for medical/safety-critical decisions. Looser standards (α = 0.10) for exploratory work. The right α depends on the cost of false positives vs false negatives in your domain. The calculator shows results at both 0.05 and 0.01 for context.
When do I use a Z-test vs T-test?
Z-test: large samples (n ≥ 30) OR small samples with known population σ. T-test: small samples (n < 30) AND unknown σ. In practice, the T-test is more conservative for small samples; for n ≥ 50 the two converge. If you're using sample SD as an estimate of σ, you should be using a T-test, regardless of sample size.
What's the difference between one-tailed and two-tailed?
Two-tailed: tests whether the observed statistic is significantly DIFFERENT (in either direction) from H₀. One-tailed: tests whether it's significantly LARGER (or SMALLER) than H₀ — directional. Two-tailed p ≈ 2× one-tailed p. Use one-tailed only with a pre-registered directional hypothesis; otherwise default to two-tailed (more conservative).
What's a chi-square test for?
Categorical data. Two common uses: (1) Goodness-of-fit — do observed counts match expected? (rolling a die 60 times, expecting 10 of each face) (2) Independence — are two categorical variables related? (gender vs voting preference). The test statistic = Σ(observed − expected)² / expected. Always right-tailed (large values = bad fit / dependence).
What's an F-test for?
Two main uses: (1) ANOVA — comparing means across 3+ groups. F = between-group variance / within-group variance. (2) Comparing variances of two distributions. (3) Testing overall significance of a regression model. F is always positive and right-tailed; small F = no group differences; large F = significant group differences. Requires both df1 (numerator) and df2 (denominator).
Is p < 0.05 'statistically significant'?
By the conventional definition, yes — but the ASA Statement on P-Values (2016) explicitly warns against treating p < 0.05 as a 'discovery' or p ≥ 0.05 as 'no effect'. P-value combined with effect size, confidence intervals, prior plausibility, and replication is what determines real significance. A p of 0.049 and 0.051 are essentially identical evidence-wise.
What is p-hacking and how do I avoid it?
P-hacking = manipulating data analysis (selectively reporting tests, transforming variables, removing outliers, adjusting sample size) to get p < 0.05. To avoid: (1) pre-register your analysis plan before seeing the data, (2) report ALL tests run, not just significant ones, (3) report effect sizes + confidence intervals alongside p-values, (4) replicate findings in independent samples before concluding.
How accurate is this calculator's CDF math?
Normal CDF: Abramowitz & Stegun 26.2.17 series, accurate to ~7.5e-8. T-CDF and F-CDF: regularized incomplete beta function via continued fractions (Numerical Recipes betacf), accurate to ~1e-10. Chi-square CDF: regularized incomplete gamma function via series + continued fractions, accurate to ~1e-10. Sufficient for any practical hypothesis test.
What are the assumptions of these tests?
Z-test: known σ OR n ≥ 30. T-test: roughly normal distribution + independent observations. Chi-square: expected counts ≥ 5 per cell (rule of thumb). F-test: normally-distributed residuals + equal variances. If assumptions are badly violated (heavy outliers, strong skew), the p-value's interpretation is unreliable; consider non-parametric alternatives (Wilcoxon, Mann-Whitney, Kruskal-Wallis).
Can I use this for non-parametric tests?
Some — Mann-Whitney U test statistic can be converted to a z-score; Kruskal-Wallis converts to chi-square. For Wilcoxon signed-rank, the W statistic has its own table — this calculator doesn't support it directly. For Bayesian alternatives (Bayes factors, posterior probabilities), use dedicated tools like R's `BayesFactor` package.
What's the relationship between p-value and confidence interval?
Direct. A 95% confidence interval that excludes the null hypothesis value corresponds to p < 0.05. A 99% CI that excludes null corresponds to p < 0.01. CIs are generally more informative than p-values because they show the effect size AND uncertainty in one number. When publishing, report both.

P-Value Calculator

You might also need

Frequently Asked Questions