Probability Symbols

probability and statisticsprobability notationsrandom variables and distributionscommon distributionsstatistical measureshypothesis testinginformation theorymoment generating functionsprobability inequalitiesbayesian statisticsregression and correlation

symbol	latex code	explanation
P(A ∩ B)	P(A \cap B)	The joint probability of events A and B both occurring. For independent events, P(A ∩ B) = P(A) · P(B).
P(A ∪ B)	P(A \cup B)	Probability of A or B (or both) occurring. Calculated as P(A) + P(B) − P(A ∩ B) by the addition rule.
P(A \| B)	P(A \mid B)	Conditional probability of A given that B has occurred. Defined as P(A ∩ B) / P(B) when P(B) > 0.
E(X)	E(X)	The expected value (mean) of random variable X — the long-run average value over many trials.
Var(X)	\text{Var}(X)	The variance of X, measuring how spread out values are from the mean. Equals E[(X − μ)²].
Cov(X, Y)	\text{Cov}(X, Y)	The covariance of X and Y, measuring how two variables change together. Positive means they increase together.
P(A)	P(A)	Probability of event A occurring. Always between 0 and 1, where 0 means impossible and 1 means certain.
P(¬A)	P(\neg A)	Probability of A not occurring (complement). Equals 1 − P(A) by the complement rule.
P(A ∩ B)	P(A \cap B)	Joint probability of A and B. For independent events: P(A ∩ B) = P(A) · P(B).
P(A ∪ B)	P(A \cup B)	Probability of A or B. Use Venn diagrams to visualize the union of events.
P(A \| B)	P(A \mid B)	Conditional probability — the probability of A when we know B happened. Central to Bayes' theorem.
X	X	A random variable — a function that assigns numerical values to outcomes in a sample space.
f_X(x)	f_X(x)	The probability function. For discrete X, gives P(X = x). For continuous X, integrates to give probabilities.
F_X(x)	F_X(x)	The cumulative distribution function (CDF). Gives P(X ≤ x) and is non-decreasing from 0 to 1.
μ	\mu	Population mean — the expected value of a distribution. The center of mass of the probability distribution.
σ²	\sigma^2	Population variance — measures the average squared deviation from the mean.
σ	\sigma	Standard deviation — the square root of variance. Same units as the original data, easier to interpret than variance.
Bin(n, p)	\text{Bin}(n, p)	Binomial distribution — models the number of successes in n independent trials, each with success probability p.
Poisson(λ)	\text{Poisson}(\lambda)	Poisson distribution — models rare events occurring at rate λ per interval. Mean and variance both equal λ.
N(μ, σ²)	\mathcal{N}(\mu, \sigma^2)	Normal distribution — the bell curve. Central to statistics due to the central limit theorem.
Exp(λ)	\text{Exp}(\lambda)	Exponential distribution — models waiting times between Poisson events. Memoryless property.
U(a, b)	\text{U}(a, b)	Uniform distribution — all values in [a, b] equally likely. Mean is (a + b)/2.
E(X)	E(X)	Expected value — weighted average of all possible values, weighted by their probabilities.
Var(X)	\text{Var}(X)	Variance — E[(X − μ)²] or equivalently E(X²) − [E(X)]². Measures dispersion.
SD(X)	\text{SD}(X)	Standard deviation — √Var(X). In the same units as X, unlike variance which is in squared units.
Cov(X, Y)	\text{Cov}(X, Y)	Covariance — E(XY) − E(X)E(Y). Zero for independent variables, but zero covariance doesn't imply independence.
Corr(X, Y)	\text{Corr}(X, Y)	Correlation coefficient — Cov(X,Y)/(σ_X · σ_Y). Standardized to range from −1 to 1.
H₀	H_0	Null hypothesis — the default assumption to be tested. Typically states "no effect" or "no difference".
H₁	H_1	Alternative hypothesis — what we accept if we reject H₀. Can be one-sided or two-sided.
α	\alpha	Significance level — the probability of rejecting H₀ when it's actually true (Type I error). Common values: 0.05, 0.01.
p-value	\text{p-value}	Probability of observing data at least as extreme as what was observed, assuming H₀ is true. Reject H₀ if p-value < α.
z	z	Z-score — number of standard deviations from the mean. Uses the normal distribution for large samples.
t	t	T-score — similar to z-score but accounts for uncertainty in estimated standard deviation. Uses the t-distribution.
H(X)	H(X)	Entropy — measures uncertainty or information content of X. Higher entropy means more unpredictable outcomes.
I(X; Y)	I(X; Y)	Mutual information — measures how much knowing X reduces uncertainty about Y. Zero if X and Y are independent.
D(P \|\| Q)	D(P \\| Q)	Kullback–Leibler divergence — measures how distribution P differs from reference distribution Q. Not symmetric.
M_X(t)	M_X(t)	Moment generating function of X — encodes all moments of the distribution. Useful for proving distribution properties.
M_X(t) = E(e^(tX))	M_X(t) = E(e^{tX})	Definition of MGF — the expected value of e^(tX). Exists when this expectation is finite near t = 0.
M'(0) = E(X)	M'(0) = E(X)	First derivative of MGF at t = 0 gives the mean. This is why it's called "moment generating".
M''(0) = E(X²)	M''(0) = E(X^2)	Second derivative at t = 0 gives E(X²). Combined with M'(0), we can find variance: Var(X) = E(X²) − [E(X)]².
P(X ≥ a) ≤ E(X)/a	P(X \geq a) \leq \frac{E(X)}{a}	Markov's inequality — bounds tail probability using only the mean. Requires X ≥ 0 and a > 0.
P(\|X − μ\| ≥ kσ) ≤ 1/k²	P(\|X - \mu\| \geq k\sigma) \leq \frac{1}{k^2}	Chebyshev's inequality — at most 1/k² of values lie more than k standard deviations from the mean.
P(Sₙ/n − μ ≥ ε) ≤ e^(−nε²/2σ²)	P\left(\frac{S_n}{n} - \mu \geq \epsilon\right) \leq e^{-\frac{n\epsilon^2}{2\sigma^2}}	Hoeffding's inequality — exponentially decreasing bound on deviations of sample means. Tighter than Chebyshev for large n.
P(A \| B)	P(A \mid B)	Posterior probability — our updated belief in A after observing evidence B. The output of Bayes' theorem.
P(A \| B) = P(B \| A)P(A) / P(B)	P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}	Bayes' theorem — relates posterior to prior P(A) and likelihood P(B\|A). Foundation of Bayesian inference.
P(A, B)	P(A, B)	Joint probability of A and B — same as P(A ∩ B). The probability both events occur together.
P(A ∩ B) = P(A)P(B \| A)	P(A \cap B) = P(A) P(B \mid A)	Multiplication rule — expresses joint probability in terms of conditional probability. Basis for tree diagrams.
Y = β₀ + β₁X + ε	Y = \beta_0 + \beta_1 X + \epsilon	Simple linear regression model — β₀ is intercept, β₁ is slope, ε is random error with E(ε) = 0.
R²	R^2	Coefficient of determination — proportion of variance in Y explained by X. Ranges from 0 to 1.
ρ(X, Y)	\rho(X, Y)	Pearson correlation — measures linear relationship strength. Equals ±1 for perfect linear relationship, 0 for no linear correlation.
Cov(X, Y) = E(XY) − E(X)E(Y)	\text{Cov}(X, Y) = E(XY) - E(X)E(Y)	Covariance formula — alternative computation using expected values. Useful for theoretical derivations.