Memoryless property: given that no event has occurred by time s, the remaining wait has the same distribution as a fresh start. The exponential is the unique continuous distribution with this property.
Subtracting the mean and dividing by the standard deviation transforms any normal into a standard normal. This is the key step for using Z-tables to compute normal probabilities.
The normal family is closed under sums of independent variables. Means add and variances add. This stability under addition underlies the central limit theorem.
Change-of-variables formula for Y=g(X) when g is monotone and differentiable. The derivative factor accounts for how g stretches or compresses regions of the input.
General method for finding the distribution of Y=g(X). Express the event {g(X)≤y} as an event in X, evaluate using the distribution of X, then differentiate to obtain fY if needed.
Special case of the change-of-variables formula for g(x)=ax+b. The factor 1/∣a∣ rescales the density when the linear map stretches or compresses the input axis.
The moment generating function of X is the expectation of etX as a function of t, for those t where the expectation exists. Encodes the distribution: if two random variables have MGFs that agree in a neighborhood of zero, their distributions agree.
Back to top
MGF Moments
MX(k)(0)=E[Xk]
MX(k)(0)=E[Xk]
See details
Back to top
explanationderivationrelated formulas
The k-th derivative of the MGF at zero gives the k-th raw moment. Differentiating term-by-term in the Taylor series of E[etX] extracts moments one at a time.
MGF of a sum of independent random variables factors into the product of individual MGFs. This converts convolution of distributions into multiplication of functions, simplifying many sum-of-independent-variables calculations.
The cumulative distribution function tracks how much probability has accumulated up to and including the value x. Defined for any random variable, discrete or continuous.
The CDF is continuous from the right. At any point of discontinuity (a value with positive probability mass), the function jumps up and is evaluated at the top of the jump.
For a discrete random variable, the CDF at x is the sum of probability masses at all values up to and including x. The result is a step function with jumps at each value in the support.
The probability that a continuous random variable falls in [a,b] equals the area under the PDF over that interval. For continuous distributions the endpoints contribute zero probability, so ≤ and < are interchangeable.
The probability that X lands in the half-open interval (a,b] is the difference of CDF values at the endpoints. Works for any random variable, discrete or continuous.
The indicator of an event A is a random variable that equals one when A occurs and zero otherwise. It converts qualitative event-occurrence into a numerical quantity that can be summed and averaged.
The expected value of an indicator equals the probability of the event it indicates. This identity is the bridge between counting events and computing probabilities.
The indicator of an intersection equals the product of indicators. Both factors must be one for the product to be one, matching the requirement that both events occur.
For a discrete random variable, the expected value is the sum of every possible value weighted by its probability mass. This is the long-run average over many independent realizations.
For a continuous random variable, the expected value is the integral of x weighted by the density. The density takes the role of the probability mass in the discrete formula.
Expectation distributes over linear combinations of random variables. Critically, this holds whether or not X and Y are independent — one of the most useful properties in probability.
Law of the Unconscious Statistician for discrete random variables. To find the expectation of a function g(X), weight values of g(x) by the PMF — no need to first derive the distribution of g(X).
The variance is the expected squared deviation from the mean μ=E[X]. It measures the spread of the distribution; small variance means values cluster tightly around the mean, large variance means they scatter widely.
Adding a constant shifts the distribution but does not change spread. Multiplying by a stretches the distribution and scales variance by a2 — squaring is needed because variance has squared units.
Variances add for independent random variables. Unlike linearity of expectation, this requires independence — without it, the covariance term must be added.
Covariance measures how two random variables move together. Positive when both tend to be above (or below) their means simultaneously; negative when one tends to be above when the other is below.
The correlation coefficient is covariance normalized by the product of standard deviations. The result is a unitless measure of linear association on the scale [−1,1].
Correlation is bounded between −1 and 1. Equality ∣ρXY∣=1 holds if and only if Y is a linear function of X with non-zero slope. The bound is a consequence of the Cauchy-Schwarz inequality applied to centered random variables.
The conditional expectation of X given Y=y is the expected value computed under the conditional distribution. It updates the unconditional expectation by incorporating the information that Y took the specific value y.
Also known as the tower property. Averaging the conditional expectations over the distribution of Y recovers the unconditional expectation. Conditioning, then averaging out, returns to the original.
Total variance decomposes into within-group variance plus between-group variance. The expected conditional variance captures average residual spread after conditioning; the variance of the conditional mean captures how much E[X∣Y] itself varies as Y varies.
The k-th moment of X about the origin. The first moment is the mean. Higher raw moments encode information about the shape and tail behavior of the distribution.
The k-th central moment measures deviation from the mean. The second central moment is the variance; the third measures skewness; the fourth measures kurtosis (tail weight).
The first Kolmogorov axiom: every event is assigned a non-negative probability. This rules out negative likelihoods and is one of three building blocks for any probability measure.
The second Kolmogorov axiom: the entire sample space has probability one. Something in the sample space must occur, so the total probability mass equals certainty.
The third Kolmogorov axiom: probabilities of pairwise disjoint events add. For finitely many disjoint events the same identity holds with a finite sum.
Every event has a probability between zero and one. The lower bound comes directly from non-negativity; the upper bound follows because A⊆Ω together with monotonicity and normalization forces P(A)≤P(Ω)=1.
The impossible event has zero probability. This is a direct consequence of the additivity axiom applied to a disjoint decomposition that includes the empty set.
The probability that an event does not occur equals one minus the probability that it does. Often the easiest path to a probability is through its complement.
The probability that at least one of two events occurs equals the sum of their individual probabilities minus the probability that both occur. Subtracting the intersection prevents double-counting outcomes that lie in both events.
The probability of a union of events is the sum of single-event probabilities, minus the pairwise intersections, plus the triple intersections, and so on with alternating signs. Generalizes the addition rule to any number of events.
Also known as the union bound. The probability that at least one of several events occurs is at most the sum of their individual probabilities. Equality holds exactly when the events are pairwise disjoint.
When the sample space is finite and every outcome is equally likely, the probability of an event reduces to counting: number of favorable outcomes divided by total outcomes.
The probability of A given that B has occurred. Conditioning rescales the original probability measure to live entirely within B, with the joint probability P(A∩B) as the numerator.
Algebraic rearrangement of the conditional probability definition. The probability that two events both occur is the probability of one times the conditional probability of the other given the first.
Generalizes the multiplication rule to any finite number of events. The joint probability factors into a product of conditional probabilities, each conditioned on the events preceding it in the chain.
Two events are independent precisely when their joint probability factors into the product of their individual probabilities. Knowing that one occurred provides no information about the other.
Two events are conditionally independent given a third when the factorization of independence holds inside the conditional probability with respect to that third event. Conditional independence neither implies nor is implied by unconditional independence.
When the sample space is partitioned into mutually exclusive cases A1,…,An, the unconditional probability of any event B is the weighted average of its conditional probabilities, with the weights being the probabilities of the cases.
Inverts the direction of conditioning. Given the conditional probability of B given A, plus the prior probabilities of A and B, recover the conditional probability of A given B. Foundational for updating beliefs from evidence.
The probability mass function of a Bernoulli random variable. The trial yields success (k=1) with probability p and failure (k=0) with probability 1−p.
Probability of exactly k successes in n independent Bernoulli trials with success probability p. The binomial coefficient (kn) counts the arrangements of k successes among n trials.
Memoryless property: given that the first success has not yet occurred after m trials, the remaining wait has the same distribution as a fresh start. The geometric is the unique discrete distribution with this property.
Probability that the r-th success occurs on the k-th trial. The k-th trial is the r-th success, so among the first k−1 trials there must be exactly r−1 successes.
Expected number of trials to achieve r successes. Equals the geometric mean times r, since the negative binomial is a sum of r independent geometric waiting times.
Probability of drawing exactly k successes in n draws without replacement from a population of N containing K successes. Choose k successes from K available and n−k failures from N−K available.
The first three factors give the binomial variance with p=K/N. The fourth factor (N−n)/(N−1) is the finite population correction, accounting for sampling without replacement.
Variance equals the mean for the Poisson — a distinguishing feature. If sample variance differs significantly from the sample mean, the Poisson model is suspect.
Independent Poissons sum to a Poisson with rate equal to the sum of rates. Combining two independent event streams produces a Poisson stream at the combined rate.
Each of the b−a+1 integer values in {a,a+1,…,b} has equal probability. The discrete uniform models any finite equally-likely-outcomes scenario such as rolling a fair die.
Variance grows with the square of the range size. The form parallels the continuous uniform variance with the discrete count (b−a+1) replacing the continuous length (b−a).