Cumulative distribution functions
pmf paradox
data oversampled selection bias
Percentiles Rank
def PercentileRank(scores, your_score):
count = 0
for score in scores:
if score <= your_score:
count += 1
percentile_rank = 100.0 * count / len(scores)
return percentile_rank
Percentiles
def Percentile(scores, percentile_rank):
scores.sort()
for score in scores:
if PercentileRank(scores, score) >= percentile_rank:
return score
Cumulative distribution functions
def Cdf(t, x):
count = 0.0
for value in t:
if value <= x:
count += 1.0
prob = count / len(t)
return prob
Prob(x): Given a value x, computes the probability p = CDF(x).
Value(p): Given a probability p, computes the corresponding value, x; that is, the inverse CDF of p.
Conditional distributions
A conditional distribution is the distribution of a subset of the data which is selected according to a condition.
Random numbers
CDFs are useful for generating random numbers with a given distribution. Here’s how: Choose a random probability in the range 0–1. Use Cdf.Value to find the value in the distribution that corresponds to the probability you chose.
resampling
The process of generating a random sample from a distribution that was computed from a sample.
In Python, sampling with replacement can be implemented with random.random to choose a percentile rank random.choice to choose an element from a sequence.
Sampling without replacement is provided by random.sample.
The numbers generated by random.random are supposed to be uniform between 0 and 1.
Summary statistics
The median is just the 50th percentile.
The 25th and 75th percentiles are often used to check whether a distribution is symmetric.
Their difference, which is called the interquartile range, measures the spread.