import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math
import itertools
import warnings
warnings.filterwarnings("ignore")
plt.style.use('ggplot')
All probabilities lie between zero and one.
$$Pr(A) \in [0,1]$$
If $Pr(A) = 0$, then $A$ cannot happen. If $Pr(A) = 1$, then $A$ is certain.
The set of all possible outcomes. That is, each possible event we might observe.
$$Pr(A) = \frac{\text{No. of outcomes in event A}}{\text{No. of outcomes in the sample space}}$$
If $S$ is the sample space containing all events that might happen, then
$$Pr(S) = 1$$
Two events are considered independent if the probability that one event occurs does not change the probability that the other event occurs.
$$Pr(A | B) = Pr(A)$$
$$Pr(B | A) = Pr(B)$$
The probability of two or more independent events occurring.
The probability that $A$ and $B$ happen.
For example, say the probability that you'll check Facebook 5 times today is .33 and the probability that I'll say "right" more than 4 times in this lecture is .56. Then the joint probability of these two independent events is .33*.56 is .185.
$$Pr(A \cap B) = Pr(A)Pr(B)$$
Two events are considered mutually exclusive when one event cannot occur if the other event has occurred.
For example, consider the education levels for respondents in a survey: (1) some high school, (2) high school graduate, (3) some college, (4) college graduate, (5) some graduate training, (6) graduated from graduate school. Respondents can only be located in one of these categories.
$$Pr(A \cap B) = 0$$
The probability that one event occurring is affected by whether another event occurs.
The probability of $A$ given $B$.
$$Pr(A | B) = \frac{Pr(A \cap B)}{Pr(B)}$$
Events that are common to both $A$ and $B$.
When $A$ and $B$ are independent
$$Pr(A \cap B) = Pr(A)Pr(B)$$
When $A$ and $B$ are dependent
$$Pr(A \cap B) = Pr(B | A)Pr(A) = Pr(A|B)Pr(B)$$
A = set([1,2,5,8])
B = set([2,3,8,9])
A.intersection(B)
Compound event where either $A$ or $B$ happens, or both.
$$Pr(A \cup B) = Pr(A) + Pr(B) - Pr(A \cap B)$$
A = set([1,2,5,8])
B = set([2,3,8,9])
A.union(B)
A combination is a way of choosing $k$ objects from $n$ objects when the order in which one chooses those objects is irrelevant.
$$\begin{pmatrix} n \\ k \end{pmatrix} = \frac{n!}{k!(n-k)!}$$
# How many combinations of 2 pairs can we draw?
options = "abcd"
n = len(options)
k = 2
# Possible combinations
math.factorial(n)/(math.factorial(k)*math.factorial(n-k))
[i for i in itertools.combinations(options,2)]
A permutation is a way of choosing $k$ objects from $n$ objects in which the order one chooses those objects matters.
$$\frac{n!}{(n-k)!}$$
math.factorial(n)/math.factorial(n-k)
[i for i in itertools.permutations(options,2)]
A random variable is a variable that can take some array of values, with the probability that it takes any particular value defined according to some random process.
Up to this point, we've talked about variables deterministically. That is, $x$ or $y$ are values that we can solve for given some equation.
$$ x + 4 = 6 $$ $$ x = 2 $$
A random variable, on the other hand, is a mapping some random process (like a coin flip) to a numerical value.
$$Pr(X) = \begin{cases} 1 \;\; Heads\\ 0 \;\; Tails\end{cases}$$
A random variable can potentially take on many values; thus, it no longer makes sense to specify a specific value, rather the probability that $X$ will take on a specific value.
$$Pr(X=1) = .5$$
A discrete random variable is a variable whose range is countable.
$$ A = \text{the year 55th student accepted into DSPP was born}$$
A continuous random variable is a variable whose range is uncountable, or put differently, whose range can take on any value in an interval.
$$ B = \text{the height of the 55th student accepted into DSPP}$$
The set of values a random variable might take. The function that defines the probability that each value occurs is known as a probability mass function (PMF) (discrete variables) or a probability distribution function (PDF) (continuous variables). (More on this below).
Probability distributions are theoretical constructs. When dealing with actual data, we encountered a "sample distribution", which are of the number of cases that take each of the values in a sample space for a given portion of the population of cases. We already encountered this when calculating "frequencies" (or counts) when summarizing empirical data.
# Discrete Distribution
D = pd.DataFrame(dict(x=np.array([1,1,3,2,1,2,2,3,2,1,1,2,1,2,3,2])))
D.groupby("x").size()
plt.figure(figsize=(10,3))
sns.countplot(D.x,color="steelblue")
plt.show()
We can display the joint frequency distributions for two discrete random variables as a matrix. Representing discrete variables as a table can offer an straightforward way for calculating probabilities given observed data. These are also known as cross-tabs (or "cross-tabulation tables")
We can calculate the "marginal probabilities" by summing across the rows and columns (and reporting those sums in the margins of the table (i.e. thus why we say "marginal")).
_ | Previous Drug Use | No Previous Drug Use | Total |
---|---|---|---|
Voted (2016) | 344 | 456 | 800 |
Did Not Vote (2016) | 566 | 132 | 698 |
Total | 910 | 588 | 1498 |
Questions:
$$ Pr(\text{Voted}) = \frac{800}{1498} \approx .53$$
$$ Pr(\text{Voted} \cap \text{Drug Use}) = \frac{344}{1498} \approx .23 $$
$$ Pr(\text{Voted} \cup \text{No Drug Use}) = \frac{800}{1498} + \frac{588}{1498} - \frac{456}{1498} \approx .62 $$
$$ Pr(\text{Voted} | \text{Drug Use}) = \frac{344}{910} \approx .38 $$
$$ Pr(\text{Voted} | \text{Drug Use}) = \frac{Pr(\text{Voted} \cap \text{Drug Use})}{Pr(\text{Drug Use})} = \frac{\frac{344}{1498}}{\frac{910}{1498}} \approx .38 $$
$X$ is a random variable is the number 4 people randomly sampled from the population who voted in the last election. What is the probability that three out of the four people voted, $Pr(X = 3)$?
To calculate this, we need need to know something about all the potential outcomes of $X$, that is, we need to understand every potential permutation that we could draw. This forms the possible sample space.
voted = ["V"]*4
not_voted = ["NV"]*4
options = voted + not_voted
possible_outcomes = set(itertools.permutations(options,4))
possible_outcomes
Now let's calculate the probability that 3 out of the 4 randomly selected individuals voted in the last election.
cnt_3_voted = sum([1 for entry in possible_outcomes if entry.count("V") == 3])
total = len(possible_outcomes)
pr_3_voted = cnt_3_voted/total
pr_3_voted
We can calculate the probability of every possible voting outcome (i.e. $Pr(X = 0)$, $Pr(X = 1)$, ...). This is known as the probability mass function.
pmf = []
for i in range(4+1):
pr = sum([1 for entry in possible_outcomes if entry.count("V") == i])/len(possible_outcomes)
pmf.append([i,pr])
pmf = pd.DataFrame(pmf,columns=["X","probability"])
pmf
plt.figure(figsize=(12,3))
sns.barplot("X","probability",color="steelblue",data=pmf)
plt.show()
The PMF (or the PDF in the continuous case) is denoted by $f(x)$
Note that the probability for each outcome of $X$ (computed above) sums to 1.
pmf.probability.sum()
The cumulative distribution function (CDF) describes the function that covers a range of values below a specific value and can be defined for both discrete and continuous distributions.
$$ Pr(X \le x) = \sum_{i\le x} p(i)$$
pmf.loc[:,"cdf"] = pmf.probability.cumsum()
pmf
plt.figure(figsize=(12,3))
sns.barplot("X","cdf",color="steelblue",data=pmf,alpha=.5)
sns.lineplot("X","cdf",color="orange",data=pmf)
sns.scatterplot("X","cdf",color="orange",s=250,data=pmf)
plt.show()
We can query the CDF to find the probability that a value is less than or equal to some value.
$$ Pr(X \le 2) = Pr(X = 0) + Pr(X = 1) + Pr(X = 2)$$
pmf.query("X == 2").cdf
The CDF is denoted by $F(x)$
The expectation (or the "expected value") of a random variable $X$, denoted simply as $E[X]$, is the sum of the possible values the variable might take weighted by the probability that each value will turn up.
For discrete distributions,
$$ E[X] = \sum_{i=1}^N x_i Pr(X=x_i) = \mu_x$$
For continuous distributions,
$$ E[X] = \int_{-\infty }^\infty x f(x) dx = \mu_x$$
where $f(x)$ denotes the probability density function.
Let's work through this definition using the example from above.
x = pmf.X.values; print(x)
p = pmf.probability.values; print(p)
mean_x = sum(x*p)
mean_x
The variance measures the spread of observed values away from the expected ones.
For discrete distributions,
$$Var(X) = \sum_{i=1}^N (x_i - \mu)^2 Pr(X=x_i) = \sigma^2_x$$
For continuous distributions,
$$ E[X] = \int_{-\infty }^\infty (x - \mu)^2 f(x) dx = \sigma^2_x$$
where $f(x)$ denotes the probability density function (PDF).
# Again, keeping with the working example from above...
sum(((x - mean_x)**2)*p)
To get the standard deviation, we just take the square root of the variance.
$$ \sigma_x = \sqrt{\sigma^2_x} $$
math.sqrt(sum(((x - 2)**2)*p))
The Bernoulli Distribution describes the frequency of two outcomes over repeated observations (e.g. coin flips). The distribution is built on the assumption that the individual events are independent of one another (e.g. the outcome of one coin flip doesn't effect the outcome of the subsequent flip). The Bernoulli Distribution is the building block for other discrete distributions.
The PMF for the Bernoulli Distribution
$$Pr(Y = y \;|\; p) = \begin{cases} 1 - p \;\;\text{ for } y= 0\\ p \;\;\;\;\;\;\;\:\text{ for } y= 1 \end{cases}$$
$$ Pr(Y = y \;|\; p) = p^y (1 - p) ^{1-y}$$
where $y = 0$ or $y = 1$.
We denote a variable that is distributed as a Bernoulli distribution as
$$Y \sim Bern(p)$$
Where the parameter $p$ corresponds to the probability of "success".
Recall our equation for the mean.
$$ E[X] = \sum_{i=1}^N x_i Pr(X=x_i)$$
$$ E[Y] = (1)p + (0)(1-p)$$
$$ E[Y] = p$$
Recall our equation for the variance.
$$Var(X) = \sum_{i=1}^N (x_i - \mu)^2 Pr(X=x_i) $$
$$ Var(Y) = ((1)-p)^2 p + ((0)-p)^2 (1-p) $$
$$ Var(Y) = p^3 - 2p^2 + p + p^2 - p^3 $$
$$ Var(Y) = -p^2 + p $$
$$ Var(Y) = p(1-p) $$
np.random.seed(1234)
Y = np.random.binomial(n = 1, p = .6,size = 100)
# Plot
plt.figure(figsize=(12,4))
ax = sns.countplot(Y,color="steelblue")
ax = ax.set(xlabel='Y', ylabel='Frequency')
The binomial distribution can describe any discrete distribution with three or more observations where
In other words, it is the discrete probability distribution of the number of successes in a sequence of $n$ independent "experiments" or "trials", where $p$ is the probability of success of each trial. Note that $p$ is assumed to be constant and $n$ finite.
The PMF for the Binomial Distribution
$$Pr(Y = y |n,p) = \begin{pmatrix} n \\ y \end{pmatrix} p^y(1-p)^{n-y}$$
where $n \ge y$, and $n$ and $y$ are positive integers and $0 \le p \le 1$.
We denote a variable that is distributed as a Binomial distribution as
$$Y \sim Bin(n,p)$$
Where the parameters $n$ denotes the number of trials, and $p$ corresponds to the probability of "success".
Recall that the expected value for any single trial is determined by the Bernoulli Distribution.
$$\mu_{bernoulli} = p $$
For the binomial distribution, we are repeating a Bernoulli trial $n$ times. Given that the outcome (0 or 1) and probability of success ($p$) is constant, we can multiple those individual probabilities of success by our $n$ trials.
$$ E[Y] = n((1)p + (0)(1-p)) $$
$$ E[Y] = np $$
Likewise, we are multiplying the variance of the Bernoulli distribution by the $n$ trials.
$$ Var(Y) = np(1-p)$$
np.random.seed(1234)
Y = np.random.binomial(n = 10, p = .5,size = 100)
# Plot
plt.figure(figsize=(12,5))
ax = sns.countplot(Y,color="steelblue")
ax = ax.set(xlabel='X', ylabel='Frequency')
An interesting thing to note is that as $n$ is large and $p$ is close to .5, then the binomial distribution begins to approximate a Normal Distribution (see below), where $Y \sim Normal(np,np(1-p))$.
# Altering values of n and p
np.random.seed(1234)
Y = np.random.binomial(n = 1000, p = .5,size = 5000)
# Plot
plt.figure(figsize=(12,5))
ax = sns.distplot(Y,color="steelblue")
ax = ax.set(xlabel='Y', ylabel='Frequency')
"The multinomial distribution is an extension of the binomial distribution to cases where more than two mutually exclusive outcomes can occur (i.e. think about an election where we have more than two candidates in a race). The multinomial distribution counts the number of times each one of $k$ different outcomes happens, where each outcome happens with probability $p_i$, $i \in \{1,\dots,k\}$. Since the outcomes are mutually exclusive and colelctively exhaustive, all these probabilities sum to one." (224)
The PMF of the Multinomial Distribution is
$$Pr( (Y_1 = y_1) \cap (Y_2 = y_2) \cap \dots \cap (Y_k = y_k) ) = \begin{cases} \frac{n!}{y_1!\dots y_k!} \prod_{i=1}^k p_i^{y_i} \;\; when \sum_{i=1}^k y_i = n\\ 0 \;\;\; \text{otherwise} \end{cases}$$
We denote a variable that is distributed as a Multinomial distribution as
$$Y \sim Multinomial(n,p_1,\dots,p_k)$$
Where the parameters $n$ denotes the number of trials, and $p_i$ corresponds to the probability of "success" for each outcome, where $\sum_{i=1}^k p_i = 1$
$$ E[Y_i] = np_i $$
where $i \in \{1,\dots,k\}$ and $k$ is the total number of possible outcomes.
$$ Var[Y_i] = np_i(1-p_i) $$
where $i \in \{1,\dots,k\}$ and $k$ is the total number of possible outcomes.
np.random.seed(1234)
Y = np.random.multinomial(n= 3,pvals = [1/3]*3,size=10)
Y
The Poisson distribution describes event counts produced by a process that (1) can be specified as a positive integer, (2) is independent from the count that came before it, and (3) has a known mean (or "rate"). That is, the expected number of events per interval is known. The Poisson distribution is used often when modeling some type of count process: e.g. number of cars that drive by Georgetown's front gate, the number of violent events in a specific geospatial region, etc.
The PMF of the Poisson Distribution can be written as
$$ P(Y = k) = \frac{e^{-\lambda} \lambda^k}{k!} $$
where $k \in \{0,1,2,\dots \}$ (i.e. $k$ is positively defined integer) and $\lambda$ denotes the rate/mean (i.e. the expected count).
We denote a variable that is distributed as a Poisson distribution as
$$Y \sim Pois(\lambda)$$
Where the $\lambda$ parameters captures expected count and variance of the distribution.
$$ E[Y] = \lambda $$
Note that the Poisson distribution has the interesting property where the variance is equal to the mean. When this property doesn't hold in the empirical data, then we usually use another distribution call the negative binomial distribution.
np.random.seed(1234)
Y = np.random.poisson(lam=1,size = 100)
# Plot
plt.figure(figsize=(12,5))
ax = sns.countplot(Y,color="steelblue")
ax = ax.set(xlabel='X', ylabel='Frequency')
Again note that as the value of $\lambda$ gets larger, the Poisson distribution begins to approximate the normal distribution, where $Y \sim Normal(\lambda,\lambda)$.
# Altering values of n and p
np.random.seed(1234)
Y = np.random.poisson(lam=500,size = 1000)
# Plot
plt.figure(figsize=(12,5))
ax = sns.distplot(Y,color="steelblue")
ax = ax.set(xlabel='Y', ylabel='Frequency')
$$ Pr( X \in [a,b]) = \int^b_a f(x) dx$$
where
$$ Pr( - \infty < X < \infty) = \int^{ - \infty}_\infty f(x) dx = 1$$
The uniform distribution is a continuous distribution that assigns equal probability to all possible values/events in the sample space (i.e. a given range).
The PDF of the uniform distribution is
$$ f(x) = \frac{1}{b-a} $$
where $x \in (a,b)$
We denote a variable that is distributed as a Uniform distribution as
$$Y \sim Unif(a,b)$$
Where the parameters $a$ and $b$ correspond with the bounds of the range.
$$ E[Y] = \frac{a+b}{2}$$
$$ Var[Y] = \frac{(b-a)^2}{12}$$
np.random.seed(1234)
Y = np.random.uniform(low=-10,high=10,size=5000)
# Plot
plt.figure(figsize=(12,5))
ax = sns.distplot(Y,color="steelblue")
ax = ax.set(xlabel='Y', ylabel='Frequency')
The Normal distribution is easily the most famous of the distributions. This distribution composes the theoretical underpinnings of the linear model. I'll remind you all of it's properties here given that you've had extensive exposure to it in your stats sequence.
The PDF of the Normal distribution is
$$ F(x; \mu, \sigma^2) = \frac{1}{\sigma\sqrt{2 \pi}} e^{\frac{-(x-\mu)^2}{2\sigma^2}} $$
We denote a variable that is distributed as a Normal distribution as
$$Y \sim N(\mu,\sigma^2)$$
Where the parameter $\mu$ corresponds with the expectation (mean) and $\sigma^2$ correspond with the variance.
$$ E[Y] = \mu $$
$$ Var[Y] = \sigma^2 $$
np.random.seed(1234)
Y = np.random.normal(loc=5,scale=10,size=5000)
# Plot
plt.figure(figsize=(12,5))
ax = sns.distplot(Y,color="steelblue")
ax = ax.set(xlabel='Y', ylabel='Frequency')
There are many more distributions (both continuous and discrete) covered elsewhere. It's useful to keep in mind that they're all interconnected. Some distributions are special cases of other distributions and the like.
Let's finish off by exploring the central limit theorem from a different angle using simulation.
The central limit theorem states that the average of any random variable follows a normal distribution, especially as the sample grows increasingly large. In short, no matter how odd the distribution you're sampling from, their means will stack normal.
Let's simulate a distribution that is clearly not normal and see this in action.
# First lets start with a poisson distribution
np.random.seed(1234)
Y = np.random.poisson(1,size=1000)
plt.figure(figsize=(12,5))
ax = sns.distplot(Y,color="steelblue")
ax = ax.set(xlabel='Y', ylabel='Frequency')
'''
The following code takes repeated simulated draws from a poisson distribution,
calculates the average, and then stores the values.
'''
n_samples = 500
averages = [0]*n_samples
for i in range(n_samples):
Y = np.random.poisson(1,size=1000)
averages[i] = Y.mean()
print(averages[1:5],end=" ...")
plt.figure(figsize=(12,5))
ax = sns.distplot(averages,color="steelblue")
ax = ax.set(xlabel='Y', ylabel='Frequency')
Again!
# First lets start with a Bernoullie distribution
np.random.seed(1234)
Y = np.random.binomial(1,p=.2,size=1000)
plt.figure(figsize=(12,3))
ax = sns.countplot(Y,color="steelblue")
ax = ax.set(xlabel='Y', ylabel='Frequency')
n_samples = 500
averages = [0]*n_samples
for i in range(n_samples):
Y = np.random.binomial(1,p=.2,size=1000)
averages[i] = Y.mean()
# Plot the averages
plt.figure(figsize=(12,5))
ax = sns.distplot(averages,color="steelblue")
ax = ax.set(xlabel='Y', ylabel='Frequency')