How to compute and optimize a simple likelihood function
Using a completely automated routine
In simple cases, where you know the distribution name and only need to estimate the parameters, existing R functions will perform all the math for you. The two most common functions are:
MASS::fitdistr — comes pre-installed (but not loaded) in all base distributions of R, handles many basic cases
fitdistrplus::fitdist — requires installation of the ‘fitdistr’ package, handles more edge cases and estimation methods
For simplicity we will use MASS::fitdistr; however, the same outputs can be reached with either function.1
The code below generates some exponential data with true parameter \(\lambda=5\), gives you the documentation for the MASS::fitdistr function, and then uses that function to estimate the parameter:
The MLE guess of \(\hat{\lambda} \approx 4.893\) is close to the true value of 5, but not exactly right. In general, we will always be wrong with our estimates, and that’s fine!
Of course, in the real world, we would not know the true value of 5, and we would only see the guess of 4.893, and have no idea how close or far we are from the truth.
The estimate returned by the fitdistr() function is exactly equal to \(1/\bar{x}\).2 You could solve this by hand and arrive at the same conclusion! For more complicated distributions, the function will instead approximate the MLE with iterative algorithms.
Sometimes your work will suggest a distribution not handled by basic R functions. You can still use R to help you maximize a custom likelihood function using stats::optim, which comes pre-loaded in all R versions.
Let’s consider a truncated Poisson distribution of speeding tickets. Perhaps you have records of all the speeding tickets issued to drivers in one locality. Some drivers have multiple speeding tickets. Presumably, some drivers have no speeding tickets, but these drivers don’t leave a record in the ticket database. If all drivers recieve tickets through a Poisson process, and we only observe the positive counts (not the zeroes), can we reconstruct the original rate?
First we will make a set of data for 50,000 drivers using a rate of \(\lambda = 0.07\) tickets per driver, then we’ll censor the zeroes to mimic a real dataset. Next we’ll create a custom likelihood function to represent such data (essentially just re-weighting the original Poisson PMF), and finally we’ll optimize this function using optim():
ll.trunc.pois <-function(lambda,x){ sum(x)*log(lambda) - lambda*sum(1/factorial(x)) -length(x)*log(1-exp(-1*lambda))} help(optim) # for your referenceopt.sol <-optim(par=0.1,fn=ll.trunc.pois, x=observed.data, control=list(fnscale=-1), method='L-BFGS-B', lower=0.01,upper=0.5) opt.sol$par
[1] 0.08437934
This call to optim() gives a starting value for lambda, specifies the function being maximized, passes data to the likelihood function, re-scales the algorithm to maximize (rather than minimize), chooses an optimization method, and puts bounds on the values of lambda. The optim() algorithm converges on the MLE of 0.084, which is near the true value of 0.07:
Almost all parametric inference includes hypothesis tests of some kind, so this is a broad category. We’ll start from basics and in future chapters we will add on more bells and whistles. I present several scenarios below and you should be able to work your way through many more once you master these building blocks.
Scenario 1: Two-sided binomial distribution test
John Edmund Kerrich (1903–1985) was a British statistician who became a prisoner of war during WWII. During his internment he flipped a coin 10,000 times and recorded 5067 heads, 4933 tails. Using a two-sided test at a 10% significance threshold, was his coin fair?
help(pbinom) # for your reference(p.value <-2*pbinom(q=4933,size=10000,prob=0.5))
[1] 0.1835155
p.value <=0.1
[1] FALSE
With a p-value of 18% we cannot reject the null hypothesis, and continue to assume that the coin is fair.
This simple calculation took advantage of the fact that the binomial distribution is symmetric, so that the left tail region (0–4933 heads) and the right tail region (5067–10000 heads) would have the same probability. We could have done this many ways:
Scenario 2: One-sided exponential distribution test
The bus stop sign outside your house claims that buses arrive, on average, every 12 minutes, but the first time you waited, it took 18 minutes, and the next time, 30 minutes. Assume bus arrival times are exponentially distributed — using a one-sided test at a 5% significance threshold, do you still believe the sign?
help(pexp) # for your reference(p.value <- (1-pexp(18/12))*(1-pexp(30/12)))
[1] 0.01831564
p.value <=0.05
[1] TRUE
With a p-value of 1.8% we reject the null hypothesis, and believe that buses arrive less often than the sign says.
This calculation took advantage of the fact that the exponential distribution is memoryless and self-similar (displays fractal properties). We could have also used the Poisson distribution to calculate the probability that no buses would arrive during our waiting periods. The idea of “equally extreme or more extreme” outcomes from our two bus waiting periods can be calculated in a number of ways, all of which are numerically equal:
(1-pexp(q=18/12,rate=1))*(1-pexp(q=30/12,rate=1))
[1] 0.01831564
1-pexp(q=48/12)
[1] 0.01831564
1-pexp(q=48,rate=1/12)
[1] 0.01831564
dpois(0,18/12)*dpois(0,30/12)
[1] 0.01831564
dpois(0,48/12)
[1] 0.01831564
Scenario 3: One-sided permutation test
In 1919 the phycologist Muriel Bristol worked alongside the statistician Ronald Fisher and one day told him that she preferred her tea poured first and then milk added, rather than the other way around. Fisher, skeptical, resolved to test her. He prepared eight cups of tea: four with the milk added first, four with the milk added second. He told Bristol to guess the four cups with milk added first, and she did, without any errors. At a 5% significance level, can we conclude that Bristol indeed could tell the cups apart?
help(choose) # for your reference(p.value <-1/choose(8,4))
[1] 0.01428571
p.value <=0.05
[1] TRUE
With a p-value of 1.4% we reject the null hypothesis, and believe that Muriel Bristol could truly taste a difference between the two methods.
This calculation was a simple combinatoric — under the null hypothesis that Bristol was guessing by blind luck, she would have chosen any of the \(8_C4=70\) distinct sets of four teacups at random. So under the null, there was only a \(1/70 \approx 1.4\%\) chance that she would pick the right four cups, which she did.
How to compute ‘exact’ confidence intervals for distributional parameters
For binomial probabilities p
Statisticians debate the merits of several different “exact” solutions for this problem.3 In this document we’ll outline the Clopper-Pearson interval, which is very conservative (meaning that it is more likely to contain the true parameter than its nominal confidence level.)
Let’s say that 1000 likely United States voters are polled and 65 of them claim they will vote for a third-party candidate in the upcoming presidential election. What would be a 90% “exact” two-sided confidence interval for the true proportion of voters supporting the third-party candidate?4
The good news: this is a simple one-line calculation using stats::binom.test!
The better news: you could replicate this using the theory you were taught, and the root-finding helper function stats::uniroot!
Exact binomial test
data: 65 and 1000
number of successes = 65, number of trials = 1000, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
90 percent confidence interval:
0.05264301 0.07930468
sample estimates:
probability of success
0.065
Both methods would also work for one-sided intervals. For example, say that you sample 500 “penny stocks” (companies with very low share prices; typically very speculative or risky investments) and find that 89 of those companies file for bankruptcy within six months. What’s a one-sided 80% confidence upper bound on the proportion of penny stocks which quickly go bankrupt?
Exact binomial test
data: 89 and 500
number of successes = 89, number of trials = 500, p-value < 2.2e-16
alternative hypothesis: true probability of success is less than 0.5
80 percent confidence interval:
0.0000000 0.1939393
sample estimates:
probability of success
0.178
This scenario does not often happen. Exact confidence intervals are rarely seen by working data scientists outside of binomial proportions. With modern sample sizes, the approximate methods discussed in Handy rules of thumb, The Central Limit Theorem, and especially The Student’s t distribution are much more common.
How to identify a distribution and its likely parameter values
Testing whether a sample could come from a specific distribution with known parameters
This is a canonical use of the Kolmogorov-Smirnov test. The function stats::ks.test comes pre-loaded with every distribution of R. Although a version of the test exists for discrete-valued samples (where “ties” may occur within the ranks of \(\boldsymbol{x}\)), the base R implementation does not perform this correction. Still, for large sample sizes, you may ignore the warnings in peace.
Suppose we sell “25 pound” bags of dog food which historically have had a mean weight of 25.6 pounds and a standard deviation of 0.3 pounds (we don’t want to shortchange the customer.) After a change to the assembly line, the most recent ten bag weights were:
\[\boldsymbol{x}=\{25.3, 25.1, 25.1, 25.5, 25.2, 24.7, 24.8, 24.9, 24.8, 25.5\}\] Does the new assembly line preserve our old bag weights?
Warning in ks.test.default(x, pnorm, mean = 25.8, sd = 0.4): ties should not be
present for the one-sample Kolmogorov-Smirnov test
Asymptotic one-sample Kolmogorov-Smirnov test
data: x
D = 0.77337, p-value = 1.276e-05
alternative hypothesis: two-sided
No, with \(p \approx 0.05\%\), we reject the null at any conventional significance level and conclude that our bags no longer fit the historical distribution. Notice that we have not made a specific claim that the bags weight more or less, or that their spread is wider or narrower, simply that they do not belong to the Normal(25.6,0.32) distribution.
Testing whether two samples could have come from the same distribution
This is the second canonical use of the Kolmogorov-Smirnov test. We don’t even need to know the distribution of either sample. We can still very easily run a test to see if it’s possible they share the same CDF.
Let’s say we tweak the assembly line again and get a new sample of ten bags.
Exact two-sample Kolmogorov-Smirnov test
data: x and y
D = 0.7, p-value = 0.01234
alternative hypothesis: two-sided
ks.test(y,pnorm,mean=25.6,sd=0.3)
Warning in ks.test.default(y, pnorm, mean = 25.6, sd = 0.3): ties should not be
present for the one-sample Kolmogorov-Smirnov test
Asymptotic one-sample Kolmogorov-Smirnov test
data: y
D = 0.30879, p-value = 0.2961
alternative hypothesis: two-sided
Using a 5% significance threshold, we may say that the latest tweak has changed our assembly line from the first sample, and that we have no evidence against the idea that we have returned to the old standards of Normal(25.6,0.32).5
Identifying and estimating an unknown sample
This is not a canonical use of the Kolmogorov-Smirnov test. Above, we tested whether a sample of data belonged to a known, fully-specified distribution. If we instead are estimating the distribution parameters from the data itself, the test is no longer fully accurate.
However… this remains common in exploratory data analysis (EDA). WE don’t need to convince an outside audience, we just need to convince ourselves. For example, let’s take the second sample of dog food bag weights:
Say we had no prior idea which distribution these came from, and we wanted a guess. The data seem continuous and don’t seem to have natural bounds. Two candidate distributions might be the Cauchy and the Normal. We’ll find the best parameters for both (using MASS::fitdistr) and see if they’re any good:
Asymptotic one-sample Kolmogorov-Smirnov test
data: y
D = 0.13322, p-value = 0.9943
alternative hypothesis: two-sided
From this we conclude that the small sample of ten dog food weights could easily be from a Cauchy distribution or from a Normal distribution. More observations would help us to determine whether one of these distributions is a better fit to the dog food bag weights.
If using fitdistrplus::fitdist you should be sure to specify method='mle' or else you may not match the results of MASS::fitdistr.↩︎
Any printed difference in the code output is simply a matter of significant digits.↩︎
Spoiler: none of them offer exact \(1 - \alpha\) coverage of the true parameter in all situations.↩︎
This scenario assumes that the sample is representative of the electorate and that the respondents are honest about their preferences.↩︎
I generated the first sample from a Normal(25.2,0.32) distribution and the second sample from a Normal(25.7,0.52) distribution… you may draw your own conclusions.↩︎