X <- c(4,6,10,5,5)
dll.geom <- function(p) length(X)/p - sum(X)/(1-p)
uniroot(dll.geom,c(0.01,0.99))[[1]][1] 0.1428571
1/7[1] 0.1428571
We have already examined R’s MLE-fitting procedures in Coding it up in R, specifically the functions MASS::fitdistr and fitdistrplus::fitdist, along with the more general workhorse function stats::optim. These functions replicate (or even improve upon) the gradient descent method described previously.
You can also use the function stats::uniroot to find a one-dimensional root of an equation, as opposed to hand-coding a Newton-Raphson algorithm. As we know from Univariate likelihood, the parameter which maximizes the likelihood function usually is a root of the derivative of the log-likelihood function.
For example, let’s imagine that we are loosing arrows at a target and counting the number of misses until we score a bullseye. After five trials the five counts for the missed shots are: \(\boldsymbol{x}=\{4,6,10,5,5\}\). If we assume these are drawn from a geometric distribution with unknown probability \(p\), we can use uniroot() to find the zero of the log-likelihood:
The geometric PMF is: \(P(X=k\;|p) = (1-p)^kp\)
The likelihood function would be: \(\mathcal{L}(p|\boldsymbol{x}) = p^n \prod_i(1-p)^{x_i}\)
The log-likelihood would be \(n \log p + \sum_i x_i \log (1-p)\)
Its partial derivative with respect to \(p\) would be \(\frac{n}{p}-\frac{\sum_i x_i}{1-p}\)
Switching from equations to R, we solve for the MLE of \(p\):
X <- c(4,6,10,5,5)
dll.geom <- function(p) length(X)/p - sum(X)/(1-p)
uniroot(dll.geom,c(0.01,0.99))[[1]][1] 0.1428571
1/7[1] 0.1428571
Which should seem intuitively correct: if on average we miss six times \((\bar{x}=6)\) before making a bullseye on the seventh shot, then our probability of making bullseyes should be 1/7.
I do not really recommend this strategy in an industry or production context. The automated routines found in R (or any other language) converge faster, better, and more surely than what you are likely to build on your own. Optimization algorithms are an art and science of their own, and take years of training to perform at an enterprise-grade level. However, it can be quite helpful to prove to yourself that you know what you’re doing. The sections The Newton-Raphson method and Gradient descent contain illustrations of simple manual routines you can implement.
For example, we can use the Newton-Raphson method instead of optim() to find the root of the derivative of the log-likelihood function in the archery example above:
# helper function to approximate derivatives
ddx <- function(value, func, delta=1e-6){
(func(value+delta)-func(value-delta))/(2*delta)}
# core newton-raphson algorithm, plus some output
newton <- function(f, x0, tol=1e-6, max.iter=20){
x <- x0
i <- 0
results <- data.frame(iter=i,x=x,fx=f(x))
while (abs(f(x))>tol & i<max.iter) {
i <- i + 1
x <- x - f(x)/ddx(x,f)
results <- rbind(results,data.frame(iter=i,x=x,fx=f(x)))}
return(results)}
newton(f=dll.geom,x0=0.05,tol=1e-10) iter x fx
1 0 0.05000000 6.842105e+01
2 1 0.08365123 2.703336e+01
3 2 0.11968296 7.698410e+00
4 3 0.13953568 9.682306e-01
5 4 0.14279220 1.857091e-02
6 5 0.14285712 7.034290e-06
7 6 0.14285714 1.023182e-12
Or we could use gradient descent to maximize the likelihood function more directly:
# gradient descent function with backtracking line search
graddesc <- function(f, g, x0, b=0.8, tol=1e-2, max.iter=40){
i <- 0
x <- x0
results <- t(c(0,x,-f(x)))
while (sqrt(g(x)%*%g(x))>tol & i<max.iter) {
i <- i + 1
t <- 0.01
while ((f(x+t*g(x))<f(x)) & t>tol) {
t <- t*b}
x <- x + t*g(x)
results <- rbind(results,t(c(i,x,-f(x))))}
colnames(results) <- c('iter',paste0('x',1:length(x)),'f')
return(results)}
ll.geom <- function(p) length(X)*log(p) + sum(X)*log(1-p)
graddesc(f=ll.geom, g=dll.geom, x0=0.1, tol=1e-10) iter x1 f
[1,] 0 0.1000000 14.67374
[2,] 1 0.1853333 14.57728
[3,] 2 0.1065617 14.57550
[4,] 3 0.1748781 14.48505
[5,] 4 0.1251698 14.40218
[6,] 5 0.1613513 14.39976
[7,] 6 0.1307366 14.37613
[8,] 7 0.1546269 14.37302
[9,] 8 0.1344580 14.36450
[10,] 9 0.1506242 14.36244
[11,] 10 0.1370251 14.35905
[12,] 11 0.1480728 14.35788
[13,] 12 0.1388114 14.35645
[14,] 13 0.1463924 14.35583
[15,] 14 0.1400553 14.35521
[16,] 15 0.1452663 14.35489
[17,] 16 0.1409200 14.35461
[18,] 17 0.1445043 14.35446
[19,] 18 0.1415197 14.35433
[20,] 19 0.1439856 14.35425
[21,] 20 0.1419348 14.35419
[22,] 21 0.1436312 14.35416
[23,] 22 0.1422215 14.35413
[24,] 23 0.1433886 14.35411
[25,] 24 0.1424194 14.35410
[26,] 25 0.1432222 14.35409
[27,] 26 0.1425558 14.35408
[28,] 27 0.1431081 14.35408
[29,] 28 0.1426497 14.35408
[30,] 29 0.1430296 14.35408
[31,] 30 0.1427144 14.35407
[32,] 31 0.1429757 14.35407
[33,] 32 0.1427589 14.35407
[34,] 33 0.1429387 14.35407
[35,] 34 0.1427896 14.35407
[36,] 35 0.1429132 14.35407
[37,] 36 0.1428107 14.35407
[38,] 37 0.1428957 14.35407
[39,] 38 0.1428252 14.35407
[40,] 39 0.1428837 14.35407
[41,] 40 0.1428352 14.35407
Notice that the Newton-Raphson algorithm converged to the eighth decimal place in seven iterations, while the gradient descent method had only found the first four decimal places after 40 iterations. Better programmers with more experience could wring much more power from gradient descent than I did.
The MASS::anorexia dataset will help to illustrate some basic t-testing use cases. This dataset describes an experimental trial in which 72 anorexic young women were assigned to three groups: a control group receiving no treatment (“Cont”), a group receiving cognitive behavioral therapy (“CBT”) and a a group receiving family therapy (“FT”).
library(MASS)
head(anorexia) Treat Prewt Postwt
1 Cont 80.7 80.2
2 Cont 89.4 80.1
3 Cont 91.8 86.4
4 Cont 74.0 86.3
5 Cont 78.1 76.1
6 Cont 88.3 78.1
90% confidence interval for the mean weight of incoming patients (automated and manual solutions):
t.test(x=anorexia$Prewt,conf.level=0.90)$conf.int[1] 81.39044 83.42622
attr(,"conf.level")
[1] 0.9
n.pre <- length(anorexia$Prewt)
muhat.pre <- mean(anorexia$Prewt)
se.pre <- sd(anorexia$Prewt)/sqrt(n.pre)
muhat.pre + c(-1,1)*qt(p=0.95,df=n.pre-1)*se.pre[1] 81.39044 83.42622
One-sided test of whether the mean weight of the incoming patients could be as high as 84 pounds:
t.test(x=anorexia$Prewt,alternative='less',mu=84)$p.value[1] 0.005576362
pt(q=(muhat.pre - 84)/se.pre,df=n.pre-1)[1] 0.005576362
Two-sample test of whether the mean post-treatment weight of the CBT group was different than the post-treatment weight of the control group (assuming equal variances):
t.test(x=anorexia$Postwt[anorexia$Treat=='CBT'],
y=anorexia$Postwt[anorexia$Treat=='Cont'],
alternative='two.sided',var.equal=TRUE)$p.value[1] 0.01692976
n.cont <- sum(anorexia$Treat=='Cont')
muhat.cont <- mean(anorexia$Postwt[anorexia$Treat=='Cont'])
n.cbt <- sum(anorexia$Treat=='CBT')
muhat.cbt <- mean(anorexia$Postwt[anorexia$Treat=='CBT'])
var.pooled <- (var(anorexia$Postwt[anorexia$Treat=='Cont'])*(n.cont-1) +
var(anorexia$Postwt[anorexia$Treat=='CBT'])*(n.cbt-1)) /
(n.cont + n.cbt - 2)
se.pooled <- sqrt(var.pooled/n.cont + var.pooled/n.cbt)
2*(1-pt(q=(muhat.cbt - muhat.cont)/se.pooled,df=n.cont + n.cbt - 2))[1] 0.01692976
Two-sample 90% confidence interval for how much more the CBT group weighed after treatment than then control group (assuming equal variances):
t.test(x=anorexia$Postwt[anorexia$Treat=='CBT'],
y=anorexia$Postwt[anorexia$Treat=='Cont'],
conf.level=0.90,var.equal=TRUE)$conf.int[1] 1.473675 7.704044
attr(,"conf.level")
[1] 0.9
(muhat.cbt - muhat.cont) + c(-1,1)*qt(p=0.95,df=n.cbt+n.cont-2)*se.pooled[1] 1.473675 7.704044
Matched-pairs 90% confidence interal measuring how much the CBT group gained during the treatment process. Note that this test requires each observation in the first group to be matched to a corresponding observation in the second group (as is true in a before-and-after study):
t.test(x=anorexia$Postwt[anorexia$Treat=='CBT'],
y=anorexia$Prewt[anorexia$Treat=='CBT'],
paired=TRUE,conf.level=0.90,var.equal=TRUE)$conf.int[1] 0.6981979 5.3155952
attr(,"conf.level")
[1] 0.9
muhat.paired <- mean(anorexia$Postwt[anorexia$Treat=='CBT'] -
anorexia$Prewt[anorexia$Treat=='CBT'])
se.paired <- sd(anorexia$Postwt[anorexia$Treat=='CBT'] -
anorexia$Prewt[anorexia$Treat=='CBT'])/sqrt(n.cbt)
muhat.paired + c(-1,1)*qt(p=0.95,df=n.cbt-1)*se.paired[1] 0.6981979 5.3155952
Sticking with the MASS::anorexia dataset, let’s define a “success” as a patient who weighed more post-treatment than pre-treatment.1
anorexia$Success <- 1*(anorexia$Postwt > anorexia$Prewt)Now we can examine the proportions of success within and between the different treatment groups. For example:
Two-sided 80% confidence interval for the one-sample proportion of successes in the family therapy group.
x.ft <- sum(anorexia$Success[anorexia$Treat=='FT'])
n.ft <- length(anorexia$Success[anorexia$Treat=='FT'])
prop.test(x=x.ft,n=n.ft,conf.level=0.80)$conf.int[1] 0.5819867 0.8909870
attr(,"conf.level")
[1] 0.8
One-sided, one-sample proportion test of whether the family therapy grpoup achieved at least a 60% success rate. Note that R by default uses a “continuity correction” which improves upon the CLT approximation for small sample sizes. By disabling this continuity correction we can manually match R’s results using the CLT:
prop.test(x=x.ft,n=n.ft,p=0.60,alternative='greater')$p.value[1] 0.1274205
prop.test(x=x.ft,n=n.ft,p=0.60,alternative='greater',correct=FALSE)$p.value[1] 0.08284192
1-pnorm((x.ft/n.ft - 0.6)/sqrt(0.6*0.4/n.ft))[1] 0.08284192
Two-sided, two-proportion test of whether the family therapy group and the CBT group have different success rates:
x.cbt <- sum(anorexia$Success[anorexia$Treat=='CBT'])
prop.test(x=c(x.ft,x.cbt),n=c(n.ft,n.cbt),
alternative='two.sided')$p.value[1] 0.4965428
prop.test(x=c(x.ft,x.cbt),n=c(n.ft,n.cbt),
alternative='two.sided',correct=FALSE)$p.value[1] 0.3145389
p.combined <- (x.ft + x.cbt)/(n.ft + n.cbt)
se.combined <- sqrt(p.combined*(1-p.combined)*(1/n.ft+1/n.cbt))
2*(1-pnorm((x.ft/n.ft - x.cbt/n.cbt)/se.combined))[1] 0.3145389
This is a low bar; we might normally want to set a minimum threshold for weight gain.↩︎