Coding it up in R

How to iteratively optimize a likelihood function

Using automated routines

We have already examined R’s MLE-fitting procedures in Coding it up in R, specifically the functions MASS::fitdistr and fitdistrplus::fitdist, along with the more general workhorse function stats::optim. These functions replicate (or even improve upon) the gradient descent method described previously.

You can also use the function stats::uniroot to find a one-dimensional root of an equation, as opposed to hand-coding a Newton-Raphson algorithm. As we know from Univariate likelihood, the parameter which maximizes the likelihood function usually is a root of the derivative of the log-likelihood function.

For example, let’s imagine that we are loosing arrows at a target and counting the number of misses until we score a bullseye. After five trials the five counts for the missed shots are: $\boldsymbol{x}=\{4,6,10,5,5\}$. If we assume these are drawn from a geometric distribution with unknown probability $p$, we can use uniroot() to find the zero of the log-likelihood:

The geometric PMF is: $P(X=k\;|p) = (1-p)^kp$

The likelihood function would be: $\mathcal{L}(p|\boldsymbol{x}) = p^n \prod_i(1-p)^{x_i}$

The log-likelihood would be $n \log p + \sum_i x_i \log (1-p)$

Its partial derivative with respect to $p$ would be $\frac{n}{p}-\frac{\sum_i x_i}{1-p}$

Switching from equations to R, we solve for the MLE of $p$:

X <- c(4,6,10,5,5)
dll.geom <- function(p) length(X)/p - sum(X)/(1-p)
uniroot(dll.geom,c(0.01,0.99))[[1]]

[1] 0.1428571

1/7

[1] 0.1428571

Which should seem intuitively correct: if on average we miss six times $(\bar{x}=6)$ before making a bullseye on the seventh shot, then our probability of making bullseyes should be 1/7.

‘Manual’ iterative optimization routines

I do not really recommend this strategy in an industry or production context. The automated routines found in R (or any other language) converge faster, better, and more surely than what you are likely to build on your own. Optimization algorithms are an art and science of their own, and take years of training to perform at an enterprise-grade level. However, it can be quite helpful to prove to yourself that you know what you’re doing. The sections The Newton-Raphson method and Gradient descent contain illustrations of simple manual routines you can implement.

For example, we can use the Newton-Raphson method instead of optim() to find the root of the derivative of the log-likelihood function in the archery example above:

# helper function to approximate derivatives
ddx <- function(value, func, delta=1e-6){
  (func(value+delta)-func(value-delta))/(2*delta)}      

# core newton-raphson algorithm, plus some output
newton <- function(f, x0, tol=1e-6, max.iter=20){         
  x <- x0
  i <- 0
  results <- data.frame(iter=i,x=x,fx=f(x))              
  while (abs(f(x))>tol & i<max.iter) {                    
    i <- i + 1
    x <- x - f(x)/ddx(x,f)
    results <- rbind(results,data.frame(iter=i,x=x,fx=f(x)))}
  return(results)}

newton(f=dll.geom,x0=0.05,tol=1e-10)

  iter          x           fx
1    0 0.05000000 6.842105e+01
2    1 0.08365123 2.703336e+01
3    2 0.11968296 7.698410e+00
4    3 0.13953568 9.682306e-01
5    4 0.14279220 1.857091e-02
6    5 0.14285712 7.034290e-06
7    6 0.14285714 1.023182e-12

Or we could use gradient descent to maximize the likelihood function more directly:

# gradient descent function with backtracking line search
graddesc <- function(f, g, x0, b=0.8, tol=1e-2, max.iter=40){
  i <- 0
  x <- x0
  results <- t(c(0,x,-f(x)))
  while (sqrt(g(x)%*%g(x))>tol & i<max.iter) {  
    i <- i + 1
    t <- 0.01
    while ((f(x+t*g(x))<f(x)) & t>tol) {  
      t <- t*b}  
    x <- x + t*g(x)                                           
    results <- rbind(results,t(c(i,x,-f(x))))} 
  colnames(results) <- c('iter',paste0('x',1:length(x)),'f') 
  return(results)}

ll.geom <- function(p) length(X)*log(p) + sum(X)*log(1-p)

graddesc(f=ll.geom, g=dll.geom, x0=0.1, tol=1e-10)

      iter        x1        f
 [1,]    0 0.1000000 14.67374
 [2,]    1 0.1853333 14.57728
 [3,]    2 0.1065617 14.57550
 [4,]    3 0.1748781 14.48505
 [5,]    4 0.1251698 14.40218
 [6,]    5 0.1613513 14.39976
 [7,]    6 0.1307366 14.37613
 [8,]    7 0.1546269 14.37302
 [9,]    8 0.1344580 14.36450
[10,]    9 0.1506242 14.36244
[11,]   10 0.1370251 14.35905
[12,]   11 0.1480728 14.35788
[13,]   12 0.1388114 14.35645
[14,]   13 0.1463924 14.35583
[15,]   14 0.1400553 14.35521
[16,]   15 0.1452663 14.35489
[17,]   16 0.1409200 14.35461
[18,]   17 0.1445043 14.35446
[19,]   18 0.1415197 14.35433
[20,]   19 0.1439856 14.35425
[21,]   20 0.1419348 14.35419
[22,]   21 0.1436312 14.35416
[23,]   22 0.1422215 14.35413
[24,]   23 0.1433886 14.35411
[25,]   24 0.1424194 14.35410
[26,]   25 0.1432222 14.35409
[27,]   26 0.1425558 14.35408
[28,]   27 0.1431081 14.35408
[29,]   28 0.1426497 14.35408
[30,]   29 0.1430296 14.35408
[31,]   30 0.1427144 14.35407
[32,]   31 0.1429757 14.35407
[33,]   32 0.1427589 14.35407
[34,]   33 0.1429387 14.35407
[35,]   34 0.1427896 14.35407
[36,]   35 0.1429132 14.35407
[37,]   36 0.1428107 14.35407
[38,]   37 0.1428957 14.35407
[39,]   38 0.1428252 14.35407
[40,]   39 0.1428837 14.35407
[41,]   40 0.1428352 14.35407

Notice that the Newton-Raphson algorithm converged to the eighth decimal place in seven iterations, while the gradient descent method had only found the first four decimal places after 40 iterations. Better programmers with more experience could wring much more power from gradient descent than I did.

How to perform univariate hypothesis tests and confidence intervals using the CLT and the t-distribution

t-tests for means and differences of means

The MASS::anorexia dataset will help to illustrate some basic t-testing use cases. This dataset describes an experimental trial in which 72 anorexic young women were assigned to three groups: a control group receiving no treatment (“Cont”), a group receiving cognitive behavioral therapy (“CBT”) and a a group receiving family therapy (“FT”).

library(MASS)
head(anorexia)

  Treat Prewt Postwt
1  Cont  80.7   80.2
2  Cont  89.4   80.1
3  Cont  91.8   86.4
4  Cont  74.0   86.3
5  Cont  78.1   76.1
6  Cont  88.3   78.1

90% confidence interval for the mean weight of incoming patients (automated and manual solutions):

t.test(x=anorexia$Prewt,conf.level=0.90)$conf.int

[1] 81.39044 83.42622
attr(,"conf.level")
[1] 0.9

n.pre <- length(anorexia$Prewt)
muhat.pre <- mean(anorexia$Prewt)
se.pre <- sd(anorexia$Prewt)/sqrt(n.pre)
muhat.pre + c(-1,1)*qt(p=0.95,df=n.pre-1)*se.pre

[1] 81.39044 83.42622

One-sided test of whether the mean weight of the incoming patients could be as high as 84 pounds:

t.test(x=anorexia$Prewt,alternative='less',mu=84)$p.value

[1] 0.005576362

pt(q=(muhat.pre - 84)/se.pre,df=n.pre-1)

[1] 0.005576362

Two-sample test of whether the mean post-treatment weight of the CBT group was different than the post-treatment weight of the control group (assuming equal variances):

t.test(x=anorexia$Postwt[anorexia$Treat=='CBT'],
   y=anorexia$Postwt[anorexia$Treat=='Cont'],
   alternative='two.sided',var.equal=TRUE)$p.value

[1] 0.01692976

n.cont <- sum(anorexia$Treat=='Cont')
muhat.cont <- mean(anorexia$Postwt[anorexia$Treat=='Cont'])
n.cbt <- sum(anorexia$Treat=='CBT')
muhat.cbt <- mean(anorexia$Postwt[anorexia$Treat=='CBT'])
var.pooled <- (var(anorexia$Postwt[anorexia$Treat=='Cont'])*(n.cont-1) +
           var(anorexia$Postwt[anorexia$Treat=='CBT'])*(n.cbt-1)) /
          (n.cont + n.cbt - 2)
se.pooled <- sqrt(var.pooled/n.cont + var.pooled/n.cbt)
2*(1-pt(q=(muhat.cbt - muhat.cont)/se.pooled,df=n.cont + n.cbt - 2))

[1] 0.01692976

Two-sample 90% confidence interval for how much more the CBT group weighed after treatment than then control group (assuming equal variances):

t.test(x=anorexia$Postwt[anorexia$Treat=='CBT'],
   y=anorexia$Postwt[anorexia$Treat=='Cont'],
   conf.level=0.90,var.equal=TRUE)$conf.int

[1] 1.473675 7.704044
attr(,"conf.level")
[1] 0.9

(muhat.cbt - muhat.cont) + c(-1,1)*qt(p=0.95,df=n.cbt+n.cont-2)*se.pooled

[1] 1.473675 7.704044

Matched-pairs 90% confidence interal measuring how much the CBT group gained during the treatment process. Note that this test requires each observation in the first group to be matched to a corresponding observation in the second group (as is true in a before-and-after study):

t.test(x=anorexia$Postwt[anorexia$Treat=='CBT'],
   y=anorexia$Prewt[anorexia$Treat=='CBT'],
   paired=TRUE,conf.level=0.90,var.equal=TRUE)$conf.int

[1] 0.6981979 5.3155952
attr(,"conf.level")
[1] 0.9

muhat.paired <- mean(anorexia$Postwt[anorexia$Treat=='CBT'] -
                 anorexia$Prewt[anorexia$Treat=='CBT'])
se.paired <- sd(anorexia$Postwt[anorexia$Treat=='CBT'] - 
            anorexia$Prewt[anorexia$Treat=='CBT'])/sqrt(n.cbt)
muhat.paired + c(-1,1)*qt(p=0.95,df=n.cbt-1)*se.paired

[1] 0.6981979 5.3155952

z-tests for proportions and differences of proportions

Sticking with the MASS::anorexia dataset, let’s define a “success” as a patient who weighed more post-treatment than pre-treatment.¹

anorexia$Success <- 1*(anorexia$Postwt > anorexia$Prewt)

Now we can examine the proportions of success within and between the different treatment groups. For example:

Two-sided 80% confidence interval for the one-sample proportion of successes in the family therapy group.

x.ft <- sum(anorexia$Success[anorexia$Treat=='FT'])
n.ft <- length(anorexia$Success[anorexia$Treat=='FT'])
prop.test(x=x.ft,n=n.ft,conf.level=0.80)$conf.int

[1] 0.5819867 0.8909870
attr(,"conf.level")
[1] 0.8

One-sided, one-sample proportion test of whether the family therapy grpoup achieved at least a 60% success rate. Note that R by default uses a “continuity correction” which improves upon the CLT approximation for small sample sizes. By disabling this continuity correction we can manually match R’s results using the CLT:
```
prop.test(x=x.ft,n=n.ft,p=0.60,alternative='greater')$p.value
```
```
[1] 0.1274205
```
```
prop.test(x=x.ft,n=n.ft,p=0.60,alternative='greater',correct=FALSE)$p.value
```
```
[1] 0.08284192
```
```
1-pnorm((x.ft/n.ft - 0.6)/sqrt(0.6*0.4/n.ft))
```
```
[1] 0.08284192
```

Two-sided, two-proportion test of whether the family therapy group and the CBT group have different success rates:

x.cbt <- sum(anorexia$Success[anorexia$Treat=='CBT'])
prop.test(x=c(x.ft,x.cbt),n=c(n.ft,n.cbt),
      alternative='two.sided')$p.value

[1] 0.4965428

prop.test(x=c(x.ft,x.cbt),n=c(n.ft,n.cbt),
      alternative='two.sided',correct=FALSE)$p.value

[1] 0.3145389

p.combined <- (x.ft + x.cbt)/(n.ft + n.cbt)
se.combined <- sqrt(p.combined*(1-p.combined)*(1/n.ft+1/n.cbt))
2*(1-pnorm((x.ft/n.ft - x.cbt/n.cbt)/se.combined))

[1] 0.3145389

This is a low bar; we might normally want to set a minimum threshold for weight gain.↩︎