Logistic Regression Sample Size Calculator

statistics

shiny

blog

Published

February 20, 2025

A logistic regression has the output of Odds Ratios (OR) and is usually used when the outcome is binary (e.g. yes/no, success/failure). We usually use a logistic regression model to look at the associations between a binary outcome and a set of explanatory variables ( ie. predictors, covariates, independent variables).

To inform your sample size for logistic regression, you need some guesswork ideally guided by relevant literature. In this context, you first need an estimate of the outcome’s prevalence in your study population.

For example, imagine you plan to study factors associated with HIV acquisition among substance abusers. Suppose that based on prior studies you estimate the prevalence of HIV infection in this group to be about 10%—that is, 10% of substance abusers are HIV positive.

Now, assume you want to examine five predictors:

Number of sexual partners
Intravenous drug use
Age
Sex
Educational status (e.g., whether the individual has completed matriculation)

A common rule-of-thumb (Peduzzi et al., 1996) is that there should be at least 10 outcomes (events) per predictor. This “10 Events Per Predictor (EPV)” guideline is used to help ensure that your logistic regression model is stable, less than 10 events per predictor may result in biased estimates in alternating directions.

The sample size calculation is based on the formula:

\[ n=EPV×kp \] \[ n = \frac{\text{EPV} \times k}{p} \] \[ n=pEPV×k \] Where:

\(n\) = Total required sample size
\(EPV\)= 10 (the recommended minimum number of events per predictor)
\(k\) = Number of predictors (here, 5)
\(p\) = Outcome prevalence (here, 0.10 which is a prevalence of 10% ie. if 10 people in your sample of 100 people have HIV, the prevalence of HIV in your sample is 10%)

Plugging in these values:

\[ n=10×50.10=50 \] \[ n = \frac{10 \times 5}{0.10} = 500 \] \[ n=0.1010×5=500 \]

This calculation suggests you would need approximately 500 participants to expect around 50 HIV-positive cases—meeting the 10 events per predictor guideline.

Reference:
Peduzzi, P., Concato, J., Kemper, E., Holford, T.R., & Feinstein, A.R. (1996). A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology, 49(12), 1373–1379.

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| echo: false 
#| messages: false
#| viewerHeight: 800

library(shiny)
library(shinythemes)
library(shinyjs)

ui <- fluidPage(
  theme = shinytheme("cyborg"),
  useShinyjs(),
  tags$style(HTML("
    #explanationText {
      font-family: 'Times New Roman', Times, serif;
      white-space: pre-wrap;
    }
    .copy-button {
      margin-bottom: 10px;
    }
  ")),
  tags$script(HTML("
    function copyExplanation() {
      var explanation = document.getElementById('explanationText').innerText;
      var tempInput = document.createElement('textarea');
      tempInput.value = explanation;
      document.body.appendChild(tempInput);
      tempInput.select();
      document.execCommand('copy');
      document.body.removeChild(tempInput);
      alert('Explanation copied to clipboard!');
    }
  ")),
  
  titlePanel("Sample Size Calculator for Logistic Regression"),
  sidebarLayout(
    sidebarPanel(
      numericInput("predictors", "Number of Predictors (k):", value = 5, min = 1, step = 1),
      numericInput("p", "Anticipated Outcome Prevalence (p):", value = 0.1, min = 0.001, max = 1, step = 0.01),
      helpText("Note: Enter the prevalence as a decimal (e.g., 0.1 = 10%, 0.25 = 25%, 0.5 = 50%)."),
      numericInput("epv", "Events per Predictor (EPV):", value = 10, min = 1, step = 1)
    ),
    mainPanel(
      h3(textOutput("sampleSize")),
      tags$button("Copy Explanation", class = "copy-button", onclick = "copyExplanation()"),
      uiOutput("explanationUI"),
      plotOutput("distPlot")
    )
  )
)

server <- function(input, output, session) {
  
  output$sampleSize <- renderText({
    k <- input$predictors
    p <- input$p
    epv <- input$epv
    
    # Required number of events = EPV * k, so total sample size n = (EPV * k) / p
    n <- (epv * k) / p
    
    paste("Required total sample size (approx):", round(n))
  })
  
  output$explanationUI <- renderUI({
    k <- input$predictors
    p <- input$p
    epv <- input$epv
    n <- (epv * k) / p
    
    explanation <- paste0(
      "Sample Size Calculation for Logistic Regression:\n\n",
      "    n = (EPV × k) / p\n\n",
      "Where:\n",
      "  EPV = Events per Predictor = ", epv, "\n",
      "  k   = Number of Predictors = ", k, "\n",
      "  p   = Anticipated outcome prevalence = ", p, "\n\n",
      "Thus, to ensure at least ", epv, " events per predictor,\n",
      "a total sample size of approximately ", round(n), " is required."
    )
    
    tags$pre(explanation, id = "explanationText")
  })
  
  output$distPlot <- renderPlot({
    p <- input$p
    k <- input$predictors
    epv <- input$epv
    n <- round((epv * k) / p)
    
    # Simulate the number of events in a study with sample size n
    set.seed(123)
    events <- rbinom(1000, size = n, prob = p)
    
    hist(events, breaks = 30, col = "lightgreen",
         main = "Simulated Distribution of Events",
         xlab = "Number of Events", border = "white")
  })
}

shinyApp(ui = ui, server = server)