Prevalence Study sample size calculator

statistics

shiny

blog

Published

February 20, 2025

We often want to estimate the prevalence in a population (i.e., the proportion of individuals with a certain condition). For example, if we want to estimate the prevalence of a rare disease in a population, we need to determine how many individuals we need to sample to obtain a reliable estimate.

The required sample size is largely influenced by how precise we want our estimate. The more accurate we want our estimate to be, the more individuals we need to sample. We report precision as how wide the 95% confidence interval will be from our prevalence estimate. If the precision is 0.05, (5%), the confidence interval will be 5% above and below the prevalence estimate. For instance, for a disease with 50% prevalence, and a precision of 5%, we need a sample size of ~ 385. Our estimated prevalence will be 50% (95% CI 45% to 55%).

The required sample size for estimating a proportion (prevalence) is given by:

\[ n = \frac{z^2 \; p \; (1-p)}{d^2} \]

where:

\(p\) is the anticipated prevalence,
\(d\) is the desired margin of error,
\(z\) is the z-score corresponding to a confidence level of \((1-\alpha/2)\) for a significance level \(\alpha\).

The estimated 95% confidence interval for evenly distributed (CI) is given by:

\[ \text{CI} = p \pm z \times \sqrt{\frac{p(1-p)}{n}} \]

When a disease is rare, this formula does not work well as the sample size recommende becomes to small. For instance, collecting info on n=50 patients with an expected prevalence of 1% means we may not find any patients with this rare disease. The calculator therefore uses an iterative approach to find the sample size required to achieve the desired margin of error when a prevalence is less than 10%.

The Copper-Pearson approximation for a confidence interval is used for rare diseases:

\[ \text{CI} = \left( \frac{x}{n} - \frac{z^2}{2n^2}, \frac{x}{n} + \frac{z^2}{2n^2} \right) \]

where \(x\) is the number of observed events.

This formula is repeated for differnt smaple sizes until the desired precision is achieved.

Desirable Prevalence and Precision Estimates:
A good starting point for a new researcher is to use prevalence estimates based on previous studies or pilot data. For instance, if studying a rare condition, a prevalence of around 1% (i.e., \(p = 0.01\)) might be expected, while more common conditions may have prevalences closer to 10% or even 50%.

The margin of error, \(d\), reflects how precise you want your prevalence estimate to be. A margin of error of ±5% (i.e., \(d = 0.05\)) is common when a high degree of precision is needed. However, if resources are limited, a wider margin (e.g., ±10% or \(d = 0.10\)) may be acceptable.

Keep in mind: - Effect of Prevalence on Sample Size: The required sample size is largest when \(p = 0.5\) (i.e., 50%), because \(p(1-p)\) is maximized at this value. If the anticipated prevalence is lower or higher than 50%, the sample size required for the same margin of error decreases. - Effect of Precision on Sample Size: A smaller margin of error (i.e., a more precise estimate) increases the required sample size. Conversely, if you are willing to accept a larger margin of error, the sample size can be reduced.

Practical Tip:
When searching through literature, pay particular attention to the methods of data collection int he study. A study that reports a low prevalence may used very broad measures, if you are sampling from a population that has been screened or referred, your prevalence may be higher than the general population. Sometimes, prevalence estimates may be in subgroup analyses in a study. If there is absolutely no data available , consider a small pilot study or some educated guesses.

Use this calculator and play with the parameters to help you find a realistic sample size.

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| echo: false
#| messages: false
#| viewerHeight: 800
library(shiny)
library(shinythemes)
library(shinyjs)

# Precision-based sample size calculation (for estimation)
precision_prevalence <- function(prevalence, alpha = 0.05, d = 0.05, threshold = 0.09) {
    # For larger prevalences, use the normal approximation
    if (prevalence >= threshold) {
        z <- qnorm(1 - alpha / 2)
        n <- (z^2 * prevalence * (1 - prevalence)) / (d^2)
        return(ceiling(n))
    } else {
        # For very small prevalences, use an iterative approach based on exact binomial CI (Clopper-Pearson)
        # Start with n at least large enough so that we expect at least one event.
        n <- ceiling(1 / prevalence)
        found <- FALSE
        while (!found) {
            # Ensure at least one observed event
            x <- max(1, round(n * prevalence))
            ci <- binom.test(x, n, conf.level = 1 - alpha)$conf.int
            # Compute the half-width of the confidence interval (we use the larger deviation from p)
            half_width <- max(ci[2] - prevalence, prevalence - ci[1])
            if (half_width <= d) {
                found <- TRUE
            } else {
                n <- n + 1
            }
        }
        return(n)
    }
}

ui <- fluidPage(
    theme = shinytheme("cyborg"),
    useShinyjs(),
    tags$style(HTML("
    #protocolText {
      font-family: 'Times New Roman', Times, serif;
      white-space: pre-wrap;
    }
    .copy-button {
      margin-bottom: 10px;
    }
  ")),
    tags$script(HTML("
    function copyProtocol() {
      var protocol = document.getElementById('protocolText').innerText;
      var tempInput = document.createElement('textarea');
      tempInput.value = protocol;
      document.body.appendChild(tempInput);
      tempInput.select();
      document.execCommand('copy');
      document.body.removeChild(tempInput);
      alert('Protocol text copied to clipboard!');
    }
  ")),
    titlePanel("Prevalence Study Sample Size & Confidence Interval Calculator"),
    sidebarLayout(
        sidebarPanel(
            numericInput("prev", "Anticipated Prevalence (p):", value = 0.01, min = 0.001, max = 1, step = 0.001),
            helpText("Enter prevalence as a decimal (e.g., 0.01 = 1%, 0.10 = 10%)."),
            numericInput("alpha", "Significance Level (α):", value = 0.05, min = 0.001, max = 0.1, step = 0.005),
            numericInput("d", "Desired Margin of Error (d):", value = 0.05, min = 0.001, max = 0.1, step = 0.001),
            hr(),
            helpText("The precision is how wide the confidence interval should be around the prevalence estimate. (e.g. 0.05 is ±5%)"),
            hr(),
            numericInput("sims", "Number of Simulations:", value = 100, min = 10, max = 10000, step = 10),
            actionButton("runSim", "Run Simulations")
        ),
        mainPanel(
            h3(textOutput("resultTitle")),
            uiOutput("resultsUI"),
            h4("Simulation Results:"),
            plotOutput("simPlot"),
            hr(),
            tags$button("Copy Protocol Text", class = "copy-button", onclick = "copyProtocol()"),
            h4("Protocol Text (Simplified)"),
            uiOutput("protocolUI")
        )
    )
)

server <- function(input, output, session) {
    # Calculate sample size using the precision-based formula
    calc_size <- reactive({
        precision_prevalence(input$prev, alpha = input$alpha, d = input$d)
    })

    # Calculate the Clopper-Pearson (exact binomial) confidence interval
    calc_binom_CI <- reactive({
        n <- calc_size()
        # Calculate the expected number of events; ensure at least one event is considered.
        x <- max(1, round(n * input$prev))
        binom.test(x, n, conf.level = 1 - input$alpha)$conf.int-> conf_intervals
        c(lower = conf_intervals[1], upper = conf_intervals[2])
    })

    # Calculate the normal approximation confidence interval
    calc_normal_CI <- reactive({
        n <- calc_size()
        p <- input$prev
        z <- qnorm(1 - input$alpha / 2)
        se <- sqrt(p * (1 - p) / n)
        lower <- p - z * se
        upper <- p + z * se
        c(lower = lower, upper = upper)
    })

    output$resultTitle <- renderText({
        "Your Estimated Sample Size & 95% Confidence Intervals"
    })

    output$resultsUI <- renderUI({
        n_precision <- calc_size()
        binom_CI <- calc_binom_CI()
        normal_CI <- calc_normal_CI()
        tagList(
            tags$p(paste(
                "For an anticipated prevalence of", input$prev,
                "with a desired margin of error of ±", input$d * 100, "% at a significance level of", input$alpha,
                "the required sample size is:", n_precision
            )),
            tags$p(paste(
                "The Clopper-Pearson (exact binomial) 95% CI is approximately:",
                paste0(round(binom_CI["lower"] * 100, 2), "% to ", round(binom_CI["upper"] * 100, 2), "%")
            )),
            tags$p(paste(
                "The normal approximation 95% CI is approximately:",
                paste0(round(normal_CI["lower"] * 100, 2), "% to ", round(normal_CI["upper"] * 100, 2), "%")
            ))
        )
    })

    # Simulation: simulate binomial samples using the computed sample size and record the half-width of the CI
    simData <- eventReactive(input$runSim, {
        n <- calc_size()
        p <- input$prev
        alpha <- input$alpha
        sims <- input$sims
        # Storage for half-widths of CIs from the binom test
        half_widths <- numeric(sims)

        withProgress(message = "Running simulations...", value = 0, {
            for (i in 1:sims) {
                # Simulate observed events using a binomial draw
                x <- rbinom(1, n, p)
                # Compute the exact Clopper-Pearson confidence interval
                ci <- binom.test(x, n, conf.level = 1 - alpha)$conf.int
                # Compute the half-width (take the maximum deviation from p)
                half_widths[i] <- max(p - ci[1], ci[2] - p)
                incProgress(1 / sims)
            }
        })

        return(half_widths)
    })

    output$simPlot <- renderPlot({
        req(simData())
        hist(simData(),
            main = "Distribution of CI Half-Widths",
            xlab = "Half-Width of 95% CI", col = "skyblue", border = "white"
        )
    })

    # Simple protocol text (for copying) in plain text with simple math notation
    output$protocolUI <- renderUI({
        n_val <- calc_size()
        binom_CI <- calc_binom_CI()
        normal_CI <- calc_normal_CI()
        simple_text <- paste(
            "Based on an anticipated prevalence (p) of ", input$prev, " and a desired margin of error (d) of ±", input$d * 100, "%,",
            "with a significance level (α) of ", input$alpha, ", the required sample size (n) is calculated using the formula:\n",
            "  n = (z^2 * p * (1-p)) / d^2\n",
            "where z is the z-score corresponding to a confidence level of (1 - α/2).\n",
            "For these parameters, the calculated sample size is ", n_val, ".\n",
            "This sample size is expected to yield approximate 95% confidence intervals for the prevalence as follows:\n",
            "  - Clopper-Pearson (exact): ", paste0(round(binom_CI["lower"] * 100, 2), "% to ", round(binom_CI["upper"] * 100, 2), "%\n"),
            "  - Normal approximation: ", paste0(round(normal_CI["lower"] * 100, 2), "% to ", round(normal_CI["upper"] * 100, 2), "%\n")
        )
        tags$pre(simple_text, id = "protocolText")
    })
}

shinyApp(ui = ui, server = server)