ESS
All Guides
Strategy

A/B Testing Email Campaigns: A Statistical Guide

Alex Kim
December 15, 2024
14 min read
5 sections

A/B Testing Fundamentals

A/B testing in email compares two variants of an email to determine which performs better against a specific metric. Unlike web A/B testing, email tests have unique constraints: you get one send per subscriber, sample sizes are fixed by your list size, and external factors (time of day, day of week) can skew results.

A proper email A/B test requires:

  • A clear hypothesis ("Subject line with a number will increase open rates by 10%")
  • A single variable changed between variants (never test multiple things at once)
  • A predetermined success metric (open rate, click rate, conversion rate)
  • A sufficient sample size for statistical significance
  • A predetermined test duration

What to Test (and What Not To)

High-impact tests (test these first):

  • Subject lines — highest impact on open rates, easiest to test
  • Send time — can shift open rates by 10-30%
  • CTA text and placement — directly affects click-through rates
  • From name — "Jane at Company" vs "Company" vs "The Company Team"

Medium-impact tests:

  • Email length (concise vs detailed)
  • Personalization (name in subject, dynamic content blocks)
  • Preheader text
  • Number of links/CTAs

Low-impact tests (usually not worth the effort):

  • Font choices or colors within acceptable ranges
  • Minor copy tweaks that do not change the core message
  • Image placement when the core layout is the same

Getting Your Sample Size Right

The most common A/B testing mistake is declaring a winner too early. To detect a meaningful difference, you need sufficient sample size.

For email open rate tests, here are rough minimums per variant:

  • Detect 5% relative difference: ~3,000 subscribers per variant
  • Detect 10% relative difference: ~800 subscribers per variant
  • Detect 20% relative difference: ~200 subscribers per variant

These assume a baseline open rate of ~25% and 95% confidence level. Use an online sample size calculator for your specific baseline metrics.

If your list is too small for statistical significance, consider:

  • Testing larger changes that produce bigger effects
  • Accumulating results across multiple sends (meta-analysis)
  • Using Bayesian methods that work better with small samples

Interpreting Results

Key metrics to evaluate:

  • Statistical significance — Is p less than 0.05? If not, the result may be due to chance.
  • Effect size — A statistically significant 0.1% improvement is real but irrelevant. Focus on meaningful differences.
  • Confidence interval — understand the range of plausible true values, not just the point estimate.

Be cautious about:

  • Novelty effects — a new approach may spike initially but normalize over time
  • Segment differences — a winning variant for one segment may lose for another
  • Downstream metrics — higher open rates do not always translate to more conversions

Common Pitfalls

  • Peeking at results early — checking daily and stopping when you see a winner inflates false positive rates dramatically
  • Testing too many things at once — multivariate testing requires exponentially larger sample sizes
  • Ignoring seasonality — results from Black Friday week do not generalize to normal periods
  • Not documenting learnings — keep a testing log so you build institutional knowledge
  • Over-optimizing for opens — clickbait subject lines win open rate tests but hurt long-term engagement