DATAData & RAGML Calculator
📊

Cross-Validation Sample Size

Minimum sample sizes for reliable k-fold cross-validation with stratification and class imbalance. Based on Varoquaux & Colliot 2023, Kohavi 1995, and scikit-learn best practices.

Concept Fundamentals
k partitions
k-Fold CV
Cross-validation method
10 × features × classes
Rule of Thumb
Minimum sample size
Preserve class ratio
Stratified
Balanced folds
Model evaluation
Application
Generalization estimate
Sample Size for Reliable CVRule of thumb: n ≥ 10 × features × classes

Why This ML Metric Matters

Why: Proper sample size ensures stable CV estimates and meaningful confidence intervals. Too few samples lead to high variance and unreliable model selection.

How: The calculator applies the 10×f×c rule, adjusts for model complexity, enforces stratification requirements for imbalanced data, and estimates confidence interval width.

📊
DATA & RAG

Cross-Validation Sample Size Calculator

Minimum samples for reliable k-fold CV. Stratification, class imbalance. Varoquaux & Colliot 2023, Kohavi 1995, scikit-learn.

📊 Quick Examples — Click to Load

Inputs

f in rule n≥10×f×c
e.g., 0.85 for 85%
c in rule n≥10×f×c
k-fold cross-validation
1:ratio (e.g., 10 for 1:10)
cv-sample-size.sh
CALCULATED
Recommended Min Samples
1500
Rule of Thumb (10×f×c)
1000
Min per Fold
300
Stratification Min/Class
60
CI width @ 95%: ±1.81%
Share:
Cross-Validation Sample Size
Recommended Minimum
1500
5-fold CV|50 features|2 classes
numbervibe.com/calculators/machine-learning/cross-validation-sample-calculator

Sample Size vs Accuracy Confidence Band

Stratification Visualization (Samples per Fold)

For educational and informational purposes only. Verify with a qualified professional.

🤖 AI & ML Facts

📐

Varoquaux 2017: with 100 samples, CV error bars can be ±10% — standard error across folds underestimates true variance

— Varoquaux

🎯

Kohavi 1995: 10-fold stratified CV often outperforms leave-one-out for model selection on real datasets

— Kohavi

🧬

High-dimensional genomics (p>>n): rule of thumb breaks down; use regularization, nested CV, or holdout

— ML practice

⚖️

scikit-learn StratifiedKFold preserves class proportions in each fold — requires min samples per class

— scikit-learn

📋 Key Takeaways

  • • Rule of thumb: n ≥ 10 × features × classes for reliable CV (Varoquaux & Colliot 2023)
  • • Kohavi 1995: stratified 10–20 fold CV recommended for accuracy estimation
  • • Stratification requires enough samples per class in each fold — at least 2 per class per fold
  • • Class imbalance: minority class limits stratification; oversample or use stratified sampling
  • • Small samples → large error bars; Varoquaux: ±10% error bars common with n=100

💡 Did You Know

📐Varoquaux 2017: with 100 samples, CV error bars can be ±10% — standard error across folds underestimates true variance
🎯Kohavi 1995: 10-fold stratified CV often outperforms leave-one-out for model selection on real datasets
🧬High-dimensional genomics (p>>n): rule of thumb breaks down; use regularization, nested CV, or holdout
⚖️scikit-learn StratifiedKFold preserves class proportions in each fold — requires min samples per class
🎲Repeated k-fold CV (e.g., 5×5) reduces variance compared to single 5-fold — at cost of more compute
🩺Medical ML: rare disease (1:100) needs thousands of samples for minority class representation in CV
📊Confidence level 99% vs 95%: widens CI by ~30% — requires more samples for same precision
🤖Complex models (deep nets) need more samples than simple models (logistic regression) for stable CV

📖 How It Works

1. Rule of thumb (10×f×c)

Minimum samples ≈ 10 × number of features × number of classes. Ensures enough data per dimension and per class.

2. Min per fold

Each fold gets n/k samples. Test fold needs enough for meaningful metric; train fold needs enough for learning.

3. Stratification

Stratified CV preserves class proportions. Requires ≥2 samples per class per fold. With imbalance, minority class limits feasibility.

4. Confidence intervals

Accuracy is a proportion; CI width ∝ 1/√n. Higher confidence (99% vs 95%) widens the interval.

5. Model complexity

Complex models (high capacity) need more samples for stable estimates. Low=1×, medium=1.5×, high=2× multiplier.

🎯 Expert Tips

Use stratification when possible

Preserves class balance across folds. Essential for imbalanced classification.

Repeated CV for stability

5×5 or 10×10 repeated k-fold reduces variance in small-sample settings.

Rare classes need more data

1:100 imbalance: need 100× more samples than balanced case for stratification.

Nested CV for model selection

Outer loop for evaluation, inner for hyperparameters — avoids optimistic bias.

⚖️ Sample Size by Scenario

ScenarioFeaturesClassesMin n (rule)Notes
Binary, low-dim202400Standard ML benchmark
Multi-class (10)1001010,000Image classification
Genomics p>>n50002100,000Use regularization, holdout
Rare disease 1:20302600+Stratification limits
NLP sentiment500315,000Medium complexity

❓ Frequently Asked Questions

Why n ≥ 10 × features × classes?

Rule of thumb from ML practice and Varoquaux: ensures enough samples per dimension and per class for stable estimates. For high-dimensional data (p>>n), this often cannot be met — use regularization and holdout.

When to use stratification?

Always for classification when classes are imbalanced. StratifiedKFold preserves class proportions. Requires at least 2 samples per class per fold.

How many folds (k) should I use?

Kohavi 1995: 10–20 folds for accuracy estimation. 5-fold is common for model selection. More folds = less bias, more variance. Fewer folds = more train data per fold, less variance in each fold.

What if I have very few samples?

Consider leave-one-out (LOO) for regression, or repeated k-fold. Be aware: LOO has high variance. Bootstrap or nested CV can help. Report confidence intervals.

How does class imbalance affect CV?

Minority class limits stratification. Need enough minority samples for each fold. With 1:100 imbalance, you need 100× more total samples than balanced case for same per-fold minority count.

What is nested cross-validation?

Outer loop evaluates model; inner loop selects hyperparameters. Prevents optimistic bias from tuning on same data used for evaluation. Computationally expensive.

When is holdout better than CV?

Very large datasets (millions), or when compute is limited. Single train/test split is simpler. CV preferred for small/medium data.

How does model complexity affect sample size?

Complex models (deep nets, high-capacity) need more samples for stable CV estimates. Simple models (logistic regression) can work with fewer samples.

📊 Cross-Validation by the Numbers

10×f×c
Rule of Thumb
10–20
Kohavi Recommended Folds
±10%
Error Bars @ n=100
≥2
Min/Class/Fold Stratification

⚠️ Disclaimer: This calculator provides estimates for educational and planning purposes. The 10×f×c rule is a heuristic; actual requirements depend on problem difficulty, model choice, and noise. For high-dimensional (p>>n) or highly imbalanced data, consult domain-specific guidelines. Always report confidence intervals and consider nested CV for model selection. Verify with scikit-learn or your ML framework.

👈 START HERE
⬅️Jump in and explore the concept!
AI

Related Calculators