Cross-Validation Sample Size
Minimum sample sizes for reliable k-fold cross-validation with stratification and class imbalance. Based on Varoquaux & Colliot 2023, Kohavi 1995, and scikit-learn best practices.
Why This ML Metric Matters
Why: Proper sample size ensures stable CV estimates and meaningful confidence intervals. Too few samples lead to high variance and unreliable model selection.
How: The calculator applies the 10×f×c rule, adjusts for model complexity, enforces stratification requirements for imbalanced data, and estimates confidence interval width.
Cross-Validation Sample Size Calculator
Minimum samples for reliable k-fold CV. Stratification, class imbalance. Varoquaux & Colliot 2023, Kohavi 1995, scikit-learn.
📊 Quick Examples — Click to Load
Inputs
Sample Size vs Accuracy Confidence Band
Stratification Visualization (Samples per Fold)
For educational and informational purposes only. Verify with a qualified professional.
🤖 AI & ML Facts
Varoquaux 2017: with 100 samples, CV error bars can be ±10% — standard error across folds underestimates true variance
— Varoquaux
Kohavi 1995: 10-fold stratified CV often outperforms leave-one-out for model selection on real datasets
— Kohavi
High-dimensional genomics (p>>n): rule of thumb breaks down; use regularization, nested CV, or holdout
— ML practice
scikit-learn StratifiedKFold preserves class proportions in each fold — requires min samples per class
— scikit-learn
📋 Key Takeaways
- • Rule of thumb: n ≥ 10 × features × classes for reliable CV (Varoquaux & Colliot 2023)
- • Kohavi 1995: stratified 10–20 fold CV recommended for accuracy estimation
- • Stratification requires enough samples per class in each fold — at least 2 per class per fold
- • Class imbalance: minority class limits stratification; oversample or use stratified sampling
- • Small samples → large error bars; Varoquaux: ±10% error bars common with n=100
💡 Did You Know
📖 How It Works
1. Rule of thumb (10×f×c)
Minimum samples ≈ 10 × number of features × number of classes. Ensures enough data per dimension and per class.
2. Min per fold
Each fold gets n/k samples. Test fold needs enough for meaningful metric; train fold needs enough for learning.
3. Stratification
Stratified CV preserves class proportions. Requires ≥2 samples per class per fold. With imbalance, minority class limits feasibility.
4. Confidence intervals
Accuracy is a proportion; CI width ∝ 1/√n. Higher confidence (99% vs 95%) widens the interval.
5. Model complexity
Complex models (high capacity) need more samples for stable estimates. Low=1×, medium=1.5×, high=2× multiplier.
🎯 Expert Tips
Use stratification when possible
Preserves class balance across folds. Essential for imbalanced classification.
Repeated CV for stability
5×5 or 10×10 repeated k-fold reduces variance in small-sample settings.
Rare classes need more data
1:100 imbalance: need 100× more samples than balanced case for stratification.
Nested CV for model selection
Outer loop for evaluation, inner for hyperparameters — avoids optimistic bias.
⚖️ Sample Size by Scenario
| Scenario | Features | Classes | Min n (rule) | Notes |
|---|---|---|---|---|
| Binary, low-dim | 20 | 2 | 400 | Standard ML benchmark |
| Multi-class (10) | 100 | 10 | 10,000 | Image classification |
| Genomics p>>n | 5000 | 2 | 100,000 | Use regularization, holdout |
| Rare disease 1:20 | 30 | 2 | 600+ | Stratification limits |
| NLP sentiment | 500 | 3 | 15,000 | Medium complexity |
❓ Frequently Asked Questions
Why n ≥ 10 × features × classes?
Rule of thumb from ML practice and Varoquaux: ensures enough samples per dimension and per class for stable estimates. For high-dimensional data (p>>n), this often cannot be met — use regularization and holdout.
When to use stratification?
Always for classification when classes are imbalanced. StratifiedKFold preserves class proportions. Requires at least 2 samples per class per fold.
How many folds (k) should I use?
Kohavi 1995: 10–20 folds for accuracy estimation. 5-fold is common for model selection. More folds = less bias, more variance. Fewer folds = more train data per fold, less variance in each fold.
What if I have very few samples?
Consider leave-one-out (LOO) for regression, or repeated k-fold. Be aware: LOO has high variance. Bootstrap or nested CV can help. Report confidence intervals.
How does class imbalance affect CV?
Minority class limits stratification. Need enough minority samples for each fold. With 1:100 imbalance, you need 100× more total samples than balanced case for same per-fold minority count.
What is nested cross-validation?
Outer loop evaluates model; inner loop selects hyperparameters. Prevents optimistic bias from tuning on same data used for evaluation. Computationally expensive.
When is holdout better than CV?
Very large datasets (millions), or when compute is limited. Single train/test split is simpler. CV preferred for small/medium data.
How does model complexity affect sample size?
Complex models (deep nets, high-capacity) need more samples for stable CV estimates. Simple models (logistic regression) can work with fewer samples.
📊 Cross-Validation by the Numbers
📚 Official Sources
⚠️ Disclaimer: This calculator provides estimates for educational and planning purposes. The 10×f×c rule is a heuristic; actual requirements depend on problem difficulty, model choice, and noise. For high-dimensional (p>>n) or highly imbalanced data, consult domain-specific guidelines. Always report confidence intervals and consider nested CV for model selection. Verify with scikit-learn or your ML framework.
Related Calculators
Embedding Dimension Calculator
Determine optimal embedding dimensions for LLMs, RAG, classification, and search. Balance memory vs expressiveness.
Machine LearningRAG Optimizer Calculator
Calculate chunk sizes, vector store memory, and token budgets for retrieval-augmented generation pipelines.
Machine LearningTraining Data Size Estimator
How much training data do you need? Chinchilla ratios, fine-tuning guidelines, and LIMA insights for optimal data requirements.
Machine LearningActivation Memory Calculator
Estimate activation memory with and without gradient checkpointing. Based on NVIDIA selective recomputation research.
Machine LearningAI Fairness & Bias Calculator
Calculate demographic parity, equalized odds, equal opportunity, and disparate impact ratio. Based on IBM AIF360 and Microsoft Fairlearn.
Machine LearningAttention Head Configuration Calculator
Configure MHA, MQA, and GQA attention. Calculate head counts, dimensions, KV cache savings, and memory per attention type.
Machine Learning