Every machine learning algorithm has knobs you must set before training begins: the learning rate of a neural network, the number of trees in a random forest, the regularization strength in logistic regression. These are hyperparameters, and their values can mean the difference between a mediocre model and a state-of-the-art one. Finding the right combination is the art and science of hyperparameter tuning.
Parameters vs. Hyperparameters
It is important to distinguish between the two:
- Parameters are learned from the data during training. Neural network weights and biases, regression coefficients, and decision tree split points are all parameters.
- Hyperparameters are set before training and control the learning process itself. Learning rate, batch size, number of hidden layers, regularization strength, and the number of estimators in an ensemble are all hyperparameters.
You cannot learn hyperparameters from the training data the same way you learn parameters. Instead, you evaluate different hyperparameter settings by measuring model performance on a validation set and selecting the configuration that works best.
"Hyperparameter tuning is not a dark art. It is a systematic search through a configuration space, guided by evaluation metrics and, ideally, by principled optimization strategies."
Grid Search
Grid search is the most straightforward approach. You define a discrete set of values for each hyperparameter and evaluate every possible combination.
For example, if you are tuning a random forest with n_estimators in [100, 200, 500] and max_depth in [5, 10, 20, None], grid search evaluates all 12 combinations.
Advantages
- Simple to implement and understand.
- Exhaustive: guarantees the best combination within the grid is found.
- Easily parallelizable: each combination can be evaluated independently.
Limitations
- Computational explosion: With d hyperparameters and k values each, the number of combinations is k^d. Adding a few more hyperparameters or values per hyperparameter quickly makes the search infeasible.
- Wastes evaluations: Grid search spends equal effort on important and unimportant hyperparameters. If only one of five hyperparameters matters, four-fifths of the evaluations are wasted.
Random Search
Random search samples hyperparameter configurations randomly from specified distributions. Instead of evaluating every point on a grid, it evaluates a fixed number of random configurations.
Key Takeaway
Bergstra and Bengio (2012) showed that random search is more efficient than grid search when only a few hyperparameters truly matter, which is usually the case. Random search explores more unique values along each dimension, increasing the chances of finding a good configuration.
Why Random Beats Grid
Consider tuning two hyperparameters where only one matters. Grid search with 9 evaluations gives you only 3 distinct values of the important parameter. Random search with 9 evaluations gives you 9 distinct values of the important parameter. The more unique values you sample, the more likely you are to find a good one.
Practical Tips for Random Search
- Use log-uniform distributions for parameters that span orders of magnitude (learning rate from 0.0001 to 0.1).
- Use uniform distributions for bounded parameters (dropout rate from 0.0 to 0.5).
- Start with a broad range and narrow down after initial results.
- Budget 50-100 random evaluations for a good starting point.
Bayesian Optimization
Bayesian optimization is the most sample-efficient approach. Instead of evaluating configurations blindly, it builds a probabilistic model of the objective function (validation performance as a function of hyperparameters) and uses it to intelligently choose the next configuration to evaluate.
How It Works
- Evaluate a few initial configurations (often randomly).
- Fit a surrogate model (typically a Gaussian Process or Tree-structured Parzen Estimator) to the observed results.
- Use an acquisition function (like Expected Improvement) to select the next configuration that balances exploring unknown regions with exploiting promising areas.
- Evaluate the selected configuration and update the surrogate model.
- Repeat until the budget is exhausted.
Advantages
- Sample-efficient: Finds good configurations with fewer evaluations than grid or random search.
- Adapts over time: Each evaluation informs the next, focusing effort where it is most likely to improve.
- Handles complex spaces: Can navigate high-dimensional, non-convex hyperparameter spaces.
Limitations
- Sequential bottleneck: Bayesian optimization is inherently sequential, as each evaluation depends on previous results. This limits parallelism.
- Surrogate model overhead: Fitting the surrogate model adds computation, though this is usually negligible compared to training the ML model.
- Complexity: More complex to implement than grid or random search, though libraries like Optuna, Hyperopt, and Ray Tune handle the heavy lifting.
Hyperband and BOHB
Hyperband
Hyperband is a bandit-based approach that makes tuning dramatically faster by early stopping. It trains many configurations for a small number of epochs, keeps the most promising ones, and allocates more epochs to the survivors. This successive halving process eliminates poor configurations quickly.
BOHB
BOHB (Bayesian Optimization and Hyperband) combines the best of both worlds: it uses Bayesian optimization to generate promising configurations and Hyperband's early stopping to evaluate them efficiently. BOHB is often the best choice when training is expensive.
Key Takeaway
For most practical scenarios, start with random search for a quick survey, then switch to Bayesian optimization (via Optuna or similar tools) for fine-tuning. If training is very expensive, use Hyperband or BOHB for early stopping.
Popular Tools
- Scikit-learn GridSearchCV / RandomizedSearchCV: Simple and integrated with scikit-learn pipelines.
- Optuna: Powerful Bayesian optimization with pruning (early stopping), visualization, and support for complex search spaces.
- Ray Tune: Distributed tuning with support for grid, random, Bayesian, Hyperband, and BOHB. Scales across clusters.
- Hyperopt: One of the earliest Bayesian optimization libraries for ML. Uses Tree-structured Parzen Estimators.
- Keras Tuner: Designed specifically for tuning Keras deep learning models.
Best Practices
- Tune the most impactful hyperparameters first. Learning rate, regularization strength, and model capacity usually matter more than minor parameters.
- Use proper validation. Always tune on a validation set, never on the test set. Use cross-validation for small datasets. See our guide on model evaluation.
- Log everything. Record every configuration and its result in an experiment tracker. You will thank yourself when debugging or writing reports.
- Set a budget. Decide how many evaluations or hours you can afford before starting. Tuning can consume infinite resources without a budget.
- Start wide, then narrow. Begin with a broad search range to find promising regions, then zoom in with finer resolution.
- Consider AutoML if you want to automate the entire process including model selection.
Hyperparameter tuning is where marginal gains turn into meaningful performance improvements. With the right strategy, a modest tuning budget, and disciplined experiment tracking, you can reliably push your models from good to great.
