Not all features are created equal. Some carry strong signals about the target variable, others are redundant or noisy, and a few can actively harm model performance. Feature selection is the process of identifying and keeping only the most useful features, reducing complexity while improving or maintaining model accuracy.
Feature selection differs from dimensionality reduction. While techniques like PCA transform features into new ones, feature selection retains the original features, making the model more interpretable and the pipeline simpler.
Why Feature Selection Matters
- Reduces overfitting: Fewer features mean fewer opportunities for the model to memorize noise in the training data.
- Speeds up training: Fewer features reduce computational cost, sometimes dramatically.
- Improves interpretability: A model with 10 features is far easier to explain than one with 1,000.
- Reduces data collection costs: If you know which features matter, you can stop collecting the ones that do not.
- Simplifies deployment: Fewer features mean a lighter ML pipeline with fewer potential failure points.
"The best feature set is not the largest one. It is the smallest set that captures the essential patterns your model needs to make accurate predictions."
Filter Methods
Filter methods evaluate features independently of any machine learning model. They use statistical measures to score each feature and select the top ones. They are fast and model-agnostic but may miss feature interactions.
Common Filter Techniques
- Correlation coefficient: Measures linear relationship between each feature and the target. Remove features with low correlation. Also remove highly correlated features to reduce redundancy.
- Mutual information: Measures the amount of information a feature provides about the target, capturing both linear and nonlinear relationships. More general than correlation.
- Chi-squared test: For categorical features, tests whether the feature and target are independent. Higher chi-squared values indicate stronger association.
- ANOVA F-test: For numerical features with categorical targets, tests whether the feature means differ across classes.
- Variance threshold: Remove features with variance below a threshold. A feature that is nearly constant carries no information.
Key Takeaway
Filter methods are fast and scalable, making them ideal for initial feature screening on high-dimensional datasets. Use mutual information instead of correlation when you suspect nonlinear relationships.
Wrapper Methods
Wrapper methods use a machine learning model to evaluate subsets of features. They search through combinations of features, training and evaluating a model for each subset. This makes them more powerful than filter methods but also more computationally expensive.
Forward Selection
Start with no features. At each step, add the feature that improves model performance the most. Stop when adding more features no longer helps (or when you reach a desired number).
Backward Elimination
Start with all features. At each step, remove the feature whose removal hurts performance the least. Stop when removing any feature significantly degrades performance.
Recursive Feature Elimination (RFE)
Train a model, rank features by importance, remove the least important ones, and repeat. This is essentially backward elimination guided by the model's own feature importance scores. Scikit-learn provides a convenient RFECV implementation that uses cross-validation to find the optimal number of features.
Embedded Methods
Embedded methods perform feature selection as part of the model training process. They are faster than wrapper methods and can capture feature interactions.
LASSO Regularization (L1)
LASSO adds an L1 penalty to the loss function that drives some feature coefficients to exactly zero, effectively removing those features from the model. The regularization strength controls how aggressively features are eliminated. Stronger regularization removes more features.
Tree-Based Feature Importance
Decision trees, random forests, and gradient boosting models naturally provide feature importance scores based on how much each feature improves the purity of splits (Gini importance) or reduces the loss (permutation importance). Features with low importance can be safely removed.
Permutation Importance
Shuffle the values of a single feature and measure how much model performance drops. A large drop means the feature is important; no drop means the feature can be removed. This method is model-agnostic and more reliable than impurity-based importance for correlated features.
Elastic Net
Elastic Net combines L1 (LASSO) and L2 (Ridge) regularization. The L1 component drives some coefficients to zero for feature selection, while the L2 component handles correlated features better than LASSO alone.
Key Takeaway
Embedded methods like LASSO and tree-based importance are often the best balance of accuracy and efficiency. They consider feature interactions that filter methods miss while being much faster than wrapper methods.
Advanced Techniques
Boruta Algorithm
Boruta creates "shadow features" by shuffling each original feature. It then uses a random forest to compare the importance of each original feature against the maximum importance of any shadow feature. Features that are consistently more important than the best shadow feature are selected. This provides a rigorous statistical test for feature relevance.
SHAP-Based Selection
SHAP (SHapley Additive exPlanations) values provide a principled measure of each feature's contribution to individual predictions. Aggregating SHAP values across the dataset gives a nuanced view of feature importance that accounts for interactions and nonlinearities. Features with consistently low SHAP values can be removed.
Feature Selection for Deep Learning
Deep learning models, especially neural networks with many layers, can perform implicit feature selection through learned representations. However, explicit feature selection can still help:
- Reduce training time by removing clearly irrelevant features.
- Prevent overfitting on small datasets.
- Improve interpretability when it matters.
- Attention mechanisms in Transformers act as a form of dynamic feature selection, learning which inputs to focus on for each prediction.
Common Mistakes
- Selecting features on the full dataset: Always perform feature selection within each cross-validation fold. Selecting on the full dataset and then cross-validating creates data leakage.
- Ignoring multicollinearity: Highly correlated features can inflate importance scores. Remove redundant features before interpreting importance.
- Using a single method: Different methods capture different aspects of feature relevance. Use multiple methods and look for consensus.
- Over-selecting: Being too aggressive with feature removal can discard useful information. Use cross-validation to find the right balance.
- Ignoring domain knowledge: Statistical methods cannot tell you if a feature is causally relevant or just a spurious correlation. Domain expertise should guide final decisions.
Practical Workflow
- Start with domain knowledge to identify obviously irrelevant features.
- Apply variance threshold to remove near-constant features.
- Use filter methods (mutual information, correlation) for initial screening.
- Train a model with embedded selection (LASSO or tree importance) to refine the feature set.
- Validate with cross-validation to confirm that the reduced feature set maintains or improves performance.
- Iterate based on results and domain feedback.
Feature selection is not a one-time step but an iterative process that improves as you understand your data better. Combined with proper model evaluation and hyperparameter tuning, it is a key ingredient in building models that are accurate, fast, and interpretable.
