3.2.4 Feature Engineering and Selection
Feature engineering is the process of transforming raw data into meaningful inputs for machine learning models. It often involves creating, encoding, and scaling features so that models can learn effectively. Good features can significantly improve model accuracy and reduce training time. Feature selection, by contrast, is choosing a subset of those features that are most relevant. This reduces dimensionality, avoids overfitting, and simplifies models. In practice, feature engineering and selection are iterative and domain-specific; they combine automated transforms with human insight.
Both traditional machine learning (ML) and deep learning (DL) benefit from careful feature handling, but their emphasis differs. Classical ML usually requires substantial manual feature design (e.g. creating aggregated statistics, one-hot encodings, or engineered combinations). Deep learning models can often learn feature representations from raw data (e.g. convolutional filters that detect edges in images). However, even deep models rely on preprocessing steps like normalization or encoding categorical inputs. In fact, DL’s strength is that it can automatically derive meaningful features from raw data, as illustrated by object-detection networks that identify and localize people in images (Figure below). In contrast, traditional ML pipelines explicitly compute features (one-hot vectors, PCA projections, etc.) and then feed them into a model. Deep nets trade off the labor of manual feature crafting against the need to design and tune network architectures.
Figure: Example of feature extraction in computer vision. A deep model has detected multiple people (“person 99%”, etc.) in an image by learning high-level features from raw pixels.
Categorical Encoding and Feature Construction
Real-world data often contains categorical or text features (e.g. user IDs, color names, country codes). A common encoding is one-hot encoding, which creates a binary indicator for each category. For example, encoding a “color” column with values {red, green, blue} yields three new columns (Color_Red, Color_Green, Color_Blue) with 1/0 entries. Scikit-learn’s OneHotEncoder
implements this by mapping each category to a sparse one-of-K vector. One-hot encoding ensures that categorical values are treated without imposing an arbitrary ordinal scale. In the code below, we show one-hot encoding a small pandas DataFrame:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Sample categorical data
df = pd.DataFrame({'color': ['red', 'green', 'blue', 'green']})
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded = encoder.fit_transform(df[['color']])
print(encoded)
Similarly, numeric features sometimes require normalization or standardization so that all inputs are on a similar scale. Techniques like Min-Max scaling (rescaling to [0,1]) or Z-score standardization (zero mean, unit variance) are widely used. For instance, scikit-learn’s StandardScaler
subtracts the mean and divides by the standard deviation of each feature, transforming data to have mean≈0 and variance≈1. This is important because many learning algorithms (e.g. gradient descent, KNN, SVM) perform better or converge faster when features are scaled. The code below demonstrates standardization:
from sklearn.preprocessing import StandardScaler
X = [[10, 0.5], [20, 0.2], [30, 0.9]]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(X_scaled) # Columns now have zero mean and unit variance
Beyond encoding and scaling, feature construction is often useful. This includes mathematical transforms (log, square root), creating interaction terms, or aggregating data (e.g. count of a user’s past actions). Automated libraries like Featuretools can generate new features from relational data using deep feature synthesis. For example, Featuretools can take transactional tables (customers, orders, products) and automatically produce summary features (total spend per customer, frequency of purchases, etc.). In code:
import featuretools as ft
# Load example retail dataset
es = ft.demo.load_retail(nrows=1000)
# Automatically generate features for the 'orders' table
feature_matrix, features = ft.dfs(entityset=es, target_entity="orders")
print(feature_matrix.head())
This automates the creation of dozens of features, illustrating how domain-specific features (e.g. number of products per order) can be engineered programmatically.
Dimensionality Reduction (PCA and Others)
High-dimensional data can be reduced while preserving information. Principal Component Analysis (PCA) is a classic linear method: it computes orthogonal directions of maximal variance and projects the data onto the top k components. Scikit-learn’s PCA
class implements this via singular value decomposition. By retaining only the first few components, one obtains a lower-dimensional representation that captures most of the variance. PCA is especially useful when features are correlated or when the number of features exceeds the number of samples. For example, to reduce 10-D data to 2-D:
from sklearn.decomposition import PCA
X = [[1, 2, 3], [2, 3, 4], [3, 4, 5]]
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
print(X_pca) # Transformed data in 2 principal components
PCA and related methods (ICA, LDA, TruncatedSVD) are part of feature engineering workflows. They help mitigate the “curse of dimensionality” and improve computational efficiency by compressing features.
Feature Selection Techniques
Rather than creating features, feature selection picks the most relevant ones from the existing set. In ML, selecting a subset of features can simplify the model, reduce overfitting, and speed up learning. Feature selection methods fall into three categories:
-
Filter methods score each feature by a statistic independent of any model. For example, mutual information can quantify the dependency between a feature and the target; high MI indicates a feature provides informative signals. One can compute
mutual_info_classif
in scikit-learn to rank features by their information gain relative to the label. (Filter scores are often followed by choosing the top k features.) -
Wrapper methods use a predictive model to evaluate feature subsets. A common wrapper is Recursive Feature Elimination (RFE). RFE works by repeatedly fitting a model and removing the least-important feature(s) at each step. For example, using logistic regression as the estimator, RFE will drop the feature with smallest coefficient, refit the model, and continue until a target number of features remains. This process naturally accounts for feature interactions. In code:
from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_iris X, y = load_iris(return_X_y=True) estimator = LogisticRegression(max_iter=200) selector = RFE(estimator, n_features_to_select=2, step=1) selector = selector.fit(X, y) print("Selected features:", selector.support_) # boolean mask
Embedded methods perform selection as part of model training. For instance, L1-regularized (lasso) models tend to zero-out irrelevant features, implicitly selecting a subset. Tree-based models (random forest, XGBoost) provide
feature_importances_
scores that can be thresholded to select features. These methods are “embedded” because feature selection is built into the learning algorithm.
Feature selection is highly useful in domains like genomics (many gene expressions, few samples) and fraud detection (large feature sets). For instance, in fraud detection, selecting variables with high mutual information (e.g. unusual transaction amounts, risky merchant categories) helps the model focus on predictive factors. In high-dimensional medical data, techniques like PCA and RFE help isolate the most important biomarkers while discarding noisy or redundant measurements.
Practical Tools and Frameworks
The Python ecosystem offers many libraries for feature engineering and selection. Scikit-learn provides transformers and selectors for almost every need:
-
OneHotEncoder
,OrdinalEncoder
, etc., for categorical encoding. -
StandardScaler
,MinMaxScaler
,Normalizer
, etc., for normalization and scaling. -
PCA
,TruncatedSVD
, and other decomposition classes for dimensionality reduction. -
Feature selection modules:
SelectKBest
,mutual_info_classif
,RFE
,VarianceThreshold
, etc. For example,SelectKBest(score_func=mutual_info_classif, k=5)
can filter the top 5 features by mutual information. -
Pipelines and
ColumnTransformer
combine encoding, scaling, and model fitting into a coherent workflow.
For automated feature creation, Featuretools (Python) excels at relational and time-series data. It uses Deep Feature Synthesis to combine tables and generate new features as shown above. Although we gave a brief code example, Featuretools has full support for grouping, aggregation and transforming features across multiple linked tables, which is invaluable in e-commerce or IoT analytics.
In the TensorFlow ecosystem, TensorFlow Transform (TFT) is used within TFX pipelines for large-scale data preprocessing. TFT allows defining transformations (scaling, bucketizing, vocab lookups) once and applies them consistently during training and serving, avoiding skew. For example, one might write a preprocessing_fn
for TFT:
import tensorflow_transform as tft
def preprocessing_fn(inputs):
outputs = {}
outputs['age_scaled'] = tft.scale_to_z_score(inputs['age'])
outputs['income_scaled'] = tft.scale_to_z_score(inputs['income'])
outputs['color_onehot'] = tft.compute_and_apply_vocabulary(inputs['color'])
return outputs
This function is executed over the training dataset to compute parameters (means, vocabularies) and then exported so that at serving time new examples can be transformed identically. As noted in the TFX tutorial, “the resulting transforms will be consistent between training and serving”, ensuring robust production pipelines.
Real-World Examples
-
Fraud Detection: Financial fraud models often use engineered features like transaction amount, time since last transaction, device ID, and location. Encoding categorical fields (merchant, country) and normalizing numerical fields (amount) are crucial. Feature selection (e.g. via mutual information) helps pick the signals that best distinguish fraud vs. legitimate cases. Tree-based models or logistic regression on these engineered features are common in fraud detection.
-
Recommendation Systems: Here, features may include user/item IDs, demographics, and content metadata. One-hot encoding of categorical IDs can blow up dimension, so DL-based recommenders often learn embeddings automatically. However, hybrid recommenders still benefit from engineered features: e.g., count of previous purchases, time since last login, or PCA on item attributes. Tools like TensorFlow’s feature columns or embeddings can handle sparse ID features, while one-hot or ordinal encodings are used for side information.
-
Medical Diagnosis: Predicting diseases (cancer, cardiovascular risk, etc.) often involves many clinical measurements or genomic markers. Feature engineering might include calculating risk scores, ratios, or binning continuous values. Dimensionality reduction (PCA) and feature selection (RFE, Lasso) are widely used to condense high-dimensional gene expression data. For instance, a study might use PCA to compress gene data to a few components, then logistic regression to predict cancer, emphasizing that selecting informative genetic features improves generalization.
In all these domains, the balance between handcrafted features and automated learning must be tuned. As one analysis notes, deep networks “automatically derive meaningful features from raw data”, but simpler models built on expert features can be more interpretable. Hence, a hybrid approach—using both engineered features (domain knowledge) and representation learning—is common.
Key Takeaways: Feature engineering and selection are fundamental to building effective ML/DL models. Proper encoding and scaling prepare the data (e.g. one-hot encoding for categories, normalization for numeric features). Dimensionality reduction (like PCA) can compress data, and selection methods (mutual information, RFE, etc.) prune irrelevant inputs. Modern libraries (scikit-learn, Featuretools, TensorFlow Transform) provide out-of-the-box tools to implement these steps. By combining theory with practical coding tools, data scientists can construct clean, informative feature sets that power real-world applications from fraud detection to recommendation systems.
References
[1] J. Murel and E. Kavlakoglu, “What is feature engineering?”, IBM Think AI (Jan. 2024). [Online]. Available: https://www.ibm.com/think/topics/feature-engineering. [Accessed 2025].
[2] Wikipedia, “Feature engineering”, Feb. 2024. [Online]. Available: https://en.wikipedia.org/wiki/Feature_engineering. [Accessed 2025].
[3] D. Jiang et al., “Expert Feature-Engineering vs. Deep Neural Networks: Which is Better for Sensor-Free Affect Detection?”, AIED 2018, pp. 1–7.
[4] P. Bhandari, “What is Feature Scaling and Why is it Important?”, Analytics Vidhya, Apr. 2025. [Online]. Available: https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/. [Accessed 2025].
[5] S. Pedregosa et al., “OneHotEncoder”, scikit-learn 0.16.1 documentation. [Online]. Available: https://scikit-learn.org/0.16/modules/generated/sklearn.preprocessing.OneHotEncoder.html. [Accessed 2025].
[6] J. VanderPlas, “StandardScaler”, scikit-learn 0.24.2 documentation. [Online]. Available: https://scikit-learn.org/0.24/modules/generated/sklearn.preprocessing.StandardScaler.html. [Accessed 2025].
[7] I. Tofallis, “PCA — scikit-learn 1.6.1 documentation.” [Online]. Available: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html. [Accessed 2025].
[8] Scikit-learn developers, “RFE — scikit-learn 1.6.1 documentation.” [Online]. Available: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html. [Accessed 2025].
[9] Wikipedia, “Feature selection”, 2025. [Online]. Available: https://en.wikipedia.org/wiki/Feature_selection. [Accessed 2025].
[10] TensorFlow, “Feature Engineering using TFX Pipeline and TensorFlow Transform.” [Online]. Available: https://www.tensorflow.org/tfx/tutorials/tfx/penguin_tft. [Accessed 2025].
댓글
댓글 쓰기