Top Automated Feature Engineering Frameworks in Python: 2022 Insights
Written on
Introduction to Feature Engineering
Feature engineering is the process of transforming raw data into valuable features that are utilized to create predictive models in machine learning. This transformation begins with the unprocessed data, aiming to enhance the performance of machine learning algorithms by generating an optimized dataset tailored for the specific algorithm.
Data scientists find feature engineering essential as it expedites the extraction of variables, allowing for a broader selection of features. By automating feature engineering, both businesses and data scientists can produce models that achieve higher accuracy.
Automated Feature Engineering
Traditionally, feature engineering has been a manual task, heavily reliant on domain knowledge and subjective decision-making, which can be labor-intensive and time-consuming. The aim of automated feature engineering is to assist data scientists by generating numerous candidate features from datasets automatically. The most relevant features can then be selected for further training, thus streamlining the process.
Automated feature engineering frameworks allow data scientists to focus on other aspects of machine learning, enhancing productivity. This approach also enables citizen data scientists to perform feature engineering using structured frameworks.
In this article, we will examine the most notable automated feature engineering frameworks in Python that every data scientist should be aware of in 2022.
Featuretools
Featuretools is an open-source library designed for automated feature engineering. It significantly accelerates the feature creation process, enabling more time to be dedicated to other facets of building machine learning models. Essentially, it prepares your data for machine learning applications.
Key components of Featuretools include:
- Entities: Each Pandas DataFrame is represented as an Entity, while an EntitySet groups various entities together.
- Deep Feature Synthesis (DFS): This is a core method of Feature Engineering that allows the creation of new features from one or more DataFrames.
- Feature Primitives: These are frequently used to manually generate features and are based on the relationships among entities within an EntitySet.
To illustrate, consider this example:
# install featuretools
pip install featuretools
import featuretools as ft
data = ft.demo.load_mock_customer()
customers_df = data["customers"]
sessions_df = data["sessions"]
transactions_df = data["transactions"]
The above code loads sample customer data and sets up the necessary DataFrames.
dataframes = {
"customers": (customers_df, "customer_id"),
"sessions": (sessions_df, "session_id", "session_start"),
"transactions": (transactions_df, "transaction_id", "transaction_time"),
}
relationships = [
("sessions", "session_id", "transactions", "session_id"),
("customers", "customer_id", "sessions", "customer_id"),
]
To run DFS, you need a dictionary of DataFrames, a list of relationships, and the target DataFrame's name. This process generates a feature matrix along with the corresponding feature definitions.
feature_matrix_customers, features_defs = ft.dfs(
dataframes=dataframes,
relationships=relationships,
target_dataframe_name="customers",
)
DFS's power lies in its ability to create a feature matrix for any DataFrame within the EntitySet. For example, to generate session-specific features:
feature_matrix_sessions, features_defs = ft.dfs(
dataframes=dataframes,
relationships=relationships,
target_dataframe_name="sessions"
)
Understanding Feature Output
Featuretools also provides insights into the generated features and the methodology behind their creation.
TSFresh
TSFresh is another open-source Python library that integrates established algorithms from statistics, time-series analysis, and signal processing for systematic feature extraction from time series data. It automatically extracts hundreds of features that describe characteristics of time series, such as peak counts and average values.
Featurewiz
Featurewiz is a relatively new open-source library that efficiently identifies significant features from datasets. It employs two primary techniques:
- SULOV: This method seeks an uncorrelated list of variables and calculates the Mutual Information Score (MIS) for each pair to determine the strongest relationships.
- Recursive XGBoost: The variables identified by SULOV are passed to XGBoost, which recursively trains models to identify optimal features based on the target variable.
PyCaret
PyCaret is a low-code, open-source machine learning library in Python that automates the machine learning workflow. It significantly accelerates the experimental cycle, allowing users to replace extensive code with minimal lines, thereby enhancing productivity.
Although PyCaret is not solely dedicated to automated feature engineering, it includes functionalities for automatic feature generation prior to model training and selection.
# install pycaret
pip install pycaret
from pycaret.datasets import get_data
insurance = get_data('insurance')
from pycaret.regression import *
reg1 = setup(data=insurance, target='charges', feature_interaction=True, feature_ratio=True)
Conclusion
By leveraging these automated feature engineering frameworks, data scientists can optimize their workflow, enhance model performance, and focus on more strategic aspects of their projects.
Stay informed about advancements in data science, machine learning, and tools like PyCaret by following me on Medium, LinkedIn, and Twitter.