Building and Deploying a Machine Learning Pipeline with Python
Written on
Introduction to Machine Learning Pipeline
This guide will walk you through creating a complete machine learning pipeline and deploying it as a web API using the FastAPI framework in Python.
Learning Objectives
By the end of this tutorial, you will be able to:
- Construct an end-to-end machine learning pipeline using PyCaret.
- Understand what model deployment entails.
- Develop an API with FastAPI to generate predictions for unseen data.
Understanding PyCaret
PyCaret is an open-source, low-code machine learning library designed in Python, which facilitates the automation of machine learning workflows. Its popularity stems from its user-friendly interface and speed in building and deploying comprehensive ML prototypes.
To get started with PyCaret, you can install it using the following command:
pip install pycaret
Exploring FastAPI
FastAPI is a modern and efficient web framework for building APIs with Python 3.6 or newer, relying on standard Python type hints. Its key attributes include:
- Speed: High performance comparable to NodeJS and Go (thanks to Starlette and Pydantic).
- Efficiency: Accelerates feature development by approximately 200-300%.
- Simplicity: Designed to be user-friendly, minimizing the time spent on documentation.
To install FastAPI, run:
pip install fastapi
Workflow Overview: PyCaret and FastAPI
Business Scenario
In this tutorial, we will reference a popular case study from the Darden School of Business. The story revolves around Greg, who aims to propose to Sarah with a diamond ring. To ensure Sarah's satisfaction, Greg gathers data on 6,000 diamonds, including attributes like price, cut, and color.
Dataset
The objective is to predict diamond prices based on various features such as carat weight, cut, and color. The dataset can be accessed from the PyCaret library.
from pycaret.datasets import get_data
data = get_data('diamond')
Exploratory Data Analysis
To visualize the relationship between independent features (like weight, cut, color, clarity) and the target variable (Price), we can create scatter plots and histograms.
import plotly.express as px
fig = px.scatter(x=data['Carat Weight'], y=data['Price'], facet_col=data['Cut'], opacity=0.25, template='plotly_dark', trendline='ols', title='SARAH GETS A DIAMOND - A CASE STUDY')
fig.show()
#### Analyzing Price Distribution
To examine the target variable's distribution, we can plot histograms.
fig = px.histogram(data, x=["Price"], template='plotly_dark', title='Histogram of Price')
fig.show()
Given that the price distribution is right-skewed, applying a log transformation may help in achieving a more normal distribution.
import numpy as np
data_copy = data.copy()
data_copy['Log_Price'] = np.log(data['Price'])
fig = px.histogram(data_copy, x=["Log_Price"], title='Histogram of Log Price', template='plotly_dark')
fig.show()
Data Preparation
The setup function in PyCaret initializes the experiment and establishes the transformation pipeline based on parameters provided. This function must be executed prior to any other function calls.
from pycaret.regression import *
s = setup(data, target='Price', transform_target=True)
Model Training and Evaluation
Once the data is prepared, we can initiate the training process using the compare_models functionality. This function evaluates all available estimators using cross-validation.
best = compare_models()
The CatBoost Regressor emerged as the best model based on Mean Absolute Error (MAE), achieving a score of $543 against an average diamond value of $11,600—an impressive result given our efforts.
plot_model(best, plot='residuals_interactive')
plot_model(best, plot='feature')
Finalizing and Saving the Pipeline
Next, we will finalize the best model by training it on the complete dataset and saving the pipeline as a pickle file.
final_best = finalize_model(best)
save_model(final_best, 'diamond-pipeline')
Model Deployment
Deploying machine learning models involves making them accessible in production environments where web applications and APIs can utilize them. Predictions can be generated through batch processing or in real-time.
This section will demonstrate how to create an API using the FastAPI framework.
The initial lines of code involve basic imports, followed by initializing an app with FastAPI and loading the trained model.
from fastapi import FastAPI
import pickle
app = FastAPI()
model = pickle.load(open('diamond-pipeline', 'rb'))
@app.post("/predict")
def predict(data: dict):
# Use PyCaret's predict_model function to make predictions
return {"prediction": ...}
You can run this script using the following command in your command prompt (ensuring your script is in the same directory as the model):
uvicorn main:app --reload
This will start an API service on your localhost. You can access it through your web browser.
Utilizing the API
To make predictions using the API, you can use the requests library in Python, showcasing the straightforward nature of both PyCaret and FastAPI. In under 25 lines of code, multiple models have been trained and a machine learning pipeline deployed via an API.
About the Author
I write about data science, machine learning, and PyCaret. For updates, feel free to follow me on social media platforms.
Chapter 2: Video Resources
Deploying ML Models in Production: An Overview
This video provides an insightful overview of deploying machine learning models in production environments.
How to Deploy Machine Learning Models into Production
In this video, you will learn practical steps to deploy your machine learning models effectively.