Scikit-learn Library

Scikit-learn is a Python library for machine learning. Although learning machine learning can be difficult, the scikit-learn library can ease the process. The Scikit-learn provides powerful datasets, machine learning models, preprocessing tools, feature selection techniques, and dimensionality reduction methods for data analysis. You can easily evaluate your machine learning model using scikit-learn's built-in metrics and evaluation tools. It is often used with other Python libraries. Seaborn, Matplotlib, Numpy, and Pandas libraries will be used with the scikit-learn library in the tutorial below, but you don't need to know them. If you want to learn about the Seaborn library, please visit the Seaborn tutorial. If you want to learn about the Pandas library, please visit the Pandas tutorial. If you want to learn about the Numpy library, please visit the Numpy tutorial. If you want to learn about the Matplotlib library, please visit the Matplotlib tutorial. If you are interested in a specific topic, you can just jump into that topic. You can use your editor to test your code, Visual Studio Code is used for the examples below. The scikit-learn 1.6.0 will be used for the tutorial below. You should be familiar with Python.

Scikit-learn Library Installation

You need to set up a virtual environment in Python. You need to install virtualenv. If you are using pip, run the command below:

pip install virtualenv

If you are using pip3 or pipx, use pip3 or pipx instead of pip.

You need to create a virtual environment in your Python project folder. If you are using pip, run the command below:

python -m venv new_env

If you are using python3, use python3 instead of python. We named the virtual environment "new_env" but you can choose another name.

You can activate the environment:

source new_env/bin/activate

If you are using pip, run the command below:

pip install -U scikit-learn

To install scikit-learn using conda, check the official website.

After the installation, you may need to close and reopen your folder.

To check the version of the scikit-learn library:

import sklearn
print(sklearn.__version__)

We will use the matplotlib, seaborn, pandas, and numpy libraries. If you are using pip, run the commands below:

pip install -U matplotlib

pip install seaborn

pip install pandas

pip install numpy

If you are using conda, run the commands below:

conda install matplotlib

conda install seaborn

conda install pandas

conda install numpy

Import the pandas and other libraries:

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

If you want to run the codes below without any installation, you can use Google Colab as well.

Scikit-learn Datasets

sklearn.datasets package offers different datasets. There are four types of sklearn.datasets: Toy datasets, Real world datasets, generated datasets, and other datasets. We will use some of them in the tutorial below. You need the dataset loaders to load Toy datasets and the dataset fetchers to load Real world datasets. For example, you need to use the load_iris() function to load the toy data, iris while you need to use the fetch_california_housing() function to load world real data, California Housing dataset. Both functions return a scikit-learn Bunch object, which includes key attributes to help you explore the dataset. When you load the dataset using the return_X_y parameter, it separates the feature variables and target values. The DESCR attribute provides a detailed description of the dataset. To learn more about using scikit-learn's datasets and understanding their key attributes, check out the tutorial below.

Supervised Learning

Supervised learning is a type of machine learning that trains models using labeled data, where input and output pairs are used to teach the model.

Linear Regression

Linear Regression fits a linear model with coefficients to minimize the residual sum of squares between the observed targets in the dataset and the targets predicted by the linear approximation. The linear regression aims to find a linear relationship between variables. We try to predict the value of unknown variables using the linear model. The California Housing dataset from sklearn's Real world dataset will be used. You need to load the California Housing dataset.

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression

housing = fetch_california_housing()

You can use feature_names and target_names to learn features and targets:

print(housing.feature_names)
print(housing.target_names)

['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup']
['MedHouseVal']

If you want to learn more about data, you can use the DESCR (print(housing.DESCR)) method. We will use only one feature, "MedInc" for the feature values (X). If you want to use more than one feature, you need to use a multiple linear regression model.

X = housing.data[:, 0]
y = housing.target

You can split data for training and testing. However, we will skip it in linear regression. You can learn how to split data and make predictions in the multiple linear regression section. You need to create the linear Regression model and fit the data:

regr = LinearRegression()
X = X.reshape(-1,1)
regr.fit(X, y)

The fit() method fits and trains the linear model. You can use the coef_ attribute for coefficients.

regr.coef_

Multiple Linear Regression

The multiple linear regression model is similar to the linear regression model, but you can use more than one variable to predict the target values. Scikit-learn's California Housing dataset will be used. We only chose one variable for the linear regression model. You can learn coefficients like in the simple linear regression example. We will use three variables for the multiple linear model: 'MedInc', 'HouseAge', 'AveBedrms'.

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd

housing = fetch_california_housing()
X = pd.DataFrame(housing.data)
X = X.loc[:, [0, 1, 3]]
y = housing.target

We use the pandas DataFrame and the loc property to extract the 'MedInc', 'HouseAge', and 'AveBedrms' columns. We will use these variables because they have the strongest correlations with the target variable.

train_test_split

We will split the data using the train_test_split function. train_test_split splits arrays or matrices into random train and test subsets. train_test_split is important for the evaluation of the model. test_size parameter in the example below represents the proportion of the dataset to include in the test split. As the test_size is 0.2, train_size is (1 - 0.2 =) 0.8 in the example below. random_state controls the shuffling applied to the data before applying the split.

x_train, x_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, test_size = 0.2, random_state=6)

Let's create a Linear Regression model and fit the data. You can predict the data using test data:

regr = LinearRegression()
regr.fit(x_train, y_train)
pred = regr.predict(x_test)

The equation for the multiple linear regression is y = m₁x₁ + m₂x₂ + ... + m_nx_n + b . m refers to the coefficients, and b refers to the intercept. You can calculate the coefficients and the intercept:

print(regr.coef_)
print(regr.intercept_)

[0.43303067 0.01791794 0.03162695]
-0.1564276101990738

You can use the regr.score(x_test, y_test) method in scikit-learn to calculate the coefficient of determination (R²), which measures how well your regression model predicts the target values.

Logistic Regression

Logistic regression is another supervised machine learning algorithm that predicts the probability of a datapoint belonging to a specific category. The regularization is applied by default. The result ranges from 0 to 1. The scikit-learn's Wine recognition dataset and load_wine() -to load and return data- will be used.

from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X, y = load_wine(return_X_y=True)

You can use the DESCR function to learn more about the data. When using return_X_y, you can easily access the feature values and target labels. Simply print the variable X to view the features and y to see the target values. Target values range from 0 to 2. They represent different classes. We will split the data:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 27)

We will create a Logistic Regression model.

lr = LogisticRegression()
lr.fit(X_train, y_train)
pred = lr.predict(X_test)

The syntax is very similar to Linear Regression models. solver algorithm is used (by default in Logistic Regression) to optimize the problem and the default value is "lbgfs". Although our code is right, we get the following error: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. We can try other solver options or we can standardize data:

from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler

X, y = load_wine(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 20)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

Our model is now working and the accuracy score is 0.977. There are different scikit-learn metrics to evaluate our model. The accuracy score shows the percentage of correct predictions made by the model. You can see the other evaluation metrics for classification. The choice of the algorithm (solver) depends on the penalty chosen and on multiclass support. You can use "lbfgs", "newton-cg", "sag", and "saga" for multiclass problems and "liblinear" and "newton-cholesky" for binary classification. For more detailed information, you can visit scikit-learn page. You can also limit the number of iterations by using the max_iter parameter. You can find the classification_report of the model below:

print(classification_report(y_test, y_pred))

You can check the scikit-learn metrics to learn about metrics.

K-Nearest Neighbors

K-Nearest Neighbors Classifier

K-Nearest Neighbors Classifier is a classification algorithm. Classification is computed from a simple majority vote of the nearest neighbors of each point. You need to specify k, the number of neighbors and the model finds the closest k neighbors. We will use the Wine recognition dataset like in the previous model.

from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

X, y = load_wine(return_X_y=True)

We will split the data and create a KNeighborsClassifier:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 27) classifier = KNeighborsClassifier()
classifier.fit(X_train, y_train)
pred = classifier.predict(X_test)

The n_neighbors parameter is used to select the number of neighbors to classify the unknown data. The default value for k is 5. We fit the data and predict X_test. Let's evaluate the model:

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, pred))

The result is 0.644. We can scale the feature data and change the number of neighbors to improve the model. We can manually change k and calculate the score. However, there's a better way to test different k values:

from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 27)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

no_of_neighbor = []
scores = []
for k in range(1, 20):
classifier = KNeighborsClassifier(n_neighbors = k)
classifier.fit(X_train, y_train)
pred = classifier.predict(X_test)
score = accuracy_score(y_test, pred)
no_of_neighbor.append(k)
scores.append(score)

plt.plot(no_of_neighbor, scores)
plt.xlabel("number of neighbors")
plt.ylabel("accuracy score")
plt.show()

accuracy score for different k values in k nearest neighbors

The x-axis in the graph shows the number of neighbors, and y-axis shows the accuracy score. We can use another metric to evaluate our model. We need to select 5 neighbors to get the highest score.

We can use StandardScaler() to scale features. StandardScaler standardizes features by removing the mean and scaling to unit variance. Let's scale the data and change the n_neighbors:

from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 27)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

classifier = KNeighborsClassifier(n_neighbors = 5)
classifier.fit(X_train, y_train)
pred = classifier.predict(X_test)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, pred))

The new accuracy score is 1.0.

K-Nearest Neighbors Regressor

The K-Nearest Neighbors Regressor is similar to the K-Nearest Neighbors Classifier but used for regression problems. The prediction of the algorithm is the average of the values of k nearest neighbors. We will use fetch_california_housing to fetch scikit-learn's California housing dataset. The default value for the weights parameter is uniform. We chose distance for the weight function to predict. It means closer neighbors of a query point will have a greater influence than neighbors which are further away.

from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 20)

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

regressor = KNeighborsRegressor(n_neighbors = 7, weights = "distance")
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

We can evaluate the model using r2_score, mean_squared_error and mean_absolute_error. mean_squared_error is a risk metric corresponding to the expected value of the squared (quadratic) error or loss. mean_absolute_error is a risk metric corresponding to the expected value of the absolute error loss or -norm loss.

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

r2 = r2_score(y_test, y_pred)
print(r2)

mse = mean_squared_error(y_test, y_pred)
print(f"mse: {mse}")

mae = mean_absolute_error(y_test, y_pred)
print(f"mae: {mae}")

0.7085820192410783
mse: 0.40138825542054024
mae: 0.42615182059800666

See scikit-learn metrics for more information.

Decision Tree Classifier

Decision Trees are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the class (or category) of a target variable by learning simple decision rules inferred from the data features. The scikit-learn's DecisionTreeClassifier module helps to create, train, predict, and visualize a decision tree classifier. We will create a simple pandas DataFrame. The feature columns are "study", "sleep", and "class". "Pass" column has the target values: 1 (Pass) and 0 (Fail).

from sklearn import tree
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import pandas as pd

pass_or_not = pd.DataFrame([
[3, 6, 0, 0],
[5, 8, 1, 1],
[4, 5, 0, 0],
[8, 7, 1, 1],
[10, 8, 1, 1],
[6, 4, 0, 0],
[7, 3, 0, 1],
[6, 3, 1, 0],
[5, 8, 0, 0],
[7, 3, 1, 1],
[5, 5, 0, 0],
[4, 7, 1, 0],
[6, 5, 1, 1],
], columns= ["study", "sleep", "class", "Pass"])

X = pass_or_not.iloc[:, :3 ]
y = pass_or_not["Pass"]

x_train, x_test, y_train, y_test = train_test_split(X,y, random_state=0, test_size=0.2)

"study" shows studying hours, "sleep" shows sleeping hours, and "class" shows if a student attends classes (1 -yes, 0 -no). If the "Pass" value of the student is 1, it means the student passed. If the "Pass" value is 0, the student failed. You can use both numerical and categorical data. However, scikit-learn implementation doesn't support categorical data. Therefore, if you have categorical data, you need to transform the data.

dtree = DecisionTreeClassifier(random_state=12, max_depth=5)
dtree.fit(x_train, y_train)
y_pred = dtree.predict(x_test)

The syntax of DecisionTreeClassifier is similar to previous models. The max_depth parameter shows the maximum depth of the tree. We will plot the tree:

plt.figure(figsize=(10,6))
tree.plot_tree(decision_tree=dtree, feature_names= pass_or_not.columns, class_names={1: "Pass", 0:"Fail"}, filled=True)
plt.tight_layout()
plt.show()

How to interpret the decision tree

A node with no children, which represents the final classification (such as "Pass" or "Fail" in our example), is called a leaf node. The node with no children and the final classification (like "Pass" or "Fail" in our example) is called the leaf node. The first row of the node shows the feature. For example, the splitting feature of the root node is "class <= 0.5". If the feature is true for the sample, you should follow the "True" arrow. Otherwise, you should follow the "False" arrow. The gini shows the impurity at node. If the gini is equal to 0, it's 100% pure (samples belong to the same label/class). The samples value shows the number of samples. The class shows the label/classification.

We can create a DecisionTreeClassifier with scikit-learn's Wine recognition dataset as well:

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import matplotlib.pyplot as plt

X, y = load_wine(return_X_y=True)
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 10)

classifier = DecisionTreeClassifier(max_depth=6, max_features=7)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
plt.figure(figsize=(10,6))
tree.plot_tree(decision_tree=classifier, feature_names= wine.feature_names, class_names=wine.target_names, filled=True)
plt.tight_layout()
plt.show()

The model is similar to the previous Decision Tree Classifier model. The max_feature parameter limits the number of features. You can check all the DecisionTreeClassifier parameters. You can calculate the accuracy_score of your decision tree.

Interpreting a decision tree can be complicated, you can use the node_ids parameter:

tree.plot_tree(decision_tree=classifier, feature_names= wine.feature_names, class_names=wine.target_names, filled=True, node_ids=True)

The node_ids parameter shows the ID number on each node. If you want to learn more about the decision tree, you can use the tree_ attribute. The tree_ attribute can be used to learn left children, right children, features, threshold, number of training samples reaching node, impurity, weighted number of training samples reaching node and the values:

classifier.tree_.children_left[i]
classifier.tree_.children_right[i]
classifier.tree_.feature[i]
classifier.tree_.threshold[i]
classifier.tree_.n_node_samples[i]
classifier.tree_.impurity[i]
classifier.tree_.weighted_n_node_samples[i]
classifier.tree_.value[i]

If you want to learn the attributes of a specific node, you need to replace i with the node ID. n_node_samples is the number of training samples reaching node i. impurity is the impurity at node i. weighted_n_node_samples is the weighted number of training samples reaching node i. If we train the first model using node_ids parameter, we get the following tree:

scikit-learn's decision tree with node ids

We can check the tree_ attributes of the decision tree above:

print(dtree.tree_.children_left[0])

1

print(dtree.tree_.children_right[1])

-1

print(dtree.tree_.feature[2])

1

print(dtree.tree_.threshold[3])

6.5

print(dtree.tree_.n_node_samples[4])

1

print(dtree.tree_.impurity[5])

0.0

print(dtree.tree_.weighted_n_node_samples[6])

3.0

print(dtree.tree_.value[3])

[[0.5 0.5]]

If you don't specify any node ID ( dtree.tree_.children_left), you can traverse the tree structure and get the left children of the tree.

Normalization (sklearn.preprocessing)

The sklearn.preprocessing package provides different functions to prepare and transform data for a machine learning model. Normalization is an important process to improve accuracy and performance. For example, features of your data can be on drastically different scales. If you don't normalize them, some features will dominate others. The normalization aims to transform features to be on a similar scale. We used StandardScaler in two different models for two different problems. It was applied to improve convergence in the Logistic Regression model and to enhance accuracy in the K-Nearest Neighbors (KNN) model. Normalization can help models better learn the relative importance (or weights) of each feature. There are different types of scaling methods. We will explore StandardScaler and MinMaxScaler.

StandardScaler

StandardScaler() standardizes features by removing the mean and scaling to unit variance. As mentioned earlier, we used StandardScaler in two different models. It solved the convergence problem in the logistic regression model and improved accuracy in the k-nearest neighbors classification model.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)

MinMaxScaler

MinMaxScaler rescales the data set such that all feature values are in the range [0, 1]. We can use MinMaxScaler for the Wine recognition dataset to improve accuracy and get the same result.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

There are other scalers to normalize data. You can test them and choose the most suitable scaler for your data. See the full list of scikit-learn normalization functions.

Scikit-learn's evaluation metrics

Scikit-learn's evaluation metrics for classification

TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative

You can use the sklearn.metrics module to measure the performance of your module. The sklearn.metrics module offer different metrics to evaluate a classification model. You can check the full list of scikit-learn metrics.

Accuracy score

The accuracy score is one of the most important metrics for classification models to measure the accuracy of the model, and we already used it in the example.

Accuracy Score = (TP + TN) / (TP + TN + FP + FN)

from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

The best performance is 1.

Recall score

Scikit-learn's another evaluation metric for classification is recall.

Recall = TP / (TP + FN)

from sklearn.metrics import recall_score
recall_score(y_test, y_pred)

The best performance is 1, and the worst performance is 0.

Precision score

You can also use precision to measure the performance of your model.

Precision = TP / (TP + FP)

from sklearn.metrics import precision_score
precision_score(y_test, y_pred)

The best performance of precision is 1, and the worst performance is 0.

F1 Score

The F1 score (F-score or F-measure) combines both precision and recall into a single statistic.

F1-Score = (2 x precision x recall) / (precision + recall)

from sklearn.metrics import f1_score
f1_score(y_test, y_pred)

Classification report

The classification_report is used to build a text report showing the main classification metrics.

from sklearn.metrics import classification_report
classification_report(y_true, y_pred, target_names=target_names)

Using target_names parameter is optional.

You can read the classification report of the logistic regression model that we created.

A classification report displays precision, recall, f1score, accuracy, and support scores of a model. Support refers to the number of actual occurrences of the class in the dataset.

Confusion matrix

The confusion matrix computes confusion matrix to evaluate the accuracy of a classification.

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_true, y_pred))

Confusion matrix can be difficult to understand. You can learn how to interpret a confusion matrix with an example.

Scikit-learn's evaluation metrics for regression

sklearn.metrics module has evaluation metrics for regression models as well. We already tested our models using mean_squared_error and mean_absolute_error. mean_squared_error is a non-negative floating point value and the best score is 0.0.
mean_absolute_error is a non-negative floating point value and the best score is 0.0.
Another important evaluation metric is r2_score. r2_score is the regression score function and the best value is 1.0. It can be negative as well.

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

We imported mean-squared_error, mean_absolute_error, and r2_score separately to show how to import the metrics, but you can import them together: from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score. If you want to learn other metrics for regression models, read the scikit-learn's metrics page.

Feature selection methods

Feature selection can be helpful to improve efficiency and accuracy. You can remove redundant features and improve your model's performance. There are different feature selection methods. In addition to Scikit-learn's sklearn.feature_selection module, we will use mlxtend library to compare different feature selection methods in the tutorial. Filter methods are provided by the sklearn.feature_selection module. You can use mlxtend's wrapper methods. Scikit-learn also provides embedded methods for feature selection.

Filter Methods

The sklearn.feature_selection module provides different methods for selecting features. You can test the methods below and choose the best feature selection method for your model.

Variance Threshold

VarianceThreshold is a feature selector that removes all low-variance features. Only features are used in VarianceThreshold; target values are not involved. Therefore, it can be used for unsupervised learning. The method can be used only on numerical values. The threshold parameter represents the threshold that removes the features that have the same or lower value in all samples. The default value is 0. We will be using Scikit-learn's wine dataset.

from sklearn.datasets import load_wine
from sklearn.feature_selection import VarianceThreshold
import pandas as pd

wine = load_wine()
df = pd.DataFrame(wine.data, columns= wine.feature_names)
print(df.columns)

Index(['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline'], dtype='object')

X = wine.data
selector = VarianceThreshold(threshold=0.2)
selected = selector.fit_transform(X)
col_names = df.columns[selector.get_support(indices=True)].to_list()
print(col_names)

['alcohol', 'malic_acid', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'proanthocyanins', 'color_intensity', 'od280/od315_of_diluted_wines', 'proline']

We printed the features to see the difference before and after the feature selection. "ash", 'nonflavanoid_phenols', and 'hue' features are removed.

Pearson's Correlation Method

Pandas dataframe.corr() is used to find the pairwise correlation of all columns in the dataset. You can use Pandas corr() method for the correlation among features. For example, if two features are highly correlated, you can remove one of them. You can use the corr() method for the correlation between features and targets. We will use the California housing dataset to see the correlation between features and the target:

from sklearn.datasets import fetch_california_housing
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
housing = fetch_california_housing()

X, y = fetch_california_housing(return_X_y=True)
df = pd.DataFrame(housing.data, columns= housing.feature_names)
df['MedHouseVal'] = y

corr_matrix = df.corr()
corr_target = corr_matrix[['MedHouseVal']].drop(labels=['MedHouseVal'])

sns.heatmap(corr_target, annot=True, cmap='RdBu_r')
plt.title("Housing dataset")
plt.show()

pearson correlation method for feature selection

Pearson is the default correlation method. You can use the corr() method only for numerical values. If you have categorical features, you can make the numeric_only parameter True. It will ignore non-numeric values. The dataset we used in the example above has only numerical values.

The heatmap above shows us the correlation between features and target variable. If you want to learn how to make a heatmap, check the seaborn heatmaps. "MedInc" has the highest correlation with the target. "AveOccup" has the lowest correlation with the target. If you want to learn more about the features and the target, you can also calculate f_regression:

from sklearn.feature_selection import f_regression
f_statistic, p_values = f_regression(X, y)

f_regression returns F-statistic and p-values. r_regression is used to calculate f_regression. For more information about r_regression, see r_regression. F-statistics and p-values should be consistent with the Pearson correlation matrix. "MedInc" should have the highest F-statistic and lowest p-value.

SelectKBest

SelectKBest selects features based on the top k highest scores. It takes two main parameters: score_func and k.
score_func is a scoring function that takes two arrays (features and targets) and returns either a single array of scores or a pair of arrays (scores, p-values). The default is f_classif, which is suitable only for classification tasks. You can choose different scoring functions depending on the type of problem. Refer to the available scoring functions in the documentation to choose the right one for your task. k specifies the number of top features to select. The default value is 10. We used the Logistic Regression model with Scikit-learn's wine dataset in the example above. We will use SelectKBest with the same model and the same dataset:

from sklearn.datasets import load_wine
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif, mutual_info_classif
import pandas as pd

X, y = load_wine(return_X_y=True)
wine = load_wine()

df = pd.DataFrame(wine.data, columns= wine.feature_names)
select = SelectKBest(f_classif, k=6)
select.fit_transform(X, y)

We selected (top) 6 features and f_classif for func_score. You can use mutual_info_classif for func_score instead of f_classif. You can see the Logistic Regression example to learn how to calculate the accuracy score.

You can use get_support to print the names of selected features:

features = df[df.columns[select.get_support(indices=True)]]
print(features.head())

Wrapper Methods

Wrapper methods select features by evaluating the performance of models on different subsets of features. In other words, wrapper methods focus on the performance of the model in selecting features. Filter methods do not evaluate the model's performance, they measure the relevance of features. We will use Python's mlxtend library to implement the following wrapper methods: Sequential Forward Selection (SFS), Sequential Backward Selection (SBS), Sequential Forward Floating Selection (SFFS), Sequential Backward Floating Selection (SBFS). You need to install mlxtned library. If you are using pip, run the following command:

pip install mlxtend

If you are using conda, run the following command:

conda install mlxtend

Sequential Feature Selection algorithms aim to select the most relevant features and reduce the number of features. It can improve computational efficiency. Removing irrelevant features can prevent generalization errors as well. Sequential Feature Selection can be used for both classification and regression tasks. We will be using Scikit-learn's wine recognition dataset. The features of the wine recognition dataset are: ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline'].

Sequential Feature Selection Methods

Sequential Forward Selection (SFS) Method

The Sequential Forward Selection (SFS) method starts with no feature and adds one feature at a time. We will be using the Logistic Regression example with wine data. Let's see how Sequential Forward Selection works:

from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression

wine = load_wine()
X = wine.data
y = wine.target

scaler = StandardScaler()
X = scaler.fit_transform(X)
df = pd.DataFrame(X, columns=wine.feature_names)
lr = LogisticRegression(max_iter=300)
sfs = SFS(lr,
k_features=5,
forward=True,
floating=False,
scoring="accuracy",
cv=2)
sfs.fit(df, y)
print(sfs.subsets_)

{'feature_idx': (6,), 'cv_scores': array([0.74157303, 0.84269663]), 'avg_score': np.float64(0.7921348314606742), 'feature_names': ('flavanoids',)}
{'feature_idx': (0, 6), 'cv_scores': array([0.88764045, 0.96629213]), 'avg_score': np.float64(0.9269662921348314), 'feature_names': ('alcohol', 'flavanoids')}
{'feature_idx': (0, 6, 12), 'cv_scores': array([0.8988764 , 0.97752809]), 'avg_score': np.float64(0.9382022471910112), 'feature_names': ('alcohol', 'flavanoids', 'proline')}
{'feature_idx': (0, 6, 9, 12), 'cv_scores': array([0.94382022, 0.98876404]), 'avg_score': np.float64(0.9662921348314606), 'feature_names': ('alcohol', 'flavanoids', 'color_intensity', 'proline')}
{'feature_idx': (0, 6, 9, 10, 12), 'cv_scores': array([0.97752809, 0.98876404]), 'avg_score': np.float64(0.9831460674157303), 'feature_names': ('alcohol', 'flavanoids', 'color_intensity', 'hue', 'proline')}

You can follow the selected feature subsets above to understand how Sequential Forward Selection (SFS) works. k_features represents the number of features. As we want to select features by SFS, the forward parameter should be True. The scoring parameter is accuracy. cv parameter is used for cross-validation. If you choose cv=0, you will get the following error: SmallSampleWarning: One or more sample arguments is too small; all returned values will be NaN. See documentation for sample size requirements. Therefore, we chose cv=2.

feature_idx shows the index of feature(s). You can find them in the features list of the wine dataset. feature_names shows the name of the features. cv_score and avg_score represent the cross-validation and average score. We chose 5 features to see the subsets. However, if you want to find the best score, your number of features should be the total number of features for SFS. We can make k=13 and plot a graph for features and their score:

plot_sfs(sfs.get_metric_dict())
plt.show()

sequential forward feature selection graph

We imported plot_sequential_feature_selection from mlxtend.plotting to plot the results. The get_metric_dict() method is used to display the results of feature selection in a pandas DataFrame format for easier visualization and analysis.

The best values for k are 9 and 10 (the avg_score is 0.98) in the example above. After finding the k value for the best score, you can change the k value.

Sequential Forward Floating Selection (SFFS) Method

Sequential Forward Floating Selection (SFFS) is an extension of Sequential Forward Selection (SFS). After each feature is added, SFFS checks whether removing or adding a feature can further improve performance. If performance improves, the change is kept; otherwise, the current feature subset remains unchanged. If you want to use SFFS for feature selection, floating parameter should be True.

Sequential Backward Selection (SBS) Method

Sequential Backward Selection (SBS) method is similar to Sequential Forward Selection. However, Sequential Backward Selection starts with all available features and removes one feature at a time. We will use the previous example to compare Sequential Backward Selection with Sequential Forward Selection.

from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_wine

wine = load_wine()
scaler = StandardScaler()
X = wine.data
X = scaler.fit_transform(X)
y = wine.target
df = pd.DataFrame(X, columns=wine.feature_names)
lr = LogisticRegression(max_iter=100)
sbs = SFS(lr,
k_features=5,
forward=False,
floating=False,
scoring='accuracy',
cv=2)
sbs.fit(df, y)
print(sbs.subsets_)

{13: {'feature_idx': (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), 'cv_scores': array([0.96629213, 0.98876404]), 'avg_score': np.float64(0.9775280898876404), 'feature_names': ('alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline')},
12: {'feature_idx': (0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12), 'cv_scores': array([0.97752809, 0.98876404]), 'avg_score': np.float64(0.9831460674157303), 'feature_names': ('alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline')},
11: {'feature_idx': (0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 12), 'cv_scores': array([0.97752809, 0.98876404]), 'avg_score': np.float64(0.9831460674157303), 'feature_names': ('alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'proanthocyanins', 'color_intensity', 'hue', 'proline')},
10: {'feature_idx': (0, 1, 2, 3, 4, 6, 8, 9, 10, 12), 'cv_scores': array([0.97752809, 1. ]), 'avg_score': np.float64(0.9887640449438202), 'feature_names': ('alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'flavanoids', 'proanthocyanins', 'color_intensity', 'hue', 'proline')},
9: {'feature_idx': (0, 2, 3, 4, 6, 8, 9, 10, 12), 'cv_scores': array([0.97752809, 0.98876404]), 'avg_score': np.float64(0.9831460674157303), 'feature_names': ('alcohol', 'ash', 'alcalinity_of_ash', 'magnesium', 'flavanoids', 'proanthocyanins', 'color_intensity', 'hue', 'proline')},
8: {'feature_idx': (0, 2, 3, 4, 6, 9, 10, 12), 'cv_scores': array([0.98876404, 0.97752809]), 'avg_score': np.float64(0.9831460674157303), 'feature_names': ('alcohol', 'ash', 'alcalinity_of_ash', 'magnesium', 'flavanoids', 'color_intensity', 'hue', 'proline')},
7: {'feature_idx': (0, 2, 3, 4, 6, 10, 12), 'cv_scores': array([0.98876404, 0.98876404]), 'avg_score': np.float64(0.9887640449438202), 'feature_names': ('alcohol', 'ash', 'alcalinity_of_ash', 'magnesium', 'flavanoids', 'hue', 'proline')},
6: {'feature_idx': (0, 2, 3, 6, 10, 12), 'cv_scores': array([0.98876404, 0.98876404]), 'avg_score': np.float64(0.9887640449438202), 'feature_names': ('alcohol', 'ash', 'alcalinity_of_ash', 'flavanoids', 'hue', 'proline')},
5: {'feature_idx': (0, 2, 6, 10, 12), 'cv_scores': array([0.96629213, 0.96629213]), 'avg_score': np.float64(0.9662921348314607), 'feature_names': ('alcohol', 'ash', 'flavanoids', 'hue', 'proline')}}

The selected feature subsets above show how Sequentail Backward Selection (SBS) works. It removes one feature at a time until it reaches (5 in the example above) the desired number of feature.

We imported plot_sequential_feature_selection from mlxtend.plotting to plot the results. The get_metric_dict() method is used to display the results of feature selection in a pandas DataFrame format for easier visualization and analysis.

plot_sfs(sbs.get_metric_dict())
plt.show()

sequential backward feature selection graph

The best feature selection:

{'feature_idx': (0, 2, 3, 6, 10, 12), 'cv_scores': array([0.98876404, 0.98876404]), 'avg_score': np.float64(0.9887640449438202), 'feature_names': ('alcohol', 'ash', 'alcalinity_of_ash', 'flavanoids', 'hue', 'proline')}

Sequential Backward Floating Selection (SBFS) Method

Sequential Backward Floating Selection (SBFS) is just like Sequential Backward Selection, but after each addition, it checks to see if it can improve performance by removing (or adding) a feature. If the result is better, it removes or adds a feature. Otherwise, the feature subset remains the same. If you want to use SBFS for feature selection, floating parameter should be True.

Recursive Feature Elimination (RFE) Method

Recursive Feature Elimination (RFE) starts by training a model with all available features. RFE ranks all features based on feature importance or model coefficients, then removes the least important feature. This process repeats until the desired number of features—specified by the n_features_to_select parameter—is reached. While RFE is similar to Sequential Backward Selection (SBS), the key difference is that SBS evaluates multiple subsets at each step, whereas RFE trains the model on only a single subset during each iteration. Therefore, RFE is computationally less complex and faster than SBS. You need to import RFE from sklearn.feature_selection module.

from sklearn.feature_selection import RFE
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
wine = load_wine()
X = wine.data
y = wine.target
names = wine.feature_names
print(wine.feature_names)

['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']

lr = LogisticRegression(max_iter=300)
rfe = RFE(lr, n_features_to_select=6)
rfe.fit(X, y)
print(rfe.ranking_)
print(rfe.support_)
rfe_features = [f for (f, support) in zip(names, rfe.support_) if support]
print(rfe_features)

[1 5 2 3 8 6 1 7 4 1 1 1 1]
[ True False False False False False True False False True True True True]
['alcohol', 'flavanoids', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']

rfe.ranking_ shows the feature rankings. The features with index 1 are kept in the same index. The feature ranking corresponds to the ranking position of the i-th feature. For example, magnesium (ranking feature - 8) is the least important feature and the first feature to be removed.
rfe.support_ shows if the features are selected or removed. If the feature is selected, it returns True. It is consistent with the results of rfe.ranking_ in our example above.
rfe_features shows the selected features after recursive elimination.

If the "ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT." error occurs, you'd better scale your data using StandardScaler instead of increasing max_iter.

Unsupervised Learning

Unsupervised Learning is a machine learning model that analyzes unlabeled data. It is used to find patterns and relationships in the data. Clustering and dimensionality reduction are some of the unsupervised machine learning techniques.

K-Means Clustering

Clustering is a common unsupervised learning technique. K-Means Clustering is an unsupervised machine learning technique that organizes unlabeled data into groups/clusters based on their similarity. K representss the number of clusters in a clustering algorithm; it also refers to the number of centroids to be generated. Means refers to the average distance of data to each cluster center; you should try to minimize it. We'll use Scikit-learn's iris dataset. As there are 3 classes ['setosa', 'versicolor', 'virginica'] in the iris dataset, we will use 3 clusters. We will use the first two features of the iris dataset in the graph. You can use another dataset. If you want to use another dataset and your dataset has many features, you can use one of the feature selection methods we learned.

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import pandas as pd
import matplotlib.pyplot as plt

iris = load_iris()
X,y= load_iris(return_X_y=True)
df = pd.DataFrame(iris.data, columns= iris.feature_names)
target = iris.target

model = KMeans(n_clusters=3, random_state=0)
model.fit(X)
labels = model.predict(X)
plt.scatter(X[:,0], X[:,1], c=target)
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.show()

Elbow method

If you're unsure about the number of clusters, you can use the elbow method. The elbow method is a graphical method for finding the optimal K value in a k-means clustering algorithm. This method relies on the Within-Cluster Sum of Squares (WCSS), which measures the total squared distance between each point in a cluster and its corresponding centroid. Let's try the elbow method for Scikit-learn's iris dataset:

num_clusters = [1,2,3,4,5,6,7,8]
inertias = []
features = iris.data

for i in num_clusters:
model = KMeans(n_clusters=i)
model.fit(features)
inertias.append(model.inertia_)

plt.plot(num_clusters, inertias, '-o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.show()

Inertia is the distance from each sample to the centroid of its cluster. The optimal K value for our example is 3 because the rate of decrease in inertia starts to slow down at 3.

Limitations of K-Means Clustering

K-Means clustering has some limitations and may not always be the best choice for your data. For example, the accuracy score in the previous example above was very low, and your model might face similar issues. Therefore, it's important to keep these limitations in mind:

K-Means is sensitive to the initial placement of cluster centers.
Choosing the optimal value of K is crucial.
It assumes clusters are spherical (round) and roughly the same size.
Outliers can negatively impact the results.

If your data meets the assumptions, K-Means is likely to perform well.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a linear dimensionality reduction technique that uses Singular Value Decomposition (SVD) to project data into a lower-dimensional space. It's a common unsupervised learning technique for large datasets. Despite dimensionality reduction, it retains the essential data. The input data is centered but not scaled for each feature before applying LogisticRegression. We will use Scikit-learn's iris dataset to compare the results with K-Means Clustering.

from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 20)

pca = PCA(n_components=2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
print(accuracy_score(y_test, y_pred))

The accuracy score is 0.94. You can learn singular values:

print(pca.singular_values_)

[21.94910388 5.31925179]

We can also visualize the PCA algorithm:

plt.figure(figsize=(8, 6))
plt.scatter(X_train[:,0], X_train[:,1], c=y_train, cmap='plasma')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.show()

Scikit Learn's iris dataset with Principal Component Analysis

Supervised Learning II

Naive Bayes Classifier

The Naive Bayes is a supervised learning method based on applying the theorem of Bayes with strong feature independence assumptions to make predictions and classifications. There are different Naive Bayes methods. We will explore Scikit-learn's MultinomialNB classifier and GaussianNB algorithm. If you want to learn Scikit-learn's other Naive Bayes classifiers, see the sklearn.naive_bayes page. We will have a quick review of the Bayes Theorem:

P(A) and P(B) are the probabilities of events A and B
P(A|B) is the probability of event A when event B occurs
P(B|A) is the probability of event B when event A occurs
P(A∩B)=P(A) * P(B)
P(A|B)= P(B|A) * P(A) | P(B)

The Naive Bayes Theorem assumes that features are independent.

Gaussian Naive Bayes is suitable for continuous data, especially when the features follow a Gaussian (normal) distribution. We will apply GaussianNB on iris data:

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.metrics import classification_report

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

The accuracy score is 0.96.

You can find the model's predicted target values (PTV) and the estimated target values (ETV) below. We can use sklearn's confusion_matrix and ConfusionMatrixDisplay to calculate and display the confusion matrix:

print(f"PTV: {y_test}")
print(f"ETV: {y_pred}")

PTV: [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0]
ETV: [2 1 0 2 0 2 0 1 1 1 1 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0]

[[ 11 0 0]
[ 0 13 0]
[ 0 1 5]]

multiclass classification confusion matrix

If you want to learn how to interpret the confusion matrix, see the Scikit-learn's confusion matrix.

MultinomialNB

MultinomialNB is a Naive Bayes classifier for multinomial models. The multinomial Naive Bayes classifier is suitable for classification with discrete features. We will use Scikit-learn's MultinomialNB for text classification. We also need to use CountVectorizer to convert text data into numerical features for text classification tasks.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

We will use positive and negative hotel reviews for text classification. Our data below have text and review features. Positive reviews are represented by 1; negative reviews are represented by 0.

reviews = [{"text": "Loved everything about this hotel!", "review": 1},
{"text": "Beds were super comfy", "review": 1},
{"text": "Room came with complimentary snacks", "review": 1},
{"text": "Perfect", "review": 1},
{"text": "Very good", "review": 1},
{"text": "Superb", "review": 1},
{"text": "Breakfast was excellent", "review": 1},
{"text": "Highly recommended", "review": 1},
{"text": "Great breakfast", "review": 1},
{"text": "My room was dirty", "review": 0},
{"text": "Too expensive", "review": 0},
{"text": "Rooms are old", "review": 0},
{"text": "The air condition wasn't working", "review": 0},
{"text": "The hotel was too noisy", "review": 0},
{"text": "Location was great", "review": 1},
{"text": "Rude staff", "review": 0},
{"text": "Unpleasant Room smell", "review": 0},
{"text": "Uncomfortable bed", "review": 0},
{"text": "Overpriced room", "review": 0},
{"text": "Exceptional", "review": 1},
]

We will create two separate lists: one for positive adjectives and another for negative adjectives used in hotel reviews. The model will use positive and negative lists to decide whether reviews are negative or positive.

negative = ["terrible", "horrible", "painful", "disgusting", "boring", "disappointing", "bad", "dirty", "old", "rude", "uncomfortable", "overpriced"]
positive = ["classic", "great", "awesome", "hilarious", "sensitive", "powerful", "heartwarming", "breathtaking", "epic", "thrilling", "best", "good", "exceptional"]

list = pd.DataFrame(reviews)
comments = list.get("text")
comments = np.array(comments)
labels = list.get("review")
labels = np.array(labels)

counter = CountVectorizer()
counter.fit(negative + positive)
review_counts = counter.transform(comments)
training_counts = counter.transform(negative + positive)
training_labels = [0] * 12 + [1] * 13

classifier = MultinomialNB()
classifier.fit(training_counts, training_labels)
rev_counts = counter.transform(comments)

y_pred = classifier.predict(rev_counts)
y_proba = classifier.predict_proba(rev_counts)

from sklearn.metrics import accuracy_score
print(accuracy_score(labels, y_pred))
print(y_proba)

The accuracy score is 0.8.

[[0.48 0.52 ]
[0.48 0.52 ]
[0.48 0.52 ]
[0.48 0.52 ]
[0.32157969 0.67842031]
[0.48 0.52 ]
[0.48 0.52 ]
[0.48 0.52 ]
[0.32157969 0.67842031]
[0.65470208 0.34529792]
[0.48 0.52 ]
[0.65470208 0.34529792]
[0.48 0.52 ]
[0.48 0.52 ]
[0.32157969 0.67842031]
[0.65470208 0.34529792]
[0.48 0.52 ]
[0.65470208 0.34529792]
[0.65470208 0.34529792]
[0.32157969 0.67842031]]

The predict_proba method returns probability estimates for the test vector X. For example, if the probabilities for the first review are [0.48, 0.52], the first number (0.48) represents the probability that the review is class 0 (negative), and the second number (0.52) represents the probability that it is class 1 (positive). Since 0.52 is higher, the model predicts the review as positive.
In summary, predict_proba returns class probabilities, while predict returns the final predicted class based on those probabilities.

We can evaluate our model with another important classification metric: the confusion matrix. The confusion matrix computes the confusion matrix to evaluate the accuracy of a classification.

from sklearn.metrics import confusion_matrix
print(confusion_matrix(labels, y_pred))

[[ 5 4]
[ 0 11]]

The interpretation of the Scikit-learn's confusion matrix can be confusing. The Scikit-learn's confusion matrix in binary classification is formulated as follows: the count of true negatives is C_0,0, false negatives is C_1,0, true positives _1,1, and false positives C_0,1. If you need to review the terms, you can check the table. Let's remember the estimated target values (ETV) and correct target values (CTV) to understand the confusion matrix above:

ETV: [1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 0 0 1]
CTV: [1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1]

You should keep in mind that scikit-learn's confusion_matrix is different from the classification_report in the evaluation metrics. In our example, we labeled negative reviews as 0 and positive reviews as 1, meaning that 0 represents negative sentiment and 1 represents positive sentiment. The total number of true negatives—cases where both the actual and predicted labels are 0—is 5 in this case. This value appears in the confusion matrix at position [0, 0], which corresponds to row 0 and column 0.
If your model is performing multiclass classification, the structure of the confusion matrix follows a consistent pattern: the rows represent the true (actual) target values in ascending order, and the columns represent the predicted target values, also in ascending order from left to right.
You can visualize the confusion matrix using ConfusionMatrixDisplay, as shown in the example above.

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

cm = confusion_matrix(labels, y_pred, labels=classifier.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=classifier.classes_)
disp.plot()
plt.show()

You can also see the confusion matrix of the previous (GaussianNB) example using iris data for multiclass classification.

If the number of True Positives and False Negatives (TP + FN) is equal to 0, you will get the following warning: "UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior." You need to use the zero_division parameter to set a value to return when there is a zero division.

classification_report(labels, y_pred, zero_division=0)

Support Vector Machines (SVM)

The Support Vector Machine (SVM) is another important supervised machine learning model used for classification and regression tasks. A SVM defines a decision boundary using labeled data to classify unlabeled points. The distance between the support vectors and the decision boundary is called the margin. The support vectors are the points in the training set closest to the decision boundary. If there are n number of features, there should be n+1 support vectors. Scikit-learn's sklearn.svm offers support vector machine algorithms, and we will use SVC, Support Vector Classifier (C-Support Vector Classification). SVC is a powerful algorithm and it's easy to use. The implementation is based on libsvm. Let's see how to make a model with SVC:

import numpy as np
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

X,y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

classifier = SVC(kernel="rbf", gamma =0.1)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

The accuracy_score is 1.

We split and scale the data in the Support Vector Machine example above. We create an SVC object. We used kernel and gamma parameters. The kernel parameter specifies the kernel type to be used in the algorithm. The default value is "rbf". We mentioned earlier that the number of features (and the distribution of data) is important in SVM. SVM aims to find the optimal hyperplane to separate data and maximize the margin to prevent false classification. Therefore, the kernel parameter offers the following kernel types: "linear", "poly", "rbf", "sigmoid", and "precomputed". The kernel transforms the data into a higher dimension. The polynomial kernel uses the polynomial function, while the rbf kernel uses the Radial Basis Function to transform data into a higher dimension. The rbf kernels are used when the data is unevenly distributed. Before explaining gamma, we'll explore the C parameter. The C parameter is the regularization parameter of SVM and is used to control misclassification. In other words, it is the cost of misclassification. The strength of the regularization is inversely proportional to C. The default value is 1.0, and we didn't change it. If C is small, the SVM has a soft margin. Although the margin is large, there will be classification errors. The gamma parameter is similar to the C parameter. The gamma parameter is the kernel coefficient for "rbf", "poly", and "sigmoid". It can be "scale", "auto", or "float" (but not negative). If the gamma is large, the model is more sensitive to the changes in training data.

You can learn the support vectors using the support_vectors attribute:

print(classifier.support_vectors_)

Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) is a dimensionality reduction technique used in classification problems. It is similar to Principal Component Analysis (PCA), but LDA is a supervised learning machine technique. LDA maximizes the distance between classes and minimizes the distance within each class. We will use Scikit-learn's sklearn.discriminant_analysis module. According to Scikit-learn, sklearn.discriminant_analysis is a classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes' rule. It assumes that all classes share the same covariance matrix, and the model fits a Gaussian density to each class. Let's see how to use sklearn.discriminant_analysis:

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state = 20)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

lda = LinearDiscriminantAnalysis(n_components= 2)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)

lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

The score is 0.98. The n_components is 2 because there are 3 classes in the wine dataset, and it should be n_classes - 1. As mentioned earlier, LDA has some assumptions. The number of the smallest class should be greater than the number of variables as well. LDA is a powerful dimensionality reduction technique, but your data should meet the assumptions.

Regularization and Hyperparameters

Regularization

Regularization is a technique used in Machine Learning to minimize overfitting. A good model should perform well both on training and test data. Overfitting occurs when a model performs well on training data but fails to perform well on unseen data. There are different techniques to prevent/minimize overfitting. Lasso, Ridge, and ElasticNet are the most common regularization techniques. Lasso (L1, the least absolute shrinkage) regularization is a linear regression technique used for regularization. Both Lasso and Ridge regression penalize overfitting. L1 penalty refers to the sum of the absolute values of the coefficients multiplied by alpha (parameter for regularization). Ridge regression uses the L2 penalty for regularization. L2 penalty is the sum of the squared coefficient values multiplied by alpha. The magnitude of coefficients shrinks during the process. Coefficients can be zero in Lasso, while coefficients do not shrink to zero in Ridge regression. ElasticNet is a Linear regression with combined L1 and L2 priors as regularizers. We will create a model with Lasso using sklearn.linear_model:

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet

X,y = fetch_california_housing(return_X_y=True)
housing = fetch_california_housing()

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=46)

lasso = Lasso(alpha=0.04)
lasso.fit(X_train, y_train)
train_pred = lasso.predict(X_train)
test_pred = lasso.predict(X_test)

from sklearn.metrics import mean_squared_error
training_mse = mean_squared_error(y_train, train_pred)
test_mse = mean_squared_error(y_test, test_pred)

We used sklearn's California housing dataset for the example above. alpha is the constant that multiplies L1. You can test different values for alpha to improve performance. The performance of the example above is not as good as the previous examples. However, it's a good example to see how Lasso works. You can use the same syntax for Ridge regression and ElasticNet. The performances of Ridge and ElasticNet for the example above are slightly better. The mean squared error of training data is close to the mean squared error of test data, and they are both high. The model does not learn enough from the data, and it does not make correct predictions.

Regularization is applied by default in Logistic Regression. Logistic Regression implements regularized logistic regression using the "liblinear" library, "newton-cg", "sag", "saga", and "lbfgs" solvers. The default penalty used in the Logistic regression is l2. However, some penalties may not work with some solvers. For more information, see the Scikit-learn page. You can see the syntax of Logistic regression with an l1 penalty:

lr = LogisticRegression(penalty = 'l1', solver = 'liblinear')

You can use the liblinear solver with both l1 and l2 penalties. You can change the solver and the penalty. As mentioned earlier, some penalties may not work with some solvers. For example, the default solver, "lbfgs" can only work with l2. See the Logistic Regression tutorial for more information about Logistic Regression parameters.

Hyperparameter tuning

Hyperparameters are Machine Learning parameters selected before the training process. They are used to control the learning process. K in the K nearest neighbor, and alpha in Lasso regression are good examples of hyperparameters. The bias occurs when the model can't make successful predictions. The bias represents the difference between a model's predictions and correct values. The variance refers to the dependence on training data. High variance occurs when the model performs well on training data but performs poorly on unseen data. High variance leads to overfitting. The bias-variance tradeoff refers to the relationship between variance and bias. We should minimize both variance and bias. Hyperparameter tuning is the process of finding optimal hyperparameters for an algorithm. You can search for the optimal hyperparameters manually. You can also use sklearn's GridSearchCV method.

GridSearchCV

GridSearchCV is a technique in machine learning that performs an exhaustive search over a specified parameter grid to find the best hyperparameters for a given estimator. It uses cross-validation to evaluate different combinations of parameters and select the one that optimizes model performance. param_grid is a dictionary where the keys are parameter names, and the values are lists of settings to try. The scoring parameter defines the evaluation metric used to measure model performance on the validation folds during the grid search. You can find the full list of score names. You should choose the correct scoring parameter for the estimator. GridSearchCV can be hard to understand. Let’s explore how GridSearchCV functions through the example below.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=20)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

classifier = LogisticRegression()
model = GridSearchCV(estimator = classifier, param_grid=[{'solver': ["lbfgs", "newton-cg"], "penalty": ["l2", None]}], scoring="accuracy", cv = 5)

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(model.best_params_)
print(model.best_score_)

The best_params_ is the parameter setting that gave the best results on the hold out data. The best_params_ in the example above are {'penalty': 'l2', 'solver': 'lbfgs'}. The best_score_ parameter is the mean cross-validated score of the best_estimator. The best_score_ is 0.96. To get the best_score_ in the example above, the solver should be "lbfgs", and the penalty should be "l2". As the scoring is accuracy, the GridSearchCV looks for the most accurate results. The param_grid in the example above has solvers and penalties. Options for solvers are "lbfgs" and "newton-cg". The penalty can be either "l2" or none because "lbfgs" and "newton-cg" don't work with "l1".

Let's see another GridSearchCV example using SVC:

from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=40)

model = GridSearchCV(estimator = SVC(), param_grid={'C': [1, 5, 10, 20], 'kernel': ( 'rbf', 'linear','poly', 'sigmoid')}, scoring = 'accuracy', cv = 5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(model.best_params_)
print(model.best_score_)

The best_params_ are {'C': 5, 'kernel': 'rbf'}. The best_score_ is 0.99. The param_grid has "C" and "kernel" parameters. Kernel values are "rbf", "linear", "poly", and "sigmoid". You can read the sklearn's parameters and attributes for GridSearchCV.

If you want to learn more about SVC, see the Support Vector Machines tutorial.

Ensemble Methods

Ensemble Methods combine multiple learning algorithms to create a better model. They can be used to improve accuracy and minimize overfitting. Boosting, bagging, and stacking are the most common techniques. The random forest is the most popular ensemble method (and bagging) example. There are different definitions and techniques for ensemble methods. We will refer to Scikit-learn definitions and explanations for classifiers and regressors. The scores/results may change in ensemble learning methods each time.

Bagging Method

If samples are drawn with replacement, the ensemble method is called Bagging.

Random Forests

Random forest is one of the most widely used ensemble learning techniques in machine learning. It is a prime example of the bagging (Bootstrap Aggregating) method, where multiple decision trees are trained on random subsets of the data and combined to improve accuracy and reduce overfitting. It's a supervised learning model, and it can be used for both classification and regression tasks. Random forest can prevent overfitting and improve accuracy. On the other hand, random forests have some limitations. They can be computationally expensive. They are hard to interpret, and their training may take a long time.

Random Forest Classifier

Scikit-learn's RandomForestClassifier is used for classification problems. You can use the parameters of RandomForestClassifier to optimize your model's performance.

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

random = RandomForestClassifier(n_estimators=10, oob_score=True)
random.fit(X_train, y_train)
y_pred = random.predict(X_test)
accuracy_score = accuracy_score(y_test, y_pred)
oob_score = print(1 - random.oob_score_)

The accuracy score is around 0.97. The syntax of RandomForestClassifier is very similar to the other sklearn models. The n_estimators parameter is the number of trees in the forest. The default value of n_estimators is 100. (The default value of n_estimators changed from 10 to 100 in Scikit-learn version 0.22.) Random forests have a slower performance, and increasing the number of estimators does not improve the performance significantly. Therefore, we decreased the n_estimators to 10. oob_score is the score of the training dataset obtained using an out-of-bag estimate. In other words, the out-of-bag error evaluates the prediction error of random forests using out-of-bag samples (the unselected data during training). If you want to use the oob_score attribute, the oob_score parameter should be True. The default value of oob_score is False. If n_estimators is not high enough, you may get the following error: "UserWarning: Some inputs do not have OOB scores. This probably means too few trees were used to compute any reliable OOB estimates." You just need to increase the number of trees (n_estimators).
Scores (both accuracy and oob_score) of the model vary slightly because subsets of data and features are randomly selected. Changing parameters may improve your model's performance. You can change the criterion (default="gini"), max_depth, max_features (default="sqrt"), bootstrap (default=True), and class_weight parameters. You can also use classification_report to evaluate your model.

Random Forest Regressor

You can use Random Forest for regression problems as well.

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X, y = fetch_california_housing(return_X_y=True, as_frame=True)
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

random = RandomForestRegressor(n_estimators=50, oob_score=True, max_features=4)
random.fit(X_train, y_train)
y_pred = random.predict(X_test)

from sklearn.metrics import mean_squared_error, r2_score
print(r2_score(y_test, y_pred))

The r2 score of the example above is around 0.81. It's a very high score for a regression task. The oob_score is around 0.19. See RandomForestClassifier for more information about oob-score. As mentioned earlier, Random forest scores can vary slightly because the algorithm randomly selects subsets of data and features during training. We chose 50 for n_estimators. The default value is 100. If you are not satisfied with your model's performance, you can change the parameters of RandomForestRegressor. The default value of the criterion is "squared_error". The default value of the max_feature is "sqrt". RandomForestRegressor attributes are also important. n_features_in is the number of features during the fit method. feature_names_in shows the names of features during the fit method. You should define the feature names to use the features_name_in attitude that's why the as_frame parameter of fetch_california_housing is True.

print(random.feature_names_in_)
print(random.feature_importances_)

['MedInc' 'HouseAge' 'AveRooms' 'AveBedrms' 'Population' 'AveOccup' 'Latitude' 'Longitude']
[0.42599551 0.05372833 0.09817814 0.03635227 0.03122995 0.13390691 0.11115228 0.10945661]

The feature_importances_ attribute shows the impurity-based feature importances. It can be used for feature selection.

Boosting Method

Boosting is an ensemble learning method used for models with high bias. Weak learners are used as base estimators. It can be used for both classification and regression. There are 2 algorithms for boosting methods: Adaptive Boosting and Gradient Boosting.

Adaptive Boosting

Adaptive Boosting (or AdaBoost) is a sequential ensemble learning method. It can be used both for classification and regression. Adaptive Boosting aims to reduce bias and improve the performance of the model.

AdaBoostClassifier

AdaBoostClassifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset, but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine

X, y = load_wine(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

decision_stump = DecisionTreeClassifier(max_depth=1)
classifier = AdaBoostClassifier(estimator=decision_stump, n_estimators=10)
print(classifier.get_params())
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

The get_params parameter gets the parameters for the estimator above: {'algorithm': 'deprecated', 'estimator__ccp_alpha': 0.0, 'estimator__class_weight': None, 'estimator__criterion': 'gini', 'estimator__max_depth': 1, 'estimator__max_features': None, 'estimator__max_leaf_nodes': None, 'estimator__min_impurity_decrease': 0.0, 'estimator__min_samples_leaf': 1, 'estimator__min_samples_split': 2, 'estimator__min_weight_fraction_leaf': 0.0, 'estimator__monotonic_cst': None, 'estimator__random_state': None, 'estimator__splitter': 'best', 'estimator': DecisionTreeClassifier(max_depth=1), 'learning_rate': 1.0, 'n_estimators': 10, 'random_state': None}. If a model consists of a one-level decision tree, it's called a decision stump.
The accuracy score is around 0.93. AdaBoostClassifier has very few parameters.

AdaBoostRegressor

AdaBoostRegressor is a meta-estimator that begins by fitting a regressor on the original dataset and then fits additional copies of the regressor on the same dataset but where the weights of instances are adjusted according to the error of the current prediction.

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import AdaBoostRegressor
from sklearn.datasets import fetch_california_housing

X,y = fetch_california_housing(return_X_y=True, as_frame=True)
housing = fetch_california_housing()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
regressor = AdaBoostRegressor(n_estimators=90)
print(regressor.get_params(deep=True))
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
print(r2_score(y_test, y_pred))

The result of get_params for the estimator above: {'estimator': None, 'learning_rate': 1.0, 'loss': 'linear', 'n_estimators': 90, 'random_state': None}.
The r2 score is around 0.46. AdaBoostRegressor has very few parameters.

Gradient Boosting

Gradient Boosting is a sequential ensembling learning method. It can be used for both classification and regression.

GradientBoostingClassifier

GradientBoostingClassifier builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine

X, y = load_wine(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

grad_classifier = GradientBoostingClassifier(n_estimators=15)
print(grad_classifier.get_params())
grad_classifier.fit(X_train, y_train)
y_pred = grad_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

The parameters of the GradientBoostingClassifier above: {'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.1, 'loss': 'log_loss', 'max_depth': 3, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 15, 'n_iter_no_change': None, 'random_state': None, 'subsample': 1.0, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}
The accuracy score is around 0.95.

GradientBoostingRegressor

The GradientBoostingRegressor estimator builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor

X,y = fetch_california_housing(return_X_y=True, as_frame=True)
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

random = GradientBoostingRegressor()
random.fit(X_train, y_train)
y_pred = random.predict(X_test)

from sklearn.metrics import mean_squared_error, r2_score
print(r2_score(y_test, y_pred))

The r2 score is around 0.77. You can alternatively calculate the mean squared error to evaluate your model. The score is high for a regression task. It performs better than the AdaBoostRegressor model above.

Please keep in mind that the scores/results of the models above may be different each time.

Stacking Method

Stacking is an ensemble machine learning technique that combines multiple models.

StackingClassifier

StackingClassifier uses the results of different classifiers to compute the final prediction. The syntax of the StackingClassifier is different from other models. You should specify base estimators using the estimators parameter. We used RandomForestClassifier and LogisticRegression models. You should also choose a final estimator that will be used to combine the base estimators. The final_estimator is LogisticRegression in the example below. The default value is LogisticRegression.

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X, y = load_wine(return_X_y=True, as_frame=True)

estimators = [ ("rf", RandomForestClassifier()), ("lr", LogisticRegression()) ]
classifier = StackingClassifier( estimators=estimators, final_estimator=LogisticRegression() )
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

The accuracy score is around 0.97-1.0. You can use get_params() to learn the parameters of the StackingClassifier. Input options for cv are int, cross-validation generator, iterable, "prefit", and (default=) None. For integer/None inputs, StratifiedKFold is used for both binary and multiclass classification. StratifiedKFold provides train/test indices to split data into train/test sets. The folds are made by preserving the percentage of samples for each class. It's the difference between StratifiedKFold and KFold cross-validation.

from sklearn.model_selection import StratifiedKFold

kfold = StratifiedKFold(n_splits=2, random_state=20, shuffle=True)
classifier = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression(), cv=kfold )

You can see how StratifiedKFold splits the data:

for i, (train_index, test_index) in enumerate(kfold.split(X_train, y_train)):
print(f"Fold {i}:")
print(len(train_index))
print(f" Train: index={train_index}")
print(f" Test: index={test_index}")

The total number of training samples is 133 (66 + 67). The first row shows the fold number (Fold 0, Fold 1). The n_splits parameter is 2. Therefore, there are 2 folds. The second row shows the total number of samples (66 for the first fold and 67 for the second fold) in a fold. The third row shows the index of training samples (Train: index= [...]), and the fourth row shows the index of test samples (Test: index= [...]).

StackingRegressor

StackingRegressor uses the results of different regressor models to compute the final prediction. The syntax of the StackingRegressor is similar to the StackingClassifier but different from the previous models. You should specify base estimators using the estimators parameter. We used RandomForestRegressor and LinearRegression models. You should also choose a final estimator that will be used to combine the base estimators using the final_estimator parameter. The final_estimator is RandomForestRegressor in the example below. The default value is RidgeCV.

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
X,y = fetch_california_housing(return_X_y=True, as_frame=True)

housing = fetch_california_housing()
estimators = [ ('rfr', RandomForestRegressor()), ('lr', LinearRegression()) ]
reg = StackingRegressor(
estimators=estimators, final_estimator = RandomForestRegressor(n_estimators=10))

X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=42)
reg.fit(X_train, y_train)
print(reg.score(X_test, y_test))

The score is around 0.75. StackingRegressor has a high performance. On the other hand, it has a slow performance. The cv parameter determines the cross-validation splitting strategy used in cross_val_predict to train the final_estimator. Possible inputs for cv are: int, cross-validation generator, iterable, "prefit", and (default=) None. For integer/None inputs, if the estimator is not a classifier, you should use KFold. KFold provides train/test indices to split data into train/test sets and splits the dataset into k consecutive folds (without shuffling by default).

from sklearn.model_selection import KFold

kfold = KFold(n_splits=2)
reg = StackingRegressor( estimators=estimators, final_estimator = RandomForestRegressor(n_estimators=10 ), cv=kfold )

You can see how KFold splits the data:

for i, (train_index, test_index) in enumerate(kfold.split(X_train, y_train)):
print(f"Fold {i}:")
print(len(train_index))
print(f" Train: index={train_index}")
print(f" Test: index={test_index}")

The total number of samples is 15480. The first row shows the fold number (Fold 0 and Fold 1). The second row shows the total number (7740 for each) of samples in a fold. The third row shows the index of training samples (Train: index= [...]), and the fourth row shows the index of test samples (Test: index= [...]).

Pipeline

A pipeline is used to apply a sequence of data preprocessing steps (transformers), followed by an optional final estimator (predictive model). The main purpose of a pipeline is to streamline the workflow by combining multiple steps into a single object. This makes it easier to perform tasks like cross-validation and hyperparameter tuning consistently across all stages. We will have different examples to explore Scikit-learn's Pipelines. Let's see a Pipeline with an estimator:

from sklearn.pipeline import Pipeline

pipeline = Pipeline([("rfc", RandomForestClassifier())])

Pipeline with GridSearchCV

We can use a pipeline combined with GridSearchCV to optimize our model:

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine

X, y = load_wine(return_X_y= True, as_frame= True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

pipeline = Pipeline([ ("rfc", RandomForestClassifier())])

grid = [{'rfc': [RandomForestClassifier()], "rfc__n_estimators": [10, 20]}]
gs = GridSearchCV(pipeline, param_grid= grid, scoring="accuracy", cv=5)
gs.fit(X_train, y_train)

You need to specify the parameters of the GridSearchCV before the GridSearchCV. If you want to set a parameter, you need to use '__' as a separator in the parameter names. rfc__n_estimators parameter in the example above sets n_estimators parameter for the GridSearchCV. If you want to learn GridSearchCV, see the GridSearchCV.

Pipeline with StandardScaler

You can use StandardScaler or other scalers with Pipeline to scale data:

from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 20)

scaled_pipe = Pipeline([('scaler', StandardScaler()), ('lr', LogisticRegression())])
scaled_pipe.fit(X_train, y_train)
y_pred = scaled_pipe.predict(X_test)

Pipeline with OneHotEncoder, SimpleImputer, ColumnTransformer

We will use Seaborn Library's Titanic dataset for the example below. Our model will predict the "survived" column. The Titanic dataset has both categorical and numerical features. We will use Logistic Regression as an estimator. Therefore, we need to convert categorical data to numerical data. We will use Scikit-learn's OneHotEncoder to encode categorical features. ColumnTransformer allows different columns or column subsets of the input to be transformed separately, and the features generated by each transformer will be concatenated to form a single feature space. SimpleImputer univariates imputer for completing missing values with simple strategies. You need different strategies for categorical and numerical data. We will use the "most_frequent" strategy for categorical data and the "mean" strategy for numerical data. The "most_frequent" imputation strategy replaces missing values using the most frequent value along each column. The "most_frequent" strategy can be used for numerical data as well. The "mean" imputation strategy fills in missing values using the mean of each column. If you want to learn if your data has missing values, you can print df.isnull().sum().

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
import numpy as np

titanic = sns.load_dataset('titanic')
titanic = pd.DataFrame(titanic)
X = titanic.drop(columns="survived")
y = titanic["survived"]
num_cols = X.select_dtypes(include=np.number).columns

cat_cols = X.select_dtypes(include=["object","bool", "category"]).columns
x_train, x_test, y_train, y_test = train_test_split(X,y, random_state=0, test_size=0.25)

num_vals = Pipeline([("imputer", SimpleImputer(strategy="mean")), ("scale", StandardScaler())])
cat_vals = Pipeline([("imputer", SimpleImputer(strategy="most_frequent")), ("onehotencoder", OneHotEncoder())])

preprocess = ColumnTransformer( transformers=[ ("num_preprocess", num_vals, num_cols), ("cat_preprocess", cat_vals, cat_cols)])
pipeline = Pipeline([("preprocess", preprocess), ("regr", LogisticRegression())])
pipeline.fit(x_train, y_train)
pred = pipeline.predict(x_test)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, pred))

The cat_cols variable represents the categorical columns, while the num_cols variable represents the numerical columns. SimpleImputer and StandardScaler will be applied to numerical data. SimpleImputer and OneHotEncoder will be applied to categorical data. We need to declare ColumnTransformer before the final pipeline.

Scikit-learn Tutorial 2025

Scikit-learn Library

Scikit-learn Library Installation

Scikit-learn Datasets

Supervised Learning

Linear Regression

Multiple Linear Regression

train_test_split

Logistic Regression

K-Nearest Neighbors

K-Nearest Neighbors Classifier

K-Nearest Neighbors Regressor

Decision Tree Classifier

How to interpret the decision tree

Normalization (sklearn.preprocessing)

StandardScaler

MinMaxScaler

Scikit-learn's evaluation metrics

Scikit-learn's evaluation metrics for classification

Accuracy score

Recall score

Precision score

F1 Score

Classification report

Confusion matrix

Scikit-learn's evaluation metrics for regression

Feature selection methods

Filter Methods

Variance Threshold

Pearson's Correlation Method

SelectKBest

Wrapper Methods

Sequential Feature Selection Methods

Sequential Forward Selection (SFS) Method

Sequential Forward Floating Selection (SFFS) Method

Sequential Backward Selection (SBS) Method

Sequential Backward Floating Selection (SBFS) Method

Recursive Feature Elimination (RFE) Method

Unsupervised Learning

K-Means Clustering

Elbow method

Limitations of K-Means Clustering

Principal Component Analysis (PCA)

Supervised Learning II

Naive Bayes Classifier

MultinomialNB

Support Vector Machines (SVM)

Linear Discriminant Analysis (LDA)

Regularization and Hyperparameters

Regularization

Hyperparameter tuning

GridSearchCV

Ensemble Methods

Bagging Method

Random Forests

Random Forest Classifier

Random Forest Regressor

Boosting Method

Adaptive Boosting

AdaBoostClassifier

AdaBoostRegressor

Gradient Boosting

GradientBoostingClassifier

GradientBoostingRegressor

Stacking Method

StackingClassifier

StackingRegressor

Pipeline

Pipeline with GridSearchCV

Pipeline with StandardScaler

Pipeline with OneHotEncoder, SimpleImputer, ColumnTransformer