Post

Sklearn core code snippet

Core code snippet for scikit-learn machine learning applications using the iris dataset and k-Nearest Neighbor classifier

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier


iris_dataset = load_iris()
# get the dataset using the load_iris function
#   type(iris_dataset) is sklearn.utils.Bunch, which is
#   similar to a Python dictionary e.g. iris_dataset.keys()


X_train, X_test, y_train, y_test = train_test_split( iris_dataset['data'], \
    iris_dataset['target'], random_state=0)
# by default 75% train, 25% test

knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)

score = knn.score(X_test, y_test)
# measures the accuracy i.e. the fraction of flowers
#   wherein the right species were predicted
#   same as:
#       y_pred = knn.predict(X_test)
#       np.mean(y_pred == y_test)

print(f"Model test set score: {score:.2f}")

Applying model to new data, e.g.

1
2
3
4
5
6
7
8
9
10
import numpy as np

# data must be of type: numpy.ndarray
X_new = np.array([[5, 2.9, 1, 0.2]])

prediction = knn.predict(X_new)

X_new_label = iris_dataset['target_names'][prediction]
print(f"Predicted name of new dataset: {X_new_label}")

Other notes

  1. k-NN (or any algorithm that generates a predictive model) can be used to detect anomalies in the data
  2. k-NN may not be a good choice for large datasets as the algorithm gets slower as the number of samples and/or independent variables increase. Moreover, it will not perform well on imbalanced datasets as the larger classes will overshadow the smaller classes (in which case, weighted voting instead of majority voting may be used to improve accuracy).

Reference:

Muller, A. C., Guido, S. (2017). Introduction to machine learning with Python: A guide for data scientists. Sebastopol, CA: O’Reilly Media Inc.

This post is licensed under CC BY 4.0 by the author.