Classification using scikit-learn

The goal of this tutorial is to introduce you to the scikit libraries for classification. We will also cover the topic of feature normalization, and evaluation.

In [2]:
import numpy as np
import scipy.sparse as sp_sparse

import matplotlib.pyplot as plt

import sklearn as sk
import sklearn.datasets as sk_data
import sklearn.metrics as metrics
from sklearn import preprocessing

import seaborn as sns

%matplotlib inline

Feature normalization

Python provides some functionality for normalizing and standardizing the data. Be careful though, some operations work only with dense data.

http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing

Use the function preprocessing.scale to normalize by removing the mean and dividing by the standard deviation. This is done per feature, that is, per column of the dataset.

In [2]:
X = np.array([[ 1., -1.,  2.],
              [ 2.,  0.,  1.],
              [ 0.,  1., -1.]])
print("column means: ",X.mean(axis = 0))
print("column std: ",X.std(axis = 0))
X_scaled = preprocessing.scale(X)
print("after feature normalization")
print(X_scaled)
print("normalized column means: ",X_scaled.mean(axis=0))
print("normalized column std: ",X_scaled.var(axis = 0))
column means:  [1.         0.         0.66666667]
column std:  [0.81649658 0.81649658 1.24721913]
after feature normalization
[[ 0.         -1.22474487  1.06904497]
 [ 1.22474487  0.          0.26726124]
 [-1.22474487  1.22474487 -1.33630621]]
normalized column means:  [0.00000000e+00 0.00000000e+00 1.48029737e-16]
normalized column std:  [1. 1. 1.]

Feature normalization will not work with sparse data. In this case, the zeros are treated as values, so the sparse matrix will become non-sparse after normalization.

In [3]:
import scipy.sparse
cX = scipy.sparse.csc_matrix(X)
cX_scaled = preprocessing.scale(cX)
print(cX_scaled)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-961e7864f1cd> in <module>
      1 import scipy.sparse
      2 cX = scipy.sparse.csc_matrix(X)
----> 3 cX_scaled = preprocessing.scale(cX)
      4 print(cX_scaled)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in scale(X, axis, with_mean, with_std, copy)
    143         if with_mean:
    144             raise ValueError(
--> 145                 "Cannot center sparse matrices: pass `with_mean=False` instead"
    146                 " See docstring for motivation and alternatives.")
    147         if axis != 0:

ValueError: Cannot center sparse matrices: pass `with_mean=False` instead See docstring for motivation and alternatives.

The same can be done with the StandardScaler from the preprocessing library of sklearn.

The function fit() computes the parameters for scaling, and transform() applies the scaling

In [4]:
from sklearn import preprocessing
std_scaler = preprocessing.StandardScaler()
std_scaler.fit(X)
print(std_scaler.mean_)
print(std_scaler.scale_)
X_std = std_scaler.transform(X)
print("scaled data:")
print(X_std)
[1.         0.         0.66666667]
[0.81649658 0.81649658 1.24721913]
scaled data:
[[ 0.         -1.22474487  1.06904497]
 [ 1.22474487  0.          0.26726124]
 [-1.22474487  1.22474487 -1.33630621]]

The advantage is the we can now apply the transform to new data.

For example, we compute the parameters for the training data and we apply the scaling to the test data.

In [5]:
y = np.array([[2.,3.,1.],
              [1.,2.,1.]])
print(std_scaler.transform(y))
[[1.22474487 3.67423461 0.26726124]
 [0.         2.44948974 0.26726124]]

The MinMaxScaler subbtracts from each column the minimum and then divides by the max-min.

In [6]:
min_max_scaler = preprocessing.MinMaxScaler()
X_minmax = min_max_scaler.fit_transform(X)
print(X_minmax)
print(min_max_scaler.transform(y))
[[0.5        0.         1.        ]
 [1.         0.5        0.66666667]
 [0.         1.         0.        ]]
[[1.         2.         0.66666667]
 [0.5        1.5        0.66666667]]

The MaxAbsScaler divides with the maximum absolute value.

The MaxAbsScaler can work with sparse data, since it does not destroy the data sparseness. For the other datasets, removing the mean (or min) can destroy the sparseness of the data.

Sometimes we may choose to normalize only the non-zero values. This should be done manually.

In [7]:
max_abs_scaler = preprocessing.MaxAbsScaler()
X_maxabs = max_abs_scaler.fit_transform(X)
X_maxabs
Out[7]:
array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0.5],
       [ 0. ,  1. , -0.5]])
In [8]:
# works with sparse data
cX_scaled = max_abs_scaler.transform(cX)
print(cX_scaled)
  (0, 0)	0.5
  (1, 0)	1.0
  (0, 1)	-1.0
  (2, 1)	1.0
  (0, 2)	1.0
  (1, 2)	0.5
  (2, 2)	-0.5

The normalize function normalizes the rows so that they become unit vectors in some norm that we specify. It can be applied to sparse matrices without destroying the sparsity.

In [9]:
#works with sparse data

X_normalized = preprocessing.normalize(X, norm='l2')

X_normalized                                      
Out[9]:
array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 0.89442719,  0.        ,  0.4472136 ],
       [ 0.        ,  0.70710678, -0.70710678]])
In [10]:
crX = scipy.sparse.csr_matrix(X)
crX_scaled = preprocessing.normalize(crX,norm='l1')
print(crX_scaled)
  (0, 0)	0.25
  (0, 1)	-0.25
  (0, 2)	0.5
  (1, 0)	0.6666666666666666
  (1, 2)	0.3333333333333333
  (2, 1)	0.5
  (2, 2)	-0.5

OneHotEncoder

The OneHotEncoder can be used for categorical data to transform them into binary, where for each attribute value we have 0 or 1 depending on whether this value appears in the feature vector. It works with numerical categorical values.

In [11]:
X = [[0,1,2],
     [1,2,3],
     [0,1,4]]
enc = preprocessing.OneHotEncoder(handle_unknown='ignore')
enc.fit(X)
enc.transform([[0,2,4],[1,1,2]]).toarray()
Out[11]:
array([[1., 0., 0., 1., 0., 0., 1.],
       [0., 1., 1., 0., 1., 0., 0.]])

In this example every number in every column defines a separate feature

In [12]:
enc.categories_
Out[12]:
[array([0, 1]), array([1, 2]), array([2, 3, 4])]

We can also apply it selectively to some columns of the data

In [13]:
#works with sparse data

X = np.array([[0, 10, 45100],
     [1, 20, 45221],
     [0, 20, 45212]])
enc = preprocessing.OneHotEncoder(categorical_features=[2]) #only the third column is categorical
enc.fit(X)
enc.transform([[5,13,45212],[4,12,45221]]).toarray()
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py:415: FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values.
If you want the future behaviour and silence this warning, you can specify "categories='auto'".
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
  warnings.warn(msg, FutureWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py:451: DeprecationWarning: The 'categorical_features' keyword is deprecated in version 0.20 and will be removed in 0.22. You can use the ColumnTransformer instead.
  "use the ColumnTransformer instead.", DeprecationWarning)
Out[13]:
array([[ 0.,  1.,  0.,  5., 13.],
       [ 0.,  0.,  1.,  4., 12.]])

Feature Selection

Feature selection is about finding the best features for your classifier. This may be important if you do not have enough training data. The idea is to find metrics that either characterize the features by themselves, or with respect to the class we want to predict, or with respect to other features.

http://scikit-learn.org/stable/modules/feature_selection.html

The VarianceThreshold selection drops features whose variance is below some threshold. If we have binary features we can estimate the treshold exactly so as to guarantee a specific ratio of 0's and 1's

In [14]:
from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
print(np.array(X))
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)
[[0 0 1]
 [0 1 0]
 [1 0 0]
 [0 1 1]
 [0 1 0]
 [0 1 1]]
Out[14]:
array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

A more sophisticated feature selection technique uses the chi-square test to determine if a feature and the class label are independent.

https://en.wikipedia.org/wiki/Chi-squared_test

In this case we keep the features with high chi-score and low p-value.

The features with the lowest scores and highest values are rejected.

The chi-square test is usually applied on categorical data.

In [15]:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris = load_iris()
X, y = iris.data, iris.target
print(X.shape)
print('Features:')
print(X[1:10,:])
print('Labels:')
print(y[1:10])
sel = SelectKBest(chi2, k=2)
X_new = sel.fit_transform(X, y)
print('Selected Features:')
print(X_new[1:10])
print('Chi2 values')
print(sel.scores_)
c,p = sk.feature_selection.chi2(X, y)
print('Chi2 values')
print(c) #The chi-square value between X columns and y
print('p-values')
print(p) #The p-value for the test
(150, 4)
Features:
[[4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]
Labels:
[0 0 0 0 0 0 0 0 0]
Selected Features:
[[1.4 0.2]
 [1.3 0.2]
 [1.5 0.2]
 [1.4 0.2]
 [1.7 0.4]
 [1.4 0.3]
 [1.5 0.2]
 [1.4 0.2]
 [1.5 0.1]]
Chi2 values
[ 10.81782088   3.7107283  116.31261309  67.0483602 ]
Chi2 values
[ 10.81782088   3.7107283  116.31261309  67.0483602 ]
p-values
[4.47651499e-03 1.56395980e-01 5.53397228e-26 2.75824965e-15]

Supervised Learning

Python has several classes and objects for implementing different supervised learning techniques such as Regression and Classification.

Regardless of the model being implemented, the following methods are implemented:

The method fit() takes the training data and labels/values, and trains the model

The method predict() takes as input the test data and applies the model.

Linear Regression

Linear Regression is implemented in the library sklearn.linear_model.LinearRegression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [3]:
from sklearn.linear_model import LinearRegression
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])

# y = 1 * x_0 + 2 * x_1 + 3
y = np.dot(X, np.array([1, 2])) + 3

reg = LinearRegression().fit(X, y)
reg.score(X, y)
Out[3]:
1.0
In [4]:
#Obtain the function coefficients
print(reg.coef_)
#and the intercept
print(reg.intercept_)
[1. 2.]
3.0000000000000018
In [5]:
#Predict for a new point
reg.predict(np.array([[3, 5]]))
Out[5]:
array([16.])

Classification models

http://scikit-learn.org/stable/supervised_learning.html#supervised-learning

Python has classes and objects that implement the different classification techniques that we described in class.

Preparing the data

Load the iris dataset

In [20]:
from sklearn.datasets import load_iris
import sklearn.utils as utils

iris = load_iris()
print("sample of data")
print(iris.data[:5,:])
print("the class labels vector")
print(iris.target)
print("the names of the classes:",iris.target_names)
print(iris.feature_names)
sample of data
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
the class labels vector
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
the names of the classes: ['setosa' 'versicolor' 'virginica']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Randomly shuffle the data. This is useful to know that the data is in random order

In [21]:
X, y = utils.shuffle(iris.data, iris.target, random_state=1) #shuffle the data
print(X.shape)
print(y.shape)
print(y)
(150, 4)
(150,)
[0 1 1 0 2 1 2 0 0 2 1 0 2 1 1 0 1 1 0 0 1 1 1 0 2 1 0 0 1 2 1 2 1 2 2 0 1
 0 1 2 2 0 2 2 1 2 0 0 0 1 0 0 2 2 2 2 2 1 2 1 0 2 2 0 0 2 0 2 2 1 1 2 2 0
 1 1 2 1 2 1 0 0 0 2 0 1 2 2 0 0 1 0 2 1 2 2 1 2 2 1 0 1 0 1 1 0 1 0 0 2 2
 2 0 0 1 0 2 0 2 2 0 2 0 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 2 0 0 2 1 2 1 2 2 1
 2 0]

Select a subset for training and a subset for testing

In [22]:
train_set_size = 100
X_train = X[:train_set_size]  # selects first 100 rows (examples) for train set
y_train = y[:train_set_size]
X_test = X[train_set_size:]   # selects from row 100 until the last one for test set
y_test = y[train_set_size:]
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
(100, 4) (100,)
(50, 4) (50,)

We can also use the train_test_split function of python for splitting the data into train and test sets. In this case you do not need the random shuffling (but it does not hurt).

In [23]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

Decision Trees

http://scikit-learn.org/stable/modules/tree.html

Train and apply a decision tree classifier. The default score computed in the classifier object is the accuracy.

In [26]:
from sklearn import tree

dtree = tree.DecisionTreeClassifier()
dtree = dtree.fit(X_train, y_train)

print("classifier accuracy:",dtree.score(X_test,y_test))

y_pred = dtree.predict(X_test)
y_prob = dtree.predict_proba(X_test)
print("classifier predictions:",y_pred[:10])
print("ground truth labels   :",y_test[:10])
print(y_prob[:10])
classifier accuracy: 0.95
classifier predictions: [2 2 2 0 0 0 2 2 2 2]
ground truth labels   : [1 2 2 0 0 0 2 2 2 2]
[[0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]]

Compute some more metrics

In [27]:
print("accuracy:",metrics.accuracy_score(y_test,y_pred))
print("\nConfusion matrix")
print(metrics.confusion_matrix(y_test,y_pred))
print("\nPrecision Score per class")
print(metrics.precision_score(y_test,y_pred,average=None))
print("\nAverage Precision Score")
print(metrics.precision_score(y_test,y_pred,average='weighted'))
print("\nRecall Score per class")
print(metrics.recall_score(y_test,y_pred,average=None))
print("\nAverage Recall Score")
print(metrics.recall_score(y_test,y_pred,average='weighted'))
print("\nF1-score Score per class")
print(metrics.f1_score(y_test,y_pred,average=None))
print("\nAverage F1 Score")
print(metrics.f1_score(y_test,y_pred,average='weighted'))
accuracy: 0.95

Confusion matrix
[[20  0  0]
 [ 0 17  2]
 [ 0  1 20]]

Precision Score per class
[1.         0.94444444 0.90909091]

Average Precision Score
0.9505892255892257

Recall Score per class
[1.         0.89473684 0.95238095]

Average Recall Score
0.95

F1-score Score per class
[1.         0.91891892 0.93023256]

Average F1 Score
0.9499057196731615

Visualize the decision tree.

For this you will need to install the package python-graphviz

In [28]:
#conda install python-graphviz
import graphviz 
print(iris.feature_names)
dot_data = tree.export_graphviz(dtree,out_file=None)
graph = graphviz.Source(dot_data)
graph
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Out[28]:
Tree 0 X[3] <= 0.75 gini = 0.666 samples = 90 value = [30, 31, 29] 1 gini = 0.0 samples = 30 value = [30, 0, 0] 0->1 True 2 X[2] <= 4.75 gini = 0.499 samples = 60 value = [0, 31, 29] 0->2 False 3 gini = 0.0 samples = 29 value = [0, 29, 0] 2->3 4 X[3] <= 1.7 gini = 0.121 samples = 31 value = [0, 2, 29] 2->4 5 X[2] <= 4.95 gini = 0.48 samples = 5 value = [0, 2, 3] 4->5 12 gini = 0.0 samples = 26 value = [0, 0, 26] 4->12 6 gini = 0.0 samples = 1 value = [0, 1, 0] 5->6 7 X[3] <= 1.55 gini = 0.375 samples = 4 value = [0, 1, 3] 5->7 8 gini = 0.0 samples = 2 value = [0, 0, 2] 7->8 9 X[2] <= 5.45 gini = 0.5 samples = 2 value = [0, 1, 1] 7->9 10 gini = 0.0 samples = 1 value = [0, 1, 0] 9->10 11 gini = 0.0 samples = 1 value = [0, 0, 1] 9->11
In [30]:
dtree2 = tree.DecisionTreeClassifier(max_depth=2)
dtree2 = dtree2.fit(X_train, y_train)
print(dtree2.score(X_test,y_test))
dot_data2 = tree.export_graphviz(dtree2,out_file=None)
graph2 = graphviz.Source(dot_data2)
graph2
0.9166666666666666
Out[30]:
Tree 0 X[2] <= 2.35 gini = 0.666 samples = 90 value = [30, 31, 29] 1 gini = 0.0 samples = 30 value = [30, 0, 0] 0->1 True 2 X[2] <= 4.75 gini = 0.499 samples = 60 value = [0, 31, 29] 0->2 False 3 gini = 0.0 samples = 29 value = [0, 29, 0] 2->3 4 gini = 0.121 samples = 31 value = [0, 2, 29] 2->4
In [31]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,y_train)
print("classifier score:", knn.score(X_test,y_test))

y_pred = knn.predict(X_test)

print("\naccuracy:",metrics.accuracy_score(y_test,y_pred))
print("\nConfusion matrix")
print(metrics.confusion_matrix(y_test,y_pred))
print("\nPrecision Score per class")
print(metrics.precision_score(y_test,y_pred,average=None))
print("\nAverage Precision Score")
print(metrics.precision_score(y_test,y_pred,average='weighted'))
print("\nRecall Score per class")
print(metrics.recall_score(y_test,y_pred,average=None))
print("\nAverage Recall Score")
print(metrics.recall_score(y_test,y_pred,average='weighted'))
print("\nF1-score Score per class")
print(metrics.f1_score(y_test,y_pred,average=None))
print("\nAverage F1 Score")
print(metrics.f1_score(y_test,y_pred,average='weighted'))
classifier score: 0.9333333333333333

accuracy: 0.9333333333333333

Confusion matrix
[[20  0  0]
 [ 0 16  3]
 [ 0  1 20]]

Precision Score per class
[1.         0.94117647 0.86956522]

Average Precision Score
0.9357203751065644

Recall Score per class
[1.         0.84210526 0.95238095]

Average Recall Score
0.9333333333333333

F1-score Score per class
[1.         0.88888889 0.90909091]

Average F1 Score
0.9329966329966328
In [32]:
from sklearn import svm

#svm_clf = svm.LinearSVC()
#svm_clf = svm.SVC(kernel = 'poly')
svm_clf = svm.SVC()
svm_clf.fit(X_train,y_train)
print("classifier score:",svm_clf.score(X_test,y_test))
y_pred = svm_clf.predict(X_test)
print("\naccuracy:",metrics.accuracy_score(y_test,y_pred))
print("\nConfusion matrix")
print(metrics.confusion_matrix(y_test,y_pred))
print("\nPrecision Score per class")
print(metrics.precision_score(y_test,y_pred,average=None))
print("\nAverage Precision Score")
print(metrics.precision_score(y_test,y_pred,average='weighted'))
print("\nRecall Score per class")
print(metrics.recall_score(y_test,y_pred,average=None))
print("\nAverage Recall Score")
print(metrics.recall_score(y_test,y_pred,average='weighted'))
print("\nF1-score Score per class")
print(metrics.f1_score(y_test,y_pred,average=None))
print("\nAverage F1 Score")
print(metrics.f1_score(y_test,y_pred,average='weighted'))
classifier score: 0.9666666666666667

accuracy: 0.9666666666666667

Confusion matrix
[[20  0  0]
 [ 0 18  1]
 [ 0  1 20]]

Precision Score per class
[1.         0.94736842 0.95238095]

Average Precision Score
0.9666666666666667

Recall Score per class
[1.         0.94736842 0.95238095]

Average Recall Score
0.9666666666666667

F1-score Score per class
[1.         0.94736842 0.95238095]

Average F1 Score
0.9666666666666667
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
In [45]:
import sklearn.linear_model as linear_model

lr_clf = linear_model.LogisticRegression()
lr_clf.fit(X_train, y_train)
print("classifier score:",lr_clf.score(X_test,y_test))
y_pred = lr_clf.predict(X_test)
print("\naccuracy:",metrics.accuracy_score(y_test,y_pred))
print("\nConfusion matrix")
print(metrics.confusion_matrix(y_test,y_pred))
print("\nPrecision Score per class")
print(metrics.precision_score(y_test,y_pred,average=None))
print("\nAverage Precision Score")
print(metrics.precision_score(y_test,y_pred,average='weighted'))
print("\nRecall Score per class")
print(metrics.recall_score(y_test,y_pred,average=None))
print("\nAverage Recall Score")
print(metrics.recall_score(y_test,y_pred,average='weighted'))
print("\nF1-score Score per class")
print(metrics.f1_score(y_test,y_pred,average=None))
print("\nAverage F1 Score")
print(metrics.f1_score(y_test,y_pred,average='weighted'))
classifier score: 0.9833333333333333

accuracy: 0.9833333333333333

Confusion matrix
[[20  0  0]
 [ 0 18  1]
 [ 0  0 21]]

Precision Score per class
[1.         1.         0.95454545]

Average Precision Score
0.9840909090909091

Recall Score per class
[1.         0.94736842 1.        ]

Average Recall Score
0.9833333333333333

F1-score Score per class
[1.         0.97297297 0.97674419]

Average F1 Score
0.9833019065577204
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

For Logistic Regression we can also obtain the probabilities for the different classes

In [46]:
probs = lr_clf.predict_proba(X_test)
print("Class Probabilities (first 10):")
print (probs[:10])
print(probs.argmax(axis = 1)[:10])
print(probs.max(axis = 1)[:10])
Class Probabilities (first 10):
[[8.97030460e-03 3.32653685e-01 6.58376010e-01]
 [3.65818540e-03 4.12481405e-01 5.83860409e-01]
 [6.12425133e-04 3.31379557e-01 6.68008018e-01]
 [9.06929006e-01 9.26073940e-02 4.63599597e-04]
 [8.98809388e-01 1.00868455e-01 3.22156817e-04]
 [9.57598497e-01 4.23682210e-02 3.32819743e-05]
 [1.32310636e-03 3.27816831e-01 6.70860062e-01]
 [1.27558143e-03 3.77948164e-01 6.20776255e-01]
 [1.50692477e-03 3.85667745e-01 6.12825330e-01]
 [8.56351814e-04 2.05563299e-01 7.93580350e-01]]
[2 2 2 0 0 0 2 2 2 2]
[0.65837601 0.58386041 0.66800802 0.90692901 0.89880939 0.9575985
 0.67086006 0.62077626 0.61282533 0.79358035]
In [47]:
print(lr_clf.coef_)
[[ 0.40967244  1.25382589 -2.05048616 -0.94328782]
 [ 0.08015593 -1.18508674  0.71601843 -1.18588825]
 [-1.31822513 -1.2196971   1.90603048  2.14869569]]

Computing Scores

In [35]:
p,r,f,s = metrics.precision_recall_fscore_support(y_test,y_pred)
print(p)
print(r)
print(f)
[1.         1.         0.95454545]
[1.         0.94736842 1.        ]
[1.         0.97297297 0.97674419]
In [36]:
report = metrics.classification_report(y_test,y_pred)
print(report)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        20
           1       1.00      0.95      0.97        19
           2       0.95      1.00      0.98        21

    accuracy                           0.98        60
   macro avg       0.98      0.98      0.98        60
weighted avg       0.98      0.98      0.98        60

In [37]:
#y_true = np.array([0, 0, 1, 1])
y_true = np.array(y_test)
print(y_true)
print(y_test)
y_true[y_true != 2] = 0
y_true[y_true==2] = 1
#y_scores = np.array([0.1, 0.4, 0.35, 0.8])
y_scores = probs[:,2]
precision, recall, thresholds = metrics.precision_recall_curve(y_true,y_scores)
plt.scatter(recall,precision)
print(recall)
print(precision)
print(thresholds)
fpr, tpr, ths = metrics.roc_curve(y_true,y_scores)
print(metrics.roc_auc_score(y_true,y_scores))
[1 2 2 0 0 0 2 2 2 2 1 0 0 2 0 0 2 1 2 1 2 0 2 0 1 1 1 2 1 0 1 0 0 0 1 1 0
 2 2 2 0 1 2 2 1 0 2 1 2 0 1 1 2 0 1 0 1 2 0 1]
[1 2 2 0 0 0 2 2 2 2 1 0 0 2 0 0 2 1 2 1 2 0 2 0 1 1 1 2 1 0 1 0 0 0 1 1 0
 2 2 2 0 1 2 2 1 0 2 1 2 0 1 1 2 0 1 0 1 2 0 1]
[1.         0.95238095 0.9047619  0.85714286 0.80952381 0.76190476
 0.71428571 0.66666667 0.61904762 0.57142857 0.52380952 0.47619048
 0.42857143 0.42857143 0.38095238 0.33333333 0.28571429 0.23809524
 0.19047619 0.14285714 0.0952381  0.04761905 0.        ]
[0.95454545 0.95238095 0.95       0.94736842 0.94444444 0.94117647
 0.9375     0.93333333 0.92857143 0.92307692 0.91666667 0.90909091
 0.9        1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.        ]
[0.53249179 0.5629576  0.56738458 0.57631424 0.58375136 0.58386041
 0.59335927 0.61282533 0.62077626 0.62133329 0.62979571 0.65821261
 0.65837601 0.66098423 0.66747926 0.66800802 0.67086006 0.70778921
 0.77223785 0.78197133 0.79358035 0.80107601]
0.9853479853479853
In [38]:
(Xtoy,y_toy)=sk_data.make_classification(n_samples=1000)
Xttrain = Xtoy[:800,:]
Xttest = Xtoy[800:,:]
yttrain = y_toy[:800]
yttest = y_toy[800:]

lr_clf.fit(Xttrain, yttrain)
#print(lr_clf.score(Xttest,yttest))
#y_tpred = lr_clf.predict(X_test)
tprobs = lr_clf.predict_proba(Xttest)
print (tprobs)

y_tscores = tprobs[:,1]
precision, recall, thresholds = metrics.precision_recall_curve(yttest,y_tscores)
plt.scatter(recall,precision)
[[3.77370472e-01 6.22629528e-01]
 [2.45764525e-01 7.54235475e-01]
 [9.55732028e-01 4.42679720e-02]
 [6.49618888e-02 9.35038111e-01]
 [9.96761172e-01 3.23882757e-03]
 [8.48084973e-01 1.51915027e-01]
 [1.37370823e-02 9.86262918e-01]
 [9.91410858e-01 8.58914193e-03]
 [1.07285120e-02 9.89271488e-01]
 [1.32956477e-01 8.67043523e-01]
 [9.00114032e-01 9.98859677e-02]
 [5.81855715e-01 4.18144285e-01]
 [4.66472885e-01 5.33527115e-01]
 [9.49946726e-02 9.05005327e-01]
 [7.90837701e-02 9.20916230e-01]
 [1.21474243e-02 9.87852576e-01]
 [9.45920522e-01 5.40794779e-02]
 [8.77669120e-01 1.22330880e-01]
 [8.23113230e-01 1.76886770e-01]
 [1.34727130e-01 8.65272870e-01]
 [9.16936132e-01 8.30638684e-02]
 [9.97031231e-01 2.96876947e-03]
 [1.11306173e-01 8.88693827e-01]
 [8.34451469e-01 1.65548531e-01]
 [1.16554534e-01 8.83445466e-01]
 [2.06443705e-01 7.93556295e-01]
 [1.84548962e-02 9.81545104e-01]
 [8.86146769e-01 1.13853231e-01]
 [1.80703525e-02 9.81929647e-01]
 [1.57229910e-01 8.42770090e-01]
 [8.25639200e-01 1.74360800e-01]
 [9.94417399e-01 5.58260104e-03]
 [5.35037854e-02 9.46496215e-01]
 [9.82158260e-01 1.78417399e-02]
 [2.56012641e-02 9.74398736e-01]
 [8.51285897e-01 1.48714103e-01]
 [4.12185887e-02 9.58781411e-01]
 [6.44871843e-01 3.55128157e-01]
 [9.93947319e-01 6.05268127e-03]
 [6.35073594e-03 9.93649264e-01]
 [8.30584275e-01 1.69415725e-01]
 [9.19015803e-01 8.09841974e-02]
 [7.11241105e-03 9.92887589e-01]
 [2.53384528e-02 9.74661547e-01]
 [9.30506699e-01 6.94933007e-02]
 [5.22321551e-01 4.77678449e-01]
 [1.12213156e-01 8.87786844e-01]
 [2.58153236e-03 9.97418468e-01]
 [2.92422129e-01 7.07577871e-01]
 [1.57162197e-03 9.98428378e-01]
 [9.90643373e-01 9.35662671e-03]
 [9.83760027e-01 1.62399733e-02]
 [5.83866882e-03 9.94161331e-01]
 [1.68014986e-01 8.31985014e-01]
 [6.33009550e-01 3.66990450e-01]
 [1.40406060e-02 9.85959394e-01]
 [5.38282370e-01 4.61717630e-01]
 [1.13908841e-01 8.86091159e-01]
 [1.27235649e-03 9.98727644e-01]
 [5.21367180e-01 4.78632820e-01]
 [9.85552823e-01 1.44471774e-02]
 [9.75818715e-01 2.41812847e-02]
 [9.92519747e-01 7.48025311e-03]
 [9.32643771e-03 9.90673562e-01]
 [3.41747254e-01 6.58252746e-01]
 [6.85185205e-02 9.31481480e-01]
 [6.90522972e-01 3.09477028e-01]
 [9.95913891e-01 4.08610922e-03]
 [9.06541092e-01 9.34589083e-02]
 [9.60238146e-01 3.97618539e-02]
 [9.95282522e-01 4.71747800e-03]
 [3.45473680e-01 6.54526320e-01]
 [7.52612009e-01 2.47387991e-01]
 [4.98177068e-02 9.50182293e-01]
 [8.83084231e-01 1.16915769e-01]
 [8.07561896e-01 1.92438104e-01]
 [1.70355769e-02 9.82964423e-01]
 [1.05090324e-01 8.94909676e-01]
 [5.26142699e-01 4.73857301e-01]
 [9.89149359e-01 1.08506410e-02]
 [9.72727975e-01 2.72720249e-02]
 [9.43414680e-01 5.65853203e-02]
 [9.99622941e-01 3.77058732e-04]
 [3.21400371e-01 6.78599629e-01]
 [8.18880779e-01 1.81119221e-01]
 [5.09933775e-01 4.90066225e-01]
 [9.86571667e-01 1.34283326e-02]
 [8.37150444e-01 1.62849556e-01]
 [9.96504476e-01 3.49552383e-03]
 [2.83535359e-01 7.16464641e-01]
 [8.19589687e-01 1.80410313e-01]
 [8.82655056e-01 1.17344944e-01]
 [9.40697457e-01 5.93025430e-02]
 [8.50214048e-01 1.49785952e-01]
 [3.35051309e-01 6.64948691e-01]
 [9.94640524e-01 5.35947643e-03]
 [1.49265906e-01 8.50734094e-01]
 [2.31367156e-01 7.68632844e-01]
 [1.25224401e-01 8.74775599e-01]
 [7.17594117e-03 9.92824059e-01]
 [3.14374261e-01 6.85625739e-01]
 [9.86872164e-01 1.31278360e-02]
 [9.90888098e-01 9.11190235e-03]
 [8.35111750e-02 9.16488825e-01]
 [6.18622152e-03 9.93813778e-01]
 [9.36960832e-01 6.30391679e-02]
 [9.17370641e-03 9.90826294e-01]
 [7.75041254e-02 9.22495875e-01]
 [3.74059421e-03 9.96259406e-01]
 [6.53544191e-01 3.46455809e-01]
 [9.54876267e-01 4.51237335e-02]
 [7.62534199e-03 9.92374658e-01]
 [5.92055502e-02 9.40794450e-01]
 [7.13178431e-01 2.86821569e-01]
 [3.01683806e-02 9.69831619e-01]
 [7.38627052e-01 2.61372948e-01]
 [1.25020525e-02 9.87497948e-01]
 [8.67139790e-01 1.32860210e-01]
 [3.43271283e-01 6.56728717e-01]
 [7.18628914e-02 9.28137109e-01]
 [2.15505124e-02 9.78449488e-01]
 [3.87854860e-02 9.61214514e-01]
 [2.21094634e-02 9.77890537e-01]
 [5.81959463e-01 4.18040537e-01]
 [9.16635597e-01 8.33644030e-02]
 [8.72558198e-01 1.27441802e-01]
 [2.36744974e-01 7.63255026e-01]
 [9.77593717e-02 9.02240628e-01]
 [4.24234200e-01 5.75765800e-01]
 [6.10423874e-02 9.38957613e-01]
 [9.59595900e-01 4.04041004e-02]
 [9.66475681e-01 3.35243195e-02]
 [6.59235424e-03 9.93407646e-01]
 [9.56031672e-01 4.39683281e-02]
 [2.57348167e-01 7.42651833e-01]
 [1.18321757e-01 8.81678243e-01]
 [8.11709343e-01 1.88290657e-01]
 [1.71418570e-01 8.28581430e-01]
 [9.73239612e-01 2.67603876e-02]
 [5.21573098e-03 9.94784269e-01]
 [9.17626409e-01 8.23735908e-02]
 [4.30203076e-01 5.69796924e-01]
 [1.04695936e-02 9.89530406e-01]
 [1.64352912e-02 9.83564709e-01]
 [3.17121575e-02 9.68287842e-01]
 [4.04032047e-02 9.59596795e-01]
 [9.94041759e-01 5.95824078e-03]
 [8.81770818e-01 1.18229182e-01]
 [8.56107244e-01 1.43892756e-01]
 [8.21130184e-02 9.17886982e-01]
 [4.59960108e-03 9.95400399e-01]
 [9.84996763e-01 1.50032367e-02]
 [6.23307588e-01 3.76692412e-01]
 [7.49692242e-01 2.50307758e-01]
 [9.25515727e-01 7.44842731e-02]
 [9.84410382e-03 9.90155896e-01]
 [1.13276889e-01 8.86723111e-01]
 [1.31635250e-01 8.68364750e-01]
 [4.53766691e-01 5.46233309e-01]
 [2.17541050e-02 9.78245895e-01]
 [9.69687163e-01 3.03128366e-02]
 [9.96791274e-01 3.20872617e-03]
 [5.73708810e-01 4.26291190e-01]
 [1.66838179e-02 9.83316182e-01]
 [8.28296115e-01 1.71703885e-01]
 [9.55957557e-01 4.40424431e-02]
 [3.88388712e-03 9.96116113e-01]
 [9.70998497e-01 2.90015032e-02]
 [9.37150198e-01 6.28498020e-02]
 [9.37611142e-01 6.23888582e-02]
 [1.62430991e-01 8.37569009e-01]
 [4.45361514e-02 9.55463849e-01]
 [9.70340611e-01 2.96593892e-02]
 [5.41606201e-02 9.45839380e-01]
 [9.93249910e-01 6.75008963e-03]
 [8.28900219e-01 1.71099781e-01]
 [7.46682746e-03 9.92533173e-01]
 [8.47877330e-03 9.91521227e-01]
 [8.78854199e-01 1.21145801e-01]
 [7.79968638e-02 9.22003136e-01]
 [3.69693191e-01 6.30306809e-01]
 [2.88137553e-02 9.71186245e-01]
 [9.50615712e-02 9.04938429e-01]
 [4.74132528e-01 5.25867472e-01]
 [3.01713560e-01 6.98286440e-01]
 [9.53837782e-01 4.61622183e-02]
 [2.15739565e-03 9.97842604e-01]
 [9.70641005e-01 2.93589948e-02]
 [2.72224249e-01 7.27775751e-01]
 [9.67527143e-01 3.24728569e-02]
 [2.52520922e-04 9.99747479e-01]
 [7.73364801e-01 2.26635199e-01]
 [8.19113084e-01 1.80886916e-01]
 [4.23851655e-02 9.57614835e-01]
 [3.54292973e-01 6.45707027e-01]
 [1.01368353e-01 8.98631647e-01]
 [1.48287913e-01 8.51712087e-01]
 [9.74625996e-01 2.53740036e-02]
 [9.96202573e-01 3.79742721e-03]
 [9.60116728e-01 3.98832719e-02]]
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
Out[38]:
<matplotlib.collections.PathCollection at 0x1d23b264898>

k-fold cross validation

In k-fold cross validation the data is split into k equal parts, the k-1 are used for training and the last one for testing. k models are trained, each time leaving a different part for testing

https://scikit-learn.org/stable/modules/cross_validation.html

There are two methods for implementing k-fold cross-validation, under the library model selection: cross_val_score, and cross validate. The latter allows multiple metrics to be considered together.

In [41]:
import sklearn.model_selection as model_selection

scores = model_selection.cross_val_score(#lr_clf,
                                          #svm_clf,
                                          #knn,
                                          dtree,
                                          X,
                                          y,
                                          scoring='f1_weighted',
                                          cv=5)
print (scores)
print (scores.mean())
[1.         0.93333333 0.96658312 0.96658312 0.86111111]
0.9455221386800334
In [42]:
scores = model_selection.cross_validate(#lr_clf,
                                          #svm_clf,
                                          #knn,
                                          dtree,
                                          X,
                                          y,
                                          scoring=['precision_weighted','recall_weighted'],
                                          cv=3)
print (scores)
print (scores['test_precision_weighted'].mean(),scores['test_recall_weighted'].mean())
{'fit_time': array([0.00129271, 0.00107265, 0.00052547]), 'score_time': array([0.00336623, 0.0021081 , 0.00261068]), 'test_precision_weighted': array([0.96078431, 0.90952381, 0.90418354]), 'test_recall_weighted': array([0.96078431, 0.90196078, 0.89583333])}
0.9248305530039276 0.9195261437908497
In [ ]: