Supervised learning using scikit-learn

The goal of this tutorial is to introduce you to the scikit libraries for classification. We will also cover feature selection, and evaluation.

Feature Selection

Feature selection is about finding the best features for your classifier. This may be important if you do not have enough training data. The idea is to find metrics that either characterize the features by themselves, or with respect to the class we want to predict, or with respect to other features.

http://scikit-learn.org/stable/modules/feature_selection.html

Variance Threshold

The VarianceThreshold selection drops features whose variance is below some threshold. If we have binary features we can estimate the treshold exactly so as to guarantee a specific ratio of 0's and 1's

Univariate Feature Selection

A more sophisticated feature selection technique uses test to determine if a feature and the class label are independent. An example of such a test is the chi-square test (there are more)

In this case we keep the features with high chi-square score and low p-value.

The features with the lowest scores and highest values are rejected.

The chi-square test is usually applied on categorical data.

The chi-square values and the p-values between features and target variable (X columns and y)

Supervised Learning

Python has several classes and objects for implementing different supervised learning techniques such as Regression and Classification.

Regardless of the model being implemented, the following methods are implemented:

The method fit() takes the training data and labels/values, and trains the model

The method predict() takes as input the test data and applies the model.

Preparing the data

To perform classification we first need to prepare the data into train and test datasets.

Randomly shuffle the data. This is useful to know that the data is in random order

Select a subset for training and a subset for testing

We can also use the train_test_split function of python for splitting the data into train and test sets. In this case you do not need the random shuffling (but it does not hurt).

Classification models

http://scikit-learn.org/stable/supervised_learning.html#supervised-learning

Python has classes and objects that implement the different classification techniques that we described in class.

Decision Trees

http://scikit-learn.org/stable/modules/tree.html

Train and apply a decision tree classifier. The default score computed in the classifier object is the accuracy. Decision trees can also give us probabilities

Compute some more metrics

Visualize the decision tree.

For this you will need to install the package python-graphviz

k-NN Classification

https://scikit-learn.org/stable/modules/neighbors.html#classification

SVM Classification

http://scikit-learn.org/stable/modules/svm.html

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

Logistic Regression

http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression

For Logistic Regression we can also obtain the probabilities for the different classes

And the coeffients of the logistic regression model

Linear Regression

Linear Regression is implemented in the library sklearn.linear_model.LinearRegression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

The $R^2$ score computes the "explained variance"

$R^2 = 1-\frac{\sum_i (y_i -\hat y_i)^2}{\sum_i (y_i -\bar y)^2}$

where $\hat y_i$ is the prediction for point $x_i$ and $\bar y$ is the mean value of the target variable

A more complex example with the diabetes dataset

More Evaluation

http://scikit-learn.org/stable/model_selection.html#model-selection

Computing Scores

The breast cancer dataset

k-fold cross validation

In k-fold cross validation the data is split into k equal parts, the k-1 are used for training and the last one for testing. k models are trained, each time leaving a different part for testing

https://scikit-learn.org/stable/modules/cross_validation.html

There are two methods for implementing k-fold cross-validation, under the library model selection: cross_val_score, and cross validate. The latter allows multiple metrics to be considered together.

Creating a pipeline

If the same steps are often repeated, you can create a pipeline to perform them all at once:

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline

Text classification Example

We will use the 20 newsgroups to do a text classification example

Word embeddings and Text classification

We will now see how we can train and use word embeddings. We will also see the NLTK library.

The NLTK libary allows for sohpisticated text processing. It can do stemming, create a parse tree, do PoS (Part-of-Speech) tagging, find Noun Phrases, entities,

The Gensim library

The Gensim library has several NLP models.

You can use existing modules to train a word2vec model: https://radimrehurek.com/gensim/models/word2vec.html

Train a CBOW embedding on the training data corpus

Transform the train and test data

Train a classifier on the emebddings

Train a SkipGram embedding on the training data corpus

Transform the train and test data

Train a classifier on the emebddings

You can also download the Google word2vec model trained over millions of documents

Transform the train and test data

Train a classifier on the emebddings