Introduction to Clustering

In this tutorial we will see some libraries of scikit for clustering. You can read more about clustering in SciKit here:

http://scikit-learn.org/stable/modules/clustering.html

In [3]:
import numpy as np
import scipy as sp
import scipy.sparse as sp_sparse
import scipy.spatial.distance as sp_dist

import matplotlib.pyplot as plt

import sklearn as sk
import sklearn.datasets as sk_data
import sklearn.metrics as metrics
from sklearn import preprocessing
import sklearn.cluster as sk_cluster
import sklearn.feature_extraction.text as sk_text


import scipy.cluster.hierarchy as hr

import time
import seaborn as sns

%matplotlib inline
:0: FutureWarning: IPython widgets are experimental and may change in the future.

Generate data from Gaussian distributions.

More on data generation here: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html

In [4]:
centers = [[1,1], [-1, -1], [1, -1]]
X, true_labels = sk_data.make_blobs(n_samples=500, centers=3, n_features=2,
                                    center_box=(-10.0, 10.0),random_state=0)
#X, true_labels = sk_data.make_blobs(n_samples=500, centers=centers, n_features=2,center_box=(-10.0, 10.0),cluster_std = 0.4, random_state=0)
plt.scatter(X[:,0], X[:,1])
Out[4]:
<matplotlib.collections.PathCollection at 0xc4535c0>
In [5]:
print(type(X))
print(true_labels)
print(len(true_labels[true_labels==0]),len(true_labels[true_labels==1]),len(true_labels[true_labels==2]))
<class 'numpy.ndarray'>
[0 2 2 0 0 0 0 2 2 0 2 2 2 2 0 0 1 0 2 0 1 1 1 2 0 2 1 2 0 2 2 0 0 0 1 1 1
 1 1 1 2 0 0 1 1 0 1 0 2 2 0 0 0 0 2 2 1 1 0 2 1 2 1 1 1 2 0 1 2 0 2 0 0 2
 1 2 1 1 1 0 0 0 2 0 2 1 2 2 2 2 0 0 1 0 2 1 2 2 2 0 1 2 2 0 0 1 0 1 0 1 0
 0 1 2 2 1 2 1 1 2 2 0 2 2 2 0 2 0 2 0 1 0 2 1 0 1 2 2 1 0 0 2 2 1 0 0 0 1
 1 0 1 0 0 0 1 0 2 1 2 0 1 1 2 2 2 1 0 1 0 1 2 2 1 0 0 2 2 1 1 1 1 1 2 2 1
 1 0 1 0 2 2 0 1 1 0 2 1 0 2 1 2 1 0 0 2 1 1 1 2 2 0 1 1 2 2 0 2 0 2 2 1 2
 1 1 0 0 0 1 2 0 0 2 2 1 2 2 0 1 0 0 0 1 1 1 0 2 2 2 2 2 1 2 2 0 2 2 0 1 2
 1 1 0 0 1 1 0 2 1 2 1 1 2 1 0 0 1 0 1 1 1 1 2 1 0 0 0 0 2 2 1 1 2 2 0 0 1
 2 0 2 1 0 1 2 1 0 2 0 1 0 2 1 2 2 0 0 0 1 2 2 0 0 1 2 1 0 0 1 1 0 2 1 0 1
 2 1 1 0 2 0 2 1 2 1 0 0 0 1 0 0 2 1 0 2 2 2 0 1 1 1 2 0 1 2 0 0 0 2 0 2 0
 2 2 0 2 2 2 2 1 1 2 1 2 2 2 2 0 0 0 1 2 0 1 0 1 0 1 2 2 0 2 1 0 1 2 2 0 1
 2 1 2 0 0 0 1 2 0 0 1 2 2 0 2 1 0 2 0 1 0 2 0 0 1 0 0 0 0 1 0 1 2 1 1 0 2
 1 2 1 2 1 0 2 1 1 1 1 1 0 2 1 2 0 0 1 2 2 0 2 1 0 0 1 1 2 1 2 1 1 1 1 2 0
 1 1 0 1 2 2 0 1 1 2 0 2 0 0 1 1 1 0 1]
167 167 166
In [6]:
plt.scatter(X[true_labels==1,0], X[true_labels==1,1],c = 'r')
plt.hold
plt.scatter(X[true_labels==2,0], X[true_labels==2,1],c = 'b')
plt.hold
plt.scatter(X[true_labels==0,0], X[true_labels==0,1],c = 'g')
Out[6]:
<matplotlib.collections.PathCollection at 0xc4fca90>
In [7]:
euclidean_dists = metrics.euclidean_distances(X)
plt.pcolor(euclidean_dists,cmap=plt.cm.coolwarm)
Out[7]:
<matplotlib.collections.PolyCollection at 0xce9b908>

Clustering

scikit-learn has a huge set of tools for unsupervised learning generally, and clustering specifically. These are in sklearn.cluster. http://scikit-learn.org/stable/modules/clustering.html

There are 3 functions in all the clustering classes,

  • fit(),
  • predict(),
  • fit_predict().

fit() builds the model from the training data (e.g. for kmeans, it finds the centroids),

predict() assigns labels to the data after building the model, and

fit_predict() does both at the same data (e.g in kmeans, it finds the centroids and assigns the labels to the dataset).

In [14]:
import sklearn.cluster as sk_cluster

kmeans = sk_cluster.KMeans(init='k-means++', n_clusters=3, n_init=10)
kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_
kmeans_labels = kmeans.labels_
error = kmeans.inertia_

print ("The total error of the clustering is: ", error)
print ('\nCluster labels')
print(kmeans_labels)
print ('\n Cluster Centroids')
print (centroids)
The total error of the clustering is:  881.748277487

Cluster labels
[1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 2 1 0 1 1 2 2 0 1 0 2 0 1 0 0 1 1 0 2 2 2
 2 2 2 0 1 1 2 2 1 2 1 0 0 1 1 1 1 0 0 2 2 1 2 2 0 2 2 2 0 2 2 0 1 0 1 1 0
 2 0 2 2 2 1 1 1 0 1 0 2 0 0 0 0 1 1 2 1 0 2 0 1 0 1 2 0 0 1 1 2 1 2 1 2 1
 1 2 0 0 2 0 2 2 0 0 2 1 0 0 1 0 1 0 1 2 1 0 2 1 2 0 0 2 1 1 1 0 2 1 1 1 2
 2 1 2 1 1 1 2 1 0 2 0 1 2 2 0 0 0 2 1 2 1 2 0 0 2 1 1 0 0 2 2 2 2 2 0 0 2
 2 1 2 0 0 0 1 2 2 1 0 2 1 0 2 0 2 1 1 0 2 2 2 0 0 1 2 2 0 0 1 0 1 0 0 2 0
 2 2 1 1 1 2 0 1 1 0 0 2 0 0 1 2 1 1 1 2 2 2 1 0 0 0 0 0 2 0 0 1 0 0 1 2 0
 2 2 1 1 2 2 1 0 2 0 2 2 0 2 1 0 2 1 2 2 2 2 0 2 1 1 1 1 0 0 0 2 0 0 1 1 2
 0 1 0 2 1 2 0 2 1 0 1 2 1 0 2 0 0 1 0 1 2 0 0 1 1 2 1 2 1 1 1 2 1 0 2 2 2
 0 2 2 1 0 0 0 2 0 2 1 1 1 2 1 0 0 2 1 0 0 0 2 2 2 2 0 1 2 0 1 1 1 2 1 0 1
 0 0 1 0 0 0 0 2 2 0 2 1 0 1 0 1 1 0 2 0 0 2 1 2 1 2 0 0 1 1 2 1 2 0 0 1 2
 0 2 0 1 1 0 2 2 1 1 2 0 0 1 0 2 1 0 1 2 1 0 1 1 2 1 1 1 1 2 1 2 0 2 0 1 0
 2 0 2 0 2 1 0 2 2 2 2 2 1 0 2 0 1 1 2 0 0 0 0 2 1 1 2 2 1 2 0 2 2 0 2 1 1
 2 2 1 2 0 0 1 2 2 0 1 0 1 1 2 2 2 1 2]

 Cluster Centroids
[[-1.52371332  2.92068825]
 [ 0.87564159  4.45514163]
 [ 1.96167358  0.73752985]]
In [9]:
idx = np.argsort(kmeans_labels) # returns the indices in sorted order
rX = X[idx,:]
r_euclid = metrics.euclidean_distances(rX)
#r_euclid = euclidean_dists[idx,:][:,idx]
plt.pcolor(r_euclid,cmap=plt.cm.coolwarm)
Out[9]:
<matplotlib.collections.PolyCollection at 0x7396860>

Confusion matrix: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

Important: In the produced confusion matrix, the first list defines the rows and the second the columns. The matrix is always square, regarless if the number of classes and clusters are not the same. The extra rows or columns are filled with zeros.

Homogeneity and completeness: http://scikit-learn.org/stable/modules/clustering.html#homogeneity-completeness

Precision: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score

Recall: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score

Silhouette score: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html

In [15]:
C= metrics.confusion_matrix(true_labels,kmeans_labels)
print (C)
#plt.pcolor(C,cmap=plt.cm.coolwarm)
plt.pcolor(C,cmap=plt.cm.Reds)
[[ 12 151   4]
 [  3   2 162]
 [154   9   3]]
Out[15]:
<matplotlib.collections.PolyCollection at 0x196717b8>
In [16]:
#The metrics assume that cluster i is mapped to class i
p = metrics.precision_score(true_labels,kmeans_labels, average=None)
print(p)
r = metrics.recall_score(true_labels,kmeans_labels, average = None)
print(r)
[ 0.07100592  0.01234568  0.01775148]
[ 0.07185629  0.01197605  0.01807229]
In [17]:
# map each cluster to the class with the larger number of points, and compute the new confusion matrix
# You need to be careful in the case that many clusters map to the same class
def cluster_class_mapping(kmeans_labels,true_labels):
    C= metrics.confusion_matrix(true_labels,kmeans_labels)
    mapping = list(np.argmax(C,axis=0)) #for each column (cluster) find the best class in the confusion matrix
    mapped_kmeans_labels = [mapping[l] for l in kmeans_labels]
    C2= metrics.confusion_matrix(true_labels,mapped_kmeans_labels)
    return mapped_kmeans_labels,C2

mapped_kmeans_labels,C = cluster_class_mapping(kmeans_labels,true_labels)
print(C)
[[151   4  12]
 [  2 162   3]
 [  9   3 154]]
In [18]:
h = metrics.homogeneity_score(true_labels,kmeans_labels)
print(h)
c = metrics.completeness_score(true_labels,kmeans_labels)
print(c)
v = metrics.v_measure_score(true_labels,kmeans_labels)
print(v)
p = metrics.precision_score(true_labels,mapped_kmeans_labels, average=None)
print(p)
r = metrics.recall_score(true_labels,mapped_kmeans_labels, average = None)
print(r)
p = metrics.precision_score(true_labels,mapped_kmeans_labels, average='weighted')
print(p)
r = metrics.recall_score(true_labels,mapped_kmeans_labels, average = 'weighted')
print(r)
0.749703757499
0.749835439427
0.749769592681
[ 0.93209877  0.95857988  0.9112426 ]
[ 0.90419162  0.97005988  0.92771084]
0.934019212506
0.934
In [19]:
error = np.zeros(11)
sh_score = np.zeros(11)
for k in range(1,11):
    kmeans = sk_cluster.KMeans(init='k-means++', n_clusters=k, n_init=10)
    kmeans.fit_predict(X)
    error[k] = kmeans.inertia_
    if k>1: sh_score[k]= metrics.silhouette_score(X, kmeans.labels_)

plt.plot(range(1,len(error)),error[1:])
plt.xlabel('Number of clusters')
plt.ylabel('Error')
Out[19]:
<matplotlib.text.Text at 0x19694320>
In [20]:
plt.plot(range(2,len(sh_score)),sh_score[2:])
plt.xlabel('Number of clusters')
plt.ylabel('silhouette score')
Out[20]:
<matplotlib.text.Text at 0x19656780>
In [21]:
colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
colors = np.hstack([colors] * 20)
plt.scatter(X[:, 0], X[:, 1], color=colors[kmeans_labels].tolist(), s=10, alpha=0.8)
Out[21]:
<matplotlib.collections.PathCollection at 0x197b4ac8>

Agglomerative Clustering

More on Agglomerative Clustering here: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

In [22]:
agglo = sk_cluster.AgglomerativeClustering(linkage = 'complete', n_clusters = 3)
agglo_labels = agglo.fit_predict(X)

C_agglo= metrics.confusion_matrix(true_labels,agglo_labels)
print (C_agglo)
#plt.pcolor(C_agglo,cmap=plt.cm.coolwarm)
plt.pcolor(C_agglo,cmap=plt.cm.Reds)

mapped_agglo_labels,C_agglo = cluster_class_mapping(agglo_labels,true_labels)
print(C_agglo)
p = metrics.precision_score(true_labels,mapped_agglo_labels, average='weighted')
print(p)
r = metrics.recall_score(true_labels,mapped_agglo_labels, average = 'weighted')
print(r)
[[ 12  23 132]
 [159   7   1]
 [  3 149  14]]
[[132  12  23]
 [  1 159   7]
 [ 14   3 149]]
0.881482805798
0.88

Another way to do agglomerative clustering using SciPy:

https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html

In [23]:
import scipy.cluster.hierarchy as hr

Z = hr.linkage(X, method='complete', metric='euclidean')

print (Z.shape, X.shape)
(499, 4) (500, 2)
In [24]:
import scipy.spatial.distance as sp_dist
D = sp_dist.pdist(X, 'euclidean') 
Z = hr.linkage(D, method='complete')
print (Z.shape, X.shape)
(499, 4) (500, 2)

Hierarchical clustering returns a 4 by (n-1) matrix Z is returned. At the i-th iteration, clusters with indices Z[i, 0] and Z[i, 1] are combined to form cluster n + i. A cluster with an index less than n corresponds to one of the n original observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2]. The fourth value Z[i, 3] represents the number of original observations in the newly formed cluster.

In [25]:
fig = plt.figure(figsize=(10,10))
T = hr.dendrogram(Z,color_threshold=0.4, leaf_font_size=4)
fig.show()
C:\Anaconda3\lib\site-packages\matplotlib\figure.py:402: UserWarning: matplotlib is currently using a non-GUI backend, so cannot show the figure
  "matplotlib is currently using a non-GUI backend, "

Another way to do agglomerative clustering (and visualizing it): http://seaborn.pydata.org/generated/seaborn.clustermap.html

In [26]:
distances = metrics.euclidean_distances(X)
cg = sns.clustermap(distances, method="complete", figsize=(13,13), xticklabels=False)
print (cg.dendrogram_col.reordered_ind)
[56, 199, 237, 179, 233, 358, 303, 452, 143, 478, 147, 440, 74, 223, 16, 388, 467, 267, 451, 44, 406, 290, 201, 422, 92, 100, 60, 107, 342, 356, 269, 458, 327, 393, 275, 165, 277, 135, 272, 117, 301, 227, 95, 172, 193, 37, 444, 310, 414, 279, 332, 57, 484, 21, 26, 46, 350, 184, 488, 192, 264, 340, 78, 489, 66, 178, 448, 220, 408, 160, 180, 499, 473, 130, 36, 22, 150, 34, 133, 177, 118, 181, 436, 109, 307, 380, 76, 241, 263, 455, 260, 167, 295, 446, 470, 169, 316, 400, 462, 38, 62, 242, 334, 476, 250, 417, 243, 402, 77, 321, 196, 475, 257, 85, 361, 67, 377, 378, 413, 481, 115, 323, 431, 63, 357, 206, 112, 471, 335, 105, 482, 212, 280, 205, 211, 438, 183, 96, 443, 447, 2, 55, 329, 291, 490, 379, 249, 374, 114, 120, 450, 87, 163, 268, 30, 287, 25, 119, 128, 441, 219, 311, 333, 13, 88, 302, 337, 348, 93, 479, 141, 286, 483, 363, 411, 234, 486, 170, 200, 309, 4, 457, 463, 382, 124, 373, 82, 421, 176, 228, 164, 248, 48, 288, 162, 318, 235, 459, 126, 370, 182, 190, 213, 466, 396, 11, 116, 7, 75, 376, 445, 246, 70, 353, 18, 198, 8, 341, 368, 305, 292, 298, 485, 29, 247, 339, 175, 389, 27, 254, 407, 424, 49, 216, 439, 281, 464, 102, 271, 371, 54, 123, 492, 10, 195, 23, 171, 397, 245, 428, 89, 101, 296, 137, 204, 142, 255, 354, 384, 73, 317, 474, 1, 258, 189, 251, 132, 362, 12, 218, 231, 418, 209, 40, 86, 136, 404, 352, 349, 65, 84, 208, 47, 324, 430, 149, 104, 229, 351, 45, 394, 285, 293, 224, 31, 225, 140, 313, 42, 17, 217, 381, 359, 383, 33, 412, 221, 312, 325, 442, 152, 72, 240, 108, 174, 308, 80, 367, 194, 5, 186, 0, 106, 365, 423, 344, 449, 433, 41, 91, 215, 210, 173, 491, 58, 127, 369, 297, 315, 392, 153, 437, 145, 304, 71, 168, 111, 110, 364, 410, 52, 435, 203, 32, 498, 336, 306, 469, 256, 416, 461, 81, 155, 386, 429, 129, 415, 43, 496, 187, 330, 355, 154, 497, 395, 495, 282, 331, 121, 494, 79, 300, 372, 276, 244, 487, 425, 460, 19, 103, 345, 273, 434, 283, 480, 166, 262, 253, 328, 50, 83, 69, 420, 493, 53, 226, 347, 360, 146, 265, 398, 432, 326, 24, 236, 125, 238, 51, 239, 131, 385, 151, 320, 15, 139, 159, 401, 456, 6, 322, 468, 188, 61, 98, 68, 387, 472, 294, 232, 390, 261, 319, 28, 90, 97, 427, 122, 375, 14, 314, 99, 191, 230, 134, 197, 252, 158, 214, 156, 266, 94, 274, 338, 409, 113, 465, 20, 64, 426, 343, 3, 144, 9, 202, 284, 405, 138, 453, 161, 399, 299, 148, 270, 185, 222, 259, 391, 59, 278, 157, 366, 346, 454, 289, 477, 35, 39, 419, 207, 403]
C:\Anaconda3\lib\site-packages\matplotlib\cbook.py:136: MatplotlibDeprecationWarning: The axisbg attribute was deprecated in version 2.0. Use facecolor instead.
  warnings.warn(message, mplDeprecation, stacklevel=1)
In [27]:
dbscan = sk_cluster.DBSCAN(eps=0.3)
dbscan_labels = dbscan.fit_predict(X)
print(dbscan_labels) #label -1 corresponds to noise
renamed_dbscan_labels = [x+1 for x in dbscan_labels]
C = metrics.confusion_matrix(true_labels,renamed_dbscan_labels)
print (C[:max(true_labels)+1,:])
#print(metrics.confusion_matrix(true_labels,renamed_dbscan_labels))
[ 0  3  1 -1  2  0 -1  2  3  6  3  2 -1  4  0 -1 -1 -1  3  0 -1  5  5  3  6
  7  5 -1  0  3  7 -1  9 -1  5 -1  5  5  5 -1  3  0 -1  8 -1 -1  5 -1  3 -1
  0  0  0  0 -1  1 -1  5  0 -1 -1  0  5  5 -1 -1 -1  5  0  0  3 -1 -1  3 -1
  2  5  5  5 -1 -1 -1  3  0 -1 -1  3 -1  4  3  0  0 -1 -1  0  5 -1  0  0  0
 -1  3 -1  0 -1 -1  0 -1  0  5  0  0 -1  0  1  5  2  5  5  7  1 -1  0 -1  3
  6  3  0 -1 -1  5  0  3  5  0  5  3  3 -1 -1 -1  2  3 -1 -1  0  0 -1 -1 -1
  5 -1 -1  0  8 -1  0 -1  0 -1  5 -1  1  1  3  5  0  5  2  5  2  3  5  0  0
  2  3  5  5 -1  5  5  3 -1  5 -1  0  8  0  3  3  0  5  5  0  3  5  0  3  5
  2 -1  6  9  3 -1 -1 -1 -1  3  0 -1 -1  3  3  0 -1 -1  1  4  5 -1 -1 -1 -1
 -1  0  5  3 -1  0  3  0 -1  2  3  6  5  0  0 -1  5  5  5  0  3  3  3  3  1
  5  3  3  0 -1  3  9 -1  3 -1  5  0  0  5  5  0  0 -1  7  5 -1 -1  5  0  0
  5  0  5 -1  5 -1 -1 -1  0 -1 -1  2  7  3 -1 -1  1  3 -1  0  5  3  0  3 -1
 -1  5  4 -1 -1  3  9  5  0  2  5  4 -1 -1  0  0  5  3  3  0 -1  5  0  5 -1
 -1  6  5  0  1  8 -1  5  4  5 -1  9  4  0  2  5  3  5 -1  0  0 -1  0 -1 -1
  5 -1 -1  3  3 -1  5  5 -1 -1  0 -1  3  2  0  0 -1  0  3  0  3 -1 -1  3  1
  0  2  5  5  1  5 -1  3 -1  3  0 -1  0 -1  2  0 -1  0  5 -1 -1  2 -1  6 -1
  5 -1  5 -1  3 -1  5 -1  5  0  9  2 -1  5  5  0  9  5  3 -1  0  3 -1  0 -1
  0 -1  0 -1 -1 -1  5  6  0  0  0  5  0 -1 -1 -1 -1 -1 -1  5  2  5 -1 -1  0
 -1 -1 -1 -1 -1  5  0  2  5  3  0 -1 -1  2 -1  0  3 -1 -1  9  5 -1  0  5  3
 -1  5 -1 -1 -1  0  5 -1  2  5  3  2  0  5  5  1  0 -1  0 -1 -1  8  8  9  5]
[[58 87  0  6  0  0  0  7  0  0  9]
 [68  0  0  0  0  0 92  1  0  6  0]
 [41 15 13 18 66  7  1  0  5  0  0]]
In [28]:
#colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
#colors = np.hstack([colors] * 20)
colors = np.array([x for x in 'bgrcmywk'*10])
plt.scatter(X[:, 0], X[:, 1], color=colors[dbscan_labels].tolist(), s=10, alpha=0.8)
Out[28]:
<matplotlib.collections.PathCollection at 0x19cde390>

Processing Complex Data

So far we have assumed that the intput is in the form of numerical vectors to which we can apply directly the algorithms we have. Often the data will be more complex. For example what if we want to cluster categorical data, itemsets, or text? Python provides libraries for processing the data and transforming them to a format that we can use.

Python offers a set of tools for extracting features:http://scikit-learn.org/stable/modules/feature_extraction.html

DictVectorizer

The DictVectorizer feature extraction: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html#sklearn.feature_extraction.DictVectorizer

The DictVectorizer takes a dictionary of attribute-value pairs and transforms them into numerical vectors. Real values are preserved, while categorical attributes are transformed into binary. The vectorizer produces a sparse representation.

In [29]:
measurements = [
{'city': 'Dubai', 'temperature': 45},
{'city': 'London', 'temperature': 12},
{'city': 'San Fransisco', 'temperature': 23},
]
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
print(type(vec.fit_transform(measurements)))
print(vec.fit_transform(measurements).toarray())
vec.get_feature_names()
<class 'scipy.sparse.csr.csr_matrix'>
[[  1.   0.   0.  45.]
 [  0.   1.   0.  12.]
 [  0.   0.   1.  23.]]
Out[29]:
['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']
In [30]:
measurements = [
    {'refund' : 'No','marital_status': 'married', 'income' : 100},
    {'refund' : 'Yes','marital_status': 'single', 'income' : 120},
    {'refund' : 'No','marital_status':'divorced', 'income' : 80},
]
vec = DictVectorizer()
print(vec.fit_transform(measurements))
vec.get_feature_names()
  (0, 0)	100.0
  (0, 2)	1.0
  (0, 4)	1.0
  (1, 0)	120.0
  (1, 3)	1.0
  (1, 5)	1.0
  (2, 0)	80.0
  (2, 1)	1.0
  (2, 4)	1.0
Out[30]:
['income',
 'marital_status=divorced',
 'marital_status=married',
 'marital_status=single',
 'refund=No',
 'refund=Yes']

SciKit datasets: http://scikit-learn.org/stable/datasets/

We will use the 20-newsgroups datasets which consists of postings on 20 different newsgroups.

More information here: http://scikit-learn.org/stable/datasets/#the-20-newsgroups-text-dataset

In [31]:
from sklearn.datasets import fetch_20newsgroups

categories = ['comp.os.ms-windows.misc', 'sci.space','rec.sport.baseball']
#categories = ['alt.atheism', 'sci.space','rec.sport.baseball']
news_data = sk_data.fetch_20newsgroups(subset='train', 
                               remove=('headers', 'footers', 'quotes'),
                               categories=categories)
#print (news_data.target, len(news_data.target))
print (news_data.target_names)
['comp.os.ms-windows.misc', 'rec.sport.baseball', 'sci.space']
In [32]:
print (type(news_data))
print (news_data.filenames)
print (news_data.target[:10])
print (news_data.data[0])
print (len(news_data.data))
<class 'sklearn.datasets.base.Bunch'>
[ 'C:\\Users\\Panayiotis\\scikit_learn_data\\20news_home\\20news-bydate-train\\sci.space\\60940'
 'C:\\Users\\Panayiotis\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.os.ms-windows.misc\\9955'
 'C:\\Users\\Panayiotis\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.os.ms-windows.misc\\9846'
 ...,
 'C:\\Users\\Panayiotis\\scikit_learn_data\\20news_home\\20news-bydate-train\\sci.space\\60891'
 'C:\\Users\\Panayiotis\\scikit_learn_data\\20news_home\\20news-bydate-train\\rec.sport.baseball\\104484'
 'C:\\Users\\Panayiotis\\scikit_learn_data\\20news_home\\20news-bydate-train\\sci.space\\61110']
[2 0 0 2 0 0 1 2 2 1]


Early to mid June.


If they think the public wants to see it they will carry it. Why not
write them and ask? You can reach them at:


                          F: NATIONAL NEWS MEDIA


ABC "World News Tonight"                 "Face the Nation"
7 West 66th Street                       CBS News
New York, NY 10023                       2020 M Street, NW
212/887-4040                             Washington, DC 20036
                                         202/457-4321

Associated Press                         "Good Morning America"
50 Rockefeller Plaza                     ABC News
New York, NY 10020                       1965 Broadway
National Desk (212/621-1600)             New York, NY 10023
Foreign Desk (212/621-1663)              212/496-4800
Washington Bureau (202/828-6400)
                                         Larry King Live TV
"CBS Evening News"                       CNN
524 W. 57th Street                       111 Massachusetts Avenue, NW
New York, NY 10019                       Washington, DC 20001
212/975-3693                             202/898-7900

"CBS This Morning"                       Larry King Show--Radio
524 W. 57th Street                       Mutual Broadcasting
New York, NY 10019                       1755 So. Jefferson Davis Highway
212/975-2824                             Arlington, VA 22202
                                         703/685-2175
"Christian Science Monitor"
CSM Publishing Society                   "Los Angeles Times"
One Norway Street                        Times-Mirror Square
Boston, MA 02115                         Los Angeles, CA 90053
800/225-7090                             800/528-4637

CNN                                      "MacNeil/Lehrer NewsHour"
One CNN Center                           P.O. Box 2626
Box 105366                               Washington, DC 20013
Atlanta, GA 30348                        703/998-2870
404/827-1500
                                         "MacNeil/Lehrer NewsHour"
CNN                                      WNET-TV
Washington Bureau                        356 W. 58th Street
111 Massachusetts Avenue, NW             New York, NY 10019
Washington, DC 20001                     212/560-3113
202/898-7900

"Crossfire"                              NBC News
CNN                                      4001 Nebraska Avenue, NW
111 Massachusetts Avenue, NW             Washington, DC 20036
Washington, DC 20001                     202/885-4200
202/898-7951                             202/362-2009 (fax)

"Morning Edition/All Things Considered"  
National Public Radio                    
2025 M Street, NW                        
Washington, DC 20036                     
202/822-2000                             

United Press International
1400 Eye Street, NW
Washington, DC 20006
202/898-8000

"New York Times"                         "U.S. News & World Report"
229 W. 43rd Street                       2400 N Street, NW
New York, NY 10036                       Washington, DC 20037
212/556-1234                             202/955-2000
212/556-7415

"New York Times"                         "USA Today"
Washington Bureau                        1000 Wilson Boulevard
1627 Eye Street, NW, 7th Floor           Arlington, VA 22229
Washington, DC 20006                     703/276-3400
202/862-0300

"Newsweek"                               "Wall Street Journal"
444 Madison Avenue                       200 Liberty Street
New York, NY 10022                       New York, NY 10281
212/350-4000                             212/416-2000

"Nightline"                              "Washington Post"
ABC News                                 1150 15th Street, NW
47 W. 66th Street                        Washington, DC 20071
New York, NY 10023                       202/344-6000
212/887-4995

"Nightline"                              "Washington Week In Review"
Ted Koppel                               WETA-TV
ABC News                                 P.O. Box 2626
1717 DeSales, NW                         Washington, DC 20013
Washington, DC 20036                     703/998-2626
202/887-7364

"This Week With David Brinkley"
ABC News
1717 DeSales, NW
Washington, DC 20036
202/887-7777

"Time" magazine
Time Warner, Inc.
Time & Life Building
Rockefeller Center
New York, NY 10020
212/522-1212

1781

CountVectorizer

The CountVectorizer can be used to extract features in the form of bag of words. It is typically used for text, but you could use it to represent also a collection of itemsets (where each itemset will become a word).

In [33]:
import sklearn.feature_extraction.text as sk_text
vectorizer = sk_text.CountVectorizer(min_df=1)
#vectorizer = sk_text.CountVectorizer(min_df=1,stop_words = 'english')

corpus = ['This is the first document.',
           'this is the second second document.',
           'And the third one.',
           'Is this the first document?',
          ]
X = vectorizer.fit_transform(corpus)
print(X.toarray())  
vectorizer.get_feature_names()
[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 2 1 0 1]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]
Out[33]:
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

TfIdfVectorizer

TfIdfVectorizer transforms text into a sparse matrix where rows are text and columns are words, and values are the tf-dif values. It performs tokenization, normalization, and removes stop-words. More here: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer

In [34]:
vectorizer = sk_text.TfidfVectorizer(
                            #stop_words='english',
                             #max_features = 1000,
                             min_df=1)
X = vectorizer.fit_transform(corpus)
print(X.toarray())  
print (vectorizer.get_feature_names())
[[ 0.          0.43877674  0.54197657  0.43877674  0.          0.
   0.35872874  0.          0.43877674]
 [ 0.          0.27230147  0.          0.27230147  0.          0.85322574
   0.22262429  0.          0.27230147]
 [ 0.55280532  0.          0.          0.          0.55280532  0.
   0.28847675  0.55280532  0.        ]
 [ 0.          0.43877674  0.54197657  0.43877674  0.          0.
   0.35872874  0.          0.43877674]]
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
In [35]:
vectorizer = sk_text.TfidfVectorizer(stop_words='english',
                             #max_features = 1000,
                             min_df=4, max_df=0.8)
data = vectorizer.fit_transform(news_data.data)
print(type(data))
<class 'scipy.sparse.csr.csr_matrix'>

Clustering text data

An example of what we want to do: http://scikit-learn.org/stable/auto_examples/text/document_clustering.html

In [36]:
import sklearn.cluster as sk_cluster
k=3
kmeans = sk_cluster.KMeans(n_clusters=k, init='k-means++', max_iter=100, n_init=1)
kmeans.fit_predict(data)
Out[36]:
array([0, 2, 2, ..., 0, 0, 0])
In [37]:
print("Top terms per cluster:")
asc_order_centroids = kmeans.cluster_centers_.argsort()#[:, ::-1]
order_centroids = asc_order_centroids[:,::-1]
terms = vectorizer.get_feature_names()
for i in range(k):
    print ("Cluster %d:" % i)
    for ind in order_centroids[i, :10]:
        print (' %s' % terms[ind])
    print
Top terms per cluster:
Cluster 0:
 space
 like
 just
 nasa
 think
 know
 don
 thanks
 does
 people
Cluster 1:
 year
 team
 game
 games
 think
 baseball
 runs
 good
 hit
 players
Cluster 2:
 windows
 file
 dos
 files
 use
 thanks
 drivers
 card
 driver
 problem
In [38]:
C = metrics.confusion_matrix(news_data.target,kmeans.labels_)

mapped_kmeans_labels,C = cluster_class_mapping(kmeans.labels_,news_data.target)
print (C)
p = metrics.precision_score(news_data.target,mapped_kmeans_labels, average=None)
print(p)
r = metrics.recall_score(news_data.target,mapped_kmeans_labels, average = None)
print(r)
[[380   1 210]
 [  0 422 175]
 [  1   3 589]]
[ 0.99737533  0.99061033  0.60472279]
[ 0.642978    0.70686767  0.99325464]
In [39]:
agglo = sk_cluster.AgglomerativeClustering(linkage = 'complete', n_clusters = 3,)
dense = data.todense()
agglo_labels = agglo.fit_predict(dense) # agglomerative needs dense data

C_agglo= metrics.confusion_matrix(news_data.target,agglo_labels)
print (C_agglo)
[[171 400  20]
 [156 438   3]
 [250 340   3]]
In [40]:
dbscan = sk_cluster.DBSCAN(eps=0.1)
dbscan_labels = dbscan.fit_predict(data)
C = metrics.confusion_matrix(news_data.target,dbscan.labels_)
print (C)
[[  0   0   0   0]
 [556   9  26   0]
 [567   0  30   0]
 [576   0  17   0]]