CS059 – Data Mining
Fall 2013
|
|
Material
Books and Slides
·
Mining Massive Datasets by Anand Rajaraman, Jeff Ullman,
and Jure Leskovec.
Free online book. Slides from the course. ·
Material
from the book “Data Mining: Concepts and
Techniques”, by Jiawei Han and Micheline Kamber. ·
Material
from the book “Introduction to
Data Mining” by Tan, Steinbach, Kumar. ·
Material
from the book "Introduction
to Information Retrieval" by C. Manning, P. Raghavan, H. Schutze ·
Material
from the book "Networks Crowds
and Markets" by D. Easley, J. Kleinberg Useful Unix Commands
You may
find the following unix commands useful when pre-processing data: ·
cut: allows you to get specific
columns from delimited data ·
sort: sorts the rows of a file in
lexicographic order, –n for numeric ·
uniq: merges consecutive rows of a file
that are identical. ·
grep: finds a sting within a file Software
·
WEKA Data Mining Software: A
software package that implements multiple data mining tools. ·
FIMI: Frequent Itemsets
Mining Implementation: A repository of implementations for frequent
itemset mining. All implementations assume the input format of the example
datasets: text file where each row is a basket consisting of space separated
integers that represent the items. ·
Liblinear:
Software package for classification. Implements the Logistic Regression and SVM classifiers. Datasets
·
The Yelp Academic Challenge
dataset ·
UCI Machine Learning Repository o Data
for Assignments 2,3: § Τhe Iris dataset (ARFF file).Τhe link to UCI
repository. § The
SpamBase dataset (ARFF
file). Τhe
link to UCI
repository § The
Mushroom dataset (ARFF
file). The link
to UCI repository. ·
Movie Lens Datasets by GroupLens
Research
·
FourSquare tips with
categories: a collection of FourSquare
tips on restaurants in New York (thanks to Yiannis Kotrotsios).
·
FourSquare tips
with categories: a collection of FourSquare
tips with the category of the corresponding venue for restaurants, nightlife venues,
and shops in New York (thanks to Yiannis Kotrotsios).
·
FourSquare users
and venues: a collection of pairs of user ids and venue names in
New York, where the user with the specific id has left a tip to the venue
with the specific name on Foursquare (thanks to Yiannis
Kotrotsios).
·
Twitter
data from the paper “What
is Twitter, a Social Network, or a News Media?” by Haewoon Kwak,
Changhyun Lee, Hosung Park, and Sue Moon. For the first Assignment, you need
the Restricted
User Profiles data file. The fields in the file are explained on the
page, you are interested in the eleventh field which is the profile description.
·
English Stopwords. Txt file with a list of English stopwords.
·
SpamAssassin.
·
Stanford Network Analysis Project Datasets.
·
Movie-Actor Graph. Each line in the file is a tab-separated movie-actor pair, i.e., it
corresponds to one edge in the graph.
|