CSE012/CS059 Data Mining
Winter 2025
|
|
Material
Books and Slides
·
Material from the book Introduction to
Data Mining by Tan, Steinbach, Kumar. ·
Mining Massive Datasets by Anand Rajaraman, Jeff Ullman, and Jure Leskovec.
Free online book. Includes
slides from the course. ·
All of
Statistics by Larry A. WassermanAll of
Statistics by Larry A. WassermanAll of
Statistics by Larry A. Wasserman ·
Introduction to Information Retrieval
by C. Manning, P. Raghavan, H. Schutze. Free online book. ·
Networks Crowds
and Markets by D. Easley, J. Kleinberg. Free online book. ·
Social Media Mining by R. Zafarani, M.
Ali Abbasi, H. Liu. Free online book. ·
Material from the book Data Mining: Concepts and
Techniques, by Jiawei Han and Micheline Kamber. ·
The Data Science Design Manual by
Steven Skiena. · All of Statistics by Larry A. Wasserman Springer Online Books
Recently,
Springer announced a list of free online books on Machine Learning and Data Mining. Some of the
most interesting and relevant books: ·
The Elements of Statistical
Learning,
by Trevor Hastie, Robert Tibshirani, Jerome Friedman ·
Data
Mining by Charu C. Aggarwal ·
The
Data Science Design Manual by Steven S. Skiena ·
The
Python Workbook by Ben Stephenson Python
·
Notes
from the course Computational
Tools for Data Science in BU Useful Unix Commands
You may find the following unix
commands useful when pre-processing data: ·
cut:
allows you to get specific columns from delimited data ·
sort:
sorts the rows of a file in lexicographic order, n for numeric ·
uniq:
merges consecutive rows of a file that are identical. ·
grep:
finds a sting within a file Do man <command> in unix/linux shell to get more information. Software
·
WEKA Data Mining Software: A
software package that implements multiple data mining tools. ·
FIMI: Frequent Itemsets
Mining Implementation: A repository of implementations for frequent
itemset mining. All implementations assume the input format of the example
datasets: text file where each row is a basket consisting of space separated
integers that represent the items. ·
Liblinear:
Software package for classification. Implements the Logistic Regression and SVM
classifiers. Datasets
·
The Yelp Academic Challenge
dataset ·
UCI Machine Learning Repository o Τhe
Iris dataset (ARFF file).Τhe link to UCI
repository. o The SpamBase dataset (ARFF
file). Τhe link to UCI
repository o The Mushroom dataset (ARFF
file). The link
to UCI repository. ·
Movie Lens Datasets by GroupLens
Research
·
FourSquare tips with
categories: a collection of FourSquare
tips on restaurants in New York (thanks to Yiannis Kotrotsios).
·
FourSquare tips
with categories: a collection of FourSquare
tips with the category of the corresponding venue for restaurants, nightlife
venues, and shops in New York (thanks to Yiannis Kotrotsios).
·
FourSquare
users and venues: a collection of pairs of user ids and venue names in
New York, where the user with the specific id has left a tip to the venue
with the specific name on Foursquare (thanks to Yiannis Kotrotsios).
·
Twitter
data from the paper What
is Twitter, a Social Network, or a News Media? by Haewoon
Kwak, Changhyun Lee, Hosung
Park, and Sue Moon.
·
English Stopwords. Txt file with a list of English stopwords.
·
SpamAssassin.
·
Stanford Network Analysis Project Datasets.
·
Movie-Actor Graph. Each line in the file is a tab-separated movie-actor pair, i.e., it
corresponds to one edge in the graph.
|