CSE012/CS059 Data Mining
Winter 2024
|
|
Material
Books and Slides
·
Material from the book Introduction to
Data Mining by Tan, Steinbach, Kumar. ·
Mining Massive Datasets by Anand Rajaraman, Jeff Ullman, and Jure Leskovec.
Free online book. Includes
slides from the course. ·
All of
Statistics by Larry A. WassermanAll of
Statistics by Larry A. WassermanAll of
Statistics by Larry A. Wasserman ·
Introduction to Information Retrieval
by C. Manning, P. Raghavan, H. Schutze. Free online book. ·
Networks Crowds
and Markets by D. Easley, J. Kleinberg. Free online book. ·
Social Media Mining by R. Zafarani, M.
Ali Abbasi, H. Liu. Free online book. ·
Material from the book Data Mining: Concepts and
Techniques, by Jiawei Han and Micheline Kamber. ·
The Data Science Design Manual by
Steven Skiena. · All of Statistics by Larry A. Wasserman Springer Online Books
Recently,
Springer announced a list of free online books on Machine Learning and Data Mining. Some of the
most interesting and relevant books: ·
The Elements of Statistical
Learning,
by Trevor Hastie, Robert Tibshirani, Jerome Friedman ·
Data
Mining by Charu C. Aggarwal ·
The
Data Science Design Manual by Steven S. Skiena ·
The
Python Workbook by Ben Stephenson Python
·
Notes
from the course Computational
Tools for Data Science in BU ·
Cookbooks: Includes examples of the use of
Iron Python, code, and data. Useful Unix Commands
You may find the following unix commands useful when
pre-processing data: ·
cut:
allows you to get specific columns from delimited data ·
sort:
sorts the rows of a file in lexicographic order, n for numeric ·
uniq:
merges consecutive rows of a file that are identical. ·
grep:
finds a sting within a file Do man <command> in unix/linux shell to get more
information. Software
·
WEKA Data Mining Software: A
software package that implements multiple data mining tools. ·
FIMI: Frequent Itemsets Mining Implementation:
A repository of implementations for frequent itemset mining. All
implementations assume the input format of the example datasets: text file
where each row is a basket consisting of space separated integers that
represent the items. ·
Liblinear: Software
package for classification. Implements the Logistic Regression and SVM
classifiers. Datasets
·
The Yelp Academic Challenge
dataset ·
UCI Machine Learning Repository o Τhe
Iris dataset (ARFF file).Τhe link to UCI
repository. o The SpamBase dataset (ARFF
file). Τhe link to UCI
repository o The Mushroom dataset (ARFF
file). The link
to UCI repository. ·
Movie Lens Datasets by GroupLens Research
·
FourSquare tips with categories: a collection of
FourSquare tips on restaurants in New York (thanks to Yiannis Kotrotsios).
·
FourSquare tips with categories: a collection
of FourSquare tips with the category of the corresponding venue for
restaurants, nightlife venues, and shops in New York (thanks to Yiannis
Kotrotsios).
·
FourSquare users and venues: a collection
of pairs of user ids and venue names in New York, where the user with the
specific id has left a tip to the venue with the specific name on Foursquare
(thanks to Yiannis Kotrotsios).
·
Twitter
data from the paper What
is Twitter, a Social Network, or a News Media? by Haewoon Kwak,
Changhyun Lee, Hosung Park, and Sue Moon. For the first Assignment, you need
the Restricted
User Profiles data file. The fields in the file are explained on the
page, you are interested in the eleventh field which is the profile description.
·
English Stopwords.
Txt file with a list of English stopwords.
·
SpamAssassin.
·
Stanford Network Analysis Project Datasets.
·
Movie-Actor Graph. Each line in the file is a tab-separated movie-actor pair, i.e., it
corresponds to one edge in the graph.
|