CSE012/CS059 – Data Mining

Winter 2025

 

Home

Material

Lectures

Tutorials

Assignments

Material

Books and Slides

·        Material from the book “Introduction to Data Mining” by Tan, Steinbach, Kumar.

·        Mining Massive Datasets by Anand Rajaraman, Jeff Ullman, and Jure Leskovec. Free online book. Includes slides from the course.

·        All of Statistics by Larry A. WassermanAll of Statistics by Larry A. WassermanAll of Statistics by Larry A. Wasserman

·        Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze. Free online book.

·        Networks Crowds and Markets by D. Easley, J. Kleinberg. Free online book.

·        Social Media Mining by R. Zafarani, M. Ali Abbasi, H. Liu. Free online book.

·        Material from the book “Data Mining: Concepts and Techniques”, by Jiawei Han and Micheline Kamber.

·        The Data Science Design Manual by Steven Skiena.

·        All of Statistics by Larry A. Wasserman

 

Springer Online Books

Recently, Springer announced a  list of free online books on Machine Learning and Data Mining.

Some of the most interesting and relevant books:

·        The Elements of Statistical Learning, by Trevor Hastie, Robert Tibshirani, Jerome Friedman

·        Data Mining by Charu C. Aggarwal

·        The Data Science Design Manual by Steven S. Skiena

·        The Python Workbook by Ben Stephenson

 

Python

·        Notes from the course Computational Tools for Data Science in BU

 

Useful Unix Commands

You may find the following unix commands useful when pre-processing data:

·        cut: allows you to get specific columns from delimited data

·        sort: sorts the rows of a file in lexicographic order, –n for numeric

·        uniq: merges consecutive rows of a file that are identical.

·        grep: finds a sting within a file

Do “man <command>” in unix/linux shell to get more information.

 

Software

·        WEKA Data Mining Software: A software package that implements multiple data mining tools.

·        FIMI: Frequent Itemsets Mining Implementation: A repository of implementations for frequent itemset mining. All implementations assume the input format of the example datasets: text file where each row is a basket consisting of space separated integers that represent the items.

·        Liblinear: Software package for classification. Implements the Logistic Regression and SVM classifiers.

 

Datasets

·        The Yelp Academic Challenge dataset

·        UCI Machine Learning Repository

o   Τhe Iris dataset (ARFF file).Τhe link to UCI repository.

o   The SpamBase dataset (ARFF file). Τhe link to UCI repository

o   The Mushroom dataset (ARFF file). The link to UCI repository.

·        Movie Lens Datasets by GroupLens Research

·        FourSquare tips with categories: a collection of FourSquare tips on restaurants in New York (thanks to Yiannis Kotrotsios).

·        FourSquare tips with categories: a collection of FourSquare tips with the category of the corresponding venue for restaurants, nightlife venues, and shops in New York (thanks to Yiannis Kotrotsios).

·        FourSquare users and venues: a collection of pairs of user ids and venue names in New York, where the user with the specific id has left a tip to the venue with the specific name on Foursquare (thanks to Yiannis Kotrotsios).

·        Twitter data from the paper “What is Twitter, a Social Network, or a News Media?” by Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon.

·        English Stopwords. Txt file with a list of English stopwords.

·        SpamAssassin.

·        Stanford Network Analysis Project Datasets.

·        Movie-Actor Graph. Each line in the file is a tab-separated movie-actor pair, i.e., it corresponds to one edge in the graph.