CSE012/CS059 – Data Mining

Winter 2023

Material

Books and Slides

· Material from the book “Introduction to Data Mining” by Tan, Steinbach, Kumar.

· Mining Massive Datasets by Anand Rajaraman, Jeff Ullman, and Jure Leskovec. Free online book. Includes slides from the course.

· All of Statistics by Larry A. WassermanAll of Statistics by Larry A. WassermanAll of Statistics by Larry A. Wasserman

· Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze. Free online book.

· Networks Crowds and Markets by D. Easley, J. Kleinberg. Free online book.

· Social Media Mining by R. Zafarani, M. Ali Abbasi, H. Liu. Free online book.

· Material from the book “Data Mining: Concepts and Techniques”, by Jiawei Han and Micheline Kamber.

· The Data Science Design Manual by Steven Skiena.

· All of Statistics by Larry A. Wasserman

Springer Online Books

Recently, Springer announced a list of free online books on Machine Learning and Data Mining.

Some of the most interesting and relevant books:

· The Elements of Statistical Learning, by Trevor Hastie, Robert Tibshirani, Jerome Friedman

· Data Mining by Charu C. Aggarwal

· The Data Science Design Manual by Steven S. Skiena

· The Python Workbook by Ben Stephenson

Python

· Notes from the course Computational Tools for Data Science in BU

· Cookbooks: Includes examples of the use of Iron Python, code, and data.

o Iron Python Cookbook

o IPython Cookbook

Useful Unix Commands

You may find the following unix commands useful when pre-processing data:

· cut: allows you to get specific columns from delimited data

· sort: sorts the rows of a file in lexicographic order, –n for numeric

· uniq: merges consecutive rows of a file that are identical.

· grep: finds a sting within a file

Do “man <command>” in unix/linux shell to get more information.

Software

· WEKA Data Mining Software: A software package that implements multiple data mining tools.

· FIMI: Frequent Itemsets Mining Implementation: A repository of implementations for frequent itemset mining. All implementations assume the input format of the example datasets: text file where each row is a basket consisting of space separated integers that represent the items.

· Liblinear: Software package for classification. Implements the Logistic Regression and SVM classifiers.

Datasets

· The Yelp Academic Challenge dataset

· UCI Machine Learning Repository

o Τhe Iris dataset (ARFF file).Τhe link to UCI repository.

o The SpamBase dataset (ARFF file). Τhe link to UCI repository

o The Mushroom dataset (ARFF file). The link to UCI repository.

· Movie Lens Datasets by GroupLens Research

· FourSquare tips with categories: a collection of FourSquare tips on restaurants in New York (thanks to Yiannis Kotrotsios).

· FourSquare tips with categories: a collection of FourSquare tips with the category of the corresponding venue for restaurants, nightlife venues, and shops in New York (thanks to Yiannis Kotrotsios).

· FourSquare users and venues: a collection of pairs of user ids and venue names in New York, where the user with the specific id has left a tip to the venue with the specific name on Foursquare (thanks to Yiannis Kotrotsios).

· Twitter data from the paper “What is Twitter, a Social Network, or a News Media?” by Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. For the first Assignment, you need the Restricted User Profiles data file. The fields in the file are explained on the page, you are interested in the eleventh field which is the profile description.