Online Social Networks and Media -- Homework

network

Online Social Networks and Media
Homework

Home

Homework

Slides & References

Reading Material

Resources

Project Report Guidelines

You can find some guidelines for the project report here. Make sure that you start the report early!

Paper Presentation Guidelines

The presentations will be evaluated based on the quality of the presentation, and the comprehension of the material covered. The following are some guideline, tips and advice for preparing your presentation.

· You have 20 minutes for the presentation (1 student group) and 25 minutes (2 students group). We will enforce the time limit and cut you off if you have not completed on time. 10 more minutes will be allocated for questions. We may randomly pick someone from the audience to ask a question, so everyone should pay attention.

· You should prepare around 20-25 slides, given that a slide takes around a minute to talk about on average.

· Break you presentation into thematic units. The following flow is very common:

Motivate why the problem is important and give a high level idea;
Define clearly the problem;
Present the main idea and the fundamental algorithms;
Present the results (experimental or theoretical or both);
Conclusions.

· The talk should be self-contained. Do not assume that the audience has read the paper, or some previous work that you consider known. Define all the concepts you need and all the notation that you use. Refer only to related work that you know.

· Since the time for the talk is short, you will need to focus on the important parts of the paper and avoid going through all the details. The goal is to give a summary of the paper and have a clear message. Just because you read all the paper it does not mean that you should present everything. At the same time, you should not skip important information. Focusing on the right part to present is important since it shows that you understood the paper well.

· Prepare the slides carefully. Do not add too much text, and only the math symbols necessary. Do not use full sentences, but rather keywords and short phrases. Make sure the slides are readable and not too loaded. Never ever project parts of the paper pdf.

· Practice! Good talks are the result of a lot of practice even if they seem spontaneous and fun to the audience. Practice the talk several times, and time yourself to make sure you are within the time bounds.

Some fun advice on how to give a bad talk (and more) here.

Project Assignment

Project assignment:

· Topic 1: Structural diversity based on network embeddings. Στέλλα Μπουρλή, Παναγιώτης Κουζουγλίδης

· Topic 2: Content homophily in a real social network. Σωτήριος Ζώγος, Δήμητρα Τριανταλή

· Topic 3: Fairness in a real social network. Σέχαϊ Φατιόν, Γεώργιος-Θεόδωρος Καλαμπόκης

· Topic 7: Polarization and extremism on Reddit. Χρύσα Τερίζη

· Topic 6: Link Recommendations for reducing polarization. Γιάννης Μπάκος, Λένα

Projects

The list of projects is available here. The assignment is First-Come-First-Serve.

The timeline for the projects is as follows:

Week before Christmas: Submit a project proposal outlining what you plan to do. This should include the topic of your presentation. Present the proposed project in class. Set up a web page for the project.
January 15: Presentations.
End of January: Submit full project.

Assignment 2

Due December 4, 2019 in class

The goal of this assignment is for you to experiment with clustering algorithms.

You can work in teams of up to 2 members.

Question 1

For this question, you will use the DBLP10 dataset that includes publications from computer science conferences between 2006 and 2015. Nodes correspond to authors. There is an edge between two authors if they have written an article together. The following information is available:

Co-authorship: Data in the form (id1, id2) meaning that author with id1 co-authored an article with author with id2.

Authors: Data in the form (id, n) indicating that the author (node) with identifier id has name n.

Label: Data in the form (id, c) indicating that author with identifier id wrote a paper at conference c. Hence, the label of each author (node of your graph) is a set of conferences.

(a) Find communities in these graphs using a modularity-based algorithm. Report the number of clusters, the size and modularity of each cluster. If necessary, experiment with different number of clusters to improve the quality of the clusters.

(b) Use the labels of the users to evaluate the homogeneity of the clusters. For each pair of clusters C_i and C_j, compute the average similarity between the labels of a_i and a_j, where a_i is an author in C_i and a_j is an author in C_j. Use the Jaccard index to measure similarity (https://en.wikipedia.org/wiki/Jaccard_index). Report your findings using m × m matrix where m is the number of clusters.

Question 2

For this question, you will propose an algorithm of your own for clustering a graph that uses the PageRank algorithm. Explain the rationale of your algorithm, and provide a high-level pseudocode.

For the evaluation, you will use the email-EU-core graph from the SNAP dataset repository, also used in Assignment 1. This is the email correspondence graph for a European research institute. The nodes are individuals that work in the organization. There are 42 departments in the organization and each node belongs to a single department. Use the department assignment as ground truth and provide an evaluation of your algorithm.

Question 3

Consider a graph that is a binary tree of depth . Compute the expected spread for the Independent Cascade model if the root of the tree is the seed node, and each edge in the tree has transmission probability p. Give a closed formula for the expectation.

Hint: Consider separately the cases and .

Assignment 1

Due Nov 13, 2019 in class

The goal of this assignment is for you to familiarize yourself with network measurements and network generation models, and experiment with the PageRank algorithm

You can work in teams of up to 2 members.

You can either write your own code or use implementations provided by SNAP, NetworkX, or other sources. Specify this in your report.

Question 1

Consider the following graphs:

(1) The Wiki-Vote graph from the SNAP dataset repository.

(2) An (undirected) Erdos-Renyi random graph.

(3) An (undirected) graph generated using preferential attachment.

(4) A graph generated using the forest fire model.

The number of nodes of the generated graphs and (when possible) the (expected) number of edges of each of the synthetically generated graphs should be the same to one of the Wiki-Vote graph.

For these graphs:

a. Plot the degree distributions for each graph. Produce 5 plots (simple distribution, bins of exponential size, cumulative, zipf). All plots should be in log-log scale.

b. Report the effective diameter for all graphs.

c. Report the clustering co-efficient for all graphs.

In addition, for the Wiki-Vote and the Forest Fire graphs compute the PageRank values and produce again 5 plots as before for the distribution of the PageRank values for each of the two graphs.

Submit a short report where you describe the experimental set-up (including the values of the tuning parameters for each of the models), and your results. Specifically, for each of the measurements, for each of the graphs, comment on the results that you obtain, similar to the analysis presented in class.

Question 2

Let be the pagerank vector, when the jump vector is the vector , and let be the personalized pagerank vector for node (i.e., when the jump vector has all probability on the node ). Prove that . (Hint: It will be helpful to consider the expression for pagerank vector in slide 48).

Question 3 [optional]

For this question you will use the email-EU-core graph from the SNAP dataset repository. This is the email correspondence graph for a European research institute. The nodes are individuals that work in the organization. There are 42 departments in the organization and each node belongs to a single department.

The goal of this assignment is to rank the departments using the PageRak algorithm.

First, design a variation of the PageRank algorithm, or modify the network appropriately so as to produce an appropriate ranking of the departments.

Then, compare the ranking of the departments produced by your method with the ranking produced by using the following baseline that first computes the PageRank value for each node in the graph and then the average PageRank value of each department (as the average PageRank of the nodes (individuals) belonging to the corresponding department).