{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# TP9 (Students version): Pagerank\n",
"\n",
"We can use the following libraries."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3.8.5 (default, Jul 28 2020, 12:59:40) \n",
"[GCC 9.3.0]\n"
]
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"import math\n",
"import sys\n",
"import random\n",
"import time\n",
"import copy\n",
"print(sys.version)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise 1: preliminary questions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this TP, we will use the graph http://lioneltabourier.fr/documents/wiki_2009.txt. It is a subpart of the English language Wikipedia collected in 2009. A link represent a hyperlink from a page to another.\n",
"\n",
"**Warning:** it is a directed graph, so in this case a line\n",
"\n",
"12 126\n",
"\n",
"means that there is a directed link from node 12 to node 126, but not necessarily in the other direction!\n",
"\n",
"For your information, we indicate that this network has\n",
"\n",
"- 50988 nodes\n",
"\n",
"- 1778125 directed links"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 1\n",
"\n",
"Load the graph in memory, in the adjacency list format, **for both the list of predecessors and the list of successors**. \n",
"\n",
"Check its number of nodes and of directed links. \n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"We remind you that the transition matrix $T$ is defined like this: if there is a link from $u$ to $v$ then $T[u][v] = \\frac{1}{d_{out}(u)}$ where $d_{out}(u)$ is the out-degree of $u$ and otherwise $T[u][v] = 0$.\n",
"\n",
"Note that it is not possible to store $T$ in memory as a $ n \\times n $ matrix, it would take too much memory.\n",
"\n",
"So instead of creating $T$, we proceed in this way:\n",
"\n",
"- we use the adjacency lists of the graph (lists of predecessors, lists of successors), \n",
"\n",
"- we store the outdegree of the nodes in a dedicated dictionary,\n",
"\n",
"- with these information, we have all the information contained in $T$.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 2\n",
"\n",
"Create a dictionary that contains the out-degree of each node, you should give the out-degree of all the nodes if the network, even if it is $0$. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 3\n",
"\n",
"Using the previous questions, implement a basic power iteration. \n",
"\n",
"The principle is to iterate $t$ times the matrix product:\n",
"\n",
"$$ X \\leftarrow T.X $$\n",
"\n",
"$X$ is a vector initialized to $ [\\frac{1}{n} \\ldots \\frac{1}{n}]$ and $T$ is the transition matrix. We strongly advise you to store $X$ as a dictionary of float.\n",
"\n",
"Run the power iteration for $ t=10 $ steps and measure at each step the value of $ ||X||_1 = \\sum _i |X[i]| $.\n",
"What do you observe? Can you explain in one sentence?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise 2: Pagerank with teleportation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 4\n",
"\n",
"Now, we improve the basic power iteration process to make a real Pagerank program, by adding a renormalization and a teleportation process.\n",
"\n",
"So now each iteration is described by the equation:\n",
"\n",
"$$ X = (1-s).T.X + s.I $$\n",
"\n",
"where $I$ is the vector $ [\\frac{1}{n} \\ldots \\frac{1}{n}]$.\n",
"\n",
"Implement the Pagerank alogorithm (this is Algorithm 3 in the course). Don't forget to renormalize $X$ after each step, that is to say do: $ X[i] \\leftarrow \\frac{X[i]}{||X||_1}$.\n",
"\n",
"Run the Pagerank for $t=10$ steps, $s=0.2$.\n",
"\n",
"Observe the nodes which have the top-5 pagerank values, then go and look in this document http://lioneltabourier.fr/documents/index_wiki_2009.txt to see to what Wikipedia pages they correspond. Do you think it is surprising?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}