{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# TP10 (Students version): recommender systems\n", "\n", "We can use the following libraries." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3.8.10 (default, Nov 26 2021, 20:14:08) \n", "[GCC 9.3.0]\n" ] } ], "source": [ "import matplotlib.pyplot as plt\n", "import math\n", "import sys\n", "import random\n", "import time\n", "import copy\n", "print(sys.version)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 1: a basic recommender system\n", "\n", "The purpose of this practical work is to make a basic recommender system, and use it on a Movielens dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 1: data preparation\n", "\n", "Download the rating data extracted from MovieLens http://lioneltabourier.fr/documents/new_rating_list.txt\n", "\n", "This file is organised as follows:\n", "\n", "
\n",
"user_id   movie_id   rating\n",
"
\n", "\n", "It contains 100836 ratings of 9724 movies by 610 different users. Ratings on MovieLens goes from 0.5 to 5.\n", "\n", "The corresponding movie index is available there http://lioneltabourier.fr/documents/movies.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 1\n", "\n", "Select **randomly** 1% of the ratings (so 1008 ratings). This will be your test set for all of this lab: these ratings are considered as unknown, and we aim at predicting them with the learning set which are the remaining 99% ratings.\n", "\n", "Create two files, one containing the learning ratings, another containing the test ratings (please join them to the .ipynb file when sending your work)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2: benchmark recommender \n", "\n", "The benchmark recommender that you will create works as follows: for a user $u$ and an item $i$, the predicted score is\n", "\n", "$$r^*(u,i) = \\overline{r} + ( \\overline{r(u)} - \\overline{r}) + ( \\overline{r(i)} - \\overline{r})$$\n", "\n", "$\\overline{r}$ is the average rating over the whole learning dataset.\n", "\n", "$\\overline{r(u)}$ is the average rating over the learning dataset of user $u$. In case $u$ is not present in the learning set, consider that $\\overline{r(u)} = \\overline{r}$.\n", "\n", "$\\overline{r(i)}$ is the average rating over the learning dataset of item $i$. In case $i$ is not present in the learning set, consider that $\\overline{r(i)} = \\overline{r}$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 2\n", "\n", "Load the learning data in memory.\n", "\n", "Clue: an adequate format for the rest of Part 1 is to create two dictionaries of lists (warning: a dictionary of sets won't work): \n", "\n", "1) keys = user ids , values = list of ratings \n", "\n", "2) keys = item ids , values = list of ratings " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 3\n", "\n", "Create a function which given a user $u$ and an item $i$ returns the value of $r^*(u,i)$ computed on the learning set." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 3: evaluation\n", "\n", "Now that we have a prediction process, we evaluate its performances on the rating set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 4\n", "\n", "For each rating in the test set, compute the rating predicted by the function defined above and compare it to the actual score. \n", "\n", "If the item has not been rated in the learning set or the user has made no rating in the learning set, skip this rating.\n", "\n", "To present your results, you can print them in the form:\n", "\n", "
\n",
"user_id item_id real_rating predicted_rating\n",
"
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 5\n", "\n", "Using the previous question, compute the _Root Mean Square Error_, as defined in the course for the whole set of predictions:\n", "\n", "$$RMSE = \\sqrt{\\frac{\\sum _{k=1} ^K (r^*(k) - r(k))^2 }{K}}$$\n", "\n", "Here $K$ is the number of predictions, $r^*(k)$ the predicted rating, $r(k)$ the real rating." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 2: user-based collaborative filtering (UBCF)\n", "\n", "Using the same learning and testing files as in Part1, we aim at building a collaborative filtering method to improve the results. \n", "\n", "For this purpose, we define a distance between users: $u_1$ and $u_2$ will be close if they rate movies similarly and far away if they rate movies differently.\n", "\n", "When predicting a score $r^*_{CF}(u,i)$, we take into account this distance such that close users have more influence than distant users." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 1: loading data\n", "\n", "### Question 1\n", "\n", "To make a collaborative filtering recommender system, we need more information than in Part1. \n", "\n", "So for Part2, create two dictionnaries of lists from the learning file:\n", "\n", "1) keys = user ids , values = list of couples (item , rating) \n", "\n", "2) keys = item ids , values = list of couples (user , rating)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2: computing distance\n", "\n", "The distance between users is defined as follows:\n", "\n", "$$d(u_1,u_2) = \\frac{1}{|I_1 \\cap I_2|} \\sum _{i \\in I_1 \\cap I_2} | r(u_1,i) - r(u_2,i)|$$\n", "\n", "where $I_1$ is the set of items rated by $u_1$ and $I_2$ is the set of items rated by $u_2$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 2\n", "\n", "The goal of this question is to compute the two elements of the distance between $u_1$ and $u_2$: its numerator $\\sum _{i \\in I_1 \\cap I_2} | r(u_1,i) - r(u_2,i)|$ and its denominator $|I_1 \\cap I_2|$.\n", "\n", "**Warning:** It is the difficult part of the lab work, as you need to make a relatively efficient code. \n", "\n", "1) Create a first matrix $m_{num}$ of size $611 \\times 611$ full of $0$, it will contain $m_{num}(u_1,u_2) = \\sum _{i \\in I_1 \\cap I_2} | r(u_1,i) - r(u_2,i)|$. \n", "For this purpose, you can use either a list of lists or a 2-dimensional numpy array if you are familiar with it.\n", "\n", "2) Create a second matrix $m_{denom}$ of size $611 \\times 611$ full of $0$, it will contain $m_{denom}=|I_1 \\cap I_2|$.\n", "\n", "3) go through every item $i$, for every couple of users $(u_1,u_2)$ that have rated $i$, update the values in both matrices $m_{num}$ and $m_{denom}$.\n", "\n", "The third step can take some time, but not too long, I advise you to check (with a counter for instance) that it won't take more than a few minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 3: evaluation\n", "\n", "The score predicted is computed in this way\n", "\n", "$$r^*(u,i) = \\overline{r} + ( \\overline{r(u)} - \\overline{r}) + ( \\overline{r_u(i)} - \\overline{r})$$\n", "\n", "You can observe that this score is similar to the benchmark except for the term $\\overline{r_u(i)}$ which is \n", "\n", "$$\\overline{r_u(i)} = \\frac{\\sum _{v \\in U} \\frac{r(v,i)}{d(u,v)}}{\\sum _{v \\in U} \\frac{1}{d(u,v)}}$$\n", "\n", "$U$ is the set of users that have rated $i$. \n", "\n", "\n", "\n", "It is a weighted average of the scores of other users which have rated item $i$, the weight is $\\frac{1}{d(u,v)}$. \n", "\n", "Note that if user $u$ has no common ratings with other users in the network, we consider simply that $\\overline{r_u(i)} = \\overline{r(i)}$ like with the benchmark." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 3\n", "\n", "For each rating in the test set, compute the rating predicted by the function defined above and compare it to the actual score. If an item has not been rated in the learning set or a user has made no rating in the learning set, don't do any prediction.\n", "\n", "To present your results, you can print them in the form:\n", "\n", "
\n",
"user_id item_id real_rating predicted_rating\n",
"
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 4\n", "\n", "Using the previous question, compute the _Root Mean Square Error_, as defined in the course for the whole set of predictions:\n", "\n", "$$RMSE = \\sqrt{\\frac{\\sum _{k=1} ^K (r^*_{CF}(k) - r(k))^2 }{K}}$$\n", "\n", "Here $K$ is the number of predictions, $r^*_{CF}(k)$ the predicted rating, $r(k)$ the real rating.\n", "\n", "You should observe only a very slight improvement to the RMSE that you have computed in Part1 with the benchmark. Do you have an idea why it is the case?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 2 }