- Content information
- Active learning method (user rating information)
- movielens 20M
- movielens 1M
- IMDb crawler:
-
Crawl movies tag information from IMDb by IMDb number provided by movielens.
-
Clean and generate different kinds of data used by later steps. (All of the following information contains 1M and 20M movielens data, only use 20M movielens as example. Detailed data information can be found here data overview.)
1) Movies year information:
dict { movieid(str) : year(int) };
2) Movies title information:
dict { movieid(str) : movie title(str) };
3) Movies genre information:
In movielens 20M, 70 movies miss genre(can't found on IMDb either). In movielens 1M, 0 movies miss genre.dict { movieid(str) : genre(lst) };
4) Movies tag information:
In movielens 20M, 594 movies miss tags(can't found on IMDb either). In movielens 1M, 12 movies miss tags.dict { movieid(str) : tags(str) };
-
- Step 3 - Item Similarity Information:
- Year information:
{ movieid : year(int) }
- Genre information:
{ movieid : array() }
- Title information:
sparse matrix: 3883x3940 (for 1M movielens) sparse matrix: 45843x25632 (for 20M movielens)
- Tag information:
sparse matrix: 3883x16935 (for 1M movielens) sparse matrix: 45843x42721 (for 20M movielens)
- Step 4 - Contruct Content-based Decision Tree
- Matlab R2013a
- Python 3.5/3.6
- Jupytor Notebook