clamli / Content-based-decision-tree

Content-based + Active learning, paper implementation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Content-based decision tree

Basic ideas

  • Content information
  • Active learning method (user rating information)

Dataset

  • movielens 20M
  • movielens 1M

Basic steps

  • IMDb crawler:
    1. Crawl movies tag information from IMDb by IMDb number provided by movielens.

    2. Clean and generate different kinds of data used by later steps. (All of the following information contains 1M and 20M movielens data, only use 20M movielens as example. Detailed data information can be found here data overview.)

      1) Movies year information:

      dict { 
       	movieid(str) : year(int) 
      };
      

      2) Movies title information:

      dict { 
       	movieid(str) : movie title(str) 
      };
      

      3) Movies genre information:
      In movielens 20M, 70 movies miss genre(can't found on IMDb either). In movielens 1M, 0 movies miss genre.

      dict { 
       	movieid(str) : genre(lst) 
      };
      

      4) Movies tag information:
      In movielens 20M, 594 movies miss tags(can't found on IMDb either). In movielens 1M, 12 movies miss tags.

      dict { 
       	movieid(str) : tags(str) 
      };
      
  • Step 3 - Item Similarity Information:
    • Year information:
     { movieid : year(int) }
    
    • Genre information:
     { movieid : array() }
    
    • Title information:
     sparse matrix: 3883x3940 (for 1M movielens)
     sparse matrix: 45843x25632 (for 20M movielens)
    
    • Tag information:
     sparse matrix: 3883x16935 (for 1M movielens)
     sparse matrix: 45843x42721 (for 20M movielens)
    
  • Step 4 - Contruct Content-based Decision Tree

Dependence

  • Matlab R2013a
  • Python 3.5/3.6
  • Jupytor Notebook

About

Content-based + Active learning, paper implementation


Languages

Language:Jupyter Notebook 97.0%Language:MATLAB 2.9%Language:Python 0.1%