DISCONTINUED
the problem is that yes the model gives results, but unreliable results, because it is supposed to have its metrics improve, and even though the models architecture and overall code seem to have no errors syntactically nor errors programmatically/logically the model evokes suspicion of an existing bug that is unsolvable thus makes the result it gives dubious at best
The knowledge graph second of all will be an even bigger challenge because us students are only limited people and our knowledge is limited. We do not know how to curate our own knowledge graph, such that it strictly follows the paper we base our thesis with, and such that it would ensure sure that when it is inputted to the model, the model will give reliable results
For this we deeply believe that our initially proposed title will never come to fruition and what we will be only capable of passing is or are the models that we have built right now that are less ideal to our initially proposed model that were supposed to be compared to it.
This is what we truly have right now
Description:
This is the second phase of my undergraduate thesis which will recommend jurisprudence documents to legal practitioners specializing in the labor sector. Based on the paper of Wang, H. et. al.
Usage:
- clone repository with
git clone https://github.com/08Aristodemus24/phil-jurisprudence-recsys.git
- navigate to directory with
readme.md
andrequirements.txt
file - run command;
conda create -n <name of env e.g. phil-jurisprudence-recsys> python=3.10.11
. Note that 3.10.11 must be the python version otherwise packages to be installed would not be compatible with a different python version - once environment is created activate it by running command
conda activate
- then run
conda activate phil-jurisprudence-recsys
- check if pip is installed by running
conda list -e
and checking list - if it is there then move to step 8, if not then install
pip
by typingconda install pip
- if
pip
exists or install is done runpip install -r requirements.txt
in the directory you are currently in
Recommender System Building
Model building:
Prerequisites to do: 1.
To do:
see shape of user input in DeepFM modeltest runlabel each line of execution in Recommender-System repository particularly in the using deepfm model- create data loader for movie ratings dataset priority a. item_index2entity_id.txt
- item_index2entity_id.txt actually goes hand in hand with the knowledge graph dataset ml1m-kg1m and ml1m-kg20k
- mean adder to the predicted ratings
- adder of a new user to the user-item rating matrix and user-item interaction matrix
- being able to update a single rated item-rating by a single user in the user-item rating matrix and the user-item interaction matrix
a.
$Y_{i, j}$ is, 0.5 user turns it to 3.5,$R_{i, j}$ is 1 initially and after update$R_{i, j}$ is still 1 - being able to update a single unrated item-rating by a single user in the user-item rating matrix and the user-item interaction matrix
a.
$Y_{i, j}$ is, 0 user turns it to 5,$R_{i, j}$ is 0 initially and after update$R_{i, j}$ is now 1 - confine ratings to only 0 or any number in the set {0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5} a. predictions like 0.75 should be rounded up to 1, 1.75 to 2, 2.8 to 3, 3.9 to 4, 4.85 to 5 b. predictions like 0.25 should be rounded down to 0, 0.499999 to 0. Basically anything than below an itnerval of 0.5 must be rounded down c. prediction 3.25 -> 3.25 - 3 = 0.25 < 0.5 therefore round 3.25 down to 3.0 d. prediciton 3.5 -> 3.5 - 3 = 0.5 >= 0.5 therefore round 3.5 to 4.0
because initial factorization machine (FM) architecture of collaborative filtering model already exists, using the the Functional API of tensorflow by using the built-in Model class, I need to build a more tailored version of this Model class that implements this FM architecturebuild DeepFM architecturewrite MKR architecture
Questions:
- how do I split the data into training, validation, and testing sets?
- how do I batch train the model?
- how do I make predictions with the model
- how do I implement the AUC, Precision@k, Recall@k, and F1-score in this model?
Insights:
- User will always have negative ratings so remove this by subtracting all unique items from negative rating set and positive rating set to get all items not rated by a user. When sampling with replacement, sample size can be greater than population size. And the population mean is a parameter; the sample mean is a statistic e.g. [1, 2, 3] sample 10 can be permitted if replace is true or we sample with replacement or return the value we took out in our "bag"
Conclusions 1.
Articles:
- building a matrix factorization model and normalizing ratings: https://www.kaggle.com/code/colinmorris/matrix-factorization
Side notes:
Preprocessing data:
Prerequisites to do:
- fix file structure of
Recommender-System
repository priority - translate chinese characters
Recommender-System
repository priority a. data_loader.py b. kg_load.py c. evaluation.py d. decorator.py e. competition.py
To do:
create preprocess rating tomorrow to keep only positive interactions as 1 and unrated items as 0.unwatched item set of a user (all unique values) must be equal to or greater than length of positive item set of that user. This will be a constraint we must add to avoid any future errors when sampling. Because if it is the case that unwatched items is less than positive e.g. user rated 10 items positively and 3 items negatively out of all 20 items, unwatched items would be 7, this would be an error since we aer sampling without replacement the same number of positive items which is 10 from an unrated set of only 7 items. But since this is a recommendation system where usually users do not rate most items and the user-item interaction and rating matrix is sparse, such an error could be avoided for the mean time, but we still need to find a way to handle this- need to find the end result of entity_id2index after convert_kg function
add logs to get_length__build_value_to_indexpipeline of separate_pos_neg_ratings() function:a. pass ratings df, item_id string b. return values will have var names n_items, old_item_idx2new_item_idx c. for users the same thing...n_users, old_user_idx2new_user_idx d. use built lookup dictionaries to renew user id and item id columns using.apply()
method of dataframe e. pass new dataframe toseparate_pos_neg_ratings()
func f. separate pos and neg ratings based on threshold by vectorization
>>> import pandas as pd
>>>
>>> ratings = pd.DataFrame({'user_id': [2, 2, 5, 16, 16, 16, 20, 1, 1, 3, 56, 32], 'item_id': [9, 2, 4, 99, 9, 4, 9, 1, 2, 50, 21, 100], 'rating': [4, 5, 4, 5, 4, 5, 5, 5, 5, 4, 5, 5]})
>>>
>>> ratings.groupby('user_id')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001A9759E7AF0>
>>> ratings.groupby('user_id').agg(set)
item_id rating
user_id
1 {1, 2} {5}
2 {9, 2} {4, 5}
3 {50} {4}
5 {4} {4}
16 {9, 99, 4} {4, 5}
20 {9} {5}
32 {100} {5}
56 {21} {5}
>>>
or
>>> ratings.groupby('user_id', as_index=False).agg(set)
user_id item_id rating
0 1 {1, 2} {5}
1 2 {9, 2} {4, 5}
2 3 {50} {4}
3 5 {4} {4}
4 16 {9, 99, 4} {4, 5}
5 20 {9} {5}
6 32 {100} {5}
7 56 {21} {5}
>>> user_pos_ratings.to_dict()
{'user_id': {0: 1, 1: 2, 2: 3, 3: 5, 4: 16, 5: 20, 6: 32, 7: 56}, 'item_id': {0: {1, 2}, 1: {9, 2}, 2: {50}, 3: {4}, 4: {9, 99, 4}, 5: {9}, 6: {100}, 7: {21}}, 'rating': {0: {5}, 1: {4, 5}, 2: {4}, 3: {4}, 4: {4, 5}, 5: {5}, 6: {5}, 7: {5}}}
>>>
>>> user_pos_ratings_2 = ratings.groupby('user_id').agg(set)
>>> user_pos_ratings_2.to_dict()
{'item_id': {1: {1, 2}, 2: {9, 2}, 3: {50}, 5: {4}, 16: {9, 99, 4}, 20: {9}, 32: {100}, 56: {21}}, 'rating': {1: {5}, 2: {4, 5}, 3: {4}, 5: {4}, 16: {4, 5}, 20: {5}, 32: {5}, 56: {5}}}
>>> user_pos_ratings_2 = ratings.groupby('user_id')[['item_id']].agg(set)
>>> user_pos_ratings_2.to_dict()
{'item_id': {1: {1, 2}, 2: {9, 2}, 3: {50}, 5: {4}, 16: {9, 99, 4}, 20: {9}, 32: {100}, 56: {21}}}
>>>
>>> temp = ratings.groupby('user_id')['item_id'].agg(set)
>>> temp
user_id
1 {1, 2}
2 {9, 2}
3 {50}
5 {4}
16 {9, 99, 4}
20 {9}
32 {100}
56 {21}
Name: item_id, dtype: object
>>> temp.to_dict()
{1: {1, 2}, 2: {9, 2}, 3: {50}, 5: {4}, 16: {9, 99, 4}, 20: {9}, 32: {100}, 56: {21}}
- make a function that saves preprocessed adn split data for easier access and its meta data like n_users, n_items for model arguments
Problems:
A big problem is that upon using refactor_raw_ratings() to get only positive ratings and sample unwatched ratings, item_ids unsually doubles from 3706 items to now 6040 items. So why is that?
Questions:
Insights:
Conclusions:
Articles:
- somehow convert each row that has a unique user and their corresponding rated item to a dictionary with each value as a set in a vectorized way: https://stackoverflow.com/questions/65436865/how-to-convert-dataframe-into-dictionary-of-sets
Side notes:
Tuning Model
Prerequisites to do:
- raise problem in stackvoerflow: RecSys model performance stalling at 47% AUC and F1-Score. Is the problem due to ratio of users to items in my dataset?
I'm having trouble with making my validation metrics go down for the binary_crossentropy and go up for the F1-score and AUC. I've tried tuning my hyper parameters such as the number of latent features of the model (8), the learning rate (0.000075), the lambda in the regularization term (1.2), the the drop out rate (0.4), and the batch size (16384), which have seemingly maximum values you can give to where you can ensure the prevention of overfitting, but to no avail my validation F1-Score and AUC always stalls at around 47%, 52% at its highest if I increase my epochs to 500. It even got to the point that a higher batch size gave my RAM problems since I only use my mere CPU in this ML task (Because I've no graphics card unfortunately).
Here is my model architecture which uses an embedding layer initially then essentially takes the output of this embedding layer and feeds it into two phases so to speak, one that will flatten the output of the embedding layer and one which concatenates the output of the embedding layer and feeds it into a fully connected network.
So my question Could it be that my architecture is too complex or is my dataset the problem?
- use github URL of larj-corpus dataset instead of local path in retrieving the rating data
To do:
trypython train_model.py -d juris-600k --protocol A --model_name DFM --n_features 8 --n_epochs 100 --rec_alpha 0.000075 --rec_lambda 1 --rec_keep_prob 0.8 --batch_size 8192
python train_model.py -d juris-600k --protocol A --model_name FM --n_epochs 100 --rec_lambda 1 --rec_keep_prob 0.8 --batch_size 8192
in both models try bigger batch of 65536, 32768, 16384- in DFM try keep probability of 0.6, alpha of 0.0001, lambda of 1.2
- commands to use:
-
python train_model.py -d ml-1m --protocol A --model_name FM --n_features 32 --n_epochs 100 --rec_alpha 0.0003 --rec_lambda 0.9 --batch_size 8192
-
python train_model.py -d juris-300k --protocol A --model_name FM --n_features 32 --n_epochs 100 --rec_alpha 0.0003 --rec_lambda 0.9 --rec_keep_prob 0.1 --batch_size 8192
-
python train_model.py -d juris-300k --protocol A --model_name DFM --n_features 32 --layers_dims 32 16 16 16 8 8 4 4 3 1 --n_epochs 100 --rec_alpha 0.0003 --rec_lambda 0.9 --rec_keep_prob 0.1 --batch_size 8192
-
python train_model.py -d ml-1m --protocol A --model_name FM --n_features 32 --n_epochs 100 --rec_alpha 0.0003 --rec_lambda 0.9 --batch_size 8192
-
python train_model.py -d juris-300k --protocol A --model_name FM --n_features 32 --n_epochs 100 --rec_alpha 0.0003 --rec_lambda 0.9 --rec_keep_prob 0.1 --batch_size 8192
-
python train_model.py -d juris-300k --protocol A --model_name DFM --n_features 32 --layers_dims 32 16 16 16 8 8 4 4 3 1 --n_epochs 100 --rec_alpha 0.0003 --rec_lambda 0.9 --rec_keep_prob 0.1 --batch_size 8192
-
python train_model.py -d juris-600k --protocol A --model_name FM --n_features 32 --n_epochs 100 --rec_alpha 0.0001 --rec_lambda 0.9 --batch_size 8192
-
python train_model.py -d juris-600k --protocol A --model_name DFM --n_features 32 --layers_dims 32 16 16 16 8 8 4 4 3 1 --n_epochs 100 --rec_alpha 0.0001 --rec_lambda 0.9 --rec_keep_prob 0.7 --batch_size 8192
-
python train_model.py -d juris-3m --protocol A --model_name FM --n_features 32 --n_epochs 100 --rec_alpha 0.0001 --rec_lambda 0.9 --batch_size 32768
-
python train_model.py -d juris-3m --protocol A --model_name DFM --n_features 32 --layers_dims 32 16 16 16 8 8 4 4 3 1 --n_epochs 100 --rec_alpha 0.0001 --rec_lambda 0.9 --rec_keep_prob 0.7 --batch_size 32768
-
just generate top N/K predictions using precision@k, recall@k, and f1@k even with seemingly poor performance
-
what if I just turn the outputs without the sigmoid and only turn them on during prediction
-
try pytorch
-
try flattening the embeddings before pushing into dot layer
-
try using different initializer
-
try doing it on Google collab
-
rework model by building a non eager model by using gradient tape
-
will consist of turning juris-600k to rating matrix using csr_matrix()
-
try implementing model that trains only and see performance improvement there once decent implement model with validation data split
-
implement batch training in gradient tape
-
then implementing splitting training dataset by batches
Problems:
There is something wrong with split data or refactor raw ratings because there seems to be a mismatch in original number of user id's and item_id's. I suspect because user id and item ids are lessened because negative ratings are removed. Nevertheless following models and used dataset produce the ff. results:
FM with juris_300k is okDFM with juris_300k causesAllocation of 268435456 exceeds 10% of free system memory.
&OP_REQUIRES failed at segment_reduction_ops_impl.h:478 : INVALID_ARGUMENT: data.shape = [8192] does not start with segment_ids.shape = [67108864]
. I suspect this has something to do with batch size and the model architecture itselfFM with juris_600k is not ok to begin with (even if it runs fine albeit with abysmal AUC, F1-Score, and Binary Accuracy results) since there is one user that is missing in the final refactored juris_600k dataset, where instead of 12034 users all in all there are now only 12033 usersDFM with juris_600k causesAllocation of 268435456 exceeds 10% of free system memory.
&OP_REQUIRES failed at segment_reduction_ops_impl.h:478 : INVALID_ARGUMENT: data.shape = [8192] does not start with segment_ids.shape = [67108864]
. Again I suspect this has something to do with batch size and the model architecture itself. Ok found the problem because if I remove deep neural network architecture model works fine. Resolved just added flatten layer after concatednation layer because I forgot
- there seems to be overfitting due to the dataset itself because of the rat
- somehow the MKR model seems to be working fine giving out the ff. results
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
WARNING:tensorflow:From C:\ProgramData\Anaconda3\envs\phil-jurisprudence-recsys\lib\site-packages\tensorflow\python\util\dispatch.py:1176: calling expand_dims (from tensorflow.python.ops.array_ops) with dim is deprecated and will be removed in a future version.
Instructions for updating:
Use the `axis` argument instead
epoch 0 train auc: 0.8884 acc: 0.8032 eval auc: 0.8833 acc: 0.8001 test auc: 0.8836 acc: 0.8003
epoch 1 train auc: 0.8945 acc: 0.8123 eval auc: 0.8862 acc: 0.8051 test auc: 0.8867 acc: 0.8060
epoch 2 train auc: 0.9007 acc: 0.8163 eval auc: 0.8873 acc: 0.8063 test auc: 0.8878 acc: 0.8059
epoch 3 train auc: 0.9102 acc: 0.8260 eval auc: 0.8919 acc: 0.8101 test auc: 0.8919 acc: 0.8110
epoch 4 train auc: 0.9210 acc: 0.8395 eval auc: 0.8987 acc: 0.8210 test auc: 0.8990 acc: 0.8203
epoch 5 train auc: 0.9287 acc: 0.8498 eval auc: 0.9035 acc: 0.8265 test auc: 0.9038 acc: 0.8271
epoch 6 train auc: 0.9331 acc: 0.8558 eval auc: 0.9056 acc: 0.8299 test auc: 0.9054 acc: 0.8298
epoch 7 train auc: 0.9383 acc: 0.8620 eval auc: 0.9076 acc: 0.8328 test auc: 0.9076 acc: 0.8321
epoch 8 train auc: 0.9404 acc: 0.8643 eval auc: 0.9097 acc: 0.8345 test auc: 0.9096 acc: 0.8339
epoch 9 train auc: 0.9427 acc: 0.8675 eval auc: 0.9105 acc: 0.8347 test auc: 0.9103 acc: 0.8345
epoch 10 train auc: 0.9453 acc: 0.8706 eval auc: 0.9112 acc: 0.8372 test auc: 0.9108 acc: 0.8358
epoch 11 train auc: 0.9460 acc: 0.8714 eval auc: 0.9123 acc: 0.8376 test auc: 0.9117 acc: 0.8370
epoch 12 train auc: 0.9474 acc: 0.8731 eval auc: 0.9118 acc: 0.8369 test auc: 0.9117 acc: 0.8370
epoch 13 train auc: 0.9481 acc: 0.8744 eval auc: 0.9134 acc: 0.8390 test auc: 0.9125 acc: 0.8361
epoch 14 train auc: 0.9491 acc: 0.8750 eval auc: 0.9133 acc: 0.8386 test auc: 0.9126 acc: 0.8374
epoch 15 train auc: 0.9483 acc: 0.8740 eval auc: 0.9131 acc: 0.8380 test auc: 0.9122 acc: 0.8367
epoch 16 train auc: 0.9502 acc: 0.8770 eval auc: 0.9134 acc: 0.8389 test auc: 0.9128 acc: 0.8384
epoch 17 train auc: 0.9505 acc: 0.8776 eval auc: 0.9129 acc: 0.8397 test auc: 0.9123 acc: 0.8381
epoch 18 train auc: 0.9509 acc: 0.8777 eval auc: 0.9140 acc: 0.8401 test auc: 0.9134 acc: 0.8392
epoch 19 train auc: 0.9516 acc: 0.8787 eval auc: 0.9136 acc: 0.8398 test auc: 0.9129 acc: 0.8386
So my problem lies with where I preprocessed the data because in wang's paper they managed to change the order of the user id's themselves such that they were all ordered from the first user (even if it had user id 1000 for instance) which was now set to a new id 0 representing a user with the positive item set.
This hypothesis has been rejected because even ordering the data doesn't have an effect on performance
-
next hypothesis is, does embeddings have to do with it? When wang preprocessed the movie lens data set did both the train and cross data splits still preserve the number of unique users and unique items -
So determining whether the density of both ml-1m and juris-300k/juris-600k affected the training did not work since oth virtually still had the same sparsity and inversely density.
Questions:
Insights:
- the higher the auc the more accurate the model is in classifying the 0 class as a 0 class and the 1 class as a 1 class for instance in a binary classification task. The more it is closer to one the more accurate it is the more it is closer to 0 the more it is inaccurate, if it is closer to 0.5 it means the model has no class separation capacity whatsoever
- My hypothesis is why precision@k, recall@k, accuracy@k, and f1@k is used in binary framed recommender systems is because the positively interacted upon items labeled as 1 and the unrated items of the users that have rated at least 1 positive item labeled 0
For example, the user has watched 6 movies, and in the first recommendation list, 2 of them are relevant. In the second list, 1 of them are relevant, the meaning of the two relevant movies in the former user is the items he/she has had a positive interaction with or will have a positive interaction with
______|item 1|item 2|item 3|item 4|item 5|
user 1| 1 | 0 | 0 | 0 | 1 |
------|------|------|------|------|------|
user 2| 0 | 1 | 1 | 1 | 0 |
------|------|------|------|------|------|
user 3| 1 | 1 | 0 | 1 | 0 |
------|------|------|------|------|------|
user 4| 1 | 0 | 1 | 1 | 1 |
------|------|------|------|------|------|
user 5| 0 | 1 | 0 | 1 | 0 |
In training say for user 1 we learned to predict properly the interaction between this user and item 1 item 2 and item 4 as our part training set, which are 1, 0, and 0. And we wanted to predict the rest of the items of user 1 which are items 3 and 5 which have interactions 0 and 1 respectively. Should the model hypothetically not overfit then in our cross validation data if we predict 1 correctly as the interaction between item 5 and user 1 then we would have now recommended an item that they may potentially like
- It could be possible that even if movielens had 6000+ plus and 3000+ items that the reason why our models did not stall was becuase user-item matrix was not sparse. It could be that the reason why our model was stalling was because our user-item rating matrix was too sparse. An experiment that I could execute is to compare whether the movelens dataset is indeed dense in data and whether the juris-300k or juris-600k dataset is sparse.
Should such expectations come to fruition it would mean that my hypothesis of the model performing well on movielens due to it not being sparse and the model not performing well on juris-300k/600k due to it being sparse would be correct and thus lead to the key conclusion that our dataset juris300k and juris-600k are in need of resynthesizing for the final time.
-
I understand now that item to item and user to user based recommendation systems simply use the row vectors of each users rated items or column vectors of each items of every user that has ever rated them from the user item rating matrix. We take these rows or columns and feed them perhaps to KNN, K-means to determine the clusters of these datapoints because every row of the user rating matrix is a user that are vectors of features we can cluster (since we know there is no target variablesto be predicted and therefore will be unsupervised) or every column of the the user item rating matrix which we know are items we can also cluster
-
the task was to learn the embedding matrices of users and items right, such that these matrices and moreover the bias vectors as well approximate the user item rating matrix. What we do to get missing ratings is just I theorize to multiply the item embedding matrix in order to get such might ratings and recommend that to the user
so the task is to generate a list of recommendations of length K and
- A known problem in recommendation sometimes called "Harry Potter" effect - (almost) everybody likes Harry Potter. So most automated procedures will find out which items are generally popular, and recommend those to the users. You can either filter out very popular items, or multiply the predicted rating by a factor that is lower the more globally popular an item is. Ah so this is why those popular items were removed in that professors notebook demo
Articles:
- Evaluating recommender systems
- https://neptune.ai/blog/how-to-test-recommender-system
- https://www.shaped.ai/blog/evaluating-recommendation-systems-part-1
- precision and recall at k
- https://medium.com/@m_n_malaeb/recall-and-precision-at-k-for-recommender-systems-618483226c54
- https://www.jillcates.com/pydata-workshop/html/tutorial.html?fbclid=IwAR0QF6gMZpj7Fs_yboKO9_3gpqqRctbUEplwwhjDvp5kDQKPK0O8f5IiG14
- https://github.com/ogbinar/upstat280/blob/main/01-recommender-systems-main/recommender-v2.ipynb
- savin model weights and model architecture
- https://colab.research.google.com/drive/172D4jishSgE3N7AO6U2OKAA_0wNnrMOq#scrollTo=OOSGiSkHTERy
- https://stackoverflow.com/questions/51806852/cant-save-custom-subclassed-model
- https://www.tensorflow.org/tutorials/keras/save_and_load#manually_save_weights
- https://www.tensorflow.org/guide/keras/serialization_and_saving#part_ii_saving_and_loading_of_subclassed_models
- why validation accuracy and auc increase and loss increases too
- https://stats.stackexchange.com/questions/548010/why-would-auc-on-a-validation-set-increase-while-loss-increases
- https://datamachines.com/blog/auc-vs-log-loss
- https://stats.stackexchange.com/questions/282160/how-is-it-possible-that-validation-loss-is-increasing-while-validation-accuracy
- https://ai.stackexchange.com/questions/24407/why-does-the-accuracy-drop-while-the-loss-decrease-as-the-number-of-epochs-incr
- traning a model by batch in non-eager execution mode
- https://youtu.be/Rx7pPuosoLk?si=WxEha38gBhMkMkmm
- http://sujitpal.blogspot.com/2019/04/matrix-factorization-as-gradient.html
- https://stackoverflow.com/questions/64605596/simple-way-of-performing-matrix-factorization-with-tensorflow-2
Analyzing embeddings
Prerequisites to do:
To do:
Questions:
Insights:
Articles:
- An algorithm similar to (or based on) K-means that do not require the 'k' number of clusters
- Introduction to Embedding, Clustering, and Similarity
- K Means Clustering on High Dimensional Data.
- How to transform a 3d arrays into a dataframe in python
- https://stackoverflow.com/questions/35525028/how-to-transform-a-3d-arrays-into-a-dataframe-in-python
Named Entity Recognition Model building for creating knowledge graph
Annotation:
Preqrequisites to do:
- use each created and properly separated .txt files for the NER annotator use https://tecoholic.github.io/ner-annotator/ for annotating organized text files manually
- annotate manually and save which will result in a .json file with format: { "classes":["CITATION", ... ,"ORGANIZATION"], "annotations":[ ["" LABOR CIRCULAR ON-LINE No. 61 Series of 1998 TOPIC At a Glance PETITIONS FOR CERTIORARI UNDER RULE 65 OF THE RULES OF COURT\r", {"entities":[[2,31,"CIRCULAR"],[32,46,"SERIES"],[65,89,"PETITION"],[96,103,"RULE"]]}], ... ["xxx"" "\r",{"entities":[]}], ["",{"entities":[]}] ] }
- create a parser that will take all the annotations arrays of each text file, extract each element and plcae it into one final data file e.g. [ ["sentence/line/string 1", {"entities":[(, , ""), ..., (, , "")]}], ["sentence/line/string 2", {"entities":[(, , ""), ..., (, , "")]}], ..., ["sentence/line/string n", {"entities":[(, , ""), ..., (, , "")]}], ]
To do: 1.
Questions: 1.
Insights: 1.
Conclusions: 1.
Side notes: 1.
Training Model:
Prerequisites to do:
- sample dat for named entity recognition
// TRAIN_DATA = [ // ("Apple is looking at buying U.K. startup for $1 billion", {"entities": [(0, 5, "ORG"), (29, 32, "GPE"), (36, 46, "MONEY")]}), // ("John lives in New York City and works for IBM", {"entities": [(0, 4, "PERSON"), (16, 29, "GPE"), (43, 46, "ORG")]}), // ("The Mona Lisa is a painting by Leonardo da Vinci", {"entities": [(4, 14, "WORK_OF_ART"), (25, 42, "PERSON")]}), // ("President Biden visited Detroit to talk about job opportunities", {"entities": [(10, 15, "PERSON"), (23, 30, "GPE")]}), // ("The Great Barrier Reef is located off the coast of Australia", {"entities": [(4, 23, "LOC"), (36, 45, "GPE")]}), // ]
// you'd have to create a dataset like the above so... // [ // ["" LABOR CIRCULAR ON-LINE No. 61 Series of 1998 TOPIC At a Glance PETITIONS FOR CERTIORARI UNDER RULE 65 OF THE RULES OF COURT\r", {"entities":[[2,31,"CIRCULAR"],[32,46,"SERIES"],[65,89,"PETITION"],[96,103,"RULE"]]}], // ["FROM DECISIONS OF THE NLRC NOW TO BE INITIALLY FILED WITH THE COURT OF APPEALS AND NO LONGER DIRECTLY WITH THE SUPREME COURT\r", {"entities":[[22,26,"ORGANIZATION"],[62,78,"COURT"],[111,124,"COURT"]]}] // ]
{ "classes":["CITATION","AMOUNT","COMPANY","CONSTRAINT","COPYRIGHT","COURT","DATE","DEFINITION","DISTANCE","DURATION","GEOENTITY","PERCENT","REGULATION","TRADEMARK","JUDGEMENT","GAZETTE","PROCEEDINGS","ARTICLE","SECTION","CLAUSE","PARAGRAPH","DEFENDANT","PROSECUTOR","APPEAL","APPELANT","PLAINTIFF","INVOLVED ENTITY","ADVOCATE","LEARNED COUNSEL","ROLE","JUDGE","OFFENCE","ACCUSATION","OBJECTION","JURISDICTION","PENALTY","COMPENSATION","EVIDENCE","EVIDENCE DESCRIPTION","ACT","CIRCULAR","SERIES","CASE","GENERAL REGISTRY NUMBER","PETITION","RULE","ORGANIZATION"], "annotations":[ ["" LABOR CIRCULAR ON-LINE No. 61 Series of 1998 TOPIC At a Glance PETITIONS FOR CERTIORARI UNDER RULE 65 OF THE RULES OF COURT\r", {"entities":[[2,31,"CIRCULAR"],[32,46,"SERIES"],[65,89,"PETITION"],[96,103,"RULE"]]}], ["FROM DECISIONS OF THE NLRC NOW TO BE INITIALLY FILED WITH THE COURT OF APPEALS AND NO LONGER DIRECTLY WITH THE SUPREME COURT\r", {"entities":[[22,26,"ORGANIZATION"],[62,78,"COURT"],[111,124,"COURT"]]}], ["[en banc]\r",{"entities":[]}], ["[New Interpretation of ""Appeals"" from NLRC Decisions]\r",{"entities":[]}], ["Case Title:\r",{"entities":[]}], ["ST. MARTIN FUNERAL HOME VS. NATIONAL LABOR RELATIONS COMMISSION, ET AL.\r",{"entities":[[0,23,"PLAINTIFF"],[28,70,"ORGANIZATION"]]}], ["[G. R. No. 130866, September 16, 1998]\r",{"entities":[[0,38,"GENERAL REGISTRY NUMBER"]]}], ["[en banc]\r",{"entities":[]}], ["FACTS & RULING OF THE COURT:\r",{"entities":[]}], ["The Supreme Court [en banc] did not rule on the factual issues of the case but instead re-examined, inter alia, Section 9 of Batas Pambansa Bilang 129, as amended by Republic Act No. 7902 [effective March 18, 1995] on the issue of where to elevate on appeal the decisions of the National Labor Relations Commission [NLRC].\r",{"entities":[[0,17,"COURT"],[112,121,"SECTION"],[125,150,"ACT"],[166,187,"ACT"],[199,213,"DATE"],[279,321,"ORGANIZATION"]]}],["The High Court remanded the case to the Court of Appeals consistent with the new ruling enunciated therein that the ""appeals"" contemplated under the law from the decisions of the National Labor Relations Commission to the Supreme Court should be interpreted to mean ""petitions for certiorari under Rule 65"" and consequently, should no longer be brought directly to the Supreme Court but initially to the Court of Appeals.\r",{"entities":[[0,14,"COURT"],[40,56,"COURT"],[181,216,"ORGANIZATION"],[224,237,"COURT"],[268,294,"PETITION"],[301,310,"RULE"],[373,386,"COURT"],[408,424,"COURT"]]}],["Before this new en banc ruling, the Supreme Court has consistently held that decisions of the NLRC may be elevated directly to the Supreme Court only by way of a special civil action for certiorari under Rule 65. There was no ruling allowing resort to the Court of Appeals.\r",{"entities":[[36,49,"COURT"],[94,98,"ORGANIZATION"],[131,144,"COURT"],[204,212,"RULE"],[256,272,"COURT"]]}],["In support of this new view, the Supreme Court ratiocinated, insofar as pertinent, as follows: ""While we do not wish to intrude into the Congressional sphere on the matter of the wisdom of a law, on this score we add the further observations that there is a growing number of labor cases being elevated to this Court which, not being a trier of fact, has at times been constrained to remand the case to the NLRC for resolution of unclear or ambiguous factual findings; that the Court of Appeals is procedurally equipped for that purpose, aside from the increased number of its competent divisions; and that there is undeniably an imperative need for expeditious action on labor cases as a major aspect of constitutional protection to labor.\r",{"entities":[[33,46,"COURT"],[408,412,"ORGANIZATION"],[479,495,"COURT"]]}], ["""Therefore, all references in the amended Section 9 of B. P. No. 129 to supposed appeals from the NLRC to the Supreme Court are interpreted and hereby declared to mean and refer to petitions for certiorari under Rule 65. Consequently, all such petitions should henceforth be initially filed in the Court of Appeals in strict observance of the doctrine on the hierarchy of courts as the appropriate forum for the relief desired.\r",{"entities":[[43,52,"SECTION"],[56,69,"ACT"],[99,103,"ORGANIZATION"],[111,124,"COURT"],[182,206,"PETITION"],[213,221,"RULE"],[299,315,"COURT"]]}], ["xxx"" "\r",{"entities":[]}], ["",{"entities":[]}] ] }
To do: 1.
Questions: 1.
Insights: 1.
Conclusions: 1.
Side notes: 1.
Things I Learned:
- batch size can affect performance of model on validation set
- hyper-parameters with n_features of 32, n_epochs of 200, rec_alpha of 0.0003, --rec_lambda of 1, --regularization of "L2", and --batch_size of 8192 seem to give good baseline results for the FM (factorization machine) model
References:
LINK_TO_PAPER, LINK_TO_PAPERS_GITHUB, CITATION
-
https://www.researchgate.net/publication/333072348_Multi-Task_Feature_Learning_for_Knowledge_Graph_Enhanced_Recommendation/stats, https://github.com/hwwang55/MKR, Wang, Hongwei & Zhang, Fuzheng & Zhao, Miao & Li, Wenjie & Xie, Xing & Guo, Minyi. (2019). Multi-Task Feature Learning for Knowledge Graph Enhanced Recommendation. WWW '19: The World Wide Web Conference. 2000-2010. 10.1145/3308558.3313411.
-
https://www.researchgate.net/publication/358851413_DFM-GCN_A_Multi-Task_Learning_Recommendation_Based_on_a_Deep_Graph_Neural_Network, https://github.com/SSSxCCC/Recommender-System, Xiao, Yan & Li, Congdong & Liu, Vincenzo. (2022). DFM-GCN: A Multi-Task Learning Recommendation Based on a Deep Graph Neural Network. Mathematics. 10. 721. 10.3390/math10050721.
-
https://www.researchgate.net/publication/332750505_Knowledge_Graph_Convolutional_Networks_for_Recommender_Systems, https://github.com/hwwang55/KGCN, Wang, Hongwei & Zhao, Miao & Xie, Xing & Li, Wenjie & Guo, Minyi. (2019). Knowledge Graph Convolutional Networks for Recommender Systems. 10.1145/3308558.3313417.