Kaggle-Mercari

./src/PlanB.py score 0.40957 on private leaderboard, with the ranking of 32 among 2384 teams.

Ideas/things to do

v8:
- embedding for cat, brand, cond ...
v7:
- adam lr
v6
- Tf: more epoch (v)
- FM: 18 ? * FTRL: beta=0.01 (v) * WB: (no)
v5
- delete stemmer and add two epoches (v)
- log price (v)
- REMEMBER to add fc back (v)
v4
- elu instead of relu (v)
- delete one fc layer (v)
- delete dropout after fc
V3
- cnn
- dropout after cnn (v)
- 2gram for cnn
- rnn

Len of text
Mean price of each category
Mean of brand/shipping
Average of word embeddings: Lookup all words in Word2vec and take the average of them. paper, Github Quora
Better way to remove stop word cached
Reduce TF time
Drop price = 0 or < 3 (link, link)

Stage 2: 1, 2, Mine
Rewrite the code:
- "without merge(fitting on train and transforming on test) my CV and LB loss increased by 0.009. I can't figure out the reason." Link
- Test set into batches. link
- Better val set for TF

Combine (condition and shipping)
Concatination of brand, item description and product name
One dimmensionfor item_condition: https://www.kaggle.com/nvhbk16k53/associated-model-rnn-ridge/versions#base=2256015&new=2410057
Other features for TF: Quora solutions
- No 1: Number of capital letters, question marks etc...
- No 3: We used TFIDF and LSA distances, word co-occurrence measures (pointwise mutual information), word matching measures, fuzzy word matching measures (edit distance, character ngram distances, etc), LDA, word2vec distances, part of speech and named entity features, and some other minor features. These features were mostly recycled from a previous NLP competition, and were not nearly as essential in this competition.
- No 8 -> a lot
- https://www.kaggle.com/shujian/naive-xgboost-v2/versions
- Tune FM: Compare 1 and 2. topic, kernel.
Ridge tuning:
- RDizzl3: Try playing with these parameters and see if you can get similar results to Ridge: alpha, eta0, power_t and max_iter. I have been able to get within 0.002 of my ridge predictions (validation) and it is faster.
Text cleaning
- RDizzl3: I have created rules for replacing text and even missing brand names that do bring some improvement to my score.
- Darragh: I didn't do too much on hand built feature engineering, but have got some boost with working on improving the tokenization. Still looking for what the top guys have done :) - nltk - ToktokTokenizer
- Text norm