sanshibayuan / Sohu-2018-4th-place-solution

2018搜狐内容识别算法大赛-解决方案(4th)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sohu-2018-4th-place-solution

2018 搜狐内容识别算法大赛

Overview

Preprocessing

  • Html filter
  • Segmentation
  • Extra-features
  • Data Augementation

Task1:Label Classification

EDA

  • Word_tfidf
  • Char_tfidf
  • Word2vec

Models

  • NBSVM
  • LGBM
  • TextCNN
  • RCNN
  • Bi-LSTM
  • Bi-GRU

Ensemble

  • Word2vec dimentions
  • Embedding layer
  • 01-2 0-1 classification

Task2:Text Extraction

  • Keywords
  • Extract text

Task3:Image Classification

  • Text Recognition
  • Text Classification
  • Area Filtering (CTPN)

See more detail in my blog https://sanshibayuan.github.io/

About

2018搜狐内容识别算法大赛-解决方案(4th)


Languages

Language:Python 97.9%Language:Cuda 2.0%Language:C++ 0.1%Language:Shell 0.1%