zzpmiracle / news_track

[TOC]

DataSet

Initial structure

595,037 documents
article : id | url | title | kicker | author | published_date | contents | type | sourse
- contents
  - (None)
  - kicker : a section header indicating the publication category，irrelevant if is one of "Opinion", "Letters to the Editor","The Post's View"
  - title
  - image : image URL and full caption
  - byline : by + author(s)
  - paragraph : plain (text) | html (with html style < …… >)
  - (author_info)
- type : 'article' / 'blog'
- sourse : 'The Washington Post'

Data cleaning

remove irrelevant article according to kicker
remove ['type'] ,['sourse'] rom article
remove [byline] ,[title] ,[author_info] ,[image] from contents if exist
remove empty content from contents
remove html code from content
group contents into an article (plain text)

After cleaning

571,963 docs remained
article : id | url | title | kicker | author | date | contents(long string)

ElasticSearch

Steps

insert：id + other parts
topics : BeautifulSoup --> id + num
source article : id in ES
search : title and 10 keywords(Rake) respectively
sort : score normalization --> weighted(from Rake) sum

Result

nearly irrelevant

BERT

Dataset
- relevant:2018 relevance judgments,labels:0-16
- irrelevant : add 10000 random sample,labels:-1
Processor
- id——>text

About

Languages

Language:Python 78.3%Language:Jupyter Notebook 21.7%