allentran / Newstitles

Script to get all titles found in www.nytimes.com and others

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Newstitles

The task is to write a function that produces a list of article titles from The NY Times (www.nytimes.com) and then sorts them by the sum of each title's tokens frequency (let's call this sum weight -a token is just a word separated by whitespace) among all titles. The output should be this sorted list along with the score for each article title.

For example, if the titles are:

  • Trump goes to Japan
  • Shooting in Texas
  • Japan and North Korea in talks
  • Trump offends Australian people
  • Judges visit Trump

Let's calculate each title's weight, starting with the first one. In this case, token Trump appears 3 times among all titles. goes, 1 time. to, 1 time. Japan, 2 times. Adding up all token frequencies gives us: 3 + 1 + 1 + 2 = 7.

The sorted list would be:

  • Japan and North Korea in talks (8)
  • Trump goes to Japan (7)
  • Trump offends Australian people (6)
  • Judges visit Trump (5)
  • Shooting in Texas (4)

Implementation details

Requires:

  • Python 2.7.10
  • requests 2.18.4
  • lxml 4.1.1

This script has some settings hardcoded to run use the base case: titles from The NY Times. To run it, just do:

$ python newtitles.py

In case you want to run it with different settings, modify these 2 vars:

  • TARGET_URL: URL of the site to analyze (e.g. https://www.nytimes.com)
  • TARGET_SELECTOR: XPath selector of the expected location of each title (e.g. /html/body//article@title)

Notes:

  • We're assuming English as language
  • For simplicity, all chars other than alphanumeric ones are stripped and we will ignore single chars

About

Script to get all titles found in www.nytimes.com and others


Languages

Language:Python 100.0%