Project for my Communication and Culture class which analyzes how articles and the relation to their headlines have changed over time.
I used the nytimesarticle package for scraping articles from 1980 to 2015 which uses the NYT Article Search API. And then for analyzing relevance between headlines and their articles I used the Natural Language Toolkit's part of speech classifier.
To measure relevance, I stored the headline of an article with its 'lead paragraph', or its 'abstract' if the 'lead paragraph' wasn't available or its 'snippet' if the 'abstract' wasn't available. Then picked out the proper nouns and noun phrases from the headlines to find what the headline implied the topics of the article were. Then I simply counted the occurances of each topic in the article and divided that by the number of topics * some importance value of their presence.
Some exerpts from the paper:
We conducted research unique to the New York Times by collecting the word counts of a thousand articles of each year from 1980 to the present, using the New York Times Article Search API. We calculated an average word count for each year. We expected to see a steady decrease in word count which corresponded with our hypothesis - that journalists are adapting to the shorter attention spans of readers. However while we did not see a decreasing trend, the data ranges greatly from an average word count of 422 in 2008 to an average word count of 665 in 1988 and a total average of all years of 585.
...
As the second part of our New York Times research, we gathered headlines and lead paragraphs for each of a thousand random articles from 1980 to 2015. We wanted to calculate a relevance score between articles and their headlines, a score for which we designed our own formula: for each headline, a list of topics was extracted by assuming the topics would be the proper nouns and noun phrases (classified by the Natural Language Toolkit’s part of speech tagger) in the headline. For example, the topic extractor would find “Turkey”, “Stab in the Back”, “Russian Social Networks”, and “Airwaves” to be the topics of the headline “Anger at Turkey’s ‘Stab in the Back’ Fills Russian Social Networks and Airwaves”. After the topics are extracted, the algorithm counts the occurrences of each topic in the lead paragraph of the article, and divides each topic’s occurrences by the number of topics multiplied by its importance of presence score (1.0 for proper nouns and 0.5 for noun phrases - because noun phrases are more likely to be gathered incorrectly and thus results in a smaller “punishment” if the topic is found fewer times in the lead paragraph). The relevances are then averaged by the number of topics for the total relevance score of the headline.
As the graph shows, there is a general decrease in relevance between 2001 and 2015 and peaks in relevance in the early 1980’s and in 1991. (It is important to note that NYTimes.com, New York Times’ digital branch, was first proposed in 1995 and launched soon after, and the Wayback Machine Internet Archive dates the first screenshot of the website to November 12, 1996.)
In 2001 the company’s revenue declined after the attack on September 11th, as did the success of the newspaper business in general due to the poor status of the economy and companies lacking the incentive to spend on newspaper advertising. It marked the beginning of a decline in newspaper circulation and revenue - specifically that gained from advertising. While print ad revenue still composes the majority of newspaper revenue, one should note the stagnation and decline since 2003 but steady increase in digital ad revenue and exponential increase in mobile ad revenue.
With this data we can theorize that the decline in headline relevance could be to encourage more website visitors with attention-grabbing headlines, rather than headlines that are relevant to the actual content of the article. Thus, this new style of headline writing produces more digital and mobile ad revenue, in an effort to make up for a continuing decrease in print ad revenue.