karussell / snacktory

Readability clone in Java

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Detect publish date

bejean opened this issue · comments

A great feature could be to detect the published date of the web page.
This information is often located somewhere at the top or the bottom of the main text.

Any ideas of 'how'?

Or even better some code :) ?

BTW: at the moment the date is guessed from the URL only

Hi, I tested this and it is a good first step.
I didn't really think about doing this. May be create an array of regexp and apply it in the extracted text.

Anyway, today, it is not possible to get the date directly with a ArticleTextExtractor object, the only way is to use SHelper class

ArticleTextExtractor extractor = new ArticleTextExtractor();
JResult res = extractor.extractContent(rawData);
text = res.getText();
title = res.getTitle();
date = SHelper.completeDate(SHelper.estimateDate(url));