hylong / cx-extractor

Automatically exported from code.google.com/p/cx-extractor

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

损耗时间的一步

GoogleCodeExporter opened this issue · comments

source = links.matcher(source).replaceAll("");

样例:http://news.itxinwen.com/2013/0802/515691.shtml

单是这一步 将耗时90s+

建议:可以直接通过source = source.replaceAll("<[^>]+>", "");  
移除所有Tag?



Original issue reported on code.google.com by ywq1...@gmail.com on 2 Aug 2013 at 8:01

private static Pattern links = Pattern.compile("<[^>]+>.*?</[aA]>");

考虑到<a>contents<a>这样更好些

唯一的缺陷是 如果正文有带有超链接的文字段也将被删除了

Original comment by ywq1...@gmail.com on 2 Aug 2013 at 9:57