Mine001 / CommNewsExtractor

通用文章提取,正文,标题,时间,作者,图片,音视频,联系方式等

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CommNewsExtractor

一.

A.版本:

version_1.5

B.环境:

python >= 3.7

二.功能:

通用文章提取,正文,标题,时间,作者,图片,音视频,联系方式等

三.API说明:

    文件:AutoExtractors.py
    1.文章标题提取:get_title()
    2.文章作者提取:get_author()
    3.文章发布时间提取:get_public_time()
    4.文章邮箱提取:get_email()
    5.文章手机提取:get_phone()
    6.固定电话提取:get_telephone()
    7.文章url链接提取:get_url()
    8.html的IP提取:get_ip()
    9.身份证号码提取:get_idcard()
    10.文件提取:get_file()
    11.图片提取:get_image()
    12.视频提取:get_video()
    13.文章正文提取:get_content()
    14.新闻通用提取:get_all()

四.使用方式:

A.下载资源并安装环境

git clone https://github.com/kingking888/CommNewsExtractor.git

cd 到CommNewsExtractor目录下的requirements.txt同级别目录

pip install -r requirements.txt

B.你的爬虫代码

'''

import json, urllib3
import requests
from extractors.AutoExtractors import *
from Jsonabstract.Json_abstract import Json_abstract
from tools.automatic_detect import automatic_detect
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)


class NewsExtract(object):
    def __init__(self):
        """
        已测试网站
        """
        # http://www.chinanews.com/
        self.url = "http://www.chinanews.com/gn/2020/05-23/9192614.shtml"
        # https://www.hao123.com/
        # self.url = "https://www.hao123.com/mid?from=shoubai&key=9412104569053922697&type=rec"
        # https://news.sina.com.cn/
        # self.url = "https://news.sina.com.cn/c/2019-12-10/doc-iihnzahi6563166.shtml"
        # http://www.sohu.com/
        # self.url = "http://www.sohu.com/a/359537759_114988?_f=index_chan08news_6"
        # https://news.qq.com/
        # self.url = "https://new.qq.com/omn/PEG20191/PEG2019121000975400.html"
        # https://news.163.com/
        # self.url = "https://news.163.com/19/1210/20/F02GT6270001899O.html"
        # http://news.ifeng.com/
        # self.url = "https://news.ifeng.com/c/7sIDPyRWK36"
        # http://www.xinhuanet.com/
        # self.url = "http://www.xinhuanet.com/gangao/2019-12/10/c_1125331339.htm"
        # http://env.people.com.cn
        # self.url = "http://env.people.com.cn/n1/2019/1210/c1010-31498499.html"
        # http://news.cctv.com/
        # self.url = "http://news.cctv.com/2019/12/10/ARTIOqy9gwiHXKlH5NWHCPPo191210.shtml?spm=C94212.P4YnMod9m2uD.ENPMkWvfnaiV.16"
        # https://www.huanqiu.com/
        # self.url = "https://opinion.huanqiu.com/article/9CaKrnKog6X"
        # http://news.baidu.com/
        # self.url = "http://news.cctv.com/2019/12/10/ARTISVPiNSCB6N4xWZjRPkNV191210.shtml"
        # http://www.cankaoxiaoxi.com
        # self.url = "http://www.cankaoxiaoxi.com/china/20191210/2397383.shtml"
        # https://www.thepaper.cn/
        # self.url = "https://www.thepaper.cn/newsDetail_forward_5198726"
        # http://www.takungpao.com/
        # self.url = "http://www.takungpao.com/news/232109/2019/1209/387560.html"
        # http://www.stnn.cc/
        # self.url = "http://news.stnn.cc/hongkong/2019/1210/697580.shtml"
        # Json---url
        # self.url = "http://www.zhwhg.com/api/article/read?id=1099"

        self.headers = {
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
            "Accept-Encoding": "gzip, deflate",
            "Accept-Language": "zh-CN,zh;q=0.9",
            "Upgrade-Insecure-Requests": "1",
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36",
        }
        self.key_dict = js_key_dict
        self.Json_abstract = Json_abstract()

    def start_requests(self):
        html_encoding = automatic_detect(self.url)
        print(html_encoding)
        res = requests.get(url=self.url, headers=self.headers, timeout=5, verify=False)
        if res.status_code == 200:
            try:
                if "www.chinanews.com" in self.url:
                    res.encoding = "utf-8"
                else:
                    res.encoding = html_encoding
                html_content = res.text
                if html_content:
                    return html_content
            except Exception as e:
                # print(e)
                pass

    def judge_if_json(self, html_content):
        if html_content:
            try:
                json_html = json.loads(html_content)
                print("This is Json Html")
                return json_html
            except Exception as e:
                # print(e)
                pass

    def html_extract(self, html_content):
        # tag=0 正文内容是不带标签,tag=1正文内容是带标签
        result = SupperAutoExtract().get_all(html_content, tag=0)
        return result

    def run(self):
        html_content = self.start_requests()
        json_html = self.judge_if_json(html_content)
        if json_html:
            result = self.Json_abstract.all_abstract(json_html, key_exp_dict=self.key_dict)
        else:
            result = self.html_extract(html_content)
        print(result)


if __name__ == '__main__':
    ne = NewsExtract()
    ne.run()

'''

五.本项目参考和借助项目:

1.https://github.com/kingname/GeneralNewsExtractor

2.https://github.com/striver-ing/distributed-spider

3.https://github.com/codelucas/newspaper

4.https://github.com/mozilla/readability

5.https://github.com/aaronsw/html2text

真诚感谢以上项目的开源作者,致敬!

六.欢迎完善项目,star和提issue

![image](https://github.com/kingking888/CommNewsExtractor/blob/master/WechatIMG124.png)

七.声明:

有关资源均来自网络收集与网友提供,任何涉及商业盈利目的的均不得使用,否则产生的一切后果将由您自己承担!

本项目资源仅供个人学习交流、测试使用。

所有内容请在下载后24小时内删除,禁止非法恶意传播,不对任何下载或转载者造成的危害负任何法律责任!

请不要将本项目的资源用于其他用途,所产生的后果我们概不负责!

如果本项目存在的内容对您和您的利益产生损害,请立即私信我们,将在最短时间内对其做出删除处理。

About

通用文章提取,正文,标题,时间,作者,图片,音视频,联系方式等

License:MIT License


Languages

Language:Python 100.0%