CrawlGPT

⚡ Fully automated web crawler. Crawling all information you want on the Internet with GPT-3.5. Built with 🦜️🔗LangChain👍👍⚡

Simple Demo

demo.mp4

What it can do?

Fully automated web crawler. Simulate the process of humans searching for data as much as possible.
Automatically collect all specified details across the entire internet or given web domain based on a given theme.
Automatically search for answers on the internet to fill in missing specified details while crawling.

✍️👇A simple exmple👇✍️

Input:
- the theme you want to crawl: Cases of mergers and acquisitions of fast food industry enterprises in America after 2010
- 0-th specific detail: When the merger occurred
- 1-th specific detail: Acquirer
- 2-th specific detail: Acquired party
- 3-th specific detail: The CEO of acquirer
- 4-th specific detail: The CEO of acquired party
- (Optional) Limited web domain: ["nytimes.com", "cnn.com"]

Output: JSON containing all specified details about the theme. The format of output is

{
    "events_num": N,
    "details":  ### The length of this list is N.
    [ 
        {
            "When the merger occurred": <answer>,
            "Acquirer": <answer>,
            "Acquired party": <answer>,
            "The CEO of acquirer": <answer>,
            "The CEO of acquired party": <answer>,
            "source_url": <url>
        },
        {
            "When the merger occurred": <answer>,
            "Acquirer": <answer>,
            "Acquired party": <answer>,
            "The CEO of acquirer": <answer>,
            "The CEO of acquired party": <answer>,
            "source_url": <url>
        },
        ..............
    ]
}

Why web crawler need GPT?

GPT can extract the necessary information by directly understanding the content of each webpage, rather than writing complex crawling rules.
GPT can connect to the internet to determine the accuracy of crawler results or supplement missing information.

How it do?

Thinking about suitable Google search queries based on the theme with GPT-3.5.
Simulate Google search in entire Internet or given web domain(if any) using each query.
Browse every website.
Extract specific details of the theme from the content of the website with GPT-3.5.
Similar to Auto-GPT, it will independently search for missing details on the Internet based on the langchain implementation of MRKL and ReAct.
Encapsulate all results into a JSON.

Quick Install

OPENAI_API_KEY: You must have a openai api key and modify os.environ["OPENAI_API_KEY"] in pipeline.py.
SERPER_API_KEY: For searching correct and real-time information, you need have a google serper api key. It will take you a short time to register. Modify os.environ["SERPER_API_KEY"] in pipeline.py and you have 1000 queries for free every month.
Hyper Parameters:
- QUERY_NUM: The Number of Google searches with different query. Default is 2.
- QUERY_RESULTS_NUM: The number of results returned per search. Default is 4.
- THEME: The theme of web crawler.
- DETAIL_LIST: The specific details of the web crawler theme.
- (Optional) URL_DOMAIN_LIST: The valid web domain or url prefix.
Install python3.11.
Install necessary dependencies: pip install -r requirements.txt
Run it: python pipeline.py > output.txt.
Read results from final_dict.json.

TODO

Support crawl in given list of web domain.
The langchain implementation of MRKL and ReAct carries the risk of divergent output. That is, the content of response may exceed our limit.
Automatically write research reports based on crawling results.
GPT consumes a huge amount of token while browsing webpage😢. Reduce the consumption.
Browse the PDF files from the pdf link in website.
Modify the entire pipeline to registration free(except for OpenAI).

Contact me and communication

I am currently working as an AI engineer at Alibaba in Beijing, China. I think communication can eliminate information gaps.

I am interested in llm applications, such as conversational search, AI agent, external data enhancement, etc. Welcome to communicate via email(hanxyz1818@gmail.com) or WeChat (if you have one).

I am also preparing to make an actual product. My immature ideas recently is about llm+crypto and llm+code. If you are also interested in them or you have some other ideas, also welcome to contact me.

gh18l / CrawlGPT