There are 2 repositories under web-extraction topic.
Fully automated and hands-free, accurately extracting and understanding web content — powered by machine learning agents.
基于Scala Akka的分布式主题网络爬虫
Automatic extraction of the information on local event from a webpage with Machine Learning
Serverless AI browser agent
Predicting product recommendation score using the data available on the website of the client
A powerful and lightweight web scraping library with LLM extraction capabilities. This library combines web scraping with AI-powered content extraction using either OpenAI or OpenRouter APIs.
Programming assignments for Web Information Extraction and Retrieval, FRI UL, 2021. PA1: standalone webcrawler of .gov.si web sites, PA2: approaches of the structured web data extraction, PA3: Data processing and indexing and Data retrieval.
This project is a command-line tool that extracts text from web pages and PDF files, including scanned documents. It supports various extraction methods. This tool is ideal for data scraping, NLP preprocessing, and content analysis.