There are 11 repositories under article-extractor topic.
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
To extract main article from given URL with Node.js
SmartReader is a library to extract the main content of a web page, based on a port of the Readability library by Mozilla
Parse markdown article, download images and replace images URL's with local paths
Laravel wrapper for common NLP tasks
Extract article or news by url or html, parse the title and content, output in markdown format.
Involution King Fun Book (IKFB, Chinese: 快卷, 卷王快乐本) is an integrated management system for papers and literature. Powered by Electron.
【 Spring Boot 实战开发】10 分钟快速构建一个自己的技术文章博客
This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around the world in over 50 languages.
This is a small and easy-to-use desktop application that allows exporting Web of Science API Expanded and InCites API data in Excel/CSV/JSON/XML with a configurable and flexible data export structure.
A web page content extractor
📚 Сборник полезных штук из Natural Language Processing: Определение языка текста, Разделение текста на предложения, Получение основного содержимого из html документа
디시인사이드 Client-Side 글 검색기 입니다.
从html中提取正文,用于新闻类网页
The program can be used to scrape the content from an article from web by an input of a set of URLs in a text file or a URL. This project uses newspaper3k and python-docx libraries. The output of this program will give a neatly modified Word Document in '.docx' format with the contents of the article.
A python script to scrap articles from Prothom Alo with the Headline, Category, URL, and Summary
Combines Apify's crawling system and article parsing with unfluff library.
Simple HTTP API endpoint that takes URL to any article and returns JSON object containing information about the article.
Extract article/blog from websites like [medium.com, inc42.com,etc]:100:
A modern pythonic lib to extract data from news pages
A Google Docs HTML Cleaner: This program transforms messy HTML from Google Docs into clean code primarily using LXML with a modular mixin design pattern.
Modern OpenAI GPT-4 Article Summarizer
🔥The bold new archive that can’t be burned, bulldozed or battering-rammed #PoweredByArweave
Scrape Yılmaz Özdil articles and create Markov model to generate newspaper articles like Yılmaz Özdil. Turkish text dataset creator for data science and NLP projects.
Automatic Extractive Text Summarization using TF-IDF Frequency Analysis. This is a Node.js web application using Express.js on the server side.
toe backend code
Nebula Expired Article Hunter is a marketing tool you can use to get expired content from www.archive.org A.K.A. wayback machine, you could use this kind of content to grow up your blog with evergreen information, improve your marketing campaigns without investing in writing services, or whatever you imagine is useful for.
Simplify your reading with Summarizer, an open-source article summarizer that transforms lengthy articles into clear and concise summaries
Crawling articles from websites
디시인사이드 이미지 크롤러