There are 3 repositories under content-extraction topic.
A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package
Readability2 converts HTML to plain text.
Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles
Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.
Web content extraction using machine learning
Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...
Via Text Density Simple Web Crawler With Go
Benson turns a list of URLs into mp3s of the contents of each web page - take control over your reading backlog!
Mobile First Indexing Tool
This repository houses a Python application for extracting YouTube video transcripts and summarizing its content.
DOM Based Content Extraction via Text Density
Example project demonstrating how to use PDFix SDK WebAssembly build in Node.js. Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...
Simple node server to extract relevant content from website source code using Mozilla's Readability.js
FileGazer - deep file analysing and categorisation
This Python-based repository hosts a sophisticated service designed for scraping web articles and converting them into Markdown format. The core functionality of this service includes extracting the main content of articles, such as headlines, key paragraphs, and associated images, and then seamlessly transforming this content into well-structured…
Configurable and schedulable web scrapping tool. Used to extract raw article content and metadata for aggregated news feeds.
Recommending Relevant Sections from a Webpage About Programming Errors and Exceptions
Example project demonstrating how to use PDFix SDK WebAssembly build in Node.js. Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...
A python content extraction library for the structured extraction of Terms and Conditions from German and English online shops