There are 1 repository under html2text topic.
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
A python based HTML to text conversion library, command line client and Web service.
The best HTML to Markdown library, A esm-native & Useful Utilities with simple, lightweight and epic quality.
An extremely configurable markdown reverser for Python3.
RxNLP APIs for clustering sentences, extracting topics, counting words & n-grams, extracting text from html or URL, computing similarity between texts and more.
Python library for converting HTML to markup or plain text
inscriptis - HTML to text conversion library for Java
Article title, authors, date and body extraction dataset.
AI chat app to response data in Markdown format with text and images. Tutorial from: https://youtu.be/qKtM2AlDTs8
There is simple project to scrape and collect news using rss and llm API based on rust.
Go package that cleans a HTML page for better readability.
A Python-based RAG chatbot leveraging GPT-4o and Bright Data's SERP API to deliver contextually rich and up-to-date AI responses using real-time search engine data.
My Python Projects.
This project involves building a robust classifier that classifies whether a document (from abstract content) belongs to cancer class or not.
Receive Packt Publishing Ltd. Free Learning updates in Telegram every day
Microservice for text and images collection for data science purposes.
A PHP package to convert HTML into a plain text format
C'est un projet de web scraping qui utilise Streamlit, BeautifulSoup, et html2text pour extraire, convertir en Markdown, et afficher le contenu de toutes les pages liées à une URL donnée. Il fournit un sommaire interactif des URL visitées et permet d'afficher le contenu extrait dans un format facile à lire.
a cli tool to fetch webpages main content and print it as markdown
The goal is to create a solution that crawls for articles from a news website (Theguardian), cleanses the response, stores it in a hosted mongo database (MongoDB Atlas), then makes it available to search via an API.
converts any .html file in a specified folder into a .txt file and combines all single .txt files into one big text file
Code and data for SORE (ACL 2025), a semantic boilerplate remover.
Scraped Web using an automated python script that acted as scrapper to extract content from Wikipedia pages and created a clean dataset from it.