There are 5 repositories under robots-txt topic.
advertools - online marketing productivity and analysis tools
A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
A simple but powerful web crawler library for .NET
A set of reusable Java components that implement functionality common to any web crawler
Determine if a page may be crawled from robots.txt, robots meta tags and robot headers
Ultimate Website Sitemap Parser
NodeJS robots.txt parser with support for wildcard (*) matching.
Gatsby plugin that automatically creates robots.txt for your site
grobotstxt is a native Go port of Google's robots.txt parser and matcher library.
Simple robots.txt template. Keep unwanted robots out (disallow). White lists (allow) legitimate user-agents. Useful for all websites.
Generator robots.txt for node js
🤖 A curated list of websites that restrict access to AI Agents, AI crawlers and GPTs
Privacy Web Search Engine (not meta, own crawler)
Dark Web Informationgathering Footprinting Scanner and Recon Tool Release. Dark Web is an Information Gathering Tool I made in python 3. To run Dark Web, it only needs a domain or ip. Dark Web can work with any Linux distros if they support Python 3. Author: AKASHBLACKHAT(help for ethical hackers)
List of useful links, tools and resources
ScrapeGPT is a RAG-based Telegram bot designed to scrape and analyze websites, then answer questions based on the scraped content. The bot utilizes Retrieval Augmented Generation and webscraping to return natural language answers to the user's queries.
Java sitemap generator. This library generates a web sitemap, can ping Google, generate RSS feed, robots.txt and more with friendly, easy to use Java 8 functional style of programming
An Astro project template for decent projects: auth, i18next, Bootstrap, sitemap, webworker, robots.txt, preact, react, endpoints, endpoint clients, OAuth, various Astro features and data loading preconfigured
Known tags and settings suggested to opt out of having your content used for AI training.
Python-based web crawling script with randomized intervals, user-agent rotation, and proxy server IP rotation to outsmart website bots and prevent blocking.
A webpack plugin to generate a robots.txt file
Enumerate old versions of robots.txt paths using Wayback Machine for content discovery
🧑🏻👩🏻 "We are people, not machines" - An initiative to know the creators of a website. Contains the information about humans to the web building - A Nuxt Module to statically integrate and generate a humans.txt author file - Based on the HumansTxt Project.
.Net Core Plugin Manager, extend web applications using plugin technology enabling true SOLID and DRY principles when developing applications
A "robots.txt" parsing and querying library for .NET
An extensible robots.txt parser and client library, with full support for every directive and specification.