robots-txt

There are 5 repositories under robots-txt topic.

PuerkitoBio / gocrawl
Polite, slim and concurrent web crawler.
crawler robots-txt
Language:Go 2026
eliasdabbas / advertools
advertools - online marketing productivity and analysis tools
marketing advertising adwords python digital-marketing online-marketing keywords search-engine-marketing twitter-api search-engine-optimization seo serp social-media youtube robots-txt scrapy seo-crawler log-analysis google-ads logfile-parser
Language:Python 1071
PuerkitoBio / fetchbot
A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
crawler robots-txt
Language:Go 782
nuxt-modules / robots
NuxtJS module for robots.txt
nuxt nuxt-module robots-txt ssr vuejs
Language:TypeScript 392
temoto / robotstxt
The robots.txt exclusion protocol implementation for Go language
go golang golang-library robots-txt web production-ready go-library status-active
Language:Go 266
InfinityCrawler
TurnerSoftware / InfinityCrawler
A simple but powerful web crawler library for .NET
crawler robots-txt spider web-crawler web-crawling
Language:C# 239
crawler-commons / crawler-commons
A set of reusable Java components that implement functionality common to any web crawler
web-crawler java sitemaps robots-txt open-source library
Language:Java 228
spatie / robots-txt
Determine if a page may be crawled from robots.txt, robots meta tags and robot headers
crawler php robots-txt
Language:PHP 213
weboptout
alexjc / weboptout
Opt-Out tool to check Copyright reservations in a way that even machines can understand.
opt-out command-line-tool robots-txt webscraping terms-of-service ml-pipeline data-ops copyright
Language:Python 192
mediacloud / ultimate-sitemap-parser
Ultimate Website Sitemap Parser
python3 sitemap sitemap-xml robots-txt xml-sitemap xml-sitemap-parser python python-3
Language:Python 169
beb7 / gflare-tk
Open-Source Python Based SEO Web Crawler
crawler python robots-txt scraper seo seo-crawler tkinter
Language:Python 143
samclarke / robots-parser
NodeJS robots.txt parser with support for wildcard (*) matching.
user-agent javascript nodejs robots-txt robots-exclusion-standard robots-parser
Language:JavaScript 141
mdreizin / gatsby-plugin-robots-txt
Gatsby plugin that automatically creates robots.txt for your site
gatsby gatsby-plugin robots-txt
Language:JavaScript 106
alextim / astro-lib
Makes it easy to add robots.txt, sitemap and web app manifest during build to your Astro app.
astro seo webmanifest robots-txt robotstxt sitemap sitemap-xml
Language:TypeScript 98
jimsmart / grobotstxt
grobotstxt is a native Go port of Google's robots.txt parser and matcher library.
go robots-exclusion-protocol robots-txt
Language:Go 97
jonasjacek / robots.txt
Simple robots.txt template. Keep unwanted robots out (disallow). White lists (allow) legitimate user-agents. Useful for all websites.
googlebot bingbot robots-txt robots-exclusion-standard blocking-bots user-agent web-robots seo search-engine whitelist crawlers web-crawling crawling search-engine-optimization baiduspider twitterbot
83
t1gor / Robots.txt-Parser-Class
Php class for robots.txt parse
google parser php robots-txt w3c yandex
Language:PHP 82
samber / the-great-gpt-firewall
🤖 A curated list of websites that restrict access to AI Agents, AI crawlers and GPTs
agent anthropic blocklist censorship crawler firewall genai generative-ai gpt gpt-4 llm openai robots-txt user-agent
Language:Python 68
ekalinin / robots.js
Parser for robots.txt for node.js
javascript nodejs parser robots robots-txt
Language:JavaScript 66
itgalaxy / generate-robotstxt
Generator robots.txt for node js
robotstxt cli robots-txt generator-robots robots-generator robots robot
Language:JavaScript 64
librengine
liameno / librengine
Privacy Web Search Engine (not meta, own crawler)
cpp crawler encryption frontend privacy rsa search-engine websearch websearchengine robots-txt self-hosted spider
Language:C++ 64
healsdata / ai-training-opt-out
Known tags and settings suggested to opt out of having your content used for AI training.
ai meta opt-out robots-txt
Language:HTML 59
akashblackhat / dark_web.py
Dark Web Informationgathering Footprinting Scanner and Recon Tool Release. Dark Web is an Information Gathering Tool I made in python 3. To run Dark Web, it only needs a domain or ip. Dark Web can work with any Linux distros if they support Python 3. Author: AKASHBLACKHAT(help for ethical hackers)
cms-framework detecting dns-lookup hacker ip-location link link-grabber nmap robots-txt traceroute whois-lookup https-header kali-liux ip-fi dark-web ethical-hacking-tools siber-sekuretry
Language:Python 58
LexiestLeszek / scrapeGPT
ScrapeGPT is a RAG-based Telegram bot designed to scrape and analyze websites, then answer questions based on the scraped content. The bot utilizes Retrieval Augmented Generation and webscraping to return natural language answers to the user's queries.
crawler huggingface large-language-models llm ollama proxy rag retrieval-augmented-generation robots-txt scraper telegram-bot website-scraper
Language:Python 53
MLArtist / WebScraper
Python-based web crawling script with randomized intervals, user-agent rotation, and proxy server IP rotation to outsmart website bots and prevent blocking.
crawling-python crawler scraper scrapping-python scraping scrapper website-scraper website-crawler robots-txt user-agent iprotation beautifulsoup beautifulsoup4 beautiful-soup
Language:Python 51
scrapy / protego
A pure-Python robots.txt parser with support for modern conventions.
robots-txt robots-parser python hacktoberfest
Language:DIGITAL Command Language 51
nasa-gcn / remix-seo
Collection of SEO utilities like sitemap, robots.txt, etc. for a Remix application. Forked from https://github.com/balavishnuvj/remix-seo
remix robots-txt seo sitemap
Language:TypeScript 48
kappa-wingman / useful-links
List of useful links, tools and resources
awesome-list awesome-lists json-ld robots-txt seo webdev webdevelopment website
47
kyr0 / astro-launchpad
An Astro project template for decent projects: auth, i18next, Bootstrap, sitemap, webworker, robots.txt, preact, react, endpoints, endpoint clients, OAuth, various Astro features and data loading preconfigured
astro authentication bootstrap i18next microservices preact robots-txt scaffold sitemap-xml template
Language:CSS 43
jirkapinkas / jsitemapgenerator
Java sitemap generator. This library generates a web sitemap, can ping Google, generate RSS feed, robots.txt and more with friendly, easy to use Java 8 functional style of programming
java web-sitemap java-sitemap-generator sitemap rss-generator robots-txt rss robots-generator java-8 java-8-lambda sitemap-generator lambda-functions aws-lambda
Language:Java 41
itgalaxy / robotstxt-webpack-plugin
A webpack plugin to generate a robots.txt file
webpack-plugin robotstxt robots-txt webpack
Language:JavaScript 35
mhmdiaa / waybackrobots
Enumerate old versions of robots.txt paths using Wayback Machine for content discovery
content-discovery recon robots-txt wayback-machine
Language:Go 34
LuXDAmore / nuxt-humans-txt
🧑🏻👩🏻 "We are people, not machines" - An initiative to know the creators of a website. Contains the information about humans to the web building - A Nuxt Module to statically integrate and generate a humans.txt author file - Based on the HumansTxt Project.
nuxt-module vuejs nuxtjs nuxt modules humans humans-txt author robots-txt static robots
Language:JavaScript 29
k3ldar / .NetCorePluginManager
.Net Core Plugin Manager, extend web applications using plugin technology enabling true SOLID and DRY principles when developing applications
asp-net-core plugin-manager memory-cache user-sessions robots-txt cache-control restrict-ip
Language:C# 28
RobotsExclusionTools
TurnerSoftware / RobotsExclusionTools
A "robots.txt" parsing and querying library for .NET
norobots-rfc user-agent robots-txt parser parse
Language:C# 27
VIPnytt / RobotsTxtParser
An extensible robots.txt parser and client library, with full support for every directive and specification.
robots-txt robots-parser
Language:PHP 24

robots-txt

PuerkitoBio / gocrawl

eliasdabbas / advertools

PuerkitoBio / fetchbot

nuxt-modules / robots

temoto / robotstxt

TurnerSoftware / InfinityCrawler

crawler-commons / crawler-commons

spatie / robots-txt

alexjc / weboptout

mediacloud / ultimate-sitemap-parser

beb7 / gflare-tk

samclarke / robots-parser

mdreizin / gatsby-plugin-robots-txt

alextim / astro-lib

jimsmart / grobotstxt

jonasjacek / robots.txt

t1gor / Robots.txt-Parser-Class

samber / the-great-gpt-firewall

ekalinin / robots.js

itgalaxy / generate-robotstxt

liameno / librengine

healsdata / ai-training-opt-out

akashblackhat / dark_web.py

LexiestLeszek / scrapeGPT

MLArtist / WebScraper

scrapy / protego

nasa-gcn / remix-seo

kappa-wingman / useful-links

kyr0 / astro-launchpad

jirkapinkas / jsitemapgenerator

itgalaxy / robotstxt-webpack-plugin

mhmdiaa / waybackrobots

LuXDAmore / nuxt-humans-txt

k3ldar / .NetCorePluginManager

TurnerSoftware / RobotsExclusionTools

VIPnytt / RobotsTxtParser