robotstxt
Robots.txt is a tool that implements the robots.txt exclusion protocol for the Go language. It includes a utility for checking robots.txt compliance.
Description
The robotstxt tool provides an implementation of the robots.txt exclusion protocol in Go (golang). This protocol is used by websites to instruct web crawlers and other bots on which parts of the site they are allowed or disallowed to access. The package is available in two forms: a development package with dev files and a runtime binary.
The primary binary, robots.txt-check, allows users to verify how a specific bot would be treated by a site's robots.txt file. This is useful for web developers, SEO specialists, and security researchers to understand crawler permissions and potential information disclosure issues.
Use cases include testing crawler access restrictions, auditing website configurations for unintended exposures, and ensuring compliance with robots.txt directives during web reconnaissance or penetration testing.
How It Works
The tool parses robots.txt files according to the exclusion protocol specification. It interprets directives like User-agent, Allow, and Disallow to determine permissions for specified bots. The robots.txt-check command simulates a bot's request against a remote robots.txt file, evaluating access rules based on the protocol implemented in Go.
Installation
sudo apt install robotstxtFlags
Examples
robots.txt-check -hrobots.txt-check -bot GoogleBotrobots.txt-check -robots-url https://example.com/robots.txtrobots.txt-check -bot BingBot -robots-url https://example.com/robots.txtrobots.txt-check -bot *robots.txt-check -robots-url https://target.com/robots.txt -bot CustomBot