A Ruby gem providing native bindings to Google's official C++ robots.txt parser and matcher. Enables fast, standards-compliant robots.txt parsing and URL access checking directly from Ruby.
- Fast Performance: Native C++ implementation via FFI bindings
- Standards Compliant: Wraps Google's official robots.txt C++ parser
- Cross-Platform: Supports macOS and Linux (ARM64 and x86_64)
- Simple API: Easy-to-use Ruby interface
- RFC 9309 Compliant: Follows the latest robots.txt specification
gem install robotstxt-rbAdd this line to your application's Gemfile:
gem 'robotstxt-rb', git: 'https://github.com/jacksontrieu/robotstxt-rb.git'And then execute:
bundle installrequire 'robotstxt-rb'
# Check if a URL is allowed for a specific user agent
robots_txt = <<~ROBOTS
User-agent: *
Disallow: /admin
Allow: /public
ROBOTS
# Check if a URL is allowed
RobotstxtRb.allowed?(
robots_txt: robots_txt,
user_agent: "MyBot",
url: "https://example.com/public"
)
# => true
RobotstxtRb.allowed?(
robots_txt: robots_txt,
user_agent: "MyBot",
url: "https://example.com/admin"
)
# => false
# Validate robots.txt content
RobotstxtRb.valid?(robots_txt: robots_txt)
# => trueChecks if a specific URL is allowed to be crawled by a given user agent according to the robots.txt rules.
Parameters:
robots_txt(String): The robots.txt content to parseuser_agent(String): The user agent string to checkurl(String): The URL to check (can be full URL or path)
Returns: Boolean - true if the URL is allowed, false if disallowed
Raises: ArgumentError if any required parameter is nil
Validates whether the given robots.txt content is well-formed.
Parameters:
robots_txt(String): The robots.txt content to validate
Returns: Boolean - true if valid, false if invalid or nil
- macOS: ARM64 (Apple Silicon) and x86_64 (Intel)
- Linux: ARM64 and x86_64
require 'robotstxt-rb'
# Simple robots.txt
robots_txt = "User-agent: *\nDisallow: /private"
# Check various URLs
puts RobotstxtRb.allowed?(robots_txt: robots_txt, user_agent: "Bot", url: "/public") # true
puts RobotstxtRb.allowed?(robots_txt: robots_txt, user_agent: "Bot", url: "/private") # false
puts RobotstxtRb.allowed?(robots_txt: robots_txt, user_agent: "Bot", url: "/private/") # falserobots_txt = <<~ROBOTS
User-agent: Googlebot
Disallow: /search
Allow: /
User-agent: *
Disallow: /
ROBOTS
# Googlebot can access most URLs
RobotstxtRb.allowed?(robots_txt: robots_txt, user_agent: "Googlebot", url: "/") # true
RobotstxtRb.allowed?(robots_txt: robots_txt, user_agent: "Googlebot", url: "/search") # false
# Other bots are blocked
RobotstxtRb.allowed?(robots_txt: robots_txt, user_agent: "OtherBot", url: "/") # falserobots_txt = <<~ROBOTS
User-agent: *
Disallow: /*.pdf$
Disallow: /temp*
Allow: /temp/public
ROBOTS
# PDF files are blocked
RobotstxtRb.allowed?(robots_txt: robots_txt, user_agent: "Bot", url: "/document.pdf") # false
# Temp files are blocked
RobotstxtRb.allowed?(robots_txt: robots_txt, user_agent: "Bot", url: "/temp/file") # false
# Exception to temp rule
RobotstxtRb.allowed?(robots_txt: robots_txt, user_agent: "Bot", url: "/temp/public") # true# Valid robots.txt
RobotstxtRb.valid?(robots_txt: "User-agent: *\nDisallow: /admin") # true
# Invalid robots.txt
RobotstxtRb.valid?(robots_txt: "Invalid-directive: value") # false
# Empty robots.txt is valid
RobotstxtRb.valid?(robots_txt: "") # true- Clone the repository:
git clone https://github.com/jacksontrieu/robotstxt-rb.git
cd robotstxt-rb- Install dependencies:
bundle install# Run all tests
bundle exec rspec
# Run with coverage
bundle exec rspec --format documentationThis project uses RuboCop for code style enforcement. To check and fix style issues:
# Check for style violations
bundle exec rubocop
# Auto-fix violations where possible
bundle exec rubocop --auto-correct
# Auto-fix all violations (including unsafe ones)
bundle exec rubocop --auto-correct-allgem build robotstxt-rb.gemspecWe welcome contributions! Please see our Contributing Guide for details on how to get started.
This project adheres to a Code of Conduct. By participating, you are expected to uphold this code.
Please see our Security Policy for information on reporting security vulnerabilities.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
See CHANGELOG.md for a list of changes and version history.