-
Notifications
You must be signed in to change notification settings - Fork 762
Configuring Crawl Scope Using DecideRules
The crawl scope defines the set of possible URIs that can be captured by a crawl. These URIs are determined by DecideRules, which work in combination to limit or expand the set of crawled URIs. Each DecideRule, when presented with an object (most often a URI of some form) responds with one of three decisions:
- ACCEPT: the object is ruled in
- REJECT: the object is ruled out
- PASS: the rule has no opinion; retain the previous decision
A URI under consideration begins with no assumed status. Each rule is applied in turn to the candidate URI. If the rule decides ACCEPT or REJECT, the URI's status is set accordingly. After all rules have been applied, the URI is determined to be "in scope" if its status is ACCEPT. If its status is REJECT it is discarded.
We suggest starting with the rules in our recommended default configurations and performing small test crawls with those rules. Understand why certain URIs are ruled in or ruled out under those rules. Then make small individual changes to the scope to achieve non-default desired effects. Creating a new ruleset from scratch can be difficult and can easily result in crawls that can't make the usual minimal progress that other parts of the crawler expect. Similarly, making many changes at once can obscure the importance of the interplay and ordering of the rules.
The following table lists the available DecideRules.
Decide Rule |
Description |
---|---|
AcceptDecideRule |
This DecideRule accepts any URI. |
ContentLengthDecideRule |
This DecideRule accepts a URI if the content-length is less than the threshold. The default threshold is 2^63, meaning any document will be accepted. |
PathologicalPathDecideRule |
This DecideRule rejects any URI that contains an excessive number of identical, consecutive path-segments. For example, |
PredicatedDecideRule |
This DecideRule applies a configured decision only if a test evaluates to true. |
ExternalGeoLocationDecideRule |
This DecideRule accepts a URI if it is located in a particular country. |
FetchStatusDecideRule |
This DecideRule applies the configured decision to any URI that has a fetch staus equal to the "target-status" setting. |
HasViaDecideRule |
This DecideRule applies the configured decision to any URI that has a "via." A via is any URI that is a seed or some kind of mid-crawl addition. |
HopCrossesAssignmentLevelDomainDecideRule |
This DecideRule applies the configured decision to any URI that differs in the portion of its hostname/domain that is assigned/sold by registrars. The portion is referred to as the "assignment-level-domain" (ALD). |
IdenticalDigestDecideRule |
This DecideRule applies the configured decision to any URI whose prior-history content-digest matches the latest fetch. |
MatchesListRegexDecideRule |
This DecideRule applies the configured decision to any URI that matches the supplied regular expressions. |
NotMatchesListRegexDecideRule |
This DecideRule applies the configured decision to any URI that does not match the supplied regular expressions. |
MatchesRegexDecideRule |
This DecideRule applies the configured decision to any URI that matches the supplied regular expression. |
ClassKeyMatchesRegexDecideRule |
This DecideRule applies the configured decision to any URI class key that matches the supplied regular expression. A URI class key is a string that specifies the name of the Frontier queue into which a URI should be placed. |
ContentTypeMatchesRegexDecideRule |
This DecideRule applies the configured decision to any URI whose content-type is present and matches the supplied regular expression. |
ContentTypeNotMatchesRegexDecideRule |
This DecideRule applies the configured decision to any URI whose content-type does not match the supplied regular expression. |
FetchStatusMatchesRegexDecideRule |
This DecideRule applies the configured decision to any URI that has a fetch status that matches the supplied regular expression. |
FetchStatusNotMatchesRegexDecideRule |
This DecideRule applies the configured decision to any URI that has a fetch status that does not match the suppllied regular expression. |
HopsPathMatchesRegexDecideRule |
This DecideRule applies the configured decision to any URI whose "hops-path" matches the supplied regular expression. The hops-path is a string that consists of characters representing the path that was taken to access the URI. An example of a hops-path is "LLXE". |
MatchesFilePatternDecideRule |
This DecideRule applies the configured decision to any URI whose suffix matches the supplied regular expression. |
NotMatchesFilePatternDecideRule |
This DecideRule applies the configured decision to any URI whose suffix does not match the supplied regular expression. |
NotMatchesRegexDecideRule |
This DecideRule applies the configured decision to any URI that does not match the supplied regular expression. |
NotExceedsDocumentLengthThresholdDecideRule |
This DecideRule applies the configured decision to any URI whose content-length does not exceed the configured threshold. The content-length comes from either the HTTP header or the actual downloaded content length of the URI. As of Heritrix 3.1, this rule has been renamed to ResourceNoLongerThanDecideRule. |
ExceedsDocumentLengthThresholdDecideRule |
This DecideRule applies the configured decision to any URI whose content length exceeds the configured threshold. The content-length comes from either the HTTP header or the actual downloaded content length of the URI. As of Heritrix 3.1, this rule has been renamed to ResourceLongerThanDecideRule. |
SurtPrefixedDecideRule |
This DecideRule applies the configured decision to any URI (expressed in SURT form) that begins with one of the prefixes in the configured set. This DecideRule returns true when the prefix of a given URI matches any of the listed SURTs. The list of SURTs may be configured in different ways: the surtsSourceFile parameter specifies a file to read the SURTs list from. If seedsAsSurtPrefixes parameter is set to true, SurtPrefixedDecideRule adds all seeds to the SURTs list. If alsoCheckVia property is set to true (default false), SurtPrefixedDecideRule will also consider Via URIs in the match. |
NotSurtPrefixedDecideRule |
This DecideRule applies the configured decision to any URI (expressed in SURT form) that does not begin with one of the prefixes in the configured set. |
OnDomainsDecideRule |
This DecideRule applies the configured decision to any URI that is in one of the domains of the configured set. |
NotOnDomainsDecideRule |
This DecideRule applies the configured decision to any URI that is not in one of the domains of the configured set. |
OnHostsDecideRule |
This DecideRule applies the configured decision to any URI that is in one of the hosts of the configured set. |
NotOnHostsDecideRule |
This DecideRule applies the configured decision to any URI that is not in one of the hosts of the configured set. |
ScopePlusOneDecideRule |
This DecideRule applies the configured decision to any URI that is one level beyond the configured scope. |
TooManyHopsDecideRule |
This DecideRule rejects any URI whose total number of hops is over the configured threshold. |
TooManyPathSegmentsDecideRule |
This DecideRule rejects any URI whose total number of path-segments is over the configured threshold. A path-segment is a string in the URI separated by a "/" character, not including the first "//". |
TransclusionDecideRule |
This DecideRule accepts any URI whose path-from-seed ends in at least one non-navlink hop. A navlink hop is represented by an "L". Also, the number of non-navlink hops in the path-from-seed cannot exceed the configured value. |
PrerequisiteAcceptDecideRule |
This DecideRule accepts all "prerequisite" URIs. Prerequisite URIs are those whose hops-path has a "P" in the last position. |
RejectDecideRule |
This DecideRule rejects any URI. |
ScriptedDecideRule |
This DecideRule applies the configured decision to any URI that passes the rules test of a JSR-223 script. The script source must be a one-argument function called |
SeedAcceptDecideRule |
This DecideRule accepts all "seed" URIs (those for which isSeed is true). |
DecideRules are configured by the bean with id "scope" under the property named "rules."
Enable FINEST
logging on the class
org.archive.crawler.deciderules.DecideRuleSequence
to watch each
DecideRule's evaluation of the processed URI. This can be done in
the logging.properties
file
logging.properties
org.archive.modules.deciderules.DecideRuleSequence.level = FINEST
in conjunction with the -Dsysprop
VM argument,
-Djava.util.logging.config.file=/path/to/heritrix3/dist/src/main/conf/logging.properties
The scope bean section of the crawler-beans.cxml
file is reproduced
below.
<bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">
<property name="rules">
<list>
<!-- Begin by REJECTing all... -->
<bean class="org.archive.modules.deciderules.RejectDecideRule">
</bean>
<!--
...then ACCEPT those within configured/seed-implied SURT prefixes...
-->
<bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
<!--
<property name="seedsAsSurtPrefixes" value="true" />
-->
<!-- <property name="alsoCheckVia" value="true" /> -->
<!-- <property name="surtsSourceFile" value="" /> -->
<!--
<property name="surtsDumpFile" value="surts.dump" />
-->
</bean>
<!--
...but REJECT those more than a configured link-hop-count from start...
-->
<bean class="org.archive.modules.deciderules.TooManyHopsDecideRule">
<!-- <property name="maxHops" value="20" /> -->
</bean>
<!--
...but ACCEPT those more than a configured link-hop-count from start...
-->
<bean class="org.archive.modules.deciderules.TransclusionDecideRule">
<!-- <property name="maxTransHops" value="2" /> -->
<!-- <property name="maxSpeculativeHops" value="1" /> -->
</bean>
<!--
...but REJECT those from a configurable (initially empty) set of REJECT SURTs...
-->
<bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
<property name="decision" value="REJECT"/>
<property name="seedsAsSurtPrefixes" value="false"/>
<property name="surtsDumpFile" value="negative-surts.dump"/>
<!-- <property name="surtsSourceFile" value="" /> -->
</bean>
<!--
...and REJECT those from a configurable (initially empty) set of URI regexes...
-->
<bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
<!-- <property name="listLogicalOr" value="true" /> -->
<!--
<property name="regexList">
<list>
</list>
</property>
-->
</bean>
<!--
...and REJECT those with suspicious repeating path-segments...
-->
<bean class="org.archive.modules.deciderules.PathologicalPathDecideRule">
<!-- <property name="maxRepetitions" value="2" /> -->
</bean>
<!--
...and REJECT those with more than threshold number of path-segments...
-->
<bean class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule">
<!-- <property name="maxPathDepth" value="20" /> -->
</bean>
<!--
...but always ACCEPT those marked as prerequisitee for another URI...
-->
<bean class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule">
</bean>
</list>
</property>
</bean>
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse