-
Notifications
You must be signed in to change notification settings - Fork 762
Developing Alternative Frontier Implementations
The current BdbFrontier implementation is the only well-used Frontier implementation. It works but there are some limitations:
- Checkpoint-only crawl state management: the crawl can only be resumed from a checkpoint, rather than it being possible to directly restart from a crawl that could not be stopped 'neatly' (e.g. VM death, system failure, etc.)
- Opaque: the state is stored as key-values where the values are Kryo-serialised CrawlURI instance blobs. This means only the right versions of the H3 codebase are able to look inside the frontier, and upgrades to H3 can render the state unusable. The Frontier cannot be analysed from other languages.
- Single-process locked: the state can only be inspected or modified from the running H3 instance itself. We can't use external tools to manage the frontier or use multiple crawlers over the same Frontier (even if they are in Java, as per point (2)).
- Bloated: the state can take up a LOT of space, as can the regular checkpoints you need to take and manage because of (1). The space used is partially down to BDB not being that efficient, but also due to (2) which means that it's possible to end up with large Kryo blobs (because of all the ways it's possible to embed additional data or objects inside the H3 CrawlURI object).
- Complex: the BdBFrontier does a lot of different-but-related things and is very difficult to understand and reason about. It also implements a sophisticated caching system where the in-memory Frontier is updated 'live' and occasionally flushed to disk.
- Concurrency: the BdBFrontier is intended to be run with many ToeThreads, but the complexities of state management mean a number of locks and synchronisations have been put in place to attempt to ensure the state is valid and consistent. These appear to have led to the codebase being overly cautious in places, which means as the number of ToeThreads is increased, the BdbFrontier lock contention becomes a significant bottleneck. This gets worse for larger crawls, due to the interactions with the in-memory caching/flushing system. (ANJ: I haven't got quite to the bottom of all this, but have observed that crawlers with ample memory and CPU available do not manage to use those resources, despite there being no observable I/O wait. This is consistent with some kind of lock contention, but I've not proven exactly where the problem arises).
The problem with the Heritrix3 frontier is that it does a lot of related things. The base Frontier class looks simple enough, accepting CrawlURIs with a QueueKey and a priority, which it stores, and then releases at crawl time. i.e. as well as storing the URIs in queues, it is aware of the crawl delays and thus the time when each queue is due. However, as hinted at by the AbstractFrontier, there are other things going on. In fact, there are four layers of inheritance, each bringing additional functionality:
- Frontier: stores CrawlURIs and releases them in time order. Also aware of the Spring Lifecycle so knows when it's being started up/shut down.
-
AbstractFrontier: Listens for new seeds, interacts with the
UriUniqFilter
, ... - WorkQueueFrontier: Adds the concept of Work Queues which allows queue rotation/budgets.
-
BdbFrontier: Brings all the functionality together with the BDB/checkpoint state management for the
BdbMultipleWorkQueues
.
In general, much of the complexity arises from implementation details relating to the BdbFrontier design, so it probably makes more sense to implement any new frontier at the highest level (implements Frontier
) and only implement the functionality that is needed. However, the BdbFrontier
has been around a long time and it's possible other parts of H3 have become somewhat hard-coded against it's capabilities. If this is true, this will likely only become clear during testing. This is becase using things like Spring means lots of dependencies are dynamically pulled in at runtime rather than being declared at compile time.
Looking at the AbstractFrontier
we see:
public abstract class AbstractFrontier
implements Frontier,
SeedListener,
HasKeyedProperties,
ExtractorParameters,
CrawlUriReceiver,
ApplicationListener<ApplicationEvent> {
For basic crawl functionality, it is likely that any implementation will need to implement SeedListener
(if H3 is being used to manage seeds), and implement CrawlUriReceiver
(if the URI uniqueness is not part of the Frontier implementation). However, for a externally-managed Frontier, it would be possible to handle seeds and uniqueness entirely outside of H3 and just use H3 as a crawl pool that gets what it's told to get and writes the WARCs/logs. This depends on how crawl scope is managed (which is often handled via addional SeedListeners that are part of the Scope DecideRules).
The other features would likely only be implemented if necessary. e.g. ApplicationListener
could be used to track crawl events, and this could be use to drop/reconnect to an external database on pause/unpause (rather than the connection having to be there all the time). HasKeyedProperties
refers to H3's 'sheets' configuration system, which allows some Frontier-level configuration to be controlled, e.g. the ExtractorParameters
which covers things like maximum outlink extraction.
Assuming we have a Frontier implementation that covers seed injection and URI uniqueness, it should be possible to design a set of Crawler Beans that use it and can run a test crawl.
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse