-
Notifications
You must be signed in to change notification settings - Fork 762
collecting streaming content
Heritrix by default only collects material available by HTTP/HTTPS, DNS, and FTP. Some 'streaming' media is actually delivered by plain HTTP, and if the direct URI to the content is discoverable by Heritrix's link-extraction modules, Heritrix can retrieve the audio/video content by HTTP. (In some cases where sites obscure the direct link with Javascript or Flash, manual intervention or extra configuration may be required to help Heritrix discover the direct URI.)
A paper at IWAW'06 by Nicolas Baly, "Archiving Streaming Media on the Web, Proof of Concept and First Results", described extending Heritrix to use MPlayer on Linux to collect content in a number of streaming protocols. While not included with official Heritrix releases, some users of Heritrix have adapted that work into recent releases to collect streaming media.
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse