-
Notifications
You must be signed in to change notification settings - Fork 762
Docker
Build the default Heritrix docker image for version 3.4.0-20210923 as follows:
docker build --build-arg version=3.4.0-20210923 -t iipc/heritrix .
To use the heritrix-contrib
release, build the image with the following command:
docker build --build-arg version=3.4.0-20210923 --build-arg java=8-jre -t iipc/heritrix:contrib -f Dockerfile.contrib .
Note, that iipc/heritrix:contrib
currently only runs with Java 8, not Java 11 (JRE/JDK).
To be supplied with --build-arg key=value
:
Name | Description |
---|---|
version | Heritrix maven release version |
java | Java docker base image, 11-jre for heritrix , 8-jre for heritrix-contrib
|
user | Custom user that runs heritrix in the container, default: heritrix
|
userid |
heritrix user id, default: 1000
|
In the docker/
folder a Makefile
exists that wraps common build steps.
Build the images with:
make (image|image-contrib|image-all) [version=3.4.0-20210923]
# e.g. basic latest release image:
make image
Supply a specific Heritrix version with version=3.4.0-20210923
.
Publish the built images with the following command for the 3.4.0-20210923 to the iipc user:
make (publish|publish-contrib|publish-all) version=3.4.0-20210923 repo=iipc/
Build multiple releases with:
make image-all-version
Test run your images with:
make (run|run-contrib) [version=3.4.0-20210923] [repo=iipc/]
See the Heritrix Documentation about running the Docker images.
Basic run commands:
# run it in foreground
# -it required for clean stopping with Ctrl+C
# --rm for cleaning up afterwards of volumes etc.
docker run --rm -it iipc/heritrix
# run it in background
# - --name is optional but easier to find for stopping
docker run -d --name heritrix_container iipc/heritrix
# logs
docker logs heritrix_container
# stop it
docker stop heritrix_container
Configuring it for real™ usage:
# --init : use tini init wrapper
# --rm : remove container after exit
# -it : runs docker interactively (pseudo TTY)
# -d : detach, run container in background
# -p : map public api port of 8443 (host) to 8443 (container port)
# -e : set environment variables for user/pass of REST API
# JAVA_OPTS=-Xmx1024M (to restrict heritrix memory usage)
# -v : mount local folder into container (to persist job results)
# on windows/WSL[2] volume mounts might not work (container files are not in local folder?)
# heritrix is install at /opt/heritrix
# heritrix jobs are at /opt/heritrix/jobs
docker run --init --rm -d \
--name heritrix_container \
-p 8443:8443 \
-e "USERNAME=admin" -e "PASSWORD=admin" \
-v $(pwd)/jobs:/opt/heritrix/jobs \
iipc/heritrix
# or mount a credentials file into the container (docker secrets?)
echo "admin:admin" > $(pwd)/creds
docker run --init --rm -d \
--name heritrix_container \
-p 8443:8443 \
-e "CREDSFILE=/opt/heritrix/creds.txt" \
-v $(pwd)/creds:/opt/heritrix/creds.txt \
-v $(pwd)/jobs:/opt/heritrix/jobs \
iipc/heritrix
# switch `-d` with `-it` to run it interactively (see log, quit with Ctrl+C)
Run a single job:
# docker options the same as above
# * specify with -e "JOBNAME=<jobname> the job that should be run
# * mount the job folder with the crawler-beans.cxml to the <jobname>
# folder within the container
# * the crawl will start immediately
#
docker run --init --rm -d \
--name heritrix_container \
-p 8443:8443 \
-e "USERNAME=admin" -e "PASSWORD=admin" -e "JOBNAME=myjob" \
-v $(pwd)/myjob:/opt/heritrix/jobs/myjob \
iipc/heritrix
Run other stuff, e.g. hoppath.pl
script:
# the last two lines are the relative path to the job lob
# (in the container) as well as the URI_PREFIX
docker run -it --rm \
-v "$(pwd)/myjob:/opt/heritrix/jobs/myjob" \
--entrypoint bin/hoppath.pl \
iipc/heritrix \
jobs/myjob/latest/logs/crawl.log \
https://
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse