Skip to content

Release Notes Heritrix 3.4.0 20220727

Alex Osborne edited this page Jul 29, 2022 · 15 revisions

Summary of changes since Release Notes - Heritrix 3.4.0-20210923 - see the full changelog for more details.

To Be Completed...

Additions

  • JDK 18 support
  • New robotsTxtOnly robots policy obeys robots.txt rules but ignores robots meta tags in HTML #489 (ato)
  • CandidatesProcessor gained a seedsRedirectNewSeedsAllowTLDs option to prevent the addition of TLDs as additional seeds when a seed redirects to them #461

Changes

none

Removals

none

Bugfixes

  • Fixed issue where HTML srcset attribute was only matched in lowercase #477
  • Sitemap links (M) are no longer treated as transclusions when limiting hops #469
  • Fixed "java.lang.NoClassDefFoundError: Could not initialize class org.archive.util.CLibrary" on Apple Silicon #467
  • Fixed Heritrix crashing on unexpected characters in the Content-Length header #449
  • Fixed StringIndexOutOfBoundsException on exact major Java versions like the first JDK 18 release #439
  • Fixed dnsjava NIO selector thread consuming 100% CPU after terminating job #425
  • Fixed <link href=...> tags being treated as embed (E) links for rel values where they shouldn't be #263
  • Fixed "RIS already open for ToeThread..." exception when crawling https pages via a proxy #191

Heritrix

Structured Guides:

Wiki index

FAQs

User Guide

Knowledge Base

Known Issues

Background Reading

Users of Heritrix

How To Crawl

Development

Clone this wiki locally