ACHE Crawler Change Log

Version 0.16.0-SNAPSHOT (Unreleased)

Add support for the search UI when using Elasticsearch 7.x and 8.x. (PR #341)
Drop search UI support for older versions of Elasticsearch <7.x.x. (PR #341)

Version 0.15.0

Bump okhttp from 3.14.0 to 4.9.3
Bump jackson-* libraries from 2.13.1 to 2.13.3
Bump logback-classic from 1.2.9 to 1.2.11
Bump slf4j-api from 1.7.32 to 1.7.36
Bump RoaringBitmap from 0.9.23 to 0.9.27
Bump metrics-* libraries from 4.2.7 to 4.2.17
Bump aws-java-sdk-s3 from 1.12.131 to 1.12.225
Remove aws-java-sdk-s3 dependency from main project
Add support for Elasticsearch 7.x and 8.x indexing (#282)
Bump jetty-server from 9.4.44.v20210927 to 9.4.48.v20220622
Bump kryo-serializers from 0.42 to 0.43
Bump RoaringBitmap from 0.9.27 to 0.9.39
Bump tika-parsers from 1.18 to 1.28.4
Bump gradle-node-plugin to version 3.5.1 and node.js to 18.14.2
Migrate tests from jUnit 4 to 5
Migrate test assertions from Hamcrest to AssertJ
Bump org.apache.httpcomponents:httpclient from 4.5.13 to 4.5.14
Bump ch.qos.logback:logback-classic from 1.2.+ to 1.4.5
Fix robots.txt serialization bug
Bump jackson-* libraries from 2.13.3 to 2.14.2
Bump org.apache.commons:commons-lang3 from 3.4 to 3.12.0
Bump org.apache.commons:commons-compress from 1.21 to 1.22
Bump org.apache.kafka:kafka-clients from 3.2.0 to 3.4.0
Bump com.squareup.okhttp3:okhttp from 4.9.3 to 4.10.0

Version 0.14.0 (2022-02-06)

Remove support for CDR 3.1 format in Kafka target repository
Move tools and memex packages to the ache-tools sub-project
Moved forked crawler-commons classes to a separate sub-project
Remove tika dependency from ache and crawler-commons sub-project
Synchronize crawler-commons/http-fetcher with the upstream library
Setup gradle build using GitHub Actions
Build docker image with multi-arch support (amd64, arm64)
Upgrade build to Gradle 7.3.3
Upgrade gradle-node-plugin to version 3.0.1
Upgrade ache-dashboard npm dependencies
Pin slf4j-api version to 1.7.32
Bump airline from 0.8 to 0.9
Bump aws-java-sdk-s3 from 1.12.129 to 1.12.131
Bump crawler-commons from 1.1 to 1.2
Bump com.github.kt3k.coveralls from 2.10.2 to 2.12.0
Bump commons-codec from 1.10 to 1.15
Bump commons-compress from 1.12 to 1.21
Bump commons-lang3 from 3.4 to 3.12.0
Bump commons-validator from 1.6 to 1.7
Bump guava from 20.0 to 23.0
Bump jetty-server from 9.3.6.v20151106 to 9.4.44.v20210927
Bump kryo from 4.0.0 to 4.0.2
Bump kafka-clients from 0.11.0.1 to 3.0.0
Bump logback-classic from 1.1.+ to 1.2.9
Bump mockito-core from 1.10.+ to 4.2.0
Bump npm from 6.14.10 to 8.3.0
Bump rocksdbjni from 6.2.2 to 6.25.3
Bump RoaringBitmap from 0.7.8 to 0.9.23
Bump smile-core from 1.5.0 to 1.5.3
Bump lucene-analyzers-common from 7.3.1 to 8.10.1
Bump webarchive-commons from 1.1.8 to 1.1.9
Bump jsoup from 1.10.3 to 1.14.3
Bump junit from 4.12 to 4.13.2
Bump jackson-* libraries from 2.8.5 to 2.13.1
Bump metrics-* libraries from 3.1.3 to 4.2.7
Replace SparkJava framework (unmaintained) by Javalin 4.2.0
Add timeout configurations for the TOR fetcher
Update and improve the documentation
Change documentation theme to sphinx_material
Add support to HTTP BASIC auth for Elasticsearch data format

Version 0.13.0 (2021-01-07)

Upgrade gradle-node-plugin to version 2.2.4
Upgrade gradle wrapper to version 6.6.1
Upgrade crawler-commons to version 1.1
Reorganized gradle module directory structure
Rename root package to 'achecrawler'
Use multi-stage build to reduce Docker image size
Refactor Elasticsearch repository and make it wait until the server ready
Upgrade npm dependencies

Version 0.12.0 (2020-01-18)

Upgrade crawler-commons dependency to version 0.9
Removed Elasticsearch transport-client-based repository
Removed Elasticsearch 1.4.4 binaries dependency
Added DumpDataFromElasticsearch tool for dumping documents from Elasticsearch repositories
Added a configuration for minimum relevance in link selectors
Added a configuration for selecting whether should re-crawl sitemaps and robots.txt links
Added documentation about relevance_threshold parameters to the target page classifiers documentation page
Added support for crawling via HTTP proxy in okhttp3 fetcher (by @maqzi)
Added tracking of more HTTP error messages (301, 302, 3xx, 402) (by @maqzi)
Upgrade crawler-commons library to version 1.0
Upgrade commons-validator library to version 1.6
Upgrade okhttp3 library to version 3.14.0
Fix issue #177: Links from recent TLDs are considered invalid
Upgrade RocksDB dependency (rocksdbjni) to version 6.2.2
Added error code details to RocksDB exception logs
Upgrade gradle-node-plugin to version 1.3.1
Upgrade npm version to 6.10.2
Upgrade ache-dashboard npm dependencies
Upgrade gradle wrapper to version 5.6.1
Update Dockerfile to use openjdk:11-jdk (Java 11)
Added content_type field to RegexTargetClassifier
Change default link classifier to LinkClassifierBreadthSearch
Update io.airlift:airline dependency to version 0.8
Update gradle build script to use new plugins DSL
Update coveralls gradle plugin to version 2.9.0
Update searchkit to version ^2.4.0

Version 0.11.0 (2018-06-01)

Removed dependency on Weka and reimplemented all machine-learning code using SMILE.
Added option to skip cross-validation on ache buildModel command
Added option to configure max number of features on ache buildModel command
Changed license from GNU GPL to Apache 2.0
Added tool (ache run ReplayCrawl) to replay old crawls using a new configuration file
Added near-duplicate page detection using min-hashing and LSH
Support ELASTIC format in Kafka data format (issue #155)
Upgrade react-scripts to get rid of vulnerable transitive dependency (hoek:4.2.0)
Upgrade npm version to 5.8.0 on gradle build script
Changed smile target page classifier to use Platt's scaling only when the parameter 'relevance_threshold' is provided in the pageclassifier.yml file.
Added Ansible scripts for automatic deployment
Added RocksDB-based target repository (RocksDBTargetRepository)
Fixed bug in ache-dashboard that prevented reloading search page on browser page refresh (issue #163)
Support Elasticsearch 6.x (issue #158)

Version 0.10.0 (2018-01-16)

We are pleased to announce version 0.10.0 of ACHE Crawler! This release contains very important changes, which include support for running multiple crawlers in a single server (multi-tenancy), and the start of our migration to Apache License 2 (APLv2).

Following is a detailed log of the major changes since last version:

Upgraded gradle-node plugin to version 1.2.0
Removed BerkeleyDB dependency (issue #143)
Allow for running multiple crawlers in a single server (issue #103)
REST API endpoints modified to support multiple crawlers (issue #103)
Web interface modified to support multiple crawlers (issue #103)
Display more metrics in crawler monitoring page
Upgrade RocksDB (org.rocksdb:rocksdbjni) to version 5.8.7 (issue #142)
Upgraded build script plugin "gradle-node" to version 1.2.0
Upgraded javascript dependencies from crawler web-interface:
- react to version 16.2.0
- react-vis to version 1.7.9
- searchkit to version 2.3.0
- npm to version 5.6.0
Allow cookies to be modified dynamically via REST API endpoint (issue #114)
Added crawlerId field to JSON output of target repositories to track provenance of crawled pages

Version 0.9.0 (2017-11-07)

We are pleased to announce version 0.9.0 of ACHE Focused Crawler! We also recently reached the milestone of 100+ starts on GitHub, 55+ forks, and 1000+ commits in the current git repository. We would like to thank all users for the feedback we have received in the past year.

This is a large release, and it brings many improvements to the documentation and several new features. Following is a detailed log of major changes since last version:

Fixed multiple bugs and handling of exceptions
Several improvements made to ACHE documentation
Allow use of multiple data formats simultaneously (issue #92)
Added new data storage format using the standard WARC format (issue #64)
Added new data storage format using Apache Kafka (issue #123)
Re-crawling of sitemaps.xml files using fixed time intervals (issue #73)
Allow configuration of cookies in ache.yml (issue #81)
Allow configuration of full User-Agent string
Fixed memory issues that would cause OutOfMemoryError (issue #63)
Support for robots exclusion protocol a.k.a. robots.txt (issue #46)
Added new HTTP fetcher implementation using okhttp3 library with support to multiple SSL cipher suites
Non-HTML pages are no longer parsed as HTML
Training of new link classifiers (Online Learning) in a background thread (issue #76)
Added REST API endpoint to stop crawler
Added REST API endpoint to add new seeds to the crawl
Added documentation for the REST API
Persist run-time crawl metrics across crawler restarts (issue #101)
Added support to per-domain wildcard link filters (issue #121)
Add more detailed metrics for HTTP response codes (issue #120)
Changed referrer policies in the search dashboard for better security
Added various configuration options for timeouts in both fetcher implementations (issue #122)
Added support for Basic HTTP authentication in the web interface (issue #129)
Added REST API endpoints to supporting monitoring using Prometheus.io (issue #128)
Add page relevance metrics for better monitoring (issue #119)
Add parameters for elasticsearch index and type names through the /startCrawl REST API (issue #107)
Support for serving web interface from non-root path (issue #137)
Added button to stop crawler in web user interface (issue #139)
Upgraded searchkit library to 2.2.0 which supports Elasticsearch 5.x
Upgrade crawler-commons library to version 0.8

Notice: that there were breaking changes in some data formats:

Repositories for relevant and irrelevant pages are now stored in the same folder (or same Elasticsearch index) and page entries include new properties to identify pages as relevant or irrelevant according to the target page classifier output. Double check the data formats documentation page and make sure you make appropriate changes if needed.

Version 0.8.0 (2017-04-27)

We are pleased to announce version 0.8.0 of ACHE Focused Crawler.

This release includes a more complete and reorganized documentation (available at http://ache.readthedocs.io/en/latest/) and a new REST API for real-time crawler monitoring.

Following is the detailed log of major changes since last version.

Added frontier load time metrics (issue #59)
Update some library versions on build.gradle
Update gradle wrapper to version 3.2.1
Added Dockerfile
Added connection timeouts to BingSearchAzureAPI
Changed seed finder to use SimpleHttpFetcher
Added option to configure a custom user agent string
Added option of not starting console reporter in MetricsManager
Change set_version script to work on MacOS
Updated test dependency (Jetty) to version 9.3.6
Rewrite all CLI programs using only airline library
Shutdown crawler and log errors on any error (any Throwable)
Simple WekaTargetClassifier refactoring
Added argument --seedsPath to specify the directory to store the seed file in SeedFinder command
Replaced the deprecated installApp by installDist gradle command in conda.recipe
Fixed type of links extracted from sitemaps
REST API for real-time metrics monitoring (issue #67)
Remove dependency on linkclassifier.features file from LinkClassifierBreadthSearch (issue #65)
Create an initial version of web-based crawler dashboard for visualization of system metrics (issue #68)
Avoid creating empty files when not necessary in FilesTargetRepository
Added Memex CDRv3 support
Added Elasticsearch indexer to AcheToCdrFileExporter and rename it to AcheToCdrExporter
Capture exceptions and retry on failures during ElasticSearch bulk indexing
Refactoring of TargetClassifierFactory
Added command annotation to MigrateToFilesTargetRepository tool
Added a simple in-memory duplicate detection tool
Added a new regex-based target classifier that matches multiple fields (issue #69)
Created an initial version of documentation using the documentation generation system Sphinx and published documentation online at http://ache.readthedocs.io/ (issue #66)
Added additional system descriptions and a scaffold for missing documentation (issue #66)
Added badge with link to documentation in README.md (issue #66)
Added an index to page-classifiers documentation page
Improved documentation on page classifiers
Added a tool to run a classifier over a file content
Adjusted regex matcher to use DOTALL mode (issue #69)
Rename test file correctly
Write a CSV with queries, classification result, and URLs (issue #71)
Moved SeedFinder documentation from wiki to Sphinx documentation

Version 0.7.0 (2016-11-27)

There were more than 100 commits since the last release 0.6.0 in July 8. Following are some of the improvements.

ACHE is now simpler to use and to configure:

Added more specific configuration samples for focused crawling and in-depth website crawling
Stopwords are now an optional parameter, and a embedded stopword list is used by default
Added utility tools for working with CDR (Common Data Repository) files
Added utility to print frontier links along with relevance scores
Added configuration for HTTP connection pool size

ACHE is faster: we fixed synchronization and parallelism issues that led to improvements of crawler efficiency of 980% (a simple benchmark available at #56).

ACHE is more resilient due fix of bugs related to:

Extraction of malformed URLs during HTML parsing
Failures due to handling of URLs with IPv4 addresses
Failure to train the linking classifier for certain configuration values
Corruption of binary data improperly stored in strings

URL normalization added for links extracted from web pages, so less duplicate content will be fetched

Cleaned log messages and added logging of structured data in CSV files regarding:

Download requests
Links selected to be downloaded

Added detailed software metrics that allows better monitoring and detection of problems. Added metrics include shows counts, 1, 5 and 15-minute rates, mean, median, and 75%, 95%, 98% and 99% percentiles for

URL fetch time
Download page processing time
Current download queue size
Current processing and pending downloads in queue

ACHE has an improved data management:

Added new page repository that stores multiple pages in rolling compressed files
Added a new alternative database backend based on Facebook's RocksDB key-value store that improves efficiency and JVM memory management.

Some stability problems were solved, such as:

Limiting size of downloader thread-pool queue sizes
Properly close repository files during crawler shutdown
Avoid start crawler shutdown multiple times

Other minor improvement such as:

Migrated code base to Java 8
More refactoring, code cleaning, and tests (coverage 44%)

Version 0.6.0 (2016-07-08)

We are pleased to announce version 0.6.0 of ACHE Focused Crawler. Here we list the major changes since last version.

New features, improvements and bug fixes:

Implementation of SeedFinder algorithm, which leverages search engine's APIs to automatically create a large and diverse seed URL set to start to bootstrap the crawler.
Added flexible way to different handlers for different types of links, which will allow to have different extractors for each content type such as HTML, media files, XML sitemaps, etc.
Support for sitemap.xml protocol, which allows the crawler automatically discover all links along with some metadata specified by webmasters.
More bug fixes and code refactoring.
More unit tests and integration tests (coverage raised to 42%)

Version 0.5.0 (2016-04-20)

We are pleased to announce version 0.5.0 of ACHE Focused Crawler. Here we list the major changes since last version.

New features, improvements and bug fixes:

New simplified configuration based on a single YML file (ache.yml)
Fixed "backlink crawling" using Mozcape API to get backlinks
Complete rewrite of Crawler Manager module with some threading bug fixes and new thread managing model
Allow HTTP Fetcher cancel download of undesired mime-types (valid mime-types configuration)
Added ability to crawl .onion links from TOR network using HTTP proxies such as Privoxy
Added more unit tests for several components (test coverage raised to 31% of codebase)
More code cleaning and refactorings

Version 0.4.0 (2016-01-28)

We are pleased to announce version 0.4.0 of ACHE Crawler. Here we list the major changes since last version.

New features, improvements and bug fixes:

Improved and updated ACHE documentation
Added configuration to disable English language detection
Configured a service to measure test code coverage (https://coveralls.io/github/ViDA-NYU/ache)
Added more unit tests for several components (test coverage raised to 24% of codebase)
Refactor RegexBasedDetector into a new type of Page Classifier that uses regular expressions
Refactor Link Storage to abtract a new component called LinkSelector
Extract headers from HTTP responses
Add support for relative redirect URLs
Add support for redirected urls, mime type, reorganize code
Fixed a number of small issues and minor bugs
Removed legacy code and more code formatting
Fixed of some memory leaks and memory usage waste
Removed LinkMonitor and ability to print frontier pages that caused memory leaks
Added better caching policy with limited memory usage in Frontier
Added link selector with politeness restrictions (access URLs from the same domain only after minimum time interval)
Added link selector that maximizes number of websites downloaded
Added link selector to allow only crawl web pages within a max depth from the seed URLs
Changed default JVM garbage collector used in ACHE
Added command line option to train a Random Forest page classifier
Refactoring of page repositories to reuse code and allow improvements
Added configuration to hash file name when using FILESYTEM data formats
Added new JSON data format
Store fetch time of downloaded pages in JSON data format
Store HTTP request headers in JSON data format
Added deflate compression for pages repositories
Improved command line help messages
Updated Gradle wrapper version to 2.8
Updated Weka version to 3.6.13
Fixed other minor bugs
Removed lots of unused code and code cleaning

Version 0.3.1 (2015-07-22)

We are pleased to announce version 0.3.1 of ACHE Crawler. This is a minor release with some changes:

Added config files to final package distribution
Added version to command line interface
Some code refactorings

Version 0.3.0 (2015-07-14)

We are pleased to announce version 0.3.0 of ACHE Crawler. Here we list the major changes since version 0.2.0 (note that some changes break compatibility with previous releases).

New features:

New command-line interface using named parameters
Integration with ElasticSearch with configurable index names
Added new way to configure different types of classifiers using YAML files (this will allow new types of classifiers be added later as well as "meta classifiers" which can combine any type of classifier, using votting or machine learning ensembles for example)
Implemented a new type of page classifier based on simple URL regular expressions
Added filtering for extracted links using "white" and "black" lists of regular expressions
Added tool for compression of CBOR files in GZIP using CCA format
Added tool for off-line indexing data into ElasticSearch from crawler files in disk

Improvements:

Improved documentation in GitHub
Started writing automated unit tests for new features
Configuration of a continuous integration pipeline using TravisCI (compiles and runs the tests for each new commit in the repository)
Embedded language detection into crawler package to ease configuration for end user (before, the user needed to download external language profiles files and specify them in command line)
Converted bash scripts to build SVM model to a single command written in cross-platform Java code
Don't automatically remove data from existing crawl, just resume previous crawls.

Bug fixes:

Escaping HTML entities from extracted links (this was causing wrong links to be extracted and the crawler waste resources trying to download unexisting pages)
Checking for empty strings in frontier and seed file
Fixed computation of CCA key
Insert URLs from the seed file only when they are not already inserted
Added shutdown hook to close LinkStorage database properly
Removed URL fragment (#) from extracted links (this was causing duplicated URLs to be downloaded)

Refactorings:

Refactored tens of classes in the crawler

Version 0.2.0 (2015-04-01)

First version release on GitHub.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGELOG.md

CHANGELOG.md

ACHE Crawler Change Log

Version 0.16.0-SNAPSHOT (Unreleased)

Version 0.15.0

Version 0.14.0 (2022-02-06)

Version 0.13.0 (2021-01-07)

Version 0.12.0 (2020-01-18)

Version 0.11.0 (2018-06-01)

Version 0.10.0 (2018-01-16)

Version 0.9.0 (2017-11-07)

Version 0.8.0 (2017-04-27)

Version 0.7.0 (2016-11-27)

Version 0.6.0 (2016-07-08)

New features, improvements and bug fixes:

Version 0.5.0 (2016-04-20)

New features, improvements and bug fixes:

Version 0.4.0 (2016-01-28)

New features, improvements and bug fixes:

Version 0.3.1 (2015-07-22)

Version 0.3.0 (2015-07-14)

New features:

Improvements:

Bug fixes:

Refactorings:

Version 0.2.0 (2015-04-01)

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

ACHE Crawler Change Log

Version 0.16.0-SNAPSHOT (Unreleased)

Version 0.15.0

Version 0.14.0 (2022-02-06)

Version 0.13.0 (2021-01-07)

Version 0.12.0 (2020-01-18)

Version 0.11.0 (2018-06-01)

Version 0.10.0 (2018-01-16)

Version 0.9.0 (2017-11-07)

Version 0.8.0 (2017-04-27)

Version 0.7.0 (2016-11-27)

Version 0.6.0 (2016-07-08)

New features, improvements and bug fixes:

Version 0.5.0 (2016-04-20)

New features, improvements and bug fixes:

Version 0.4.0 (2016-01-28)

New features, improvements and bug fixes:

Version 0.3.1 (2015-07-22)

Version 0.3.0 (2015-07-14)

New features:

Improvements:

Bug fixes:

Refactorings:

Version 0.2.0 (2015-04-01)