Skip to content

Releases: hplt-project/warc2text-runner

v3.0.0-alpha.1

21 Mar 00:00
Compare
Choose a tag to compare
v3.0.0-alpha.1 Pre-release
Pre-release

HTML2text updates:

  1. Moved to Trafilatura 2.0.0
  2. Additional extraction of text with markup using xml outputs from Trafilatura
  3. Extraction of HTML language tags
  4. Streaming input HTMLs directly from LUMIO

Code running stage2 on LUMI for the second data release

13 May 16:26
Compare
Choose a tag to compare

See two/README.MD to learn to reproduce stage2 for the second data release

updated langid

12 May 20:05
fb881ca
Compare
Choose a tag to compare
updated langid Pre-release
Pre-release

langid update: preprocessing, new model
better selected blocksize for trafilatura

v2.0.0-alpha.2

06 May 22:13
Compare
Choose a tag to compare
v2.0.0-alpha.2 Pre-release
Pre-release

Full Changelog: v2.0.0-alpha.1...v2.0.0-alpha.2

Now pip installable.

v2.0.0-alpha.1

23 Apr 17:54
Compare
Choose a tag to compare
v2.0.0-alpha.1 Pre-release
Pre-release
traf.py: timeout