Releases: hplt-project/warc2text-runner
Releases · hplt-project/warc2text-runner
v3.0.0-alpha.1
HTML2text updates:
- Moved to Trafilatura 2.0.0
- Additional extraction of text with markup using xml outputs from Trafilatura
- Extraction of HTML language tags
- Streaming input HTMLs directly from LUMIO
Code running stage2 on LUMI for the second data release
updated langid
langid update: preprocessing, new model
better selected blocksize for trafilatura
v2.0.0-alpha.2
Full Changelog: v2.0.0-alpha.1...v2.0.0-alpha.2
Now pip installable.
v2.0.0-alpha.1
traf.py: timeout