every index.html code brocken since today #286

kundaliniserp · 2024-04-15T08:00:09Z

hello before i downloaded a lot of websites but since today wayback-machine-downloader is no longer working for me. i tried for different timestamps for defferent domains, with and without proxy. whatever i try, instead of working code , in every index.html i get sth like:

<jv �Óìª3ž*="" lny‘="" !µ*&¹˜ØÁ!,Æ”9ˆÀÎËñ´œëºªÇË«ãbr<<ØÝ‡±�—åy‘wª”þ©þ™wª�dªd5ö�e~="" “ü$[ow�)="" Þ¨üd�wù§�="" d6^i†¡áé`óÕe¡èeß�&t”tÿÌòuÖ�ŸeË}wëÝ="" Ï;�·z÷�þË¿üË_ÿµÓi�Œ–Ê\Á°w:¦Ê<›åßÝ�¤�år¥8˜Æô»[—ÅduöÝdá;Î;øÑn�óbudÓn5Î¦ùw}Õ<a�;^�‹uuÿÉÛ·l="">i].p€[gŠ�a[w–å¨\UwlKwŠù$ÿÔn�”ÓiyÙnÍ²O�b–�æ†ZGÓlyšSF5/�‹|uÔéÓ÷…�2KV•z�úÌ{6™Î:jx«¢œ³Îít÷»ûC…ÿÿBü�9+ªVU¬ò–ú·\¬ŠYñY1£¢õYku–·>”Yµj½}öSk1]Ÿ�óÖÅ ×íµ:jÌV‹êèþý+(Ð�—³û—år¢Ð©ªûT´º_åå}$ÂÿúëªXMó‡ýƒÖÇl®�^¶�=õÑiý¼Î§ñò�þÏ¤Xµ&ÿ×íÞððA1[üÏÿ½j-Êõ²uQVÕ2»ÈÖŸZŠoUÁyy‘

could someone please give advice? thank you

gingerbeardman · 2024-04-20T16:17:14Z

My index files are downloaded and look as I expect them to.

Feel free to post a sample URL where you are seeing this and I'll check.

ckought · 2024-05-16T01:34:58Z

I'm getting this for about a third of the pages that I'm grabbing. No rhyme or reason for what ones are corrupted. It's happening to txt and html files, and probably other file types too (I know some jpg files are getting corrupted, but I'm not sure on css or js or others).

It looks to me like whatever code is being used to strip off the archive.org code at the top is causing it. It's like the code is downloading the page, stripping off the archive.org code, something goes wrong, and then it writes the garbage file to disk thinking the file is just fine.

ckought · 2024-05-16T13:02:14Z

Update:

Been doing some digging, and it looks like every corrupted file, no matter if it's html, txt, jpeg, css, all start with one of these three sets of HEX characters:

1f 8b 08 00 00 00 00 00 00 03
1f 8b 08 00 3f 3f 3f 00
1f 8b 08 00 4f 3f 3f 00

Still not sure what's actually causing the files to corrupt though. It does seem to get worse the longer larger the website is and the therefore the longer the script is running, so that may have something to do with it.

gingerbeardman · 2024-05-16T15:08:50Z

@ckought I do not see this, so I wonder what version of wayback_machine_downloader are you using and has it been modified in any way?

Use the one in #280 or see this comment #265 (comment)

ckought · 2024-05-18T17:04:29Z

@gingerbeardman That fixed the issue. I've had it running for about 48 hours, with 25K downloads with no corrupted downloads.

The version of wayback_machine_downloader.rb I had been using was the version from the automatic install using "gem install wayback_machine_downloader".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

every index.html code brocken since today #286

every index.html code brocken since today #286

kundaliniserp commented Apr 15, 2024

gingerbeardman commented Apr 20, 2024

ckought commented May 16, 2024

ckought commented May 16, 2024

gingerbeardman commented May 16, 2024 •

edited

Loading

ckought commented May 18, 2024

every index.html code brocken since today #286

every index.html code brocken since today #286

Comments

kundaliniserp commented Apr 15, 2024

gingerbeardman commented Apr 20, 2024

ckought commented May 16, 2024

ckought commented May 16, 2024

gingerbeardman commented May 16, 2024 • edited Loading

ckought commented May 18, 2024

gingerbeardman commented May 16, 2024 •

edited

Loading