Bad desktop data dumps for Jan. #74

ebidel · 2017-01-30T05:31:06Z

Both desktop data dumps for this month (2017-01-01 and 2017-01-15) are showing malformed data. Known issue?

igrigorik · 2017-01-30T06:10:41Z

Hmm, no that's something we need to investigate. Thanks for the heads up Eric.

ronancremin · 2017-02-03T15:01:49Z

December 2016 looks anomalous also—sudden dramatic drop in overall weight vs. the previous month (if only it were true!).

pmeenan · 2017-02-03T16:23:37Z

We had an issue with the requests database where the primary key ran out of 32-bit numbers - doh. It should be fixed for the 2/1 crawl and we're looking at backfilling the December and January crawl stats from the HARs in bigquery.

ronancremin · 2017-02-03T16:36:02Z

Thanks Patrick! I was looking for evidence of responsive images in WordPress 4.4 hopefully pulling down the average size as it rolls out.

Themanwithoutaplan · 2017-02-13T11:06:22Z

How come the errors for numDomains weren't being flagged as constraint violations?

rviscomi · 2017-03-28T01:53:04Z

I need to learn more about the BigQuery -> MySQL pipeline, but I hope to get this fixed soon.

rviscomi · 2019-02-04T23:22:12Z

See also this comment from #116:

The downloads page lists January 2017 but the links are broken. The thing is that the dumps were available at the time and contained valid data. Can they be recreated?

pmeenan · 2019-02-04T23:42:53Z

Which links specifically? The desktop links to the archived dumps on archive.org are all working for me.

http://www.archive.org/download/httparchive_downloads_Jan_1_2019/httparchive_Jan_1_2019_pages.gz
http://www.archive.org/download/httparchive_downloads_Jan_1_2019/httparchive_Jan_1_2019_pages.csv.gz
http://www.archive.org/download/httparchive_downloads_Jan_1_2019/httparchive_Jan_1_2019_requests.gz
http://www.archive.org/download/httparchive_downloads_Jan_1_2019/httparchive_Jan_1_2019_requests.csv.gz

Is the problem that an automated script is trying to use the pre-archive location and it moved once the archiving completed? For the pipeline, would it be easier if a copy of the dumps was also archived to the cloud storage bucket?

rviscomi · 2019-02-04T23:56:55Z

Sorry this is an old issue from 2017 that I updated. Was triaging old issues.

pmeenan · 2019-02-05T00:02:43Z

Whoops. My bad. Dumps from 2 years ago? If the links don't work they're gone.

…

________________________________ From: Rick Viscomi <[email protected]> Sent: Monday, February 4, 2019 6:56 PM To: HTTPArchive/legacy.httparchive.org Cc: Patrick Meenan; Mention Subject: Re: [HTTPArchive/legacy.httparchive.org] Bad desktop data dumps for Jan. (#74) Sorry this is an old issue from 2017 that I updated. Was triaging old issues. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#74 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAbHBdm6b2vWs1eei8ZHqIASlLzykdXfks5vKMjHgaJpZM4LxDxq>.

igrigorik mentioned this issue Feb 13, 2017

Bad data in Feb 1st #76

Closed

rviscomi added bug P1 labels Apr 11, 2017

rviscomi mentioned this issue Jun 14, 2017

Desktop Data from 2017-01 missing #116

Closed

rviscomi added P3 and removed P1 labels Jun 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad desktop data dumps for Jan. #74

Bad desktop data dumps for Jan. #74

ebidel commented Jan 30, 2017

igrigorik commented Jan 30, 2017

ronancremin commented Feb 3, 2017

pmeenan commented Feb 3, 2017

ronancremin commented Feb 3, 2017

Themanwithoutaplan commented Feb 13, 2017

rviscomi commented Mar 28, 2017

rviscomi commented Feb 4, 2019

pmeenan commented Feb 4, 2019

rviscomi commented Feb 4, 2019

pmeenan commented Feb 5, 2019 via email

Bad desktop data dumps for Jan. #74

Bad desktop data dumps for Jan. #74

Comments

ebidel commented Jan 30, 2017

igrigorik commented Jan 30, 2017

ronancremin commented Feb 3, 2017

pmeenan commented Feb 3, 2017

ronancremin commented Feb 3, 2017

Themanwithoutaplan commented Feb 13, 2017

rviscomi commented Mar 28, 2017

rviscomi commented Feb 4, 2019

pmeenan commented Feb 4, 2019

rviscomi commented Feb 4, 2019

pmeenan commented Feb 5, 2019 via email