Skip to content
This repository has been archived by the owner on Jan 4, 2023. It is now read-only.

Bad desktop data dumps for Jan. #74

Open
ebidel opened this issue Jan 30, 2017 · 10 comments
Open

Bad desktop data dumps for Jan. #74

ebidel opened this issue Jan 30, 2017 · 10 comments
Labels

Comments

@ebidel
Copy link

ebidel commented Jan 30, 2017

Both desktop data dumps for this month (2017-01-01 and 2017-01-15) are showing malformed data. Known issue?

http://httparchive.org/interesting.php

@igrigorik
Copy link
Contributor

Hmm, no that's something we need to investigate. Thanks for the heads up Eric.

/cc @pmeenan @rviscomi

@ronancremin
Copy link

December 2016 looks anomalous also—sudden dramatic drop in overall weight vs. the previous month (if only it were true!).

@pmeenan
Copy link
Member

pmeenan commented Feb 3, 2017

We had an issue with the requests database where the primary key ran out of 32-bit numbers - doh. It should be fixed for the 2/1 crawl and we're looking at backfilling the December and January crawl stats from the HARs in bigquery.

@ronancremin
Copy link

Thanks Patrick! I was looking for evidence of responsive images in WordPress 4.4 hopefully pulling down the average size as it rolls out.

@Themanwithoutaplan
Copy link

How come the errors for numDomains weren't being flagged as constraint violations?

@rviscomi
Copy link
Member

I need to learn more about the BigQuery -> MySQL pipeline, but I hope to get this fixed soon.

@rviscomi
Copy link
Member

rviscomi commented Feb 4, 2019

See also this comment from #116:

The downloads page lists January 2017 but the links are broken. The thing is that the dumps were available at the time and contained valid data. Can they be recreated?

@pmeenan
Copy link
Member

pmeenan commented Feb 4, 2019

Which links specifically? The desktop links to the archived dumps on archive.org are all working for me.

http://www.archive.org/download/httparchive_downloads_Jan_1_2019/httparchive_Jan_1_2019_pages.gz
http://www.archive.org/download/httparchive_downloads_Jan_1_2019/httparchive_Jan_1_2019_pages.csv.gz
http://www.archive.org/download/httparchive_downloads_Jan_1_2019/httparchive_Jan_1_2019_requests.gz
http://www.archive.org/download/httparchive_downloads_Jan_1_2019/httparchive_Jan_1_2019_requests.csv.gz

Is the problem that an automated script is trying to use the pre-archive location and it moved once the archiving completed? For the pipeline, would it be easier if a copy of the dumps was also archived to the cloud storage bucket?

@rviscomi
Copy link
Member

rviscomi commented Feb 4, 2019

Sorry this is an old issue from 2017 that I updated. Was triaging old issues.

@pmeenan
Copy link
Member

pmeenan commented Feb 5, 2019 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants