GitHub connectivity: Problem accessing `media.githubusercontent.com` from England: `ERR_CONNECTION_CLOSED` #19

amotl · 2024-11-14T23:39:09Z

Dear GitHub SREs,

we are observing a connectivity issue when accessing media.githubusercontent.com from England.

URL: devrel/uk-offshore-wind-farm-data/wind_farms.json
Error: HTTP client does not establish connection, tripping ERR_CONNECTION_CLOSED or SSL errors.
Location: Nottingham, England using Virgin Media as the ISP and Cloudflare DNS.

You can find more details from a colleague's report below. Does it make any sense to you?

With kind regards,
Andreas.

Report

Tuesday

Sanity check: Anyone else seeing issues with GitHub links for downloading raw files / using as source for COPY FROM statements? I tried this from an existing Jupyter Notebook that used to work up until recently. I think the issue is on GitHub's side - SSL issue with media.githubusercontent.com perhaps. The resource in question is wind_farms.json.

I am going to take a break and have a walk and see if this is ok later. I'm just writing a new Jupyter notebook, which is how i found it.

Interesting: I turned off to follow redirects in postman and now get a 302 which redirects to the media.github... URL, which then fails with a mix of SSL issues or connection reset. I'm assuming it will just start working again sometimes, so it is more of an annoyance than a blocker right now (so long as it's working for people taking the courses etc).

Thursday

Just for interest: That GitHub issue I had still persists. I wonder if it is my ISP having some issue with an updated SSL cert or something. It fails on every raw link on GitHub on 3 machines on my local network. I can work around it to get stuff done, I was just worried there was a wider issue with using the raw files option now. I suspect it's "just me" until proven otherwise.

Screen.Recording.2024-11-13.at.13.25.48.mov

% dig media.githubusercontent.com

; <<>> DiG 9.10.6 <<>> media.githubusercontent.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 24444
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;media.githubusercontent.com.	IN	A

;; ANSWER SECTION:
media.githubusercontent.com. 0	IN	A	81.99.162.48

;; Query time: 18 msec
;; SERVER: 194.168.4.100#53(194.168.4.100)
;; WHEN: Wed Nov 13 13:32:41 GMT 2024
;; MSG SIZE  rcvd: 72

/cc @github, @github-support, @simonprickett

The text was updated successfully, but these errors were encountered:

simonprickett · 2024-12-02T10:28:02Z

It's now December 2nd and I continue to see this issue.

amotl · 2025-01-08T11:25:24Z

Guy from GitHub support responded:

Hi Andreas,

Thanks for contacting GitHub Support!

Apologies for the time it has taken us to respond - your patience is appreciated!

I believe this behavior is a result of using the GitHub raw paths in a manner they are not designed for.

I'm co-incidentally working on an update of our public docs that clarify this, at the moment, so I can share that content with you now!

The raw file endpoints are not suitable for any sort of large-scale automated use, and are subject to dynamic rate-limiting or blocking if overused. For automated use, using the API or a shallow clone are recommended, as rate limits can be anticipated and worked within.

Occasional use of the raw file endpoints, e.g. to download a script for use at the command line, is acceptable. Always use a unique, identifying user-agent. that includes the name of your software in the standard format, plus, in parentheses, an email address or a URL that shows the owner. (e.g., MonaApp/1.0 (+https://example.com)).

So, if a lot of traffic was hitting GitHub's raw points using the same user-agent (people were all using the default settings of a popular tool, for example, or someone with the same user-agent was attempting to mass scrape the site) then that traffic might well be blocked.

Hmmm, although I notice in your colleague's video they are attempting to access the raw version of a file in the UI, so that makes me less certain that this is the same issue. And also that file is stored in LFS storage, which is a further possible complication.

Could you please ask the person experiencing this issue to join this ticket and talk to us directly by forwarding them the email version of this message and asking them to reply to it with an email address associated with their GitHub account?

Is the problem still occurring? If they could give us more details about the HTTP client they were using, including the user-agent, that would also be useful. And does the problem occur on files stored in regular Git, or just those in LFS storage?

Cheers,
Guy

amotl · 2025-01-08T11:25:44Z

Hi again. @hlcianfagna reported a problem, which might be related, or actually the same.

Hi, a partner reached out saying that apparently we cannot download the LFS files for chicago-data in the way described in fundamentals_handson_your_first_cratedb_cluster.html. I tried the "media.githubusercontent" links from the original version of the file but those do not seem to work either. Maybe we should move these files to an S3 bucket?

@simonprickett confirmed:

My issue turned out to be somehow DNS related (my eero mesh had cached some DNS thing). However, based on the ticket @amotl created with GitHub, we did learn that they don't really like this sort of thing.

I can confirm I'm seeing the same issue here again now, both using a local instance of CrateDB, and CrateDB Cloud.

simonprickett · 2025-01-08T11:42:06Z

Note - this also applies to the City Tour content and a couple of other developer relations demos that use Chicago and Wind Farm datasets stored in this repo. And we can't restrict downloads from S3 to just Crate cloud DB IP address ranges, as folks may also be running Crate in Docker locally.

amotl · 2025-01-08T11:46:18Z

Thanks Simon. Let's move off using GitHub as a CDN, as advised:

Using raw download URLs in web pages or otherwise using those direct links as a form of CDN is discouraged.
-- https://stackoverflow.com/a/58227912

The current idea and consensus is to keep using the cratedb-datasets repository as a source of truth, but distribute its content to a public S3 bucket called cratedb-datasets, by adding a little GHA workflow using Rclone to sync up the content on changes.

Please share your objections if you see any.

An alternative would be to sync up the repository content to the web space served at https://cdn.crate.io/, but we'd need to add/employ a minimal authentication mechanism then, possibly using MS SSO, WebDAV, or SSH, like Jenkins is doing it.

amotl · 2025-01-08T11:59:28Z

@hlcianfagna added:

A potential risk is that if the links are circulated outside of our intended audience we may find ourselves with a big bill.
An alternative could be to host the files in our website.

Thanks. I've heard about those hidden egress traffic cost opportunities S3 is offering, if I get the jargon right? I think, while the other benefits of S3 enumerated above are nice, they do not satisfy to open that trap.

In this spirit, I'd like to elevate the possibility to explore the second option, using https://cdn.crate.io/ to serve the content. Please respond with 👍 if you agree, or otherwise share your opinion about it.

/cc @kneth, @ckurze

amotl · 2025-01-10T02:04:20Z

Those patches support publishing the repository content to https://cdn.crate.io/downloads/datasets/cratedb-datasets/.

Those patches adjust relevant URLs on downstream repositories. Thanks, @simonprickett!

amotl · 2025-03-17T15:52:04Z

The datasets are being served from https://cdn.crate.io/downloads/datasets/cratedb-datasets/ now.
Thanks again for all those updates, @simonprickett!

amotl added the bug Something isn't working label Jan 8, 2025

amotl mentioned this issue Jan 8, 2025

Publish content to web server, GitHub is not a CDN #21

Closed

amotl closed this as completed Mar 17, 2025

amotl mentioned this issue Mar 17, 2025

Automate publishing of repository contents to web server #28

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub connectivity: Problem accessing `media.githubusercontent.com` from England: `ERR_CONNECTION_CLOSED` #19

GitHub connectivity: Problem accessing `media.githubusercontent.com` from England: `ERR_CONNECTION_CLOSED` #19

amotl commented Nov 14, 2024

simonprickett commented Dec 2, 2024

amotl commented Jan 8, 2025 •

edited

Loading

amotl commented Jan 8, 2025

simonprickett commented Jan 8, 2025

amotl commented Jan 8, 2025 •

edited

Loading

amotl commented Jan 8, 2025 •

edited

Loading

amotl commented Jan 10, 2025 •

edited

Loading

amotl commented Mar 17, 2025

GitHub connectivity: Problem accessing media.githubusercontent.com from England: ERR_CONNECTION_CLOSED #19

GitHub connectivity: Problem accessing media.githubusercontent.com from England: ERR_CONNECTION_CLOSED #19

Comments

amotl commented Nov 14, 2024

Report

Tuesday

Thursday

simonprickett commented Dec 2, 2024

amotl commented Jan 8, 2025 • edited Loading

amotl commented Jan 8, 2025

simonprickett commented Jan 8, 2025

amotl commented Jan 8, 2025 • edited Loading

amotl commented Jan 8, 2025 • edited Loading

amotl commented Jan 10, 2025 • edited Loading

amotl commented Mar 17, 2025

GitHub connectivity: Problem accessing `media.githubusercontent.com` from England: `ERR_CONNECTION_CLOSED` #19

GitHub connectivity: Problem accessing `media.githubusercontent.com` from England: `ERR_CONNECTION_CLOSED` #19

amotl commented Jan 8, 2025 •

edited

Loading

amotl commented Jan 8, 2025 •

edited

Loading

amotl commented Jan 8, 2025 •

edited

Loading

amotl commented Jan 10, 2025 •

edited

Loading