Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GitHub connectivity: Problem accessing media.githubusercontent.com from England: ERR_CONNECTION_CLOSED #19

Open
amotl opened this issue Nov 14, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@amotl
Copy link
Member

amotl commented Nov 14, 2024

Dear GitHub SREs,

we are observing a connectivity issue when accessing media.githubusercontent.com from England.

You can find more details from a colleague's report below. Does it make any sense to you?

With kind regards,
Andreas.

Report

Tuesday

Sanity check: Anyone else seeing issues with GitHub links for downloading raw files / using as source for COPY FROM statements? I tried this from an existing Jupyter Notebook that used to work up until recently. I think the issue is on GitHub's side - SSL issue with media.githubusercontent.com perhaps. The resource in question is wind_farms.json.

I am going to take a break and have a walk and see if this is ok later. I'm just writing a new Jupyter notebook, which is how i found it.

Interesting: I turned off to follow redirects in postman and now get a 302 which redirects to the media.github... URL, which then fails with a mix of SSL issues or connection reset. I'm assuming it will just start working again sometimes, so it is more of an annoyance than a blocker right now (so long as it's working for people taking the courses etc).

Thursday

Just for interest: That GitHub issue I had still persists. I wonder if it is my ISP having some issue with an updated SSL cert or something. It fails on every raw link on GitHub on 3 machines on my local network. I can work around it to get stuff done, I was just worried there was a wider issue with using the raw files option now. I suspect it's "just me" until proven otherwise.

Screen.Recording.2024-11-13.at.13.25.48.mov
% dig media.githubusercontent.com

; <<>> DiG 9.10.6 <<>> media.githubusercontent.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 24444
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;media.githubusercontent.com.	IN	A

;; ANSWER SECTION:
media.githubusercontent.com. 0	IN	A	81.99.162.48

;; Query time: 18 msec
;; SERVER: 194.168.4.100#53(194.168.4.100)
;; WHEN: Wed Nov 13 13:32:41 GMT 2024
;; MSG SIZE  rcvd: 72

/cc @github, @github-support, @simonprickett

@simonprickett
Copy link
Contributor

It's now December 2nd and I continue to see this issue.

@amotl
Copy link
Member Author

amotl commented Jan 8, 2025

Guy from GitHub support responded:

Hi Andreas,

Thanks for contacting GitHub Support!

Apologies for the time it has taken us to respond - your patience is appreciated!

I believe this behavior is a result of using the GitHub raw paths in a manner they are not designed for.

I'm co-incidentally working on an update of our public docs that clarify this, at the moment, so I can share that content with you now!

The raw file endpoints are not suitable for any sort of large-scale automated use, and are subject to dynamic rate-limiting or blocking if overused. For automated use, using the API or a shallow clone are recommended, as rate limits can be anticipated and worked within.

Occasional use of the raw file endpoints, e.g. to download a script for use at the command line, is acceptable. Always use a unique, identifying user-agent. that includes the name of your software in the standard format, plus, in parentheses, an email address or a URL that shows the owner. (e.g., MonaApp/1.0 (+https://example.com)).

So, if a lot of traffic was hitting GitHub's raw points using the same user-agent (people were all using the default settings of a popular tool, for example, or someone with the same user-agent was attempting to mass scrape the site) then that traffic might well be blocked.

Hmmm, although I notice in your colleague's video they are attempting to access the raw version of a file in the UI, so that makes me less certain that this is the same issue. And also that file is stored in LFS storage, which is a further possible complication.

Could you please ask the person experiencing this issue to join this ticket and talk to us directly by forwarding them the email version of this message and asking them to reply to it with an email address associated with their GitHub account?

Is the problem still occurring? If they could give us more details about the HTTP client they were using, including the user-agent, that would also be useful. And does the problem occur on files stored in regular Git, or just those in LFS storage?

Cheers,
Guy

@amotl
Copy link
Member Author

amotl commented Jan 8, 2025

Hi again. @hlcianfagna reported a problem, which might be related, or actually the same.

Hi, a partner reached out saying that apparently we cannot download the LFS files for chicago-data in the way described in fundamentals_handson_your_first_cratedb_cluster.html. I tried the "media.githubusercontent" links from the original version of the file but those do not seem to work either. Maybe we should move these files to an S3 bucket?

@simonprickett confirmed:

My issue turned out to be somehow DNS related (my eero mesh had cached some DNS thing). However, based on the ticket @amotl created with GitHub, we did learn that they don't really like this sort of thing.

I can confirm I'm seeing the same issue here again now, both using a local instance of CrateDB, and CrateDB Cloud.

@simonprickett
Copy link
Contributor

Note - this also applies to the City Tour content and a couple of other developer relations demos that use Chicago and Wind Farm datasets stored in this repo. And we can't restrict downloads from S3 to just Crate cloud DB IP address ranges, as folks may also be running Crate in Docker locally.

@amotl amotl added the bug Something isn't working label Jan 8, 2025
@amotl
Copy link
Member Author

amotl commented Jan 8, 2025

Thanks Simon. Let's move off using GitHub as a CDN, as advised:

Using raw download URLs in web pages or otherwise using those direct links as a form of CDN is discouraged.
-- https://stackoverflow.com/a/58227912

The current idea and consensus is to keep using the cratedb-datasets repository as a source of truth, but distribute its content to a public S3 bucket called cratedb-datasets, by adding a little GHA workflow using Rclone to sync up the content on changes.

Please share your objections if you see any.

An alternative would be to sync up the repository content to the web space served at https://cdn.crate.io/, but we'd need to add/employ a minimal authentication mechanism then, possibly using MS SSO, WebDAV, or SSH, like Jenkins is doing it.

@amotl
Copy link
Member Author

amotl commented Jan 8, 2025

@hlcianfagna added:

A potential risk is that if the links are circulated outside of our intended audience we may find ourselves with a big bill.
An alternative could be to host the files in our website.

Thanks. I've heard about those hidden egress traffic cost opportunities S3 is offering, if I get the jargon right? I think, while the other benefits of S3 enumerated above are nice, they do not satisfy to open that trap.

In this spirit, I'd like to elevate the possibility to explore the second option, using https://cdn.crate.io/ to serve the content. Please respond with 👍 if you agree, or otherwise share your opinion about it.

/cc @kneth, @ckurze

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants