Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch more data from GitHub #443

Open
benmccann opened this issue Aug 18, 2023 · 13 comments
Open

Fetch more data from GitHub #443

benmccann opened this issue Aug 18, 2023 · 13 comments

Comments

@benmccann
Copy link
Collaborator

benmccann commented Aug 18, 2023

@CaptainCodeman suggested it's too difficult to list a new component at the moment. It made me think about how we could simplify things and I do think things could be quite a bit easier. A lot of the info we need is already available on GitHub, so we should just fetch it from there rather than making the user enter it all.

We could make it so that the user only has to enter the title, npm package name, GitHub URL, category, and tags

Once we have the repo URL we can then fetch the homepage URL, description, and star count from GitHub. @MacFJA just added a script to fetch the star count, so hopefully it wouldn't be too hard to extend. We could also fetch the last commit date and whether the repo is archived from GitHub, which would help a lot in identifying stale repositories.

I think we should get rid of the addedOn field. If you have a library that's 5 years old, who cares if it was added to sveltesociety.dev 3 years ago or last week?

@CaptainCodeman
Copy link

I had a play around last evening doing exactly this and it works well - all you need to enter though is the npm package name and everything else can be retrieved from that.

This is a list of components (in no particular order) rendered doing that, with images coming from the open-graph / twitter cards taken from the project homepages where available:

Screenshot 2023-08-18 at 10 25 02 AM

The readme can be rendered gfm style on a detail page and the original creation date and last version published date both available from npm (which are probably more useful than the date listed on the site). I'd suggest some vector similarity indexing to group related components, and / or use a convention of keywords to decide what categories things go into, and would show similar / other component cards based on what is being viewed (this was storing things in Redis, which can provide the vector similarity indexing):

Screenshot 2023-08-18 at 10 36 24 AM

You would probably want to have submissions require authentication, and an approval process before they showed up to protect against abuse, but those would both be fairly simple and you could then auto-approve based on previous approvals for the same account to cut down on work (maybe also just allow people to view verified / un-verified entries as a toggle).

@benmccann
Copy link
Collaborator Author

benmccann commented Aug 18, 2023

Wow, that looks great!

The npm field is currently not required. I just pushed a commit to make sure that everything on https://sveltesociety.dev/components has it. https://sveltesociety.dev/templates doesn't have it at all. Some need it and some don't on https://sveltesociety.dev/tools, so we'll have to leave it optional for tools. I think that's fine though because tools don't currently have tags except for "official"

I would suggest to make the rendering a little more consistent. I.e. make them all look like the first one "svelte-marquee" and not like "svelte-toolbox" that comes two later.

I agree we'll need an approval process. I think that we could leave it as GitHub PRs, which has some advantages like requiring no infrastructure, everyone can immediately pull down and run the site, admins can turn on/off email notifications, etc. There's a couple big annoyances with GitHub PRs today, but they can be remedied:

  • friction of creating them - will be a lot better because so many fewer fields. in particular "tags" has been a troublesome field because depending on the length of the line prettier will split it into multiple lines or not and so it frequently fails linting. this problem will disappear entirely
  • merge conflicts - I think this happens because everyone inserts at the end. We could just tell people to insert in a random spot in the list

The one issue we have right now is that if you add a new item, then we need to regenerate the data from github to grab the star count. We could change it so that the file with this data isn't checked in, but is fetched on build. If that's causing us to send too much traffic we could change deployments to happen once a week or upon request instead of with each commit

@MacFJA
Copy link
Collaborator

MacFJA commented Aug 19, 2023

  • friction of creating them - [...] in particular "tags" has been a troublesome field [...] this problem will disappear entirely

I'm not sure if it's a good idea to blindling use keyword from Github or NPM: As everyone can put whatever they want, we will end up with lots of differente (and sometime very similar to other), and filtering data will be a nightmare (imagine scrolling in a list of hundreds of values)


  • merge conflicts - I think this happens because everyone inserts at the end. We could just tell people to insert in a random spot in the list

IMO, it will be hard/weird to ask people to put their data at a random place 🤔

A discussion have been started by @stephane-vanraes on this subject here: #318


The one issue we have right now is that if you add a new item, then we need to regenerate the data from github to grab the star count. We could change it so that the file with this data isn't checked in, but is fetched on build. If that's causing us to send too much traffic we could change deployments to happen once a week or upon request instead of with each commit

It's not too hard to run the star update script when something is merged on main.
The workflow need to be changed a bit: instead of opening a Pull Request a commit is added directly on main


We could make it so that the user only has to enter the title, npm package name, GitHub URL, category, and tags

If we only ask that, maybe an Issue template and a bot could do most of the job

The bot: check is needed data are provided and conform to the spec, and on a trigger create a commit with the data at the end of the file: no risk of conflict as the edition is atomic

@benmccann
Copy link
Collaborator Author

benmccann commented Aug 21, 2023

I'm not sure if it's a good idea to blindling use keyword from Github or NPM: As everyone can put whatever they want, we will end up with lots of differente (and sometime very similar to other), and filtering data will be a nightmare (imagine scrolling in a list of hundreds of values)

The current tags are pretty awful as well honestly. There's quite a lot that either duplicate the category or only have a couple entries. I think we could clean up the categories and get rid of the tags. If we loaded the tags from npm I'd get rid of the tags dropdown and just have the tags included via the free text search field.

IMO, it will be hard/weird to ask people to put their data at a random place

Maybe. We could try it and see how it goes since it's basically no effort - just have to update the PR template. Even if people still stuck entries at the end I think it gets way easier to deal with merge conflicts if we have fewer fields.

@benmccann
Copy link
Collaborator Author

I just updated the categories so that every component has a category now. I think we can get rid of the existing tags at this point. There's not much they help with that we can't get from just doing ctrl+f or using the searchbox on the page and they pretty much just duplicate the category or have only a couple items in them.

@CaptainCodeman
Copy link

Categories are problematic because it's often difficult to decide which one a component should go into (which also means someone searching may be picking the wrong thing).

Tags potentially allow for many more categories and cater for the fact that many components will always overlap more than one. The issue of tags not always being perfect matches can be solved with a mix of word stemming and synonyms. "headless" and "renderless" are just different terms for the same thing, so a search for either should find components tagged with any. Likewise you might have "tailwind", "tailwind-css", "tailwind-ui, and every combination of space and dash separated or concatenated strings - they can all be normalized so a search for one is a search for all. IMO this works, whereas trying to come up with a restrictive set of tags that people have to adhere to (and select, in addition to keyworks they already chose when they published their component) is just too difficult. Other quick examples: "grid" and "masonry", "graph" and "cart", "forms" and "validation", then all the different ways to signify accessibility: "a11y", "aria", "wai-aria" etc...

It's a little effort to think up suitable synonyms, but like many things you quickly cover 80% of the common cases leaving just the obscure things.

Ultimately, they are just one thing to potentially search on or group things by. Combined with a full-text search on title and description (which I think most people would use before touching long drop-downs) that's usually all that is needed.

Once you find something close to what you're after, grouping and similarity should show other things you might be interested in.

Here's a working example: https://svelte.tips (you can submit new things, they'll go into a pending queue that is just a one-button approval).

@stephane-vanraes
Copy link
Collaborator

merge conflicts - I think this happens because everyone inserts at the end. We could just tell people to insert in a random spot in the list

Just an idea, but this could alternatively also be solved by having one file per package instead. Then adding a package is as simple as creating a new file, copying the template and fill in the data.

@benmccann
Copy link
Collaborator Author

I like that svelte.tips uses an npm icon rather than npm install package to show the npm package and includes the last modified date

The display of tags on svelte.tips looks pretty inconsistent, so I think I'd show max 10 and then a more button or something like that. E.g. svelte-particles has 50 while lots have none or just a few.

@MacFJA
Copy link
Collaborator

MacFJA commented Sep 17, 2023

The display of tags on svelte.tips looks pretty inconsistent, so I think I'd show max 10 and then a more button or something like that. E.g. svelte-particles has 50 while lots have none or just a few.

Maybe we can get all the tag and only keep tag that have between 20% and 80% of packages.

  • 20%: remove tags that are too unique, too precise, isolated
  • 80%: remove tags that are too common (like svelte, on NPM it makes sense, but on a dedicated website for Svelte it's obvious)

Those bounds can be tweaked to have a better coverage or total number of tags (we can even try to automatically adjust them to match a fix number of tags or a ratio with the number of packages)

@benmccann
Copy link
Collaborator Author

I don't think there's any harm in having unique tags in the dataset. It may help people find what they're looking for. I just don't know that we need to display them all in the UI. Removing "svelte" could make sense as a special case. I don't know if there are others that would make sense not to show though

@MacFJA
Copy link
Collaborator

MacFJA commented Nov 4, 2023

Data that can be pull (and meaningful) from:

  • GitHub (GraphQL1):
    • description
    • homepageUrl
    • isLocked
    • isTemplate
    • languages
    • lockReason
    • name
    • nameWithOwner
    • openGraphImageUrl
    • releases
    • repositoryTopics
    • shortDescriptionHTML
    • stargazerCount
  • GitLab (GraphQL2):
    • archived
    • description
    • languages
    • name
    • nameWithNamespace
    • starCount
    • topics
    • webUrl
    • releases
  • NPMS3 ("REST"):
    • name
    • keywords
    • version
    • description
    • date
    • repository
    • homepage4
    • starsCount5
    • score (quality, popularity, maintenance, global)
    • deprecated
GitHub GitLab NPMS Description
description description description Project description
isLocked archived deprecated Is the project active
homepageUrl webUrl homepage Public URL of the project
isTemplate - - Is the repository a repository template
languages languages - Languages in which source files are written
lockReason - deprecated The Reason why the project is not active
name name name The name of the repository
nameWithOwner nameWithNamespace name The complete name of the repository
openGraphImageUrl - - The image that describe the project (share image)
repositoryTopics topics keywords The list of tags, keywords of the project
shortDescriptionHTML - - The description without any links
stargazerCount starCount starsCount The number of stars on the repository (already use for GitHub and GitLab)
Calculated from releases Calculated from releases version The last version of the package
Calculated from releases Calculated from releases date Date of the last version
- - repository URL of the repository
- - score Score of the package

Warning

Lots of information are optional.
For example:

  • Release feature on Github and Gitlab are not mandatory, but version and data data from NPMS are always present
  • deprecated on NPMS are less used than archived/locked repository (⚠️ personal observation, no statistics on it)
  • etc.

Footnotes

  1. https://docs.github.com/en/graphql

  2. https://docs.gitlab.com/ee/api/graphql/reference/#queryproject

  3. https://npms.io/

  4. Multiple value exist, from the package.json and from the repository

  5. Multiple value exist, from the repository and the others from somewhere else 😱

@lachlancollins
Copy link
Collaborator

lachlancollins commented Nov 7, 2023

Data that can be pull (and meaningful) from:

@MacFJA I've started working on this sort of idea in my fork! I've written updateNpm.js which pulls the latest release's version and date, as well as updatePublint.js which checks if the package passes Publint. They take ~20 seconds in total to check all packages in components.json. I'll work on it more once the Zod PR is sorted out since the scripts depend on that for validation.

EDIT: I've expanded the npm script to also get the supported svelte versions from peerDependencies (support.json) - can then use a semvar util to work out which major versions are supported. Something like this will probably be helpful with the svelte 5 upgrade!

@lachlancollins
Copy link
Collaborator

Hi @benmccann and @MacFJA , I've submitted a draft PR (#499) which makes a start on automatically fetching data from the npm API - would appreciate some feedback on the approach!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants