Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summarize the extracted Content #33

Open
Sarahkhan20 opened this issue Dec 17, 2024 · 10 comments
Open

Summarize the extracted Content #33

Sarahkhan20 opened this issue Dec 17, 2024 · 10 comments
Labels
suggestion New feature or request

Comments

@Sarahkhan20
Copy link

Sometimes to understand what exactly the codebase does we need to summarize the whole codebase , so add a feature for summarizing. Shall I work on this feature? using META or gemini Model.

@cyclotruc
Copy link
Owner

This is something I have in mind but I'm not convinced yet that we should start using LLMs in gitingest.
Currently the extraction logic doesn't involve any LLM and it has a lot of benefits,
I think I would like it to stay that way for now; we already have a lot that we can improve in "classic" code before we start integrating LLMS

That being said if you want to start a PoC of this, I'm indeed very interested to see how it turns out

@Sarahkhan20
Copy link
Author

Sure, Thanks

@PylotLight
Copy link

I agree that summarisation should be done properly with other tools which will likely require a full RAG/vector store setup due to the size of the context required. Unless using googles 1M+ context window it's not a simple task to just pull everything here.

@cyclotruc cyclotruc added the suggestion New feature or request label Dec 19, 2024
@cyclotruc
Copy link
Owner

cyclotruc commented Dec 19, 2024

You both have a valid point here: in the long term, LLMs and vectorisation techniques will be mandatory to achieve the best summary possible.

So let me rephrase what I said earlier:
This step will eventually come for gitingest, but I want to stay focused for now on improving the simple "declarative code" ingestion.

The idea behind this is:

  • Less overhead, less complexity:
    The project is still young and any added complexity (dependencies) should be carefully considered. Right now it's very easy to contribute, but even a simple local oLlama running on CPU would make it harder for some people to onboard the codebase.

  • Performances:
    A summarization step using LLMs or Vectorisation would come with a tradeoff in speed, and gitingest is focused bringing a smooth user experience, so that work will be better approached once we have a proper "profiling & optimisation" workflow going on.

  • There's lower hanging fruits to pick for now:
    I think we can already push the quality of the digest with simple ingestion logic:
    We know what codebases look like on average for popular languages
    It's certainly possible to improve based on known patterns before having to rely on models to make finer grained choices.

In the meantime, feel free to either:

  • draft some PoC around this idea, maybe I need to change my mind
  • Start gathering ressoures or ideas that could help us once we start working on this milestone

@argishh
Copy link

argishh commented Dec 24, 2024

I totally agree with you.
Before using a local llm, we need to take considerations from all type user's perspectives. Let's work on the PoC first, then it'll be easier to decide next steps.

A few suggestions from my end,

  1. Make this feature optional, local llms or Gemini doesn't have to be a dependency.

  2. We can use API based models by letting users utilize their API Keys.

@cyclotruc
Copy link
Owner

Very good point, making it optional is a good approach to this transition, with the option to use API models as well

@Sarahkhan20
Copy link
Author

Hey, I am glad this post is really engaging ! I am working currently on a PoC for the same, as suggested by @argishh The option for letting the users their own API key is very feasible as the Target audiences are developers so they might know how to get their API key , alternatively we can also write instructions to do so, however the large repositories can exhaust upto 22K tokens even more! So that's why Gemini provides aroud 1M token per minutes , than Groq or other llm models . plus we can have other optimizing like reducing or bucket algorithmn to optimize , let's see how it goes . Right now I am engaged with ICPC .But I have a prototype for small repositories ready, feel free to checkout! and create issues , once I am done with ICPC regionals i.e 5th January I will implement the ideas I have and shared here for large repositories . Check this out , meanwhile : https://github.com/Sarahkhan20/GitZen (It works for small repositories for now )

@argishh
Copy link

argishh commented Dec 24, 2024

@Sarahkhan20 nice work! I've gone through the code, and so far it looks great. I'll try it out next. I'm also interested in knowing how you're planning to optimize it for larger repos. Do not hesitate to reach out if you require any help to ideate or implement.

@joydeep049
Copy link
Contributor

I agree with @cyclotruc here.

Before we move on to add more features, and trust me adding an LLM summarisation feature would be huge, and probably will need a lot of testing and time. Before we move on to adding such features, we should make sure that the working of the current version of gitingest is nice and robust. Also that we employ the best coding practices moving forward.

Being a Machine Learning Engineer myself, I cannot help but want to work on such features!

@argishh
Copy link

argishh commented Jan 1, 2025

Definitely @joydeep049 . That's why @Sarahkhan20 started working on a separate PoC first. It'll take a while before the PoC is functional for larger repos, which would require summarization the most. So there's no rush as of now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
suggestion New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants