Summarize the extracted Content #33

Sarahkhan20 · 2024-12-17T12:52:37Z

Sometimes to understand what exactly the codebase does we need to summarize the whole codebase , so add a feature for summarizing. Shall I work on this feature? using META or gemini Model.

cyclotruc · 2024-12-17T23:10:06Z

This is something I have in mind but I'm not convinced yet that we should start using LLMs in gitingest.
Currently the extraction logic doesn't involve any LLM and it has a lot of benefits,
I think I would like it to stay that way for now; we already have a lot that we can improve in "classic" code before we start integrating LLMS

That being said if you want to start a PoC of this, I'm indeed very interested to see how it turns out

Sarahkhan20 · 2024-12-18T06:54:11Z

Sure, Thanks

PylotLight · 2024-12-19T04:16:36Z

I agree that summarisation should be done properly with other tools which will likely require a full RAG/vector store setup due to the size of the context required. Unless using googles 1M+ context window it's not a simple task to just pull everything here.

cyclotruc · 2024-12-19T17:11:21Z

You both have a valid point here: in the long term, LLMs and vectorisation techniques will be mandatory to achieve the best summary possible.

So let me rephrase what I said earlier:
This step will eventually come for gitingest, but I want to stay focused for now on improving the simple "declarative code" ingestion.

The idea behind this is:

Less overhead, less complexity:
The project is still young and any added complexity (dependencies) should be carefully considered. Right now it's very easy to contribute, but even a simple local oLlama running on CPU would make it harder for some people to onboard the codebase.
Performances:
A summarization step using LLMs or Vectorisation would come with a tradeoff in speed, and gitingest is focused bringing a smooth user experience, so that work will be better approached once we have a proper "profiling & optimisation" workflow going on.
There's lower hanging fruits to pick for now:
I think we can already push the quality of the digest with simple ingestion logic:
We know what codebases look like on average for popular languages
It's certainly possible to improve based on known patterns before having to rely on models to make finer grained choices.

In the meantime, feel free to either:

draft some PoC around this idea, maybe I need to change my mind
Start gathering ressoures or ideas that could help us once we start working on this milestone

argishh · 2024-12-24T04:35:27Z

I totally agree with you.
Before using a local llm, we need to take considerations from all type user's perspectives. Let's work on the PoC first, then it'll be easier to decide next steps.

A few suggestions from my end,

Make this feature optional, local llms or Gemini doesn't have to be a dependency.
We can use API based models by letting users utilize their API Keys.

cyclotruc · 2024-12-24T17:24:40Z

Very good point, making it optional is a good approach to this transition, with the option to use API models as well

Sarahkhan20 · 2024-12-24T20:30:30Z

Hey, I am glad this post is really engaging ! I am working currently on a PoC for the same, as suggested by @argishh The option for letting the users their own API key is very feasible as the Target audiences are developers so they might know how to get their API key , alternatively we can also write instructions to do so, however the large repositories can exhaust upto 22K tokens even more! So that's why Gemini provides aroud 1M token per minutes , than Groq or other llm models . plus we can have other optimizing like reducing or bucket algorithmn to optimize , let's see how it goes . Right now I am engaged with ICPC .But I have a prototype for small repositories ready, feel free to checkout! and create issues , once I am done with ICPC regionals i.e 5th January I will implement the ideas I have and shared here for large repositories . Check this out , meanwhile : https://github.com/Sarahkhan20/GitZen (It works for small repositories for now )

argishh · 2024-12-24T20:46:04Z

@Sarahkhan20 nice work! I've gone through the code, and so far it looks great. I'll try it out next. I'm also interested in knowing how you're planning to optimize it for larger repos. Do not hesitate to reach out if you require any help to ideate or implement.

joydeep049 · 2025-01-01T17:18:26Z

I agree with @cyclotruc here.

Before we move on to add more features, and trust me adding an LLM summarisation feature would be huge, and probably will need a lot of testing and time. Before we move on to adding such features, we should make sure that the working of the current version of gitingest is nice and robust. Also that we employ the best coding practices moving forward.

Being a Machine Learning Engineer myself, I cannot help but want to work on such features!

argishh · 2025-01-01T17:23:13Z

Definitely @joydeep049 . That's why @Sarahkhan20 started working on a separate PoC first. It'll take a while before the PoC is functional for larger repos, which would require summarization the most. So there's no rush as of now.

cyclotruc added the suggestion New feature or request label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summarize the extracted Content #33

Summarize the extracted Content #33

Sarahkhan20 commented Dec 17, 2024

cyclotruc commented Dec 17, 2024

Sarahkhan20 commented Dec 18, 2024

PylotLight commented Dec 19, 2024

cyclotruc commented Dec 19, 2024 •

edited

Loading

argishh commented Dec 24, 2024

cyclotruc commented Dec 24, 2024

Sarahkhan20 commented Dec 24, 2024

argishh commented Dec 24, 2024

joydeep049 commented Jan 1, 2025

argishh commented Jan 1, 2025

Summarize the extracted Content #33

Summarize the extracted Content #33

Comments

Sarahkhan20 commented Dec 17, 2024

cyclotruc commented Dec 17, 2024

Sarahkhan20 commented Dec 18, 2024

PylotLight commented Dec 19, 2024

cyclotruc commented Dec 19, 2024 • edited Loading

argishh commented Dec 24, 2024

cyclotruc commented Dec 24, 2024

Sarahkhan20 commented Dec 24, 2024

argishh commented Dec 24, 2024

joydeep049 commented Jan 1, 2025

argishh commented Jan 1, 2025

cyclotruc commented Dec 19, 2024 •

edited

Loading