Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LLM analyzers #508

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

rbs-jacob
Copy link
Member

One sentence summary of this PR (This should go in the CHANGELOG!)

Add AI-enhanced LLM analyzers.

Anyone you think should look at this, specifically?

@whyitfor

@rbs-jacob rbs-jacob requested a review from whyitfor October 30, 2024 21:53
@rbs-jacob
Copy link
Member Author

rbs-jacob commented Oct 30, 2024

Outstanding to do items:

  • Push LLM-enhanced program/function analyzers
  • Update changelog

# The data summary is un-informative and verbose
if DataSummary in model.attributes:
del model.attributes[DataSummary]
return serialize(make_serializable(model.attributes))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this essentially a re-implementation of what the PJSONSerializationService does? Is there a reason to not use it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PJSON has types alongside the values. This is useful for deserializing PJSON values, but would make this code overly complicated, since we never need to deserialize the resulting text and thus would end up adding logic to drop the types anyway.

Moreover, PJSON is nice as a wire format, but is not optimized for human-readable text "serialization." We want to deviate from its representation of some values to make the resulting text easier for the LLM to understand.

I also (personally) just don't like the PJSON serialization service.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that the serialized PJSON would hinder the LLM analysis? This sounds like a hypothesis and not something that was tested.

We should, whenever possible, avoid writing ~80 lines of code if it can be avoided.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could surely get something working with PJSON. The (admittedly untested) hypothesis is that the code to do so would be as long as this code, less clear than this code, and dependent on undocumented serialized PJSON structure, which may change in the future, necessitating changes here.

If you still think I should rewrite this to use PJSON, I will.

Copy link
Contributor

@whyitfor whyitfor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See commenda

Comment on lines +80 to +81
# Install Ollama
RUN curl -L "https://ollama.com/download/ollama-linux-""$TARGETARCH"".tgz" | tar -C /usr/ -xzv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want to add this to the Docker image?

This feels like an incomplete install step -- we're adding a dependency that requires the user to download/install more things to actually use it. The impact is a larger image size for all users, plus more work for users who actually want to use the LLM feature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you propose doing it instead?

It would be ideal to have this whole thing in a separate package so that users could conditionally include the AI components. Then we could install Ollama and pull a model without worrying about unnecessarily inflating the core image. But one of the original constraints was that this should be in OFRAK core.

If we don't have Ollama installed, then we can't run the tests without having the tests themselves pull down and install the binary. As far as I know, in other OFRAK tests, there is no precedent for installing a dependency like this. It would be doable, but seems a little weird.

On the other hand, we could pull a model in the OFRAK core Dockerstub after installing Ollama, but this seems like a bad idea. We don't want to add another several-hundred-MB dependency to the OFRAK core images. We especially don't want to add models this large to the base image if the user is just going to use OpenAI anyway.

Installing Ollama, but not pulling a model, seemed like a reasonable tradeoff. If it seems wrong to you, I would appreciate guidance on a better way to do it.

ofrak_core/Makefile Show resolved Hide resolved
ofrak_core/ofrak/core/llm.py Show resolved Hide resolved
ofrak_core/ofrak/core/llm.py Outdated Show resolved Hide resolved
ofrak_injector.discover(ofrak_ghidra)


async def test_llm_component(ofrak_context: OFRAKContext, model: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the https://github.com/redballoonsecurity/ofrak/blob/master/ofrak_core/test_ofrak/unit/component/analyzer/analyzer_test_case.py#L34 to run these tests?

Right now these tests are not validating that the analyzer successfully generates the needed attributes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that validation in these tests. It does not make sense to use AnalyzerTests subclass here since the analyzer must be run with resource.run. It requires a config, and thus cannot be called with resource.analyze, as happens in AnalyzerTests methods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the tests do call get_attributes at the end, so they're validating that the attributes are added.

# The data summary is un-informative and verbose
if DataSummary in model.attributes:
del model.attributes[DataSummary]
return serialize(make_serializable(model.attributes))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that the serialized PJSON would hinder the LLM analysis? This sounds like a hypothesis and not something that was tested.

We should, whenever possible, avoid writing ~80 lines of code if it can be avoided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants