This project analyzes Java source code to determine method purity for memoization suitability using LLMs.
This directory contains all the data related to the first phase of analysis.
java_source_code_files/: Contains the raw Java source code files that are being analyzed.csv_files/: Holds CSV files that drive the analysis pipeline, such asjava_filepaths.csvwhich lists the source code filepaths to be processed.java_llm_analysis_files/: Stores the raw JSON output from each individual pass of the LLM analysis. Each sub-directory (e.g.,gemini-2.5-pro-pass-1) corresponds to a single run.aggregated_purity_analyses/: Contains the final, aggregated analysis results. The JSON files here are created by combining the results from the multiple passes injava_llm_analysis_files. gemini-2.5-pro-three-passes is based on gemini-2.5-pro-pass-1, gemini-2.5-pro-pass-2, gemini-2.5-pro-pass-3.
This directory contains all the Python scripts used to run the analysis pipeline.
fetch_code.py: Fetches Java source code from URLs provided either individually or in a CSV file and saves them into thephase1/java_source_code_files/directory, ready for analysis.create_filepaths_csv.py: A utility script to generate a CSV file of file paths from a list of source code URLs. This CSV is used as input for the labeling script.label_llm.py: This is the main script for LLM analysis. It reads Java files specified in a CSV, sends the code to an LLM for purity analysis, and saves the results as structured JSON files in thephase1/java_llm_analysis_files/directory.aggregate_purity_results.py: This script aggregates the results from multiple LLM analysis passes. It combines the JSON outputs for each file, calculates a consensus on method purity (pure, impure, or mixed), and saves the final aggregated JSON in thephase1/aggregated_purity_analyses/directory.- The remaining scripts are to experiment with logic before adding new functionality to the scripts above. Ignore these.
MIT License
Copyright (c) 2025 Muhammad Musa Khan. Attribution required, no warranty provided.