This project explores whether large language models (LLMs), baselines and traditional classifiction algorithms can identify code clones across different programming languages.
This work aims to:
- Evaluate the effectiveness of LLMs in cross-lingual code clone detection.
- Compare LLM performance to baselines and traditional methods.
- Uncover the best approach for identifying similar code across languages.
This repository contains the following files:
data_selection
: This directory contains all the code to select the subsetsdata
: This directory contains the subset of XLCoST and CodeNet used in the experimentsclassifier
: This directory contains all the code and data for the classification partresults
: This directory contains all the results for each dataset and each LLMget_embeddings.py
: permits to generate the vectors for each code snippetsp, se, sct
: permit to run experiment with the gpt-3.5-turbo depending on the experimentllama2_inf, falcon_inf, starchat_inf, starcoder_inf
: permit to run experiment with the llama-2-7b-chat-hf, falcon-7b-instruct, starchat-beta and starcoder2-15b-instruct