An embedding-based approach to detect similar software projects.
Initially, we created a novel approach to represent code as a part of a source code topic modeling project (not finished yet). On the way, we found out that with the new representation, we can explicitly measure the distance between arbitrary pieces of code (and even projects!), and at the project level it works reasonably well. We decided to implement it as a stand-alone tool to verify the feasibility of our approach and share it with the community.
Also, Sosed means "neighbor" in Russian.
Firstly, we took a large corpus of sub-token sequences and trained their embeddings with fasttext. Then, we clustered the embeddings with spherical K-means to get 256 groups of semantically similar tokens. These clusters reflect topics that occurred at sub-token level in a large corpus of source code.
We represent code as a distribution of clusters among its sub-tokens. We hypothesize that fragments of code with similar distributions are also similar in a more broad sense.
We took a dataset of 9 million GitHub repositories from the paper by Markovtsev et al. It contains all the repositories on GitHub as of the end of 2016, excluding explicit and implicit forks (see the paper for details).
We computed the aforementioned representations for all the repositories. Cluster distribution can be seen as a 256-dimensional vector, where coordinate along dimension C is the probability of cluster C appearing among the project's sub-tokens. We propose two ways to measure the similarity of the distributions: either by using KL-divergence or cosine similarity. While the first option has clearer motivation from the mathematics perspective, our experience with the tool shows that both options produce similar results. It is worth noting that in the case of discrete distributions, both maximization of cosine similarity and minimization of KL-divergence can be reduced to maximization of two vectors' inner product.
In the end, for a given project, the tool will output its nearest neigbors among these 9 million projects, sorted by their similarity.
In order for Sosed not to act as a black-box, we manually labeled all 256 clusters with short descriptions of their topics.
If you pass --explain
flag to Sosed, it will identify clusters that contributed the most to the similarity and
print their descriptions.
We recorded a short video demonstrating the features of Sosed: finding similar projects, filtering them by language and GitHub stars, producing explainable output. The video is available on Youtube.
To run the tool, clone this project:
git clone https://github.com/JetBrains-Research/sosed.git
- Install required dependencies:
pip install cython
pip install -r requirements.txt
- Create a conda environment with required dependencies:
conda env create --file conda_env.yml
conda activate sosed-env
- Download and setup tokenizer (approx. 75 MB of grammars for tree-sitter, also available as a stand-alone tool):
python3 -m sosed.setup_tokenizer
- Create a list with links to GitHub repositories or paths to local projects as an input file (see input_examples for examples).
- Run the tool. On the first run it will download several files with data (approx. 300 MB archived, 960 MB upacked):
python3 -m sosed.run -i input_examples/input.txt -o output/output_example
-
Pull docker image:
docker pull egorbogomolov/sosed:latest
-
Map
input
,output
, anddata
directories from inside the container to the local filesystem, to cache both auxiliary and output data. This allows to inspect the output afterwards (e.g., check the vocabulary for analyzed projects) outside the container:docker run \ -v "$(pwd)"/input_examples:/input_examples/ \ -v "$(pwd)"/output:/output/ \ -v "$(pwd)"/data:/data \ egorbogomolov/sosed:latest -i input_examples/input.txt -o output/examples_output/
- Sosed does not work from source on Windows, because some dependencies do not support Windows. Consider using WSL or Docker for Windows.
- Clusters were labeled manually, and even though we tried our best, the range of topics is very broad, from "CPU fans" to "Large Hadron Collider". Some descriptions may be imprecise, so feel free to open an issue in case you find any debatable descriptions!
-l
,--local
– if passed, the input file will be treated as a list of paths to local directories.-f
,--force
– if passed, all stages will be re-run, otherwise stored intermediate data will be used.-s
,--min_stars
– searching for similar projects among those with at leastmin_stars
stars. Available options are 0, 1, 10, 50, 100. Default is 100. Warning: for 0+ and 1+ options, a large archive will be downloaded (0+ stars: 1 GB compressed, 9 GB decompressed; 1+ stars: 250 MB compressed, 2 GB decompressed).-k
,--closest
– number of closest repositories to print. Default is 10.-b
,--batches
– number of projects that are tokenized together and stored in one file. Default is 100.-m
,--metric
– a method to compute project similarity, eitherkl
orcosine
. Default iskl
.-e
,--explain
– if passed, Sosed will explain the similarity.--lang
– language name. If passed, Sosed will output only the projects in a given language.
python3 -m sosed.run \
-i input_examples/input.txt -o output_example \
--local --force --min_stars 10 --closest 20 --batches 1000 --metric cosine --explain --lang Go
To verify the tool, we identified the most similar projects to a set of 94 popular GitHub repositories (the list is available here). We applied filtering to output only the projects with 100+ stars.
From our point of view, the results seem promising, but we still should conduct a thorough evaluation. We made the full output available for both KL-divergence and cosine similarity.