Replies: 1 comment 1 reply
-
The size of a fragment is expressed in kgrams (k subsequent tokens in the syntax tree) and this is also shown for each fragment. This means that on either size of the fragment, there are X kgrams with the same hash. How much actual tokens are matching is difficult to say, because we do not compare all kgrams in a file (we compare roughly only one kgram per window). If you desire more information about the algorithm: there is a description in our publication (let me know if you don't have access). There are multiple reasons why matches can be off or look weird:
If you pass the
Where each token is separated by a comma. Parentheses mean descending in the tree to a child |
Beta Was this translation helpful? Give feedback.
-
Let me please ask for clarification of the colored matching fragments shown in the console.
My intuition was that there come selected fragments with some overlap (shown in green) in between the two sources under comparison.
In some cases, the overlaps are excellent and directly identify a shared piece of code. However, in other cases, the overlap is poor. I came across several weird cases, where a comment seemed to match with some real code (I would expect that you filter comments out?), or where a large piece of code (400 lines) matched just a line or two! (This often happens when using small values of the
w
parameter, but this could be just a coincidence).Also, what does the
X kgrams
in the heading of each fragment refer to? Is it the number of all kgrams in the fragment, or the number of shared kgrams? Also, what is the relation between this number and the highlighted code (are they supposed to relate to each other?)?Consider e.g. the following example with
1 kgrams
(I'm using the default-k
here):where part of lines 393-396 is highlighted in green on the left, and lines 40-43 are highlighted on the right.
Here I'm totally confused: what does
1 kgrams
mean and what it relates to?Why the highligted codes do not match?
Consider also another example with
10 kgrams
:Here only line 423 and 98 match (this makes sense), but what is the meaning of the
10 kgrams
here?Thanks for clarification.
Beta Was this translation helpful? Give feedback.
All reactions