This repository contains LaTeX source code for the "Code Clone Detection in Clang Static Analyzer" poster.
The poster was prepared for LLVM Developer's Meeting 2015. LLVM Developers' Meeting is the largest conference for compiler specialists from all over the world held every year. Dozens of engineers working at Google, Apple and Intel are attending the conference and exchanging valuable experience.
This research resulted in alpha.clone.CloneChecker
in Clang Static
Analyzer. This check is capable of
detecting a part of what the original implementation was able to detect, but
is more stable and production-ready.
Unit
tests
give a good overview of code pieces, which can be detected by current upstream
implementation. CloneChecker
is a part of Clang Static Analyzer, which is
shipped with Clang binary. To run this check on a custom file just install
Clang and type the following command.
$ clang++ -cc1 -analyze -analyzer-checker=alpha.clone.CloneChecker source.cpp
The work described in this poster was done in terms of Google Summer of Code 2015 and supported by Google.
I was working with LLVM Community under mentorship of Vassil Vassilev (CERN/FNAL). Vassil is a known compiler specialist and the creator of Cling, an interactive C++ interpreter used in CERN.
The GSoC project page sadly doesn't contain much information due to my lack of knowledge that a large part of the summary I wrote won't be accessible from there. This poster, though, contains an extensive overview of the work done during Summer 2015.
Despite Code Clone Detection being quite popular topic over the past years there was no good solution, which was extensible, open, easy-to-use and would actually do its job really good. While some solutions existed most of them didn't take advantage of modern compiler technologies and very naive attempts lead to detecting a small part of widely existing code clones.
Numerous research papers proposed text-based approach, which is both not scalable and inefficient. Only few attempts (most notably, this one) focused on AST analysis, which gives significantly better results. Taking an advantage of reusing Clang infrastructure, which provides a rich AST for C, C++ and Objective-C, leads to even better solution.
Reusing Clang infrastructure allows to parse the up-to-date C and C++ dialects and detecting very sophisticated code clone instances.
The poster shows that the proposed implementation outperforms existing solutions (those I am aware of) in performance, range of detected code clones and usability. Many approaches, which are aiming for speed, are not able most Type II and Type III clones (please refer to Code Clone Taxonomy in Notes). Those trying to detect more types of similarity have serious performance issue. My work combines performance efficiency while not limiting detection capabilities.
The code used for this paper is available in my fork of Clang repository.
The following table shows that even a naive implementation of Code Clone check is able to process huge open-source projects and find many similar pieces of code:
Project | Normal build time | Build with BasicCloneCheck time | Clones found |
---|---|---|---|
OpenSSL | 1m26s | 9m27s | 180 |
Git | 0m26s | 2m46s | 34 |
SDL | 0m26s | 1m59s | 170 |
During Summer 2016 I was an intern in Google Munich, where I introduced major improvements to clang-rename and started clang-refactor (see design doc for reference). Therefore I was unable to continue my work on coding side and only participated in few discussions. Raphael Isemann under mentorship of Vassil Vassilev and with the help of Apple Static Anlysis team engineers did a great job improving current infrastructure (see GSoC project page) and finally pushing the code to the Clang repository.
Clang Static Analyzer isn't able to pass information between translation units and this, unfortunately, is a huge limitation for Code Clone Detection because of its nature: most clones end up in different translation units and are not reported by the check. If a proper solution is to be made, there is a need to overcome described limitation. My work on clang-refactor might become useful for an efficient solution.
I would like to thank Vassil Vassilev for guidance and support, LLVM Community for great suggestions and all the work done towards supporting new contributors and, of course, Google - for creating a great opportunity for students from all over the world and funding.
Compared to C++ projects, projects written in C suffer from code duplication issues significantly more.
The following pieces of code can be easily wrapped into functions to prevent potential errors, such as fixing a bug in one of the clone instances and ignoring the others.
A little more throughout analysis was able to identify around 500 big enough (see following examples) code clones.
Project commit used for analysis: b77b6127e8de38726f37697bbbc736ced7b49771.
if (encrypt) {
while (l--) {
if (n == 0) {
n2l(iv, v0);
ti[0] = v0;
n2l(iv, v1);
ti[1] = v1;
BF_encrypt((BF_LONG *)ti, schedule);
iv = (unsigned char *)ivec;
t = ti[0];
l2n(t, iv);
t = ti[1];
l2n(t, iv);
iv = (unsigned char *)ivec;
}
c = *(in++) ^ iv[n];
*(out++) = c;
iv[n] = c;
n = (n + 1) & 0x07;
}
} else {
while (l--) {
if (n == 0) {
n2l(iv, v0);
ti[0] = v0;
n2l(iv, v1);
ti[1] = v1;
BF_encrypt((BF_LONG *)ti, schedule);
iv = (unsigned char *)ivec;
t = ti[0];
l2n(t, iv);
t = ti[1];
l2n(t, iv);
iv = (unsigned char *)ivec;
}
cc = *(in++);
c = iv[n];
iv[n] = cc;
*(out++) = c ^ cc;
n = (n + 1) & 0x07;
}
}
if (!b->Z_is_one) {
if (!field_sqr(group, Zb23, b->Z, ctx))
goto end;
if (!field_mul(group, tmp1, a->X, Zb23, ctx))
goto end;
tmp1_ = tmp1;
} else
tmp1_ = a->X;
if (!a->Z_is_one) {
if (!field_sqr(group, Za23, a->Z, ctx))
goto end;
if (!field_mul(group, tmp2, b->X, Za23, ctx))
goto end;
tmp2_ = tmp2;
} else
tmp2_ = b->X;
/* compare X_a*Z_b^2 with X_b*Z_a^2 */
if (BN_cmp(tmp1_, tmp2_) != 0) {
ret = 1; /* points differ */
goto end;
}
if (!b->Z_is_one) {
if (!field_mul(group, Zb23, Zb23, b->Z, ctx))
goto end;
if (!field_mul(group, tmp1, a->Y, Zb23, ctx))
goto end;
/* tmp1_ = tmp1 */
} else
tmp1_ = a->Y;
if (!a->Z_is_one) {
if (!field_mul(group, Za23, Za23, a->Z, ctx))
goto end;
if (!field_mul(group, tmp2, b->Y, Za23, ctx))
goto end;
/* tmp2_ = tmp2 */
} else
tmp2_ = b->Y;
if (Xp && rsa->p == NULL) {
rsa->p = BN_new();
if (rsa->p == NULL)
goto err;
if (!BN_X931_derive_prime_ex(rsa->p, p1, p2,
Xp, Xp1, Xp2, e, ctx, cb))
goto err;
}
if (Xq && rsa->q == NULL) {
rsa->q = BN_new();
if (rsa->q == NULL)
goto err;
if (!BN_X931_derive_prime_ex(rsa->q, q1, q2,
Xq, Xq1, Xq2, e, ctx, cb))
goto err;
}
Patch 8.0.0071.
Around 300 similar code pieces in total found in Vim.
// Chunk 1.
switch (gchar_cursor())
{
case ALEF:
tempc = ALEF_;
break;
case ALEF_U_H:
tempc = ALEF_U_H_;
break;
case _AYN:
tempc = _AYN_;
break;
case AYN:
tempc = AYN_;
break;
case _GHAYN:
tempc = _GHAYN_;
break;
case GHAYN:
tempc = GHAYN_;
break;
case _HE:
tempc = _HE_;
break;
case YE:
tempc = YE_;
break;
case IE:
tempc = IE_;
break;
case TEE:
tempc = TEE_;
break;
case YEE:
tempc = YEE_;
break;
default:
tempc = 0;
}
...
// Chunk 2.
switch (gchar_cursor())
{
case ALEF:
tempc = ALEF_;
break;
case ALEF_U_H:
tempc = ALEF_U_H_;
break;
case _AYN:
tempc = _AYN_;
break;
case AYN:
tempc = AYN_;
break;
case _GHAYN:
tempc = _GHAYN_;
break;
case GHAYN:
tempc = GHAYN_;
break;
case _HE:
tempc = _HE_;
break;
case YE:
tempc = YE_;
break;
case IE:
tempc = IE_;
break;
case TEE:
tempc = TEE_;
break;
case YEE:
tempc = YEE_;
break;
default:
tempc = 0;
}
// Code chunk 1.
/* Optional arguments: line number to stop searching and timeout. */
if (argvars[1].v_type != VAR_UNKNOWN && argvars[2].v_type != VAR_UNKNOWN)
{
lnum_stop = (long)get_tv_number_chk(&argvars[2], NULL);
if (lnum_stop < 0)
goto theend;
#ifdef FEAT_RELTIME
if (argvars[3].v_type != VAR_UNKNOWN)
{
time_limit = (long)get_tv_number_chk(&argvars[3], NULL);
if (time_limit < 0)
goto theend;
}
#endif
}
...
// Code chunk 2.
if (argvars[5].v_type != VAR_UNKNOWN)
{
lnum_stop = (long)get_tv_number_chk(&argvars[5], NULL);
if (lnum_stop < 0)
goto theend;
#ifdef FEAT_RELTIME
if (argvars[6].v_type != VAR_UNKNOWN)
{
time_limit = (long)get_tv_number_chk(&argvars[6], NULL);
if (time_limit < 0)
goto theend;
}
#endif
}
Commit be5a750939c212bc0781ffa04fabcfd2b2bd744e.
} else if (!strcmp(name, "cloning")) {
if (!strcmp(value, "true"))
options.cloning = 1;
else if (!strcmp(value, "false"))
options.cloning = 0;
else
return -1;
return 0;
} else if (!strcmp(name, "update-shallow")) {
if (!strcmp(value, "true"))
options.update_shallow = 1;
else if (!strcmp(value, "false"))
options.update_shallow = 0;
else
return -1;
return 0;
This one looks especially weird.
if ((mlim = xdl_bogosqrt(xdf1->nrec)) > XDL_MAX_EQLIMIT)
mlim = XDL_MAX_EQLIMIT;
for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart]; i <= xdf1->dend; i++, recs++) {
rcrec = cf->rcrecs[(*recs)->ha];
nm = rcrec ? rcrec->len2 : 0;
dis1[i] = (nm == 0) ? 0: (nm >= mlim) ? 2: 1;
}
if ((mlim = xdl_bogosqrt(xdf2->nrec)) > XDL_MAX_EQLIMIT)
mlim = XDL_MAX_EQLIMIT;
for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart]; i <= xdf2->dend; i++, recs++) {
rcrec = cf->rcrecs[(*recs)->ha];
nm = rcrec ? rcrec->len1 : 0;
dis2[i] = (nm == 0) ? 0: (nm >= mlim) ? 2: 1;
}
[0] For reference on Code Clone Taxonomy see fairly recent paper by Roy et al.