-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unefficient creation of repulsion sets #17
Comments
Hey there, thanks for the report, and ouch. This indeed looks like it would cause troubles. The converse (e.g. only using the function name) may cause different issues, though -- not for the repulsion sets, but for the attraction sets (two different functions with same name but different arguments, for example). So I guess the "best" solution would be to use the prototypes for attraction pairs, and not use them for the repulsion sets? Cheers, |
Hey there,
one thing I was wondering -- stemsymbols is normally responsible for trying
to "canonicalize" symbols from different compilers,
thinking about the problem in question, it seems like the best place to fix
this would be to 'fix' stemsymbols?
Do you happen to have a few example strings that are incorrectly viewed as
different?
Cheers,
Thomas
Am Fr., 28. Dez. 2018 um 01:37 Uhr schrieb Mohamad Mansouri <
[email protected]>:
… Hey,
Thanks for the fast reply, I would like to add that creating symbols from
function prototypes results in finding 1176 distinct function in the unrar
dataset (ELF + PE) while creating symbols from function name will give 879
distinct functions this means that near 297 are repeated functions which
they represent 25% of the dataset. On the other hand and after thinking
twice about it, its low chance that this may cause a big problem since, for
a problem to occur the probability is
[image: eq]
<https://camo.githubusercontent.com/1e1b1d5abc24fc1318b2b14130f0d6c3ee9f7b3b/68747470733a2f2f6c617465782e636f6465636f67732e636f6d2f6769662e6c617465783f282535436672616325374232393725374425374231313736253744292825354366726163253742312537442537423131373525374429>
the number of problamatic pairs are 0.00021 * N where N is the number of
repulsion pairs
for 500000 repulsion pair we may have 107 pairs of function that are
declared as repulsion pairs but they belong to the same function
I agree to what you said regarding the solution
I will create a pull request base on what you said. Please have a look.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#17 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEYBvJ02QCO4PQoPSSw_2qaZntJcIHDRks5u9WfRgaJpZM4ZjDlA>
.
|
running this on the output of the generated training data of Unrar. Regards, |
Hey there,
interesting, thanks for checking. I will ponder the next two days how to
best deal with this...
Cheers,
Thomas
Am So., 30. Dez. 2018 um 02:58 Uhr schrieb Mohamad Mansouri <
[email protected]>:
… running this on the output of the generated training data of Unrar.
cat extracted_symbols_* | awk '{system("echo "$4 "| base64 -d; echo")}' |
sort | uniq
This what you get
func_uniq.txt
<https://github.com/googleprojectzero/functionsimsearch/files/2716851/func_uniq.txt>
Regarding to what you thought I am afraid you are not completely right as
if you only take into account the ELF file (which are not undergoing the
stemsymbol thing) you find the same problem.
This can be proved by running this command
cat extracted_symbols_* | awk '$2 ~ /ELF/{system("echo " $4 " | base64 -d;
echo ")}' | sort | uniq
This what you get
func_ELF_uniq.txt
<https://github.com/googleprojectzero/functionsimsearch/files/2716873/func_ELF_uniq.txt>
Regards,
Mansouri
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#17 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEYBvMiaiv0L2lLEL2vaGyuq6ihvbpFFks5u-B3jgaJpZM4ZjDlA>
.
|
When extracting the symbols of the binary files of the dataset, base64 of the function prototype is used to build a ground truth of same functions.
but with different compilers, and platforms the function prototype does not remain the same.
Thus making the algorithm possibly put the same function (with different prototype) in the repulsion file in the training and validation sets.
Since this is indeed a frequent case, I believe this may have affected the evaluation significantly.
For example:
WideToChar(wchar_t const*, char*, unsigned long)
BASE64 : V2lkZVRvQ2hhcih3Y2hhcl90IGNvbnN0KiwgY2hhciosIHVuc2lnbmVkIGxvbmcp
WideToChar(wchar_t const*, char*, unsigned long)
BASE64: V2lkZVRvQ2hhcih3Y2hhcl90IGNvbnN0KiwgY2hhciosIHVuc2lnbmVkIGludCk=
these are 2 functions each exists in 22 distinct files of the dataset. these refere to the same function but the training will try to make them look different
QuickOpen::ReadRaw(RawRead&)
BASE64: UXVpY2tPcGVuOjpSZWFkUmF3KFJhd1JlYWQmKQ==
QuickOpen::ReadRaw( RawRead&)
BASE64: UXVpY2tPcGVuOjpSZWFkUmF3KCBSYXdSZWFkJik=
The first function appeared in 64 files while the second appeared in 41 different files. Some of the compilers have just put a space before the parameter and this will make troubles in the training.
There are many cases as this issue.
Suggestion: Use the name of the function as a symbol (without the parameters)
The text was updated successfully, but these errors were encountered: