Unefficient creation of repulsion sets #17

MohamadMansouri · 2018-12-27T18:14:24Z

When extracting the symbols of the binary files of the dataset, base64 of the function prototype is used to build a ground truth of same functions.
but with different compilers, and platforms the function prototype does not remain the same.
Thus making the algorithm possibly put the same function (with different prototype) in the repulsion file in the training and validation sets.
Since this is indeed a frequent case, I believe this may have affected the evaluation significantly.

For example:
WideToChar(wchar_t const*, char*, unsigned long)
BASE64 : V2lkZVRvQ2hhcih3Y2hhcl90IGNvbnN0KiwgY2hhciosIHVuc2lnbmVkIGxvbmcp
WideToChar(wchar_t const*, char*, unsigned long)
BASE64: V2lkZVRvQ2hhcih3Y2hhcl90IGNvbnN0KiwgY2hhciosIHVuc2lnbmVkIGludCk=
these are 2 functions each exists in 22 distinct files of the dataset. these refere to the same function but the training will try to make them look different

QuickOpen::ReadRaw(RawRead&)
BASE64: UXVpY2tPcGVuOjpSZWFkUmF3KFJhd1JlYWQmKQ==
QuickOpen::ReadRaw( RawRead&)
BASE64: UXVpY2tPcGVuOjpSZWFkUmF3KCBSYXdSZWFkJik=
The first function appeared in 64 files while the second appeared in 41 different files. Some of the compilers have just put a space before the parameter and this will make troubles in the training.

There are many cases as this issue.

Suggestion: Use the name of the function as a symbol (without the parameters)

thomasdullien · 2018-12-27T19:32:40Z

Hey there,

thanks for the report, and ouch. This indeed looks like it would cause troubles.

The converse (e.g. only using the function name) may cause different issues, though -- not for the repulsion sets, but for the attraction sets (two different functions with same name but different arguments, for example). So I guess the "best" solution would be to use the prototypes for attraction pairs, and not use them for the repulsion sets?

Cheers,
Thomas

MohamadMansouri · 2018-12-28T00:37:37Z

Hey,
Thanks for the fast reply, I would like to add that creating symbols from function prototypes results in finding 1176 distinct function in the unrar dataset (ELF + PE) while creating symbols from function name will give 879 distinct functions this means that near 297 are repeated functions which they represent 25% of the dataset. On the other hand and after thinking twice about it, its low chance that this may cause a big problem since, for a problem to occur the probability is

the number of problamatic pairs are 0.00021 * N where N is the number of repulsion pairs
for 500000 repulsion pair we may have 107 pairs of function that are declared as repulsion pairs but they belong to the same function

I agree to what you said regarding the solution

I will create a pull request base on what you said. Please have a look.

thomasdullien · 2018-12-28T15:25:51Z

Hey there, one thing I was wondering -- stemsymbols is normally responsible for trying to "canonicalize" symbols from different compilers, thinking about the problem in question, it seems like the best place to fix this would be to 'fix' stemsymbols? Do you happen to have a few example strings that are incorrectly viewed as different? Cheers, Thomas Am Fr., 28. Dez. 2018 um 01:37 Uhr schrieb Mohamad Mansouri < [email protected]>:

…

Hey, Thanks for the fast reply, I would like to add that creating symbols from function prototypes results in finding 1176 distinct function in the unrar dataset (ELF + PE) while creating symbols from function name will give 879 distinct functions this means that near 297 are repeated functions which they represent 25% of the dataset. On the other hand and after thinking twice about it, its low chance that this may cause a big problem since, for a problem to occur the probability is [image: eq] <https://camo.githubusercontent.com/1e1b1d5abc24fc1318b2b14130f0d6c3ee9f7b3b/68747470733a2f2f6c617465782e636f6465636f67732e636f6d2f6769662e6c617465783f282535436672616325374232393725374425374231313736253744292825354366726163253742312537442537423131373525374429> the number of problamatic pairs are 0.00021 * N where N is the number of repulsion pairs for 500000 repulsion pair we may have 107 pairs of function that are declared as repulsion pairs but they belong to the same function I agree to what you said regarding the solution I will create a pull request base on what you said. Please have a look. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#17 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEYBvJ02QCO4PQoPSSw_2qaZntJcIHDRks5u9WfRgaJpZM4ZjDlA> .

MohamadMansouri · 2018-12-30T01:58:58Z

running this on the output of the generated training data of Unrar.
cat extracted_symbols_* | awk '{system("echo "$4 "| base64 -d; echo")}' | sort | uniq
This what you get
func_uniq.txt
Regarding to what you thought I am afraid you are not completely right as if you only take into account the ELF file (which are not undergoing the stemsymbol thing) you find the same problem.
This can be proved by running this command
cat extracted_symbols_* | awk '$2 ~ /ELF/{system("echo " $4 " | base64 -d; echo ")}' | sort | uniq
This what you get
func_ELF_uniq.txt

Regards,
Mansouri

thomasdullien · 2018-12-30T14:42:10Z

Hey there, interesting, thanks for checking. I will ponder the next two days how to best deal with this... Cheers, Thomas Am So., 30. Dez. 2018 um 02:58 Uhr schrieb Mohamad Mansouri < [email protected]>:

…

running this on the output of the generated training data of Unrar. cat extracted_symbols_* | awk '{system("echo "$4 "| base64 -d; echo")}' | sort | uniq This what you get func_uniq.txt <https://github.com/googleprojectzero/functionsimsearch/files/2716851/func_uniq.txt> Regarding to what you thought I am afraid you are not completely right as if you only take into account the ELF file (which are not undergoing the stemsymbol thing) you find the same problem. This can be proved by running this command cat extracted_symbols_* | awk '$2 ~ /ELF/{system("echo " $4 " | base64 -d; echo ")}' | sort | uniq This what you get func_ELF_uniq.txt <https://github.com/googleprojectzero/functionsimsearch/files/2716873/func_ELF_uniq.txt> Regards, Mansouri — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#17 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEYBvMiaiv0L2lLEL2vaGyuq6ihvbpFFks5u-B3jgaJpZM4ZjDlA> .

thomasdullien self-assigned this Dec 27, 2018

MohamadMansouri mentioned this issue Dec 28, 2018

Avoid adding functions of same name to the repulsion dataset #18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unefficient creation of repulsion sets #17

Unefficient creation of repulsion sets #17

MohamadMansouri commented Dec 27, 2018

thomasdullien commented Dec 27, 2018

MohamadMansouri commented Dec 28, 2018

thomasdullien commented Dec 28, 2018 via email

MohamadMansouri commented Dec 30, 2018

thomasdullien commented Dec 30, 2018 via email

Unefficient creation of repulsion sets #17

Unefficient creation of repulsion sets #17

Comments

MohamadMansouri commented Dec 27, 2018

thomasdullien commented Dec 27, 2018

MohamadMansouri commented Dec 28, 2018

thomasdullien commented Dec 28, 2018 via email

MohamadMansouri commented Dec 30, 2018

thomasdullien commented Dec 30, 2018 via email