Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unefficient creation of repulsion sets #17

Open
MohamadMansouri opened this issue Dec 27, 2018 · 5 comments
Open

Unefficient creation of repulsion sets #17

MohamadMansouri opened this issue Dec 27, 2018 · 5 comments
Assignees

Comments

@MohamadMansouri
Copy link

When extracting the symbols of the binary files of the dataset, base64 of the function prototype is used to build a ground truth of same functions.
but with different compilers, and platforms the function prototype does not remain the same.
Thus making the algorithm possibly put the same function (with different prototype) in the repulsion file in the training and validation sets.
Since this is indeed a frequent case, I believe this may have affected the evaluation significantly.

For example:
WideToChar(wchar_t const*, char*, unsigned long)
BASE64 : V2lkZVRvQ2hhcih3Y2hhcl90IGNvbnN0KiwgY2hhciosIHVuc2lnbmVkIGxvbmcp
WideToChar(wchar_t const*, char*, unsigned long)
BASE64: V2lkZVRvQ2hhcih3Y2hhcl90IGNvbnN0KiwgY2hhciosIHVuc2lnbmVkIGludCk=
these are 2 functions each exists in 22 distinct files of the dataset. these refere to the same function but the training will try to make them look different

QuickOpen::ReadRaw(RawRead&)
BASE64: UXVpY2tPcGVuOjpSZWFkUmF3KFJhd1JlYWQmKQ==
QuickOpen::ReadRaw( RawRead&)
BASE64: UXVpY2tPcGVuOjpSZWFkUmF3KCBSYXdSZWFkJik=
The first function appeared in 64 files while the second appeared in 41 different files. Some of the compilers have just put a space before the parameter and this will make troubles in the training.

There are many cases as this issue.

Suggestion: Use the name of the function as a symbol (without the parameters)

@thomasdullien
Copy link
Contributor

Hey there,

thanks for the report, and ouch. This indeed looks like it would cause troubles.

The converse (e.g. only using the function name) may cause different issues, though -- not for the repulsion sets, but for the attraction sets (two different functions with same name but different arguments, for example). So I guess the "best" solution would be to use the prototypes for attraction pairs, and not use them for the repulsion sets?

Cheers,
Thomas

@thomasdullien thomasdullien self-assigned this Dec 27, 2018
@MohamadMansouri
Copy link
Author

Hey,
Thanks for the fast reply, I would like to add that creating symbols from function prototypes results in finding 1176 distinct function in the unrar dataset (ELF + PE) while creating symbols from function name will give 879 distinct functions this means that near 297 are repeated functions which they represent 25% of the dataset. On the other hand and after thinking twice about it, its low chance that this may cause a big problem since, for a problem to occur the probability is
eq
the number of problamatic pairs are 0.00021 * N where N is the number of repulsion pairs
for 500000 repulsion pair we may have 107 pairs of function that are declared as repulsion pairs but they belong to the same function

I agree to what you said regarding the solution

I will create a pull request base on what you said. Please have a look.

@thomasdullien
Copy link
Contributor

thomasdullien commented Dec 28, 2018 via email

@MohamadMansouri
Copy link
Author

running this on the output of the generated training data of Unrar.
cat extracted_symbols_* | awk '{system("echo "$4 "| base64 -d; echo")}' | sort | uniq
This what you get
func_uniq.txt
Regarding to what you thought I am afraid you are not completely right as if you only take into account the ELF file (which are not undergoing the stemsymbol thing) you find the same problem.
This can be proved by running this command
cat extracted_symbols_* | awk '$2 ~ /ELF/{system("echo " $4 " | base64 -d; echo ")}' | sort | uniq
This what you get
func_ELF_uniq.txt

Regards,
Mansouri

@thomasdullien
Copy link
Contributor

thomasdullien commented Dec 30, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants