Skip to content

Commit 749771c

Browse files
committed
structure changes, update README
1 parent 4c8dd68 commit 749771c

File tree

4,609 files changed

+2401
-319
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

4,609 files changed

+2401
-319
lines changed

__init__.py

Lines changed: 0 additions & 7 deletions
This file was deleted.

doc/README.md

Lines changed: 60 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -14,34 +14,34 @@
1414
```
1515

1616

17-
# :books: Documentation
18-
19-
1. [ Architecture ](#page_with_curl-architecture)
20-
1. [ Toolchain architecture ](#toolchain-architecture)
21-
22-
2. [ Recommended installation and usage ](#page_with_curl-recommended-installation-and-usage)
23-
24-
3. [ Dockerhub installation and usage ](#page_with_curl-dockerhub-installation-and-usage)
25-
26-
4. [ Pypi installation and usage ](#page_with_curl-Pypi-installation-and-usage)
27-
28-
5. [ Credentials ](#page_with_curl-credentials)
17+
# Table of Contents
18+
1. [Architecture](#architecture)
19+
- [Toolchain Architecture](#toolchain-architecture)
20+
2. [Recommended Installation and Usage](#page_with_curl-recommended-installation-and-usage)
21+
3. [Dockerhub Installation and Usage](#page_with_curl-dockerhub-installation)
22+
4. [Pypi Installation and Usage](#page_with_curl-pypi-installation-and-usage)
23+
5. [Credentials](#page_with_curl-credentials)
2924

3025
:page_with_curl: Architecture
3126
====
32-
<a name="arch"></a>
27+
<a name="architecture"></a>
3328

3429
### Toolchain architecture
35-
<a name="arch_std"></a>
30+
<a name="toolchain-architecture"></a>
3631

37-
Our toolchain is represented in the next figure and works as follow. A collection of labelled binaries of different malwares families is collected and used as the input of the toolchain. **Angr**, a framework for symbolic execution, is used to execute symbolically binaries and extract execution traces. For this purpose, different heuristics have been developped to optimize symbolic execution. Several execution traces (i.e : API calls used and their arguments) corresponding to one binary are extracted with Angr and gather together thanks to several graph heuristics to construct a SCDG. These resulting SCDGs are then used as input to graph mining to extract common graph between SCDG of the same family and create a signature. Finally when a new sample has to be classified, its SCDG is build and compared with SCDG of known families (thanks to a simple similarity metric).
32+
Our toolchain is represented in the following figure and works as follows:
3833

39-
![GitHub Logo](./images/SEMA_illustration.png)
34+
- A collection of labelled binaries from different malware families is collected and used as the input of the toolchain.
35+
- **Angr**, a framework for symbolic execution, is used to execute binaries symbolically and extract execution traces. For this purpose, different heuristics have been developed to optimize symbolic execution.
36+
- Several execution traces (i.e., API calls used and their arguments) corresponding to one binary are extracted with Angr and gathered together using several graph heuristics to construct a SCDG.
37+
- These resulting SCDGs are then used as input to graph mining to extract common graphs between SCDGs of the same family and create a signature.
38+
- Finally, when a new sample has to be classified, its SCDG is built and compared with SCDGs of known families using a simple similarity metric.
4039

41-
This repository contains a first version of a SCDG extractor.
42-
During symbolic analysis of a binary, all system calls and their arguments found are recorded. After some stop conditions for symbolic analysis, a graph is build as follow : Nodes are systems Calls recorded, edges show that some arguments are shared between calls.
40+
![Toolchain Illustration](./images/SEMA_illustration.png)
4341

44-
When a new sample has to be evaluated, its SCDG is first build as described previously. Then, `gspan` is applied to extract the biggest common subgraph and a similarity score is evaluated to decide if the graph is considered as part of the family or not. The similarity score `S` between graph `G'` and `G''` is computed as follow:
42+
This repository contains a first version of a SCDG extractor. During the symbolic analysis of a binary, all system calls and their arguments found are recorded. After some stop conditions for symbolic analysis, a graph is built as follows: Nodes are system calls recorded, edges show that some arguments are shared between calls.
43+
44+
When a new sample has to be evaluated, its SCDG is first built as described previously. Then, `gspan` is applied to extract the biggest common subgraph and a similarity score is evaluated to decide if the graph is considered as part of the family or not. The similarity score `S` between graph `G'` and `G''` is computed as follows:
4545
Since `G''` is a subgraph of `G'`, this is calculating how much `G'` appears in `G''`.
4646
Another classifier we use is the Support Vector Machine (`SVM`) with INRIA graph kernel or the Weisfeiler-Lehman extension graph kernel.
4747

@@ -59,33 +59,36 @@ A web application is available and is called SemaWebApp. It allows to manage the
5959

6060
#### Interesting links
6161

62-
* https://angr.io/
63-
64-
* https://bazaar.abuse.ch/
6562

66-
* https://docs.docker.com/engine/install/ubuntu/
63+
- [Angr](https://angr.io/)
64+
- [Bazaar Abuse](https://bazaar.abuse.ch/)
65+
- [Docker Installation on Ubuntu](https://docs.docker.com/engine/install/ubuntu/)
6766

6867

69-
#### For extracting database
68+
#### Extracting database
7069

70+
To extract the database, use the following commands:
7171
```bash
72-
cd databases/Binaries; bash extract_deploy_db.sh
72+
cd databases/Binaries
73+
./extract_deploy_db.sh
7374
```
7475

7576
Password for archive is "infected". Warning : it contains real samples of malwares.
7677

77-
#### For compressing database
78+
#### Compressing database
7879

80+
To compress the database, use the following commands:
7981
```bash
8082
#To zip back the test database
81-
cd databases/Binaries; bash compress_db.sh
83+
cd databases/Binaries
84+
./compress_db.sh
8285
```
8386

8487
:page_with_curl: **Recommended installation and usage**
8588
====
86-
<a name="install"></a>
89+
<a name="recommended-installation-and-usage"></a>
8790

88-
Install the entire toolchain
91+
To install the entire toolchain, use the following command:
8992

9093
```bash
9194
# Full installation (ubuntu)
@@ -101,11 +104,9 @@ First launch the containers :
101104
make run-toolchain
102105
```
103106

104-
It will start the scdg, the classifier and the web app services. If you wish to use only the scdg or only the classifier, refer to the next sections.
105-
106-
Wait for the containers to be up
107+
This will start the SCDG, the classifier, and the web app services. If you wish to use only the SCDG or only the classifier, refer to the specific sections below.
107108

108-
Then visit 127.0.0.1:5000 on your browser
109+
Wait for the containers to be up, then visit 127.0.0.1:5000 on your browser
109110

110111
See next sections for details about the different parameters.
111112

@@ -127,13 +128,17 @@ docker rmi sema-classifier
127128

128129
### Use only SemaSCDG
129130

130-
First run the SCDG container with volumes like this :
131+
To use only the SemaSCDG, first run the SCDG container with volumes like this:
131132
```bash
132133
docker run --rm --name="sema-scdg" -v ${PWD}/OutputFolder:/sema-scdg/application/database/SCDG -v ${PWD}/ConfigFolder:/sema-scdg/application/configs -v ${PWD}/InputFolder:/sema-scdg/application/database/Binaries -it sema-scdg bash
133134
```
134-
Where the first volume corresponds to the output folder where the results will be put. The second volume corresponds to the folder containing the configuration files that will be passed to the docker. And the third matches the folder containing the binaries that are going to be passed to the container.
135+
In this command:
135136

136-
Inside the container just run :
137+
- The first volume corresponds to the output folder where the results will be put.
138+
- The second volume corresponds to the folder containing the configuration files that will be passed to the docker.
139+
- The third matches the folder containing the binaries that are going to be passed to the container.
140+
141+
To run experiments, run inside the container :
137142
```bash
138143
python3 SemaSCDG.py configs/config.ini
139144
```
@@ -144,17 +149,9 @@ pypy3 SemaSCDG.py configs/config.ini
144149

145150
#### Configuration files
146151

147-
The parameters are put in a configuration file : `configs/config.ini`
148-
Feel free to modify it or create new configuration files to run different experiments.
152+
The parameters are put in a configuration file : `configs/config.ini`. Feel free to modify it or create new configuration files to run different experiments.
149153

150-
To restore the default values of `config.ini` do :
151-
```bash
152-
python3 restore_defaults.py
153-
```
154-
The default parameters are stored in the file `default_config.ini`
155-
Do not modify `config_tutorial.ini` and `config_test.ini` as they are designed to fit the Tutorial and the tests needs respectively.
156-
157-
The output of the SCDG are put into `database/SCDG/runs/` by default. If you are not using volumes and want to save some runs from the container to your host machine :
154+
The output of the SCDG are put into `database/SCDG/runs/` by default. If you are not using volumes and want to save some runs from the container to your host machine, use :
158155
```bash
159156
make save-scdg-runs ARGS=PATH
160157
```
@@ -243,14 +240,14 @@ If you wish to run multiple experiments with different configuration files, the
243240
./multiple_experiments.sh -h
244241

245242
# Run example
246-
./multiple_experiments.sh -m python3 -c configs/config configs/default_configs
243+
./multiple_experiments.sh -m python3 -c configs/config1 configs/config2
247244
```
248245

249246
#### Tests
250247

251248
To run the test, inside the docker container :
252249
```bash
253-
python3 scdg_tests.py configs/config_test.ini
250+
python3 scdg_tests.py test_data/config_test.ini
254251
```
255252

256253
#### Tutorial
@@ -345,7 +342,6 @@ python3 SemaClassifier.py --train output/save-SCDG/
345342
```
346343

347344
This will classify input dataset based on previously computed models
348-
349345
```bash
350346
python3 SemaClassifier.py output/test-set/
351347
```
@@ -359,7 +355,7 @@ python3 classifier_tests.py configs/config_test.ini
359355

360356
:page_with_curl: **Dockerhub installation**
361357
====
362-
<a name="dockerhub"></a>
358+
<a name="dockerhub-installation-and-usage"></a>
363359

364360
## SemaSCDG
365361

@@ -383,14 +379,20 @@ Then use the same commands than in the recommended usage section
383379

384380
:page_with_curl: **Pypi installation and usage**
385381
====
386-
<a name="pypi"></a>
382+
<a name="pypi-installation-and-usage"></a>
387383

388-
If you wish to install the toolchain python dependencies on your system, use :
384+
It is also possible to use the toolchain without docker container by using the Pypi package to install dependencies.
389385

390386
```bash
391387
pip install sema-toolchain
392388
```
393389

390+
After cloning the git you can then use the toolchain without docker
391+
Example :
392+
```bash
393+
python3 sema_scdg/application/SemaSCDG.py sema_scdg/application/configs/config.ini
394+
```
395+
394396
#### Pypy3 usage
395397

396398
By default, pypy3 can be used to launch experiments inside the SCDG's docker container. If you wish to use it outside the container, make sure to install pypy3 :
@@ -407,9 +409,14 @@ Then install the dependecies on pypy3 :
407409
pypy3 -m pip install -r /sema_scdg/requirements_pypy.txt
408410
```
409411

412+
Then use pypy3 instead of python3 to launch experiments:
413+
```bash
414+
pypy3 sema_scdg/application/SemaSCDG.py sema_scdg/application/configs/config.ini
415+
```
416+
410417
:page_with_curl: Credentials
411418
====
412-
<a name="credit"></a>
419+
<a name="credentials"></a>
413420

414421
Main authors of the projects:
415422

0 commit comments

Comments
 (0)