Paimon

Paimon: Patch Identification Monster

Notes: This is the instruction of old-version Paimon.

1. Dependencies

$ conda create -n paimon python=3.9
$ conda activate paimon
$ conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
$ conda install pyg -c pyg
$ conda install transformer

2. Data Preprocessing

All commands are executed under the root folder: <PATH_TO_FOLDER>/Paimon/, which is refered as <root> in the following instructions.

Step 1: Generate raw dataset from Joern results.

$ python ./src/get_dataset.py

This step is to verify, copy, rename, and label the original files.

According to the labeled patch list in <root>/_PatchDB/, we verify if each file in <root>/_GraphLogs/ is valid, i.e., there is corresponding patch is in PatchDB.

We copy and rename the valid files to <root>/data_raw/[positives|negatives]/*.log and determine the positives or negatives subfolder based on the PatchDB label.

This step only moves the files and does not change the graph format:

(nodeid_out, nodeid_in, edge_type, edge_version)
===========================    <-- delimiter between edges and nodes.
(nodeid, node_version, node_type, dist, line_num, code_line)

For example, a graph with 2 nodes and 1 edge:

(-3, -37, 'DDGcbb', 1)                          
===========================
(-3, 0, 'D', 8, '82', 'CBB *cbb')               
(-37, 1, '-', 0, '+90', 'cbb->child = NULL')

Step 2: Parse the graph data from text to numpy object.

$ python ./src/parse_graphs.py

This step is to parse graphs in the text records.

We parse the nodes and edges in the raw text data and then construct the graph data in numpy object format.

By this step, the graph file in <root>/data_raw/*/*.log will be converted into <root>/data_mid/*/*.npz.

The mid-point graph data structure can be shown by loading the .npz files:

>>> gdata = np.load('gdata_mid.npz', allow_pickle=True)
>>> list(gdata.keys()) # the items regarding a graph.
['nodesP', 'edgesP', 'nodesA', 'edgesA', 'nodesB', 'edgesB', 'label', 'dtype']
# nodes and edges of patch (P), pre-patch (A) code, and post-patch (B) code.
>>> gdata['nodesP']
array([['-3', '0', 'D', '8', '82', 'CBB *cbb'],
       ['-37', '1', '-', '0', '+90', 'cbb->child = NULL']], dtype=object)
# each node contains nodeid, node_version, node_type, dist, line_num, code_tokens.
>>> gdata['edgesP']
array([['-3', '-37', 'DDG', '1']], dtype=object)
# each edge contains nodeid_out, nodeid_in, edge_type, edge_version.
>>> gdata['label']
array([1]) # label.

Step 3: Embed the graph data.

$ python ./src/embed_graphs.py

This step is for ID mapping, tokenization, and embedding.

(1) ID Mapping

We map the valid node IDs to the 0-indexed consecutive integers, i.e., from 0, 1 ... to #valid_nodes-1. Here, the valid node means the node exist in the node list and is used in edge connection.

There are two other situations.

The node exists in the node list, but is not used in edge connection; it is an isolated point and it will be dropped. The graph will be still processed and saved.
The node is used in edge connection, but we cannot find its node id in the node list; the generated graph information is imcompleted, it will raise an error [Error] <ProcNodes> Node [node_id] does not in node list. In this situation, the broken graph will be dropped and nothing is saved.

Also, there is two special cases.

If the graph is void (i.e., no node and no edge), the graph will be processed as a graph with two nodes connected to each other. All the node/edge features will be zeros with the normal dimensions. This case usally happens in one of the twin graphs.
If the graph is a N-node graph without any edge, the graph will also be processed as a graph only with two nodes connected to each other. The edge features will be zeros with the normal dimensions. The two node features will come from the first 2 nodes if applicable. If the graph only contains 1 node, both 2 features come from this node.

(2) Tokenization

The tokenization step is processed using RobertaTokenizer, while the code statement will be truncated into 512-token length if it is too long.

(3) Embedding

The embedding step is processed by RobertaModel. you can use pooler_output or last_hidden_state with aggregation.

This saved graph contains 4 mandatory information:

edgeIndex is a 2 * #edges matrix recording the edge connection and graph topology. It is a sparse version of adjency matrix.
edgeAttr is a #edges * 5 matrix recording the edge embedding.
nodeAttr is a #nodes * 768 matrix recording the node embedding.
label is a 1 * 1 vector recording the graph label.

We can see the structure of saved patch graph:

>>> g_patch = np.load('gdata_patch.npz', allow_pickle=True)
>>> list(g_patch.keys())
['edgeIndex', 'edgeAttr', 'nodeAttr', 'label', 'nodeDict']
>>> g_patch['edgeIndex']
array([[0],
       [1]])
>>> g_patch['edgeAttr']
array([[0, 1, 0, 1, 0]])
>>> g_patch['nodeAttr']
array([[ 0.3733095 , -0.38387725, -0.4817224 , ..., -0.03157633,
        -0.02150492, -0.0046486 ],
       [ 0.41430753, -0.41784966, -0.5413571 , ..., -0.04296428,
         0.01779909,  0.02394774]], dtype=float32)
>>> g_patch['label']
array([1])
>>> g_patch['nodeDict'] # debug only, no need to use.
array({'-3': 0, '-37': 1}, dtype=object)

We also construct a twin graph structure to record the pre-patch and post-patch information.

The edgeIndex, edgeAttr, nodeAttr uses the 0 and 1 suffix to indicate pre-patch and post-patch info.

>>> g_twin = np.load('gdata_twin.npz', allow_pickle=True)
>>> list(g_twin.keys())
['edgeIndex0', 'edgeAttr0', 'nodeAttr0', 'edgeIndex1', 'edgeAttr1', 'nodeAttr1', 'label']

3. Model Training.

Perform Patch-GNN:

python src/gnn_patch.py

or Twin-GNN:

python src/gnn_twin.py

A. Standard Formats

A.1. Text format of graph.

The parser follows the text format to get the graph information. The file is end with .log.

edge list of patch graph
===========================    <-- delimiter between edges and nodes.
node list of patch graph
---------------------------    <-- delimiter between graphs.
edge list of pre-patch graph
===========================
node list of pre-patch graph
---------------------------
edge list of post-patch graph
===========================
node list of post-patch graph

Important Notes: Use only ONE blank line when there is no edge or node in the list.

Edge format:

(-id1, -id2, ['CDG'|'DDG'], [-1|0|1])

id1 and id2 are intergers.
['CDG'|'DDG'] means can be either 'CDG' or 'DDG'.
[-1|0|1] means can be interger -1, 0, or 1.

Node format:

(-id, free_num, 'free_str', free_num, 'free_str', 'code_statement')

id is an interger.
free_num are reserved intergers that have not been used.
'free_str' are reserved strings that have not been used.
'code_statement' is a string recording the code line.

A.2. NumPy object format of graph.

The mid-point graph data structure follows the following numpy format. The file is end with .npz.

# node information.
nodeid = 8 # can be an integer or a string.
code = 'if (a > 1)' # must be a string.
node1 = [nodeid, code] # len(node)>=2; node[0]->nodeid, node[-1]->code, node[1:-1]->any.
node2 = [14, 'a += 1;']
node3 = [5, 'int *p = NULL;']
nodes = np.array([node1, node2, node3], dtype=object)
# edge information.
edgetype = 'CDG' # must be a string, can be 'CDG', 'DDG', or 'AST'.
version = '1' # must be a string, can be '-1', '0', or '1'.
edge1 = [8, 5, edgetype, version]
edge2 = [14, 8, 'DDG', '0']
edges = np.array([edge1, edge2], dtype=object)
# label information.
label = np.array([1]) # [0]:non-security or [1]:security.
# save as numpy object 
np.savez('commit_name.npz', nodesP=nodes, edgesP=edges, nodesA=nodes, edgesA=edges, nodesB=nodes, edgesB=edges, label=label, dtype=object)
# save nodes/edges into corresponding graph (P:patch, A:pre-patch, B:post-patch)

We can see the graph information by reading the .npz file.

>>> gdata = np.load('commit_name.npz', allow_pickle=True)
>>> list(gdata.keys()) # the keys in the numpy object.
['nodesP', 'edgesP', 'nodesA', 'edgesA', 'nodesB', 'edgesB', 'label', 'dtype']
>>> gdata['nodesP'] # same structure in nodesA and nodesB.
array([[8, 'if (a > 1)'],
       [14, 'a += 1;'],
       [5, 'int *p = NULL;']], dtype=object)
# each node contains nodeid, codeline.
>>> gdata['edgesP'] # same structure in edgesA and edgesB.
array([[8, 5, 'CDG', '1'],
       [14, 8, 'DDG', '0']], dtype=object)
# each edge contains nodeid_out, nodeid_in, edge_type, edge_version.
>>> gdata['label']
array([1]) # label.

After executing ./src/embed_graphs.py, we can see this graph data structure can be processed.

>>> g_patch = np.load('commit_name.npz', allow_pickle=True)
>>> list(g_patch.keys())
['edgeIndex', 'edgeAttr', 'nodeAttr', 'label', 'nodeDict']
>>> g_patch['edgeIndex']
array([[0, 1],
       [2, 0]])
>>> g_patch['edgeAttr']
array([[0, 1, 1, 0, 0],
       [1, 1, 0, 1, 0]])
>>> g_patch['nodeAttr']
array([[ 0.38118038, -0.42532718, -0.6148339 , ..., -0.03758701,
         0.01594985,  0.08632272],
       [ 0.34696624, -0.45462358, -0.5188762 , ..., -0.04286851,
         0.02313088,  0.04798017],
       [ 0.41223133, -0.45855263, -0.5493048 , ..., -0.08573418,
        -0.01763462,  0.1002164 ]], dtype=float32)
>>> g_patch['label']
array([1])
>>> g_patch['nodeDict']
array({8: 0, 14: 1, 5: 2}, dtype=object)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Paimon

1. Dependencies

2. Data Preprocessing

Step 1: Generate raw dataset from Joern results.

Step 2: Parse the graph data from text to numpy object.

Step 3: Embed the graph data.

3. Model Training.

A. Standard Formats

A.1. Text format of graph.

A.2. NumPy object format of graph.

Files

README.md

Latest commit

History

README.md

File metadata and controls

Paimon

1. Dependencies

2. Data Preprocessing

Step 1: Generate raw dataset from Joern results.

Step 2: Parse the graph data from text to numpy object.

Step 3: Embed the graph data.

3. Model Training.

A. Standard Formats

A.1. Text format of graph.

A.2. NumPy object format of graph.