A MapReduce framework for network anomaly detection
- Hadoop cluster with Hadoop Streaming installed
- Numpy installed on all hadoop nodes
- Ipsumdump
Note: to avoid the burden of installing Hadoop, you can also try hashdoop with the Matatabi docker image.
The analysis of traffic traces with Hashdoop consists of four main steps:
- Convert traffic trace to textual format
- Configure Hashdoop
- Hash the trace
- Detect anomalies
Assuming the pcap trace 200704121400.dump.gz is in the ~/mawi/ directory. Convert the pcap file to a text file using the following command:
ipsumdump -tsSdDlpF -r ~/mawi/200704121400.dump.gz > ~/mawi/200704121400.ipsum
The destination directory should be the same as the tracesHdfsPath variable in hashdoop.conf.
hadoop fs -mkdir -p /user/hashdoop/data/
hadoop fs -put ~/mawi/200704121400.ipsum /user/hashdoop/data/
The hashdoop.conf
file is set by default for the trace and directories
used in this readme. Make sure variables in this file meet your needs.
tracesHdfsPath
: HDFS directory where traffic traces are locatedsketchesHdfsPath
: HDFS directory where hashed traffic will be storedstreamingLib
: jar file of your hadoop streaming Note that trace names are assumed to be like the ones in the MAWI archive.
Set the “hashSize” parameter in hashdoop.conf.
This parameter controls the number of sub-traces created with one hash key. Hashdoop uses two
hash keys (i.e. the source and destination address), so it generated 2*hashSize
sub-traces.
Execute the (MapReduce) hashing code with the runHashing.py script:
python runHashing.py
Set the detection threshold and the output path in the configuration file (hashdoop.conf), then run:
python runSimpleDetector.py
Set the detection threshold, time bin and the output path in the configuration file (hashdoop.conf), then run:
python runAstute.py