-- Semantic LOG ExtRaction Templating (SLOGERT) --
SLOGERT aims to automatically extract and enrich low-level log data into an RDF Knowledge Graph that conforms to our LOG Ontology. It integrates
- LogPai for event pattern detection and parameter extractions from log lines
- Stanford NLP for parameter type detection and keyword extraction, and
- OTTR Engine for RDF generation.
- Apache Jena for RDF data manipulation.
We have tested our approach on text-based logs produced by Unix OSs, in particular:
- Apache,
- Kernel,
- Syslog,
- Auth, and
- FTP logs.
In our latest evaluation, we are testing our approach with the AIT log dataset, which contains additional logs from non-standard application, such as suricata and exim4. In this repository, we include a small excerpt of the AIT log dataset in the input
folder as example log sources.
**Figure 1**. SLOGERT KG generation workflow.
SLOGERT pipeline can be described in several steps, which main parts are shown in Figure 1 above and will be described as the following:
- Load
config-io
andconfig.yaml
- Collect target
log files
from theinput folder
as defined inconfig-io
. We assume that each top-level folder within input folder represent a single log source - Aggregate collected log files into single file.
- Add log-source information to each log lines,
- If log lines exceed the configuration limit (e.g., 100k), split the aggregated log file into a set of
log-files
.
Example results of this step is available in output/auth.log/1-init/
folder
- Initialize
extraction_template_generator
withconfig-io
to register extraction patterns - For each
log-file
fromlog-files
- Generate a list of
<extraction-template, raw-result>
pairs usingextraction_template_generator
- Generate a list of
NOTE: We use LogPAI as extraction_template_generator
Example results of this step is available in output/auth.log/2-logpai/
folder
- Load existing
RDF_templates
list - Load
regex_patterns
fromconfig
list for parameter recognition - Initialize
NLP_engine
engine - For each
extraction-template
from the list of<extraction-template, raw-result>
pairs- Transform
extraction-template
into anRDF_template_candidate
- if
RDF_templates
does not containRDF_template_candidate
- [A2.1 - RDF template generation]
- For each
parameter
fromRDF_template_candidate
- If
parameter
isunknown
- [A2.2 - Template parameter recognition]
- Load
sample-raw-results
fromraw-results
- Recognize
parameter
fromsample-raw-results
usingNLP_engine
andregex_patterns
asparameter_type
- Save
parameter_type
inRDF_template_candidate
- Load
- [A2.2 - end]
- [A2.2 - Template parameter recognition]
- If
- [A2.3 - Keyword extraction]
- Extract
template_pattern
fromRDF_template_candidate
- Execute
NLP_engine
engine on thetemplate_pattern
to retrievetemplate_keywords
- Add
template_keywords
as keywords inRDF_template_candidate
- Extract
- [A2.3 - end]
- [A2.4 - Concept annotation]
- Load
concept_model
containing relevant concept in the domain - For each
keyword
fromtemplate_keywords
- for each
concept
inconcept_model
- if
keyword
containsconcept
- Add
concept
as concept annotation inRDF_template_candidate
- Add
- if
- for each
- Load
- [A2.4 - end]
- add
RDF_template_candidate
toRDF_templates
list
- For each
- [A2.1 - end]
- [A2.1 - RDF template generation]
- Transform
NOTE: We use Stanford NLP as our NLP_engine
Example results (i.e., RDF_templates
) of this step is available as output/auth.log/auth.log-template.ttl
- Initialize
RDFizer_engine
- Generate
RDF_generation_template
fromRDF_templates
list - for each
raw_result
fromraw_results
list- Generate
RDF_generation_instances
fromRDF_generation_template
andraw_result
- Generate
RDF_graph
fromRDF_generation_instances
andRDF_generation_template
usingRDFizer_engine
- Generate
NOTE: We use LUTRA as our RDFizer_engine
Example RDF_generation_template
and RDF_generation_instances
are available in the output/auth.log/3-ottr/
folder.
Example results of this step is available in the output/auth.log/4-ttl/
folder
Figure 2. SLOGERT KG generation algorithms.
For those that are interested, we also provided an explanation of the KG generation in a form of Algorithm as shown in the Figure 2 above.
Prerequisites for running SLOGERT
Java 11
(for Lutra)Apache Maven
Python 2
withpandas
andpython-scipy
installed (for LogPai)- the default setting is to use
python
command to invoke Python 2 - if this is not the case, modification on the
LogIntializer.java
is needed.
- the default setting is to use
We have tried and and tested SLOGERT on Mac OSX and Ubuntu with the following steps:
- Compile this project (
mvn clean install
ormvn clean install -DskipTests
if you want to skip the tests) - You can set properties for extraction in the config file (e.g., number of loglines produced per file). Examples of config and template files are available on the
src/test/resources
folder (e.g.,auth-config.yaml
for auth log data). - Transform the CSVs into OTTR format using the config file. By default, the following script should work on the example file. (
java -jar target/slogert-<SLOGERT-VERSION>-jar-with-dependencies.jar -c src/test/resources/auth-config.yaml
) - The result would be produced in the
output/
folder
Slogert configuration is divided into two parts: main configuration config.yaml
and the input parameter config-io.yaml
There are several configuration that can be adapted in the main configuration file src/main/resources/config.yaml
. We will briefly described the most important configuration options here.
- logFormats to describe information that you want to extract from a log source. This is important due to the various existing logline formats and variants. Each logFormat contain references to the ottrTemplate to build the
RDF_generation_template
for RDFization step. - nerParameters to register patterns that will used by StanfordNLP for recognizing log template parameter types.
- nonNerParameters to register standard regex patterns for template parameter types that can't be easily detected using StanfordNLP. Both nerParameters and nonNerParameters are contains reference for ottr template generation.
- ottrTemplates to register
RDF_generation_template
building block necessary for the RDFization process.
The I/O configuration aim to describe log-source specific information that are not suitable to be added into config.yaml
. An example of this IO configuration is src/test/resources/auth-config.yaml
for auth log. We will describe the most important configuration options in the following:
- source: the name of source file to be searched for in the input folder.
- format: the basic format of the log file, which will be used by
extraction_template_generator
in process A1. - logFormat: types of the logfile. this value of this property should be registered in the
logFormats
withinconfig.yaml
for SLOGERT to work. - isOverrideExisting: whether SLOGERT should use load
RDF_templates
or to override them. - paramExtractAttempt: how many log lines should be processed to determine the
parameter_type
of aRDF_template_candidate
. - logEventsPerExtraction: how many log lines should be processed in a single batch of execution.