Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[skip release] Updated README.md with instructions for ONNX conversion. #55

Merged
merged 1 commit into from
Sep 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -118,4 +118,3 @@ jobs:
run: ./gradlew githubRelease
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

85 changes: 81 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,15 @@ This is a Scala/Spark implementation of the Isolation Forest unsupervised outlie
algorithm. This library was created by [James Verbus](https://www.linkedin.com/in/jamesverbus/) from
the LinkedIn Anti-Abuse AI team.

This library supports distributed training and scoring using Spark data structures. It inherits from
the `Estimator` and `Model` classes in [Spark's ML library](https://spark.apache.org/docs/2.3.0/ml-guide.html)
The `isolation-forest` module supports distributed training and scoring in Scala using Spark data structures.
It inherits from the `Estimator` and `Model` classes in [Spark's ML library](https://spark.apache.org/docs/2.3.0/ml-guide.html)
in order to take advantage of machinery such as `Pipeline`s. Model persistence on HDFS is
supported.

The `isolation-forest-onnx` module provides Python-based converter to convert a trained model to ONNX format for broad
portability across platforms and languages. [ONNX](https://onnx.ai/) is an open format built to represent machine
learning models.

## Copyright

Copyright 2019 LinkedIn Corporation
Expand Down Expand Up @@ -66,7 +70,7 @@ spark scala version combination.

```
dependencies {
compile 'com.linkedin.isolation-forest:isolation-forest_3.5.1_2.13:3.1.0'
compile 'com.linkedin.isolation-forest:isolation-forest_3.5.1_2.13:3.2.3'
}
```

Expand All @@ -79,7 +83,7 @@ Here is an example for a recent Spark/Scala version combination.
<dependency>
<groupId>com.linkedin.isolation-forest</groupId>
<artifactId>isolation-forest_3.5.1_2.13</artifactId>
<version>3.1.0</version>
<version>3.2.3</version>
</dependency>
```

Expand Down Expand Up @@ -197,6 +201,79 @@ isolationForestModel.write.overwrite.save(path)
val isolationForestModel2 = IsolationForestModel.load(path)
```

## ONNX model conversion and inference

### Converting a trained model to ONNX

The artifacts associated with the `isolation-forest-onnx` module are [available](https://pypi.org/project/isolation-forest-onnx/) in PyPI.

The ONNX converter can be installed using `pip`. It is recommended to use the same version of the converter as the
version of the `isolation-forest` library used to train the model.

```bash
pip install isolation-forest-onnx==3.2.3
```

You can then import and use the converter in Python.

```python
import os
from isolationforestonnx.isolation_forest_converter import IsolationForestConverter

# This is the same path used in the previous example showing how to save the model in Scala above.
path = '/user/testuser/isolationForestWriteTest'

# Get model data path
data_dir_path = path + '/data'
avro_model_file = os.listdir(data_dir_path)
model_file_path = data_dir_path + '/' + avro_model_file[0]

# Get model metadata file path
metadata_dir_path = path + '/metadata'
metadata_file = os.listdir(path + '/metadata/')
metadata_file_path = metadata_dir_path + '/' + metadata_file[0]

# Convert the model to ONNX format (this will return the ONNX model in memory)
converter = IsolationForestConverter(model_file_path, metadata_file_path)
onnx_model = converter.convert()

# Convert and save the model in ONNX format (this will save the ONNX model to disk)
onnx_model_path = '/user/testuser/isolationForestWriteTest.onnx'
converter.convert_and_save(onnx_model_path)
```

### Using the ONNX model for inference (example in Python)

```python
import numpy as np
import onnx
from onnxruntime import InferenceSession

# `onnx_model_path` the same path used above in the convert and save operation
onnx_model_path = '/user/testuser/isolationForestWriteTest.onnx'
dataset_path = 'isolation-forest-onnx/test/resources/shuttle.csv'

# Load data
input_data = np.loadtxt(dataset_path, delimiter=',')
num_features = input_data.shape[1] - 1
last_col_index = num_features
print(f'Number of features for {dataset_name}: {num_features}')

# The last column is the label column
input_dict = {'features': np.delete(input_data, last_col_index, 1).astype(dtype=np.float32)}
actual_labels = input_data[:, last_col_index]

# Load the ONNX model from local disk and do inference
onx = onnx.load(onnx_model_path)
sess = InferenceSession(onx.SerializeToString())
res = sess.run(None, input_dict)

# Print scores
actual_outlier_scores = res[0]
print('ONNX Converter outlier scores:')
print(np.transpose(actual_outlier_scores[:num_examples_to_print])[0])
```

## Validation

The original 2008 "Isolation forest" paper by Liu et al. published the AUROC results obtained by
Expand Down
2 changes: 2 additions & 0 deletions isolation-forest-onnx/setup.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
[metadata]
name = isolation-forest-onnx
author = James Verbus
author_email = [email protected]
description = A converter for the LinkedIn Spark/Scala isolation forest model format to ONNX format.
url = https://github.com/linkedin/isolation-forest
license = BSD 2-Clause License
Expand Down
Loading