Anomalous prediction score results in HDBscan #219

fsanna13 · 2022-03-03T10:07:24Z

Ask the question
I'm training a model using the Tribuo HDBscan algorithm and then predicting new values with this model to search for anomalies in my data. However, when retrieving the predictions score, I'm getting back extremely wrong values.
To make myself clear, I'm training the model using segments containing statistic values of my data (standard deviation, mean, average and so on). Each segment refers to a certain timeframe (10 AM to 11 AM, 11 AM to 12 AM and so on).
When predicting the data, I do the same with the test set, grouping it in different timeframes and calculating the same statistic values used during the training phase.

Still, after using the predict method, even if the test dataset has a magnitude much more higher than the one used for training the model, the score didn't underline the distance between the data. I would've expected that these test values got the maximum outlier score since they are so distant from the trained data.
Is there something wrong with my approach?

This code shows how we created the Datasource for training and test set.

public class TrainingDatasource implements ConfigurableDataSource<ClusterID> {

    private static final ClusteringFactory factory = new ClusteringFactory();
    private List<Example<ClusterID>> examples;

    public TrainingDatasource(List<ProcessedSegment> segments) {
        this.initDataSource(segments);
    }

    private void initDataSource(List<ProcessedSegment> segments) {

        examples = new ArrayList<>(segments.size());

        for (int i = 0; i < segments.size(); i++) {

            Map<String, Double> features = segments.get(i).getFeatures();
            String[] keys = features.keySet().toArray(new String[0]);

            Object[] values = features.values().toArray();

            double[] valueList = new double[values.length];
            for (int k = 0; k < values.length; k++) {
                valueList[k] = (double) values[k];
            }

            examples.add(new ArrayExample<>(new ClusterID(i), keys, valueList));
        }
    }
....

To define our custom Datasource, we took this class as an example: https://github.com/oracle/tribuo/blob/407af05654dabdeed06c4439333db89bae6cc9d9/Clustering/Core/src/main/java/org/tribuo/clustering/example/GaussianClusterDataSource.java
One of our doubts is the assignement of clusterID for each segment.

Here is an image showing what the Datasource instance (before training) contains in debugger mode:

Is your question about a specific ML algorithm or approach?
I'm using the HDBScan algorithm

Is your question about a specific Tribuo class?
HDBScanModel and Dataset<ClusterID>

System details

Tribuo version: 4.2.0
Java version: 11

The text was updated successfully, but these errors were encountered:

Craigacp · 2022-03-03T18:22:38Z

Is the issue is that at prediction time some examples which are far from the training data are being assigned to a non-noise cluster? Or are they being assigned to the noise cluster, but have strange outlier scores? I think the outlier scores are fixed and based on the largest MST edge weight, so they might not be too useful currently.

fsanna13 added the question General question label Mar 3, 2022

fsanna13 changed the title ~~Wrong prediction score results in HDBscan~~ Anomalous prediction score results in HDBscan Mar 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anomalous prediction score results in HDBscan #219

Anomalous prediction score results in HDBscan #219

fsanna13 commented Mar 3, 2022

Craigacp commented Mar 3, 2022

Anomalous prediction score results in HDBscan #219

Anomalous prediction score results in HDBscan #219

Comments

fsanna13 commented Mar 3, 2022

Craigacp commented Mar 3, 2022