Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anomalous prediction score results in HDBscan #219

Open
fsanna13 opened this issue Mar 3, 2022 · 1 comment
Open

Anomalous prediction score results in HDBscan #219

fsanna13 opened this issue Mar 3, 2022 · 1 comment
Labels
question General question

Comments

@fsanna13
Copy link

fsanna13 commented Mar 3, 2022

Ask the question
I'm training a model using the Tribuo HDBscan algorithm and then predicting new values with this model to search for anomalies in my data. However, when retrieving the predictions score, I'm getting back extremely wrong values.
To make myself clear, I'm training the model using segments containing statistic values of my data (standard deviation, mean, average and so on). Each segment refers to a certain timeframe (10 AM to 11 AM, 11 AM to 12 AM and so on).
When predicting the data, I do the same with the test set, grouping it in different timeframes and calculating the same statistic values used during the training phase.

Still, after using the predict method, even if the test dataset has a magnitude much more higher than the one used for training the model, the score didn't underline the distance between the data. I would've expected that these test values got the maximum outlier score since they are so distant from the trained data.
Is there something wrong with my approach?

This code shows how we created the Datasource for training and test set.

public class TrainingDatasource implements ConfigurableDataSource<ClusterID> {

    private static final ClusteringFactory factory = new ClusteringFactory();
    private List<Example<ClusterID>> examples;

    public TrainingDatasource(List<ProcessedSegment> segments) {
        this.initDataSource(segments);
    }

    private void initDataSource(List<ProcessedSegment> segments) {

        examples = new ArrayList<>(segments.size());

        for (int i = 0; i < segments.size(); i++) {

            Map<String, Double> features = segments.get(i).getFeatures();
            String[] keys = features.keySet().toArray(new String[0]);

            Object[] values = features.values().toArray();

            double[] valueList = new double[values.length];
            for (int k = 0; k < values.length; k++) {
                valueList[k] = (double) values[k];
            }

            examples.add(new ArrayExample<>(new ClusterID(i), keys, valueList));
        }
    }
....

To define our custom Datasource, we took this class as an example: https://github.com/oracle/tribuo/blob/407af05654dabdeed06c4439333db89bae6cc9d9/Clustering/Core/src/main/java/org/tribuo/clustering/example/GaussianClusterDataSource.java
One of our doubts is the assignement of clusterID for each segment.

Here is an image showing what the Datasource instance (before training) contains in debugger mode:
Immagine 2022-03-03 110442

Is your question about a specific ML algorithm or approach?
I'm using the HDBScan algorithm

Is your question about a specific Tribuo class?
HDBScanModel and Dataset<ClusterID>

System details

  • Tribuo version: 4.2.0
  • Java version: 11
@fsanna13 fsanna13 added the question General question label Mar 3, 2022
@fsanna13 fsanna13 changed the title Wrong prediction score results in HDBscan Anomalous prediction score results in HDBscan Mar 3, 2022
@Craigacp
Copy link
Member

Craigacp commented Mar 3, 2022

Is the issue is that at prediction time some examples which are far from the training data are being assigned to a non-noise cluster? Or are they being assigned to the noise cluster, but have strange outlier scores? I think the outlier scores are fixed and based on the largest MST edge weight, so they might not be too useful currently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question
Projects
None yet
Development

No branches or pull requests

2 participants