Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simpler and clearer grouping configuration for AttractorsViaFeaturizing #84

Open
KalelR opened this issue Jul 4, 2023 · 5 comments
Open
Labels
clarification-needed Needs more clarification from the OP enhancement New feature or request good first issue Good for newcomers grouping grouping configurations

Comments

@KalelR
Copy link
Contributor

KalelR commented Jul 4, 2023

The standard way to group features in the AttractorsViaFeaturizing method is to clusterize them using DBSCAN. As we've known for some time, the algorithm is more or less a black box, and so it can be quite hard to debug it. Non-optimal parameters can lead to (i) finding the same attractor more than once (it might separate two trajectories on the same attractor that have slightly different feature values), or (ii) not finding one or more attractors that are there (so grouping 2+ attractors together). Sometimes both can occur for the same parametrization. In particular for the projects I've been working on, it has been very frustrating to figure out an optimal parametrization, as you really need to look deep into all details of the computations.

As I understand, DBSCAN is great if you have noisy data, then it works great for identifying the clouds (clusters) of higher density. But in a wide range of applications I have in mind, this doesn't seem necessary, since

  1. The systems are deterministic
  2. I know good features to separate the attractors
  3. I can run the trajectories for sufficiently long so that features on the same attractor differ very little.

As a consequence, feature space is composed of very small and dense clouds with a wide separation between themselves. DBSCAN is thus not really needed, we can use something much simpler / more intuitive. I've been having success with a simple comparison: just comparing the distances between each feature, and separating them if that distance is higher than a threshold.

function _cluster_features_into_labels_comparing(features, config, distance_threshold; par_weight::Real = 0, plength::Int = 1, spp::Int = 1)
    labels_features = Vector{Int64}(undef, length(features)) #labels of all features
    labels_features[1] = 1
    clusters_idxs = [1] #idxs of the features that define each cluster
    cluster_labels = [1] #labels for the clusters, going from 1 : num_clusters
    next_cluster_label = 2
    metric = config.clust_distance_metric
    for idx_feature = 2:length(features)
        feature = features[idx_feature]
        dist_to_clusters = Dict(cluster_labels[idx] => evaluate(metric, feature, features[clusters_idxs[idx]]) for idx in eachindex(cluster_labels))
        dist_vals = collect(values(dist_to_clusters))
        min_dist, idx_min_dist = findmin(dist_vals)
        
        if min_dist > distance_threshold #bigger than threshold => new attractor
            push!(clusters_idxs, idx_feature)
            push!(cluster_labels, next_cluster_label)
            feature_label = next_cluster_label
            next_cluster_label += 1
        else #smaller => assign to closest cluster 
            idx_closest_cluster = collect(keys(dist_to_clusters))[idx_min_dist]
            feature_label = idx_closest_cluster
        end
        
        labels_features[idx_feature] = feature_label 
    end 
    return labels_features
end

So far I'm naming this grouping config as GroupingViaComparing. It has been working much better than DBSCAN, as I know exactly what it's doing all the time, so it's easier to figure a good parameter (which is just one: the maximum distance between in-cluster features) and there are no unexpected behaviors.

Note also that it has some nice advantages:

  1. features could very well just be the attractors themselves; then the distance metric could be Hausdorff() for instance, meaning we would distinguish attractors directly by their distances in state space, skipping entirely the need for features. Of course this would be slower, it can be parallelized and maybe the distances don't need to be computed for lots of time points
  2. by skipping the computation of the pairwise distance matrix, we also avoid the memory problems from DBSCAN

It needs more work and definitely more testing but I'm so far quite happy. What do you think?

@Datseris
Copy link
Member

Datseris commented Jul 4, 2023

I've been having success with a simple comparison: just comparing the distances between each feature, and separating them if that distance is higher than a threshold.

Sure but do you compare each feature with each other feature? This means N^2 comparisons. Or do you just compare each feature with the previous feature? Your code has only one loop which makes me think that you are doing the latter, but N^2 comparisons would be the correct thing to do...?

Also, this is somewhat related to GroupViaNearestFeature but you don't have the nearest features predefined.

In any case, yes, I think this should become a new grouping confing. Thankfully the API for adding a new Grouping configuration is well declared so it shouldn't be hard to put in a PR together! https://juliadynamics.github.io/Attractors.jl/dev/attractors/#Grouping-types

@Datseris Datseris added grouping grouping configurations enhancement New feature or request good first issue Good for newcomers clarification-needed Needs more clarification from the OP labels Jul 4, 2023
@Datseris
Copy link
Member

Datseris commented Jul 4, 2023

BTW why are you making these functions in the first place instead of just going ahead and using group_features(features, config) with a new type?

I think, this is a major advantage of using interfaces, that you don't need to hack things together but rather follow the interface. It would probably save you lots of hacking and editing the source code of the package to get end results.

@Datseris
Copy link
Member

Datseris commented Jul 4, 2023

Also, how does GroupViaHistogram fare in your example case? Since you separate by threshold this practically means that you can make a histogram with the same bin size as your threshold and you'll get approximately the same results.

(but for histogram you need to know the end points of the histogram in advance)

@KalelR
Copy link
Contributor Author

KalelR commented Jul 4, 2023

Sure but do you compare each feature with each other feature? This means N^2 comparisons. Or do you just compare each feature with the previous feature? Your code has only one loop which makes me think that you are doing the latter, but N^2 comparisons would be the correct thing to do...?

Compare each feature with the already-found clusters, so it's ~ N * number_attractors comparisons. Since I'm assuming that the features belonging to a same cluster are very close to each other, there's no reason to compare all of them pairwise. Or am I missing something?

Also, this is somewhat related to GroupViaNearestFeature but you don't have the nearest features predefined.

But there you need to already know the attractors.

BTW why are you making these functions in the first place instead of just going ahead and using group_features(features, config) with a new type?

Yup, you're right. Implemented like this because was quicker, but I'll work on the API

Also, how does GroupViaHistogram fare in your example case? Since you separate by threshold this practically means that you can make a histogram with the same bin size as your threshold and you'll get approximately the same results.

Oh yeah good point, I'd forgotten about GroupViaHistogram to be honest. Does the histogram algorithm work well for high dimensions? If so, it's probably a better alternative than what I proposed, at least for features.
If you wanted to directly compare attractors, not their features, maybe this way by directly computing distances would work better.

@Datseris
Copy link
Member

Datseris commented Jul 4, 2023

Does the histogram algorithm work well for high dimensions?

Yes, but it currently has the same limitation that Cartesian Indices has, because it uses cartesian indices to access the histogram. So, same as in #76 . A problem to solve first before solving this cartesian indexing issue is JuliaDynamics/StateSpaceSets.jl#18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clarification-needed Needs more clarification from the OP enhancement New feature or request good first issue Good for newcomers grouping grouping configurations
Projects
None yet
Development

No branches or pull requests

2 participants