Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize feature selection for clustering (or in general finding attractors) #85

Open
Datseris opened this issue Jul 11, 2023 · 6 comments
Labels
enhancement New feature or request grouping grouping configurations hard

Comments

@Datseris
Copy link
Member

I have an idea of how we could make a function to optimize feature selection for clustering attractors. I will be creating this function while working on a new research project. But I think it is useful to discuss possibilities of how to make it happen.

My idea is as follows:

  • We have some "function" that returns a lot of features, like 20 or so.
  • From these features we want to find the 3 or 4 ones that are the most efficient into clustering data into separated attractors.
  • We could perform an optimization loop: pick 3 features at random, cluster the data, with these feaures, and the computed a clustering quality metric. Keep the three features that have the highest quality.

The main question therefore is how do we define a "cluster quality metric"? On one hand, this metric needs to have a contribution of cluster separation: the more separated the clusters are, the more clear the clustering. This could be done by computing the pairwise distribution distance of the pdfs fitted to every cluster, and keep as the metric the median of this distribution.

However, this cluster quality metric should also have some short of contribution from the amount of clusters found. The more clusters the better...?

cc also @KalelR I think this is relevant for you too.

@Datseris Datseris added enhancement New feature or request hard grouping grouping configurations labels Jul 11, 2023
@KalelR
Copy link
Contributor

KalelR commented Jul 20, 2023

Hey, great! I think it's a cool project, which can extend to finding relevant features in general for time-series.

Our experience from using DBSCAN to find attractors is that it can be quite unclear when the clustering works. We had mainly to rely on the silhouettes index, which is also far from perfect, and quite unintuitive. Another complication is that it would be very expensive to find the most efficient set of features from the clustering by doing this sampling from lots of features. So maybe it would be nice to come up with a technique that doesn't rely on clustering like DBSCAN.

In the systems I've been studying, it hasn't been tremendously difficult to find nice features that are constant, or at least stationary, on the attractor, so that their time-averages were well-behaved, and ics on the same attractor were basically the same. In this scenario, feature space is composed of very densely populated clouds (almost single points) that are all well separated between themselves, and a simple histogram (or the similar idea I did in PR #84) works nicely. Maybe a similar idea could work here, and one could look at the histograms on individual features and from there decide the best ones. Not sure what the best metric would be, though. Also not sure in general how robust this would be to noisy data.

The main question therefore is how do we define a "cluster quality metric"?

I wonder how well the clustering would work if applied on all the available features simultaneously, i.e. on this high-dimensional feature space. If it works well (the histogram would, I guess), then it could be taken as a ground truth to then compare the single-feature analysis to.

Some additional thoughts:

Be aware of features spanning different orders of magnitude, especially if you are computing distances. Features going in the interval (0, 1e3) will over-contribute compared to features in (0, 1e-1) for instance. A problem we haven't yet solved is how to properly deal with this. The current rescale_to_01 isn't very good (cf. PR #59).

A useful property of the features that could enter the metric is if they are constant, or at least stationary in time.

Would be happy to follow this and discuss more at some point!

PS: sorry for the delay, had a trip this week.

@awage
Copy link
Contributor

awage commented Jul 20, 2023

From these features we want to find the 3 or 4 ones that are the most efficient into clustering data into separated attractors.

I guess you want to do use some Integer programming technique. The setup is as follows: for 20 features you have a vector of 20 value {0,1} where 1 means "use this parameter" and 0 "discard this one". The program search the vector of {0,1} that optimizes the fitness function (the clustering metric}. It will select the features that optimizes the metrics.

The hard stuff is to define the right clustering quality metric.

Another complication is that it would be very expensive to find the most efficient set of features from the clustering by doing this sampling from lots of features. So maybe it would be nice to come up with a technique that doesn't rely on clustering like DBSCAN.

The optimization to select the best features can be done with DBSCAN on a small random sample of vectors. The optimization shouldn't take long if the problem is convex. If it is not convex it is the worst case scenario and you may find a suboptimal solution.

@Datseris
Copy link
Member Author

Datseris commented Aug 2, 2023

Another complication is that it would be very expensive to find the most efficient set of features from the clustering by doing this sampling from lots of features.

In my application this is not a problem. I have about 100 trajectories in total.

The current rescale_to_01 isn't very good (cf. PR #59).

What do you mean? Why?

Maybe a similar idea could work here, and one could look at the histograms on individual features and from there decide the best ones. Not sure what the best metric would be, though. Also not sure in general how robust this would be to noisy data.
I wonder how well the clustering would work if applied on all the available features simultaneously, i.e. on this high-dimensional feature space. If it works well (the histogram would, I guess), then it could be taken as a ground truth to then compare the single-feature analysis to.

But I haven't really understood of a way to judge the quality of a clustering in your answer. So, there is therefore no way to compare whether some features are good or not. You say "compare" above, but how? Technically how. As I said in my original post, I need of a way to numerically estimate "goodness" of a clustering output. I guess I'll open a post on discourse and see if someone from statistics has an idea of how to do this. Alex put it very well:

The hard stuff is to define the right clustering quality metric.


I guess you want to do use some Integer programming technique. The setup is as follows: for 20 features you have a vector of 20 value {0,1} where 1 means "use this parameter" and 0 "discard this one". The program search the vector of {0,1} that optimizes the fitness function (the clustering metric}. It will select the features that optimizes the metrics.

Yes, but this is the easy part. I can just use Combinatorics.jl to make an iterator for me.

@Datseris
Copy link
Member Author

Datseris commented Aug 2, 2023

@Datseris
Copy link
Member Author

Datseris commented Aug 2, 2023

@KalelR apparently the metric I am searching for is just the silhuette (see the post above). But I remember that here we optimize differently using Optim.jl .

So I guess the numerical quantity I am searching for is just the minimum of the optimizer here: https://github.com/JuliaDynamics/Attractors.jl/blob/main/src/mapping/grouping/cluster_optimal_%CF%B5.jl#L116

So I guess I should just do a simple PR that makes it so that this information is stored somewhere in the GroupViaClustering struct.

@Datseris
Copy link
Member Author

Datseris commented Aug 7, 2023

I did some analysis here. I am keeping tract of the -Optim.minimum(opt) from this line: https://github.com/JuliaDynamics/Attractors.jl/blob/main/src/mapping/grouping/cluster_optimal_%CF%B5.jl#L116 .

It appears to me that just this by itself is not enough. Here are some examples from the data I am using:

(left are the timeseries. The features are always the standard deviation of the timeseries for now. The dataset has 3 attractors, we know this a-priori. We want to see which observables (i.e., "features") are the ones that separate the attractors best)

optimal val = 0.84

image

optimal val = 0.87

image

optimal val = 0.95

image

this image is pretty interesting because it shows that only one feature is sufficient to cluster this data into 3 attractors (you will notice that from these 3 features they are almost perfectly linearly correlated)

My results are that the optimal sillhuete value indeed correlates with "cluster separation quality", however this metric by itself is probably not enough. E.g., in the following example I have the least amount of optimal value, yet still find 3 clusters:

optimal value = 0.8

image

My main difficulty is that the silhuette quality metric has so little variability in it. I need to process this. Ideally I want a metric that rewards both cluster quality and cluster amount. And it also rewards using less feature dimensions. I want to test if just using the formula "optimal value * number of attractors / number of feature dimensions" is enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request grouping grouping configurations hard
Projects
None yet
Development

No branches or pull requests

3 participants