-
Notifications
You must be signed in to change notification settings - Fork 117
Correct PCA component counting logic #11134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CodSpeed Performance ReportMerging #11134 will not alter performanceComparing Summary
|
|
||
# Create Group A: `nr_obs_group_a` perfectly correlated | ||
# responses based on `params_a` | ||
for i in range(nr_obs_group_a): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if these are "perfectly negatively correlated" instead? I suspect one will not necessarily get 2 groups then, but what is reasonable in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember you saying something about this, but don't recall the details.
Why do you suspect different results with negative correlations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have observation a,b,c with correlations:
you would with our method get distances ab: 1.01
, ac: 2.8
and bc: 2.2
.
Whereas if you where to consider positive and negative correlations as equal (take absolute value) you would get:
ab: 0.812
, ac: 0.812
and bc: 1.27
.
Now whether we want to consider positive and negative correlations as equal is up for question i guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. Nice example.
Here's a trivial example that shows that perfectly negatively correlated responses are treated as being far apart, while perfectly positively correlated responses are treated as being close.
corr_matrix = np.array([
[ 1.0, -1.0],
[-1.0, 1.0]
])
Z = linkage(corr_matrix, "average", "euclidean")
print("Distance perfect negative correlation: ", Z[0, 2])
corr_matrix = np.array([
[ 1.0, 1.0],
[1.0, 1.0]
])
Z = linkage(corr_matrix, "average", "euclidean")
print("Distance perfect positive correlation: ", Z[0, 2])
Distance perfect negative correlation: 2.8284271247461903
Distance perfect positive correlation: 0.0
0aaf2c6
to
0982191
Compare
02e3cb6
to
50f34ad
Compare
@@ -45,7 +54,10 @@ def get_nr_primary_components( | |||
# sum to get the cumulative proportion of variance explained by each successive | |||
# component. | |||
variance_ratio = np.cumsum(singulars**2) / np.sum(singulars**2) | |||
return max(len([1 for i in variance_ratio[:-1] if i < threshold]), 1) | |||
|
|||
num_components = np.searchsorted(variance_ratio, threshold, side="left") + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice one!
# Create Group B: (b+j)*(-1)^j*X_2 pattern - checkerboard correlation | ||
b = 1 # base scaling factor for group B | ||
for j in range(nr_obs_group_b): | ||
sign = (-1) ** j # alternates: +1, -1, +1, -1, ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe more efficient: sign = -1 if j % 2 else 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The modulo version is approx 1 second faster with 10 million groups. The number of groups in this test is maximum 6.
import time
n = 10_000_000
# Test exponentiation
start = time.perf_counter()
for j in range(n):
sign = (-1) ** j
time_exp = time.perf_counter() - start
# Test modulo
start = time.perf_counter()
for j in range(n):
sign = -1 if j % 2 else 1
time_mod = time.perf_counter() - start
print(f"Exponentiation: {time_exp:.4f}s")
print(f"Modulo: {time_mod:.4f}s")
print(f"Winner: {'Exponentiation' if time_exp < time_mod else 'Modulo'}")
Exponentiation: 1.5748s
Modulo: 0.6303s
Winner: Modulo
These tests timeout:
|
BREAKING CHANGE: The change in component counting directly affects the number of clusters requested in the `main` auto-scaling function. For the same input data, this version may produce a different number of clusters and therefore different final scaling factors compared to previous versions. The new behavior is considered more accurate.
50f34ad
to
9358a5f
Compare
I have updated the snapshots. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if @larsevj has any more comments, but it's 🚀 for my part
Nothing to add from my side either. Looks good. |
Results from running
ES
on Drogon using old and new way of counting principal components:git rebase -i main --exec 'just rapid-tests'
)When applicable