You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've noticed there is a discrepancy in the features computed and the number of clusters in the embedded space when a single csv is split into multiple csvs and later recombined in the BSOID app/algorithm. I think this is due to 2 reasons.
First, the adaptive filtering in bsoid_utilities/likelihoodprocessing.py. This is based on the distribution of the likelihoods of the file and would therefore change based on the length of the file. If the csv is split into multiple csvs, the features calculated for each data point may be different.
Second, the StandardScaler() in extract_features.py is applied to each csv. If the csv is split in multiple csvs, the scaling will be performed on each csv individually before the data is recombined and the features calculated for each data point may be different.
Therefore, even if the input data is the same and just split into multiple files (one csv split into multiple csvs, but combined capturing the same pose data from the original mp4), the number of clusters for training the random forest classifier will not match. What should I make of this? Does the embedded space still carry meaning about the behavior if it changes due to factors such as file length and the way the data is combined? Is there anything I could be doing wrong to cause this discrepancy?
Note, this discrepancy occurs before the UMAP embedding or HDBSCAN clustering. The features computed are different, resulting in a different embedded space, and therefore, different number of clusters used to train the random forest classifier.
The text was updated successfully, but these errors were encountered:
marybethcassity
changed the title
Discrepancy in number of clusters with split up data
Discrepancy in features computed and number of clusters with split up data
May 23, 2024
I've noticed there is a discrepancy in the features computed and the number of clusters in the embedded space when a single csv is split into multiple csvs and later recombined in the BSOID app/algorithm. I think this is due to 2 reasons.
First, the adaptive filtering in bsoid_utilities/likelihoodprocessing.py. This is based on the distribution of the likelihoods of the file and would therefore change based on the length of the file. If the csv is split into multiple csvs, the features calculated for each data point may be different.
Second, the StandardScaler() in extract_features.py is applied to each csv. If the csv is split in multiple csvs, the scaling will be performed on each csv individually before the data is recombined and the features calculated for each data point may be different.
Therefore, even if the input data is the same and just split into multiple files (one csv split into multiple csvs, but combined capturing the same pose data from the original mp4), the number of clusters for training the random forest classifier will not match. What should I make of this? Does the embedded space still carry meaning about the behavior if it changes due to factors such as file length and the way the data is combined? Is there anything I could be doing wrong to cause this discrepancy?
Note, this discrepancy occurs before the UMAP embedding or HDBSCAN clustering. The features computed are different, resulting in a different embedded space, and therefore, different number of clusters used to train the random forest classifier.
The text was updated successfully, but these errors were encountered: