Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hello, could you tell me how to calculate this value, std_split ? #1

Open
Mr-Wu-H opened this issue Apr 13, 2022 · 10 comments
Open

Hello, could you tell me how to calculate this value, std_split ? #1

Mr-Wu-H opened this issue Apr 13, 2022 · 10 comments

Comments

@Mr-Wu-H
Copy link

Mr-Wu-H commented Apr 13, 2022

std_split : float
the standard deviation from the mean to subdivide the time series of a class into subclasses.

@benibaeumle
Copy link
Owner

Hello,
in their paper the authors just write Next, we calculate the standard deviation value of these adjacent discrepancies. At last, we can separate the data into subclasses by splitting at the sequence that has difference larger than half of the computed standard deviation..
Unfortunately, I have no justified argument for you on how to best choose a value for std_split. The obvious thing is, with higher values you sample fewer time series as you pack more and more distinct time series into the same subclass.
Maybe visualizing the adjacent discrepancies along with different split values might give you an indication.

@Mr-Wu-H
Copy link
Author

Mr-Wu-H commented Apr 13, 2022 via email

@Mr-Wu-H
Copy link
Author

Mr-Wu-H commented Apr 25, 2022

Hello, in their paper the authors just write Next, we calculate the standard deviation value of these adjacent discrepancies. At last, we can separate the data into subclasses by splitting at the sequence that has difference larger than half of the computed standard deviation.. Unfortunately, I have no justified argument for you on how to best choose a value for std_split. The obvious thing is, with higher values you sample fewer time series as you pack more and more distinct time series into the same subclass. Maybe visualizing the adjacent discrepancies along with different split values might give you an indication.

Hello, this sentence“we can separate the data into subclasses by splitting at the sequence that has difference larger than half of the computed standard deviation” means that remove the part greater than standard deviation and the rest time series as the sample time series.Am I right?

@Mr-Wu-H
Copy link
Author

Mr-Wu-H commented Apr 25, 2022

Hello, this method FastShapeletCandidates is to get shapelet candidates of one class, right?

@benibaeumle
Copy link
Owner

Hello, in their paper the authors just write Next, we calculate the standard deviation value of these adjacent discrepancies. At last, we can separate the data into subclasses by splitting at the sequence that has difference larger than half of the computed standard deviation.. Unfortunately, I have no justified argument for you on how to best choose a value for std_split. The obvious thing is, with higher values you sample fewer time series as you pack more and more distinct time series into the same subclass. Maybe visualizing the adjacent discrepancies along with different split values might give you an indication.

Hello, this sentence“we can separate the data into subclasses by splitting at the sequence that has difference larger than half of the computed standard deviation” means that remove the part greater than standard deviation and the rest time series as the sample time series.Am I right?

I am not sure if I understand you correctly. Please, see the paper chapter 3.1 for how this particular step is computed (I do not have Latex support when answering here, so having a look on the paper should be more comfortable for you). But in words, what is computed is:

  1. Calculate the sum of the time steps of each time series
  2. Calculate the mean over the sums
  3. Select the time series which is closest to the mean over the sums
  4. Calculate the euclidean distances of each time series to the time series we selected in 3. and sort the resulting list of distances
  5. Calculate the standard deviation of the differences between each pair of neighboring distances
  6. Now, for each pair in the sorted list of distances we check if the difference is larger than 1.5x the standard deviation we calculated in 5.
  7. If the standard deviation is larger than 1.5 we consider the time series between the last split point and the current split point as a subclass.
  8. Repeat 7 until we iterated over all neighboring distance pairs

The result after computing the 8 steps above is the set of subclasses.

@benibaeumle benibaeumle reopened this Apr 25, 2022
@benibaeumle
Copy link
Owner

Hello, this method FastShapeletCandidates is to get shapelet candidates of one class, right?

Yes.

@Mr-Wu-H
Copy link
Author

Mr-Wu-H commented Apr 26, 2022

Thanks a lot.

@Mr-Wu-H
Copy link
Author

Mr-Wu-H commented Apr 26, 2022

Hello, in their paper the authors just write Next, we calculate the standard deviation value of these adjacent discrepancies. At last, we can separate the data into subclasses by splitting at the sequence that has difference larger than half of the computed standard deviation.. Unfortunately, I have no justified argument for you on how to best choose a value for std_split. The obvious thing is, with higher values you sample fewer time series as you pack more and more distinct time series into the same subclass. Maybe visualizing the adjacent discrepancies along with different split values might give you an indication.

Hello, this sentence“we can separate the data into subclasses by splitting at the sequence that has difference larger than half of the computed standard deviation” means that remove the part greater than standard deviation and the rest time series as the sample time series.Am I right?

I am not sure if I understand you correctly. Please, see the paper chapter 3.1 for how this particular step is computed (I do not have Latex support when answering here, so having a look on the paper should be more comfortable for you). But in words, what is computed is:

  1. Calculate the sum of the time steps of each time series
  2. Calculate the mean over the sums
  3. Select the time series which is closest to the mean over the sums
  4. Calculate the euclidean distances of each time series to the time series we selected in 3. and sort the resulting list of distances
  5. Calculate the standard deviation of the differences between each pair of neighboring distances
  6. Now, for each pair in the sorted list of distances we check if the difference is larger than 1.5x the standard deviation we calculated in 5.
  7. If the standard deviation is larger than 1.5 we consider the time series between the last split point and the current split point as a subclass.
  8. Repeat 7 until we iterated over all neighboring distance pairs

The result after computing the 8 steps above is the set of subclasses.

Hello,how should I understand the last split point and the current split point in step 7?

@benibaeumle
Copy link
Owner

See here.

@Mr-Wu-H
Copy link
Author

Mr-Wu-H commented Apr 27, 2022

See here.

Thanks a lot.In your demo,the data set ,fordA_sample, will generate 6300 shapelets.Do you know how to remove those that may overlap shapelets to reduce time complexity?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants