Skip to content

Latest commit

 

History

History
93 lines (82 loc) · 6.18 KB

eval_cpt.md

File metadata and controls

93 lines (82 loc) · 6.18 KB

Evaluate changepoint method for RTT time series

The documents explains:

  • How we evaluate the performance of different changepoint methods on real RTT time series;
  • The scoring method used in evaluation.

The big picture

In order to tell which changepoint method with what configuration works the best for RTT time series, we need an evaluation framework composed of two parts:

  • a dataset of ground truth;
  • a scoring method.

The ground truth dataset is obtained throw manually labelling. In order to ensure the quality of labelled dataset, we performed capability test on labellers using an synthetic dataset. For sythetic RTT time seires, the moments when the generative model changes are known as ground truth. Usage:

$ python labeller_evaluation.py -h
usage: labeller_evaluation.py [-h] [-f FACT] [-l LABELLER] [-o OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  -f FACT, --fact FACT  directory storing ground fact.
  -l LABELLER, --labeller LABELLER
                        directory storing the detections of human labeller.
  -o OUTPUT, --output OUTPUT
                        filename storing the output result.

The synthetic dataset with known moments of change is stored in dataset/artificial_trace_ground_fact. There are 20 traces in the repository. These traces are generated by https://github.com/WenqinSHAO/rtt_gen.git. In average, each trace contain 8646 datapoints. There are in all 935 moments of change to be detected. Each trace is stored in a .csv file containing three columns: epoch for timestamp, rtt for RTT value, cp for change.

The labelled changes (by human labllers) are stored in dataset/artificial_trace_labelled. The .csv file follow the same format of ground truth. The labelling is done with the help of these tools: https://github.com/WenqinSHAO/rtt_visual.git.

The output directory is data/.

The capability test confirmed that human labllers are highly capable of detecting changes in synthetic RTT traces. Then we proceed with real traces. 50 real RTT traces of various forms are collected from RIPE Atlas. Each trace have 8162 datapoints on average. In all, they represent more than 34008 hours, i.e. 1417 days, of RTT measurements. 1047 moments of change are manually labelled and stored in dataset/real_trace_labelled. These labelled moments are used as ground truth in the evaluation of changepoint methods.

$ python cpt_evaluation.py -h
usage: cpt_evaluation.py [-h] [-d DIRECTORY] [-f FILENAME]

optional arguments:
  -h, --help            show this help message and exit
  -d DIRECTORY, --directory DIRECTORY
                        benchmark changepoint methods using the traces from
                        the specified directory.
  -f FILENAME, --filename FILENAME
                        file name for output.

The output directory is data/. The output of labeller_evaluation.py and cpt_evaluation.py are each a .csv file containing following columns:

file       len    changes  tp  fp    fn  precision         recall           score            dis              method             penalty
11119.csv  10001  33       33  1092  0   0.0293333333333   1.0              1.0              0.151515151515   cpt_normal         AIC
11119.csv  10001  33       32  104   1   0.235294117647    0.969696969697   1.0              0.03125          cpt_normal         BIC
11119.csv  10001  33       32  97    1   0.248062015504    0.969696969697   1.0              0.03125          cpt_normal         MBIC
11119.csv  10001  33       33  183   0   0.152777777778    1.0              1.0              0.121212121212   cpt_normal         Hannan-Quinn
11119.csv  10001  33       30  17    3   0.63829787234     0.909090909091   0.986828872629   0.166666666667   cpt_poisson        AIC
11119.csv  10001  33       18  14    15  0.5625            0.545454545455   0.684655151504   0.222222222222   cpt_poisson        BIC
11119.csv  10001  33       18  13    15  0.58064516129     0.545454545455   0.684655151504   0.222222222222   cpt_poisson        MBIC
11119.csv  10001  33       26  14    7   0.65              0.787878787879   0.751922379465   0.192307692308   cpt_poisson        Hannan-Quinn
11119.csv  10001  33       12  11    21  0.521739130435    0.363636363636   0.580483609138   0.333333333333   cpt_poisson_naive  AIC
11119.csv  10001  33       7   11    26  0.388888888889    0.212121212121   0.479300106511   0.571428571429   cpt_poisson_naive  BIC
11119.csv  10001  33       7   11    26  0.388888888889    0.212121212121   0.501398809509   0.714285714286   cpt_poisson_naive  MBIC

Scoring method

Four scoring methods are provided in localutils/benchmark.py. evaluation_window_weighted() is used in human labeller capability test and changepoint methods evaluation. It has a window option thus allows approximate matching between fact and detection. Each RTT change in ground truth is weighted according to its practical importance to produce a weighted version of recall, noted as score in the output file.

More precisely, a bipartite graph is first constructed using moments of ground truth events and detection events as vertices. Edges are composed of ground truth and detection event paris, the distance between which is no greater than the window size. The cost of each edge is then the distance between ground truth and detection. Minimum cost maximum-cardinality matching is calculated using Hungarian algorithm for the constructed graph. The cardinality of the matching is then the number of the True Positive. All the unmatched ground truth events contribute to False Negative. All the unmatched detection events are regarded as False Positive. Precision and recall are well calculated. score is a weighted version of recall, where each RTT change is weighted by the weighting() function. dis column in the output is the average cost of the matching, that is the average distance between matched ground truth and detection events.