-
Notifications
You must be signed in to change notification settings - Fork 20
/
Copy pathREADME
183 lines (152 loc) · 5.92 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
Python-DP-Means-Clustering
==========================
Comparing DP-means and K-means clustering algorithms
"cluster.py" has implementations of k-means and dp-means clustering algorithms.
Implementations were intended to be straight-forward, understandable and give
full output for diagnostics, rather than optimized implmentations.
For more information on the dp-means, see Revisiting k-means: New Algorithms
via Bayesian Nonparametrics at http://arxiv.org/abs/1111.0352/
CLUSTERING
==========
> ./cluster.py -h
Usage: cluster.py [options]
Options:
-h, --help show this help message and exit
-k CLUSTERS, --kmeans-clusters=CLUSTERS
If present, use kmeans with number of clusters
specified
-l LAM, --lamda=LAM If preset, use dpmeans with lambda parameters
specified
-x XVAL, --cross-validate=XVAL
Number of records to hold out for cross validations.
Data will be random-ordered for you.
-s, --cross-validate-stop
Stop when cross-validation error rises.
> cat input/c3_s20_f2.csv | ./cluster.py -k2
Tolerance reached at step 8
Iterations completed: 8
Final error: 2.994711
elapsed time: 6.582022 ms
> cat input/c3_s20_f2.csv | ./cluster.py -k2
Tolerance reached at step 3
Iterations completed: 3
Final error: 2.994711
elapsed time: 2.923965 ms
> cat input/c3_s20_f2.csv | ./cluster.py -l2
Tolerance reached at step 4
Iterations completed: 4
Final error: 0.520926
elapsed time: 5.388021 ms
> cat input/c3_s20_f2.csv | ./cluster.py -l5
Tolerance reached at step 2
Iterations completed: 2
Final error: 2.994711
elapsed time: 2.490044 ms
To plot errors for the last run (dp-means in this case), use "plotResult.r"
This script reads ./output/results.csv and ./output/error.csv.
> ./plotResult.r
Loading required package: methods
Loading required package: grid
V1 V2 V3 V4 cluster
Min. :-4.997 Min. :-4.11117 Min. :0.000 Iter-0 :65 0:103
1st Qu.:-3.413 1st Qu.:-3.01148 1st Qu.:0.000 Iter-1 :65 1: 90
Median :-2.562 Median :-2.34123 Median :2.000 Iter-2 :65 2: 59
Mean :-1.606 Mean :-1.68030 Mean :1.918 Iter-3 :65 3: 12
3rd Qu.: 1.425 3rd Qu.:-0.01388 3rd Qu.:4.000 Iter-4 :65 4:126
Max. : 2.224 Max. : 2.00472 Max. :4.000 Iter-Final:65
V1 V2
Min. :0 Min. :0.5209
1st Qu.:1 1st Qu.:0.5209
Median :2 Median :0.5388
Mean :2 Mean :0.6489
3rd Qu.:3 3rd Qu.:0.5777
Max. :4 Max. :1.0860
See training output images created in ./img/iters.png and ./img/error.png
OPTIMAL DP-MEANS
================
Finds the optimal value of lambda only from data.
cat input/c4_s300_f2.csv | ./DPopt.py
...
Final error: 18.510049
Final cross-validation error: 18.223702
Tolerance reached at step 6
Iterations completed: 6
Final error: 14.329098
Final cross-validation error: 14.262425
lambda: 5.48775
with error: 14.26242
Code holds back 20% of data for training optimization.
There are no parameters to set unless you anticipate more than the default max
number of clusters (set in code).
CREATE TEST DATA
================
> ./createTestData.py -h
Usage: createTestData.py [options]
Options:
-h, --help show this help message and exit
-s SAMPLE, --sample-size=SAMPLE
Sample size per cluster
-f FEATURES, --features=FEATURES
Number of features
-c CLUSTERS, --clusters=CLUSTERS
Sample size
-o OVERLAP, --overlap=OVERLAP
0 - distinct, 1 - scale = sig
> ./createTestData.py -s6 -c1 -f3
2.80484810546906,-5.107337369680055,1.7687444192348534
4.045291632153071,-4.955840347993885,1.5936351799326172
3.503220395140305,-5.008280722637208,1.5695863487866264
3.2134872837791812,-4.809839458886229,1.3158740999089755
3.8383496901618197,-4.745260338782687,1.74511375801971
3.3736868708580805,-5.2559718245077045,1.4113521104252063
CLUSTERING TESTS
================
Example test run on data set with 3 features, 100 points per cluster, with 4 clusters.
> ./test.py | tee output/test.all.csv | grep -v Inter > output/test.csv
...
Tolerance reached at step 1
Iterations completed: 1
Final error: 0.091798
Tolerance reached at step 1
Iterations completed: 1
Final error: 0.091798
Tolerance reached at step 1
Iterations completed: 1
Final error: 0.091798
Tolerance reached at step 1
Iterations completed: 1
Final error: 0.091798
Tolerance reached at step 1
Iterations completed: 1
Final error: 0.091798
Tolerance reached at step 1
Iterations completed: 1
Final error: 0.091798
...
> ./test.py -h
Usage: test.py [options]
Options:
-h, --help show this help message and exit
-f FILE, --file=FILE Input file name
-i ITER, --iterations=ITER
Iterations to use in searching for min error. Default
20.
Plot the test results,
> ./plotTest.r
Loading required package: methods
Loading required package: grid
V1 V2 V3 V4
dp-means:12 Min. : 0.5774 Min. : 1.046 Min. : 345.6
k-means :12 1st Qu.: 2.7424 1st Qu.: 2.722 1st Qu.: 1438.7
Median : 4.8094 Median : 3.758 Median : 4243.7
Mean : 5.1264 Mean : 7.159 Mean : 4779.7
3rd Qu.: 6.9462 3rd Qu.: 6.242 3rd Qu.: 5946.4
Max. :12.0000 Max. :33.695 Max. :12496.4
method
dp-means:12
k-means :12
See ./img/test_errors.png and ./img/test_times.png for comparative error and times
for k-means and dp-means.
NOTE: lambda is chosen based on relevant scale of the data. In this example, the data set
was created to fall between -5 and 5, so the range is 10. The maximum lambda is there 10,
while the smallest lambda could be chosen as the smallest expected cluster size.