-
Notifications
You must be signed in to change notification settings - Fork 32
/
dataforge.tex
933 lines (783 loc) · 26.7 KB
/
dataforge.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
\chapter{Using Data-Forge}\label{s:dataforge}
We have now seen how to do everything a data scientist would want to do in JavaScript
except actual data science.
This is unfortunately one of the areas where the language still lags behind R and Python,
but statistical libraries are now appearing,
and if the last twenty-five years have taught us anything,
it's not to underestimate JavaScript.
In this chapter we will look at a library called Data-Forge\index{Data-Forge}
that is designed for working with tabular data.
Data-Forge was inspired by Python's Pandas library,\index{Pandas}
but should be familiar to anyone who has worked with the tidyverse in R as well.\index{tidyverse}
Its \grefdex{g:dataframe}{\texttt{DataFrame}}{dataframe} class represents a table made up of named columns
and any number of rows.
Dataframes are \gref{g:immutable}{immutable}:
once a dataframe has been constructed,
its contents cannot be changed.
Instead,
every operation produces a new dataframe.
(Some clever behind-the-scenes data recycling makes this much more efficient than it sounds.)
Like Pandas and the tidyverse,
Data-Forge is designed to work on \gref{g:tidy-data}{tidy data}.
As defined in \cite{Wick2014}, tabular data is tidy if:
\begin{itemize}
\item
Each column contains one statistical variable
(i.e., one property that was measured or observed).
\item
Each different observation is in a different row.
\item
There is one table for each set of observations.
\item
If there are multiple tables,
each table has a column containing a unique key
so that related data can be linked.
\end{itemize}
For example,
this data is not tidy:
\begin{longtable}{|l|l|l|l|l|}
\hline
\multicolumn{5}{|c|}{Rodent Pleurisy Rates} \\
\hline
& \multicolumn{2}{c|}{Female} & \multicolumn{2}{c|}{Male} \\
\hline
& 2018 & 2019 & 2018 & 2019 \\
\hline
Jan & 0.05 & 0.07 & 0.03 & 0.06 \\
\hline
Feb & 0.05 & 0.08 & 0.04 & 0.07 \\
\hline
Mar & 0.05 & 0.11 & 0.04 & 0.10 \\
\hline
\end{longtable}
\noindent
but this data is:
\begin{longtable}{|l|l|l|l|}
\hline
Year & Month & Sex & Rate \\
\hline
2018 & Jan & Female & 0.05 \\
\hline
2018 & Feb & Female & 0.05 \\
\hline
2018 & Mar & Female & 0.05 \\
\hline
2018 & Jan & Male & 0.03 \\
\hline
2018 & Feb & Male & 0.04 \\
\hline
2018 & Mar & Male & 0.04 \\
\hline
2019 & Jan & Female & 0.07 \\
\hline
2019 & Feb & Female & 0.08 \\
\hline
2019 & Mar & Female & 0.11 \\
\hline
2019 & Jan & Male & 0.06 \\
\hline
2019 & Feb & Male & 0.07 \\
\hline
2019 & Mar & Male & 0.10 \\
\hline
\end{longtable}
\section{Basic Operations}\label{s:dataforge-basics}
To get started,
we install Data-Forge using \texttt{npm install data-forge}
and then load the library and create a dataframe from a list of objects.
Each of these objects must use the same keys,
which become the names of the dataframe's columns:
\begin{minted}{js}
const DF = require('data-forge')
const fromObjects = new DF.DataFrame([
{ones: 1, tens: 10},
{ones: 2, tens: 20},
{ones: 3, tens: 30}
])
console.log('fromObjects:\n', fromObjects)
\end{minted}
When we print the dataframe,
we see this rather complex structure:
\begin{minted}{text}
fromObjects:
DataFrame {
configFn: null,
content:
{ index: CountIterable {},
values: [ [Object], [Object], [Object] ],
pairs: MultiIterable { iterables: [Array] },
isBaked: true,
columnNames: [ 'ones', 'tens' ] } }
\end{minted}
Each column is stored as a \texttt{Series} object;
once in a while,
we will need to work with these objects directly
instead of with the dataframe as a whole.
(\figref{f:dataforge-rows-and-columns}).
If we want to see the actual data,
we need to convert the dataframe back to an array of objects:
\figpdf{figures/dataforge-rows-and-columns.pdf}{How Tables Are Stored}{f:dataforge-rows-and-columns}
\begin{minted}{js}
console.log('fromObjects as array:\n', fromObjects.toArray())
\end{minted}
\begin{minted}{text}
fromObjects as array:
[ { ones: 1, tens: 10 },
{ ones: 2, tens: 20 },
{ ones: 3, tens: 30 } ]
\end{minted}
We can instead create a dataframe by providing the names of the columns in one list
and the rows' values in another:
\begin{minted}{js}
const fromSpec = new DF.DataFrame({
columnNames: ['ones', 'tens'],
rows: [
[4, 40],
[5, 50],
[6, 60]
]
})
console.log('fromSpec as array:\n', fromSpec.toArray())
\end{minted}
\begin{minted}{text}
fromSpec as array:
[ { ones: 4, tens: 40 },
{ ones: 5, tens: 50 },
{ ones: 6, tens: 60 } ]
\end{minted}
However,
we usually won't create a dataframe directly like this;
instead,
we will read data from a file or a database.
Data-Forge provides a function called \texttt{fromCSV} for doing this:
\begin{minted}{js}
const text = `ones,tens
7,70
8,80
9,90`
const fromText = DF.fromCSV(text)
console.log('fromText as array:\n', fromText.toArray())
\end{minted}
\begin{minted}{text}
fromText as array:
[ { ones: '7', tens: '70' },
{ ones: '8', tens: '80' },
{ ones: '9', tens: '90' } ]
\end{minted}
However we create our dataframe,
we can ask it for its columns' names:
\begin{minted}{js}
console.log(data.getColumnNames())
\end{minted}
\begin{minted}{text}
[ 'ones', 'tens' ]
\end{minted}
\noindent
or get its content as a list of lists (rather than as a list of objects):
\begin{minted}{js}
console.log(data.toRows())
\end{minted}
\begin{minted}{text}
[ [ 1, 10 ], [ 2, 20 ], [ 3, 30 ] ]
\end{minted}
We can also process the rows using a \texttt{for} loop:
\begin{minted}{js}
for (let row of data) {
console.log(row)
}
\end{minted}
\begin{minted}{text}
{ ones: 1, tens: 10 }
{ ones: 2, tens: 20 }
{ ones: 3, tens: 30 }
\end{minted}
\noindent
or its \texttt{forEach} method:
\begin{minted}{js}
data.forEach(row => {
console.log(row)
})
\end{minted}
\begin{minted}{text}
{ ones: 1, tens: 10 }
{ ones: 2, tens: 20 }
{ ones: 3, tens: 30 }
\end{minted}
\noindent
However,
a good rule of thumb is that if you're using a loop on a dataframe,
you're doing the wrong thing:
you should instead use the methods described below.
\section{Doing Calculations}\label{s:dataforge-calc}
Suppose we want to add a new column to a dataframe---or rather,
create a new dataframe with an extra column,
since we can't modify a dataframe in place.
To do this,
we create a new \texttt{Series} object to represent that column,
then use \texttt{withSeries} to construct our result:
\begin{minted}{js}
const double_oh = new DF.Series([100, 200, 300])
const withHundreds = data.withSeries({hundreds: double_oh})
console.log(withHundreds.toArray())
\end{minted}
\begin{minted}{text}
[ { ones: 1, tens: 10, hundreds: 100 },
{ ones: 2, tens: 20, hundreds: 200 },
{ ones: 3, tens: 30, hundreds: 300 } ]
\end{minted}
Just as we usually create dataframes by reading data from external sources,
we will usually create new columns from existing values.
As you probably won't be surprised to learn,
we tell Data-Forge how to do this by writing callback functions.
Since we often want to create several new columns at once,
we give the \texttt{generateSeries} method an object
whose keys are the names of the new columns
and whose values are callbacks taking a row as input and producing a new value as output:
\begin{minted}{js}
const sumsAndProducts = data.generateSeries({
sum: row => row.ones + row.tens,
product: row => row.ones * row.tens
})
console.log(sumsAndProducts.toArray())
\end{minted}
\begin{minted}{text}
[ { ones: 1, tens: 10, sum: 11, product: 10 },
{ ones: 2, tens: 20, sum: 22, product: 40 },
{ ones: 3, tens: 30, sum: 33, product: 90 } ]
\end{minted}
We can also get rid of columns entirely using \texttt{dropSeries}:
\begin{minted}{js}
const justResults = sumsAndProducts.dropSeries(["ones", "tens"])
console.log(justResults.toArray())
\end{minted}
\begin{minted}{text}
[ { sum: 11, product: 10 },
{ sum: 22, product: 40 },
{ sum: 33, product: 90 } ]
\end{minted}
Since every dataframe method returns a dataframe,
we can use \gref{g:method-chaining}{method chaining} to combine operations (\figref{f:dataforge-method-chaining}).
We have seen this technique before with chains of \texttt{.then} calls on promises;
here,
it is used like pipes in the Unix command line
or the pipe operator \texttt{{\%}{\textgreater}{\%}} in modern R code:
\begin{minted}{js}
const result = data
.withSeries({hundreds: double_oh})
.generateSeries({
sum: row => row.ones + row.tens + row.hundreds
})
.dropSeries(["ones", "tens", "hundreds"])
.toArray()
\end{minted}
\figpdf{figures/dataforge-method-chaining.pdf}{A More Complicated Pipeline}{f:dataforge-method-chaining}
To make results easier to understand,
we will often want to sort our data.
Suppose we have a file containing the red-green-blue values for several colors:
\begin{minted}{text}
name,red,green,blue
maroon,128,0,0
lime,0,255,0
navy,0,0,128
yellow,255,255,0
fuchsia,255,0,255
aqua,0,255,255
\end{minted}
\noindent
We can pass the name of this file to our program as a command-line argument,
read it (remembering to set the encoding to UTF-8 so that we get characters rather than raw bytes),
and then display it:
\begin{minted}{js}
const fs = require('fs')
const DF = require('data-forge')
const text = fs.readFileSync(process.argv[2], 'utf-8')
const colors = DF.fromCSV(text)
console.log(colors.toArray())
\end{minted}
\begin{minted}{text}
[ { name: 'maroon', red: '128', green: '0', blue: '0' },
{ name: 'lime', red: '0', green: '255', blue: '0' },
{ name: 'navy', red: '0', green: '0', blue: '128' },
{ name: 'yellow', red: '255', green: '255', blue: '0' },
{ name: 'fuchsia', red: '255', green: '0', blue: '255' },
{ name: 'aqua', red: '0', green: '255', blue: '255' } ]
\end{minted}
If we want to see the colors in alphabetical order,
we call \texttt{orderBy} with a callback that gives Data-Forge the value to sort by:
\begin{minted}{js}
const sorted = colors.orderBy(row => row.name)
console.log(sorted.toArray())
\end{minted}
\begin{minted}{text}
[ { name: 'aqua', red: '0', green: '255', blue: '255' },
{ name: 'fuchsia', red: '255', green: '0', blue: '255' },
{ name: 'lime', red: '0', green: '255', blue: '0' },
{ name: 'maroon', red: '128', green: '0', blue: '0' },
{ name: 'navy', red: '0', green: '0', blue: '128' },
{ name: 'yellow', red: '255', green: '255', blue: '0' } ]
\end{minted}
\noindent
To sub-sort (\figref{f:dataforge-sorting}) by another column we use \texttt{thenBy}:
\begin{minted}{js}
const doubleSorted = colors
.orderBy(row => row.green)
.thenBy(row => row.blue)
.dropSeries(['name', 'red'])
console.log(doubleSorted.toArray())
\end{minted}
\begin{minted}{text}
[ { green: '0', blue: '0' },
{ green: '0', blue: '128' },
{ green: '0', blue: '255' },
{ green: '255', blue: '0' },
{ green: '255', blue: '0' },
{ green: '255', blue: '255' } ]
\end{minted}
\figpdf{figures/dataforge-sorting.pdf}{Sorting and Sub-sorting}{f:dataforge-sorting}
We can remove duplicates with \texttt{distinct}:
\begin{minted}{js}
const notTheSame = colors.distinct(row => row.red)
console.log(notTheSame.toArray())
\end{minted}
\begin{minted}{text}
[ { name: 'maroon', red: '128', green: '0', blue: '0' },
{ name: 'lime', red: '0', green: '255', blue: '0' },
{ name: 'yellow', red: '255', green: '255', blue: '0' } ]
\end{minted}
\noindent
but this is trickier than it appears.
Each row does indeed have a distinct \texttt{red} value,
but Data-Forge gets to decide which row to keep from each subset.
What's more surprising is that multi-column \texttt{distinct} \emph{doesn't} work:
\begin{minted}{js}
const multiColumn = colors
.distinct(row => [row.red, row.green])
.orderBy(row => row.red)
.thenBy(row => row.green)
console.log(multiColumn.toArray())
\end{minted}
\begin{minted}{text}
[ { name: 'navy', red: '0', green: '0', blue: '128' },
{ name: 'lime', red: '0', green: '255', blue: '0' },
{ name: 'aqua', red: '0', green: '255', blue: '255' },
{ name: 'maroon', red: '128', green: '0', blue: '0' },
{ name: 'fuchsia', red: '255', green: '0', blue: '255' },
{ name: 'yellow', red: '255', green: '255', blue: '0' } ]
\end{minted}
This isn't Data-Forge's fault.
In JavaScript,
two arrays are only equal if they're the same object,
i.e., \texttt{[0] === [0]} is \texttt{false}.
We will explore ways of doing multi-column \texttt{distinct} in the exercises.
\section{Subsets}\label{s:dataforge-subset}
You may have noticed that the color values in the table above are strings rather than numbers.
If we want Data-Forge to convert values to more useful types,\index{dataframe!type conversion}
we can use the methods \texttt{parseDates}, \texttt{parseFloats}, and so on.
The program below does this for a subset of
\hreffoot{https://earthquake.usgs.gov/earthquakes/feed/v1.0/csv.php}{USGS data about earthquakes in August 2016}:
\begin{minted}{js}
const fs = require('fs')
const DF = require('data-forge')
const text = fs.readFileSync('earthquakes.csv', 'utf-8')
const earthquakes = DF
.fromCSV(text)
.parseDates('Time')
.parseFloats(['Latitude', 'Longitude', 'Depth_Km', 'Magnitude'])
console.log('Data has', earthquakes.count(), 'rows')
\end{minted}
\begin{minted}{text}
Data has 798 rows
\end{minted}
Whether we convert it or not,
we will often want to work with subsets of data.
We can select values by position using \texttt{head} and \texttt{tail}\index{dataframe!select by location}
(which are named after classic Unix commands):
\begin{minted}{js}
console.log(earthquakes.head(3).toArray())
\end{minted}
\begin{minted}{text}
[ { Time: 2016-08-24T07:36:32.000Z,
Latitude: 42.6983,
Longitude: 13.2335,
Depth_Km: 8.1,
Magnitude: 6 },
{ Time: 2016-08-24T07:37:26.580Z,
Latitude: 42.7123,
Longitude: 13.2533,
Depth_Km: 9,
Magnitude: 4.5 },
{ Time: 2016-08-24T07:40:46.590Z,
Latitude: 42.7647,
Longitude: 13.1723,
Depth_Km: 9.7,
Magnitude: 3.8 } ]
\end{minted}
\begin{minted}{js}
console.log(earthquakes.tail(3).toArray())
\end{minted}
\begin{minted}{text}
[ { Time: 2016-08-26T10:09:45.380Z,
Latitude: 42.6953,
Longitude: 13.2363,
Depth_Km: 9.5,
Magnitude: 2.3 },
{ Time: 2016-08-26T10:11:55.960Z,
Latitude: 42.6163,
Longitude: 13.2985,
Depth_Km: 11,
Magnitude: 2.1 },
{ Time: 2016-08-26T10:21:09.870Z,
Latitude: 42.6153,
Longitude: 13.2952,
Depth_Km: 7.5,
Magnitude: 3 } ]
\end{minted}
If we want data from the middle of the table,
we can skip a few rows and then take as many as we want (\figref{f:dataforge-positional}):
\begin{minted}{js}
console.log(earthquakes.skip(10).take(3).toArray())
\end{minted}
\begin{minted}{text}
[ { Time: 2016-08-24T07:47:51.540Z,
Latitude: 42.6675,
Longitude: 13.3238,
Depth_Km: 6.5,
Magnitude: 3.3 },
{ Time: 2016-08-24T07:52:25.710Z,
Latitude: 42.7447,
Longitude: 13.2827,
Depth_Km: 7.9,
Magnitude: 2.9 },
{ Time: 2016-08-24T07:52:43.210Z,
Latitude: 42.6378,
Longitude: 13.2313,
Depth_Km: 10.9,
Magnitude: 3.1 } ]
\end{minted}
\figpdf{figures/dataforge-positional.pdf}{Selecting by Position}{f:dataforge-positional}
However,
it's far more common to select rows by the values they contain rather than by their position.\index{dataframe!select by value}
Just like the \texttt{Array.filter} method we met way back in \secref{s:callbacks-functional},
we do this by giving Data-Forge a callback function that tells it whether a given row should be kept or not.
This text can be as complex as desired,
but must work row by row:
we cannot make a decision about one row based on the values in the rows before it or after it.
\begin{minted}{js}
const large = earthquakes.where(row => (row.Magnitude >= 5.0))
console.log(large.toArray())
\end{minted}
\begin{minted}{text}
[ { Time: 2016-08-24T07:36:32.000Z,
Latitude: 42.6983,
Longitude: 13.2335,
Depth_Km: 8.1,
Magnitude: 6 },
{ Time: 2016-08-24T08:33:28.890Z,
Latitude: 42.7922,
Longitude: 13.1507,
Depth_Km: 8,
Magnitude: 5.4 } ]
\end{minted}
\section{Aggregation}\label{s:dataforge-aggregate}
Working with individual observations is all very well,
but if we want to understand our data,
we need to look at its aggregate properties.\index{dataframe!aggregation}
If,
for example,
we want to know the average depth and magnitude of our earthquakes,
we use the \texttt{summarize} method:
\begin{minted}{js}
const averageValues = earthquakes.summarize({
Depth_Km: series => series.average(),
Magnitude: series => series.average()
})
console.log(averageValues)
\end{minted}
\begin{minted}{text}
{ Depth_Km: 9.545614035087722, Magnitude: 2.5397243107769376 }
\end{minted}
\noindent
As with \texttt{filter},
we can do many calculations at once by giving \texttt{summarize} several callbacks.
The keys of the object we pass in specify the columns we want to aggregate;
the callbacks invoke methods of the \texttt{Series} class that Data-Forge uses to store individual columns.
Instead of producing a dataframe with a single row,
\texttt{summarize} produces an object whose keys match the names of the columns in our original dataframe.
Aggregation is often combined with grouping:
for example,
we may want to calculate the average weight of rats of different breeds
or the distribution of votes by province.
The first step is to group the data:
\begin{minted}{js}
const groupedByMagnitude = earthquakes.groupBy(row => row.Magnitude)
console.log(`${groupedByMagnitude.count()} groups`)
console.log(groupedByMagnitude.head(2).toArray())
\end{minted}
\begin{minted}{text}
28 groups
[ DataFrame {
configFn: null,
content:
{ index: [ExtractElementIterable],
values: [ExtractElementIterable],
pairs: [Array],
isBaked: false,
columnNames: [ColumnNamesIterable] } },
DataFrame {
configFn: null,
content:
{ index: [ExtractElementIterable],
values: [ExtractElementIterable],
pairs: [Array],
isBaked: false,
columnNames: [ColumnNamesIterable] } } ]
\end{minted}
As the output shows,
\texttt{groupBy} returns an array containing one new dataframe for each group in our original data.
Here's how we find the average depth of earthquakes according to magnitude:
\begin{minted}{js}
const averaged = earthquakes
.groupBy(row => row.Magnitude)
.select(group => ({
Magnitude: group.first().Magnitude,
Depth_Km: group.deflate(row => row.Depth_Km).average()
}))
.inflate()
.orderBy(row => row.Magnitude)
console.log(averaged.toArray())
\end{minted}
\begin{minted}{text}
[ { Magnitude: 2, Depth_Km: 9.901052631578946 },
{ Magnitude: 2.1, Depth_Km: 9.702083333333333 },
{ Magnitude: 2.2, Depth_Km: 9.843037974683545 },
...
{ Magnitude: 4.5, Depth_Km: 9.4 },
{ Magnitude: 5.4, Depth_Km: 8 },
{ Magnitude: 6, Depth_Km: 8.1 } ]
\end{minted}
Going through this step by step:
\begin{enumerate}
\item
The \texttt{groupBy} call produces a list of 28 dataframes,
one for each distinct value of \texttt{Magnitude}.
\item
\texttt{select} then converts each of these dataframes into an object
whose \texttt{Magnitude} is equal to the magnitude of the group's first element
and whose \texttt{Depth\_Km} is the average of the depths.
\begin{itemize}
\item
We can use the magnitude of the group's first element because all of the magnitudes in the group are the same.
\item
We must also remember to use \texttt{deflate} to turn a column of a dataframe into a \texttt{Series}
so that we can then call \texttt{average}.
\end{itemize}
\item
The output of \texttt{select} is a \texttt{Series} of two-valued objects,
so we must call \texttt{inflate} to convert it back to a \texttt{DataFrame}.
\item
Finally,
we order by the magnitude of the earthquakes to produce our output.
\end{enumerate}
\figref{f:dataforge-summarizing} shows these steps graphically.
It's easy to forget the \texttt{inflate} and \texttt{deflate} steps at first
(we did when writing this example),
but they quickly become habitual.
\figpdf{figures/dataforge-summarizing.pdf}{Summarizing Groups}{f:dataforge-summarizing}
\section{In Real Life}\label{s:dataforge-real}
To wrap up our exploration of Data-Forge,
we will explore data from \url{http://dataisplural/data.world}
to find the annual recreational visits to national parks in the United States for the last ten years.
Our strategy is to:
\begin{enumerate}
\item
import the data,
\item
inspect the columns to make sure the data is clean,
\item
fix any problems we notice,
\item
group the data by year and summarize the total visitors, and then
\item
filter to keep only the years that are greater than 2009.
\end{enumerate}
We'll start by reading the data and looking at the first couple of rows:
\begin{minted}{js}
const fs = require('fs')
const DF = require('data-forge')
const text = fs.readFileSync('../../data/national_parks.csv', 'utf-8')
const raw = DF.fromCSV(text)
console.log(raw.head(2).toArray())
\end{minted}
\begin{minted}{text}
[ { year: '1904',
gnis_id: '1163670',
geometry: 'POLYGON',
metadata: 'NA',
number_of_records: '1',
parkname: 'Crater Lake',
region: 'PW',
state: 'OR',
unit_code: 'CRLA',
unit_name: 'Crater Lake National Park',
unit_type: 'National Park',
visitors: '1500' },
{ year: '1941',
gnis_id: '1531834',
geometry: 'MULTIPOLYGON',
metadata: 'NA',
number_of_records: '1',
parkname: 'Lake Roosevelt',
region: 'PW',
state: 'WA',
unit_code: 'LARO',
unit_name: 'Lake Roosevelt National Recreation Area',
unit_type: 'National Recreation Area',
visitors: '0' } ]
\end{minted}
It's always good to look at the structure of the data
before we dive into any analysis.
The dataframe method \texttt{detectTypes} shows us the frequency of data types in our dataframe:
\begin{minted}{js}
const typesDf = raw.detectTypes()
console.log(typesDf.toString())
\end{minted}
\begin{minted}{text}
__index__ Type Frequency Column
--------- ------ --------- -----------------
0 string 100 year
1 string 100 gnis_id
2 string 100 geometry
3 string 100 metadata
4 string 100 number_of_records
5 string 100 parkname
6 string 100 region
7 string 100 state
8 string 100 unit_code
9 string 100 unit_name
10 string 100 unit_type
11 string 100 visitors
\end{minted}
We want numbers instead of strings for \texttt{year} and \texttt{visitors},
but before we transform them,
let's check and see if any values in those columns are \texttt{NA},
meaning ``not available'':
\begin{minted}{js}
const cleaned = raw
.where(row => ((row.year != 'NA') && (row.visitors != 'NA')))
console.log(`${raw.count()} original rows and ${cleaned.count()} cleaned rows`)
\end{minted}
\begin{minted}{text}
21560 original rows and 21556 cleaned rows
\end{minted}
Four rows contain missing values.
We probably wouldn't spot them if we scrolled through the data ourselves,
so it's good that we let the computer do the work.
Let's remove those four rows and convert the two columns of interest from strings to numbers:
\begin{minted}{js}
const data = raw
.where(row => ((row.year != 'NA') && (row.visitors != 'NA')))
.parseFloats(['year', 'visitors'])
console.log(`${data.count()} rows`)
console.log(data.detectTypes().toString())
\end{minted}
\begin{minted}{text}
21556 rows
__index__ Type Frequency Column
--------- ------ --------- -----------------
0 number 100 year
... ... ... ...
11 number 100 visitors
\end{minted}
This looks good:
we have dropped the four offending rows and everything else is a number.
With clean data,
we can now group by year
and find the total number of visitors in each year:
\begin{minted}{js}
const annual = data
.groupBy(row => row.year)
.select(group => ({
year: group.first().year,
visitors: group.deflate(row => row.visitors).sum()
}))
.inflate()
.orderBy(row => row.year)
console.log(annual.toString())
\end{minted}
\begin{minted}{text}
__index__ year visitors
--------- ---- -----------
0 1904 120690
112 1905 140954
88 NaN 13764633135
55 1906 30569
113 1907 32935
61 1908 42768
111 1909 60899
110 1910 173416
... ... ...
23 2011 276799292
91 2012 281392715
70 2013 271305455
89 2014 290105230
78 2015 304730566
75 2016 328483428
\end{minted}
Uh oh.
What's that \texttt{NaN} doing there in the third row?
Are there still some missing values in the \texttt{year} column?
\begin{minted}{js}
const numNan = data
.where(row => (isNaN(row.year) || isNaN(row.visitors)))
.count()
console.log(`${numNan} rows have NaN`)
\end{minted}
\begin{minted}{text}
382 rows have NaN
\end{minted}
The exercises will look at where these non-numbers have come from and how we should handle them.
Even if you don't live close to a national park,
we encourage you to take a break and step outside before tackling the exercises below.
\section{Exercises}\label{s:dataforge-exercises}
\exercise{Other Data Formats}
Write a short program that reads data from \texttt{colors.csv},
converts the red-green-blue values to numbers,
and saves the result as JSON.
Once that is working,
write another program that reads the JSON file and converts it back to CSV.
Are there any differences between your second program's output
and your original CSV file?
\exercise{Directional Sorting}
Modify the program in \secref{s:dataforge-calc} to sort color values
by \emph{increasing} red and \emph{decreasing} green.
\exercise{Multi-Column Distinct}
\secref{s:dataforge-calc} pointed out that we cannot use \texttt{distinct}
to find rows with distinct combinations of values.
What \emph{can} we do?
To show that your idea works,
write a program that gets distinct combinations of red and green values from the color data:
\begin{minted}{text}
red,green
0,0
0,255
128,0
255,0
255,255
\end{minted}
\exercise{Revisiting Data Manipulation}
Back in the chapter on data manipulation,
we aggregated \texttt{surveys.csv} to find the minimum, maximum, and average values
for \texttt{year}, \texttt{hindfoot\_length}, and \texttt{weight}.
Repeat this exercise using the methods of \texttt{data-forge}.
\exercise{Grouping and Aggregating}
Retrace the steps in the example that calculated
the average depth of earthquakes of different magnitudes
and show the data structure after each method call.
Are they all dataframes,
or are other data structures created or manipulated as well?
\exercise{Not a Number}
Write a short pipeline that prints out the values in the \texttt{year} and \texttt{visitors} columns
that wind up being converted to \texttt{NaN}.
(Hint: copy the columns, convert the copies to numbers, filter to keep those rows, then print the original values.)
\section*{Key Points}
\input{keypoints/dataforge}