You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: manuscript/ParkNotebookManuscript.md
+60-52Lines changed: 60 additions & 52 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,36 +1,43 @@
1
1
# Digitizing data from Tribolium competition experiements of Thomas Park
2
2
3
-
Michael Wade
4
-
Dan Leehr
5
-
Karen A. Cranston
3
+
*Michael Wade
4
+
*Dan Leehr
5
+
*Karen A. Cranston
6
6
7
7
Department of Biology, Indiana University, Bloomington, IN 47405
8
+
8
9
National Evolutionary Synthesis Center, Durham, NC, 27705
9
10
10
11
abstract
11
12
(140 words max)
12
13
13
-
Mike - add some text introducing Park
14
+
MikeW - add some text introducing Park
15
+
14
16
The raw data from both the published and unpublished experiments has never been published, and exists only as handwritten data sheets in binders. We describe the digitization of data from two published Park manuscripts. We scanned binder pages, used Amazon Mechanical Turk and Google Spreadsheets to transcribe the scanned tables and developed methods for computational validation of the data entry process.
15
17
18
+
# Background & Summary
19
+
20
+
MikeW - short introduction to Park and to the experiments
16
21
17
-
#Background & Summary
18
-
Mike - short introduction to Park and to the experiments
19
-
Mike - description of reuse potential
22
+
MikeW - description of reuse potential
20
23
21
-
(700 words maximum) An overview of the study design, the assay(s)
24
+
*(700 words maximum) An overview of the study design, the assay(s)
22
25
performed, and the created data, including any background information
23
26
needed to put this study in the context of previous work and the literature.
24
27
The section should also briefly outline the broader goals that motivated
25
28
the creation of this dataset and the potential reuse value. We also
26
29
encourage authors to include a figure that provides a schematic overview
27
30
of the study and assay(s) design. This section and the other main
28
31
body sections of the manuscript should include citations to the literature
29
-
as needed citecite1, cite2.
32
+
as needed citecite1, cite2.*
30
33
31
34
#Methods
32
35
33
-
This section should include detailed text describing the methods used
36
+
Details of the original laboratory experiments and raw data collection is described in previous publications (ref Park).
37
+
38
+
MikeW - can you add a summary of the experiments behind the data?
39
+
40
+
*This section should include detailed text describing the methods used
34
41
in the study and assay(s), and the processing steps leading to the
35
42
production of the data files, including any computational analyses
36
43
(e.g. normalization, image feature extraction). These methods should
@@ -43,85 +50,86 @@ associated publications. In principle, fundamentally new methods should
43
50
not be presented for the first time in Data Descriptors, and we may
44
51
choose to decline publication of a Data Descriptor until any novel
45
52
methods or techniques are published in an appropriate research or
46
-
methods-focused journal article.
53
+
methods-focused journal article.*
47
54
48
55
## Scanned images and data entry spreadsheets
49
-
Mike A - details of scanning process
56
+
MikeAdamo - details of scanning process
50
57
51
-
We first scanned the pages using a {need information about scanner from Duke} and saved the as high-resolution JPG images. An initial pilot project scanned 66 pages, followed by a full list of 1093 pages. Using the set of 1093 images, which we sorted visually into images that contained consistent tabular data and those that did not. Of the 1025 tabular images, we sorted into full and partial pages depending on whether all lines on the binder sheet contained data. Figure 1 contains a sample full page of tabular data that represents the majority of the images.
58
+
We first scanned the pages using a {need information about scanner from Duke} and saved the as high-resolution JPG images. An initial pilot project scanned 66 pages, followed by a full list of 1093 pages. Using the set of 1093 images, which we sorted visually into images that contained consistent tabular data and those that did not. There were 1046 images containing tabular data in two distinct tabular formats. Of the 1046 tabular images, we sorted into full and partial pages depending on whether the majority of lines on the binder sheet contained data. Figure 1 contains a sample full page of tabular data that represents the majority of the images.
52
59
53
-
The remaining 68 pages either contained a non-standard tabular format, or did not contain tabular data (for example, documentation). These pages were not considered for bulk transcription.
60
+
The remaining 47 pages either contained a non-standard tabular format, or did not contain tabular data (for example, documentation). These pages were not considered for bulk transcription.
54
61
55
-
At this point, the usable images were sorted into 6 directories or batches. Each batch contained images of pages in the same format and with roughly the same amount of data. For each image, we provided a publicly-accessible URL. We then created a template online Google spreadsheet that contained only a single row with the same columns as the tabular data on the scanned pages and set the permissions to "Anyone with the link can edit". For each image that included a table, we used the Google Docs API to copy the template and create a spreadsheet for data entry where the name of the spreadsheet matched the name of the image file.
62
+
At this point, the usable images were sorted into 6 directories or batches. Each batch contained images of pages from the same experiment, in the same format and with roughly the same amount of data. For each image, we provided a publicly-accessible URL. We then created a template online Google spreadsheet that contained only a single row with the same columns as the tabular data on the scanned pages and set the permissions to "Anyone with the link can edit". For each image that included a table, we used the Google Docs API to copy the template and create a spreadsheet for data entry, saving its URL. We recorded the mapping of image URLs to spreadsheet URLs in a CSV file.
56
63
57
64
## Digitization with Mechanical Turk
58
-
Mechanical Turk is an online marketplace that pairs workers with online tasks (Human Intelligent Tasks, or HITs). We create a HIT template that included the image, a link to the associated online spreadsheet and instructions for doing the data entry. See Figure 2 for the HIT template. By including the image location and spreadsheet location as variables on the HIT template, we could use the AMT bulk creation process to create a separate HIT for each combination of image and associated spreadsheet. We then published the HITs, asking workers to enter the numbers from the image into the cells of the spreadsheet. For full pages, we allowed 60 minutes of time and paid $1.00 USD. For half pages, we allowed 45 minutes and paid $0.60 USD. For each submitted batch, all HITs were taken and completed within 1 hour of submission. The average time to complete a full page was 35 minutes and for a half page was 23 minutes.
65
+
Mechanical Turk (https://www.mturk.com) is an online marketplace that pairs workers with online tasks (Human Intelligent Tasks, or HITs). We created a HIT template that included the image, a link to the associated online spreadsheet and instructions for doing the data entry. See Figure 2 for the HIT template. By including the image location and spreadsheet location as variables on the HIT template, we could use the bulk creation process to generate a separate HIT for each combination of image and associated spreadsheet. We then published the HITs, asking workers to enter the numbers from the image into the cells of the spreadsheet. For full pages, we allowed 60 minutes of time and paid $1.00 USD. For half pages, we allowed 45 minutes and paid $0.60 USD. The average time to complete a full page was 23 minutes and for a half page was 12 minutes. For each submitted batch, all HITs were taken and completed within 1 hour of submission.
59
66
60
67
We did an initial pilot project to test the workflow, then a full set of two experiments with slight modifications to the initial protocol.
61
68
62
69
#Data Records
63
70
64
-
All data associated with this manuscript is deposited in the Dryad data respository {link to dryad data package}. All scripts used for data processing, as well as additional documentation, is on GitHub at http://github.com/nescent/parknotebooks. All code is licensed under the GPL v. 3 license. All data is release with the CC0 waiver.
71
+
All data associated with this manuscript is deposited in the Dryad data respository {link to dryad data package}. All scripts used for data processing, as well as additional documentation, is on GitHub at http://github.com/nescent/parknotebooks. All code is licensed under the GPL v. 3 license. All data is released with a CC0 waiver.
72
+
73
+
The data includes both the scanned images from the notebooks as well as the comma-separated (csv) files that contain digitized data from the images that follow standard data format. The two files *digguide.csv details the naming scheme of the image files. The majority (1025) of the pages follow a very consistent structure with the following columns:
65
74
66
-
The data includes both the scanned images from the notebooks as well as the comma-separated (csv) files that contain digitized data from the images that follow standard data format. The two files *digguide.csv details the naming scheme of the image files. For the tabular data, it follows a very consistent structure with the following columns:
75
+
* Date: the date of the observation, in format MM-DD-YY, where month and day can be either one digit or two.
76
+
* Age: the age of the population in days
77
+
* Obsr.: the name of the person making the observation
78
+
* Larvae (multiple columns): the number of individuals in larval stages, separated by size. Sometimes three columns (small, medium, large) and sometimes two (small med, large)
79
+
* Sum: the sum of the counts in the Larvae columns
80
+
* Pupae: the number of individuals in pupal stage
81
+
* Imago: the number of individuals in imago state
82
+
* Total: the sum of the Sum, Pupae and Imago columns
83
+
* Dead Imago: number of dead individuals in imago state
84
+
* wt. in grams: total weight of ?? the population?
67
85
68
-
Date: the date of the observation, in format MM-DD-YY, where month and day can be either one digit or two.
69
-
Age: the age of the population in days
70
-
Obsr.: the name of the person making the observation
71
-
Larvae (multiple columns): the number of individuals in larval stages, separated by size. Sometimes three columns (small, medium, large) and sometimes two (small med, large)
72
-
Sum: the sum of the counts in the Larvae columns
73
-
Pupae: the number of individuals in pupal stage
74
-
Imago: the number of individuals in imago state
75
-
Total: the sum of the Sum, Pupae and Imago columns
76
-
Dead Imago: number of dead individuals in imago state
77
-
wt. in grams: total weight of ?? the population?
86
+
A smaller set of pages (21) follow a different structure, recording the mean values per vial and per gram:
87
+
88
+
__MikeW: Need detail on these columns__
89
+
90
+
* PER VIAL Age
91
+
* PER VIAL larvae and pupae Mean
92
+
* PER VIAL larvae and pupae %
93
+
* PER VIAL imagoes Mean
94
+
* PER VIAL imagoes %
95
+
* PER VIAL total Mean
96
+
* PER GRAM L & P Mean
97
+
* PER GRAM Imag. Mean
98
+
* PER GRAM Total Mean
99
+
* n
78
100
79
101
#Technical Validation
80
102
81
-
To validate accuracy of the data entry, we did not do double-entry but instead relied on features of the data that allowed for internal verification. For each row, there were two columns ('Sum' and 'Total') that were sums of other columns in the row. We wrote a python script that checked that the entered sum was equal to the actual sum of the columns. This script also checked that the files contains the expected number of columns and rows. In most cases where the sum test failed, the error was in the original data entry on the binder page, not in the transcribed data.
103
+
To validate accuracy of the data entry, we did not do double-entry but instead relied on features of the data that allowed for internal verification. In both tabular formats, there were columns that were sums of other columns, providing an intrinsic check. In the dated observation data, the 'Sum' and 'Total' columns in each row were sums of other columns. In the mean value data, the 'PER VIAL total Mean' and 'PER GRAM Total Mean' columns in each row were sums of other columns. We wrote a python script that checked that the entered sum was equal to the actual sum of the columns. This script also checked that the files contains the expected number of columns and reported the number of rows, both passing the sum test and failing. In most cases where the sum test failed, the error was in the original data entry on the binder page, not in the transcribed data. The script generated summary reports regarding each page in CSV format as well as detailed per-row errors.
104
+
105
+
KarenC: we are going to provide a second set of tables where the errors have been corrected
82
106
83
-
Q: Have the errors been corrected in the final spreadsheets?
107
+
#Usage Notes
84
108
85
-
*Usage Notes
86
-
Mike W - anything for here?
109
+
MikeW - anything for here?
87
110
88
-
Brief instructions that may help other researchers reuse these dataset.
111
+
*Brief instructions that may help other researchers reuse these dataset.
89
112
This is an optional section, but strongly encouraged when helpful
90
113
to readers. This may include discussion of software packages that
91
114
are suitable for analyzing the assay data files, suggested downstream
92
115
processing steps (e.g. normalization, etc.), or tips for integrating
93
116
or comparing this with other datasets. If needed, authors are encouraged
94
117
to upload code, programs, or data processing workflows as Supplementary
95
-
Information, when they may help others analyse the data.
118
+
Information, when they may help others analyse the data.*
96
119
97
-
*Acknowledgements
120
+
#Acknowledgements
98
121
This work was supported by the National Evolutionary Synthesis Center (NESCent), NSF #EF-0905606
99
122
100
123
Author contributions: MW provided the notebooks and expert knowledge of the data and experimental lab conditions. KC designed and implemented the pilot experiment. MA scanned the images and provided metadata matching images to notebook pages. DL implemented the full experiment, making improvements to validation protocols.
101
124
102
125
103
-
*Competing financial interests
126
+
#Competing financial interests
104
127
105
128
The author(s) declare no competing financial interests.
106
129
107
-
*Figure Legends
130
+
#Figure Legends
108
131
109
132
Figure 1: Sample scanned notebook page. This page is representative of the standard tabular format submitted to Mechanical Turk for digitization.
110
133
111
134
Figure 2: Human Intelligent Task (HIT) template. The template used to generate individual HITs for submission to Amazon Mechanical Turk.
112
135
113
-
*Tables
114
-
115
-
Tables supporting the Data Descriptor. These can provide summary information
116
-
(sample numbers, demographics, etc.), but they should generally not
117
-
be used to present primary data (i.e. measurements). Tables containing
118
-
primary data should be submitted to an appropriate data repository.
119
-
120
-
Tables may be provided within the LaTeX document or as separate
121
-
files (tab-delimited text or Excel files). Legends, where needed,
122
-
should be included here. Generally, a Data Descriptor should have
123
-
fewer than ten Tables, but more may be allowed when needed. Tables
124
-
may be of any size, but only Tables which fit onto a single printed
125
-
page will be included in the PDF version of the article (up to a maximum
0 commit comments