Skip to content

Commit 39300c2

Browse files
committed
source commit: 3a30cbc
0 parents  commit 39300c2

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+2405
-0
lines changed

01-getting-started.md

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
---
2+
title: Introduction
3+
teaching: 10
4+
exercises: 0
5+
---
6+
7+
::::::::::::::::::::::::::::::::::::::: objectives
8+
9+
- Describe OpenRefine’s uses and applications.
10+
- Differentiate data cleaning from data organization.
11+
- Experiment with OpenRefine’s user interface.
12+
- Locate helpful resources to learn more about OpenRefine.
13+
14+
::::::::::::::::::::::::::::::::::::::::::::::::::
15+
16+
:::::::::::::::::::::::::::::::::::::::: questions
17+
18+
- How is OpenRefine useful?
19+
20+
::::::::::::::::::::::::::::::::::::::::::::::::::
21+
22+
## Lesson
23+
24+
## Motivations for the OpenRefine Lesson
25+
26+
- Data is often very messy, and this tool saves a lot of time on cleaning
27+
headaches.
28+
29+
- Data cleaning steps often need repeating with multiple files. It is important to know what you did to your data. This makes it possible for you to repeat these steps again with similarly structured data. OpenRefine is
30+
perfect for speeding up repetitive tasks by replaying previous actions on
31+
multiple datasets.
32+
33+
- Additionally, journals, granting agencies, and other institutions are requiring documentation of the
34+
steps you took when working with your data. With OpenRefine, you can capture
35+
all actions applied to your raw data and share them with your publication as
36+
supplemental material.
37+
38+
- Any operation that changes the data in OpenRefine can be reversed or
39+
undone.
40+
41+
- Some concepts such as clustering algorithms are quite complex, but with OpenRefine
42+
we can introduce them, use them, and show their power.
43+
44+
> **Note:** You must export your modified dataset to a new file: OpenRefine does not save over the original source file. All changes are stored in the OpenRefine project.
45+
46+
## Before we get started
47+
48+
The following setup is necessary before we can get started (see the [instructions here](../learners/setup.md).)
49+
50+
## What is OpenRefine?
51+
52+
- OpenRefine is a Java program that runs on your machine (not in the cloud): it is a desktop application that uses your web browser as a graphical interface. No internet connection is needed, and none of the data or commands you enter in OpenRefine are sent to a remote server.
53+
- OpenRefine does not modify your original dataset. All actions can be reversed in OpenRefine and you can capture all the actions applied to your data and share this documentation with your publication as supplemental material.
54+
- OpenRefine saves as you go. You can return to the project at any time to pick up where you left off or export your data to a new file.
55+
- OpenRefine can be used to standardise and clean data across your file.
56+
57+
### It can also help you
58+
59+
- Get an overview of a data set
60+
- Resolve inconsistencies in a data set
61+
- Help you split data up into more granular parts
62+
- Match local data up to other data sets
63+
- Enhance a data set with data from other sources
64+
- Save a set of data cleaning steps to replay on multiple files
65+
66+
OpenRefine is a powerful, free, and open source tool with a large growing community of practice. More help can be found at [https://openrefine.org](https://openrefine.org).
67+
68+
### Features
69+
70+
- Open source ([source on GitHub](https://github.com/OpenRefine/OpenRefine)).
71+
- A large growing community, from novice to expert, ready to help.
72+
73+
### More Information on OpenRefine
74+
75+
You can find out a lot more about OpenRefine at the official user manual [docs.openrefine.org](https://docs.openrefine.org/). There is a [user forum](https://forum.openrefine.org) that can answer a lot of beginner questions and problems. [Recipes](https://github.com/OpenRefine/OpenRefine/wiki/Recipes), scripts, projects, and extensions are available to add functionality to OpenRefine. These can be copied into your OpenRefine instance to run on your dataset.
76+
77+
:::::::::::::::::::::::::::::::::::::::: keypoints
78+
79+
- OpenRefine is a powerful, free and open source tool that can be used for data cleaning.
80+
- OpenRefine will automatically track any steps you take in working with your data.
81+
82+
::::::::::::::::::::::::::::::::::::::::::::::::::
83+
84+

02-importing-data.md

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
---
2+
title: Importing Data to OpenRefine
3+
teaching: 10
4+
exercises: 0
5+
---
6+
7+
::::::::::::::::::::::::::::::::::::::: objectives
8+
9+
- Create a new OpenRefine project from a CSV file
10+
11+
::::::::::::::::::::::::::::::::::::::::::::::::::
12+
13+
:::::::::::::::::::::::::::::::::::::::: questions
14+
15+
- How can we import our data into OpenRefine?
16+
17+
::::::::::::::::::::::::::::::::::::::::::::::::::
18+
19+
## Importing data
20+
21+
::::::::::::::::::::::::::::::::::::::::: callout
22+
23+
## What kinds of data files can I import?
24+
25+
There are several options for getting your data set into OpenRefine. You can import files in a variety of formats including:
26+
27+
- Comma-separated values (CSV) or tab-separated values (TSV)
28+
- Text files
29+
- Fixed-width columns
30+
- JSON
31+
- XML
32+
- OpenDocument spreadsheet (ODS)
33+
- Excel spreadsheet (XLS or XLSX)
34+
- RDF data (JSON-LD, N3, N-Triples, Turtle, RDF/XML)
35+
- Wikitext
36+
37+
See the [Create a project by importing data](https://docs.openrefine.org/manual/starting#create-a-project-by-importing-data) page in the OpenRefine manual for more information.
38+
39+
40+
::::::::::::::::::::::::::::::::::::::::::::::::::
41+
42+
## Create your first OpenRefine project (using provided data)
43+
44+
Start OpenRefine, which will open in your browser (at the address [http://127.0.0.0:3333](https://127.0.0.0:3333)). Once OpenRefine is launched in your
45+
browser, the left margin has options to:
46+
47+
- `Create Project`
48+
- `Open Project`
49+
- `Import Project`
50+
- `Language Settings`
51+
52+
1. Click `Create Project` from the left margin and select then `This Computer` (because you're uploading data from your computer).
53+
54+
2. Click `Choose Files` and browse to where you stored the file `Portal_rodents_19772002_simplified.csv`. Select the
55+
file and click `Open`, or just double-click on the filename.
56+
57+
![](fig/or372-create-project.png){alt='Menu to create a new project'}
58+
59+
3. Click `Next>>` under the browse button to upload the data into OpenRefine.
60+
61+
4. On the next screen, OpenRefine will present you with a preview of your data. You can check here for obvious errors, if, for example, your file was tab-delimited rather than comma-delimited, the preview would look strange (and you could correct it by choosing the correct separator and clicking the `Update Preview` button on the right. If you selected the wrong file, click `<<Start Over` at the top left.
62+
63+
5. In the middle of the page, will be a set of options (`Character encoding`, etc.). Make sure the tick box next to `Trim leading & trailing whitespace from strings` is not ticked. (We're going to need the leading whitespace in one of our examples.)
64+
65+
![](fig/or372-data-import.png){alt='Menu to import data'}
66+
67+
6. If all looks well, click `Create Project>>` in the top right. You will be presented with a view onto your data. This is OpenRefine!
68+
69+
The columns are all imported as text, even the columns with numbers. We will see how to format the numeric columns in the next episode.
70+
71+
::::::::::::::::::::::::::::::::::::::::: callout
72+
73+
## OpenRefine does not modify your original dataset
74+
75+
Once your data is imported into a project - OpenRefine leaves your raw data intact and works on a copy which it creates
76+
inside the newly created project. All the data transformation and cleaning steps you apply will be performed on this copy
77+
and you can undo any changes too.
78+
79+
80+
::::::::::::::::::::::::::::::::::::::::::::::::::
81+
82+
:::::::::::::::::::::::::::::::::::::::: keypoints
83+
84+
- Use the Create Project option to import data
85+
- You can control how data imports using options on the import screen
86+
- Several file types may be imported into OpenRefine
87+
88+
::::::::::::::::::::::::::::::::::::::::::::::::::
89+
90+

03-exploring-data.md

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
---
2+
title: Exploring Data with OpenRefine
3+
teaching: 15
4+
exercises: 20
5+
---
6+
7+
::::::::::::::::::::::::::::::::::::::: objectives
8+
9+
- Learn about different types of facets and how they can be used to summarise data of different data types
10+
11+
::::::::::::::::::::::::::::::::::::::::::::::::::
12+
13+
:::::::::::::::::::::::::::::::::::::::: questions
14+
15+
- How can we summarise our data?
16+
- How can we find errors in our data?
17+
- How can we edit data to fix errors?
18+
- How can we convert column data from one data type to another?
19+
20+
::::::::::::::::::::::::::::::::::::::::::::::::::
21+
22+
## Exploring data with facets
23+
24+
Facets are one of the most useful features of OpenRefine. Data faceting is a process of exploring data by applying multiple filters to investigate its composition. It also allows you to identify a subset of data that you wish to change in bulk.
25+
26+
A `facet` groups all the like values that appear in a column, and allows you to filter the data by those values. It also allows you to edit values across many records at the same time.
27+
28+
### Exploring text columns
29+
30+
One type of facet is called a 'Text facet'. This groups all the identical text values in a column and lists each value with the number of records it appears in. The facet information always appears in the left hand panel in the OpenRefine interface.
31+
32+
Here we will use faceting to look for potential errors in data entry in the `scientificName` column.
33+
34+
1. Scroll over to the `scientificName` column.
35+
36+
2. Click the down arrow and choose `Facet` > `Text facet`.
37+
38+
![](fig/or362-facet-menu.png){alt='Facet menu of a column'}
39+
40+
3. In the left panel, you'll now see a box containing every unique value in the `scientificName` column
41+
along with a number representing how many times that value occurs in the column.
42+
43+
![](fig/or362-faceted-scientificname.png){alt='Faceting results on the column scientificName'}
44+
45+
4. Try sorting this facet by name and by count. Do you notice any problems with the data? What are they?
46+
47+
5. Hover the mouse over one of the names in the `facet` list. You should see that you have an `edit` function available.
48+
49+
6. You could use this to fix an error immediately, and OpenRefine will ask whether you want to make the same correction to every value it finds like that one. But OpenRefine offers even better ways to find and fix these errors, which we'll use instead. We'll learn about these when we talk about clustering.
50+
51+
::::::::::::::::::::::::::::::::::::::::: callout
52+
53+
## Facets and large datasets
54+
55+
Facets are intended to group together common values and OpenRefine limits the number of values allowed in a single facet to ensure the software does not perform slowly or run out of memory. If you create a facet where there are many unique values (for example, a facet on a 'book title' column in a data set that has one row per book) the facet created will be very large and may either slow down the application, or OpenRefine will not create the facet.
56+
57+
58+
::::::::::::::::::::::::::::::::::::::::::::::::::
59+
60+
::::::::::::::::::::::::::::::::::::::: challenge
61+
62+
## Exercise
63+
64+
1. Using faceting, find out how many years are represented in the census.
65+
2. Which years have the most and least observations?
66+
3. Is the column formatted as Number, Date, or Text?
67+
68+
::::::::::::::: solution
69+
70+
## Solution
71+
72+
1. For the column `yr` do `Facet` > `Text facet`. A box will appear in the left panel showing that there are 16 unique entries in this column.
73+
2. After creating a facet, click `Sort by count` in the facet box. The year with the most observations is 1978. The least is 1993.
74+
3. By default, the column `yr` is formatted as Text.
75+
76+
:::::::::::::::::::::::::
77+
78+
::::::::::::::::::::::::::::::::::::::::::::::::::
79+
80+
### Exploring numeric columns
81+
82+
When a table is imported into OpenRefine, all columns are treated as having text values. We can transform columns to other data types (e.g. number or date) using the `Edit cells` > `Common transforms` feature. Here we will experiment changing columns to numbers and see what additional capabilities that grants us.
83+
84+
#### Numeric facet
85+
86+
Sometimes there are non-number values or blanks in a column which may represent errors in data entry and we want to find them. We can do that with a `Numeric facet`.
87+
88+
Create a `numeric facet` for the column `yr`. The facet will be empty because OpenRefine sees all the values as text.
89+
90+
To transform cells in the `yr` column to numbers, click the down arrow for that column, then `Edit cells` > `Common transforms…` > `To number`. You will notice the `yr` values change from left-justified to right-justified, and black to green color.
91+
92+
::::::::::::::::::::::::::::::::::::::: challenge
93+
94+
## Exercise
95+
96+
The dataset included other numeric columns that we will explore in this exercise:
97+
98+
- `period` - Unique number assigned to each survey period
99+
- `plot` - Plot number animal was caught on, from 1 to 24
100+
- `recordID` - Unique record ID number to facilitate quick reference to particular entry
101+
102+
Transform the columns `period`, `plot`, and `recordID` from text to numbers.
103+
104+
1. How does changing the format change the faceting display for the `yr` column?
105+
2. Can all columns be transformed to numbers?
106+
107+
::::::::::::::: solution
108+
109+
## Solution
110+
111+
Displaying a `Numeric facet` of `yr` shows a histogram of the number of
112+
entries per year. Notice that the data is shown as a number, not a date. If you instead transform the column to a date, the program will assume all entries are on January 1st of the year.
113+
114+
Only observations that include only numerals (0-9) can be transformed to numbers. If you apply a number transformation to a column that doesn't meet this criteria, and then click the Undo / Redo tab, you will see a step that starts with Text transform on 0 cells. This means that the data in that column was not transformed.
115+
116+
:::::::::::::::::::::::::
117+
118+
::::::::::::::::::::::::::::::::::::::::::::::::::
119+
120+
The next exercise will explore what happens when a numeric column contains values that are not numbers.
121+
122+
::::::::::::::::::::::::::::::::::::::: challenge
123+
124+
## Exercise
125+
126+
1. For a column you transformed to numbers, edit one or two cells, replacing the numbers with text (such as `abc`) or blank (no number or text).
127+
2. Use the pulldown menu to apply a numeric facet to the column you edited. The facet will appear in the left panel.
128+
3. Notice that there are several checkboxes in this facet: `Numeric`, `Non-numeric`, `Blank`, and `Error`. Below these are counts of the number of cells in each category. You should see checks for `Non-numeric` and `Blank` if you changed some values.
129+
4. Experiment with checking or unchecking these boxes to select subsets of your data.
130+
131+
132+
::::::::::::::::::::::::::::::::::::::::::::::::::
133+
134+
When done examining the numeric data, remove this facet by clicking the `x` in the upper left corner of its panel. Note that this does not undo the edits you made to the cells in this column.
135+
136+
#### Examine a pair of numeric columns using scatterplots
137+
138+
Now that we have multiple columns representing numbers, we can see how they relate to one another using the scatterplot facet. Select a numeric column, for example `recordID`, and use the pulldown menu to > `Facet` > `Scatterplot facet`. A new window called `Scatterplot Matrix` will appear. There are squares for each pair of numeric columns organized in an upper right triangle. Each square has little dots for the cell values from each row.
139+
140+
![](fig/or372-scatterplots.png){alt='Scatterplots between numeric columns'}
141+
142+
Click the image of the scatterplot between `recordID` and `yr` to select this one for the facet.
143+
144+
::::::::::::::::::::::::::::::::::::::: challenge
145+
146+
## Exercise
147+
148+
Click in the scatterplot facet in the lower left margin and drag to highlight a rectangle. How does this change the data rows displayed?
149+
150+
151+
::::::::::::::::::::::::::::::::::::::::::::::::::
152+
153+
::::::::::::::::::::::::::::::::::::::::: callout
154+
155+
## More Details on Faceting
156+
157+
Full documentation on faceting can be found at [Exploring facets: Faceting](https://docs.openrefine.org/manual/facets)
158+
159+
160+
::::::::::::::::::::::::::::::::::::::::::::::::::
161+
162+
:::::::::::::::::::::::::::::::::::::::: keypoints
163+
164+
- Faceting can identify errors or outliers in data
165+
166+
::::::::::::::::::::::::::::::::::::::::::::::::::
167+
168+

0 commit comments

Comments
 (0)