Summary and Schedule
A part of the data workflow is preparing the data for analysis. Some of this involves data cleaning, where errors in the data are identifed and corrected or formatting made consistent. This step must be taken with the same care and attention to reproducibility as the analysis.
OpenRefine is a powerful free and open source tool for working with messy data: cleaning it and transforming it from one format into another.
Learning objectives
By the end of this lesson, you will be able to:
- create, export and import a project in OpenRefine
- view and work on subsets of rows using facets and text filters
- reduce variations in data through clustering, bulk editing and transformations
- undo and redo actions and export the history of actions
- save cleaned data in a widely supported file format
This lesson will teach you to use OpenRefine to effectively clean and format data and automatically track any changes that you make. Many people comment that this tool saves them literally months of work trying to make these edits by hand.
Importantly, this lesson does not cover all of OpenRefine’s functionalities. It also does not correct all errors in the provided dataset.
Getting Started
Data Carpentry’s teaching is hands-on, so participants are encouraged to use their own computers to ensure the proper setup of tools for an efficient workflow.
These lessons assume no prior knowledge of the skills or tools.
To most effectively use these materials, please make sure to install everything before working through this lesson.
| Setup Instructions | Download files required for the lesson | |
| Duration: 00h 00m | 1. Introduction | What is OpenRefine useful for? | 
| Duration: 00h 10m | 2. Working with OpenRefine | How can we bring our data into OpenRefine? How can we sort and summarize our data? How can we find and correct errors in our raw data? | 
| Duration: 00h 45m | 3. Filtering and Sorting with OpenRefine | How can we select only a subset of our data to work with? How can we sort our data? | 
| Duration: 01h 05m | 4. Examining Numbers in OpenRefine | How can we convert a column from one data type to another? How can we find non-numeric values in a column that should contain numbers? | 
| Duration: 01h 25m | 5. Using scripts | How can we document the data-cleaning steps we’ve applied to our
data? How can we apply these steps to additional data sets? | 
| Duration: 01h 45m | 6. Exporting and Saving Data from OpenRefine | How can we get our cleaned data out of OpenRefine? How can we save the whole project with all history as a file? | 
| Duration: 02h 00m | 7. Other Resources in OpenRefine | What other resources are available for working with OpenRefine? | 
| Duration: 02h 10m | Finish | 
The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.
Participants should install and run before the workshop, so that any problems may reveal themselves early.
Data
The data for this lesson is a part of the Data Carpentry Social Sciences workshop. It is a teaching version of the Studying African Farmer-Led Irrigation (SAFI) database. The SAFI dataset represents interviews of farmers in two countries in eastern sub-Saharan Africa (Mozambique and Tanzania). These interviews were conducted between November 2016 and June 2017 and probed household features (e.g. construction materials used, number of household members), agricultural practices (e.g. water usage), and assets (e.g. number and types of livestock).
The data used in this lesson is a subset of the teaching version that has been intentionally ‘messed up’ for this lesson.
Download the data file to your computer.
A general description of the dataset used in the Social Sciences lessons can be found in the workshop data home page.
Instead of downloading the data to the computer, you could import the data from the URL directly when you start the project. When learners have trouble finding the file on their computer, this may be a workaround to not have to wait.
Software
For this lesson you will need OpenRefine and a web browser. Basic installation steps are provided on this page. The OpenRefine installation manual provides more details about installation, upgrades and configuration.
Note: this is a Java program that runs on your machine (not in the cloud). It runs inside your browser, but no web connection is needed for this lesson.
Administrator rights
You do not need administrative rights on the computer to install OpenRefine. However, if anti-malware software blocks OpenRefine when you try to start it, you may need administrative rights to allow OpenRefine to run. OpenRefine is safe to run.
Starting OpenRefine may take minutes, even on some modern computers. Learners may be wondering if it is actually working; if there are no error messages, it is probably still starting up and you should wait a little longer.
Windows
- Check that you have Firefox, Edge, Opera or Chrome browsers installed and set as your default browser. OpenRefine runs in your default browser. It will not run correctly in Internet Explorer. 
- Download the software from openrefine.org. 
- 
Unzip the downloaded file into a directory by right-clicking and selecting “Extract…”. Name that directory something like OpenRefine. CalloutLong pathsThe path to the directory you extract the application files into should be short, because some of OpenRefine’s files have very long names. If the path is too long, OpenRefine cannot start. 
- Go to your newly created OpenRefine directory. 
- Launch OpenRefine by opening - openrefine.exe. This will launch a command prompt window, but you can ignore that and wait for the browser to launch.
- If you see Internet Explorer start, or OpenRefine does not automatically open for you, point one of the supported browsers at http://127.0.0.1:3333/ or http://localhost:3333 to launch the program. 
Mac
- Check that you have Firefox, Edge, Opera or Chrome browsers installed and set as your default browser. OpenRefine runs in your default browser. It will not run correctly in Internet Explorer.
- Download the software from openrefine.org.
- Unzip the downloaded file into a directory by double-clicking it. Name that directory something like OpenRefine.
- Go to your newly created OpenRefine directory.
- Drag the OpenRefine app into the Applications folder.
- Launch OpenRefine: Control-click the app icon, then choose “Open” from the shortcut menu. For Troubleshooting help, see the Apple support page.
- If you are using a different browser than listed above, or if OpenRefine does not automatically open for you, point your browser at http://127.0.0.1:3333/ or http://localhost:3333 to launch the program.
Linux
- Check that you have Firefox or Chrome browsers installed and set as your default browser. OpenRefine runs in your default browser.
- Download the software from openrefine.org.
- Unzip the downloaded file into a directory. Name that directory something like OpenRefine.
- Go to your newly created OpenRefine directory.
- Launch OpenRefine by typing ./refineinto the terminal within the OpenRefine directory.
- If you are using a different browser than listed above, or if OpenRefine does not automatically open for you, point your browser at http://127.0.0.1:3333/ or http://localhost:3333 to launch the program.
Exiting OpenRefine
To exit OpenRefine, close all the browser tabs or windows, then navigate to the command line window. To close this window and ensure OpenRefine exits properly, hold down [control] and press [c] on your keyboard. This will save all changes to your projects.
Remember, it’s important to close the browser window or tab first to ensure you’re not actively using OpenRefine before stopping the server. This prevents any unsaved changes from being lost. After stopping the server, you can safely exit the terminal or command prompt window.