Summary and Schedule
Good data organization is the foundation of any research project. Most researchers have data in spreadsheets, so it’s the place that many research projects start.
Typically we organize data in spreadsheets in ways that we as humans want to work with the data. However computers require data to be organized in particular ways. In order to use tools that make computation more efficient, such as programming languages like R or Python, we need to structure our data the way that computers need the data. Since this is where most research projects start, this is where we want to start too!
In this lesson, you will learn:
- Good data entry practices - formatting data tables in spreadsheets
- How to avoid common formatting mistakes
- Approaches for handling dates in spreadsheets
- Basic quality control and data manipulation in spreadsheets
- Exporting data from spreadsheets
In this lesson, however, you will not learn about data analysis with spreadsheets. Much of your time as a researcher will be spent in the initial ‘data wrangling’ stage, where you need to organize the data to perform a proper analysis later. It’s not the most fun, but it is necessary. In this lesson you will learn how to think about data organization and some practices for more effective data wrangling. With this approach you can better format current data and plan new data collection so less data wrangling is needed.
Getting Started
Data Carpentry’s teaching is hands-on, so participants are encouraged to use their own computers to ensure the proper setup of tools for an efficient workflow. These lessons assume no prior knowledge of the skills or tools.
To ensure a smooth start to the workshop, please prepare the required data and software by following the Setup instructions provided below.
For Instructors
If you are teaching this lesson in a workshop, please see the Instructor notes.
| Setup Instructions | Download files required for the lesson | |
| Duration: 00h 00m | 1. Introduction | What are spreadsheets useful for in a research project? |
| Duration: 00h 18m | 2. Formatting Data Tables in Spreadsheets | How do we format data in spreadsheets for effective data use? |
| Duration: 00h 48m | 3. Formatting Problems | What common mistakes are made when formatting spreadsheets? |
| Duration: 01h 08m | 4. Dates as Data | What are good approaches for handling dates in spreadsheets? |
| Duration: 01h 28m | 5. Quality Assurance | How can we carry out basic quality assurance in spreadsheets? |
| Duration: 01h 53m | 6. Exporting Data | How can we export data from spreadsheets in a way that is useful for downstream applications? |
| Duration: 02h 08m | Finish |
The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.
Setup instructions
Data
You need to download some files to follow this lesson:
- Download the following three files:
- Place these 3 files in a folder you can easily find and access on
your computer (for instance in a
datacarpentry-spreadsheetsfolder on your Desktop or within your Home folder).
For more information about the dataset and to download it from the original source on Figshare, check out the Social Sciences workshop data page.
Software
Google Sheets is a popular and free spreadsheet program that can be easy for your learners to use, especially if they don’t have other software installed.
However, please note the following before choosing to use Google Sheets in your workshop:
- Missing Instructions: The current “Quality Assurance” episode provides step-by-step instructions for data validation using Microsoft Excel and LibreOffice Calc, but not Google Sheets. The layout and options in Sheets are different, while learners can use Excel or Calc interchangeably without significant problems.
- Data Privacy Warning: For working with sensitive data (such as personal or health information), there are some caveats to using Google Sheets or Google Drive. According to Google’s Terms of Service, Google has the right to access content on standard consumer accounts for the purposes of operating and improving their services. Sensitive data should only be stored on cloud services if your home institution has officially approved its use for such purposes.
To work through this tutorial you will need access to a spreadsheet program. For this you have many options: Microsoft Excel, LibreOffice Calc, Apple Numbers, Google Sheets, Gnumeric, Onlyoffice, WPS office, among others. Commands may differ a bit between programs, but the general ideas for thinking about spreadsheets are the same.
For this lesson, we encourage you to use LibreOffice Calc or Microsoft Excel, as the step-by-step instructions are provided for those programs. If you don’t have Microsoft Excel, you can use LibreOffice. It’s a free, open source spreadsheet program. Here are the instructions to install it: