Wide and long data formats
Last updated on 2023-05-04 | Edit this page
Estimated time: 35 minutes
Overview
Questions
- What are long and Wide formats?
- Why would I want to change between them?
Objectives
- Explain difference between long and wide formats and why each might be used
- Illustrate how to change between formats using the
melt()
andpivot()
methods
Wide and long data formats
In the SN7577 dataset that we have been using there is a group of columns which record which daily newspapers each respondent reads. Despite the un-informative names like ‘daily1’ each column refers to a current UK daily national or local newspaper.
Whether the paper is read or not is recorded using the values of 0 or 1 as a boolean indicator. The advantage of using a column for each paper means that should a respondent read multiple newspapers, all of the required information can still be recorded in a single record.
Recording information in this wide format is not always beneficial when trying to analyse the data.
Pandas provides methods for converting data from wide to long format and from long to wide format
The SN7577 dataset does not contain a variable that can be used to uniquely identify a row. This is often referred to as a ‘primary key’ field (or column).
A dataset doesn’t need to have such a key. None of the work we have done so far has required it.
When we create a pandas Dataframe by importing a csv file, we have seen that pandas will create an index for the rows. This index can be used a bit like a key field, but as we have seen there can be problems with the index when we concatenate two Dataframes together.
In the version of SN7577 that we are going to use to demonstrate long and wide formats we will add a new variable with the name of ‘Id’ and we will restrict the other columns to those starting with the word ‘daily’.
We will create a new Dataframe with a single column of ‘Id’.
PYTHON
# create an 'Id' column
df_papers1 = pd.DataFrame(pd.Series(range(1,1287)), index=None, columns=['Id'])
Using the range function, we can create values of Id
starting with 1 and going up to 1286 (remember the second parameter to
range is one past the last value used.) We have explicitly coded this
value because we knew how many rows were in the dataset. If we didn’t,
we could have used
1287
{: output}
We will create a 2nd Dataframe, based on SN7577 but containing only the columns starting with the word ‘daily’.
There are several ways of doing this, we’ll cover the way that we
have covered all of the prerequisites for. We will use the
filter
method of pandas
with its
like
parameter.
The value supplied to like
can occur anywhere in the
column name to be matched (and therefore selected).
To create the Dataframe that we will use, we will concatenate the two Dataframes we have created.
PYTHON
df_papers = pd.concat([df_papers1, df_papers2], axis = 1)
print(df_papers.index)
print(df_papers.columns)
RangeIndex(start=0, stop=1286, step=1)
Index(['Id', 'daily1', 'daily2', 'daily3', 'daily4', 'daily5', 'daily6',
'daily7', 'daily8', 'daily9', 'daily10', 'daily11', 'daily12',
'daily13', 'daily14', 'daily15', 'daily16', 'daily17', 'daily18',
'daily19', 'daily20', 'daily21', 'daily22', 'daily23', 'daily24',
'daily25'],
dtype='object')
{: output}
We use axis = 1
because we are joining by columns, the
default is joining by rows (axis=0
).
From ‘wide’ to ‘long’
To make the displays more manageable we will use only the first eight ‘daily’ columns
PYTHON
## using df_papers
daily_list = df_papers.columns[1:8]
df_daily_papers_long = pd.melt(df_papers, id_vars = ['Id'], value_vars = daily_list)
# by default, the new columns created will be called 'variable' which is the name of the 'daily'
# and 'value' which is the value of that 'daily' for that 'Id'. So, we will rename the columns
df_daily_papers_long.columns = ['Id','Daily_paper','Value']
df_daily_papers_long
We now have a Dataframe that we can groupby
.
We want to groupby
the Daily_paper
and then
sum the Value
.
Daily_paper
daily1 0
daily2 26
daily3 52
{: output}
From Long to Wide
The process can be reversed by using the pivot()
method.
Here we need to indicate which column (or columns) remain fixed (this
will become an index in the new Dataframe), which column contains the
values which are to become column names and which column contains the
values for the columns.
In our case we want to use the Id
column as the fixed
column, the Daily_paper
column contains the column names
and the Value
column contains the values.
PYTHON
df_daily_papers_wide = df_daily_papers_long.pivot(index = 'Id', columns = 'Daily_paper', values = 'Value')
We can change our Id
index back to an ordinary column
with
Exercise
- Find out how many people take each of the daily newspapers by Title.
- Which titles don’t appear to be read by anyone?
There is a file called Newspapers.csv which lists all of the newspapers Titles along with the corresponding ‘daily’ value
Hint: Newspapers.csv contains both daily and Sunday newspapers you can filter out the Sunday papers with the following code:
- Read in Newspapers.csv file and keep only the dailies.
PYTHON
df_newspapers = pd.read_csv("Newspapers.csv")
df_newspapers = df_newspapers[(df_newspapers.Column_name.str.startswith('daily'))]
df_newspapers
- Create the df_papers Dataframe as we did before.
PYTHON
import pandas as pd
df_SN7577 = pd.read_csv("SN7577.tab", sep='\t')
#create an 'Id' column
df_papers1 = pd.DataFrame(pd.Series(range(1,1287)),index=None,columns=['Id'])
df_papers2 = df_SN7577.filter(like= 'daily')
df_papers = pd.concat([df_papers1, df_papers2], axis = 1)
df_papers
- Create a list of all of the dailies, one way would be
- Pass the list as the
value_vars
parameter to themelt()
method
PYTHON
#use melt to create df_daily_papers_long
df_daily_papers_long = pd.melt(df_papers, id_vars = ['Id'], value_vars = daily_list )
#Change the column names
df_daily_papers_long.columns = ['Id', 'Daily_paper', 'Value']
-
merge
the two Dataframes with a left join, because we want all of the Newspaper Titles to be included.
PYTHON
df_papers_taken = pd.merge(df_newspapers, df_daily_papers_long, how='left', left_on = 'Column_name',right_on = 'Daily_paper')
- Then
groupby
the ‘Title’ and sum the ‘Value’
Key Points
- The
melt()
method can be used to change from wide to long format - The
pivot()
method can be used to change from the long to wide format - Aggregations are best done from data in the long format.