Data Aggregation using Pandas
Last updated on 2023-05-04 | Edit this page
Overview
Questions
- How can I summarise the data in a data frame?
Objectives
- Access and summarize data stored in a Data Frame
- Perform basic mathematical operations and summary statistics on data in a Pandas Data Frame
- Understand missing data
- Changing to and from ‘NaN’ values
Using Pandas functions to summarise data in a Data Frame
For variables which contain numerical values we are often interested in various statistical measures relating to those values. For categorical variable we are often interested in the how many of each unique values are present in the dataset.
We shall use the SAFI_results.csv dataset to demonstrate how we can obtain these pieces of information
For numeric variables we can obtain a variety of basic statistical
information by using the describe()
method.
This can be done for the Dataframe as a whole, in which case some of
the results might have no sensible meaning. If there are any missing
values, represented in the display as NaN
you will get a
warning message.
You can also .describe()
on a single variable basis.
There are also a set of methods which allow us to obtain individual values.
PYTHON
print(df_SAFI['B_no_membrs'].min())
print(df_SAFI['B_no_membrs'].max())
print(df_SAFI['B_no_membrs'].mean())
print(df_SAFI['B_no_membrs'].std())
print(df_SAFI['B_no_membrs'].count())
print(df_SAFI['B_no_membrs'].sum())
OUTPUT
2
19
7.190839694656488
3.1722704895263734
131
942
Unlike the describe()
method which converts the variable
to a float (when it was originally an integer), the individual summary
methods only does so for the returned result if needed.
We can do the same thing for the E19_period_use
variable
PYTHON
print(df_SAFI['E19_period_use'].min())
print(df_SAFI['E19_period_use'].max())
print(df_SAFI['E19_period_use'].mean())
print(df_SAFI['E19_period_use'].std())
print(df_SAFI['E19_period_use'].count())
print(df_SAFI['E19_period_use'].sum())
1.0
45.0
12.043478260869565
8.583030848015385
92
1108.0
{: output}
Exercise
Compare the count values returned for the B_no_membrs
and the E19_period_use
variables.
- Why do you think they are different?
- How does this affect the calculation of the mean values?
- We know from when we originally displayed the contents of the
df_SAFI
Dataframe that there are 131 rows in it. This matches the value for theB_no_membrs
count. The count forE19_period_use
however is only 92. If you look at the values in theE19_period_use
column using
you will see that there are several NaN
values. They
also occurred when we used describe()
on the full
Dataframe. NaN
stands for Not a Number, ie. the value is
missing. There are only 92 non-missing values and this is what is
reported by the count()
method. This value is also used in
the calculation of the mean and std values.
Dealing with missing values
We can find out how many variables in our Dataframe contains any
NaN
values with the code
Column1 0
A01_interview_date 0
A03_quest_no 0
A04_start 0
...
{: output}
or for a specific variable
39
{: output}
Data from most sources has the potential to include missing data. Whether or not this presents a problem at all depends on what you are planning to do.
We have been using data from two very different sources.
The SN7577 dataset is provided by the UK Data Service. Datasets from the UK data Service, have already been ‘cleaned’ and it is unlikely that there will be any genuinely missing data. However you may find that data which was missing has been replaced with a value such as ‘-1’ or ‘Not Specified’. In cases like these it may be appropriate to replace these values with ‘NaN’ before you try to process the data further.
The SAFI dataset we have been using comes from a project called ‘Studying African Farmer-led Irrigation’. The data for this project is questionnaire based, but rather than using a paper-based questionnaire, it has been created and is completed electronically via an app on a smartphone. This provides flexibility in the design and presentation of the questionnaire; a section of the questionnaire may only be presented depending on the answer given to some preceding question. This means that there can quite legitimately be a set of ‘NaN’ values in a record (one complete questionnaire) where you would still consider the record to be complete.
We have already seen how we can check for missing values. There are three other actions we need to be able to do:
- Remove complete rows which contain
NaN
- Replace
NaN
with a value of our choice - Replace specific values with
NaN
With these options we can ensure that the data is suitable for the further processing we have planned.
Completely remove rows with NaNs
The dropna()
method will delete all rows if any
of the variables contain an NaN
. For some datasets this may
be acceptable. You will need to take care that you have enough rows left
for your analysis to have meaning.
PYTHON
df_SAFI = pd.read_csv("SAFI_results.csv")
print(df_SAFI.shape)
df_SAFI.dropna(inplace=True)
print(df_SAFI.shape)
(131, 55)
(0, 55)
{: output}
Because there are variables in the SAFI dataset which are all
NaN
using the dropna()
method effectively
deletes all of the rows from the Dataframe, probably not what you
wanted. Instead we can use the notnull()
method as a row
selection criteria and delete the rows where a specific variable has
NaN
values.
PYTHON
df_SAFI = pd.read_csv("SAFI_results.csv")
print(df_SAFI.shape)
df_SAFI = df_SAFI[(df_SAFI['E_no_group_count'].notnull())]
print(df_SAFI.shape)
(131, 55)
(39, 55)
{: output}
Replace NaN with a value of our choice
The E19_period_use
variable answers the question: “For
how many years have you been irrigating the land?”. In some cases the
land is not irrigated and these are represented by NaN in the dataset.
So when we run
df_SAFI['E19_period_use'].describe()
we get a count value of 92 and all of the other statistics are based on this count value.
Now supposing that instead of NaN the interviewer entered a value of 0 to indicate the land which is not irrigated has been irrigated for 0 years, technically correct.
To see what happens we can convert all of the NaN values in the
E19_period_use
column to 0 with the following code:
If we now run the describe()
again you can see that all
of the statistic have been changed because the calculations are NOW
based on a count of 131. Probably not what we would have wanted.
Conveniently this allows us to demonstrate our 3rd action.
Replace specific values with NaN
Although we can recognise NaN
with methods like
isnull()
or dropna()
actually creating a
NaN
value and putting it into a Dataframe, requires the
numpy
module. The following code will replace our 0 values
with NaN
. We can demonstrate that this has occurred by
running the describe()
again and see that we now have our
original values back.
Categorical variables
For categorical variables, numerical statistics don’t make any sense.
For a categorical variable we can obtain a list of unique values used by
the variable by using the unique()
method.
array(['grass', 'mabatisloping', 'mabatipitched'], dtype=object)
{: output}
Knowing all of the unique values is useful but what is more useful is
knowing how many occurrences of each there are. In order to do this we
can use the groupby
method.
Having performed the groupby()
we can them
describe()
the results. The format is similar to that which
we have seen before except that the ‘grouped by’ variable appears to the
left and there is a set of statistics for each unique value of the
variable.
You can group by more than one variable at a time by providing them as a list.
PYTHON
grouped_data = df_SAFI.groupby(['C01_respondent_roof_type', 'C02_respondent_wall_type'])
grouped_data.describe()
You can also obtain individual statistics if you want.
PYTHON
A11_years_farm = df_SAFI.groupby(['C01_respondent_roof_type', 'C02_respondent_wall_type'])['A11_years_farm'].count()
A11_years_farm
C01_respondent_roof_type C02_respondent_wall_type
grass burntbricks 22
muddaub 42
sunbricks 9
mabatipitched burntbricks 6
muddaub 3
...
{: output}
Exercise
- Read in the SAFI_results.csv dataset.
- Get a list of the different
C01_respondent_roof_type
values. - Groupby
C01_respondent_roof_type
and describe the results. - Remove rows with NULL values for
E_no_group_count
. - repeat steps 2 & 3 and compare the results.
PYTHON
# Steps 1 and 2
import numpy as np
df_SAFI = pd.read_csv("SAFI_results.csv")
print(df_SAFI.shape)
print(pd.unique(df_SAFI['C01_respondent_roof_type']))
PYTHON
# steps 4 and 5
df_SAFI = df_SAFI[(df_SAFI['E_no_group_count'].notnull())]
grouped_data = df_SAFI.groupby('C01_respondent_roof_type')
print(df_SAFI.shape)
print(pd.unique(df_SAFI['C01_respondent_roof_type']))
grouped_data.describe()
E_no_group_count
is related to whether or not farm plots
are irrigated or not. It has no obvious connection to farm buildings. By
restricting the data to non-irrigated plots we have accidentally?
removed one of the roof_types completely.
Key Points
- Summarising numerical and categorical variables is a very common requirement
- Missing data can interfere with how statistical summaries are calculated
- Missing data can be replaced or created depending on requirement
- Summarising or aggregation can be done over single or multiple variables at the same time