Extracting row and columns
Last updated on 2023-05-04 | Edit this page
Overview
Questions
- How can I extract specific rows and columns from a Dataframe?
- How can I add or delete columns from a Dataframe?
- How can I find and change missing values in a Dataframe?
Objectives
- Define indexing as it relates to data structures
- Select specific columns from a data frame
- Select specific rows from a data frame based on conditional expressions
- Using indexes to access rows and columns
- Copy a data frame
- Add columns to a data frame
- Analyse datasets having missing/null values
We will continue this episode from where we left off in the last episode. If you have restarted Jupyter or you want to use a new notebook make sure that you import pandas and have read the SN7577.tab dataset into a Dataframe.
Selecting rows and columns from a pandas Dataframe
If we know which columns we want before we read the data from the
file we can tell read_csv()
to only import those columns by
specifying columns either by their index number (starting at 0) as a
list to the usecols
parameter. Alternatively we can also
provide a list of column names.
PYTHON
df_SN7577_some_cols = pd.read_csv("SN7577.tab", sep='\t', usecols= [0,1,2,173,174,175])
print(df_SN7577_some_cols.shape)
print(df_SN7577_some_cols.columns)
df_SN7577_some_cols = pd.read_csv("SN7577.tab", sep='\t', usecols= ['Q1', 'Q2', 'Q3', 'sex', 'age', 'agegroups'])
print(df_SN7577_some_cols.columns)
OUTPUT
(1286, 6)
Index(['Q1', 'Q2', 'Q3', 'sex', 'age', 'agegroups'], dtype='object')
Index(['Q1', 'Q2', 'Q3', 'sex', 'age', 'agegroups'], dtype='object')
Let us assume for now that we read in the complete file which is now
in the Dataframe df_SN7577
, how can we now refer to
specific columns?
There are two ways of doing this using the column names (or labels):
OUTPUT
0 1
1 3
2 10
3 9
...
If we are interested in more than one column, the 2nd method above
cannot be used. However in the first, although we used a string with the
value of 'Q1'
we could also have provided a list of
strings. Remember that lists are enclosed in []
.
Exercise
What happens if you:
- List the columns you want out of order from the way they appear in the file?
- Put the same column name in twice?
- Put in a non-existing column name? (a.k.a Typo)
Filtering by Rows
You can filter the Dataframe by rows by specifying a range in the
form of a:b
. a
is the first row and
b
is one beyond the last row required.
PYTHON
# select row with index of 1, 2 and 3 (rows 2, 3 and 4 in the Dataframe)
df_SN7577_some_rows = df_SN7577[1:4]
df_SN7577_some_rows
Exercise
What happens if we ask for a single row instead of a range?
Using criteria to filter rows
It is more likely that you will want to select rows from the Dataframe based on some criteria, such as “all rows where the value for Q2 is -1”.
The criteria can be more complex and isn’t limited to a single column’s values:
PYTHON
df_SN7577_some_rows = df_SN7577[ (df_SN7577.Q2 == -1) & (df_SN7577.numage > 60)]
df_SN7577_some_rows
We can combine the row selection with column selection:
PYTHON
df_SN7577_some_rows = df_SN7577[ (df_SN7577.Q2 == -1) & (df_SN7577.numage > 60)][['Q1', 'Q2','numage']]
df_SN7577_some_rows
Selecting rows on the row index is of limited use unless you need to select a contiguous range of rows.
There is however another way of selecting rows using the row index:
Using the iloc
method gives the same results as our
previous example.
However, now we can specify a single value and more importantly we
can use the range()
function to indicate the records that
we want. This can be useful for making pseudo-random selections of rows
from across the Dataframe.
PYTHON
# Select the first row from the Dataframe
df_SN7577_some_rows = df_SN7577.iloc[0]
df_SN7577_some_rows
# select every 100th record from the Dataframe.
df_SN7577_some_rows = df_SN7577.iloc[range(0, len(df_SN7577), 100)]
df_SN7577_some_rows
You can also specify column ranges using the iloc
method
again using the column index numbers:
PYTHON
# columns 0,1,2 and 3
df_SN7577_some_rows = df_SN7577.iloc[range(0, len(df_SN7577), 100),0:4]
df_SN7577_some_rows
# columns 0,1,2,78 and 95
df_SN7577_some_rows = df_SN7577.iloc[range(0, len(df_SN7577), 100),[0,1,2,78,95]]
df_SN7577_some_rows
There is also a loc
method which allows you to use the
column names.
Sampling
Pandas does have a sample
method which allows you to
extract a sample of the records from the Dataframe.
PYTHON
df_SN7577.sample(10, replace=False) # ten records, do not select same record twice (this is the default)
df_SN7577.sample(frac=0.05, random_state=1) # 5% of records , same records if run again
Key Points
- Import specific columns when reading in a .csv with the
usecols
parameter - We easily can chain boolean conditions when filtering rows of a pandas dataframe
- The
loc
andiloc
methods allow us to get rows with particular labels and at particular integer locations respectively - pandas has a handy
sample
method which allows us to extract a sample of rows from a dataframe