Introduction to Python
- Python is an interpreted language
- The REPL (Read-Eval-Print loop) allows rapid development and testing of code segments
- Jupyter notebooks builds on the REPL concepts and allow code results and documentation to be maintained together and shared
- Jupyter notebooks is a complete IDE (Integrated Development Environment)
Python basics
- The Jupyter environment can be used to write code segments and display results
- Data types in Python are implicit based on variable values
- Basic data types are Integer, Float, String and Boolean
- Lists and Dictionaries are structured data types
- Arithmetic uses standard arithmetic operators, precedence can be changed using brackets
- Help is available for builtin functions using the
help()
function further help and code examples are available online - In Jupyter you can get help on function parameters using shift+tab
- Many functions are in fact methods associated with specific object types
Python control structures
- Most programs will require ‘Loops’ and ‘Branching’ constructs.
- The
if
,elif
,else
statements allow for branching in code. - The
for
andwhile
statements allow for looping through sections of code - The programmer must provide a condition to end a
while
loop.
Creating re-usable code
- Functions are used to create re-usable sections of code
- Using parameters with functions make them more flexible
- You can use functions written by others by importing the libraries containing them into your code
Processing data from a file
- Reading data from files is far more common than program ‘input’ requests or hard coding values
- Python provides simple means of reading from a text file and writing to a text file
- Tabular data is commonly recorded in a ‘csv’ file
- Text files like csv files can be thought of as being a list of strings. Each string is a complete record
- You can read and write a file one record at a time
- Python has builtin functions to parse (split up) records into individual tokens
Dates and Time
- Date and Time functions in Python come from the datetime library, which needs to be imported
- You can use format strings to have dates/times displayed in any representation you like
- Internally date and times are stored in special data structures which allow you to access the component parts of dates and times
Processing JSON data
- JSON is a popular data format for transferring data used by a great many Web based APIs
- The JSON data format is very similar to the Python Dictionary structure.
- The complex structure of a JSON document means that it cannot easily be ‘flattened’ into tabular data
- We can use Python code to extract values of interest and place them in a csv file
Reading data from a file using Pandas
- pandas is a Python library containing functions and data structures to assist in data analysis
- pandas data structures are the Series (like a vector) and the Dataframe (like a table)
- the pandas
read_csv
function allows you to read an entirecsv
file into a Dataframe
Extracting row and columns
- Import specific columns when reading in a .csv with the
usecols
parameter - We easily can chain boolean conditions when filtering rows of a pandas dataframe
- The
loc
andiloc
methods allow us to get rows with particular labels and at particular integer locations respectively - pandas has a handy
sample
method which allows us to extract a sample of rows from a dataframe
Data Aggregation using Pandas
- Summarising numerical and categorical variables is a very common requirement
- Missing data can interfere with how statistical summaries are calculated
- Missing data can be replaced or created depending on requirement
- Summarising or aggregation can be done over single or multiple variables at the same time
Joining Pandas Dataframes
- You can join pandas Dataframes in much the same way as you join tables in SQL
- The
concat()
function can be used to concatenate two Dataframes by adding the rows of one to the other. -
concat()
can also combine Dataframes by columns but themerge()
function is the preferred way - The
merge()
function is equivalent to the SQL JOIN clause. ‘left’, ‘right’ and ‘inner’ joins are all possible.
Wide and long data formats
- The
melt()
method can be used to change from wide to long format - The
pivot()
method can be used to change from the long to wide format - Aggregations are best done from data in the long format.
Data visualisation using Matplotlib
- Graphs can be drawn directly from Pandas, but it still uses Matplotlib
- Different graph types have different data requirements
- Graphs are created from a variety of discrete components placed on a ‘canvas’, you don’t have to use them all
Accessing SQLite Databases
- The SQLite database system is directly available from within Python
- A database table and a pandas Dataframe can be considered similar structures
- Using pandas to return all of the results from a query is simpler than using sqlite3 alone