Introduction to Python


  • Python is an interpreted language
  • The REPL (Read-Eval-Print loop) allows rapid development and testing of code segments
  • Jupyter notebooks builds on the REPL concepts and allow code results and documentation to be maintained together and shared
  • Jupyter notebooks is a complete IDE (Integrated Development Environment)

Python basics


  • The Jupyter environment can be used to write code segments and display results
  • Data types in Python are implicit based on variable values
  • Basic data types are Integer, Float, String and Boolean
  • Lists and Dictionaries are structured data types
  • Arithmetic uses standard arithmetic operators, precedence can be changed using brackets
  • Help is available for builtin functions using the help() function further help and code examples are available online
  • In Jupyter you can get help on function parameters using shift+tab
  • Many functions are in fact methods associated with specific object types

Python control structures


  • Most programs will require ‘Loops’ and ‘Branching’ constructs.
  • The if, elif, else statements allow for branching in code.
  • The for and while statements allow for looping through sections of code
  • The programmer must provide a condition to end a while loop.

Creating re-usable code


  • Functions are used to create re-usable sections of code
  • Using parameters with functions make them more flexible
  • You can use functions written by others by importing the libraries containing them into your code

Processing data from a file


  • Reading data from files is far more common than program ‘input’ requests or hard coding values
  • Python provides simple means of reading from a text file and writing to a text file
  • Tabular data is commonly recorded in a ‘csv’ file
  • Text files like csv files can be thought of as being a list of strings. Each string is a complete record
  • You can read and write a file one record at a time
  • Python has builtin functions to parse (split up) records into individual tokens

Dates and Time


  • Date and Time functions in Python come from the datetime library, which needs to be imported
  • You can use format strings to have dates/times displayed in any representation you like
  • Internally date and times are stored in special data structures which allow you to access the component parts of dates and times

Processing JSON data


  • JSON is a popular data format for transferring data used by a great many Web based APIs
  • The JSON data format is very similar to the Python Dictionary structure.
  • The complex structure of a JSON document means that it cannot easily be ‘flattened’ into tabular data
  • We can use Python code to extract values of interest and place them in a csv file

Reading data from a file using Pandas


  • pandas is a Python library containing functions and data structures to assist in data analysis
  • pandas data structures are the Series (like a vector) and the Dataframe (like a table)
  • the pandas read_csv function allows you to read an entire csv file into a Dataframe

Extracting row and columns


  • Import specific columns when reading in a .csv with the usecols parameter
  • We easily can chain boolean conditions when filtering rows of a pandas dataframe
  • The loc and iloc methods allow us to get rows with particular labels and at particular integer locations respectively
  • pandas has a handy sample method which allows us to extract a sample of rows from a dataframe

Data Aggregation using Pandas


  • Summarising numerical and categorical variables is a very common requirement
  • Missing data can interfere with how statistical summaries are calculated
  • Missing data can be replaced or created depending on requirement
  • Summarising or aggregation can be done over single or multiple variables at the same time

Joining Pandas Dataframes


  • You can join pandas Dataframes in much the same way as you join tables in SQL
  • The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other.
  • concat() can also combine Dataframes by columns but the merge() function is the preferred way
  • The merge() function is equivalent to the SQL JOIN clause. ‘left’, ‘right’ and ‘inner’ joins are all possible.

Wide and long data formats


  • The melt() method can be used to change from wide to long format
  • The pivot() method can be used to change from the long to wide format
  • Aggregations are best done from data in the long format.

Data visualisation using Matplotlib


  • Graphs can be drawn directly from Pandas, but it still uses Matplotlib
  • Different graph types have different data requirements
  • Graphs are created from a variety of discrete components placed on a ‘canvas’, you don’t have to use them all

Accessing SQLite Databases


  • The SQLite database system is directly available from within Python
  • A database table and a pandas Dataframe can be considered similar structures
  • Using pandas to return all of the results from a query is simpler than using sqlite3 alone