Data Analysis and Visualization with Python for Social Scientists *alpha*: Key Points

Introduction to Python

Python is an interpreted language
The REPL (Read-Eval-Print loop) allows rapid development and testing of code segments
Jupyter notebooks builds on the REPL concepts and allow code results and documentation to be maintained together and shared
Jupyter notebooks is a complete IDE (Integrated Development Environment)

The Jupyter environment can be used to write code segments and display results
Data types in Python are implicit based on variable values
Basic data types are Integer, Float, String and Boolean
Lists and Dictionaries are structured data types
Arithmetic uses standard arithmetic operators, precedence can be changed using brackets
Help is available for builtin functions using the help() function further help and code examples are available online
In Jupyter you can get help on function parameters using shift+tab
Many functions are in fact methods associated with specific object types

Functions are used to create re-usable sections of code
Using parameters with functions make them more flexible
You can use functions written by others by importing the libraries containing them into your code

Reading data from files is far more common than program ‘input’ requests or hard coding values
Python provides simple means of reading from a text file and writing to a text file
Tabular data is commonly recorded in a ‘csv’ file
Text files like csv files can be thought of as being a list of strings. Each string is a complete record
You can read and write a file one record at a time
Python has builtin functions to parse (split up) records into individual tokens

Date and Time functions in Python come from the datetime library, which needs to be imported
You can use format strings to have dates/times displayed in any representation you like
Internally date and times are stored in special data structures which allow you to access the component parts of dates and times

JSON is a popular data format for transferring data used by a great many Web based APIs
The JSON data format is very similar to the Python Dictionary structure.
The complex structure of a JSON document means that it cannot easily be ‘flattened’ into tabular data
We can use Python code to extract values of interest and place them in a csv file

pandas is a Python library containing functions and data structures to assist in data analysis
pandas data structures are the Series (like a vector) and the Dataframe (like a table)
the pandas read_csv function allows you to read an entire csv file into a Dataframe

Import specific columns when reading in a .csv with the usecols parameter
We easily can chain boolean conditions when filtering rows of a pandas dataframe
The loc and iloc methods allow us to get rows with particular labels and at particular integer locations respectively
pandas has a handy sample method which allows us to extract a sample of rows from a dataframe

Summarising numerical and categorical variables is a very common requirement
Missing data can interfere with how statistical summaries are calculated
Missing data can be replaced or created depending on requirement
Summarising or aggregation can be done over single or multiple variables at the same time

You can join pandas Dataframes in much the same way as you join tables in SQL
The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other.
concat() can also combine Dataframes by columns but the merge() function is the preferred way
The merge() function is equivalent to the SQL JOIN clause. ‘left’, ‘right’ and ‘inner’ joins are all possible.

Graphs can be drawn directly from Pandas, but it still uses Matplotlib
Different graph types have different data requirements
Graphs are created from a variety of discrete components placed on a ‘canvas’, you don’t have to use them all

The SQLite database system is directly available from within Python
A database table and a pandas Dataframe can be considered similar structures
Using pandas to return all of the results from a query is simpler than using sqlite3 alone