Python Basic Data Analysis 20180412
Python Basic Data Analysis 20180412
Software
Hardware
Consulting
Training
WHAT IS PYTHON?
• Python is an open-source programming language
• It is relatively easy to learn
• It is a powerful tool with many modules (libraries) that can be
imported in to extend its functionality
• Python can be used to automate tasks and process large
amounts of data
• Python can be used on Mac’s, PC’s, Linux, as well as in a high-
performance computing environment (Polaris, Andes, Discovery
machines here at Dartmouth)
WHY PYTHON FOR DATA ANALYSIS?
• Python can be used to import datasets quickly
• Python can import and export common data formats such as CSV files
Reference: Python for Data Analytics, Wes McKinney, 2012, O’Reilly Publishing
DEVELOPMENT ENVIRONMENTS(I)
• Python can be run in a variety of environments with various tools
• From the command line (most Mac’s have Python installed by default)
• From a windows terminal
• From a Linux terminal
• Using an Integrated Development Environment such as Eclipse or PyCharm IDE
• Using a web-hosted “sandbox” environment
DEVELOPMENT ENVIRONMENTS (II)
• Browser-based sandbox
DEVELOPMENT ENVIRONMENTS (III)
• Mac Terminal
DEVELOPMENT ENVIRONMENTS (IV)
Entering Python code:
Command line or Optional IDE
• Preliminary Steps
• Download data from Dartgo link (www.dartgo.org/pyii)
• Get the dataset to either:
• A familiar location on your desktop (e.g.g desktop/python-novice-
inflammation/data)
• Or uploaded in to the sandstorm sandbox web environment
• Opening Python
• Open your browser to https://oasis.sandstorm.io/ (Create an account or
sign in with existing account
• Or, open a terminal on your Mac or PC
HANDS ON PRACTICE:
GETTING STARTED
numbervar = 5
print(numbervar)
if 0 == 0:
print(“true”)
mytuple[3]
BASIC DATA STRUCTURES: DICTIONARIES
# Create a Dictionary or look-up table
# The leading elements are known as “keys” and the
trailing elements are known as “values”
lookuptable = {'Dave': 4076, 'Jen': 4327, 'Joanne':
4211}
lookuptable['Dave']
# show the keys
lookuptable.keys()
lookuptable.values()
# check to see if an element exists
'Jen' in lookuptable
# output: true
BASIC DATA STRUCTURES: DICTIONARIES
Create a Dictionary or look-up table
Use the key for error-checking to see if a value exists
leading elements are known as “keys” and the trailing # check to see if an
element exists
if 'Jen' in lookuptable:
print("Jen's extension is: " + str(lookuptable['Jen’]))
else:
print("No telephone number listed")
DATA STRUCTURES: LOOPING
a, b = 0, 1
i = 0
fibonacci = '1'
while i < 7:
print(b)
fibonacci = fibonacci + ', ' + str(b)
a=b
b=a+b
i=i+1 # increment the loop counter
print(fibonacci)
IMPORTING AND USING MODULES
Modules greatly extend the power and functionality of Python,
much like libraries in R, JavaScript and other languages
import sys
# check the version of Python that is installed
sys.version
'3.4.2 (default, Oct 8 2014, 10:45:20) \n[GCC 4.9.1]’ in this
sandbox!
# check the working directory
import os
os.getcwd()
'/var/home’ – this is less applicable in the sandbox – on
laptop or a linux server it is essential to know the working
directory
IMPORTING AND USING MODULES
# multiply some consecutive numbers
1*2*3*4*5*6*7
5040
y = xsquared(5)
print str(y)
# Output: 25
WITH AND FOR COMMANDS
We’ll use the WITH and FOR commands to help us read in and
loop over the rows in a CSV file; here’s some pseudo-code of
what we’d like to do:
data = numpy.loadtxt(fname='inflammation-
01.csv', delimiter=',') # load csv to variable
print(data)
print(type(data))
print(data.dtype)
print(data.shape)
DATA ANALYSIS – INFLAMMATION
DATASET
• View data elements with matrix addressing
print(data[30,20])
maxval = numpy.max(data)
print('maximum inflammation: ', maxval)
stdval = numpy.std(data)
print( 'standard deviation: ', stdval)
DATA ANALYSIS – INFLAMMATION
DATASET
• Next, let’s examine a dataset
of patients (rows) and forty
days of inflammation values
import matplotlib.pyplot
%matplotlib inline
image = matplotlib.pyplot.imshow(data)
DATA ANALYSIS – INFLAMMATION
DATASET
• Next, let’s examine a dataset
of patients (rows) and forty
days of inflammation values
ave_plot = matplotlib.pyplot.plot(ave_inflammation)
matplotlib.pyplot.show()
SCRIPTS AND PARAMETERS
• Use an IDE or friendly text-editor
#!/usr/bin/python
#--------------------------------
# my first script!
import sys
print('My first script!')
print('Number of arguments:', len(sys.argv), 'arguments.')
print('Argument List:', str(sys.argv))
#--------------------------------
READING MULTIPLE FILES
• Programming for speed, reusability
• Data analysis over many files
strfiles = ['inflammation-01.csv','inflammation-02.csv’]
for f in strfiles:
print(f)
#data = numpy.loadtxt(fname=f, delimiter=',’)
#print('mean ',f, numpy.mean(data, axis=0))
import csv
writer.writeheader()
with open('inflammation-01.csv') as f:
reader2=csv.reader(f)
row1 = next(reader2) # gets the first line
row2 = next(reader2)
print ("CSV column headers:" + str(row1))
print ("CSV first line: " + str(row2))
SCRIPTS AND PARAMETERS
• Use an IDE or friendly text-editor
#!/usr/bin/python
#--------------------------------
# program name: python_add_parameters.py
import sys
i=0
total =0
while i < len(sys.argv):
print('Number of arguments:',
len(sys.argv), 'arguments.')
print('Argument List:', str(sys.argv))
#--------------------------------
CSV LIBRARY
• Csv library built-in to Python
import csv
with open('inflammation-01.csv') as f:
reader2=csv.reader(f)
row1=next(reader2)
print(str(row1))
• Output: ['0', '0', '1', '3', '1', '2', '4', '7', '8', '3', '3', ‘3’….
IMPORTING A DATASET IN TO PYTHON:
USING THE OS AND CSV MODULES
Find out where you are in the directory structure, import the operating system library (OS)
# Reference: https://docs.python.org/2/library/csv.html section 13.1
import os
cwd = os.getcwd()
print "Working Directory is: " + cwd
Os.chdir(‘c:\\temp’)
Os.getcwd()
Pc - https://sourceforge.net/projects/numpy/files/NumPy/1.8.0/
Click on the dmg file. You may need to change Mac security
preference (Sys Pref > Security > ) to allow the DMG installer to
run
STATISTICAL OPERATIONS
NUMPY FOR PYTHON 2.7
# Reference: https://docs.scipy.org/doc/numpy/reference/routines.statistics.html
Numpy.median
.average
.std
.var
.corrcoef (Pearson product-moment correlation)
.correlate
.cov (estimate of covariance matrix)
.histogram
.amin
.amax
.percentile
SAVING PYTHON SCRIPTS
• Python files can be written
in a simple text editor, or
using an IDE editor.
• The file extension is .py
A MODULE FOR BASIC STATISTICAL ANALYSIS:
USING THE NUMPY LIBRARY
• Python Foundation
• Online tutorials
• Web forums
• Stack overflow:
http://stackoverflow.com/questions/tagged/python
LEARNING MORE…
• Python Tutorials
• Python 2.7.13 https://docs.python.org/2/tutorial/
• Python 3.6 https://docs.python.org/3.6/tutorial/
• Numpy, Scipy tutorials
• https://docs.scipy.org/doc/numpy-dev/user/quickstart.html
• http://cs231n.github.io/python-numpy-tutorial/
• Python CSV library tutorial
• https://docs.python.org/2/library/csv.html
• Lynda, Youtube Online tutorials
• Lynda, log in with Dartmouth credentials:
www.lynda.com/portal/dartmouth
• Search for Python Programming, Numpy, Scipy
QUESTIONS?