0% found this document useful (0 votes)
85 views

Python Basic Data Analysis 20180412

This document provides an overview of Python for data analysis. It introduces Python, explains why it is useful for data analysis due to libraries like NumPy, Pandas, and Matplotlib. It also covers developing in Python, importing and working with datasets, defining functions, and analyzing a sample medical dataset to calculate statistics and create visualizations.

Uploaded by

rncster
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views

Python Basic Data Analysis 20180412

This document provides an overview of Python for data analysis. It introduces Python, explains why it is useful for data analysis due to libraries like NumPy, Pandas, and Matplotlib. It also covers developing in Python, importing and working with datasets, defining functions, and analyzing a sample medical dataset to calculate statistics and create visualizations.

Uploaded by

rncster
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

PYTHON II: INTRODUCTION TO

DATA ANALYSIS WITH PYTHON


Dartmouth College | Research Computing
OVERVIEW
• What is Python?
• Why Python for data analysis?
• Development Environments
• Hands-on: Basic Data Structures in Python, Looping
• Defining a function in Python
• Importing a dataset in to a Python data structure, using
modules
• Python scripts and parameters
• Questions, Resources & Links
RC.DARTMOUTH.EDU

Software
Hardware
Consulting
Training
WHAT IS PYTHON?
• Python is an open-source programming language
• It is relatively easy to learn
• It is a powerful tool with many modules (libraries) that can be
imported in to extend its functionality
• Python can be used to automate tasks and process large
amounts of data
• Python can be used on Mac’s, PC’s, Linux, as well as in a high-
performance computing environment (Polaris, Andes, Discovery
machines here at Dartmouth)
WHY PYTHON FOR DATA ANALYSIS?
• Python can be used to import datasets quickly

• Python’s importable libraries make it an attractive language for data analysis


• NumPy
• SciPy
• Statsmodels
• Pandas
• Matplotlib
• Natural Language Toolkit (NLTK)

• Python can import and export common data formats such as CSV files
Reference: Python for Data Analytics, Wes McKinney, 2012, O’Reilly Publishing
DEVELOPMENT ENVIRONMENTS(I)
• Python can be run in a variety of environments with various tools
• From the command line (most Mac’s have Python installed by default)
• From a windows terminal
• From a Linux terminal
• Using an Integrated Development Environment such as Eclipse or PyCharm IDE
• Using a web-hosted “sandbox” environment
DEVELOPMENT ENVIRONMENTS (II)
• Browser-based sandbox
DEVELOPMENT ENVIRONMENTS (III)
• Mac Terminal
DEVELOPMENT ENVIRONMENTS (IV)
Entering Python code:
Command line or Optional IDE

Python Integrated Development Environment


PYTHON SOFTWARE FOUNDATION AND
MATERIALS FOR THIS TUTORIAL
• Materials download: www.dartgo.org/pyii
• Material reference and basis, Python Software Foundation at Python.org:
https://docs.python.org/3/tutorial/
• Note about Python 2.x and Python 3.x:
• There are a variety of differences between the versions.
• Some include:
• Print “hi world” in 2.x is now print(“hi world”) in 3.x
• Division with integers can now yield a floating point number
• In 2.x, 11/2=5, whereas in 3.x, 11/2=5.5
• More at https://wiki.python.org/moin/Python2orPython3
HANDS ON PRACTICE:
GETTING STARTED

• Preliminary Steps
• Download data from Dartgo link (www.dartgo.org/pyii)
• Get the dataset to either:
• A familiar location on your desktop (e.g.g desktop/python-novice-
inflammation/data)
• Or uploaded in to the sandstorm sandbox web environment

• Opening Python
• Open your browser to https://oasis.sandstorm.io/ (Create an account or
sign in with existing account
• Or, open a terminal on your Mac or PC
HANDS ON PRACTICE:
GETTING STARTED

• Open a web browser


• Navigate to oasis.sandstorm.io
HANDS ON: DIVING IN
Using a Python interpreter or IDE:

Note: after type


A line, click Alt+Enter
# this a comment To run the line and go to next line
#Using a Python sandbox, interpreter or IDE:

textvar = 'hello world!'


print(textvar)

# This creates our first variable. It is a string or text variable.


#Next, we’ll define a variable that contains a numerical value:

numbervar = 5
print(numbervar)

Materials reference: https://docs.python.org/3/tutorial/


BASIC DATA STRUCTURES IN PYTHON: LISTS
# Create a list

# A list in Python a basic sequence type


squares = [1, 4, 9, 16, 25]
print(squares[2])
# Basic list functions: retrieve a value, append, insert
print(squares[1])

squares.append(35) # add a value to end of list


print(squares)
squares[5] = 36 # ... and then fix our error, 6*6=36!
print(squares)
BASIC DATA STRUCTURES IN PYTHON: LISTS
WITH CONDITIONALS
This is where the sandbox environment, or an IDE, becomes very useful

# a basic conditional structure

if 0 == 0:
print(“true”)

# used with a list element


if squares[1] == (2*2):
print('correct!')
else:
print('wrong!’)

squares[:] = [] # clear out the list


LOOPING OVER A BASIC DATA STRUCTURE
#Loop over a data structure
berries = ['raspberry','blueberry','strawberry’]

#Loop over a data structure


berries = ['raspberry','blueberry','strawberry']
for i in berries:
print("Today's pies: " + i)

# sort the structure and then loop over it


for i in sorted(berries):
print("Today's pies(alphabetical): " + i)
BASIC DATA STRUCTURES: TUPLES AND SETS
A “Tuple” is a type of sequence that can contain a variety of data
types
# Create a tuple

mytuple = ('Bill', 'Jackson', 'id', 5)


Print(mytuple)

# Use indexing to access a tuple element. Note: tuple elements


start counting at 0, not 1

mytuple[3]
BASIC DATA STRUCTURES: DICTIONARIES
# Create a Dictionary or look-up table
# The leading elements are known as “keys” and the
trailing elements are known as “values”
lookuptable = {'Dave': 4076, 'Jen': 4327, 'Joanne':
4211}
lookuptable['Dave']
# show the keys
lookuptable.keys()
lookuptable.values()
# check to see if an element exists
'Jen' in lookuptable
# output: true
BASIC DATA STRUCTURES: DICTIONARIES
Create a Dictionary or look-up table
Use the key for error-checking to see if a value exists
leading elements are known as “keys” and the trailing # check to see if an
element exists
if 'Jen' in lookuptable:
print("Jen's extension is: " + str(lookuptable['Jen’]))
else:
print("No telephone number listed")
DATA STRUCTURES: LOOPING

# Loop over a dictionary data structure


# print the whole dictionary
for i,j in lookuptable.iteritems():
print i,j
WHILE LOOPS AND LOOP COUNTERS
• Use a “while” loop to generate a Fibonacci series

a, b = 0, 1
i = 0
fibonacci = '1'
while i < 7:
print(b)
fibonacci = fibonacci + ', ' + str(b)
a=b
b=a+b
i=i+1 # increment the loop counter
print(fibonacci)
IMPORTING AND USING MODULES
Modules greatly extend the power and functionality of Python,
much like libraries in R, JavaScript and other languages
import sys
# check the version of Python that is installed
sys.version
'3.4.2 (default, Oct 8 2014, 10:45:20) \n[GCC 4.9.1]’ in this
sandbox!
# check the working directory
import os
os.getcwd()
'/var/home’ – this is less applicable in the sandbox – on
laptop or a linux server it is essential to know the working
directory
IMPORTING AND USING MODULES
# multiply some consecutive numbers
1*2*3*4*5*6*7
5040

# save time and labor by using modules effectively


import math
math.factorial(7)
MODULES
# Modules
from math import pi
print(pi)
round(pi)
round(pi,5)
DEFINING A FUNCTION IN PYTHON
Functions save time by storing repeatable processes
Defining a function is easy:
use the ‘def’ function in Python
def xsquared( x ):
# find the square of x
x2 = x * x;
# the ‘return’ statement returns the function
value
return x2

# call the function

y = xsquared(5)
print str(y)
# Output: 25
WITH AND FOR COMMANDS
We’ll use the WITH and FOR commands to help us read in and
loop over the rows in a CSV file; here’s some pseudo-code of
what we’d like to do:

WITH open (file.extension) as fileobject:


{get data in file}
FOR rows in file:
{do something with data elements in the rows}
UPLOAD DATA
• To upload data in to the hosted python
instance, click the “jupyter” title to go
back to upload screen
• Use the “Files” tab to upload
• Upload > Browse
• The hosted environment supports the
upload of reasonably-sized csv files
DATA ANALYSIS – INFLAMMATION
DATASET
• Next, let’s examine a dataset of patients (rows) and forty days of
inflammation values (columns)
import os
os.listdir() # load with numpy
import numpy
f = open('inflammation-01.csv’)
numpy.loadtxt(fname='inflammation-01.csv',
filecontent = f.read() delimiter=',') # load csv

print(filecontent) # load in to a variable

data = numpy.loadtxt(fname='inflammation-
01.csv', delimiter=',') # load csv to variable

print(data)
print(type(data))
print(data.dtype)
print(data.shape)
DATA ANALYSIS – INFLAMMATION
DATASET
• View data elements with matrix addressing

print('first value in data:', data [0,0])

print(data[30,20])

maxval = numpy.max(data)
print('maximum inflammation: ', maxval)

stdval = numpy.std(data)
print( 'standard deviation: ', stdval)
DATA ANALYSIS – INFLAMMATION
DATASET
• Next, let’s examine a dataset
of patients (rows) and forty
days of inflammation values

import matplotlib.pyplot
%matplotlib inline
image = matplotlib.pyplot.imshow(data)
DATA ANALYSIS – INFLAMMATION
DATASET
• Next, let’s examine a dataset
of patients (rows) and forty
days of inflammation values

ave_inflammation = numpy.mean(data, axis=0)

ave_plot = matplotlib.pyplot.plot(ave_inflammation)

matplotlib.pyplot.show()
SCRIPTS AND PARAMETERS
• Use an IDE or friendly text-editor

#!/usr/bin/python
#--------------------------------
# my first script!

import sys
print('My first script!')
print('Number of arguments:', len(sys.argv), 'arguments.')
print('Argument List:', str(sys.argv))
#--------------------------------
READING MULTIPLE FILES
• Programming for speed, reusability
• Data analysis over many files

strfiles = ['inflammation-01.csv','inflammation-02.csv’]
for f in strfiles:
print(f)
#data = numpy.loadtxt(fname=f, delimiter=',’)
#print('mean ',f, numpy.mean(data, axis=0))

Got lots of files?


This is where RC systems like Polaris or
Discovery can be very useful
WRITE TO CSV!

import csv

with open('names.csv', 'w', newline='') as csvfile:

fieldnames = ['first_name', 'last_name']

writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

writer.writeheader()

writer.writerow({'first_name': 'Baked', 'last_name': 'Beans'})

writer.writerow({'first_name': 'Lovely', 'last_name': 'Spam'})

writer.writerow({'first_name': 'Wonderful', 'last_name': 'Spam'})


CSV HEADER ROW AND FIRST DATA
ROW
• Read first rows:

with open('inflammation-01.csv') as f:
reader2=csv.reader(f)
row1 = next(reader2) # gets the first line
row2 = next(reader2)
print ("CSV column headers:" + str(row1))
print ("CSV first line: " + str(row2))
SCRIPTS AND PARAMETERS
• Use an IDE or friendly text-editor
#!/usr/bin/python
#--------------------------------
# program name: python_add_parameters.py

import sys
i=0
total =0
while i < len(sys.argv):

total = total + int(sys.argv[1])


i = i + 1
print('sum: ' + str(total))

print('Number of arguments:',
len(sys.argv), 'arguments.')
print('Argument List:', str(sys.argv))
#--------------------------------
CSV LIBRARY
• Csv library built-in to Python
import csv
with open('inflammation-01.csv') as f:
reader2=csv.reader(f)
row1=next(reader2)
print(str(row1))
• Output: ['0', '0', '1', '3', '1', '2', '4', '7', '8', '3', '3', ‘3’….
IMPORTING A DATASET IN TO PYTHON:
USING THE OS AND CSV MODULES
Find out where you are in the directory structure, import the operating system library (OS)
# Reference: https://docs.python.org/2/library/csv.html section 13.1
import os
cwd = os.getcwd()
print "Working Directory is: " + cwd
Os.chdir(‘c:\\temp’)
Os.getcwd()

Import the CSV file in to a reader function


# Download the CSV and copy it to the working directory
# Note: the CSV module’s reader and writer objects read and write sequences
with open('HawaiiEmergencyShelters.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row['NAME'], row['ADDRESS'])
STATISTICS FROM CSV COLUMNS
# Loop through column, find average

with open('HawaiiEmergencyShelters.csv') as csvfile:


reader = csv.DictReader(csvfile)
x_sum = 0
x_length = 0
for row in reader:
try:
x = row['NUMCOTS']
x_sum += int(x)
x_length += 1
except ValueError:
print("Error converting: {0:s}".format(x))
x_average = x_sum / x_length
print ('Average: ')
print(x_average)
NUMERICAL FUNCTIONS
# Float and Int
x = 3.453
xint = int(x)
yfloat = float(2)
Xround = round(x)
INSTALLING NUMPY FOR PYTHON 2.7
“Numpy” is a helper module in Python for numerical processing

To get the NUMPY installer


Mac -
https://sourceforge.net/projects/numpy/files/NumPy/1.8.0/nu
mpy-1.8.0-py2.7-python.org-macosx10.6.dmg/download

Pc - https://sourceforge.net/projects/numpy/files/NumPy/1.8.0/

Click on the dmg file. You may need to change Mac security
preference (Sys Pref > Security > ) to allow the DMG installer to
run
STATISTICAL OPERATIONS
NUMPY FOR PYTHON 2.7
# Reference: https://docs.scipy.org/doc/numpy/reference/routines.statistics.html

Numpy.median
.average
.std
.var
.corrcoef (Pearson product-moment correlation)
.correlate
.cov (estimate of covariance matrix)
.histogram
.amin
.amax
.percentile
SAVING PYTHON SCRIPTS
• Python files can be written
in a simple text editor, or
using an IDE editor.
• The file extension is .py
A MODULE FOR BASIC STATISTICAL ANALYSIS:
USING THE NUMPY LIBRARY

# importing the library


# running basic functions

>>> import numpy


>>> numpy.mean(3,6,9)
6.0
>>> numpy.std([2,4,6,8])
2.2360679774997898

# Reference: https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html and


https://docs.scipy.org/doc/numpy/reference/routines.statistics.html
THE OS MODULE: SOME USEFUL OS
COMMANDS
• More OS library tasks:
• os.path.realpath(path) canonical path
• os.path.dirname(path) directory
• os.getcwd() get working directory (as string)
• os.chdir(path) change the working directory
PYTHON ON DARTMOUTH RESEARCH
COMPUTING MACHINES
• Research Computing shared Linux resources include Polaris and Andes, as
well as the high-performance computing platform Discovery.
• These machines have several versions of Python installed, and commonly-
used modules. Additional modules can be installed upon request
• Polaris currently has Python 2.6.6 as the default, and Numpy and Scipy
libraries are installed.
• Andes currently has Python 2.7.5 as the default, with Numpy, Scipy and the
Pandas modules installed. Pandas is another commonly used data analysis
library.
PYTHON SOFTWARE FOUNDATION AND
MATERIALS FOR THIS TUTORIAL
• Materials download: www.dartgo.org/workshopsg and download
IntroDataAnalysisPython
• Material reference and basis, Python Software Foundation at Python.org:
https://docs.python.org/2/tutorial/
RESOURCES & LINKS
• Research Computing
[email protected]
• http://rc.dartmouth.edu

• Python Foundation
• Online tutorials
• Web forums
• Stack overflow:
http://stackoverflow.com/questions/tagged/python
LEARNING MORE…
• Python Tutorials
• Python 2.7.13 https://docs.python.org/2/tutorial/
• Python 3.6 https://docs.python.org/3.6/tutorial/
• Numpy, Scipy tutorials
• https://docs.scipy.org/doc/numpy-dev/user/quickstart.html
• http://cs231n.github.io/python-numpy-tutorial/
• Python CSV library tutorial
• https://docs.python.org/2/library/csv.html
• Lynda, Youtube Online tutorials
• Lynda, log in with Dartmouth credentials:
www.lynda.com/portal/dartmouth
• Search for Python Programming, Numpy, Scipy
QUESTIONS?

You might also like