0% found this document useful (0 votes)
231 views

DWV Unit1

This document discusses data wrangling and visualization using Python and pandas. It covers topics like the pandas Series and DataFrame data structures, reading and writing different file formats with pandas, and loading data. The agenda includes 10 classes and 3 labs on topics from the textbook chapter 5 such as the Series and DataFrame structures, indexing objects, and why pandas is useful for data analysis. Students will complete projects and be prepared to answer questions for assignments, vivas, and interviews.

Uploaded by

Ujwal mudhiraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
231 views

DWV Unit1

This document discusses data wrangling and visualization using Python and pandas. It covers topics like the pandas Series and DataFrame data structures, reading and writing different file formats with pandas, and loading data. The agenda includes 10 classes and 3 labs on topics from the textbook chapter 5 such as the Series and DataFrame structures, indexing objects, and why pandas is useful for data analysis. Students will complete projects and be prepared to answer questions for assignments, vivas, and interviews.

Uploaded by

Ujwal mudhiraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 102

Data Wrangling & Visualization

Dr Tilottama Goswami
Professor
Department of Artificial Intelligence, Anurag University
tilottamagoswami.co.in
TEXT BOOK
Wes McKinney. Python for Data Analysis: Data Wrangling with pandas,
NumPy and I Python, O'Reilly, 2017, 2nd Edition

TOTAL 10 CLASSES & 3 LABS


Agenda – Part 1 (Refer Ch5 from TB)
1. Series Data Sructure
2. DataFrame Data Structure
3. Index Objects
4. Why PANDAS for Data Analysis

Projects for Assignments


Viva & Interview Questions
PANDAS DATA STRUCTURE
• SERIES
• A Series is a one-dimensional array-like object containing a
sequence of values

• DATA FRAME
• A DataFrame represents a rectangular table of data and
contains an ordered collection of columns, each of which can
be a different value type (numeric, string, boolean, etc.).
To install pandas, command is

pip install pandas


S
E
R
I
E
S
DEFINITION
Create Series Data Structure with customized label/index name
SELECT VALUE(S)USING LABELS
Filter/ Apply Math Function
SERIES AS ORDERED DICTIONARY & Vice Versa
How to Override the Sorted Order in Dictionary
How to Check the Missing Data? Method 1

How to Check the Missing Data? Method 2

How to Check the Missing Data? Method 3


Arithmetic Operations with Series Data – Automatic Index alignment
Series & Index – NAME attribute
INDEX modification using Assignment
Q/A Part- A

• Q1. What do you mean by Pandas in Python?


• Q2. Name three data structures available in Pandas.
• Q3. What do you mean by Series in Python?
• Q4. Create an Example of a series containing names of students
• Q5. Write a program in Python to create series of vowels.
• Q6. T/F
• A Pandas Series is like a column in a table.
• It is a one-dimensional array holding data of any type.
• Q7 Write the output of the following : import pandas as pd
S1 = pd.Series(12, index = [4, 6, 8])
print(S1)
Q/A – Part A

The pandas name itself is derived from panel data, an econometrics term for multidimensional
structured datasets
import pandas as pd
S1 = pd.Series([“Ram”,”Abir”,”Anaya”])
print(S1)
Q6
import pandas as pd
S1 = pd.Series(["a","e","i","o","u"])
print(S1)

Q7
A Pandas Series is like a column in a table. (T)
It is a one-dimensional array holding data of any type.(T)
References
Text Books

1. Wes McKinney. Python for Data Analysis: Data Wrangling with


pandas, NumPy and I Python, O'Reilly, 2017, 2nd Edition
2. Jacqueline Kazil and Katharine Jarmul, Data Wrangling with Python,
O'Reilly, 2016

VIVA QUESTIONS
https://csiplearninghub.com/pandas-series-class-12-ip-important-questions/

Exercises
https://www.w3schools.com/python/pandas/pandas_series.asp
https://towardsdatascience.com/a-practical-introduction-to-pandas-series-
9915521cdc69
D A T

A F R

A M E
A DataFrame represents a rectangular table of data and contains an ordered collection
of columns, each of which can be a different value type (numeric, string,
boolean, etc.).

The DataFrame has both a row and column index;

The DataFrame can be thought of as a dict of Series all sharing the same index.

The data is stored as one or more two-dimensional blocks rather than a list, dict, or some
other collection of one-dimensional arrays.
How to construct a Data Frame?
If a Data Frame is huge, how to view parts?
What happens if the column name is not in the dictionary?
Can we assign a column that doesn’t exist?

Add a new column “eastern” of boolean values


where the state column equals 'Ohio'

New columns cannot be created with the frame2.eastern syntax


How to retrieve a Column from Data Frame?

Retrieved as Series

frame2[column] works for any column name,

but frame2.column only works when the column


name is a valid Python variable
name.
Retrieved as Attribute
How to retrieve a Row from Data Frame?

Rows can also be retrieved by position or name with the special loc attribute
How to modify column values ?

Assign a scalar value

Assign a array of values


What care must be taken to assign a series or an array to a column in Data Frame?

I) When you are assigning lists or arrays to a column,


: the value’s length must match the length of the DataFrame

II) If you assign a Series,


: its labels will be realigned exactly to the DataFrame’s index, inserting
missing values in any holes
The column returned from indexing a DataFrame is a view on the underlying data, not a
copy. Thus, any in-place modifications to the Series will be reflected in the DataFrame. The
del method can then be used to remove this column
Create DataFrame from nested dict of dicts
If the nested dict is passed to the DataFrame,
pandas will interpret
the outer dict keys as the columns
and
the inner keys as the row indices:

Transpose a DataFrame
values attribute returns the data contained in the DataFrame as a two-dimensional ndarray
I
INDEX
OBJECTS
N
D
E
X
INDEX DEFINITION
Properties of Index objects

Index objects are immutable and thus can’t be modified by the user:

Unlike Python sets, a pandas Index can contain duplicate labels:

Selections with duplicate labels will select all occurrences of that label.
How to create index objects and assign them to Series
WHY PANDAS FOR DATA ANALYSIS
Hence PANDAS are popularly used for Data Analysis
Exercise
• https://csiplearninghub.com/important-pandas-dataframe-questions-12-ip/
LAB PROJECT 1-W1,W2
(TWO TASKS )
PROJECT- Part 1
Attendance
MID 1 Marks

Residence
and
Gender
PROJECT- Part 1
All Mid Marks
Grp- Roll
G1. 1-10
G2. 11-20
G3. 21-30
G4. 31-40
G5. 41-50
G6. 51-60
---------
61-G1
62-G2
Residence and 63-G3
Gender 64-G4
65-G5
66-G6
PROJECT- Part 1

Residence
and
Gender
TASKS- PART 1 [10 Marks]
Write Roll number, Grp Number, Section, comments for
questions and upload the .ipynb in Google Classroom
1. Create Series for each column
2. Create Dictionary for each column and make a series from it
3. Customize the index to AU Roll numbers
4. Select the values of Mid greater than 4 / Select the values of assignment greater than 2
5. Select students from Villages / Select the DayScholars who are girls
6. Check if there is any missing data
7. Create a Data frame for the given input using arrays/ dictionaries
8. Add a Name Column for the given input
9. Create an index object of your choice and customize the data structure given to you
10. Check if the index has any duplicate values
Agenda –Part2 (Refer Ch6 from TB)
Data Loading, Storage and File Formats
1. Read Text File into DataFrame
2. Read Text Files in Pieces
3. Write Data to Text Format
4. Working with Delimited Formats
5. JSON Data

Projects for Assignments


Viva & Interview Questions
DATA LOADING
STORAGE
FILE FORMATS https://realpython.com/pandas-read-write-files/#write-a-csv-file
Pandas provides so many options of reading data into a DataFrame

Data stored in XML and JSON documents, CSV files, and Excel files is all unstructured. XML and JSON are also considered file
formats that represent semi-structured data, because both of them represent data in a hierarchical (tree-like) structure

Accessing data is a necessary first step for using most of the tools in data analysis.

Data input and output using pandas,


Numerous tools in other libraries to help with reading and writing data in various formats.

Input and output typically falls into a few main categories:


Reading text files and other more efficient on-disk formats
Loading data from databases
Interacting with network sources like web APIs.
Parameters ?
50
parameters
Messy Data

50 parameters
But WHY ?

Type Inference
CONVERT TEXT DATA INTO DATAFRAME

Categories for Optional Arguments for the functions mentioned in the last page
Feather
Hierarchical Data Formats is a fast, lightweight, and easy-
The Hierarchical Data Format version 5 (HDF5), to-use binary file format for
is an open source file format that supports large, storing data frames. Feather
complex, heterogeneous data provides binary columnar
serialization for data frames
Java Script Object Notation (JSON) computer data interchange format Message Pack (MsgPack)
Example : Read .csv file
Contents

a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
Customize File Header = Column Names

Allow pandas to assign default column names

Customize the column Names


Customize the Index Column with other Column option
Hierarchical Index
From
Multiple Columns
How to read a Table with variable amount of whitespace delimiter in data ? Using REGULAR EXPRESSION
SKIP ROWS WHILE READING A CSV FILE

Skip the first, third, and fourth rows of a file with skiprows
How to Handle Missing Values in a csv file?

By default, pandas uses a set of commonly


occurring sentinels, such as NA and NULL
What is Missing data ?
1) Not present (empty string) , ‘ ‘
The na_values option can take either a list
2) Marked by some sentinel value, eg NA
or set of strings to consider missing
values
Customization of different NA sentinels can be specified for each column in a dict

Dictionary

Dictionary
Few more are there..
Processing Large Text Files 10000 rows x 5 columns

1) Read in a small piece of a file


or
2) Iterate through smaller chunks of the file.

One time Pandas settings


TextParser is also equipped
with a get_chunk method
that enables you to read
pieces of an arbitrary size.

The object returned is not a data frame but an iterator.


To get the data will need to iterate through this object.

Exercise
https://www.geeksforgeeks.org/how-to-load-a-massive-file-as-small-chunks-in-pandas/
T
E
X
T

P
A
R
S
E
R
Writing Data to Text Format Using DataFrame’s to_csv method

Delimiters : Comma Seprated, |


Represent Missing values by Empty
Strings or other Sentinel Values

*Use cat or type command depends on linux/windows


Writing Text Data to Console, sys.stdout Using DataFrame’s to_csv method

Delimiters : Comma Seprated, |


Represent Missing values by Empty Strings or other Sentinel Values

Row and Column Labels Can Be Disabled


Choose Subset of Columns from a DataFrame

Write only a subset of the columns,


and in an order of your choosing
Series also has a to_csv method
Use csv.Dialect

read_table()may fail in case a file with one or more malinformed lines are present. CSV files come in many different
flavors. To define a new format with a different delimiter, string quoting convention, or line terminator, we define a
simple subclass of csv.Dialect

class my_dialect(csv.Dialect): reader = csv.reader(f, dialect=my_dialect)


lineterminator = '\n'
delimiter = ';'
reader = csv.reader(f, delimiter='|') Only one option
quotechar = '"'
quoting = csv.QUOTE_MINIMAL

To write delimited files manually, you can use csv.writer. It accepts an open,
writable file object and the same dialect and format options as csv.reader:

with open('mydata.csv', 'w') as f:

writer = csv.writer(f, dialect=my_dialect)

writer.writerow(('one', 'two', 'three'))


writer.writerow(('1', '2', '3'))
writer.writerow(('4', '5', '6'))
writer.writerow(('7', '8', '9'))
JSON
JSON (short for JavaScript Object Notation) has become one of the standard formats for sending data by
HTTP request between web browsers and other applications. It is a much more free-form data format than a
tabular text form like CSV.

•JSON stands for JavaScript Object Notation


•JSON is a lightweight data-interchange format
•JSON is plain text written in JavaScript object notation
•JSON is used to send data between computers
•JSON is language independent *
•JSON is a text format for storing and transporting data

The file type for JSON files is ".json"


The MIME type for JSON text is "application/json"
JSON syntax is derived from JavaScript object notation
syntax:
•Data is in name/value pairs
•Data is separated by commas
•Curly braces hold objects
•Square brackets hold arrays
JSON defines only two data structures: objects and arrays.

An object is a set of name-value pairs,


and
An array is a list of values.

JSON defines seven value types:


string, number, object, array, true, false, and null.

Application

Commonly used for transmitting data in web applications (e.g., sending some
data from the server to the client, so it can be displayed on a web page, or
vice versa)

JavaScript Object Notation (JSON) is unstructured, flexible, and readable


by humans. Basically, you can dump data into the database however it
comes, without having to adapt it to any specialized database language (like
SQL)
IMPORTANT : import json

1)Convert JSON String to Python Dictionary : json.loads()


json.load() takes a file object and returns the json object

2)Convert a Python object to JSON : json.dumps()

Convert a JSON object or list of objects to a DataFrame: pd.DataFrame(..)


3)Convert JSON into a Series or DataFrame: pandas.read_json()

4)Export Data from pandas DataFrame(df) to JSON:


df.to_json() or df.to_json(orient='records'),
CONVERT CSVDATAFRAME JSON IN TWO WAYS – COLUMNWISE & RECORDWISE
df=pd.read_csv(r"C:\Users\Tilottama\OneDrive\DataWrangling\Lab\tg1.csv",header=None)
tg1.csv df

df.to_json(r'C:\Users\Tilottama\OneDrive\Data Wrangling\Lab\tg1json.json') STORE JSON COLUMNWISE


tg1json.json

df.to_json(r'C:\Users\Tilottama\OneDrive\Data Wrangling\Lab\tg1Recordjson.json', orient="records")


STORE JSON RECORDWISE
tg1Recordjson.json
CONVERT JSON file DATAFRAME CSV
Convert JSON String to Python Form : json.loads()
json.load() takes a file object and returns the json object
import json
res=open(r'C:\Users\Tilottama\OneDrive\Data Wrangling\Lab\tg1json.json')
print(type(res))
ogdata = json.load(res)
print(ogdata)
print(type(ogdata))

<class '_io.TextIOWrapper'>
{'0': {'0': 1, '1': 2}, '1': {'0': ' Pushpa', '1': ' Flower'}, '2': {'0': ' H1', '1': ' H2'}, '3': {'0': ' HRN1',
'1': ' HRN2'}}
<class 'dict'>

import pandas as pd
dforg=pd.DataFrame(ogdata)
print(dforg)
print(type(dforg))
dforg.to_csv(r"C:\Users\Tilottama\OneDrive\Data Wrangling\Lab\tg1backtoCSV.csv")
dforg.to_csv(r"C:\Users\Tilottama\OneDrive\Data Wrangling\Lab\tg1backtoCSVNoHeader.csv",header=None)
dforg.to_csv(r"C:\Users\Tilottama\OneDrive\Data Wrangling\Lab\tg1backtoCSVNoHeaderNoIndex.csv",header=None,index=None)
JSON LOAD() Vs LOADS()
Further Reading
• http://www.datasciencelovers.com/tag/read-file/
Exercises
• Python JSON Exercise with Solution (pynative.com)
Q/A Part 2
Q1) List any 5 functions a Data Analyst should know, to read and save data in a
particular format
Q2) Why Pandas for Reading file formats?
Q3) Whats are the Issues with dot notation – df.name? Instead we can use df[‘name’]
Q4) What is a file format?
Q5) Why should a data scientist understand different file formats?
Q6) Compare and Contrast CSV and JSON file formats
Q7) Write JSON for:
Q8) a)Write a csv file for :

b) Write JSON file for the above table.


Answer 1
Answer 2:
• Pandas provides so many options of reading data into a DataFrame
• Pandas comes with 18 readers for different sources of data. They include readers for CSV,
JSON, Parquet files and ones that support reading from SQL databases or even HTML
documents.

https://gretel.ai/blog/a-guide-to-load-almost-anything-into-a-dataframe

Answer 3:
Issues with the dot notation
There are three issues with using dot notation. It doesn’t work in the following situations:
•When there are spaces in the column name, eg df.favorite food
insead use df['favorite food']
•When the column name is the same as a DataFrame method, eg df.count, use df['count']
•When the column name is a variable
Eg > col = 'height'
> df[col]
Ans 4) A file format is a standard way in which information is encoded for storage in a file.
First, the file format specifies whether the file is a binary or ASCII file. Second, it shows
how the information is organized. For example, comma-separated values (CSV) file format
stores tabular data in plain text

Ans 5) Usually, the files you will come across will depend on the application you are
building. As a data scientist, you need to understand the underlying structure of various file
formats, their advantages and dis-advantages. Unless you understand the underlying
structure of the data, you will not be able to explore it. Also, at times you need to make
decisions about how to store data. Choosing the optimal file format for storing data can
improve the performance of your models in data processing. For example, in an image
processing system, you need image files as input and output. So you will mostly see files in
jpeg, gif or png format.
SR.NO JSON CSV
Ans 6
1. JSON stands for JavaScript Object Notation. CSV stands for Comma separated value.

It is used as the syntax for storing and exchanging the It is a plain text format with a series of
2.
data. values separated by commas.

3. JSON file saved with extension as .json. CSV file saved with extension as .csv.

4. It is more versatile. It is less versatile.

It is used for for key, value store and supports arrays, It is a standard of saving tabular
5.
objects as values. information into a delimited text file.

6. It mainly uses the JavaScript data types. It does not have any data types.

7. It is less secured. It is more secured.

8. It consumes more memory as compared to CSV. It consumes less memory.

It support a lot of scalability in terms of adding and It does not support a lot in terms of
9.
editing the content. scalability.
It is more compact than other file
10. It is less compact as compared to CSV file .
formats
Ans 7) JSON
Data Wrangling & Visualization
PROJECT-2
Dr Tilottama Goswami
Professor
Department of Artificial Intelligence, Anurag University
tilottamagoswami.co.in
Agenda
PART 1
1. Read CSV File
2. Clean Data – Missing Values
3. Filter Data – Relevant information to be captured
4. Write CSV File
5. Convert the given CSV File to JSON File
Rules
PROJECT
GROUPS
Grp- Roll
• Sets of Tasks given each week G1. 1-10
G2. 11-20
• According to Roll Number the student is assigned to the Task Set G3. 21-30
G4. 31-40
• Each Week – assigned 10 marks. G5. 41-50
• Upload the code file to google classroom within the due date G6. 51-60
---------
• Viva will be conducted at the end of each Project Part 61-G1
62-G2
63-G3
64-G4
65-G5
66-G6
PROJECTS II – PART A
Project1: Residence and Gender Project2 : Residence and Gender Project3 : Residence and Gender
RESIDENCE-SET-1.csv RESIDENCE-SET-2.csv RESIDENCE-SET-3.csv

G1,G2
PROJECT 1 G3,G4
PROJECT 2

G5,G6
PROJECT 3
Snapshot of the csv file. The File(s) are uploaded in GoogleClassroom
TASKS- for All the Projects[1,2,3][10 Marks]
Write Roll number, Grp Number, Project number, Section, comments
for questions and upload the .ipynb in Google Classroom
PROJECT 1 : RESIDENCE-SET-1.csv
PROJECT 2 : RESIDENCE-SET-2.csv
PROJECT 3 : RESIDENCE-SET-3.csv
1. Read the csv file
2. Data Clean- Remove the rows with missing data
3. Store the clean data in CleanDataResidence.json file
4. Consider the clean data and create a csv files based on gender basis
5. Consider the clean data and create a csv file for girls students who are from villages
6. Find the count of girls and boys from villages, and save it in json file
PROJECTS II – PART B
Project4: All Mid Marks Project5 : All Mid Marks Project6 : All Mid Marks
RESIDENCE-SET-1.csv RESIDENCE-SET-2.csv RESIDENCE-SET-3.csv

All Mid Marks

G1,G2
PROJECT 1 G3,G4
PROJECT 2

G5,G6
PROJECT 3
Snapshot of the csv file. The File(s) are uploaded in GoogleClassroom
TASKS- for All the Projects[4,5,6][10 Marks]
Write Roll number, Grp Number, Project number, Section, comments
for questions and upload the .ipynb in Google Classroom
PROJECT 1 : MIDMARKS-SET-1.csv
PROJECT 2 : MIDMARKS-SET-2.csv
PROJECT 3 : MIDMARKS-SET-3.csv
1. Read the csv file
2. Data Clean- Remove the rows with missing data
3. Store the clean data in CleanDataResidence.json file
4. Consider the clean data and create csv files based on subject wise for all mids.
5. Find the number of students who got less than 10 in all subjects and also >14 in all subjects in Mid1 ,
and save it in MidDataAnalysis.json file
Count CS1 DM DS PP JP P&S
<10
>14
If you don’t reveal some insights soon, I’m
going to be forced to slice, dice and drill !!

You might also like