DWV Unit1
DWV Unit1
Dr Tilottama Goswami
Professor
Department of Artificial Intelligence, Anurag University
tilottamagoswami.co.in
TEXT BOOK
Wes McKinney. Python for Data Analysis: Data Wrangling with pandas,
NumPy and I Python, O'Reilly, 2017, 2nd Edition
• DATA FRAME
• A DataFrame represents a rectangular table of data and
contains an ordered collection of columns, each of which can
be a different value type (numeric, string, boolean, etc.).
To install pandas, command is
The pandas name itself is derived from panel data, an econometrics term for multidimensional
structured datasets
import pandas as pd
S1 = pd.Series([“Ram”,”Abir”,”Anaya”])
print(S1)
Q6
import pandas as pd
S1 = pd.Series(["a","e","i","o","u"])
print(S1)
Q7
A Pandas Series is like a column in a table. (T)
It is a one-dimensional array holding data of any type.(T)
References
Text Books
VIVA QUESTIONS
https://csiplearninghub.com/pandas-series-class-12-ip-important-questions/
Exercises
https://www.w3schools.com/python/pandas/pandas_series.asp
https://towardsdatascience.com/a-practical-introduction-to-pandas-series-
9915521cdc69
D A T
A F R
A M E
A DataFrame represents a rectangular table of data and contains an ordered collection
of columns, each of which can be a different value type (numeric, string,
boolean, etc.).
The DataFrame can be thought of as a dict of Series all sharing the same index.
The data is stored as one or more two-dimensional blocks rather than a list, dict, or some
other collection of one-dimensional arrays.
How to construct a Data Frame?
If a Data Frame is huge, how to view parts?
What happens if the column name is not in the dictionary?
Can we assign a column that doesn’t exist?
Retrieved as Series
Rows can also be retrieved by position or name with the special loc attribute
How to modify column values ?
Transpose a DataFrame
values attribute returns the data contained in the DataFrame as a two-dimensional ndarray
I
INDEX
OBJECTS
N
D
E
X
INDEX DEFINITION
Properties of Index objects
Index objects are immutable and thus can’t be modified by the user:
Selections with duplicate labels will select all occurrences of that label.
How to create index objects and assign them to Series
WHY PANDAS FOR DATA ANALYSIS
Hence PANDAS are popularly used for Data Analysis
Exercise
• https://csiplearninghub.com/important-pandas-dataframe-questions-12-ip/
LAB PROJECT 1-W1,W2
(TWO TASKS )
PROJECT- Part 1
Attendance
MID 1 Marks
Residence
and
Gender
PROJECT- Part 1
All Mid Marks
Grp- Roll
G1. 1-10
G2. 11-20
G3. 21-30
G4. 31-40
G5. 41-50
G6. 51-60
---------
61-G1
62-G2
Residence and 63-G3
Gender 64-G4
65-G5
66-G6
PROJECT- Part 1
Residence
and
Gender
TASKS- PART 1 [10 Marks]
Write Roll number, Grp Number, Section, comments for
questions and upload the .ipynb in Google Classroom
1. Create Series for each column
2. Create Dictionary for each column and make a series from it
3. Customize the index to AU Roll numbers
4. Select the values of Mid greater than 4 / Select the values of assignment greater than 2
5. Select students from Villages / Select the DayScholars who are girls
6. Check if there is any missing data
7. Create a Data frame for the given input using arrays/ dictionaries
8. Add a Name Column for the given input
9. Create an index object of your choice and customize the data structure given to you
10. Check if the index has any duplicate values
Agenda –Part2 (Refer Ch6 from TB)
Data Loading, Storage and File Formats
1. Read Text File into DataFrame
2. Read Text Files in Pieces
3. Write Data to Text Format
4. Working with Delimited Formats
5. JSON Data
Data stored in XML and JSON documents, CSV files, and Excel files is all unstructured. XML and JSON are also considered file
formats that represent semi-structured data, because both of them represent data in a hierarchical (tree-like) structure
Accessing data is a necessary first step for using most of the tools in data analysis.
50 parameters
But WHY ?
Type Inference
CONVERT TEXT DATA INTO DATAFRAME
Categories for Optional Arguments for the functions mentioned in the last page
Feather
Hierarchical Data Formats is a fast, lightweight, and easy-
The Hierarchical Data Format version 5 (HDF5), to-use binary file format for
is an open source file format that supports large, storing data frames. Feather
complex, heterogeneous data provides binary columnar
serialization for data frames
Java Script Object Notation (JSON) computer data interchange format Message Pack (MsgPack)
Example : Read .csv file
Contents
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
Customize File Header = Column Names
Skip the first, third, and fourth rows of a file with skiprows
How to Handle Missing Values in a csv file?
Dictionary
Dictionary
Few more are there..
Processing Large Text Files 10000 rows x 5 columns
Exercise
https://www.geeksforgeeks.org/how-to-load-a-massive-file-as-small-chunks-in-pandas/
T
E
X
T
P
A
R
S
E
R
Writing Data to Text Format Using DataFrame’s to_csv method
read_table()may fail in case a file with one or more malinformed lines are present. CSV files come in many different
flavors. To define a new format with a different delimiter, string quoting convention, or line terminator, we define a
simple subclass of csv.Dialect
To write delimited files manually, you can use csv.writer. It accepts an open,
writable file object and the same dialect and format options as csv.reader:
Application
Commonly used for transmitting data in web applications (e.g., sending some
data from the server to the client, so it can be displayed on a web page, or
vice versa)
<class '_io.TextIOWrapper'>
{'0': {'0': 1, '1': 2}, '1': {'0': ' Pushpa', '1': ' Flower'}, '2': {'0': ' H1', '1': ' H2'}, '3': {'0': ' HRN1',
'1': ' HRN2'}}
<class 'dict'>
import pandas as pd
dforg=pd.DataFrame(ogdata)
print(dforg)
print(type(dforg))
dforg.to_csv(r"C:\Users\Tilottama\OneDrive\Data Wrangling\Lab\tg1backtoCSV.csv")
dforg.to_csv(r"C:\Users\Tilottama\OneDrive\Data Wrangling\Lab\tg1backtoCSVNoHeader.csv",header=None)
dforg.to_csv(r"C:\Users\Tilottama\OneDrive\Data Wrangling\Lab\tg1backtoCSVNoHeaderNoIndex.csv",header=None,index=None)
JSON LOAD() Vs LOADS()
Further Reading
• http://www.datasciencelovers.com/tag/read-file/
Exercises
• Python JSON Exercise with Solution (pynative.com)
Q/A Part 2
Q1) List any 5 functions a Data Analyst should know, to read and save data in a
particular format
Q2) Why Pandas for Reading file formats?
Q3) Whats are the Issues with dot notation – df.name? Instead we can use df[‘name’]
Q4) What is a file format?
Q5) Why should a data scientist understand different file formats?
Q6) Compare and Contrast CSV and JSON file formats
Q7) Write JSON for:
Q8) a)Write a csv file for :
https://gretel.ai/blog/a-guide-to-load-almost-anything-into-a-dataframe
Answer 3:
Issues with the dot notation
There are three issues with using dot notation. It doesn’t work in the following situations:
•When there are spaces in the column name, eg df.favorite food
insead use df['favorite food']
•When the column name is the same as a DataFrame method, eg df.count, use df['count']
•When the column name is a variable
Eg > col = 'height'
> df[col]
Ans 4) A file format is a standard way in which information is encoded for storage in a file.
First, the file format specifies whether the file is a binary or ASCII file. Second, it shows
how the information is organized. For example, comma-separated values (CSV) file format
stores tabular data in plain text
Ans 5) Usually, the files you will come across will depend on the application you are
building. As a data scientist, you need to understand the underlying structure of various file
formats, their advantages and dis-advantages. Unless you understand the underlying
structure of the data, you will not be able to explore it. Also, at times you need to make
decisions about how to store data. Choosing the optimal file format for storing data can
improve the performance of your models in data processing. For example, in an image
processing system, you need image files as input and output. So you will mostly see files in
jpeg, gif or png format.
SR.NO JSON CSV
Ans 6
1. JSON stands for JavaScript Object Notation. CSV stands for Comma separated value.
It is used as the syntax for storing and exchanging the It is a plain text format with a series of
2.
data. values separated by commas.
3. JSON file saved with extension as .json. CSV file saved with extension as .csv.
It is used for for key, value store and supports arrays, It is a standard of saving tabular
5.
objects as values. information into a delimited text file.
6. It mainly uses the JavaScript data types. It does not have any data types.
It support a lot of scalability in terms of adding and It does not support a lot in terms of
9.
editing the content. scalability.
It is more compact than other file
10. It is less compact as compared to CSV file .
formats
Ans 7) JSON
Data Wrangling & Visualization
PROJECT-2
Dr Tilottama Goswami
Professor
Department of Artificial Intelligence, Anurag University
tilottamagoswami.co.in
Agenda
PART 1
1. Read CSV File
2. Clean Data – Missing Values
3. Filter Data – Relevant information to be captured
4. Write CSV File
5. Convert the given CSV File to JSON File
Rules
PROJECT
GROUPS
Grp- Roll
• Sets of Tasks given each week G1. 1-10
G2. 11-20
• According to Roll Number the student is assigned to the Task Set G3. 21-30
G4. 31-40
• Each Week – assigned 10 marks. G5. 41-50
• Upload the code file to google classroom within the due date G6. 51-60
---------
• Viva will be conducted at the end of each Project Part 61-G1
62-G2
63-G3
64-G4
65-G5
66-G6
PROJECTS II – PART A
Project1: Residence and Gender Project2 : Residence and Gender Project3 : Residence and Gender
RESIDENCE-SET-1.csv RESIDENCE-SET-2.csv RESIDENCE-SET-3.csv
G1,G2
PROJECT 1 G3,G4
PROJECT 2
G5,G6
PROJECT 3
Snapshot of the csv file. The File(s) are uploaded in GoogleClassroom
TASKS- for All the Projects[1,2,3][10 Marks]
Write Roll number, Grp Number, Project number, Section, comments
for questions and upload the .ipynb in Google Classroom
PROJECT 1 : RESIDENCE-SET-1.csv
PROJECT 2 : RESIDENCE-SET-2.csv
PROJECT 3 : RESIDENCE-SET-3.csv
1. Read the csv file
2. Data Clean- Remove the rows with missing data
3. Store the clean data in CleanDataResidence.json file
4. Consider the clean data and create a csv files based on gender basis
5. Consider the clean data and create a csv file for girls students who are from villages
6. Find the count of girls and boys from villages, and save it in json file
PROJECTS II – PART B
Project4: All Mid Marks Project5 : All Mid Marks Project6 : All Mid Marks
RESIDENCE-SET-1.csv RESIDENCE-SET-2.csv RESIDENCE-SET-3.csv
G1,G2
PROJECT 1 G3,G4
PROJECT 2
G5,G6
PROJECT 3
Snapshot of the csv file. The File(s) are uploaded in GoogleClassroom
TASKS- for All the Projects[4,5,6][10 Marks]
Write Roll number, Grp Number, Project number, Section, comments
for questions and upload the .ipynb in Google Classroom
PROJECT 1 : MIDMARKS-SET-1.csv
PROJECT 2 : MIDMARKS-SET-2.csv
PROJECT 3 : MIDMARKS-SET-3.csv
1. Read the csv file
2. Data Clean- Remove the rows with missing data
3. Store the clean data in CleanDataResidence.json file
4. Consider the clean data and create csv files based on subject wise for all mids.
5. Find the number of students who got less than 10 in all subjects and also >14 in all subjects in Mid1 ,
and save it in MidDataAnalysis.json file
Count CS1 DM DS PP JP P&S
<10
>14
If you don’t reveal some insights soon, I’m
going to be forced to slice, dice and drill !!