0% found this document useful (0 votes)
16 views

CH-6 Data Loading, Storage, and File Formats

Uploaded by

keshav30072005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

CH-6 Data Loading, Storage, and File Formats

Uploaded by

keshav30072005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 163

CH-6 Data Loading, Storage, and File Formats

Reading and Writing Data in Text Format

Above function may use optional arguments that may fall into a few categories:
● Indexing Can treat one or more columns as the returned DataFrame, and
whether to get column names from the file, the user, or not at all.
● Type inference and data conversion This includes the user-defined value
conversions and custom list of missing value markers.
● Datetime parsing Includes combining capability, including combining date and
time information spread over multiple columns into a single column in the result.
Iterating Support for iterating over chunks of very large files.
● Unclean data issues Skipping rows or a footer, comments, or other minor things
like numeric data with thousands separated by commas.

Opening CSV file

In UNIX
A file may not always have header

Opening a comma delimited file

Opening a file using sep

Assigning column name


Assigning index and column name

Hierarchical index for multiple columns

When there is no fixed separator or white space


<----------------------using sep=”\s+”for regular expression ------------------------------->

Skipping rows

NaN and NULL values


<-------------------- using ‘na_values’ -------------------------------------->
Reading text file in pieces
Setting Maximum Displayed Rows: pd.options.display.max_rows = 10 sets the
maximum number of rows displayed when printing a DataFrame to 10. This is a display
option in pandas that helps to control the amount of output shown.only the first 10 rows
and the last 10 rows of the DataFrame will be shown

Output

To read only a small number of rows


Specifying chunksize as the number of the rows

Explanation:

● pd.read_csv with chunksize: The pd.read_csv function is used to read the CSV
file ('ex6.csv') in chunks. The chunksize parameter is set to 1000, indicating that
the file will be read in chunks of 1000 rows at a time.
● TextFileReader Object: The result of this operation is a TextFileReader object
(<pandas.io.parsers.TextFileReader at 0x7f6b1e2672e8>). This object is an
iterator that allows you to iterate over the file in chunks using a for loop.

Iterating output of above file


Writing data to text format

Opening a csv file

Copying this csv file ‘data’ to another csv file ‘out.csv’

Writing data using separator

● Other delimiters can be used, of course (writing to sys.stdout so it prints the text
result to the console):

Missing values appear as empty strings in the output. You might want to denote them
by some other sentinel value:
Both index and header are disabled

Using columns name

Using to_csv method of series


Working with delimited format

To display above file in dictionary

Different delimiter, string quoting convention, or line terminator,in a subclass of


csv.Dialect:
We can also give individual CSV dialect parameters as keywords to csv.reader without

having to define a subclass:

To write delimited files manually, you can use csv.writer. It accepts an open, writa‐ ble
file object and the same dialect and format options as csv.reader:
BINARY DATA FORMAT

Reading microsoft excel files

Opening the excel file by directly passing the excel file name.

Writing to the excel file creating Excelwriter and using the to_excel method:

Writing to the excel file without ExcelWriter :

INTERACTING WITH WEB APIs


● The requests.get(url) method sends a GET request to the specified URL
('https://api.github.com/repos/pandas-dev/pandas/issues'). The response is
stored in the resp variable.
● Printing resp shows the HTTP response object, which includes a status code. In
this case, <Response [200]> indicates a successful response with status code
200.

● The json() method of the response object is used to parse the JSON content of
the response into native Python objects. In this case, it returns a list of
dictionaries, where each dictionary represents an issue on the GitHub repository.

● Printing the title of the index 0 of the data dictionary

Creating data frame


INTERACTING WITH THE DATABASE
SQLite database using Python’s built-in sqlite3 driver
You can pass the list of tuples to the DataFrame constructor, but you also need the
column names, contained in the cursor’s description attribute:
The SQLAlchemy project is a popular Python SQL toolkit that abstracts away many of
the common differences between SQL databases. pandas has a read_sql function that
enables you to read data easily from a general SQLAlchemy connection. Here, we’ll
connect to the same SQLite database with SQLAlchemy and read data from the table
created before:

Ch-7 Data Cleaning and Preparation


Handling Missing Data
● Missing data occurs commonly in many data analysis applications. One of the
goals of pandas is to make working with missing data as painless as possible.
For example, all of the descriptive statistics on pandas objects exclude missing
data by default.

● The built-in Python None value is also treated as NA in object arrays:


Filtering Out Missing Data
● Using dropna , isnull and boolean indexing

● In Dataframe : dropna by default drops any row containing a missing value


● A related way to filter out DataFrame rows tends to concern time series data.
Suppose you want to keep only rows containing a certain number of
observations. You can indicate this with the thresh argument
Filling In Missing Data
● Using fillna method ( fillna method create a copy and don’t change the original
data)
● Filling the mean value in the fillna method
Data Transformation
● Removing Duplicates
➢ Searching for the duplicate data

➢ drop_duplicates returns a DataFrame where the duplicated array is False


➢ Both of these methods by default consider all of the columns; alternatively,
you can specify any subset of them to detect duplicates. Suppose we had
an additional column of values and wanted to filter duplicates only based
on the 'k1' column

➢ duplicated and drop_duplicates by default keep the first observed value


combina‐ tion. Passing keep='last' will return the last one:

● Transforming Data Using a Function or Mapping


➢ To add the following column

➢ The map method on a Series accepts a function or dict-like object


containing a map‐ ping, but here we have a small problem in that some of
the meats are capitalized and others are not. Thus, we need to convert
each value to lowercase using the str.lower Series method:
➢ We could also have passed a function that does all the work

● Replacing Values
➢ Filling in missing data with the fillna method is a special case of more
general value replacement.

➢ The -999 values might be sentinel values for missing data. To replace
these with NA values that pandas understands, we can use replace,
producing a new Series (unless you pass inplace=True)

➢ If you want to replace multiple values at once, you instead pass a list and
then the substitute value:
➢ To use a different replacement for each value, pass a list of substitutes:

➢ The argument passed can also be a dict:

● Renaming Axis Indexes


➢ Like values in a Series, axis labels can be similarly transformed by a
function or mapping of some form to produce new, differently labeled
objects. You can also modify the axes in-place without creating a new data
structure.
➢ Like a Series, the axis indexes have a map method:

➢ you can assign to index, modifying the DataFrame in-place:

➢ If you want to create a transformed version of a dataset without modifying


the original, a useful method is rename:

➢ rename can be used in conjunction with a dict-like object providing new


values for a subset of the axis labels

➢ pass inplace=True:

● Discretization and Binning


➢ Continuous data is often discretized or otherwise separated into “bins” for
analysis. Suppose you have data about a group of people in a study, and
you want to group them into discrete age buckets:

➢ To divide in groups

➢ cat.codes-> return the indexes of the bin on basis of distribution of data in


cats

➢ Consistent with mathematical notation for intervals, a parenthesis means


that the side is open, while the square bracket means it is closed
(inclusive). You can change which side is closed by passing right=False:

➢ You can also pass your own bin names by passing a list or array to the
labels option:
➢ If you pass an integer number of bins to cut instead of explicit bin edges, it
will compute equal-length bins based on the minimum and maximum
values in the data. Consider the case of some uniformly distributed data
chopped into fourths:

➢ A closely related function, qcut, bins the data based on sample quantiles.
Depending on the distribution of the data, using cut will not usually result
in each bin having the same number of data points. Since qcut uses
sample quantiles instead, by definition you will obtain roughly equal-size
bins
➢ Quartiles are values that divide your data into four equal parts.The three quartiles
are Q1 (25th percentile), Q2 (50th percentile, or the median), and Q3 (75th
percentile).In the context of qcut, it means that the data is divided into four
intervals, each containing approximately 25% of the data points
➢ Similar to cut you can pass your own quantiles (numbers between 0 and 1,
inclusive):

● Detecting and Filtering Outliers


➢ Filtering or transforming outliers is largely a matter of applying array
operations
➢ To select all rows having a value exceeding 3 or –3, you can use the any method
on a boolean DataFrame

Here any(1) is checking if their is atleast one value in the giving column which is
>3.

➢ Values can be set based on these criteria. Here is code to cap values outside the
inter‐ val –3 to 3:

❖ np.sign(data) returns an array where each element is:


➢ 1 if the original element is greater than 0,
➢ 0 if the original element is 0, and
➢ -1 if the original element is less than 0.
❖ np.sign(data) * 3 then scales each element by multiplying it by 3. This means
that the resulting array will have elements:
➢ 3 if the original element is greater than 0,
➢ 0 if the original element is 0, and
➢ -3 if the original element is less than 0.
➢ The statement np.sign(data) produces 1 and –1 values based on whether the
values in data are positive or negative:

String Manipulation

● String Object Methods


➢ Splitting the comma- separated values

➢ To remove the white space

➢ These substrings could be concatenated together with a two-colon


delimiter using addition:
➢ Using .join method

➢ Using in , index and find

Find will return -1 if value is not present in the string but the index will raise
an error

➢ Using count

➢ Using replace
CH-8 Data Wrangling: Join, Combine, and Reshape

Hierarchical Indexing

● Hierarchical indexing is an important feature of pandas that enables you to have


multiple (two or more) index levels on an axis.
In above label

● [0,0,0] represent first three value in data is for a


● [1,1] represent second two value in data is for b
● [2,2] represent next two value in data is for c
● [3,3] represent last two value in data is for d

Also,

● [0,1,2] -> because is ‘a’ we have all three[1,2,3] whose indexes are [0,1,2]
respectively
● [0,2]->because is ‘b’ we have [1,3] whose indexes are [0,2] respectively
● [0,1] ->because is ‘c’ we have [1,2] whose indexes are [0,1] respectively
● [1,2]->because is ‘d’ we have [2,3] whose indexes are [1,2] respectively
➢ With a hierarchically indexed object, so-called partial indexing is possible,
enabling you to concisely select subsets of the data:
➢ Selection is even possible from an “inner” level:

➢ Hierarchical indexing plays an important role in reshaping data and group-based


operations like forming a pivot table. For example, you could rearrange the data
into a DataFrame using its unstack method:
➢ With a DataFrame, either axis can have a hierarchical index:

➢ The hierarchical levels can have names (as strings or any Python objects).
● Reordering and Sorting
➢ The swaplevel takes two level numbers or names and returns a new object
with the levels interchanged

➢ sort_index, on the other hand, sorts the data using only the values in a
single level.
● Summary Statistics by Level

Original frame

Applying sum function at different levels


● Indexing with a DataFrame’s columns

➢ DataFrame’s set_index function will create a new DataFrame using one or more
of its columns as the index
➢ By default the columns are removed from the DataFrame, though you can leave
them in:

➢ reset_index, on the other hand, does the opposite of set_index; the hierarchical
index levels are moved into the columns:
Combining and Merging Datasets

Data contained in pandas objects can be combined together in a number of ways:

● pandas.merge connects rows in DataFrames based on one or more keys. This


will be familiar to users of SQL or other relational databases, as it implements
database join operations.
● pandas.concat concatenates or “stacks” together objects along an axis.
● The combine_first instance method enables splicing together overlapping data to
fill in missing values in one object with values from another.

Database-Style DataFrame Joins

➢ Merge or join operations combine datasets by linking rows using one or more
keys.
➢ many-to-one join; the data in df1 has multiple rows labeled a and b, whereas df2
has only one row for each value in the key column
➢ If that information in the field is not specified , merge uses the overlapping
column names as the keys. It’s a good practice to specify explicitly, though:

➢ If the column names are different in each object, you can specify them
separately:
➢ You may notice that the 'c' and 'd' values and associated data are missing from
the result. By default merge does an 'inner' join; the keys in the result are the
intersec‐ tion, or the common set found in both tables. Other possible options are
'left', 'right', and 'outer'. The outer join takes the union of the keys, combining the
effect of applying both left and right joins:
➢ A last issue to consider in merge operations is the treatment of overlapping
column names. While you can address the overlap manually (see the earlier
section on renaming axis labels), merge has a suffixes option for specifying
strings to append to overlapping names in the left and right DataFrame objects:
● Merging on Index
➢ you can pass left_index=True or right_index=True (or both) to indicate that
the index should be used as the merge key:
➢ Since the default merge method is to intersect the join keys, you can instead
form the union of them with an outer join:
➢ Using the indexes of both sides of the merge is also possible:
➢ DataFrame has a convenient join instance for merging by index. It can also be
used to combine together many DataFrame objects having the same or similar
indexes but non-overlapping columns. In the prior example, we could have written

➢ DataFrame’s join method performs a left join on the join keys, exactly preserving
the left frame’s row index. It also supports joining the index of the passed
DataFrame on one of the columns of the calling DataFrame:
● Concatenating Along an Axis
➢ Another kind of data combination operation is referred to interchangeably
as concat‐ enation, binding, or stacking
➢ By default concat works along axis=0, producing another Series. If you pass
axis=1, the result will instead be a DataFrame (axis=1 is the columns):
➢ In this case there is no overlap on the other axis, which as you can see is the
sorted union (the 'outer' join) of the indexes. You can instead intersect them by
passing join='inner':

➢ You can even specify the axes to be used on the other axes with join_axes
➢ In the case of combining Series along axis=1, the keys become the DataFrame
column headers:

➢ For dataframe
➢ If you pass a dict of objects instead of a list, the dict’s keys will be used for the
keys option:

➢ we can name the created axis levels with the names argument
➢ A last consideration concerns DataFrames in which the row index does not
contain any relevant data:
Reshaping and Pivoting

● There are a number of basic operations for rearranging tabular data. These are
alternatingly referred to as reshape or pivot operations.
● Reshaping with Hierarchical Indexing
➢ Hierarchical indexing provides a consistent way to rearrange data in a
DataFrame. There are two primary actions:
❖ stack This “rotates” or pivots from the columns in the data to the
rows
❖ unstack This pivots from the rows into the columns
➢ Using the stack method on this data pivots the columns into the rows, producing
a Series:

➢ From a hierarchically indexed Series, you can rearrange the data back into a Data‐
Frame with unstack:

➢ By default the innermost level is unstacked (same with stack). You can unstack a
different level by passing a level number or name:
➢ Unstacking might introduce missing data if all of the values in the level aren’t
found in each of the subgroups:

➢ Stacking filters out missing data by default, so the operation is more easily
invertible
➢ When you unstack in a DataFrame, the level unstacked becomes the lowest level
in the result:
➢ When calling stack, we can indicate the name of the axis to stack:

CH-9 Plotting and Visualization

A Brief matplotlib API Primer


● Figures and Subplots
➢ Plots in matplotlib reside within a Figure object
➢ To create a new figure :

➢ In IPython, an empty plot window will appear, but in Jupyter nothing will be
shown until we use a few more commands. plt.figure has a number of
options; notably, figsize will guarantee the figure has a certain size and
aspect ratio if saved to disk. You can’t make a plot with a blank figure. You
have to create one or more subplots using add_subplot:
➢ This means that the figure should be 2 × 2 (so up to four plots in total),
and we’re selecting the first of four subplots (numbered from 1).

➢ When you issue a plotting command like plt.plot([1.5, 3.5, -2, 1.6]),
matplotlib draws on the last figure and subplot used (creating one if
necessary), thus hiding the figure and subplot creation.
The 'k--' is a style option instructing matplotlib to plot a black dashed line

➢ Below,

In histogram :

➔ color='k': This parameter sets the color of the bars in the histogram. In
this case, 'k' stands for black. Matplotlib recognizes a variety of color
representations, and 'k' is a shorthand for black.
➔ alpha=0.3: This parameter controls the transparency of the bars in the
histogram. The alpha parameter takes values between 0 (completely
transparent) and 1 (completely opaque). In this case, alpha=0.3 means
that the bars will be somewhat transparent, allowing underlying elements
to show through.

In scatter graph :

➔ np.arange(30): This generates an array of 30 evenly spaced values


ranging from 0 to 29. This array represents the x-coordinates of the points
in the scatter plot.
➔ np.arange(30) + 3 * np.random.randn(30): This generates an array of
y-coordinates for the scatter plot. It starts with the same values as the
x-coordinates (np.arange(30)), but then it adds a random normal
distribution with a mean of 0 and a standard deviation of 3, multiplied by
the array size (30). This results in adding some random noise to the
x-values.
● Adjusting the spacing around subplots
➢ By default matplotlib leaves a certain amount of padding around the
outside of the subplots and spacing between subplots. This spacing is all
specified relative to the height and width of the plot, so that if you resize
the plot either programmatically or manually using the GUI window, the
plot will dynamically adjust itself.To adjust the padding:

➢ wspace and hspace controls the percent of the figure width and figure
height, respectively, to use as spacing between subplots
You may notice that the axis labels overlap. matplotlib doesn’t check
whether the labels overlap, so in a case like this you would need to fix the
labels yourself by specifying explicit tick locations and tick labels.

● Colors, Markers, and Line Styles


➢ to plot x versus y with green dashes, you would execute:

ax.plot(x, y, 'g--')

ax.plot(x, y, linestyle='--', color='g')

➢ Line plots can additionally have markers to highlight the actual data
points. Since matplotlib creates a continuous line plot, interpolating
between points, it can occasionally be unclear where the points lie. The
marker can be part of the style string, which must have color followed by
marker type and line style.
➢ drawstyle option
● Ticks, Labels, and Legends
➢ The pyplot interface, designed for interactive use, consists of methods like
xlim, xticks, and xticklabels
➢ These control the plot range, tick locations, and tick labels, respectively.
They can be used in two ways:
❖ Called with no arguments returns the current parameter value (e.g.,
plt.xlim() returns the current x-axis plotting range)
❖ Called with parameters sets the parameter value (e.g., plt.xlim([0,
10]), sets the x-axis range to 0 to 10)
➢ Setting the title, axis labels, ticks, and ticklabels
The rotation option sets the x tick labels at a 30-degree rotation.

➢ In above example we could even use:

● Adding legends
➢ Legends are another critical element for identifying plot elements .In
Matplotlib, a legend is a box containing a label or labels explaining the
meaning of the elements in a plot. Legends are used to identify different
data series or elements in a chart, making it easier for viewers to interpret
the information presented.

Once you’ve done this, you can either call ax.legend() or plt.legend() to
automatically create a legend.
To exclude one or more elements from the legend, pass no label or
label='_nolegend_'.

● Annotations and Drawing on a Subplot


➢ In addition to the standard plot types, you may wish to draw your own plot
annotations, which could consist of text, arrows, or other shapes.To do so:
We use the set_xlim and set_ylim methods to manually set the start and
end boundaries for the plot rather than using matplotlib’s default. Lastly,
ax.set_title adds a main title to the plot
➢ Drawing shapes requires some more care. matplotlib has objects that represent
many common shapes, referred to as patches. Some of these, like Rectangle and
Circle, are found in matplotlib.pyplot, but the full set is located in
matplotlib.patches

● Saving Plots to File


➢ The file type is inferred from the file extension. So if you used .pdf instead,
you would get a PDF. There are a couple of important options that I use
frequently for publishing graphics: dpi, which controls the dots-per-inch
resolution, and bbox_inches, which can trim the whitespace around the
actual figure. To get the same plot as a PNG with minimal whitespace
around the plot and at 400 DPI

➢ To save file in byteIO


● Matplotlib Configuration
➢ Matplotlib configuration refers to the ability to customize the default
settings and behavior of Matplotlib's plotting parameters. This allows you
to set global configurations for various components such as figure size,
font styles, color schemes, and more. You can modify these configurations
to suit your preferences and ensure consistency across your plots.
➢ The plt.rc method in Matplotlib is a way to modify these configurations
programmatically from within Python. It takes two main arguments: the
component you want to customize (e.g., 'figure', 'axes', 'xtick', 'ytick', 'grid',
'legend', etc.) and a set of keyword arguments specifying the new
parameters.

➢ This means that all subsequent figures created with Matplotlib will have a
default size of 10x10 inches.
➢ You can also use a dictionary to specify multiple configuration options at
once:
➢ This code sets the default font family, weight, and size for text in
Matplotlib plots.
➢ Matplotlib allows for extensive customization, and you can find a
comprehensive list of configuration options in the matplotlibrc file.

Plotting with pandas and seaborn

Series Plotting:
● A Series is a one-dimensional labeled array in pandas.
● In the example, a Series s is created with random data and plotted using the
plot() method.
● By default, plot() generates a line plot where the x-axis is taken from the Series
index, and the y-axis is the Series values.

● You can disable using the index for the x-axis by passing use_index=False to the
plot() method.
● Options such as xticks, yticks, xlim, and ylim can be used to customize the
appearance of the plot.
DataFrame Plotting:
● A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous
tabular data structure with labeled axes (rows and columns).
● In the example, a DataFrame df is created with random data and plotted using
the plot() method.
● The plot() method for a DataFrame plots each column as a different line on the
same subplot, and it automatically generates a legend.
● The plot attribute of a DataFrame contains a family of methods for different plot
types. For example, df.plot() is equivalent to df.plot.line().

● Pandas plotting methods accept various optional parameters, and you can
customize the appearance of the plots.
● Options for adjusting x-axis and y-axis ticks, limits, and subplot placement are
mentioned.
● Pandas plots are built on top of Matplotlib, and you can provide a Matplotlib
subplot object to the ax parameter for more flexible placement of subplots in a
grid layout.
Bar Plots
● The plot.bar() and plot.barh() make vertical and horizontal bar plots, respectively
● We create stacked bar plots from a DataFrame by passing stacked=True,
resulting in the value in each row being stacked together

Output of above:
And the corresponding row sums:

On dividing the sum with each element of each column , i.e, 16/18
The black lines drawn on the bars represent the 95% confidence interval (this can
be configured through optional arguments).
➢ seaborn.barplot has a hue option that enables us to split by an additional
categorical value.
➢ You can switch between different plot appearances using seaborn.set:

Histograms and Density Plots


● A histogram is a kind of bar plot that gives a discretized display of value
frequency. The data points are split into discrete, evenly spaced bins, and the
number of data points in each bin is plotted.
● Using previous typing data
● A related plot type is a density plot, which is formed by computing an estimate of
a continuous probability distribution that might have generated the observed
data. The usual procedure is to approximate this distribution as a mixture of
“kernels”—that is, simpler distributions like the normal distribution. Thus, density
plots are also known as kernel density estimate (KDE) plots.
● Seaborn makes histograms and density plots even easier through its distplot
method, which can plot both a histogram and a continuous density estimate
simultaneously

Scatter or Point Plots


● Point plots or scatter plots can be a useful way of examining the relationship
between two one-dimensional data series.
● In exploratory data analysis it’s helpful to be able to look at all the scatter plots
among a group of variables; this is known as a pairs plot or scatter plot matrix.
Making such a plot from scratch is a bit of work, so seaborn has a convenient
pairplot function, which supports placing histograms or density estimates of
each variable along the diagonal
CH-10 Data Aggregation and Group Operations

GroupBy Mechanics

● Hadley Wickham, coined the term split-apply-combine for describing group


operations. In the first stage of the process, data contained in a pandas object,
whether a Series, Data‐ Frame, or otherwise, is split into groups based on one or
more keys that you provide. The splitting is performed on a particular axis of an
object. For example, a DataFrame can be grouped on its rows (axis=0) or its
columns (axis=1). Once this is done, a function is applied to each group,
producing a new value. Finally, the results of all those function applications are
combined into a result object.

● Example: to compute the mean of the data1 column using the labels from key1.
● Series now has a hierarchical index consisting of the unique pairs of keys
observed:
● You may have noticed in the first case df.groupby('key1').mean() that there is no
key2 column in the result. Because df['key2'] is not numeric data, it is said to be a
nuisance column, which is therefore excluded from the result.
● .size( ) returns the count or the size

Iterating Over Groups

● The GroupBy object supports iteration, generating a sequence of 2-tuples


containing the group name along with the chunk of data.
● By default groupby groups on axis=0, but you can group on any of the other axes.
For example, we could group the columns of our example df here by dtype like
so:
Selecting a Column or Subset of Columns

● Indexing a GroupBy object created from a DataFrame with a column name or


array of column names has the effect of column subsetting for aggregation. This
means that:
Grouping with Dicts and Series
Grouping with Functions

Grouping by Index Levels


Data Aggregation

● Aggregations refer to any data transformation that produces scalar values from
arrays. The preceding examples have used several of them, including mean,
count, min, and sum.
Column-Wise and Multiple Function Application
Returning Aggregated Data Without Row Indexes

● In all of the examples up until now, the aggregated data comes back with an
index, potentially hierarchical, composed from the unique group key
combinations. Since this isn’t always desirable, you can disable this behavior in
most cases by passing as_index=False to groupby:
Apply: General split-apply-combine
Inside GroupBy, when you invoke a method like describe, it is actually just a short‐ cut
for:

Suppressing the Group Keys

● In the preceding examples, you see that the resulting object has a hierarchical
index formed from the group keys along with the indexes of each piece of the
original object. You can disable this by passing group_keys=False to groupby:

Quantile and Bucket Analysis


● These were equal-length buckets; to compute equal-size buckets based on
sample quantiles, use qcut. I’ll pass labels=False to just get quantile numbers

Example: Filling Missing Values with Group-Specific Values


Example: Random Sampling and Permutation
● Suppose you wanted two random cards from each suit. Because the suit is the
last character of each card name, we can group based on this and use apply:
Example: Group Weighted Average and Correlation
Output

Weighted average for category 'X' is calculated as

● (2×0.2+3×0.8)/(0.2+0.8)=2.8
● (2×0.2+3×0.8)/(0.2+0.8)=2.8.
Weighted average for category 'Y' is calculated as

● (1×0.5+4×0.5)/(0.5+0.5)=2.5
● (1×0.5+4×0.5)/(0.5+0.5)=2.5.

● create a function that computes the pairwise correlation of each column with the
'SPX' column:

● compute percent change on close_px using pct_change:


Pivot Tables and Cross-Tabulation

● A pivot table is a data summarization tool frequently found in spreadsheet


programs and other data analysis software. It aggregates a table of data by one
or more keys, arranging the data in a rectangle with some of the group keys along
the rows and some along the columns. Pivot tables in Python with pandas are
made possible through the groupby facility described in this chapter combined
with reshape opera‐ tions utilizing hierarchical indexing. DataFrame has a
pivot_table method, and there is also a top-level pandas.pivot_table function. In
addition to providing a convenience interface to groupby, pivot_table can add
partial totals, also known as margins.
● We could augment this table to include partial totals by passing margins=True.
This has the effect of adding All row and column labels, with corresponding
values being the group statistics for all the data within a single tier:

Here, the All values are means without taking into account smoker versus
nonsmoker (the All columns) or any of the two levels of grouping on the rows
(the All row).
Cross-Tabulations: Crosstab
Ch-11 Time Series

Time series data is an important form of structured data in many different fields, such
as finance, economics, ecology, neuroscience, and physics. Anything that is observed or
measured at many points in time forms a time series. Many time series are fixed
frequency, which is to say that data points occur at regular intervals according to some
rule, such as every 15 seconds, every 5 minutes, or once per month. Time series can
also be irregular without a fixed unit of time or offset between units.

How you mark and refer to time series data depends on the application, and you may
have one of the following:

● Timestamps, specific instants in time


● Fixed periods, such as the month January 2007 or the full year 2010
● Intervals of time, indicated by a start and end timestamp. Periods can be thought
of as special cases of intervals
● Experiment or elapsed time; each timestamp is a measure of time relative to a
particular start time (e.g., the diameter of a cookie baking each second since
being placed in the oven)

Date and Time Data Types and Tools

● The datetime, time, and calendar modules are the main places to start. The
datetime.datetime type, or simply datetime, is widely used:

● datetime stores both the date and time down to the microsecond. timedelta
repre‐ sents the temporal difference between two datetime objects:

● You can add (or subtract) a timedelta or multiple thereof to a datetime object to
yield a new shifted object:
● Converting Between String and Datetime
➢ As strings using str or the strftime method,

➢ to convert strings to dates using date time.strptime


➢ datetime.strptime is a good way to parse a date with a known format.
However, it can be a bit annoying to have to write a format spec each time,
especially for common date formats. In this case, you can use the
parser.parse method in the third-party dateutil package (this is installed
automatically when you install pandas)

➢ The to_datetime method parses many different kinds of date


representations.
NaT (Not a Time) is pandas’s null value for timestamp data.

Time Series Basics

● A basic kind of time series object in pandas is a Series indexed by timestamps,


which is often represented external to pandas as Python strings or datetime
objects:
Recall that ts[::2] selects every second element in ts

➢ pandas stores timestamps using NumPy’s datetime64 data type at the


nanosecond resolution:

● Indexing, Selection, Subsetting


➢ Time series behaves like any other pandas.Series when you are indexing
and select‐ ing data based on label:

➢ For longer time series, a year or only a year and month can be passed to
easily select slices of data:
Here, the string '2001' is interpreted as a year and selects that time period. This also
works if you specify the month:
Because most time series data is ordered chronologically, you can slice with time‐
stamps not contained in a time series to perform a range query:

➢ This means that no data is copied and modifications on the slice will be reflec‐
ted in the original data.
➢ There is an equivalent instance method, truncate, that slices a Series between
two dates:
● Time Series with Duplicate Indices
➢ Suppose you wanted to aggregate the data having non-unique timestamps. One
way to do this is to use groupby and pass level=0

Date Ranges, Frequencies, and Shifting

● tools for resampling, inferring fre‐ quencies, and generating fixed-frequency date
ranges

● Generating Date Ranges


➢ The start and end dates define strict boundaries for the generated date
index. For example, if you wanted a date index containing the last
business day of each month, you would pass the 'BM' frequency (business
end of month
➢ Sometimes you will have start or end dates with time information but want to
generate a set of timestamps normalized to midnight as a convention. To do this,
there is a normalize option:
● Frequencies and Date Offsets
➢ Frequencies in pandas are composed of a base frequency and a multiplier.
Base frequencies are typically referred to by a string alias, like 'M' for
monthly or 'H' for hourly. For each base frequency, there is an object
defined generally referred to as a date offset.

➢ In most applications, you would never need to explicitly create one of


these objects, instead using a string alias like 'H' or '4H'. Putting an integer
before the base frequency creates a multiple:

Many offsets can be combined together by addition:


Similarly, you can pass frequency strings, like '1h30min', that will
effectively be parsed to the same expression:

➢ Some frequencies describe points in time that are not evenly spaced. For
example, 'M' (calendar month end) and 'BM' (last business/weekday of month)
depend on the number of days in a month and, in the latter case, whether the
month ends on a weekend or not. We refer to these as anchored offsets.
➢ Week of month dates

One useful frequency class is “week of month,” starting with WOM. This enables
you to get dates like the third Friday of each month:

● Shifting (Leading and Lagging) Data


➢ “Shifting” refers to moving data backward and forward through time. Both
Series and DataFrame have a shift method for doing naive shifts forward
or backward, leaving the index unmodified:
➢ A common use of shift is computing percent changes in a time series or multiple
time series as DataFrame columns. This is expressed as:

➢ Because naive shifts leave the index unmodified, some data is discarded. Thus if
the frequency is known, it can be passed to shift to advance the timestamps
instead of simply the data:

➢ Other frequencies can be passed, too, giving you some flexibility in how to lead
and lag the data:
Here T stands for minutes

➢ Shifting dates with offsets

The pandas date offsets can also be used with datetime or Timestamp objects:

If you add an anchored offset like MonthEnd, the first increment will “roll forward”
a date to the next date according to the frequency rule:

Anchored offsets can explicitly “roll” dates forward or backward by simply using
their rollforward and rollback methods, respectively:
A creative use of date offsets is to use these methods with groupby:

Here offset is monthend

an easier and faster way to do this is using resample:


Time Zone Handling

● Working with time zones is generally considered one of the most unpleasant
parts of time series manipulation. As a result, many time series users choose to
work with time series in coordinated universal time or UTC, which is the
successor to Greenwich Mean Time and is the current international standard.
Time zones are expressed as offsets from UTC; for example, New York is four
hours behind UTC during daylight saving time and five hours behind the rest of
the year.
● In Python, time zone information comes from the third-party pytz library (installa‐
ble with pip or conda), which exposes the Olson database, a compilation of world
time zone information. This is especially important for historical data because
the daylight saving time (DST) transition dates (and even UTC offsets) have been
changed numerous times depending on the whims of local governments. In the
Uni‐ ted States, the DST transition times have been changed many times since
1900!

Periods and Period Arithmetic

● Periods represent timespans, like days, months, quarters, or years


● Regular ranges of periods can be constructed with the period_range function:

The PeriodIndex class stores a sequence of periods and can serve as an axis
index in any pandas data structure:

If you have an array of strings, you can also use the PeriodIndex class:

● Period Frequency Conversion


➢ Periods and PeriodIndex objects can be converted to another frequency
with their asfreq method.As an example, suppose we had an annual period
and wanted to convert it into a monthly period either at the start or end of
the year. This is fairly straightforward:

➢ You can think of Period('2007', 'A-DEC') as being a sort of cursor pointing


to a span of time, subdivided by monthly periods. See Figure 11-1 for an
illustration of this. For a fiscal year ending on a month other than
December, the corresponding monthly subperiods are different:

➢ When you are converting from high to low frequency, pandas determine
the super‐ period depending on where the subperiod “belongs.” For
example, in A-JUN frequency, the month Aug-2007 is actually part of the
2008 period:

➢ Whole PeriodIndex objects or time series can be similarly converted with


the same semantics:

Resampling and Frequency Conversion

● Resampling refers to the process of converting a time series from one frequency
to another. Aggregating higher frequency data to lower frequency is called
downsam‐ pling, while converting lower frequency to higher frequency is called
upsampling. Not all resampling falls into either of these categories; for example,
converting W-WED (weekly on Wednesday) to W-FRI is neither upsampling nor
downsampling

resample is a flexible and high-performance method that can be used to process


very large time series
● Downsampling
➢ Aggregating data to a regular, lower frequency is a pretty normal time
series task. The data you’re aggregating doesn’t need to be fixed
frequently; the desired frequency defines bin edges that are used to slice
the time series into pieces to aggregate. For example, to convert to
monthly, 'M' or 'BM', you need to chop up the data into onemonth intervals.
Each interval is said to be half-open; a data point can only belong to one
interval, and the union of the intervals must make up the whole time
frame.
➢ There are a couple things to think about when using resample to
downsample data:
➔ Which side of each interval is closed
➔ How to label each aggregated bin, either with the start of the
interval or the end
➢ Suppose you wanted to aggregate this data into five-minute chunks or
bars by taking the sum of each group:

The frequency you pass defines bin edges in five-minute increments. By


default, the left bin edge is inclusive, so the 00:00 value is included in the
00:00 to 00:05 interval.1 Passing closed='right' changes the interval to be
closed on the right

➢ The resulting time series is labeled by the timestamps from the left side of
each bin. By passing label='right' you can label them with the right bin
edge:
➢ Lastly, you might want to shift the result index by some amount, say subtracting
one second from the right edge to make it more clear which interval the
timestamp refers to. To do this, pass a string or date offset to loffset:

You also could have accomplished the effect of loffset by calling the shift
method on the result without the loffset.

➢ Open-High-Low-Close (OHLC) resampling

In finance, a popular way to aggregate a time series is to compute four values for
each bucket: the first (open), last (close), maximum (high), and minimal (low)
values. By using the ohlc aggregate function you will obtain a DataFrame having
columns containing these four aggregates, which are efficiently computed in a
single sweep of the data:

● Upsampling and Interpolation


➢ When converting from a low frequency to a higher frequency, no
aggregation is needed

➢ When you are using an aggregation function with this data, there is only
one value per group, and missing values result in the gaps. We use the
asfreq method to con‐ vert to the higher frequency without any
aggregation:

➢ Suppose you wanted to fill forward each weekly value on the


non-Wednesdays. The same filling or interpolation methods available in
the fillna and reindex methods are available for resampling:
● Resampling with Periods
Upsampling is more nuanced, as you must make a decision about which
end of the timespan in the new frequency to place the values before
resampling, just like the asfreq method. The convention argument defaults
to 'start' but can also be 'end':
Since periods refer to timespans, the rules about upsampling and downsampling are
more rigid:

● In downsampling, the target frequency must be a subperiod of the source


frequency.
● In upsampling, the target frequency must be a superperiod of the source
frequency.

If these rules are not satisfied, an exception will be raised. This mainly affects the
quarterly, annual, and weekly frequencies; for example, the timespans defined by QMAR
only line up with A-MAR, A-JUN, A-SEP, and A-DEC:

Moving Window Functions

● An important class of array transformations used for time series operations are
statistics and other functions evaluated over a sliding window or with
exponentially decay‐ ing weights. This can be useful for smoothing noisy or
gappy data. I call these moving window functions, even though it includes
functions without a fixed-length window like exponentially weighted moving
average. Like other statistical functions, these also automatically exclude
missing data.
The expression rolling(250) is similar in behavior to groupby, but instead of
group‐ ing it creates an object that enables grouping over a 250-day sliding
window. So here we have the 250-day moving window average of Apple’s stock
price.

● By default rolling functions require all of the values in the window to be non-NA.
This behavior can be changed to account for missing data and, in particular, the
fact that you will have fewer than window periods of data at the beginning of the
time series
● In order to compute an expanding window mean, use the expanding operator
instead of rolling. The expanding mean starts the time window from the
beginning of the time series and increases the size of the window until it
encompasses the whole series
The rolling function also accepts a string indicating a fixed-size time offset rather
than a set number of periods. Using this notation can be useful for irregular time
ser‐ ies. These are the same strings that you can pass to resample. For example,
we could compute a 20-day rolling mean like so:
CH-12 Advanced pandas
Categorical Data

● Background and Motivation


This representation as integers is called the categorical or
dictionary-encoded representation. The array of distinct values can be
called the categories, dictionary, or levels of the data.The integer values
that reference the categories are called the category codes or simply
codes.

● Categorical Type in pandas


➢ pandas has a special Categorical type for holding data that uses the
integer-based categorical representation or encoding.
➢ If you have obtained categorical encoded data from another source, you
can use the alternative from_codes constructor:

● Categorical Methods
Creating dummy variables for modeling

When you’re using statistics or machine learning tools, you’ll often transform catego‐
rical data into dummy variables, also known as one-hot encoding. This involves creat‐
ing a DataFrame with a column for each distinct category; these columns contain 1s for
occurrences of a given category and 0 otherwise.

pandas.get_dummies function converts this one-dimensional categorical data into a


DataFrame containing the dummy variable.
Advanced GroupBy Use

Group Transforms and “Unwrapped” GroupBys

There is another built-in method called transform, which is similar to apply but imposes
more constraints on the kind of function you can use:

● It can produce a scalar value to be broadcast to the shape of the group


● It can produce an object of the same shape as the input group
● It must not mutate its input
Built-in aggregate functions like 'mean' or 'sum' are often much faster than a general
apply function. These also have a “fast past” when used with transform. This allows us
to perform a so-called unwrapped group operation:

Techniques for Method Chaining

When applying a sequence of transformations to a dataset, you may find yourself


creating numerous temporary variables that are never used in your analysis.
First, the DataFrame.assign method is a functional alternative to column assign‐ ments
of the form df[k] = v. Rather than modifying the object in-place, it returns a new
DataFrame with the indicated modifications. So these statements are equivalent:

Assigning in-place may execute faster than using assign, but assign enables easier
method chaining:

One thing to keep in mind when doing method chaining is that you may need to refer to
temporary objects. In the preceding example, we cannot refer to the result of load_data
until it has been assigned to the temporary variable df.

To help with this, assign and many other pandas functions accept function-like
arguments, also known as callables. To show callables in action, consider a fragment of
the example from before:

Here, the result of load_data is not assigned to a variable, so the function passed into []
is then bound to the object at that stage of the method chain.

You might also like