CH-6 Data Loading, Storage, and File Formats
CH-6 Data Loading, Storage, and File Formats
Above function may use optional arguments that may fall into a few categories:
● Indexing Can treat one or more columns as the returned DataFrame, and
whether to get column names from the file, the user, or not at all.
● Type inference and data conversion This includes the user-defined value
conversions and custom list of missing value markers.
● Datetime parsing Includes combining capability, including combining date and
time information spread over multiple columns into a single column in the result.
Iterating Support for iterating over chunks of very large files.
● Unclean data issues Skipping rows or a footer, comments, or other minor things
like numeric data with thousands separated by commas.
In UNIX
A file may not always have header
Skipping rows
Output
Explanation:
● pd.read_csv with chunksize: The pd.read_csv function is used to read the CSV
file ('ex6.csv') in chunks. The chunksize parameter is set to 1000, indicating that
the file will be read in chunks of 1000 rows at a time.
● TextFileReader Object: The result of this operation is a TextFileReader object
(<pandas.io.parsers.TextFileReader at 0x7f6b1e2672e8>). This object is an
iterator that allows you to iterate over the file in chunks using a for loop.
● Other delimiters can be used, of course (writing to sys.stdout so it prints the text
result to the console):
Missing values appear as empty strings in the output. You might want to denote them
by some other sentinel value:
Both index and header are disabled
To write delimited files manually, you can use csv.writer. It accepts an open, writa‐ ble
file object and the same dialect and format options as csv.reader:
BINARY DATA FORMAT
Opening the excel file by directly passing the excel file name.
Writing to the excel file creating Excelwriter and using the to_excel method:
● The json() method of the response object is used to parse the JSON content of
the response into native Python objects. In this case, it returns a list of
dictionaries, where each dictionary represents an issue on the GitHub repository.
● Replacing Values
➢ Filling in missing data with the fillna method is a special case of more
general value replacement.
➢ The -999 values might be sentinel values for missing data. To replace
these with NA values that pandas understands, we can use replace,
producing a new Series (unless you pass inplace=True)
➢ If you want to replace multiple values at once, you instead pass a list and
then the substitute value:
➢ To use a different replacement for each value, pass a list of substitutes:
➢ pass inplace=True:
➢ To divide in groups
➢ You can also pass your own bin names by passing a list or array to the
labels option:
➢ If you pass an integer number of bins to cut instead of explicit bin edges, it
will compute equal-length bins based on the minimum and maximum
values in the data. Consider the case of some uniformly distributed data
chopped into fourths:
➢ A closely related function, qcut, bins the data based on sample quantiles.
Depending on the distribution of the data, using cut will not usually result
in each bin having the same number of data points. Since qcut uses
sample quantiles instead, by definition you will obtain roughly equal-size
bins
➢ Quartiles are values that divide your data into four equal parts.The three quartiles
are Q1 (25th percentile), Q2 (50th percentile, or the median), and Q3 (75th
percentile).In the context of qcut, it means that the data is divided into four
intervals, each containing approximately 25% of the data points
➢ Similar to cut you can pass your own quantiles (numbers between 0 and 1,
inclusive):
Here any(1) is checking if their is atleast one value in the giving column which is
>3.
➢ Values can be set based on these criteria. Here is code to cap values outside the
inter‐ val –3 to 3:
String Manipulation
Find will return -1 if value is not present in the string but the index will raise
an error
➢ Using count
➢ Using replace
CH-8 Data Wrangling: Join, Combine, and Reshape
Hierarchical Indexing
Also,
● [0,1,2] -> because is ‘a’ we have all three[1,2,3] whose indexes are [0,1,2]
respectively
● [0,2]->because is ‘b’ we have [1,3] whose indexes are [0,2] respectively
● [0,1] ->because is ‘c’ we have [1,2] whose indexes are [0,1] respectively
● [1,2]->because is ‘d’ we have [2,3] whose indexes are [1,2] respectively
➢ With a hierarchically indexed object, so-called partial indexing is possible,
enabling you to concisely select subsets of the data:
➢ Selection is even possible from an “inner” level:
➢ The hierarchical levels can have names (as strings or any Python objects).
● Reordering and Sorting
➢ The swaplevel takes two level numbers or names and returns a new object
with the levels interchanged
➢ sort_index, on the other hand, sorts the data using only the values in a
single level.
● Summary Statistics by Level
Original frame
➢ DataFrame’s set_index function will create a new DataFrame using one or more
of its columns as the index
➢ By default the columns are removed from the DataFrame, though you can leave
them in:
➢ reset_index, on the other hand, does the opposite of set_index; the hierarchical
index levels are moved into the columns:
Combining and Merging Datasets
➢ Merge or join operations combine datasets by linking rows using one or more
keys.
➢ many-to-one join; the data in df1 has multiple rows labeled a and b, whereas df2
has only one row for each value in the key column
➢ If that information in the field is not specified , merge uses the overlapping
column names as the keys. It’s a good practice to specify explicitly, though:
➢ If the column names are different in each object, you can specify them
separately:
➢ You may notice that the 'c' and 'd' values and associated data are missing from
the result. By default merge does an 'inner' join; the keys in the result are the
intersec‐ tion, or the common set found in both tables. Other possible options are
'left', 'right', and 'outer'. The outer join takes the union of the keys, combining the
effect of applying both left and right joins:
➢ A last issue to consider in merge operations is the treatment of overlapping
column names. While you can address the overlap manually (see the earlier
section on renaming axis labels), merge has a suffixes option for specifying
strings to append to overlapping names in the left and right DataFrame objects:
● Merging on Index
➢ you can pass left_index=True or right_index=True (or both) to indicate that
the index should be used as the merge key:
➢ Since the default merge method is to intersect the join keys, you can instead
form the union of them with an outer join:
➢ Using the indexes of both sides of the merge is also possible:
➢ DataFrame has a convenient join instance for merging by index. It can also be
used to combine together many DataFrame objects having the same or similar
indexes but non-overlapping columns. In the prior example, we could have written
➢ DataFrame’s join method performs a left join on the join keys, exactly preserving
the left frame’s row index. It also supports joining the index of the passed
DataFrame on one of the columns of the calling DataFrame:
● Concatenating Along an Axis
➢ Another kind of data combination operation is referred to interchangeably
as concat‐ enation, binding, or stacking
➢ By default concat works along axis=0, producing another Series. If you pass
axis=1, the result will instead be a DataFrame (axis=1 is the columns):
➢ In this case there is no overlap on the other axis, which as you can see is the
sorted union (the 'outer' join) of the indexes. You can instead intersect them by
passing join='inner':
➢ You can even specify the axes to be used on the other axes with join_axes
➢ In the case of combining Series along axis=1, the keys become the DataFrame
column headers:
➢ For dataframe
➢ If you pass a dict of objects instead of a list, the dict’s keys will be used for the
keys option:
➢ we can name the created axis levels with the names argument
➢ A last consideration concerns DataFrames in which the row index does not
contain any relevant data:
Reshaping and Pivoting
● There are a number of basic operations for rearranging tabular data. These are
alternatingly referred to as reshape or pivot operations.
● Reshaping with Hierarchical Indexing
➢ Hierarchical indexing provides a consistent way to rearrange data in a
DataFrame. There are two primary actions:
❖ stack This “rotates” or pivots from the columns in the data to the
rows
❖ unstack This pivots from the rows into the columns
➢ Using the stack method on this data pivots the columns into the rows, producing
a Series:
➢ From a hierarchically indexed Series, you can rearrange the data back into a Data‐
Frame with unstack:
➢ By default the innermost level is unstacked (same with stack). You can unstack a
different level by passing a level number or name:
➢ Unstacking might introduce missing data if all of the values in the level aren’t
found in each of the subgroups:
➢ Stacking filters out missing data by default, so the operation is more easily
invertible
➢ When you unstack in a DataFrame, the level unstacked becomes the lowest level
in the result:
➢ When calling stack, we can indicate the name of the axis to stack:
➢ In IPython, an empty plot window will appear, but in Jupyter nothing will be
shown until we use a few more commands. plt.figure has a number of
options; notably, figsize will guarantee the figure has a certain size and
aspect ratio if saved to disk. You can’t make a plot with a blank figure. You
have to create one or more subplots using add_subplot:
➢ This means that the figure should be 2 × 2 (so up to four plots in total),
and we’re selecting the first of four subplots (numbered from 1).
➢ When you issue a plotting command like plt.plot([1.5, 3.5, -2, 1.6]),
matplotlib draws on the last figure and subplot used (creating one if
necessary), thus hiding the figure and subplot creation.
The 'k--' is a style option instructing matplotlib to plot a black dashed line
➢ Below,
In histogram :
➔ color='k': This parameter sets the color of the bars in the histogram. In
this case, 'k' stands for black. Matplotlib recognizes a variety of color
representations, and 'k' is a shorthand for black.
➔ alpha=0.3: This parameter controls the transparency of the bars in the
histogram. The alpha parameter takes values between 0 (completely
transparent) and 1 (completely opaque). In this case, alpha=0.3 means
that the bars will be somewhat transparent, allowing underlying elements
to show through.
In scatter graph :
➢ wspace and hspace controls the percent of the figure width and figure
height, respectively, to use as spacing between subplots
You may notice that the axis labels overlap. matplotlib doesn’t check
whether the labels overlap, so in a case like this you would need to fix the
labels yourself by specifying explicit tick locations and tick labels.
ax.plot(x, y, 'g--')
➢ Line plots can additionally have markers to highlight the actual data
points. Since matplotlib creates a continuous line plot, interpolating
between points, it can occasionally be unclear where the points lie. The
marker can be part of the style string, which must have color followed by
marker type and line style.
➢ drawstyle option
● Ticks, Labels, and Legends
➢ The pyplot interface, designed for interactive use, consists of methods like
xlim, xticks, and xticklabels
➢ These control the plot range, tick locations, and tick labels, respectively.
They can be used in two ways:
❖ Called with no arguments returns the current parameter value (e.g.,
plt.xlim() returns the current x-axis plotting range)
❖ Called with parameters sets the parameter value (e.g., plt.xlim([0,
10]), sets the x-axis range to 0 to 10)
➢ Setting the title, axis labels, ticks, and ticklabels
The rotation option sets the x tick labels at a 30-degree rotation.
● Adding legends
➢ Legends are another critical element for identifying plot elements .In
Matplotlib, a legend is a box containing a label or labels explaining the
meaning of the elements in a plot. Legends are used to identify different
data series or elements in a chart, making it easier for viewers to interpret
the information presented.
Once you’ve done this, you can either call ax.legend() or plt.legend() to
automatically create a legend.
To exclude one or more elements from the legend, pass no label or
label='_nolegend_'.
➢ This means that all subsequent figures created with Matplotlib will have a
default size of 10x10 inches.
➢ You can also use a dictionary to specify multiple configuration options at
once:
➢ This code sets the default font family, weight, and size for text in
Matplotlib plots.
➢ Matplotlib allows for extensive customization, and you can find a
comprehensive list of configuration options in the matplotlibrc file.
Series Plotting:
● A Series is a one-dimensional labeled array in pandas.
● In the example, a Series s is created with random data and plotted using the
plot() method.
● By default, plot() generates a line plot where the x-axis is taken from the Series
index, and the y-axis is the Series values.
● You can disable using the index for the x-axis by passing use_index=False to the
plot() method.
● Options such as xticks, yticks, xlim, and ylim can be used to customize the
appearance of the plot.
DataFrame Plotting:
● A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous
tabular data structure with labeled axes (rows and columns).
● In the example, a DataFrame df is created with random data and plotted using
the plot() method.
● The plot() method for a DataFrame plots each column as a different line on the
same subplot, and it automatically generates a legend.
● The plot attribute of a DataFrame contains a family of methods for different plot
types. For example, df.plot() is equivalent to df.plot.line().
● Pandas plotting methods accept various optional parameters, and you can
customize the appearance of the plots.
● Options for adjusting x-axis and y-axis ticks, limits, and subplot placement are
mentioned.
● Pandas plots are built on top of Matplotlib, and you can provide a Matplotlib
subplot object to the ax parameter for more flexible placement of subplots in a
grid layout.
Bar Plots
● The plot.bar() and plot.barh() make vertical and horizontal bar plots, respectively
● We create stacked bar plots from a DataFrame by passing stacked=True,
resulting in the value in each row being stacked together
Output of above:
And the corresponding row sums:
On dividing the sum with each element of each column , i.e, 16/18
The black lines drawn on the bars represent the 95% confidence interval (this can
be configured through optional arguments).
➢ seaborn.barplot has a hue option that enables us to split by an additional
categorical value.
➢ You can switch between different plot appearances using seaborn.set:
GroupBy Mechanics
● Example: to compute the mean of the data1 column using the labels from key1.
● Series now has a hierarchical index consisting of the unique pairs of keys
observed:
● You may have noticed in the first case df.groupby('key1').mean() that there is no
key2 column in the result. Because df['key2'] is not numeric data, it is said to be a
nuisance column, which is therefore excluded from the result.
● .size( ) returns the count or the size
● Aggregations refer to any data transformation that produces scalar values from
arrays. The preceding examples have used several of them, including mean,
count, min, and sum.
Column-Wise and Multiple Function Application
Returning Aggregated Data Without Row Indexes
● In all of the examples up until now, the aggregated data comes back with an
index, potentially hierarchical, composed from the unique group key
combinations. Since this isn’t always desirable, you can disable this behavior in
most cases by passing as_index=False to groupby:
Apply: General split-apply-combine
Inside GroupBy, when you invoke a method like describe, it is actually just a short‐ cut
for:
● In the preceding examples, you see that the resulting object has a hierarchical
index formed from the group keys along with the indexes of each piece of the
original object. You can disable this by passing group_keys=False to groupby:
● (2×0.2+3×0.8)/(0.2+0.8)=2.8
● (2×0.2+3×0.8)/(0.2+0.8)=2.8.
Weighted average for category 'Y' is calculated as
● (1×0.5+4×0.5)/(0.5+0.5)=2.5
● (1×0.5+4×0.5)/(0.5+0.5)=2.5.
● create a function that computes the pairwise correlation of each column with the
'SPX' column:
Here, the All values are means without taking into account smoker versus
nonsmoker (the All columns) or any of the two levels of grouping on the rows
(the All row).
Cross-Tabulations: Crosstab
Ch-11 Time Series
Time series data is an important form of structured data in many different fields, such
as finance, economics, ecology, neuroscience, and physics. Anything that is observed or
measured at many points in time forms a time series. Many time series are fixed
frequency, which is to say that data points occur at regular intervals according to some
rule, such as every 15 seconds, every 5 minutes, or once per month. Time series can
also be irregular without a fixed unit of time or offset between units.
How you mark and refer to time series data depends on the application, and you may
have one of the following:
● The datetime, time, and calendar modules are the main places to start. The
datetime.datetime type, or simply datetime, is widely used:
● datetime stores both the date and time down to the microsecond. timedelta
repre‐ sents the temporal difference between two datetime objects:
● You can add (or subtract) a timedelta or multiple thereof to a datetime object to
yield a new shifted object:
● Converting Between String and Datetime
➢ As strings using str or the strftime method,
➢ For longer time series, a year or only a year and month can be passed to
easily select slices of data:
Here, the string '2001' is interpreted as a year and selects that time period. This also
works if you specify the month:
Because most time series data is ordered chronologically, you can slice with time‐
stamps not contained in a time series to perform a range query:
➢ This means that no data is copied and modifications on the slice will be reflec‐
ted in the original data.
➢ There is an equivalent instance method, truncate, that slices a Series between
two dates:
● Time Series with Duplicate Indices
➢ Suppose you wanted to aggregate the data having non-unique timestamps. One
way to do this is to use groupby and pass level=0
● tools for resampling, inferring fre‐ quencies, and generating fixed-frequency date
ranges
➢ Some frequencies describe points in time that are not evenly spaced. For
example, 'M' (calendar month end) and 'BM' (last business/weekday of month)
depend on the number of days in a month and, in the latter case, whether the
month ends on a weekend or not. We refer to these as anchored offsets.
➢ Week of month dates
One useful frequency class is “week of month,” starting with WOM. This enables
you to get dates like the third Friday of each month:
➢ Because naive shifts leave the index unmodified, some data is discarded. Thus if
the frequency is known, it can be passed to shift to advance the timestamps
instead of simply the data:
➢ Other frequencies can be passed, too, giving you some flexibility in how to lead
and lag the data:
Here T stands for minutes
The pandas date offsets can also be used with datetime or Timestamp objects:
If you add an anchored offset like MonthEnd, the first increment will “roll forward”
a date to the next date according to the frequency rule:
Anchored offsets can explicitly “roll” dates forward or backward by simply using
their rollforward and rollback methods, respectively:
A creative use of date offsets is to use these methods with groupby:
● Working with time zones is generally considered one of the most unpleasant
parts of time series manipulation. As a result, many time series users choose to
work with time series in coordinated universal time or UTC, which is the
successor to Greenwich Mean Time and is the current international standard.
Time zones are expressed as offsets from UTC; for example, New York is four
hours behind UTC during daylight saving time and five hours behind the rest of
the year.
● In Python, time zone information comes from the third-party pytz library (installa‐
ble with pip or conda), which exposes the Olson database, a compilation of world
time zone information. This is especially important for historical data because
the daylight saving time (DST) transition dates (and even UTC offsets) have been
changed numerous times depending on the whims of local governments. In the
Uni‐ ted States, the DST transition times have been changed many times since
1900!
The PeriodIndex class stores a sequence of periods and can serve as an axis
index in any pandas data structure:
If you have an array of strings, you can also use the PeriodIndex class:
➢ When you are converting from high to low frequency, pandas determine
the super‐ period depending on where the subperiod “belongs.” For
example, in A-JUN frequency, the month Aug-2007 is actually part of the
2008 period:
● Resampling refers to the process of converting a time series from one frequency
to another. Aggregating higher frequency data to lower frequency is called
downsam‐ pling, while converting lower frequency to higher frequency is called
upsampling. Not all resampling falls into either of these categories; for example,
converting W-WED (weekly on Wednesday) to W-FRI is neither upsampling nor
downsampling
➢ The resulting time series is labeled by the timestamps from the left side of
each bin. By passing label='right' you can label them with the right bin
edge:
➢ Lastly, you might want to shift the result index by some amount, say subtracting
one second from the right edge to make it more clear which interval the
timestamp refers to. To do this, pass a string or date offset to loffset:
You also could have accomplished the effect of loffset by calling the shift
method on the result without the loffset.
In finance, a popular way to aggregate a time series is to compute four values for
each bucket: the first (open), last (close), maximum (high), and minimal (low)
values. By using the ohlc aggregate function you will obtain a DataFrame having
columns containing these four aggregates, which are efficiently computed in a
single sweep of the data:
➢ When you are using an aggregation function with this data, there is only
one value per group, and missing values result in the gaps. We use the
asfreq method to con‐ vert to the higher frequency without any
aggregation:
If these rules are not satisfied, an exception will be raised. This mainly affects the
quarterly, annual, and weekly frequencies; for example, the timespans defined by QMAR
only line up with A-MAR, A-JUN, A-SEP, and A-DEC:
● An important class of array transformations used for time series operations are
statistics and other functions evaluated over a sliding window or with
exponentially decay‐ ing weights. This can be useful for smoothing noisy or
gappy data. I call these moving window functions, even though it includes
functions without a fixed-length window like exponentially weighted moving
average. Like other statistical functions, these also automatically exclude
missing data.
The expression rolling(250) is similar in behavior to groupby, but instead of
group‐ ing it creates an object that enables grouping over a 250-day sliding
window. So here we have the 250-day moving window average of Apple’s stock
price.
● By default rolling functions require all of the values in the window to be non-NA.
This behavior can be changed to account for missing data and, in particular, the
fact that you will have fewer than window periods of data at the beginning of the
time series
● In order to compute an expanding window mean, use the expanding operator
instead of rolling. The expanding mean starts the time window from the
beginning of the time series and increases the size of the window until it
encompasses the whole series
The rolling function also accepts a string indicating a fixed-size time offset rather
than a set number of periods. Using this notation can be useful for irregular time
ser‐ ies. These are the same strings that you can pass to resample. For example,
we could compute a 20-day rolling mean like so:
CH-12 Advanced pandas
Categorical Data
● Categorical Methods
Creating dummy variables for modeling
When you’re using statistics or machine learning tools, you’ll often transform catego‐
rical data into dummy variables, also known as one-hot encoding. This involves creat‐
ing a DataFrame with a column for each distinct category; these columns contain 1s for
occurrences of a given category and 0 otherwise.
There is another built-in method called transform, which is similar to apply but imposes
more constraints on the kind of function you can use:
Assigning in-place may execute faster than using assign, but assign enables easier
method chaining:
One thing to keep in mind when doing method chaining is that you may need to refer to
temporary objects. In the preceding example, we cannot refer to the result of load_data
until it has been assigned to the temporary variable df.
To help with this, assign and many other pandas functions accept function-like
arguments, also known as callables. To show callables in action, consider a fragment of
the example from before:
Here, the result of load_data is not assigned to a variable, so the function passed into []
is then bound to the object at that stage of the method chain.