MLA 002 Numpy and Pandas

May 23, 2018

Click to Play Episode

NumPy enables efficient storage and vectorized computation on large numerical datasets in RAM by leveraging contiguous memory allocation and low-level C/Fortran libraries, drastically reducing memory footprint compared to native Python lists. Pandas, built on top of NumPy, introduces labelled, flexible tabular data manipulation - facilitating intuitive row and column operations, powerful indexing, and seamless handling of missing data through tools like alignment, reindexing, and imputation.

Resources

Resources best viewed here

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd Edition)

Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython 3rd Edition

Fast.ai Linear Algebra

Show Notes

Learn Faster with a Walking DeskWalk While You Learn

Sitting for hours drains energy and focus. A walking desk boosts alertness, helping you retain complex ML topics more effectively.Boost focus and energy to learn faster and retain more.Discover the benefitsDiscover the benefits

NumPy: Efficient Numerical Arrays and Vectorized Computation

Purpose and Design
- NumPy ("Numerical Python") is the foundational library for handling large numerical datasets in RAM.
- It introduces the ndarray (n-dimensional array), which is synonymous with a tensor - enabling storage of vectors, matrices, or higher-dimensional data.
Memory Efficiency
- NumPy arrays are homogeneous: all elements share a consistent data type (e.g., float64, int32, bool).
- This data type awareness enables allocation of tightly-packed, contiguous memory blocks, optimizing both RAM usage and data access speed.
- Memory footprint can be orders of magnitude lower than equivalent native Python lists; for example, tasks that exhausted 32GB of RAM using Python lists could drop to just 6GB with NumPy structures.
Vectorized Operations
- NumPy supports vectorized calculations: operations (such as squaring all elements) are applied across entire arrays in a single step, without explicit Python loops.
- These operations are operator-overloaded and are executed by delegating instructions to low-level, highly optimized C or Fortran routines, delivering significant computational speed gains.
- Conditional operations and masking, such as zeroing out negative numbers (akin to a ReLU activation), can be done efficiently with Boolean masks.

Pandas: Advanced Tabular Data Manipulation

Relationship to NumPy
- Pandas builds upon NumPy, leveraging its underlying optimized array storage and computation for numerical columns in its data structures.
- Supports additional types like strings for non-numeric data, which are common in real-world datasets.
2D Data Handling and Directional Operations
- The core Pandas structure is the DataFrame, which handles labelled rows and columns, analogous to a spreadsheet or SQL table.
- Operations are equally intuitive row-wise and column-wise, facilitating both SQL-like ("row-oriented") and "columnar" manipulations.
- This dual-orientation enables many complex data transformations to be succinct one-liners instead of lengthy Python code.
Indexing and Alignment
- Pandas uses flexible and powerful indexing, enabling functions such as joining disparate datasets via a shared index (e.g., timestamp alignment in financial time series).
- When merging DataFrames (e.g., two stocks with differing trading days), Pandas automatically aligns data on the index, introducing NaN (null) values for unmatched dates.
Handling Missing Data (Imputation)
- Pandas includes robust features for detecting and filling missing values, known as imputation.
  - Options include forward filling, backfilling, or interpolating missing values based on surrounding data.
- Datasets can be reindexed against standardized sequences, such as all valid trading days, to enforce consistent time frames and further identify or fill data gaps.
Use Cases and Integration
- Pandas simplifies ETL (extract, transform, load) for CSV and database-derived data, merging NumPy’s computation power with tools for advanced data cleaning and integration.
- When preparing data for machine learning frameworks (e.g., TensorFlow or Keras), Pandas DataFrames can be converted back into NumPy arrays for computation, maintaining tight integration across the data science stack.

Summary: NumPy underpins high-speed numerical operations and memory efficiency, while Pandas extends these capabilities to powerful, flexible, and intuitive manipulation of labelled multi-dimensional data - together forming the backbone of data analysis and preparation in Python machine learning workflows.

Accelerate Your AI Strategy with TylerAI Strategy Call with Tyler

Go from concept to action plan. Get expert, confidential guidance on your specific AI implementation challenges in a private, one-hour strategy session with Tyler.Get personalized guidance from Tyler to solve your company's AI implementation challenges.Book Your Session with TylerBook Your Call with Tyler

Transcript

You are listening to machine learning applied. In this episode, we're gonna talk a little bit more about Num, pa, and Pandas, some more depth that I didn't get into in the MLG Languages and Frameworks episode. So num, pa, and Pandas. In Python, when you're doing machine learning or exploratory data analysis or anything on data, you're gonna be using Num pi.

Num pa, which stands for Numerical Python and Num PA is a very old library. It's been around for a very long time, and what its goal is, is to store big arrays of numbers in Ram and to operate over those arrays really fast. Okay, so it's so num pa is just arrays of numbers or arrays of arrays, of numbers, vectors, or matrices or any dimensional tensors.

Remember tensor the word for n dimensional array. And in fact, in num PA, we have a construct called ND array. End dimensional array, which is synonymous with a tensor ND array tensor. But in NumPy, they call it an ND array, and that's the main construct that you'll be using when working with NumPy. And what it does, what an ND array does is it stores a list of numbers or a list of a list of numbers and so on.

Those numbers can be integers or Boolean saved as zeros and ones. Or very high precision floating point numbers like a float 64. And then of course, you'll be using these tensors, these ND arrays in your TensorFlow code or your K OS code or your data analysis, right? Everything in machine learning is just crunching over tensors of numbers, so that's what MPI is.

It's for keeping these tensors of numbers on hand. Now if, if the buck stopped there, why would we even need NumPy? We could just use a arrays of numbers in Python, right? Open bracket number, number number, close bracket. Well, what NumPy does that's very unique is it makes storing and operating on these numbers extremely efficient, very, very low memory footprint, high speed operations.

And the way it does this is that num PA arrays are aware of the data type that's inside of them, okay? They could be float 60 fours or integer, unlike your traditional Python arrays. I. Which are heterogeneous, they can have anything inside of them. That awareness of what type of data occupies every cell of your ND array in numie allows Numie to dictate the way that those arrays are stored in ram.

And so it will use some knowledge about whether we're storing float 60 fours or 30 twos in order to set aside specifically sized contiguous blocks in ram. So it's very good at allocating space in RAM in order to store these ND arrays most efficiently. And what you'll find is if you have a traditional python and dimensional list of numbers.

Over here and then you convert it to a NumPy ND array over there, their memory footprints are drastically different. The ND array memory footprint will be substantially smaller than the Python list. And I mean, this can be, this can be orders of magnitude. There was one time where I actually was running out of RAM on my 32 gigabyte.

PC for some data processing I was doing in Python, and so I switched it over to NumPy and it, it didn't even touch, it didn't even bother my system at all. I think it was taking like six gigs of RAM to operate on the data at that point. So it can be an enormous savings in space and in time. Num PI will perform operations over these memory blocks all at once.

It's an, it's an amazing feature. It's called vectorized Operations. So to give you an example, if you wanted to square every number in your Python list. What you'd have to do is a for loop over every outer list and a for loop over every inner list and so on. Okay? You'd have to like unpack your list and square every one of those cells individually, and that could take a long time.

In Numie, you would literally just type your, you know, your NumPy array, R times, times two, and that means. Python times time means raised to the power of so star, star two, and it'll actually convert that operation to a NumPy function. So it's an operator overload, and what it'll do is it'll snap its fingers and in O of one time.

Right Big O notation, it will square all of the elements of your array just just in, in the blink of an eye. It won't actually iterate over each of the element. It will do it all at once. And so these are called vectorized operations. And what's interesting actually is that NumPy doesn't typically execute these operations in Python.

It will send. An operation down to a lower level library written in C or fortran. These are very old libraries that have been around from, you know, big scientific projects back when everybody was using C or fortran. Super highly optimized, scientific, numerical cal calculating systems. Num, pa is kind of like this intermediary between Python and those old systems.

It'll send this operation down to one of these low level libraries and these low level libraries will do this extremely fast calculation for you. So Num Pi is all about efficient storage in space and fast computation in time. I. Now this, this computation thing, it gets really cool when, uh, not just when you're performing mathematical operations on matrices, like dot products and squaring things, and, uh, L two norm what, whatever have you.

But when you're combining matrices in very specific ways, like you can mask a matrix based on a Boolean flag. So you can do something like, uh, you know, for every cell in this ND array, if its value is less than zero, make it zero. In other words, you know, min, you know, every value is the minimum of zero. Or that value I.

You're, you're cutting off negative values. You're, you're preventing anything negative. You know, this, this might look like something you do in, in building a relu, rectified linear unit activation function function, for example. And so you can sort of make an ND array mask, which is sort of this Boolean operator, and then a conditional value to set all those elements too.

So NumPy is real slick. If you haven't already been playing with it in your machine learning career, you will be before too long. Now Pandas is written on top of Num, pa, and pandas is real slick. Let's talk about what it adds to num pa. First off, any sort of numerical storage or computation in pandas is performed via num pa.

So if you're doing numerical stuff, it's gonna defer it down to num, pa, and then num pa will pop it over to Fortran or c Pandas also lets you add strings, string columns, which you know you're gonna have those in your CSVs or your databases. And it'll allow you to do some, some real interesting stuff.

First off, it allows you to operate in any direction in your, let's say you're working with a 2D matrix, like a typical spreadsheet in any direction. And what I mean by that is you can either operate row wise or column wise. Okay, if you're used to, and what I mean by that is if you're used to using a traditional SQL database like Postgres or my SQL, you're used to manipulating data.

Row wise, you're used to fetching all rows where A, B, or CA is less than one B equals false and C equals hello. Now, there's a thing out there called columnar databases. There are some databases out there that operate the other direction. I want all the values of of my column, of column B, for example, stacked in an array where A, B, and C.

So instead of fetching an array of rows, you're fetching an array of values for a column or multiple columns. It takes a little while to get used to, but once you start working with pandas and manipulating data through pandas, you'll start to see the value of columnar databases or data processing. Now, typically if you have a traditional SQL database that's.

A row oriented database, or if you have a columnar database, that's the opposite. But pandas lets you do both, and you'll find once you start getting used to it, you'll, you'll find yourself just flying through data in a way that in Pandas is just like a one-liner would've been, you know, 10 to a hundred lines of Python code for you before and before you start getting used to doing things.

Column wise and row wise in Pandas fashion, you'll think there's no other way of it. You won't think you're doing hard work manipulating data. That's just the way to do it. Then you switch over to doing things in pandas and holy cow, you can do this all in one line of code. Another really powerful aspect of pandas is by way of this concept called indexing.

You can add an index to all your rows in a Pandas data frame. You're like, well, okay, you can have an index on all your rows in a Postgres table. Yes. But pandas takes a step up. So let's just, let's just dive into an example of stock trading to showcase how powerful this indexing feature is. Let's say you're going to build a stock trading machine learning model, maybe a recurrent neural network that's going to forecast the price of tomorrow and decide whether or not to buy or sell a specific stock.

Now, in this case, we're only using apple stocks and Google stocks. So we go to some service and we pull down a spreadsheet of apple stocks, all traded for the last some odd days, maybe one year. Then we go to the same service and we pull down a, a spreadsheet, a CSV of Google stocks, and you open it up and the days don't match.

The Apple stocks spreadsheet has maybe one year's worth of prices per day, and the Google stocks spreadsheet has a year and a half. So they don't match up. They don't match up exactly. Well, what you'll do is you'll import these into pandas in your Python code using a function called Pandas Reed csv. So automatically you can just pull in a CSV spreadsheet into pandas.

Get a data frame with one line of code that's real nice. You'll get a data frame for your Apple stocks. You'll get a data frame for your Google stocks. You'll see that the days don't align perfectly, and what you'll do then is you'll make the dates. In both of these data frames your index, okay? The days will be the index, because you'll see in pandas, it's these indexes that let you join things together, just like a standard SQL join.

And then what you'll do is you'll just combine these data frames, just combine them together, and you'll see that pandas automatically, it created two columns. One column is called Apple, another column is called Google. A single index of all the days. Automatically aligned all of your stocks, the, the, the apple on the left and the Google on the right to the right days.

Okay? What you'll get then is a single data frame with two columns, apple and Google. The Apple column will have a whole bunch of blank entries in the beginning, because Google had just objectively more data from the spreadsheets that we downloaded. So it added the Google column and it saw that Apple didn't have those days, and so it just inserts blanks.

And that's very useful because you can figure out where your blanks are and what you're gonna do about it. And it phase aligns these two data frames. It aligns them. It merges them together and aligns them on the date index. And fills in empties with knolls. Additionally, you might find lots of holes throughout your data frame past that initial chunk of emptiness in an apple.

Let's say Apple didn't trade for some specific day that Google did maybe six months ago. I. Well, it will fill in that empty slot of apple with a null. So it, so again, it phase aligned, it date aligned all of your stocks together. Now what are we gonna do about those nulls, those empty holes? That's another thing that Pandas brings to bear, is it gives you a tool for filling the holes.

Some very complex tooling effect, all sorts of options. Let's say that there's a missing day. Sandwiched between two days that have a price, so there's only one missing day. Well, you take the day before and the day after the missing day, and you can average them. You can take the mean and plop that into the, to the null value into the empty cell.

So you can take the mean of the before and the after. That's one option you can do. You can forward fill, so you can just take yesterday's date and plop it into today. You can backfill taking tomorrow and putting in today. Pandas gives you all sorts of options for handling missing values. Now we start to build our model and we start to see that it's having some problems, and we go back to our pandas data frame, our joined data frame between the Google and the Apple stocks, and we're like, what's missing here?

Look, I've, I've aligned all the days together and I've filled in all the missing values. It's called imputing, by the way. Impute, I-M-P-U-T-E. I've imputed the missing values. What's missing, and you'll notice that lots of days are missing from both Apple and Google. The dates are missing and you'll be like, well, crap.

How am I gonna determine, you know, what days are valid days to trade on? But for which neither of these stocks traded, pandas offers another tool called re-indexing. And what you can do is you can tell this data frame to re-index onto all valid dates. Or all valid trading dates. Or all valid, whatever, dates and all, all those things, all valid blank.

These are all like tools inside of Pandas. Pandas has a function or a constant of some sort that represents every day of the year, every date, every business day of the year. Every trading day of the year, every holiday, every blah, blah, blah. And you can, what they call re-index your data frame onto these dates.

And what we'll do is throw away anything that doesn't match and create empties for things that it didn't have before. And then again, you can do your imputing, you can forward fill or backfill or whatever. Super, super powerful. So Num PI is the underlying fundamental library for storing your data in ram.

But if you really want to do sort of like SQL style manipulating of that NumPy data, you'll be taking it up a level, wrapping your NumPy array into a Pandas data frame, and working with pandas. And then we go up the stack, we go to TensorFlow or Carros, and both of those libraries will use num pi under the hood as well.

They don't really need to do these slicing and dicing SQL style operations, so they don't necessarily wrap your data in pandas. Instead, they'll, they'll just handle everything in num, pa. But TensorFlow and Caros are both numie and pandas aware. So you can just pass it your Pandas data frame if that's, if that's your fancy, and it will convert it back to its underlying NUMIE data and work with the MPA data from there.

That's NumPy and Pandas. Next up we're gonna talk about how we store these NumPy arrays on disc, whether it's an HDF file, a pickle file, Postgres, what have you. See you then.

Comments temporarily disabled because Disqus started showing ads (and rough ones). I'll have to migrate the commenting system.