#!/usr/bin/env python # coding: utf-8 # # How do I merge DataFrames in pandas? ([video](https://www.youtube.com/watch?v=iYWKfUOtGaw&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=32)) # # - [My pandas video series](https://www.dataschool.io/easier-data-analysis-with-pandas/) # - [GitHub repository](https://github.com/justmarkham/pandas-videos) # ## Table of contents # # 1. Selecting a Function # 2. Joining (Merging) DataFrames # 3. What if...? # 4. Four Types of Joins # # Part 1: Selecting a Function # Taken from [Merging DataFrames with pandas](https://www.datacamp.com/courses/merging-dataframes-with-pandas?tap_a=5644-dce66f&tap_s=280411-a25fc8) (DataCamp course): # # - `df1.append(df2)`: stacking vertically # - `pd.concat([df1, df2])`: # - stacking many horizontally or vertically # - simple inner/outer joins on Indexes # - `df1.join(df2)`: inner/outer/left/right joins on Indexes # - `pd.merge(df1, df2)`: many joins on multiple columns # # Part 2: Joining (Merging) DataFrames # Using the [MovieLens 100k data](http://grouplens.org/datasets/movielens/), let's create two DataFrames: # # - **movies**: shows information about movies, namely a unique **movie_id** and its **title** # - **ratings**: shows the **rating** that a particular **user_id** gave to a particular **movie_id** at a particular **timestamp** # In[1]: import pandas as pd # ## Movies # In[2]: movie_cols = ['movie_id', 'title'] movies = pd.read_table('data/u.item', sep='|', header=None, names=movie_cols, usecols=[0, 1]) movies.head() # In[3]: movies.shape # In[4]: movies.movie_id.nunique() # ## Ratings # In[5]: rating_cols = ['user_id', 'movie_id', 'rating', 'timestamp'] ratings = pd.read_table('data/u.data', sep='\t', header=None, names=rating_cols) ratings.head() # In[6]: ratings.shape # In[7]: ratings.movie_id.nunique() # In[8]: ratings.loc[ratings.movie_id == 1, :].head() # ## Merging Movies and Ratings # Let's pretend that you want to examine the ratings DataFrame, but you want to know the **title** of each movie rather than its **movie_id**. The best way to accomplish this objective is by "joining" (or "merging") the DataFrames using the Pandas `merge` function: # In[9]: movies.columns # In[10]: ratings.columns # In[11]: movie_ratings = pd.merge(movies, ratings) movie_ratings.columns # In[12]: movie_ratings.head() # In[13]: movie_ratings.shape # Here's what just happened: # # - Pandas noticed that movies and ratings had one column in common, namely **movie_id**. This is the "key" on which the DataFrames will be joined. # - The first **movie_id** in movies is 1. Thus, Pandas looked through every row in the ratings DataFrame, searching for a movie_id of 1. Every time it found such a row, it recorded the **user_id**, **rating**, and **timestamp** listed in that row. In this case, it found 452 matching rows. # - The second **movie_id** in movies is 2. Again, Pandas did a search of ratings and found 131 matching rows. # - This process was repeated for all of the remaining rows in movies. # # At the end of the process, the movie_ratings DataFrame is created, which contains the two columns from movies (**movie_id** and **title**) and the three other colums from ratings (**user_id**, **rating**, and **timestamp**). # # - **movie_id** 1 and its **title** are listed 452 times, next to the **user_id**, **rating**, and **timestamp** for each of the 452 matching ratings. # - **movie_id** 2 and its **title** are listed 131 times, next to the **user_id**, **rating**, and **timestamp** for each of the 131 matching ratings. # - And so on, for every movie in the dataset. # In[14]: print(movies.shape) print(ratings.shape) print(movie_ratings.shape) # Notice the shapes of the three DataFrames: # # - There are 1682 rows in the movies DataFrame. # - There are 100000 rows in the ratings DataFrame. # - The `merge` function resulted in a movie_ratings DataFrame with 100000 rows, because every row from ratings matched a row from movies. # - The movie_ratings DataFrame has 5 columns, namely the 2 columns from movies, plus the 4 columns from ratings, minus the 1 column in common. # # By default, the `merge` function joins the DataFrames using all column names that are in common (**movie_id**, in this case). The [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.merge.html) explains how you can override this behavior. # # Part 3: What if...? # ## What if the columns you want to join on don't have the same name? # In[15]: movies.columns = ['m_id', 'title'] movies.columns # In[16]: ratings.columns # In[17]: pd.merge(movies, ratings, left_on='m_id', right_on='movie_id').head() # ## What if you want to join on one index? # In[18]: movies = movies.set_index('m_id') movies.head() # In[19]: pd.merge(movies, ratings, left_index=True, right_on='movie_id').head() # ## What if you want to join on two indexes? # In[20]: ratings = ratings.set_index('movie_id') ratings.head() # In[21]: pd.merge(movies, ratings, left_index=True, right_index=True).head() # # Part 4: Four Types of Joins # There are actually four types of joins supported by the Pandas `merge` function. Here's how they are described by the documentation: # # - **inner:** use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys # - **outer:** use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically # - **left:** use only keys from left frame, similar to a SQL left outer join; preserve key order # - **right:** use only keys from right frame, similar to a SQL right outer join; preserve key order # # The default is the "inner join", which was used when creating the movie_ratings DataFrame. # # It's easiest to understand the different types by looking at some simple examples: # ## Example DataFrames A and B # In[22]: A = pd.DataFrame({'color': ['green', 'yellow', 'red'], 'num':[1, 2, 3]}) A # In[23]: B = pd.DataFrame({'color': ['green', 'yellow', 'pink'], 'size':['S', 'M', 'L']}) B # ## Inner join # # Only include observations found in both A and B: # In[24]: pd.merge(A, B, how='inner') # ## Outer join # # Include observations found in either A or B: # In[25]: pd.merge(A, B, how='outer') # ## Left join # # Include all observations found in A: # In[26]: pd.merge(A, B, how='left') # ## Right join # # Include all observations found in B: # In[27]: pd.merge(A, B, how='right')