#!/usr/bin/env python
# coding: utf-8

# # How do I merge DataFrames in pandas? ([video](https://www.youtube.com/watch?v=iYWKfUOtGaw&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=32))
# 
# - [My pandas video series](https://www.dataschool.io/easier-data-analysis-with-pandas/)
# - [GitHub repository](https://github.com/justmarkham/pandas-videos)

# ## Table of contents
# 
# 1. Selecting a Function
# 2. Joining (Merging) DataFrames
# 3. What if...?
# 4. Four Types of Joins

# # Part 1: Selecting a Function

# Taken from [Merging DataFrames with pandas](https://www.datacamp.com/courses/merging-dataframes-with-pandas?tap_a=5644-dce66f&tap_s=280411-a25fc8) (DataCamp course):
# 
# - `df1.append(df2)`: stacking vertically
# - `pd.concat([df1, df2])`:
#   - stacking many horizontally or vertically
#   - simple inner/outer joins on Indexes
# - `df1.join(df2)`: inner/outer/left/right joins on Indexes
# - `pd.merge(df1, df2)`: many joins on multiple columns

# # Part 2: Joining (Merging) DataFrames

# Using the [MovieLens 100k data](http://grouplens.org/datasets/movielens/), let's create two DataFrames:
# 
# - **movies**: shows information about movies, namely a unique **movie_id** and its **title**
# - **ratings**: shows the **rating** that a particular **user_id** gave to a particular **movie_id** at a particular **timestamp**

# In[1]:


import pandas as pd


# ## Movies

# In[2]:


movie_cols = ['movie_id', 'title']
movies = pd.read_table('data/u.item', sep='|', header=None, names=movie_cols, usecols=[0, 1])
movies.head()


# In[3]:


movies.shape


# In[4]:


movies.movie_id.nunique()


# ## Ratings

# In[5]:


rating_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('data/u.data', sep='\t', header=None, names=rating_cols)
ratings.head()


# In[6]:


ratings.shape


# In[7]:


ratings.movie_id.nunique()


# In[8]:


ratings.loc[ratings.movie_id == 1, :].head()


# ## Merging Movies and Ratings

# Let's pretend that you want to examine the ratings DataFrame, but you want to know the **title** of each movie rather than its **movie_id**. The best way to accomplish this objective is by "joining" (or "merging") the DataFrames using the Pandas `merge` function:

# In[9]:


movies.columns


# In[10]:


ratings.columns


# In[11]:


movie_ratings = pd.merge(movies, ratings)
movie_ratings.columns


# In[12]:


movie_ratings.head()


# In[13]:


movie_ratings.shape


# Here's what just happened:
# 
# - Pandas noticed that movies and ratings had one column in common, namely **movie_id**. This is the "key" on which the DataFrames will be joined.
# - The first **movie_id** in movies is 1. Thus, Pandas looked through every row in the ratings DataFrame, searching for a movie_id of 1. Every time it found such a row, it recorded the **user_id**, **rating**, and **timestamp** listed in that row. In this case, it found 452 matching rows.
# - The second **movie_id** in movies is 2. Again, Pandas did a search of ratings and found 131 matching rows.
# - This process was repeated for all of the remaining rows in movies.
# 
# At the end of the process, the movie_ratings DataFrame is created, which contains the two columns from movies (**movie_id** and **title**) and the three other colums from ratings (**user_id**, **rating**, and **timestamp**).
# 
# - **movie_id** 1 and its **title** are listed 452 times, next to the **user_id**, **rating**, and **timestamp** for each of the 452 matching ratings.
# - **movie_id** 2 and its **title** are listed 131 times, next to the **user_id**, **rating**, and **timestamp** for each of the 131 matching ratings.
# - And so on, for every movie in the dataset.

# In[14]:


print(movies.shape)
print(ratings.shape)
print(movie_ratings.shape)


# Notice the shapes of the three DataFrames:
# 
# - There are 1682 rows in the movies DataFrame.
# - There are 100000 rows in the ratings DataFrame.
# - The `merge` function resulted in a movie_ratings DataFrame with 100000 rows, because every row from ratings matched a row from movies.
# - The movie_ratings DataFrame has 5 columns, namely the 2 columns from movies, plus the 4 columns from ratings, minus the 1 column in common.
# 
# By default, the `merge` function joins the DataFrames using all column names that are in common (**movie_id**, in this case). The [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.merge.html) explains how you can override this behavior.

# # Part 3: What if...?

# ## What if the columns you want to join on don't have the same name?

# In[15]:


movies.columns = ['m_id', 'title']
movies.columns


# In[16]:


ratings.columns


# In[17]:


pd.merge(movies, ratings, left_on='m_id', right_on='movie_id').head()


# ## What if you want to join on one index?

# In[18]:


movies = movies.set_index('m_id')
movies.head()


# In[19]:


pd.merge(movies, ratings, left_index=True, right_on='movie_id').head()


# ## What if you want to join on two indexes?

# In[20]:


ratings = ratings.set_index('movie_id')
ratings.head()


# In[21]:


pd.merge(movies, ratings, left_index=True, right_index=True).head()


# # Part 4: Four Types of Joins

# There are actually four types of joins supported by the Pandas `merge` function. Here's how they are described by the documentation:
# 
# - **inner:** use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys
# - **outer:** use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically
# - **left:** use only keys from left frame, similar to a SQL left outer join; preserve key order
# - **right:** use only keys from right frame, similar to a SQL right outer join; preserve key order
# 
# The default is the "inner join", which was used when creating the movie_ratings DataFrame.
# 
# It's easiest to understand the different types by looking at some simple examples:

# ## Example DataFrames A and B

# In[22]:


A = pd.DataFrame({'color': ['green', 'yellow', 'red'], 'num':[1, 2, 3]})
A


# In[23]:


B = pd.DataFrame({'color': ['green', 'yellow', 'pink'], 'size':['S', 'M', 'L']})
B


# ## Inner join
# 
# Only include observations found in both A and B:

# In[24]:


pd.merge(A, B, how='inner')


# ## Outer join
# 
# Include observations found in either A or B:

# In[25]:


pd.merge(A, B, how='outer')


# ## Left join
# 
# Include all observations found in A:

# In[26]:


pd.merge(A, B, how='left')


# ## Right join
# 
# Include all observations found in B:

# In[27]:


pd.merge(A, B, how='right')