Imagine being hungry in an unfamiliar part of town and getting restaurant recommendations served up, based on your personal preferences, at just the right moment. The recommendation comes with an attached discount from your credit card provider for a local place around the corner!
C company has built partnerships with merchants in order to offer promotions or discounts to cardholders. But do these promotions work for either the consumer or the merchant? Do customers enjoy their experience? Do merchants see repeat business? Personalization is key.
C has built machine learning models to understand the most important aspects and preferences in their customers’ lifecycle, from food to shopping. But so far none of them is specifically tailored for an individual or profile. This is where you come in.
In this project, I will develop algorithms to identify and serve the most relevant opportunities to individuals, by uncovering signal in customer loyalty. I will improve customers’ lives and help C reduce unwanted campaigns, to create the right experience for customers.
- train.csv - the training set
- test.csv - the test set
- sample_submission.csv - a sample submission file in the correct format - contains all
card_ids
you are expected to predict for. - historical_transactions.csv - up to 3 months' worth of historical transactions for each
card_id
- merchants.csv - additional information about all merchants /
merchant_ids
in the dataset. - new_merchant_transactions.csv - two months' worth of data for each
card_id
containing ALL purchases thatcard_id
made atmerchant_ids
that were not visited in the historical data.
Root Mean Squared Error (RMSE)
Submissions are scored on the root mean squared error. RMSE is defined as:
where card_id
, and card_id
.
- Partitioning discrete and continuous fields.
- Enlabel the discrete features
- Addressing the missing values with -1 or mean
- Addressing the infinite values with the maximum value of the column
- Drop duplicates
This project contains many tables, such as transactions, merchant table. If I want to unifized all data based on card_id
. To do so, it's important to merge data from different tables on unique card_id
.
In this project, I'll apply two methods to integrate data.
The orignal table
card_id |
A | B | C | D |
---|---|---|---|---|
-- | Categorical | Categorical | Numeric | Numeric |
1 | 1 | 2 | 4 | 7 |
2 | 2 | 1 | 5 | 5 |
1 | 1 | 2 | 1 | 4 |
3 | 2 | 2 | 5 | 8 |
The new table generated new features
card_id |
A1 & C | A2 & C | A1 & D | A2 & D | B1 & C | B2 & C | B1 & D | B2 & D |
---|---|---|---|---|---|---|---|---|
1 | 5 | 0 | 11 | 0 | 0 | 5 | 0 | 11 |
2 | 0 | 5 | 0 | 5 | 5 | 0 | 5 | 0 |
3 | 0 | 5 | 0 | 8 | 0 | 5 | 0 | 8 |
This method will cause enormous new columns and zero values.
The new table generated new features
card_id |
A_sum | A_mean | A_max | A_var | A_unique |
---|---|---|---|---|---|
1 | 2 | 1 | 1 | 0 | 2 |
2 | 2 | 2 | 2 | 0 | 2 |
3 | 2 | 5 | 2 | 0 | 2 |
This method won't cause so many new columns or zero values.
During this project, I've implemented serval approaches, from basic to advanced, to improve the metrics.
The RMSE on the test set is 3.65455
Approach | RMSE |
---|---|
Base line | 3.65455 |
RF + CV | 3.65173 |
Select top 300 features according to correlation with the target to train
Approach | RMSE |
---|---|
Base line | 3.65455 |
RF + CV | 3.65173 |
LightGBM | 3.69732 |
LightGBM + CV | 3.64403 |
I found there were many columns related to ID, such as merchant_id
, merchant_category_id
, state_id
, subsector_id
, city_id
, these values inflect customer's behavior. For example, for a customer A, if a certain merchant id happens a lot in his transaction record, it means A likes this merchant. Furthermore, if this merchant record in many customers', it means this merchant is popular, and it also means customer A is similar to other customers, if not, A has a special like.
In order to mining this information, I will apply CountVector and TF-IDF. Specificly, CountVector can extract merchant information of a customer, and TF-IDF can extract if many customers like one product at the same time.
If we apply NLP approaches, there is an issue we should consider which is there are too many new features and most of them are sparse. So I'll introduce associated method sparse
from scipy
.
Approach | RMSE |
---|---|
Base line | 3.65455 |
RF + CV | 3.65173 |
LightGBM | 3.69732 |
LightGBM + CV | 3.64403 |
XGBoost | 3.62832 |
- Voting with average
Average three results from models (LGBM, RF, XGBoost)
Approach | RMSE |
---|---|
Base line | 3.65455 |
RF + CV | 3.65173 |
LightGBM | 3.69732 |
LightGBM + CV | 3.64403 |
XGBoost | 3.62832 |
Average Voting | 3.6365 |
Only simple average cannot have better results.
- Voting with weighted average
Approach | RMSE |
---|---|
Base line | 3.65455 |
RF + CV | 3.65173 |
LightGBM | 3.69732 |
LightGBM + CV | 3.64403 |
XGBoost | 3.62832 |
Average Voting | 3.6365 |
weighted Voting | 3.633307 |
Approach | RMSE |
---|---|
Base line | 3.65455 |
RF + CV | 3.65173 |
LightGBM | 3.69732 |
LightGBM + CV | 3.64403 |
XGBoost | 3.62832 |
Average Voting | 3.6365 |
weighted Voting | 3.633307 |
Stacking | 3.62798 |
Generally speaking, there are two apporaches to train a better model. The first one is to improve the data quality by feature engineering. The other one is to improve the quality of model. Comparing with these two features, the upper limit of the model accuracy depends on feature engineering, and the building model will approach this upper limit. Therefore, during the optimization process, we need both these two method in order to get a better result.
I've already made some feature engineering tasks, e.g. aggregate data by card_id
, generate more features, feature selection, and extract information by NLP. There are still improve space for feature engineering.
During the previous process, I ignored an important customer behavior feature, i.e. the time feature. We could create new features based on this information, such as time difference between the first trade and the latest trade, average time interval between a customer transaction, etc.
In a word, the customer behavior features are important to extract information and make better prediction results efficiently. Additionally, the closer a customer behavior to the current time point, the more valuable. So I focus on the customer behavior of the latest two months.
I've generated new features according cross derivation, such as summarize two features. This time I'll apply higher order to generate new features. Of course, the higher order will cause more sparse matrix, so I'll only multiply features related to customer behavior.
I've addressed the abnormal values and removed the values which less than -30. However, according to the data description, these abnormal values are manipulated by the company. It represents a certain type of customer. Therefore I will not remove them, but consider them as a group of customers.
Approach | RMSE |
---|---|
Base line | 3.65455 |
RF + CV | 3.65173 |
LightGBM | 3.69732 |
LightGBM + CV | 3.64403 |
XGBoost | 3.62832 |
Average Voting | 3.6365 |
weighted Voting | 3.633307 |
Stacking | 3.62798 |
LightGBM | 3.61177 |
XGboost | 3.61048 |
CatBoost | 3.61154 |
Approach | RMSE |
---|---|
Base line | 3.65455 |
RF + CV | 3.65173 |
LightGBM | 3.69732 |
LightGBM + CV | 3.64403 |
XGBoost | 3.62832 |
Average Voting | 3.6365 |
weighted Voting | 3.633307 |
Stacking | 3.62798 |
LightGBM | 3.61177 |
XGboost | 3.61048 |
CatBoost | 3.61154 |
Voting | 3.60640 |
Stacking | 3.60683 |