Update README.md

shawlu95 · web-flow · commit 2138f2e822d5 · 2018-09-18T11:39:57.000-07:00
diff --git a/README.md b/README.md
@@ -1,11 +1,10 @@
 # Lookalike Model
 ## Introduction
-Facebook Lookalike Audience service requires three inputs: Source, Location and Audience Size. Source refers to source audience, which Facebook recommends using "1,000 to 50,000 of your best customers." Location restricts the lookalike audience to a specific geographic region. Audience Size specifies the desired size of lookalike audience to be generated, measured in millions or percentage of total population of selected location. It is the responsibility of Coupang to select the source audience, based on relevant metrics e.g. lifetime value, transaction value, total order size and engagement.
-
 Traditional segmentation approach applies hard filters to population, such as age range, account monetary value, and purchase frequency, to create segments. Because all filters must be satisfied, only the intersection of all filters remains. This methodology has two disadvantages.
 
-On one hand, hard filters may introduce human bias. For example, when promoting a baby product, requiring seed audience to be all females will leave out male customers who tend to their babies on their wives' behalf, or female customers who did not declare their gender on Coupang. If such bias is feed into Facebook Lookalike Audience, lookalike audience will all be females too.
-On the other hand, hard filters reduce the population so fast that only a few hundred customers remain after applying just a few filters. For example, among the 12.9 million active customers in 2018, there are only 140 customers who registered account on Aug 21, placed 3 (or more) items in cart within 7 days, but had not made a single purchase by Aug 29. Such small segment size is not enough to paint a meaningful persona to generate lookalike audience.
+1. On one hand, hard filters may introduce human bias. For example, when promoting a baby product, requiring seed audience to be all females will leave out male customers who tend to their babies on their wives' behalf, or female customers who did not declare their gender on Coupang. If such bias is feed into Facebook Lookalike Audience, lookalike audience will all be females too.
+
+2. On the other hand, hard filters reduce the population so fast that only a few hundred customers remain after applying just a few filters. As the number of filters increases, the size of segment shrinks exponentially. When a segment becomes too small, it represents a too specialized set of customers who may not generalize to a meaningful persona to build lookalike audience.
 
 <img src="fig/overview.jpg" width="700">
 
@@ -14,31 +13,35 @@ The objective of this project is to build a lookalike model that "softens" the f
 Still, human bias is often helpful. Since the initial segments must be defined by the business team, in accordance with a specific business goal, such bias often aligns (albeit imperfectly) with the business goal. The role of the lookalike model is to take the initial segments as input, inspect its underlying structure, assign weights to features, build a larger segment using unsupervised machine learning, and output the enlarged segment to be used as source audience for Facebook Lookalike Audience service. For example, knowing that most diaper buyers are women is a meaningful piece of information, but the lookalike model will override the human bias that "diaper buyers must be female" if some male customers were found to conform to most of other features of the initial segments.
 
 ## Workflow
-Input source audience: an initial segment as a list of member_srls. The initial segment is defined by some hard filters chosen by the business team. Filters are selected for a specific business goal (e.g. promoting a particular product line). Notice: the lookalike model does NOT take filters as input.
+Input source audience: an initial segment as a list of member_srls. The initial segment is defined by some hard filters chosen by the business team. Filters are selected for a specific business goal (e.g. promoting a particular product line). Notice: the lookalike model does __NOT__ take filters as input.
 
 1. Features extraction: for each customer in the initial input segment, a full list of features is retrieved.
 2. Feature weighting: for each feature, Jensen-Shannon divergence is computed and assigned as weight. 
 3. Subset selection: the top 20 features with highest weight are used for the next step.
-4. Near Neighbor Ranking: compute average of the initial segment as centroid, rank its nearest neighbor by a distance metric 5. (Euclidean, cosine, Mahalanobis, etc.).
-Output lookalike audience: a set of lookalike audience larger than the initial input. The sized of lookalike audience can be arbitrarily set to n. The top n customers most similar to the initial segment are returned.
+4. Near Neighbor Ranking: compute average of the initial segment as centroid, rank its nearest neighbor by a distance metric (Euclidean, cosine, Mahalanobis, etc.).
+5. Output lookalike audience: a set of lookalike audience larger than the initial input. The sized of lookalike audience can be arbitrarily set to n. The top n customers most similar to the initial segment are returned.
 
-<img src="fig/workflow.jpg" width="450">
+<p align="center">
+    <img src="fig/workflow.jpg" width="450">
+</p>
 
 ## Feature Extraction
 
 The full model includes 1805 features, which are computed primarily based on bimart.cs_sales and indexing_platform.user_behavior_log. Because the indexing_platform.user_behavior_log table is huge, it is impossible to compute 1805 features for the entire population of approximately 11 million customer (after 12 hours, the task is killed). Instead, the sql must be run on private cluster using Zeppelin Sandbox.
 
 The partial model includes 957 features, which are computed based on bimart.cs_sales in Redshift. The 957 features (9 * 106 + 3) are highlighted by red colors in the table above. It takes about 30 minutes to finish the entire feature extraction process on Redshift.
 
-The redshift SQL commands are syntactically different from Hive SQL commands, both versions are (hopefully) bug-free. Refer to README.txt in the sql_hive and sql_redshift directories for detailed explanations on how to use the files.
+The redshift SQL commands are syntactically different from Hive SQL commands, both versions are (hopefully) bug-free. Refer to README.txt in the [sql_hive](sql_hive/README.txt) and [sql_redshift](sql_redshift/README.txt) directories for detailed explanations on how to use the files.
 
 * Need: annualized aggregate spending (aas), annualized order count (cto), annualized quantity count (ctq), days per order (dpo), day per quantity (dpq) quantify customer's need for each product category.
 * Habit: annualized view count (ctv) measures indirectly how much interests a customer places in each product category.
 * Engagement: view per order (vpo), view per quantity (vpq) measure how much attention a customer spends on looking for the best offer in each product category.
 * Spending power: GMV per order (gpo), per quantity (gpq), per day (gpd) measure how much a customer is willing to spend (per unit order / quantity) in each product category.
 * Churn: days since last order (dal), views since last order (vsl) measure how likely a user is going to place the next order in each product category.
 
-<img src="fig/sql.jpg" width="800">
+<p align="center">
+    <img src="fig/sql.jpg" width="800">
+</p>
 
 ### Naming Convention
 Each feature is referred to as symbol + category code, in which category code is padded with 0 to the front to maintain a consistent 3-digit code. For example, "cto078" refers to annualized order count in category 78 (Overseas Travel - Ticket/Pass). Use the script here to generate names for all features. An exhaustive list of 1805 feature names can be found [here](data/feature_code.csv) as a csv file.
@@ -48,7 +51,9 @@ The model uses Jensen-Shannon divergence (JS divergence) to measure the importan
 
 The model computes JS divergence for every feature of the source audience, compared against the population sample. The features with highest JS divergence are selected. In the python class, parameter max_features specifies the number of top features with highest JS divergence; parameter threshold specifies the minimum JS divergence required to be considered as meaningful divergence. If none of the features exceeds minimum threshold, no feature is selected. If more than max_features number of features exceed the minimum threshold, the top max_features features are selected. The final feature weights are computed by normalizing JS divergence of valid candidate features.
 
-<img src="fig/js_div.jpg" width="600">
+<p align="center">
+    <img src="fig/js_div.jpg" width="600">
+</p>
 
 Warning: you must have at least 1 feature with meaningful divergence for the lookalike model to proceed. Empirically, a good range of features is 5 ~ 15. The default max_features is set to 20, beyond which joining table (10 million rows, one per customer) successively takes exponentially more time. The default threshold is set to 0.05, below which divergence is more likely caused by noise, not by source audience characteristic.
 
@@ -60,7 +65,7 @@ The full feature table does not have to bee updated frequently, as the features
 
 ## How to use
 ### Step 1
-Connect to pang-SFO.
+Connect to pang-SFO. You cannot use Coupang's guest wifi, or home wifi. Connect VPN if necessary.
 
 ### Step 2
 If using Redshift, run SQL commands sequentially in sql_redshift foldler. For detailed instruction, refer to this [guide](sql_redshift/README.txt).
@@ -142,7 +147,9 @@ Generally, when the population distribution is normal, lookalike audience's shif
 
 Note that the shape of histogram depends on the lookalike audience size. The more lookalike audience, the more its shape converges to the population (if the lookalike audience is chosen to be the same size of the population, then the two become exactly identical).
 
-![Result](output/mom_src_srls/lookalike.png)
+<p align="center">
+    <img src="output/mom_src_srls/lookalike.png" width="800">
+</p>
 
 ## Caution
 The model is most useful when the goal is to understand the source audience's concern on specific product category or range of product categories. Simply select source audience based on account value, gender, platform, or registration time will should not produce any meaningful insights from the source audience.