Skip to content

Commit e85731c

Browse files
authored
Update README.md
1 parent 0eddcf1 commit e85731c

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,6 @@ Rerun selectTopFeatures() with lower js_div threshold or rebuild source audience
165165

166166
The recommended size of source audience is between 5,000 and 50,000. Too few customers will not likely provide sufficient data to estimate probability distribution for each feature. Too many customers will result in distributions that are generally similar to the population distribution for most features. Note that the recommended source audience size is the effective source audience size, not the input size: it is not helpful to randomly select 5,000 customers out of a large segment of 500,000 customers. The effective audience siz would be 500,000 in this case, which is too general (~4.5% of all Coupang customers).
167167

168-
If source audience is generated using SQL query, be careful about how the rows of the table you're querying from are indexed and sorted. The sorting method can introduce unexpected signal pattern which the lookalike model WILL pick up. For example, if the table is ordered by most recent active date, then a "LIMIT" clause will return the most recent (or distant) active customers. Such signal will cause the distribution of "day since last order" of the source audience to be very different from the rest of the population, because customers who are recently active are likely to have smaller values in "day since last order" feature. To avoid such implicit filter effect, you may want to add "ORDER BY RANDOM()" to randomize the indexing order.
168+
If source audience is generated using SQL query, be careful about how the rows of the table you're querying from are indexed and sorted. The sorting method can introduce unexpected signal pattern which the lookalike model WILL pick up. For example, if the table is ordered by most recent active date, then a "LIMIT" clause will return the most recent (or distant) active customers. Such signal will cause the distribution of "day since last order" of the source audience to be very different from the rest of the population, because customers who are recently active are likely to have smaller values in "day since last order" feature. To avoid such implicit filter effect, you may want to use the "TABLESAMPLE" SQL function.
169169

170170
Finally, before using the lookalike audience for any purpose, remember to examine the what top features are selected, what weights are assigned and histograms plot to see if the results violate commonsense.

0 commit comments

Comments
 (0)