+The first challenge lies in selecting a suitable criterion for quantifying a datapoint's value. Most criteria aim to measure the gain in model performance attributable to including the datapoint in the training dataset. A common [approach](https://conservancy.umn.edu/handle/11299/37076), dubbed "leave-one-out", simply computes the difference in performance between a model trained on the full dataset and one trained on the full dataset minus one example. Recently, [Ghorbani _et al._](https://proceedings.mlr.press/v97/ghorbani19c/ghorbani19c.pdf) and [Jia et al. ](http://proceedings.mlr.press/v89/jia19a/jia19a.pdf) proposed a data valuation scheme based on the [Shapley value](https://en.wikipedia.org/wiki/Shapley_value), a classic solution in game theory for distributing rewards in cooperative games. Empirically, Data Shapley valuations are more effective in downstream applications (_e.g._ active learning) than "leave-one-out" valuations. Moreover, they have several intuitive properties not shared by other criteria. Computing Shapley value can often be expensive, one line of research is to develop for simpler models [PTIME Shapley algorithm and use as a proxy](http://www.vldb.org/pvldb/vol12/p1610-jia.pdf) which can be effective in many scenarios (https://arxiv.org/pdf/1911.07128.pdf). [DataScope](https://github.com/easeml/datascope/) also extends this functionality for end-to-end ML pipelines consist of both feature extractors and ML models.
0 commit comments