Add section on data valuation

seyuboglu · seyuboglu · commit 9b5b281ff8ed · 2021-08-09T22:46:24.000-04:00
diff --git a/augmentation.md b/augmentation.md
@@ -34,7 +34,7 @@ an integral part of text applications such as machine translation.
 A large body of work utilizes hand-crafted data augmentation primitives in order to improve
 model performance. These hand-crafted primitives are designed based on domain knowledge
 about data properties, e.g. rotating an image preserves the content of the image, and should
-typically not change the class label.
+typically not change the class label. 
 
 The next few sections provide a sampling of work across several different
 modalities (images, text, audio) that take this approach.
@@ -44,12 +44,12 @@ modalities (images, text, audio) that take this approach.
 Heuristic transformations are commonly used in image augmentations, such as rotations, flips or crops
 (e.g. [AlexNet](https://papers.nips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf), [Inception](https://arxiv.org/abs/1409.4842.pdf)).
 
-Recent work has hand-crafted more sophisticated primitives, such as
+Recent work has proposed more sophisticated hand-crafted primitives:
 
-- [Cutout](https://arxiv.org/abs/1708.04552)
-- [Mixup](https://arxiv.org/pdf/1710.09412.pdf)
-- [CutMix](https://arxiv.org/abs/1905.04899.pdf)
-- [MixMatch](https://arxiv.org/pdf/1905.02249.pdf) and [ReMixMatch](https://arxiv.org/abs/1911.09785.pdf)
+- [Cutout](https://arxiv.org/abs/1708.04552) randomly masks patches of the input image during training. 
+- [Mixup](https://arxiv.org/pdf/1710.09412.pdf) augments a training dataset with convex combinations of training examples. There is substantial empirical [evidence](https://papers.nips.cc/paper/2019/file/36ad8b5f42db492827016448975cc22d-Paper.pdf) that Mixup can improve generalization and adversarial robustness. A recent [theoretical analysis](https://arxiv.org/abs/2010.04819) helps explain these gains, showing that the Mixup loss can be approximated by standard ERM loss with regularization terms.  
+- [CutMix](https://arxiv.org/abs/1905.04899.pdf) combines the two approaches above: instead of summing two input images (like Mixup), CutMix pastes a random patch from one image onto the other and updates the label to be weighted sum of the two image labels proportional to the size of the cutouts.
+- [MixMatch](https://arxiv.org/pdf/1905.02249.pdf) and [ReMixMatch](https://arxiv.org/abs/1911.09785.pdf) extend the utility of these techniques to semi-supervised settings.
 
 While these primitives have culminated in compelling performance gains, they can often produce unnatural images and distort image semantics. However, data augmenation techniques such as [AugMix](https://arxiv.org/abs/1912.02781) can mix together various unnatural augmentations and lead to images that appear more natural.
 
@@ -135,5 +135,5 @@ Several open questions remain in data augmentation and synthetic data generation
 
 <h2 id="augmentation-evenmore">Further Reading</h2>
 
-- the ["Automating the Art of Data Augmentation"](https://hazyresearch.stanford.edu/data-aug-part-1)
+- The ["Automating the Art of Data Augmentation"](https://hazyresearch.stanford.edu/data-aug-part-1)
   series of blog posts by [Sharon Li](http://pages.cs.wisc.edu/~sharonli/) provide an overview of data augmentation.
diff --git a/data-selection.md b/data-selection.md
@@ -1,2 +1,16 @@
 # Data Selection
 _This area is a stub, you can help by improving it._
+
+
+## Data Valuation
+Quantifying the contribution of each training datapoint to an end model is useful in a number of settings: 
+1. in __active learning__ knowing the value of our training examples can help guide us in collecting more data
+2. when __compensating__ individuals for the data they contribute to a training dataset (_e.g._ search engine users contributing their browsing data or patients contributing their medical data)
+3. for __explaining__ a model's predictions and __debugging__ its behavior.
+
+However, data valuation can be quite tricky.
+The first challenge lies in selecting a suitable criterion for quantifying a datapoint's value. Most criteria aim to measure the gain in model performance attributable to including the datapoint in the training dataset. A common [approach](https://conservancy.umn.edu/handle/11299/37076), dubbed "leave-one-out", simply computes the difference in performance between a model trained on the full dataset and one trained on the full dataset minus one example. Recently, [Ghorbani _et al._](https://proceedings.mlr.press/v97/ghorbani19c/ghorbani19c.pdf) proposed a data valuation scheme based on the [Shapley value](https://en.wikipedia.org/wiki/Shapley_value), a classic solution in game theory for distributing rewards in cooperative games. Empirically, Data Shapley valuations are more effective in downstream applications (_e.g._ active learning) than "leave-one-out" valuations. Moreover, they have several intuitive properties not shared by other criteria.
+
+
+Computing exact valuations according to either of these criteria requires retraining the model from scratch many times, which can be prohibitively expensive for large models. Thus, a second challenge lies in finding a good approximation for these measures. [Influence functions](https://arxiv.org/pdf/1703.04730.pdf) provide an efficient estimate of the "leave-one-out" measure that only requires on access to the model's gradients and hessian-vector products. Shapley values can be estimated with Monte Carlo samples or, for models trained via stochastic gradient descent, a simple gradient-based[approach](https://proceedings.mlr.press/v97/ghorbani19c/ghorbani19c.pdf). 
+ 
diff --git a/evaluation.md b/evaluation.md
@@ -37,10 +37,9 @@ Automated methods for slice discovery include,
 
 - [SliceFinder](https://research.google/pubs/pub47966/) is an interactive framework
   for finding interpretable slices of data.
-- [SliceLine](https://mboehm7.github.io/resources/sigmod2021b_sliceline.pdf) uses a fast slice-enumeration
-  method to make the process of slice discovery efficient and parallelizable.
-- [GEORGE](https://arxiv.org/pdf/2011.12945.pdf), which uses standard approaches to cluster representations
-  of a deep model in order to discover underperforming subgroups of data.
+- [SliceLine](https://mboehm7.github.io/resources/sigmod2021b_sliceline.pdf) uses a fast slice-enumeration method to make the process of slice discovery efficient and parallelizable.
+- [GEORGE](https://arxiv.org/pdf/2011.12945.pdf) uses standard approaches to cluster representations of a deep model in order to discover underperforming subgroups of data.
+- [Multiaccuracy Audit](https://arxiv.org/abs/1805.12317) is a model-agnostic approach that searches for slices on which the model performs poorly by training a simple "auditor" model to predict the full model's residual from input features. This idea of fitting a simple model to predict the predictions of the full model is also used in the context of [explainable ML](https://arxiv.org/pdf/1910.07969.pdf).
 
 Future directions for slice discovery will continue to improve our understanding of how to find slices
 that are interpretable, task-relevant, error-prone and susceptible to distribution shift.