Skip to content

Commit 9b5b281

Browse files
committed
Add section on data valuation
1 parent 4523fb1 commit 9b5b281

File tree

3 files changed

+24
-11
lines changed

3 files changed

+24
-11
lines changed

augmentation.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ an integral part of text applications such as machine translation.
3434
A large body of work utilizes hand-crafted data augmentation primitives in order to improve
3535
model performance. These hand-crafted primitives are designed based on domain knowledge
3636
about data properties, e.g. rotating an image preserves the content of the image, and should
37-
typically not change the class label.
37+
typically not change the class label.
3838

3939
The next few sections provide a sampling of work across several different
4040
modalities (images, text, audio) that take this approach.
@@ -44,12 +44,12 @@ modalities (images, text, audio) that take this approach.
4444
Heuristic transformations are commonly used in image augmentations, such as rotations, flips or crops
4545
(e.g. [AlexNet](https://papers.nips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf), [Inception](https://arxiv.org/abs/1409.4842.pdf)).
4646

47-
Recent work has hand-crafted more sophisticated primitives, such as
47+
Recent work has proposed more sophisticated hand-crafted primitives:
4848

49-
- [Cutout](https://arxiv.org/abs/1708.04552)
50-
- [Mixup](https://arxiv.org/pdf/1710.09412.pdf)
51-
- [CutMix](https://arxiv.org/abs/1905.04899.pdf)
52-
- [MixMatch](https://arxiv.org/pdf/1905.02249.pdf) and [ReMixMatch](https://arxiv.org/abs/1911.09785.pdf)
49+
- [Cutout](https://arxiv.org/abs/1708.04552) randomly masks patches of the input image during training.
50+
- [Mixup](https://arxiv.org/pdf/1710.09412.pdf) augments a training dataset with convex combinations of training examples. There is substantial empirical [evidence](https://papers.nips.cc/paper/2019/file/36ad8b5f42db492827016448975cc22d-Paper.pdf) that Mixup can improve generalization and adversarial robustness. A recent [theoretical analysis](https://arxiv.org/abs/2010.04819) helps explain these gains, showing that the Mixup loss can be approximated by standard ERM loss with regularization terms.
51+
- [CutMix](https://arxiv.org/abs/1905.04899.pdf) combines the two approaches above: instead of summing two input images (like Mixup), CutMix pastes a random patch from one image onto the other and updates the label to be weighted sum of the two image labels proportional to the size of the cutouts.
52+
- [MixMatch](https://arxiv.org/pdf/1905.02249.pdf) and [ReMixMatch](https://arxiv.org/abs/1911.09785.pdf) extend the utility of these techniques to semi-supervised settings.
5353

5454
While these primitives have culminated in compelling performance gains, they can often produce unnatural images and distort image semantics. However, data augmenation techniques such as [AugMix](https://arxiv.org/abs/1912.02781) can mix together various unnatural augmentations and lead to images that appear more natural.
5555

@@ -135,5 +135,5 @@ Several open questions remain in data augmentation and synthetic data generation
135135

136136
<h2 id="augmentation-evenmore">Further Reading</h2>
137137

138-
- the ["Automating the Art of Data Augmentation"](https://hazyresearch.stanford.edu/data-aug-part-1)
138+
- The ["Automating the Art of Data Augmentation"](https://hazyresearch.stanford.edu/data-aug-part-1)
139139
series of blog posts by [Sharon Li](http://pages.cs.wisc.edu/~sharonli/) provide an overview of data augmentation.

data-selection.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,16 @@
11
# Data Selection
22
_This area is a stub, you can help by improving it._
3+
4+
5+
## Data Valuation
6+
Quantifying the contribution of each training datapoint to an end model is useful in a number of settings:
7+
1. in __active learning__ knowing the value of our training examples can help guide us in collecting more data
8+
2. when __compensating__ individuals for the data they contribute to a training dataset (_e.g._ search engine users contributing their browsing data or patients contributing their medical data)
9+
3. for __explaining__ a model's predictions and __debugging__ its behavior.
10+
11+
However, data valuation can be quite tricky.
12+
The first challenge lies in selecting a suitable criterion for quantifying a datapoint's value. Most criteria aim to measure the gain in model performance attributable to including the datapoint in the training dataset. A common [approach](https://conservancy.umn.edu/handle/11299/37076), dubbed "leave-one-out", simply computes the difference in performance between a model trained on the full dataset and one trained on the full dataset minus one example. Recently, [Ghorbani _et al._](https://proceedings.mlr.press/v97/ghorbani19c/ghorbani19c.pdf) proposed a data valuation scheme based on the [Shapley value](https://en.wikipedia.org/wiki/Shapley_value), a classic solution in game theory for distributing rewards in cooperative games. Empirically, Data Shapley valuations are more effective in downstream applications (_e.g._ active learning) than "leave-one-out" valuations. Moreover, they have several intuitive properties not shared by other criteria.
13+
14+
15+
Computing exact valuations according to either of these criteria requires retraining the model from scratch many times, which can be prohibitively expensive for large models. Thus, a second challenge lies in finding a good approximation for these measures. [Influence functions](https://arxiv.org/pdf/1703.04730.pdf) provide an efficient estimate of the "leave-one-out" measure that only requires on access to the model's gradients and hessian-vector products. Shapley values can be estimated with Monte Carlo samples or, for models trained via stochastic gradient descent, a simple gradient-based[approach](https://proceedings.mlr.press/v97/ghorbani19c/ghorbani19c.pdf).
16+

evaluation.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -37,10 +37,9 @@ Automated methods for slice discovery include,
3737

3838
- [SliceFinder](https://research.google/pubs/pub47966/) is an interactive framework
3939
for finding interpretable slices of data.
40-
- [SliceLine](https://mboehm7.github.io/resources/sigmod2021b_sliceline.pdf) uses a fast slice-enumeration
41-
method to make the process of slice discovery efficient and parallelizable.
42-
- [GEORGE](https://arxiv.org/pdf/2011.12945.pdf), which uses standard approaches to cluster representations
43-
of a deep model in order to discover underperforming subgroups of data.
40+
- [SliceLine](https://mboehm7.github.io/resources/sigmod2021b_sliceline.pdf) uses a fast slice-enumeration method to make the process of slice discovery efficient and parallelizable.
41+
- [GEORGE](https://arxiv.org/pdf/2011.12945.pdf) uses standard approaches to cluster representations of a deep model in order to discover underperforming subgroups of data.
42+
- [Multiaccuracy Audit](https://arxiv.org/abs/1805.12317) is a model-agnostic approach that searches for slices on which the model performs poorly by training a simple "auditor" model to predict the full model's residual from input features. This idea of fitting a simple model to predict the predictions of the full model is also used in the context of [explainable ML](https://arxiv.org/pdf/1910.07969.pdf).
4443

4544
Future directions for slice discovery will continue to improve our understanding of how to find slices
4645
that are interpretable, task-relevant, error-prone and susceptible to distribution shift.

0 commit comments

Comments
 (0)