Skip to content

Commit cdea3d4

Browse files
authored
updated blog posts with current Snorkel literature
1 parent ff33e19 commit cdea3d4

File tree

1 file changed

+9
-9
lines changed

1 file changed

+9
-9
lines changed

README.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
</div>
55

66
We're collecting (an admittedly opinionated) list of resources and progress made
7-
in data-centric AI, with exciting directions past, present and future.
7+
in data-centric AI, with exciting directions past, present and future.
88
[This blog talks about our journey to data-centric AI](https://hazyresearch.stanford.edu/data-centric-ai) and
99
we articulate [why we're excited about data as a viewpoint for AI in this blog](https://hazyresearch.stanford.edu/what-data-centric-ai-is-not).
1010

@@ -28,7 +28,7 @@ If you have ideas on how we can make this repository better, feel free to submit
2828

2929
### Contributing
3030

31-
We want this resource to grow with contributions from readers and data enthusiasts.
31+
We want this resource to grow with contributions from readers and data enthusiasts.
3232
If you'd like to make contributions to this Github repository, please read our [contributing guidelines](CONTRIBUTING.md).
3333

3434

@@ -57,9 +57,9 @@ If you'd like to make contributions to this Github repository, please read our [
5757

5858
_This area is a stub, you can help by improving it._
5959

60-
There's a lot of excitement around understanding how to put machine learning to work on real use-cases.
61-
Data-Centric AI embodies a particular point of view around how this progress can happen: by focusing on making it easier for
62-
practitioners to understand, program and iterate on datasets, instead of spending time on models.
60+
There's a lot of excitement around understanding how to put machine learning to work on real use-cases.
61+
Data-Centric AI embodies a particular point of view around how this progress can happen: by focusing on making it easier for
62+
practitioners to understand, program and iterate on datasets, instead of spending time on models.
6363

6464

6565
<h1 id="data-programming">Data Programming & Weak Supervision</h1>
@@ -68,11 +68,11 @@ practitioners to understand, program and iterate on datasets, instead of spendin
6868

6969
Many modern machine learning systems require large, labeled datasets to be successful, but producing such datasets is time-consuming and expensive. Instead, weaker sources of supervision, such as [crowdsourcing](https://papers.nips.cc/paper/2011/file/c667d53acd899a97a85de0c201ba99be-Paper.pdf), [distant supervision](https://www.aclweb.org/anthology/P09-1113.pdf), and domain experts' heuristics like [Hearst Patterns](https://people.ischool.berkeley.edu/~hearst/papers/coling92.pdf) have been used since the 90s.
7070

71-
However, these were largely regarded by AI and AI/ML folks as ad hoc or isolated techniques. The effort to unify and combine these into a data centric viewpoint started in earnest with [data programming](https://arxiv.org/abs/1605.07723), embodied in [Snorkel](https://snorkel.ai/how-to-use-snorkel-to-build-ai-applications/), now an [open-source project](http://snorkel.org) and [thriving company](http://snorkel.ai). In the Snorkel approach, users specify multiple labeling functions that each represent a noisy estimate of the ground-truth label. Because these labeling functions vary in accuracy and coverage of the dataset, and may even be correlated, they are combined and denoised via a latent variable graphical model. The technical challenge is thus to learn accuracy and correlation parameters in this model, and to use them to infer the true label to be used for downstream tasks.
71+
However, these were largely regarded by AI and AI/ML folks as ad hoc or isolated techniques. The effort to unify and combine these into a data centric viewpoint started in earnest with [data programming](https://arxiv.org/abs/1605.07723) AKA [programmatic labeling](https://snorkel.ai/programmatic-labeling/), embodied in [Snorkel](https://snorkel.ai/how-to-use-snorkel-to-build-ai-applications/), now an [open-source project](http://snorkel.org) and [thriving company](http://snorkel.ai). In Snorkel's [data-centric AI](https://snorkel.ai/data-centric-ai-primer/) approach, users specify multiple labeling functions that each represent a noisy estimate of the ground-truth label. Because these labeling functions vary in accuracy and coverage of the dataset, and may even be correlated, they are combined and denoised via a latent variable graphical model. The technical challenge is thus to learn accuracy and correlation parameters in this model, and to use them to infer the true label to be used for downstream tasks.
7272

7373
Data programming builds on a long line of work on parameter estimation in latent variable graphical models. Concretely, a generative model for the joint distribution of labeling functions and the unobserved (latent) true label is learned. This label model permits aggregation of diverse sources of signal, while allowing them to have varying accuracies and potential correlations.
7474

75-
An overview of the weak supervision landscape can be found in this [Snorkel blog post](https://www.snorkel.org/blog/weak-supervision), including how it compares to other approaches to get more labeled data and the technical modeling challenges.
75+
This Snorkel blog post contains an overview of [weak supervision](https://snorkel.ai/weak-supervision/), including how it compares to other approaches to get more labeled data and the technical modeling challenges.
7676
These [Stanford CS229 lecture notes](https://mayeechen.github.io/files/wslecturenotes.pdf) provide a theoretical summary of how graphical models are used in weak supervision.
7777

7878
<h1 id="augmentation">Data Augmentation</h1>
@@ -138,14 +138,14 @@ Beyond subpopulation shift, robustness also features domain shift and adversaria
138138

139139
[Data Cleaning Area Page](data-cleaning.md)
140140

141-
Another way to improve data quality for ML/AI applications is via data cleaning. There is a diverse range of exciting work along this line to jointly understand data cleaning and machine learning.
141+
Another way to improve data quality for ML/AI applications is via data cleaning. There is a diverse range of exciting work along this line to jointly understand data cleaning and machine learning.
142142

143143

144144
<h1 id="mlops">MLOps</h1>
145145

146146
[MLOps Area Page](mlops.md)
147147

148-
The central role of data makes the development and deployment of ML/AI applications an human-in-the-loop process.
148+
The central role of data makes the development and deployment of ML/AI applications an human-in-the-loop process.
149149
This is a complex process in which human engineers could make mistakes, require guidance, or need to be warned when something unexpected happens. The goal of MLOps is to provide principled ways for lifecycle management, monitoring, and validation.
150150

151151
Researchers have started tackling these challenges by developing new techniques and building systems such as [TFX](https://arxiv.org/pdf/2010.02013.pdf), [Ease.ML](http://cidrdb.org/cidr2021/papers/cidr2021_paper26.pdf) or [Overton](https://www.cs.stanford.edu/~chrismre/papers/overton-tr.pdf) designed to handle the entire lifecycle of a machine learning model both during development and in production. These systems typically consist of distinct components in charge of handling specific stages (e.g., pre- or post-training) or aspects (e.g., monitoring or debugging) of MLOps.

0 commit comments

Comments
 (0)