You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -68,11 +68,11 @@ practitioners to understand, program and iterate on datasets, instead of spendin
68
68
69
69
Many modern machine learning systems require large, labeled datasets to be successful, but producing such datasets is time-consuming and expensive. Instead, weaker sources of supervision, such as [crowdsourcing](https://papers.nips.cc/paper/2011/file/c667d53acd899a97a85de0c201ba99be-Paper.pdf), [distant supervision](https://www.aclweb.org/anthology/P09-1113.pdf), and domain experts' heuristics like [Hearst Patterns](https://people.ischool.berkeley.edu/~hearst/papers/coling92.pdf) have been used since the 90s.
70
70
71
-
However, these were largely regarded by AI and AI/ML folks as ad hoc or isolated techniques. The effort to unify and combine these into a data centric viewpoint started in earnest with [data programming](https://arxiv.org/abs/1605.07723), embodied in [Snorkel](https://snorkel.ai/how-to-use-snorkel-to-build-ai-applications/), now an [open-source project](http://snorkel.org) and [thriving company](http://snorkel.ai). In the Snorkel approach, users specify multiple labeling functions that each represent a noisy estimate of the ground-truth label. Because these labeling functions vary in accuracy and coverage of the dataset, and may even be correlated, they are combined and denoised via a latent variable graphical model. The technical challenge is thus to learn accuracy and correlation parameters in this model, and to use them to infer the true label to be used for downstream tasks.
71
+
However, these were largely regarded by AI and AI/ML folks as ad hoc or isolated techniques. The effort to unify and combine these into a data centric viewpoint started in earnest with [data programming](https://arxiv.org/abs/1605.07723) AKA [programmatic labeling](https://snorkel.ai/programmatic-labeling/), embodied in [Snorkel](https://snorkel.ai/how-to-use-snorkel-to-build-ai-applications/), now an [open-source project](http://snorkel.org) and [thriving company](http://snorkel.ai). In Snorkel's [data-centric AI](https://snorkel.ai/data-centric-ai-primer/) approach, users specify multiple labeling functions that each represent a noisy estimate of the ground-truth label. Because these labeling functions vary in accuracy and coverage of the dataset, and may even be correlated, they are combined and denoised via a latent variable graphical model. The technical challenge is thus to learn accuracy and correlation parameters in this model, and to use them to infer the true label to be used for downstream tasks.
72
72
73
73
Data programming builds on a long line of work on parameter estimation in latent variable graphical models. Concretely, a generative model for the joint distribution of labeling functions and the unobserved (latent) true label is learned. This label model permits aggregation of diverse sources of signal, while allowing them to have varying accuracies and potential correlations.
74
74
75
-
An overview of the weak supervision landscape can be found in this [Snorkel blog post](https://www.snorkel.org/blog/weak-supervision), including how it compares to other approaches to get more labeled data and the technical modeling challenges.
75
+
This Snorkel blog post contains an overview of [weak supervision](https://snorkel.ai/weak-supervision/), including how it compares to other approaches to get more labeled data and the technical modeling challenges.
76
76
These [Stanford CS229 lecture notes](https://mayeechen.github.io/files/wslecturenotes.pdf) provide a theoretical summary of how graphical models are used in weak supervision.
77
77
78
78
<h1id="augmentation">Data Augmentation</h1>
@@ -138,14 +138,14 @@ Beyond subpopulation shift, robustness also features domain shift and adversaria
138
138
139
139
[Data Cleaning Area Page](data-cleaning.md)
140
140
141
-
Another way to improve data quality for ML/AI applications is via data cleaning. There is a diverse range of exciting work along this line to jointly understand data cleaning and machine learning.
141
+
Another way to improve data quality for ML/AI applications is via data cleaning. There is a diverse range of exciting work along this line to jointly understand data cleaning and machine learning.
142
142
143
143
144
144
<h1id="mlops">MLOps</h1>
145
145
146
146
[MLOps Area Page](mlops.md)
147
147
148
-
The central role of data makes the development and deployment of ML/AI applications an human-in-the-loop process.
148
+
The central role of data makes the development and deployment of ML/AI applications an human-in-the-loop process.
149
149
This is a complex process in which human engineers could make mistakes, require guidance, or need to be warned when something unexpected happens. The goal of MLOps is to provide principled ways for lifecycle management, monitoring, and validation.
150
150
151
151
Researchers have started tackling these challenges by developing new techniques and building systems such as [TFX](https://arxiv.org/pdf/2010.02013.pdf), [Ease.ML](http://cidrdb.org/cidr2021/papers/cidr2021_paper26.pdf) or [Overton](https://www.cs.stanford.edu/~chrismre/papers/overton-tr.pdf) designed to handle the entire lifecycle of a machine learning model both during development and in production. These systems typically consist of distinct components in charge of handling specific stages (e.g., pre- or post-training) or aspects (e.g., monitoring or debugging) of MLOps.
0 commit comments