As self-supervised data is often curated from large, public data sources (e.g., Wikipedia), it can contain popularity bias where the long tail of rare things are not well represented in the training data. As [Orr et. al.](https://arxiv.org/pdf/2010.10363.pdf) show, some popular models (e.g., BERT) rely on context memorization and struggle to resolve this long tail as they are incapable of seeing a rare thing enough times to memorize the diverse set of patterns associated with it. The long tail problem even propagates to downstream tasks, like retrieval tasks from [AmbER](https://arxiv.org/pdf/2106.06830.pdf). One exciting future direction that lies at the intersection of AI and years of research from the data management community to address the long tail is through the integration of structured knowledge into the model. Structured knowledge is the core idea behind the tail success of [Bootleg](https://arxiv.org/pdf/2010.10363.pdf), a system for Named Entity Disambiguation.
0 commit comments