didi
diff --git a/‎docs/co-reference_resolution.md
Lines changed: 32 additions & 5 deletions b/‎docs/co-reference_resolution.md
Lines changed: 32 additions & 5 deletions
diff --git a/‎docs/dialogue_state_management.md
Lines changed: 38 additions & 2 deletions b/‎docs/dialogue_state_management.md
Lines changed: 38 additions & 2 deletions
diff --git a/‎docs/entity_linking.md
Lines changed: 4 additions & 1 deletion b/‎docs/entity_linking.md
Lines changed: 4 additions & 1 deletion
diff --git a/‎docs/entity_tagging.md
Lines changed: 13 additions & 4 deletions b/‎docs/entity_tagging.md
Lines changed: 13 additions & 4 deletions
diff --git a/‎docs/machine_translation.md
Lines changed: 56 additions & 9 deletions b/‎docs/machine_translation.md
Lines changed: 56 additions & 9 deletions
diff --git a/‎docs/multi-task_learning.md
Lines changed: 29 additions & 0 deletions b/‎docs/multi-task_learning.md
Lines changed: 29 additions & 0 deletions
diff --git a/‎docs/pos_tagging.md
Lines changed: 6 additions & 2 deletions b/‎docs/pos_tagging.md
Lines changed: 6 additions & 2 deletions
diff --git a/‎docs/question_answering.md
Lines changed: 27 additions & 2 deletions b/‎docs/question_answering.md
Lines changed: 27 additions & 2 deletions
@@ -24,13 +24,12 @@ Average of F1-scores returned by these three precision/recall metrics:
 - B-cubed.  
 - Entity-based CEAF.  
 
-
 ## <span class="t">CoNLL 2012 Co-reference task</span>.
 
 CoNLL 2012 introduced a co-reference task in Chinese.
 - http://conll.cemantix.org/2012/introduction.html 
 
-Data for this evaluation is part of Ontonotes, distributed by the Linguistic Data Consortium (LDC).
+Data for this evaluation is part of OntoNotes, distributed by the Linguistic Data Consortium (LDC).
 - https://catalog.ldc.upenn.edu/LDC2013T19 
 
 |  Test set | # of co-referring mentions | Genre |
@@ -47,16 +46,44 @@ Scoring code: https://github.com/conll/reference-coreference-scorers
 
 |  System | Average F1 of MUC, B-cubed, CEAF |
 | --- | --- |
-|  [[Clark & Manning, 2016](https://nlp.stanford.edu/static/pubs/clark2016deep.pdf)] | 63.88 |
-|  [[Clark & Manning, 2016](https://nlp.stanford.edu/static/pubs/clark2016improving.pdf)] | 63.66 |
+|  [Kong & Jian (2019)](https://www.ijcai.org/Proceedings/2019/700) | 63.85 |
+|  [Clark & Manning (2016b)](https://nlp.stanford.edu/static/pubs/clark2016deep.pdf) | 63.88 |
+|  [Clark & Manning (2016a)](https://nlp.stanford.edu/static/pubs/clark2016improving.pdf) | 63.66 |
 
 ### Resources
 
-Data for this evaluation is part of Ontonotes, distributed by the Linguistic Data Consortium (LDC).
+Data for this evaluation is part of OntoNotes, distributed by the Linguistic Data Consortium (LDC).
 - https://catalog.ldc.upenn.edu/LDC2013T19 
 
 ---
 
+## <span class="t">Subtask: zero pronoun resolution (CoNLL 2012 / OntoNotes 5.0) </span>.
+
+### Metrics
+
+F1 score computed on resolution hits ([Zhao & Ng 2007](https://www.aclweb.org/anthology/D07-1057.pdf)).
+
+### Results
+
+|  System | Overall F1 (w/ gold syntactic info) | Overall F1 (w/o gold syntactic info) |
+| --- | --- | --- |
+|  [Aloraini & Poesio (2020)](https://www.aclweb.org/anthology/2020.lrec-1.11/) | 63.5 | |
+|  [Song et al. (2020)](https://www.aclweb.org/anthology/2020.acl-main.482/) | 58.5 | 26.1 |
+|  [Yang et al. (2019)](https://www.aclweb.org/anthology/W19-4108/) | 58.1 | |
+|  [Yin et al. (2018)](https://www.aclweb.org/anthology/C18-1002/) | 57.3 | |
+|  [Liu et al. (2017)](https://www.aclweb.org/anthology/P17-1010/) | 55.3 | |
+|  [Yin et al. (2017)](https://www.aclweb.org/anthology/D17-1135/) | 54.9 | 22.7 |
+
+### Resources
+
+Training and testing is performed on the train and dev splits of OntoNotes 5.0 respectively (statistics reported by [Yin et al. (2018)](https://www.aclweb.org/anthology/C18-1002/))
+
+| Split | Documents | Sentences | Words | Anaphoric Zero Pronouns | 
+| --- | --- | --- | --- | --- |
+|  Train | 1,391 | 36,487 | 756K | 12,111 |
+|  Dev | 172 | 6,083 | 110K | 1,713 |
+
+
 **Suggestions? Changes? Please send email to [[email protected]](mailto:[email protected])**
 
 
@@ -110,8 +110,44 @@ This task aims at tracking the dialog state defined as a frame structure filled
 |  Train | English | 35 | 31,304 |
 |  Dev | Chinese | 2 | 3,130 |
 
+## <span class="t">CrossWOZ: A Large-Scale Chinese Cross-Domain Task-Oriented Dialogue Dataset</span>.
+The first large-scale, Chinese Cross-Domain Wizard-of-Oz task-oriented dataset.
+* 5 Domains: hotel, restaurant, attraction, metro, taxi
+* Annotation of dialog states and dialog acts on both user and system sides
+* About 60% of dialogs have cross-domain user-goals
+* Rule-based user simulator also provided for evaluation
+
+### Links
+* [github](https://github.com/thu-coai/CrossWOZ)
+* [arxiv](https://arxiv.org/abs/2002.11893)
+
+### Overview
+| CrossWoz | |
+| --- | --- |
+|  Language | Chinese with English translations |
+|  Speakers | Human-to-Human |
+|  \# Domains | 5 |
+|  \# Slots | 72 |
+|  \# Values | 7,871 |
+
+### Datasets
+| Split                 | Train  | Valid | Test  |
+| --------------------- | ------ | ----- | ----- |
+| \# dialogues          | 5,012  | 500   | 500   |
+| \# Turns (utterances) | 84,692 | 8,458 | 8,476 |
+| Vocab                 | 12,502 | 5,202 | 5,143 |
+| Avg. user sub-goals   | 3.24   | 3.26  | 3.26  |
+| Avg. turns            | 16.9   | 16.9  | 17.0  |
+| Avg. tokens per turn  | 16.3   | 16.3  | 16.2  |
+
+### Example data
+A piece of dialogue: (Names of hotels are replaced by A,B,C for simplicity.)
+
+![example](https://github.com/thu-coai/CrossWOZ/blob/master/example.png)
+
+### Benchmark Model Evaluations
+![result](https://github.com/thu-coai/CrossWOZ/blob/master/result.png)
+
 ---
 
 **Suggestions? Changes? Please send email to [[email protected]](mailto:[email protected])**
-
-
 
@@ -55,7 +55,10 @@ NERC F-score
 
 |  System | TAC-KBP / EDL 2015<br/>Names | TAC-KBP / EDL 2016<br/>Names and nominals | TAC-KBP / EDL 2017<br/>Names and nominals |
 | --- | --- | --- | --- |
-|  Best anonymous system in shared task writeup | 76.9 | 76.2 | 67.8 |
+| [Sil et al (2018)](https://arxiv.org/abs/1712.01813) | 84.4 | | |
+| [Pan et al (2020)](https://www.aclweb.org/anthology/D19-6107.pdf) | 84.2 | | |
+| [Pan et al (2020)](https://www.aclweb.org/anthology/D19-6107.pdf) | 81.2 (unsup)| | |
+| Best anonymous system in shared task writeup | 76.9 | 76.2 | 67.8 |
 
 ### Resources
 
 
@@ -82,7 +82,9 @@ A standard train/dev/test split does not seem to be available.  Authors frequent
 ### Results
 
 | System | F-score |
-| --- | --- |
+| --- | --- | 
+| [Wang et al (2020)](https://www.aclweb.org/anthology/2020.acl-main.525/) | 81.7 |
+| [Huang et al (2020)](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0235796) | 81.7 |
 | [Wang & Lu. (2018)](https://arxiv.org/pdf/1810.01808.pdf) | 73.00 | 
 | [Ju et. al. (2018)](http://www.aclweb.org/anthology/N18-1131) | 72.25 | 
 
@@ -103,7 +105,11 @@ Paper summarizing the bakeoff:
 
 | System | F-score |
 | --- | --- | 
-| [Meng et. al. (2019)](https://arxiv.org/abs/1901.10125) | 93.89 | 
+| [Liu et al (2020)](https://arxiv.org/pdf/1909.07606.pdf) | 95.7 |
+| [Meng et. al. (2019)](https://arxiv.org/abs/1901.10125) | 95.5 | 
+| [Ma et al (2020)](https://www.aclweb.org/anthology/2020.acl-main.528.pdf) | 95.4 |
+| [Sun et al (2020)](https://arxiv.org/pdf/1907.12412.pdf) | 95.0 |
+| [Yan et al (2020)](https://ieeexplore.ieee.org/abstract/document/9141551) | 94.1 |
 | [Liu et. al. (2019)](https://www.aclweb.org/anthology/N19-1247) | 93.74 |
 | [Sui et al. (2019)](https://www.aclweb.org/anthology/D19-1396/) | 93.47 | 
 | [Gui et al. (2019)](https://www.aclweb.org/anthology/D19-1096/) | 93.46 |
@@ -135,11 +141,14 @@ Using the test split by http://www.aclweb.org/anthology/E17-2113:
 
 | System | F-score (name mentions) | F-score (nominal mentions) | F-score (Overall) |
 | --- | --- | --- | --- |
+| [Ma et al (2020)](https://www.aclweb.org/anthology/2020.acl-main.528.pdf) | 70.9 | 67.0 | 70.5 |
+| [Meng et. al. (2019)](https://arxiv.org/abs/1901.10125) | 67.6 | | | 
+| [Hu and Zheng (2020)](https://www.jstage.jst.go.jp/article/transinf/E103.D/7/E103.D_2019EDP7253/_pdf/-char/ja) | 56.4 | | |
 | [Sui et al. (2019)](https://www.aclweb.org/anthology/D19-1396/) | 56.45 | 68.32 | 63.09 |
 | [Gui et al. (2019)](https://www.aclweb.org/anthology/D19-1096/) | 55.34 | 64.98 | 60.21 |
 | [Liu et. al. (2019)](https://www.aclweb.org/anthology/N19-1247) | 52.55 | 67.41 | 59.84 |
-|[Zhu (2019)](https://www.aclweb.org/anthology/N19-1342)|55.38 | 62.98 | 59.31 |
-|[Zhang & Yang (2018)](http://aclweb.org/anthology/P18-1144)|53.04| 62.25 | 58.79 |
+| [Zhu (2019)](https://www.aclweb.org/anthology/N19-1342)|55.38 | 62.98 | 59.31 |
+| [Zhang & Yang (2018)](http://aclweb.org/anthology/P18-1144)|53.04| 62.25 | 58.79 |
 | [Peng & Dredze (2015)](https://www.cs.jhu.edu/~npeng/papers/golden_horse_supplement.pdf) | 55.28 | 62.97 | 58.99 |
 
 ### Resources
 
@@ -35,8 +35,9 @@ The United States and China may soon reach a trade agreement.
 * BLEU-SBP ((Chiang et al 08)[http://aclweb.org/anthology/D08-1064]).  Addresses decomposability problems with Bleu, proposing a cross between Bleu and word error rate.
 * HTER.  Returns the number of edits performed by a human posteditor to get an automatic translation into good shape.
 
+## <span class="t">ZH-EN</span>.
 
-## <span class="t">WMT</span>.
+### <span class="t">WMT</span>.
 
 The Second Conference on Machine Translation (WMT17) has a Chinese/English MT component, done in cooperation with CWMT 2017.
 * [Website](http://www.statmt.org/wmt17)
@@ -78,7 +79,7 @@ English - Chinese (WMT17)
 
 There are many parallel English/Chinese text resources to train MT systems on.  These are publicly available:
 
-|  Train set | Size (words on English side) | Genre |
+|  Dataset | Size (words on English side) | Genre |
 | --- | --- | --- |
 |  UN | 327m | Political |
 |  New Commentary v12 | 5m | News opinions |
@@ -90,7 +91,7 @@ The Linguistic Data Consortium has additional resources, such as FBIS and NIST t
 
 
 
-## <span class="t">NIST</span>.
+### <span class="t">NIST</span>.
 
 NIST has a long history of supporting Chinese-English translation by creating annual test sets and running annual NIST OpenMT evaluations during the 2000s.  Many sites have reported results on NIST test sets.  
 
@@ -122,6 +123,7 @@ Note that this [paper](http://www.lrec-conf.org/proceedings/lrec2018/pdf/678.pdf
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 |  [[Zhang et al 2019]](https://arxiv.org/abs/1906.02448) | 1.25m | mteval-v11b |  | 48.31 | 49.40 | 48.72 | 48.45 |  | 48.72 |
 |  [[Hadiwinoto & Ng, 2018]](http://www.lrec-conf.org/proceedings/lrec2018/pdf/678.pdf) | 7.65m | mteval-v13a | 46.94 | 47.58 | 49.13 | 47.78 | 49.37 | 41.48 | 47.05 |
+|  [[Yang te al, 2020]](https://www.aclweb.org/anthology/2020.acl-main.531/) | 1.2m | unspecified |  | 46.56 |  | 46.04 |  | 37.53 |  |
 |  [[Meng et al 2019]](https://arxiv.org/pdf/1901.10125.pdf) | 1.25m | unspecified | 40.56 (dev) | 39.93 | 41.54 | 38.01 | 37.45 | 29.07 | 37.76 |
 |  [[Ma et al 2018c]](https://arxiv.org/abs/1805.04871) | 1.25m | unspecified | 39.77 (dev) | 38.91 | 40.02 | 36.82 | 35.93 | 27.61 | 36.51 |
 |  [[Chen et al 2017]](http://aclweb.org/anthology/P17-1177) | 1.6m | multibleu | 36.57 | 35.64 | 36.63 | 34.35 | 30.57 |  |  |
@@ -132,13 +134,13 @@ The Linguistic Data Consortium provides training materials typically used for NI
 
 
 
-## <span class="t">IWSLT 2015</span>.
+### <span class="t">IWSLT 2015</span>.
 
 * Translation of TED talks
 * Chinese-to-English track
 * [Shared task overview](https://cris.fbk.eu/retrieve/handle/11582/303031/9811/main.pdf)
 
-|  Test sets | Size (sentences) | # of talks | Genre |
+|  Dataset | Size (sentences) | # of talks | Genre |
 | --- | --- | --- | --- |
 |  tst2014 | 1068 | 12 | TED talks |
 |  tst2015 | 1,080 | 12 | TED talks |
@@ -165,7 +167,7 @@ English to Chinese (tst2015)
 
 ### Resources
 
-|   | Size (sentences) | # of talks | Genre |
+|  Dataset | Size (sentences) | # of talks | Genre |
 | --- | --- | --- | --- |
 |  Train | 210k | 1718 | TED talks |
 
@@ -201,8 +203,9 @@ English to Chinese
 [The Multitarget TED Talks Task (MTTT)](http://cs.jhu.edu/~kevinduh/a/multitarget-tedtalks/)
 
 
+## <span class="t">ZH-JA</span>.
 
-## <span class="t">Workshop on Asian Translation</span>.
+### <span class="t">Workshop on Asian Translation</span>.
 
 [The Workshop on Asian Translation](http://lotus.kuee.kyoto-u.ac.jp/WAT/) has run since 2014.  Here, we include the 2018 Chinese/Japanese evaluations.
 
@@ -245,15 +248,59 @@ Participants must get data from [here](http://lotus.kuee.kyoto-u.ac.jp/WAT/paten
 
 ### Resources
 
-|  Test set | Size (sentences) | Genre |
+|  Dataset | Size (sentences) | Genre |
 | --- | --- | --- |
 |  Japanese-Chinese train | 250,000 | Patents |
 |  Japanese-Chinese dev | 2000 | Patents |
 |  Japanese-Chinese devtest | 2000 | Patents |
 
 
+### <span class="t">IWSLT2020 ZH-JA Open Domain Translation</span>.
+
+[The shared task](http://iwslt.org/doku.php?id=open_domain_translation) is to promote research on translation between Asian languages, exploitation of noisy parallel web corpora for MT and smart processing of data and provenance.
+
+### Metrics
+* 4-gram character BlEU.
+
+A (secret) mixed-genre test set was intended to cover a variety of topics. The test data was selected from high-quality (human translated) parallel web content, authored between January and March 2020.
+
+|  Test set | Size (sentences) | Genre |
+| --- | --- | --- |
+|  Chinese-Japanese | 875 | mixed-genre |
+|  Japanese-Chinese | 875 | mixed-genre |
+
+
+### Results
+
+Chinese to Japanese
+
+|  System | Bleu |
+| --- | --- |
+|  [CASIA*](https://www.aclweb.org/anthology/2020.iwslt-1.15/) | 43.0 |
+|  [Xiaomi](https://www.aclweb.org/anthology/2020.iwslt-1.18/) | 34.3 |
+|  [TSUKUBA](https://www.aclweb.org/anthology/2020.iwslt-1.17/) | 33.0 |
+
+Japanese to Chinese
+
+|  System | Bleu |
+| --- | --- |
+|  [CASIA*](https://www.aclweb.org/anthology/2020.iwslt-1.15/) | 55.8 |
+|  [Samsung Research China](https://www.aclweb.org/anthology/2020.iwslt-1.12/) | 34.0 |
+|  [OPPO](https://www.aclweb.org/anthology/2020.iwslt-1.13/) | 32.9 |
+
+\* means system collected external parallel training data that inadvertently overlapped with the blind test set.
+
+### Resources
+
+|  Dataset | Size (sentences) | Genre |
+| --- | --- | --- |
+|  Web crawled | 18,966,595 | mixed-genre |
+|  Existing parallel sources | 1,963,238 | mixed-genre |
+
+
+## <span class="t">Others</span>.
 
-## <span class="t">CWMT</span>.
+### <span class="t">CWMT</span>.
 
 [CWMT 2017](http://ee.dlut.edu.cn/CWMT2017/index_en.html)
 and [2018](http://www.cipsc.org.cn/cwmt/2018/english/)
 
@@ -0,0 +1,29 @@
+# Chinese Multi-task Learning
+
+## Background
+
+Multi-task learning aims to learn multiple different tasks simultaneously while maximizing performance on one or all of the tasks.
+
+
+## Standard metrics
+
+- Accuracy of classification.
+- Exact Match.
+
+
+## <span class="t">CLUE</span>.
+
+The [Chinese Language Understanding Evaluation Benchmark](https://arxiv.org/abs/2004.05986) (CLUE)
+is a benchmark to evaluate the performance of models across a diverse range of existing natural language understanding tasks. Models are evaluated based on the average scores across all tasks.
+* [GitHub](https://github.com/CLUEbenchmark/CLUE)
+* [Website](https://www.cluebenchmarks.com/)
+
+The state-of-the-art results can be seen on the public [CLUE leaderboard](https://www.cluebenchmarks.com/rank.html).
+
+
+---
+
+**Suggestions? Changes? Please send email to [[email protected]](mailto:[email protected])**
+
+
+
@@ -45,7 +45,9 @@ F1 score calculated from word-level precision and word-level recall computed fro
 
 | System | F1 score |
 | --- | --- |
-| [Meng et. al. (2019)](https://arxiv.org/pdf/1901.10125.pdf) (Glyce)| 95.61 |
+| [Tian el. al. (2020)](https://www.aclweb.org/anthology/2020.acl-main.735.pdf) | 96.92 | 
+| [Meng et. al. (2019)](https://arxiv.org/pdf/1901.10125.pdf) (Glyce + BERT)| 96.61 |
+| [Meng et. al. (2019)](https://arxiv.org/pdf/1901.10125.pdf) (BERT)| 96.06 |
 | [Shao et. al. 2017](http://www.aclweb.org/anthology/I17-1018) | 94.38 |
 
 ### Resources
@@ -75,7 +77,9 @@ F1 score calculated from word-level precision and word-level recall computed fro
 
 | System | F1 score|
 | --- | --- |
-| [Meng et. al. (2019)](https://arxiv.org/pdf/1901.10125.pdf) (Glyce)| 90.77 |
+| [Meng et. al. (2019)](https://arxiv.org/pdf/1901.10125.pdf) (Glyce + BERT)| 96.14 |
+| [Tian el. al. (2020)](https://www.aclweb.org/anthology/2020.acl-main.735.pdf) | 95.69 | 
+| [Meng et. al. (2019)](https://arxiv.org/pdf/1901.10125.pdf) (BERT)| 94.79 |
 | [Shao et. al. (2017)](http://www.aclweb.org/anthology/I17-1018) | 89.75 |
 
 ### Resources
 
@@ -87,8 +87,10 @@ NLPCC DBQA 2016
 
 |  System | MRR | F1 |
 | --- | --- | --- |
-|  [ERNIE(baidu)](https://arxiv.org/pdf/1904.09223.pdf) | 95.1 | 82.7 |
-|  [BERT](https://arxiv.org/pdf/1810.04805.pdf) | 94.6 | 80.8 |
+| [ERNIE 2.0](https://arxiv.org/pdf/1907.12412.pdf) | 95.8 | 85.8 |
+| [Meng et. al. (2019)](https://arxiv.org/pdf/1901.10125.pdf) (Glyce + BERT) | - | 83.4 |
+| [ERNIE(baidu)](https://arxiv.org/pdf/1904.09223.pdf) | 95.1 | 82.7 |
+| [BERT](https://arxiv.org/pdf/1810.04805.pdf) | 94.6 | 80.8 |
 
 NLPCC DBQA 2017
 
@@ -104,6 +106,29 @@ NLPCC DBQA 2017
 
 
 
+## <span class="t">Machine Reading Comprehension (MRC) tasks from CLUE benchmark</span>.
+
+[CLUE](https://github.com/CLUEbenchmark/CLUE) is a Chinese Language Understanding Evaluation benchmark. 
+Machine Reading Comprehension (MRC) is a task to teach machine to read and understand unstructured text and then answer questions about it.
+MRC corpus in CLUE consists of three datasets: **CMRC 2018** [(Cui et al.)](https://www.aclweb.org/anthology/D19-1600.pdf), **ChID** [(Zheng et al.)](https://www.aclweb.org/anthology/P19-1075.pdf), and **C<sup>3</sup>** [(Sun et al.)](https://arxiv.org/pdf/1904.09679.pdf).
+
+
+
+### Metrics
+* Exact Match (CMRC 2018)
+* Accuracy (ChID and C<sup>3</sup>)
+
+### Results
+|  System | CMRC 2018 | ChID | C<sup>3</sup> |
+| --- | --- | --- | --- |
+| [HUMAN (CLUE origin)](https://github.com/CLUEbenchmark/CLUE) | 92.40 | 87.10 | 96.00 |
+| [RoBERTa-wwm-ext-large (CLUE origin)](https://github.com/CLUEbenchmark/CLUE) | 76.58 | 85.37 | 72.32 |
+| [BERT-base (CLUE origin)](https://github.com/CLUEbenchmark/CLUE) | 69.72 | 82.04 | 64.50 |
+
+### Resources
+
+[CLUE benchmark](https://www.cluebenchmarks.com/)
+
 ## Other resources.
 
 * WebQA (Baidu) has 42k questions and answers. [link](https://arxiv.org/pdf/1607.06275.pdf)