Skip to content

Commit b56af3a

Browse files
authored
Merge pull request #2 from didi/master
update
2 parents afa140d + d0c02af commit b56af3a

14 files changed

+288
-57
lines changed

docs/co-reference_resolution.md

Lines changed: 32 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,13 +24,12 @@ Average of F1-scores returned by these three precision/recall metrics:
2424
- B-cubed.
2525
- Entity-based CEAF.
2626

27-
2827
## <span class="t">CoNLL 2012 Co-reference task</span>.
2928

3029
CoNLL 2012 introduced a co-reference task in Chinese.
3130
- http://conll.cemantix.org/2012/introduction.html
3231

33-
Data for this evaluation is part of Ontonotes, distributed by the Linguistic Data Consortium (LDC).
32+
Data for this evaluation is part of OntoNotes, distributed by the Linguistic Data Consortium (LDC).
3433
- https://catalog.ldc.upenn.edu/LDC2013T19
3534

3635
| Test set | # of co-referring mentions | Genre |
@@ -47,16 +46,44 @@ Scoring code: https://github.com/conll/reference-coreference-scorers
4746

4847
| System | Average F1 of MUC, B-cubed, CEAF |
4948
| --- | --- |
50-
| [[Clark & Manning, 2016](https://nlp.stanford.edu/static/pubs/clark2016deep.pdf)] | 63.88 |
51-
| [[Clark & Manning, 2016](https://nlp.stanford.edu/static/pubs/clark2016improving.pdf)] | 63.66 |
49+
| [Kong & Jian (2019)](https://www.ijcai.org/Proceedings/2019/700) | 63.85 |
50+
| [Clark & Manning (2016b)](https://nlp.stanford.edu/static/pubs/clark2016deep.pdf) | 63.88 |
51+
| [Clark & Manning (2016a)](https://nlp.stanford.edu/static/pubs/clark2016improving.pdf) | 63.66 |
5252

5353
### Resources
5454

55-
Data for this evaluation is part of Ontonotes, distributed by the Linguistic Data Consortium (LDC).
55+
Data for this evaluation is part of OntoNotes, distributed by the Linguistic Data Consortium (LDC).
5656
- https://catalog.ldc.upenn.edu/LDC2013T19
5757

5858
---
5959

60+
## <span class="t">Subtask: zero pronoun resolution (CoNLL 2012 / OntoNotes 5.0) </span>.
61+
62+
### Metrics
63+
64+
F1 score computed on resolution hits ([Zhao & Ng 2007](https://www.aclweb.org/anthology/D07-1057.pdf)).
65+
66+
### Results
67+
68+
| System | Overall F1 (w/ gold syntactic info) | Overall F1 (w/o gold syntactic info) |
69+
| --- | --- | --- |
70+
| [Aloraini & Poesio (2020)](https://www.aclweb.org/anthology/2020.lrec-1.11/) | 63.5 | |
71+
| [Song et al. (2020)](https://www.aclweb.org/anthology/2020.acl-main.482/) | 58.5 | 26.1 |
72+
| [Yang et al. (2019)](https://www.aclweb.org/anthology/W19-4108/) | 58.1 | |
73+
| [Yin et al. (2018)](https://www.aclweb.org/anthology/C18-1002/) | 57.3 | |
74+
| [Liu et al. (2017)](https://www.aclweb.org/anthology/P17-1010/) | 55.3 | |
75+
| [Yin et al. (2017)](https://www.aclweb.org/anthology/D17-1135/) | 54.9 | 22.7 |
76+
77+
### Resources
78+
79+
Training and testing is performed on the train and dev splits of OntoNotes 5.0 respectively (statistics reported by [Yin et al. (2018)](https://www.aclweb.org/anthology/C18-1002/))
80+
81+
| Split | Documents | Sentences | Words | Anaphoric Zero Pronouns |
82+
| --- | --- | --- | --- | --- |
83+
| Train | 1,391 | 36,487 | 756K | 12,111 |
84+
| Dev | 172 | 6,083 | 110K | 1,713 |
85+
86+
6087
**Suggestions? Changes? Please send email to [[email protected]](mailto:[email protected])**
6188

6289

docs/dialogue_state_management.md

Lines changed: 38 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -110,8 +110,44 @@ This task aims at tracking the dialog state defined as a frame structure filled
110110
| Train | English | 35 | 31,304 |
111111
| Dev | Chinese | 2 | 3,130 |
112112

113+
## <span class="t">CrossWOZ: A Large-Scale Chinese Cross-Domain Task-Oriented Dialogue Dataset</span>.
114+
The first large-scale, Chinese Cross-Domain Wizard-of-Oz task-oriented dataset.
115+
* 5 Domains: hotel, restaurant, attraction, metro, taxi
116+
* Annotation of dialog states and dialog acts on both user and system sides
117+
* About 60% of dialogs have cross-domain user-goals
118+
* Rule-based user simulator also provided for evaluation
119+
120+
### Links
121+
* [github](https://github.com/thu-coai/CrossWOZ)
122+
* [arxiv](https://arxiv.org/abs/2002.11893)
123+
124+
### Overview
125+
| CrossWoz | |
126+
| --- | --- |
127+
| Language | Chinese with English translations |
128+
| Speakers | Human-to-Human |
129+
| \# Domains | 5 |
130+
| \# Slots | 72 |
131+
| \# Values | 7,871 |
132+
133+
### Datasets
134+
| Split | Train | Valid | Test |
135+
| --------------------- | ------ | ----- | ----- |
136+
| \# dialogues | 5,012 | 500 | 500 |
137+
| \# Turns (utterances) | 84,692 | 8,458 | 8,476 |
138+
| Vocab | 12,502 | 5,202 | 5,143 |
139+
| Avg. user sub-goals | 3.24 | 3.26 | 3.26 |
140+
| Avg. turns | 16.9 | 16.9 | 17.0 |
141+
| Avg. tokens per turn | 16.3 | 16.3 | 16.2 |
142+
143+
### Example data
144+
A piece of dialogue: (Names of hotels are replaced by A,B,C for simplicity.)
145+
146+
![example](https://github.com/thu-coai/CrossWOZ/blob/master/example.png)
147+
148+
### Benchmark Model Evaluations
149+
![result](https://github.com/thu-coai/CrossWOZ/blob/master/result.png)
150+
113151
---
114152

115153
**Suggestions? Changes? Please send email to [[email protected]](mailto:[email protected])**
116-
117-

docs/entity_linking.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,10 @@ NERC F-score
5555

5656
| System | TAC-KBP / EDL 2015<br/>Names | TAC-KBP / EDL 2016<br/>Names and nominals | TAC-KBP / EDL 2017<br/>Names and nominals |
5757
| --- | --- | --- | --- |
58-
| Best anonymous system in shared task writeup | 76.9 | 76.2 | 67.8 |
58+
| [Sil et al (2018)](https://arxiv.org/abs/1712.01813) | 84.4 | | |
59+
| [Pan et al (2020)](https://www.aclweb.org/anthology/D19-6107.pdf) | 84.2 | | |
60+
| [Pan et al (2020)](https://www.aclweb.org/anthology/D19-6107.pdf) | 81.2 (unsup)| | |
61+
| Best anonymous system in shared task writeup | 76.9 | 76.2 | 67.8 |
5962

6063
### Resources
6164

docs/entity_tagging.md

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,9 @@ A standard train/dev/test split does not seem to be available. Authors frequent
8282
### Results
8383

8484
| System | F-score |
85-
| --- | --- |
85+
| --- | --- |
86+
| [Wang et al (2020)](https://www.aclweb.org/anthology/2020.acl-main.525/) | 81.7 |
87+
| [Huang et al (2020)](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0235796) | 81.7 |
8688
| [Wang & Lu. (2018)](https://arxiv.org/pdf/1810.01808.pdf) | 73.00 |
8789
| [Ju et. al. (2018)](http://www.aclweb.org/anthology/N18-1131) | 72.25 |
8890

@@ -103,7 +105,11 @@ Paper summarizing the bakeoff:
103105

104106
| System | F-score |
105107
| --- | --- |
106-
| [Meng et. al. (2019)](https://arxiv.org/abs/1901.10125) | 93.89 |
108+
| [Liu et al (2020)](https://arxiv.org/pdf/1909.07606.pdf) | 95.7 |
109+
| [Meng et. al. (2019)](https://arxiv.org/abs/1901.10125) | 95.5 |
110+
| [Ma et al (2020)](https://www.aclweb.org/anthology/2020.acl-main.528.pdf) | 95.4 |
111+
| [Sun et al (2020)](https://arxiv.org/pdf/1907.12412.pdf) | 95.0 |
112+
| [Yan et al (2020)](https://ieeexplore.ieee.org/abstract/document/9141551) | 94.1 |
107113
| [Liu et. al. (2019)](https://www.aclweb.org/anthology/N19-1247) | 93.74 |
108114
| [Sui et al. (2019)](https://www.aclweb.org/anthology/D19-1396/) | 93.47 |
109115
| [Gui et al. (2019)](https://www.aclweb.org/anthology/D19-1096/) | 93.46 |
@@ -135,11 +141,14 @@ Using the test split by http://www.aclweb.org/anthology/E17-2113:
135141

136142
| System | F-score (name mentions) | F-score (nominal mentions) | F-score (Overall) |
137143
| --- | --- | --- | --- |
144+
| [Ma et al (2020)](https://www.aclweb.org/anthology/2020.acl-main.528.pdf) | 70.9 | 67.0 | 70.5 |
145+
| [Meng et. al. (2019)](https://arxiv.org/abs/1901.10125) | 67.6 | | |
146+
| [Hu and Zheng (2020)](https://www.jstage.jst.go.jp/article/transinf/E103.D/7/E103.D_2019EDP7253/_pdf/-char/ja) | 56.4 | | |
138147
| [Sui et al. (2019)](https://www.aclweb.org/anthology/D19-1396/) | 56.45 | 68.32 | 63.09 |
139148
| [Gui et al. (2019)](https://www.aclweb.org/anthology/D19-1096/) | 55.34 | 64.98 | 60.21 |
140149
| [Liu et. al. (2019)](https://www.aclweb.org/anthology/N19-1247) | 52.55 | 67.41 | 59.84 |
141-
|[Zhu (2019)](https://www.aclweb.org/anthology/N19-1342)|55.38 | 62.98 | 59.31 |
142-
|[Zhang & Yang (2018)](http://aclweb.org/anthology/P18-1144)|53.04| 62.25 | 58.79 |
150+
| [Zhu (2019)](https://www.aclweb.org/anthology/N19-1342)|55.38 | 62.98 | 59.31 |
151+
| [Zhang & Yang (2018)](http://aclweb.org/anthology/P18-1144)|53.04| 62.25 | 58.79 |
143152
| [Peng & Dredze (2015)](https://www.cs.jhu.edu/~npeng/papers/golden_horse_supplement.pdf) | 55.28 | 62.97 | 58.99 |
144153

145154
### Resources

docs/machine_translation.md

Lines changed: 56 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -35,8 +35,9 @@ The United States and China may soon reach a trade agreement.
3535
* BLEU-SBP ((Chiang et al 08)[http://aclweb.org/anthology/D08-1064]). Addresses decomposability problems with Bleu, proposing a cross between Bleu and word error rate.
3636
* HTER. Returns the number of edits performed by a human posteditor to get an automatic translation into good shape.
3737

38+
## <span class="t">ZH-EN</span>.
3839

39-
## <span class="t">WMT</span>.
40+
### <span class="t">WMT</span>.
4041

4142
The Second Conference on Machine Translation (WMT17) has a Chinese/English MT component, done in cooperation with CWMT 2017.
4243
* [Website](http://www.statmt.org/wmt17)
@@ -78,7 +79,7 @@ English - Chinese (WMT17)
7879

7980
There are many parallel English/Chinese text resources to train MT systems on. These are publicly available:
8081

81-
| Train set | Size (words on English side) | Genre |
82+
| Dataset | Size (words on English side) | Genre |
8283
| --- | --- | --- |
8384
| UN | 327m | Political |
8485
| New Commentary v12 | 5m | News opinions |
@@ -90,7 +91,7 @@ The Linguistic Data Consortium has additional resources, such as FBIS and NIST t
9091

9192

9293

93-
## <span class="t">NIST</span>.
94+
### <span class="t">NIST</span>.
9495

9596
NIST has a long history of supporting Chinese-English translation by creating annual test sets and running annual NIST OpenMT evaluations during the 2000s. Many sites have reported results on NIST test sets.
9697

@@ -122,6 +123,7 @@ Note that this [paper](http://www.lrec-conf.org/proceedings/lrec2018/pdf/678.pdf
122123
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
123124
| [[Zhang et al 2019]](https://arxiv.org/abs/1906.02448) | 1.25m | mteval-v11b | | 48.31 | 49.40 | 48.72 | 48.45 | | 48.72 |
124125
| [[Hadiwinoto & Ng, 2018]](http://www.lrec-conf.org/proceedings/lrec2018/pdf/678.pdf) | 7.65m | mteval-v13a | 46.94 | 47.58 | 49.13 | 47.78 | 49.37 | 41.48 | 47.05 |
126+
| [[Yang te al, 2020]](https://www.aclweb.org/anthology/2020.acl-main.531/) | 1.2m | unspecified | | 46.56 | | 46.04 | | 37.53 | |
125127
| [[Meng et al 2019]](https://arxiv.org/pdf/1901.10125.pdf) | 1.25m | unspecified | 40.56 (dev) | 39.93 | 41.54 | 38.01 | 37.45 | 29.07 | 37.76 |
126128
| [[Ma et al 2018c]](https://arxiv.org/abs/1805.04871) | 1.25m | unspecified | 39.77 (dev) | 38.91 | 40.02 | 36.82 | 35.93 | 27.61 | 36.51 |
127129
| [[Chen et al 2017]](http://aclweb.org/anthology/P17-1177) | 1.6m | multibleu | 36.57 | 35.64 | 36.63 | 34.35 | 30.57 | | |
@@ -132,13 +134,13 @@ The Linguistic Data Consortium provides training materials typically used for NI
132134

133135

134136

135-
## <span class="t">IWSLT 2015</span>.
137+
### <span class="t">IWSLT 2015</span>.
136138

137139
* Translation of TED talks
138140
* Chinese-to-English track
139141
* [Shared task overview](https://cris.fbk.eu/retrieve/handle/11582/303031/9811/main.pdf)
140142

141-
| Test sets | Size (sentences) | # of talks | Genre |
143+
| Dataset | Size (sentences) | # of talks | Genre |
142144
| --- | --- | --- | --- |
143145
| tst2014 | 1068 | 12 | TED talks |
144146
| tst2015 | 1,080 | 12 | TED talks |
@@ -165,7 +167,7 @@ English to Chinese (tst2015)
165167

166168
### Resources
167169

168-
| | Size (sentences) | # of talks | Genre |
170+
| Dataset | Size (sentences) | # of talks | Genre |
169171
| --- | --- | --- | --- |
170172
| Train | 210k | 1718 | TED talks |
171173

@@ -201,8 +203,9 @@ English to Chinese
201203
[The Multitarget TED Talks Task (MTTT)](http://cs.jhu.edu/~kevinduh/a/multitarget-tedtalks/)
202204

203205

206+
## <span class="t">ZH-JA</span>.
204207

205-
## <span class="t">Workshop on Asian Translation</span>.
208+
### <span class="t">Workshop on Asian Translation</span>.
206209

207210
[The Workshop on Asian Translation](http://lotus.kuee.kyoto-u.ac.jp/WAT/) has run since 2014. Here, we include the 2018 Chinese/Japanese evaluations.
208211

@@ -245,15 +248,59 @@ Participants must get data from [here](http://lotus.kuee.kyoto-u.ac.jp/WAT/paten
245248

246249
### Resources
247250

248-
| Test set | Size (sentences) | Genre |
251+
| Dataset | Size (sentences) | Genre |
249252
| --- | --- | --- |
250253
| Japanese-Chinese train | 250,000 | Patents |
251254
| Japanese-Chinese dev | 2000 | Patents |
252255
| Japanese-Chinese devtest | 2000 | Patents |
253256

254257

258+
### <span class="t">IWSLT2020 ZH-JA Open Domain Translation</span>.
259+
260+
[The shared task](http://iwslt.org/doku.php?id=open_domain_translation) is to promote research on translation between Asian languages, exploitation of noisy parallel web corpora for MT and smart processing of data and provenance.
261+
262+
### Metrics
263+
* 4-gram character BlEU.
264+
265+
A (secret) mixed-genre test set was intended to cover a variety of topics. The test data was selected from high-quality (human translated) parallel web content, authored between January and March 2020.
266+
267+
| Test set | Size (sentences) | Genre |
268+
| --- | --- | --- |
269+
| Chinese-Japanese | 875 | mixed-genre |
270+
| Japanese-Chinese | 875 | mixed-genre |
271+
272+
273+
### Results
274+
275+
Chinese to Japanese
276+
277+
| System | Bleu |
278+
| --- | --- |
279+
| [CASIA*](https://www.aclweb.org/anthology/2020.iwslt-1.15/) | 43.0 |
280+
| [Xiaomi](https://www.aclweb.org/anthology/2020.iwslt-1.18/) | 34.3 |
281+
| [TSUKUBA](https://www.aclweb.org/anthology/2020.iwslt-1.17/) | 33.0 |
282+
283+
Japanese to Chinese
284+
285+
| System | Bleu |
286+
| --- | --- |
287+
| [CASIA*](https://www.aclweb.org/anthology/2020.iwslt-1.15/) | 55.8 |
288+
| [Samsung Research China](https://www.aclweb.org/anthology/2020.iwslt-1.12/) | 34.0 |
289+
| [OPPO](https://www.aclweb.org/anthology/2020.iwslt-1.13/) | 32.9 |
290+
291+
\* means system collected external parallel training data that inadvertently overlapped with the blind test set.
292+
293+
### Resources
294+
295+
| Dataset | Size (sentences) | Genre |
296+
| --- | --- | --- |
297+
| Web crawled | 18,966,595 | mixed-genre |
298+
| Existing parallel sources | 1,963,238 | mixed-genre |
299+
300+
301+
## <span class="t">Others</span>.
255302

256-
## <span class="t">CWMT</span>.
303+
### <span class="t">CWMT</span>.
257304

258305
[CWMT 2017](http://ee.dlut.edu.cn/CWMT2017/index_en.html)
259306
and [2018](http://www.cipsc.org.cn/cwmt/2018/english/)

docs/multi-task_learning.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Chinese Multi-task Learning
2+
3+
## Background
4+
5+
Multi-task learning aims to learn multiple different tasks simultaneously while maximizing performance on one or all of the tasks.
6+
7+
8+
## Standard metrics
9+
10+
- Accuracy of classification.
11+
- Exact Match.
12+
13+
14+
## <span class="t">CLUE</span>.
15+
16+
The [Chinese Language Understanding Evaluation Benchmark](https://arxiv.org/abs/2004.05986) (CLUE)
17+
is a benchmark to evaluate the performance of models across a diverse range of existing natural language understanding tasks. Models are evaluated based on the average scores across all tasks.
18+
* [GitHub](https://github.com/CLUEbenchmark/CLUE)
19+
* [Website](https://www.cluebenchmarks.com/)
20+
21+
The state-of-the-art results can be seen on the public [CLUE leaderboard](https://www.cluebenchmarks.com/rank.html).
22+
23+
24+
---
25+
26+
**Suggestions? Changes? Please send email to [[email protected]](mailto:[email protected])**
27+
28+
29+

docs/pos_tagging.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,9 @@ F1 score calculated from word-level precision and word-level recall computed fro
4545

4646
| System | F1 score |
4747
| --- | --- |
48-
| [Meng et. al. (2019)](https://arxiv.org/pdf/1901.10125.pdf) (Glyce)| 95.61 |
48+
| [Tian el. al. (2020)](https://www.aclweb.org/anthology/2020.acl-main.735.pdf) | 96.92 |
49+
| [Meng et. al. (2019)](https://arxiv.org/pdf/1901.10125.pdf) (Glyce + BERT)| 96.61 |
50+
| [Meng et. al. (2019)](https://arxiv.org/pdf/1901.10125.pdf) (BERT)| 96.06 |
4951
| [Shao et. al. 2017](http://www.aclweb.org/anthology/I17-1018) | 94.38 |
5052

5153
### Resources
@@ -75,7 +77,9 @@ F1 score calculated from word-level precision and word-level recall computed fro
7577

7678
| System | F1 score|
7779
| --- | --- |
78-
| [Meng et. al. (2019)](https://arxiv.org/pdf/1901.10125.pdf) (Glyce)| 90.77 |
80+
| [Meng et. al. (2019)](https://arxiv.org/pdf/1901.10125.pdf) (Glyce + BERT)| 96.14 |
81+
| [Tian el. al. (2020)](https://www.aclweb.org/anthology/2020.acl-main.735.pdf) | 95.69 |
82+
| [Meng et. al. (2019)](https://arxiv.org/pdf/1901.10125.pdf) (BERT)| 94.79 |
7983
| [Shao et. al. (2017)](http://www.aclweb.org/anthology/I17-1018) | 89.75 |
8084

8185
### Resources

docs/question_answering.md

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -87,8 +87,10 @@ NLPCC DBQA 2016
8787

8888
| System | MRR | F1 |
8989
| --- | --- | --- |
90-
| [ERNIE(baidu)](https://arxiv.org/pdf/1904.09223.pdf) | 95.1 | 82.7 |
91-
| [BERT](https://arxiv.org/pdf/1810.04805.pdf) | 94.6 | 80.8 |
90+
| [ERNIE 2.0](https://arxiv.org/pdf/1907.12412.pdf) | 95.8 | 85.8 |
91+
| [Meng et. al. (2019)](https://arxiv.org/pdf/1901.10125.pdf) (Glyce + BERT) | - | 83.4 |
92+
| [ERNIE(baidu)](https://arxiv.org/pdf/1904.09223.pdf) | 95.1 | 82.7 |
93+
| [BERT](https://arxiv.org/pdf/1810.04805.pdf) | 94.6 | 80.8 |
9294

9395
NLPCC DBQA 2017
9496

@@ -104,6 +106,29 @@ NLPCC DBQA 2017
104106

105107

106108

109+
## <span class="t">Machine Reading Comprehension (MRC) tasks from CLUE benchmark</span>.
110+
111+
[CLUE](https://github.com/CLUEbenchmark/CLUE) is a Chinese Language Understanding Evaluation benchmark.
112+
Machine Reading Comprehension (MRC) is a task to teach machine to read and understand unstructured text and then answer questions about it.
113+
MRC corpus in CLUE consists of three datasets: **CMRC 2018** [(Cui et al.)](https://www.aclweb.org/anthology/D19-1600.pdf), **ChID** [(Zheng et al.)](https://www.aclweb.org/anthology/P19-1075.pdf), and **C<sup>3</sup>** [(Sun et al.)](https://arxiv.org/pdf/1904.09679.pdf).
114+
115+
116+
117+
### Metrics
118+
* Exact Match (CMRC 2018)
119+
* Accuracy (ChID and C<sup>3</sup>)
120+
121+
### Results
122+
| System | CMRC 2018 | ChID | C<sup>3</sup> |
123+
| --- | --- | --- | --- |
124+
| [HUMAN (CLUE origin)](https://github.com/CLUEbenchmark/CLUE) | 92.40 | 87.10 | 96.00 |
125+
| [RoBERTa-wwm-ext-large (CLUE origin)](https://github.com/CLUEbenchmark/CLUE) | 76.58 | 85.37 | 72.32 |
126+
| [BERT-base (CLUE origin)](https://github.com/CLUEbenchmark/CLUE) | 69.72 | 82.04 | 64.50 |
127+
128+
### Resources
129+
130+
[CLUE benchmark](https://www.cluebenchmarks.com/)
131+
107132
## Other resources.
108133

109134
* WebQA (Baidu) has 42k questions and answers. [link](https://arxiv.org/pdf/1607.06275.pdf)

0 commit comments

Comments
 (0)