You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/language_modeling.md
+21Lines changed: 21 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -77,6 +77,27 @@ These numbers are not comparable, given different training conditions.
77
77
|[Huang et al, 2010 [GW v2]](http://www.imaging.org/site/PDFS/Reporter/Articles/2010_25/Rep25_2_EI2010_HUANG.pdf)| -- | 220.6 | 610m chars, random 11m for test. MSR segmenter. |
78
78
| Neural Lattice Models [v5][Buckman+Neubig, 2018](https://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00036)| 32.19 | -- |*Guangming Daily subset, top 10k chars + UNK, length <150. 934k lines train, 30k line test. Data [here](https://github.com/jbuckman/neural-lattice-language-models). |
79
79
80
+
### Other Resources
81
+
82
+
## <spanclass="t">Common Crawl Data</span>
83
+
84
+
[CommonCrawl](https://commoncrawl.org) has released enormous quantities of web-crawled data that can be mined for Chinese text. Several groups have built their own pipelines to do the extraction and filtering.
85
+
86
+
The CLUE Organization extracted "Clue Corpus 2020" (also called "C5") from the Common Crawl data. It is 100G raw text with 35 billion Chinese characters.
87
+
Intended to be a large-scale corpus for pre-training Chinese language models.
88
+
Preprint paper by [Xu, Zhang, and Dong](https://arxiv.org/abs/2003.01355v2)
89
+
90
+
## <spanclass="t">CLUECorpusSmall </span>
91
+
92
+
Publicly-available data, collected at https://github.com/CLUEbenchmark/CLUECorpus2020 and https://github.com/brightmart/nlp_chinese_corpus
93
+
Includes:
94
+
95
+
1. Wikipedia (wiki2019zh), 1 million well-structured Chinese entries
96
+
2. News corpus (news2016zh), 2.5 million pieces of news, including keywords and descriptions
97
+
3. Baike 2018qa (baike2018qa), 1.5 million question-and-answer questions
98
+
4. Community Q&A json version (webtext2019zh), 4.1 million high-quality community questions and answers, suitable for training large models
99
+
5. Translation corpus (translation2019zh), 5.2 million Chinese and English sentence pairs
0 commit comments