Skip to content

Commit d0c02af

Browse files
authored
Merge pull request #19 from amittai/master
Language Modeling update with a new Common-Crawl derived open source corpus for future use.
2 parents cddb958 + 49e2132 commit d0c02af

File tree

1 file changed

+21
-0
lines changed

1 file changed

+21
-0
lines changed

docs/language_modeling.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,27 @@ These numbers are not comparable, given different training conditions.
7777
| [Huang et al, 2010 [GW v2]](http://www.imaging.org/site/PDFS/Reporter/Articles/2010_25/Rep25_2_EI2010_HUANG.pdf) | -- | 220.6 | 610m chars, random 11m for test. MSR segmenter. |
7878
| Neural Lattice Models [v5] [Buckman+Neubig, 2018](https://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00036) | 32.19 | -- | *Guangming Daily subset, top 10k chars + UNK, length <150. 934k lines train, 30k line test. Data [here](https://github.com/jbuckman/neural-lattice-language-models). |
7979

80+
### Other Resources
81+
82+
## <span class="t">Common Crawl Data</span>
83+
84+
[CommonCrawl](https://commoncrawl.org) has released enormous quantities of web-crawled data that can be mined for Chinese text. Several groups have built their own pipelines to do the extraction and filtering.
85+
86+
The CLUE Organization extracted "Clue Corpus 2020" (also called "C5") from the Common Crawl data. It is 100G raw text with 35 billion Chinese characters.
87+
Intended to be a large-scale corpus for pre-training Chinese language models.
88+
Preprint paper by [Xu, Zhang, and Dong](https://arxiv.org/abs/2003.01355v2)
89+
90+
## <span class="t">CLUECorpusSmall </span>
91+
92+
Publicly-available data, collected at https://github.com/CLUEbenchmark/CLUECorpus2020 and https://github.com/brightmart/nlp_chinese_corpus
93+
Includes:
94+
95+
1. Wikipedia (wiki2019zh), 1 million well-structured Chinese entries
96+
2. News corpus (news2016zh), 2.5 million pieces of news, including keywords and descriptions
97+
3. Baike 2018qa (baike2018qa), 1.5 million question-and-answer questions
98+
4. Community Q&A json version (webtext2019zh), 4.1 million high-quality community questions and answers, suitable for training large models
99+
5. Translation corpus (translation2019zh), 5.2 million Chinese and English sentence pairs
100+
80101
---
81102

82103
**Suggestions? Changes? Please send email to [[email protected]](mailto:[email protected])**

0 commit comments

Comments
 (0)