Merge pull request #19 from amittai/master

Huang17 · web-flow · commit d0c02afa98ee · 2020-07-27T18:52:23.000-07:00
Language Modeling update with a new Common-Crawl derived open source corpus for future use.
diff --git a/docs/language_modeling.md b/docs/language_modeling.md
@@ -77,6 +77,27 @@ These numbers are not comparable, given different training conditions.
 |  [Huang et al, 2010 [GW v2]](http://www.imaging.org/site/PDFS/Reporter/Articles/2010_25/Rep25_2_EI2010_HUANG.pdf) | -- | 220.6 | 610m chars, random 11m for test. MSR segmenter. |
 |  Neural Lattice Models [v5] [Buckman+Neubig, 2018](https://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00036) | 32.19 | -- | *Guangming Daily subset, top 10k chars + UNK, length <150. 934k lines train, 30k line test. Data [here](https://github.com/jbuckman/neural-lattice-language-models). |
 
+### Other Resources
+
+## <span class="t">Common Crawl Data</span>
+
+[CommonCrawl](https://commoncrawl.org) has released enormous quantities of web-crawled data that can be mined for Chinese text. Several groups have built their own pipelines to do the extraction and filtering.
+
+The CLUE Organization extracted "Clue Corpus 2020" (also called "C5") from the Common Crawl data. It is 100G raw text with 35 billion Chinese characters.
+Intended to be a large-scale corpus for pre-training Chinese language models.
+Preprint paper by [Xu, Zhang, and Dong](https://arxiv.org/abs/2003.01355v2)
+
+## <span class="t">CLUECorpusSmall </span>
+
+Publicly-available data, collected at https://github.com/CLUEbenchmark/CLUECorpus2020 and https://github.com/brightmart/nlp_chinese_corpus
+Includes:
+
+1. Wikipedia (wiki2019zh), 1 million well-structured Chinese entries
+2. News corpus (news2016zh), 2.5 million pieces of news, including keywords and descriptions
+3. Baike 2018qa (baike2018qa), 1.5 million question-and-answer questions
+4. Community Q&A json version (webtext2019zh), 4.1 million high-quality community questions and answers, suitable for training large models
+5. Translation corpus (translation2019zh), 5.2 million Chinese and English sentence pairs
+
 ---
 
 **Suggestions? Changes? Please send email to [chinesenlp.xyz@gmail.com](mailto:chinesenlp.xyz@gmail.com)**