@@ -69,7 +69,7 @@ Note that the English result is worse than the 84.2 MultiNLI baseline because
6969this training used Multilingual BERT rather than English-only BERT. This implies
7070that for high-resource languages, the Multilingual model is somewhat worse than
7171a single-language model. However, it is not feasible for us to train and
72- maintain dozens of single-language model . Therefore, if your goal is to maximize
72+ maintain dozens of single-language models . Therefore, if your goal is to maximize
7373performance with a language other than English or Chinese, you might find it
7474beneficial to run pre-training for additional steps starting from our
7575Multilingual model on data from your language of interest.
@@ -152,11 +152,9 @@ taken as the training data for each language
152152However, the size of the Wikipedia for a given language varies greatly, and
153153therefore low-resource languages may be "under-represented" in terms of the
154154neural network model (under the assumption that languages are "competing" for
155- limited model capacity to some extent).
156-
157- However, the size of a Wikipedia also correlates with the number of speakers of
158- a language, and we also don't want to overfit the model by performing thousands
159- of epochs over a tiny Wikipedia for a particular language.
155+ limited model capacity to some extent). At the same time, we also don't want
156+ to overfit the model by performing thousands of epochs over a tiny Wikipedia
157+ for a particular language.
160158
161159To balance these two factors, we performed exponentially smoothed weighting of
162160the data during pre-training data creation (and WordPiece vocab creation). In
0 commit comments