الخلاصة:
We first examined the representation of the 20 languages in M-BERT by deriving language
identity representations in 1000 labeled corpora. The high performance of the language
identification model in distinguishing the languages in M-BERT (mean F1 score 0.999)
indicated that BERT models use strong language-specific information in their pretraining
process. We then tested the M-BERT model's capability of differentiating between pairs of
languages. By feeding modeling prompts that include the name of the language and a token
from one of the two languages to the model, we used the model's output probability to
determine which language the input was expressed in. This is effectively a language
disambiguation task, and we should be able to use it to measure the model's ability to
differentiate and understand pairs of languages. This simple disambiguation setup, combined
with the model's ability to perform probability judgment, could serve as a test to reveal what
exchanges the model is capable of processing for any given pair of languages.