Reveal Help Center

MBERT Explained

MBERT

Google uses BERT and MBERT to better understand the language context of user searches.

What is BERT and MBERT?

BERT is the recent Deep Learning technology that was developed at Google to work specifically with natural language data. BERT stands for Bidirectional Encoder Representations from Transformers. This is the first Deep Learning technology that essentially learns a language. BERT can understand the meaning of words from their context, similar to how humans understand language. Google uses BERT to go beyond key words and better understand the context of user searches.

MBERT stands for multilingual BERT and is the next step in creating models that understand the meaning of words in the context. MBERT is a deep learning model that was trained on 104 languages simultaneously and encodes the knowledge of all 104 languages together. MBERT allows us to build models that can be applied out-of-the-box to data in 104 languages.

Why is BERT and MBERT Important for eDiscovery?

BERT can understand the meaning of words from their context, similar to how humans understand language. BERT achieves state-of-the-art performance on over 11 natural language understanding tasks, further confirming the power of its level of language understanding.

AI models that are based on MBERT need to be trained in one language to be able to score documents in any of the supported languages. The underlying MBERT model was trained on large amounts of data in 104 languages simultaneously and encodes the combined knowledge of these languages.

BERT and MBERT are specifically exciting for legal and compliance teams because it has outperformed other classification techniques on a variety of tasks. Many of those tasks are similar to the model building process in eDiscovery and in proactive solutions. The Reveal AI Data Science team confirmed that BERT-based models achieve an improvement in the F1 score over traditional classification in a variety of tests.

Languages Supported by MBERT
  • Afrikaans

  • Albanian

  • Arabic

  • Aragonese

  • Armenian

  • Asturian

  • Azerbaijani

  • Bashkir

  • Basque

  • Bavarian

  • Belarusian

  • Bengali

  • Bishnupriya Manipuri

  • Bosnian

  • Breton

  • Bulgarian

  • Burmese

  • Catalan

  • Cebuano

  • Chechen

  • Chinese (Simplified)

  • Chinese (Traditional)

  • Chuvash

  • Croatian

  • Czech

  • Danish

  • Dutch

  • English

  • Estonian

  • Finnish

  • French

  • Galician

  • Georgian

  • German

  • Greek

  • Gujarati

  • Haitian

  • Hebrew

  • Hindi

  • Hungarian

  • Icelandic

  • Ido

  • Indonesian

  • Irish

  • Italian

  • Japanese

  • Javanese

  • Kannada

  • Kazakh

  • Kirghiz

  • Korean

  • Latin

  • Latvian

  • Lithuanian

  • Lombard

  • Low Saxon

  • Luxembourgish

  • Macedonian

  • Malagasy

  • Malay

  • Malayalam

  • Marathi

  • Minangkabau

  • Mongolian

  • Nepali

  • Newar

  • Norwegian (Bokmal)

  • Norwegian (Nynorsk)

  • Occitan

  • Persian (Farsi)

  • Piedmontese

  • Polish

  • Portuguese

  • Punjabi

  • Romanian

  • Russian

  • Scots

  • Serbian

  • Serbo-Croatian

  • Sicilian

  • Slovak

  • Slovenian

  • South Azerbaijani

  • Spanish

  • Sundanese

  • Swahili

  • Swedish

  • Tagalog

  • Tajik

  • Tamil

  • Tatar

  • Telugu

  • Thai

  • Turkish

  • Ukrainian

  • Urdu

  • Uzbek

  • Vietnamese

  • Volapük

  • Waray-Waray

  • Welsh

  • West Frisian

  • Western Punjabi

  • Yoruba