Multilingual Support in SearchAI¶
SearchAI offers multilingual capabilities to enhance accessibility and deliver a seamless experience for users interacting in different languages. This feature ensures that users can engage with the platform in their preferred language, resulting in more intuitive and personalized interactions. With this feature, users can:
- Add and manage content in multiple languages.
- Submit queries and receive responses in supported languages.
- Get search results and answers in the same language as the query
Note: SearchAI supports 100+ languages, enabling global accessibility. It can work with any language supported by the underlying LLM and vector generation model, provided you use the Text Extraction Strategy and Vector Retrieval method.
Configuration Guidance by Language¶
Certain modules within SearchAI are language-sensitive and may require different strategies or models depending on the language used. The tables below provide current support details for some of the most widely used languages across key components:
- Extraction Strategies
- Vector Configuration Models
- Retrieval Strategies
- Answer Generation Models
Use this guidance to ensure your multilingual setup is aligned with the most effective techniques for each language or model.
The table below outlines the supported content extraction methods for several widely used languages, helping you choose the most effective approach for processing multilingual content.
Language-Specific Extraction Capabilities¶
Language | Text Extraction | Layout Aware Extraction | Image Extraction | Advanced HTML Extraction | Markdown Extraction |
English | ✅ | ✅ | ✅ | ✅ | ✅ |
Ukrainian | ✅ | ✅ | ✅ | ✅ | ✅ |
Japanese | ✅ | ❌ | ❌ | ❌ | ❌ |
Spanish | ✅ | ❌ | ✅ | ❌ | ✅ |
Russian | ✅ | ❌ | ✅ | ❌ | ✅ |
Afrikaans | ✅ | ❌ | ❌ | ❌ | ❌ |
Albanian | ✅ | ❌ | ❌ | ❌ | ❌ |
Amharic | ✅ | ❌ | ❌ | ❌ | ❌ |
Arabic | ✅ | ❌ | ❌ | ❌ | ❌ |
Armenian | ✅ | ❌ | ❌ | ❌ | ❌ |
Assamese | ✅ | ❌ | ❌ | ❌ | ❌ |
Azerbaijani | ✅ | ❌ | ❌ | ❌ | ❌ |
Basque | ✅ | ❌ | ❌ | ❌ | ❌ |
Belarusian | ✅ | ❌ | ❌ | ❌ | ❌ |
Bengali | ✅ | ❌ | ❌ | ❌ | ❌ |
Bosnian | ✅ | ❌ | ❌ | ❌ | ❌ |
Bulgarian | ✅ | ❌ | ❌ | ❌ | ❌ |
Burmese | ✅ | ❌ | ❌ | ❌ | ❌ |
Catalan | ✅ | ❌ | ❌ | ❌ | ❌ |
Cebuano | ✅ | ❌ | ❌ | ❌ | ❌ |
Chinese | ✅ | ❌ | ❌ | ❌ | ✅ |
Corsican | ✅ | ❌ | ❌ | ❌ | ❌ |
Croatian | ✅ | ❌ | ❌ | ❌ | ❌ |
Czech | ✅ | ❌ | ❌ | ❌ | ❌ |
Danish | ✅ | ❌ | ❌ | ❌ | ❌ |
Dutch | ✅ | ❌ | ❌ | ❌ | ❌ |
English | ✅ | ❌ | ❌ | ❌ | ❌ |
Esperanto | ✅ | ❌ | ❌ | ❌ | ❌ |
Estonian | ✅ | ❌ | ❌ | ❌ | ❌ |
Finnish | ✅ | ❌ | ❌ | ❌ | ❌ |
French | ✅ | ❌ | ❌ | ❌ | ❌ |
Frisian | ✅ | ❌ | ❌ | ❌ | ❌ |
Galician | ✅ | ❌ | ❌ | ❌ | ❌ |
Georgian | ✅ | ❌ | ❌ | ❌ | ❌ |
German | ✅ | ❌ | ✅ | ✅ | ✅ |
Greek | ✅ | ❌ | ❌ | ❌ | ❌ |
Gujarati | ✅ | ❌ | ❌ | ❌ | ❌ |
Haitian Creole | ✅ | ❌ | ❌ | ❌ | ❌ |
Hebrew | ✅ | ❌ | ❌ | ❌ | ❌ |
Hindi | ✅ | ❌ | ❌ | ❌ | ❌ |
Hungarian | ✅ | ❌ | ❌ | ❌ | ✅ |
Icelandic | ✅ | ❌ | ❌ | ❌ | ❌ |
Indonesian | ✅ | ❌ | ❌ | ❌ | ❌ |
Irish | ✅ | ❌ | ❌ | ❌ | ❌ |
Italian | ✅ | ❌ | ❌ | ❌ | ❌ |
Javanese | ✅ | ❌ | ❌ | ❌ | ❌ |
Kannada | ✅ | ❌ | ❌ | ❌ | ❌ |
Kazakh | ✅ | ❌ | ❌ | ❌ | ❌ |
Khmer | ✅ | ❌ | ❌ | ❌ | ❌ |
Korean | ✅ | ❌ | ❌ | ❌ | ❌ |
Kurdish | ✅ | ❌ | ❌ | ❌ | ❌ |
Kyrgyz | ✅ | ❌ | ❌ | ❌ | ❌ |
Latin | ✅ | ❌ | ❌ | ❌ | ❌ |
Latvian | ✅ | ❌ | ❌ | ❌ | ❌ |
Lithuanian | ✅ | ❌ | ❌ | ❌ | ❌ |
Luxembourgish | ✅ | ❌ | ❌ | ❌ | ❌ |
Macedonian | ✅ | ❌ | ❌ | ❌ | ❌ |
Malagasy | ✅ | ❌ | ❌ | ❌ | ❌ |
Malay | ✅ | ❌ | ❌ | ❌ | ❌ |
Malayalam | ✅ | ❌ | ❌ | ❌ | ❌ |
Maltese | ✅ | ❌ | ❌ | ❌ | ❌ |
Marathi | ✅ | ❌ | ❌ | ❌ | ❌ |
Mongolian | ✅ | ❌ | ❌ | ❌ | ❌ |
Nepali | ✅ | ❌ | ❌ | ❌ | ❌ |
Norwegian | ✅ | ❌ | ❌ | ❌ | ❌ |
Odia (Oriya) | ✅ | ❌ | ❌ | ❌ | ❌ |
Persian (Farsi) | ✅ | ❌ | ❌ | ❌ | ❌ |
Polish | ✅ | ❌ | ❌ | ❌ | ❌ |
Portuguese | ✅ | ❌ | ❌ | ❌ | ❌ |
Punjabi | ✅ | ❌ | ❌ | ❌ | ❌ |
Romanian | ✅ | ❌ | ❌ | ❌ | ❌ |
Scots Gaelic | ✅ | ❌ | ❌ | ❌ | ❌ |
Serbian | ✅ | ❌ | ❌ | ❌ | ❌ |
Sinhala | ✅ | ❌ | ❌ | ❌ | ❌ |
Slovak | ✅ | ❌ | ❌ | ❌ | ❌ |
Slovenian | ✅ | ❌ | ❌ | ❌ | ❌ |
Somali | ✅ | ❌ | ❌ | ❌ | ❌ |
Sundanese | ✅ | ❌ | ❌ | ❌ | ❌ |
Swahili | ✅ | ❌ | ❌ | ❌ | ❌ |
Swedish | ✅ | ❌ | ❌ | ❌ | ❌ |
Tagalog | ✅ | ❌ | ❌ | ❌ | ❌ |
Tajik | ✅ | ❌ | ❌ | ❌ | ❌ |
Tamil | ✅ | ❌ | ❌ | ❌ | ❌ |
Telugu | ✅ | ❌ | ❌ | ❌ | ❌ |
Thai | ✅ | ❌ | ❌ | ❌ | ❌ |
Tibetan | ✅ | ❌ | ❌ | ❌ | ❌ |
Turkish | ✅ | ❌ | ❌ | ❌ | ❌ |
Turkmen | ✅ | ❌ | ❌ | ❌ | ❌ |
Uyghur | ✅ | ❌ | ❌ | ❌ | ❌ |
Urdu | ✅ | ❌ | ❌ | ❌ | ❌ |
Uzbek | ✅ | ❌ | ❌ | ❌ | ❌ |
Vietnamese | ✅ | ❌ | ❌ | ❌ | ❌ |
Welsh | ✅ | ❌ | ❌ | ❌ | ❌ |
Yiddish | ✅ | ❌ | ❌ | ❌ | ❌ |
Yoruba | ✅ | ❌ | ❌ | ❌ | ❌ |
Language-Specific Vector Generation Support¶
Vector generation model support varies by language. Use the following models for optimal performance:
- English: MPNet, E5, BGE-M3, LaBSE
- Non-English Languages: BGE-M3 and LaBSE (recommended)
Note: BGE-M3 supports a wide range of languages. Their training data includes many commonly spoken languages; however, performance may be lower for low-resource or underrepresented languages.
Language-Specific Retrieval Strategy Support¶
- English: Vector Retrieval and Hybrid Retrieval
- Non-English: Vector Retrieval
Supported Answer Generation Models¶
Answer generation quality depends on the language capabilities of the underlying LLM. While many languages are technically supported, performance may vary.
Recommendation
Always refer to the LLM-supported languages list. If the LLM supports your language but response quality is inconsistent, try defining custom prompts to improve accuracy and relevance. You may also consider evaluating other LLMs that better support your target language.