Cohere has launched two open-weight AI models under its Aya project, aiming to bridge the language gap in foundation models through advanced multilingual capabilities. Named Aya Expanse 8B and 35B, these models are available on Hugging Face, with the 8B model specifically designed to make AI research more accessible globally, while the 35B model offers leading multilingual support across 23 languages.
The Aya initiative, led by Cohere for AI, the company’s research arm, began last year to broaden access to AI in languages beyond English. In February, it debuted the Aya 101 model, a 13-billion-parameter language model supporting 101 languages, accompanied by the Aya dataset to bolster language diversity in model training. The new Aya Expanse models build on the Aya 101’s foundation, incorporating advancements for enhanced linguistic performance.
According to Cohere’s blog, Aya Expanse models excelled in multilingual benchmarks, outperforming models of similar sizes from competitors like Google, Mistral, and Meta. Specifically, the Aya Expanse 32B model surpassed Gemma 2 27B, Mistral 8x22B, and even Meta’s much larger Llama 3.1 70B in multilingual assessments, while the smaller 8B model achieved higher performance than Gemma 2 9B, Llama 3.1 8B, and Ministral 8B.
Cohere credits the success of Aya Expanse to data arbitrage, a data sampling method aimed at reducing reliance on synthetic data, which often results in nonsensical outputs for low-resource languages. Unlike models trained primarily on English-centric data, Aya Expanse focuses on “global preferences,” incorporating various cultural and linguistic contexts to enhance model accuracy and safety. Cohere highlighted the challenges of extending preference training to multilingual settings, as Western-centric data safety protocols tend to overlook non-English languages. Cohere’s approach, however, includes cultural nuances, enabling improved performance across languages.
The Aya initiative prioritizes research in non-English LLMs to address the data scarcity in languages other than English. Cohere’s work joins efforts by others in the field, such as OpenAI, which recently launched its Multilingual Massive Multitask Language Understanding Dataset on Hugging Face. This dataset evaluates LLMs across 14 languages, including Arabic, German, Swahili, and Bengali, aiming to enhance non-English LLM capabilities.
Cohere has been active in expanding its AI offerings. Alongside Aya Expanse, the company recently introduced image search capabilities to its Embed 3 model, used in retrieval-augmented generation (RAG) systems, and improved fine-tuning for its Command R 08-2024 model.
Featured image courtesy of Reuters
Leave a Reply