Introducing a Massive New Multilingual Dataset for Machine Translation

6 / 5 / 2024

A recent paper by researchers from several European universities unveiled the HPLT (High Performance Language Technologies) dataset for language modeling and machine translation. This dataset incorporates monolingual and bilingual corpora sourced from web crawls by the Internet Archive and CommonCrawl, a first at this scale.

According to the team, these resources are among the largest open text corpora ever released, providing invaluable material for language modeling and machine translation training.

They detailed their acquisition and processing methods, leveraging open-source software tools and high-performance computing, and have made everything publicly available on GitHub as a model for others.

Covering 75 languages, including many low- to medium-resourced ones, the monolingual collection comprises a whopping 5.25 billion documents. Additionally, the parallel corpus, mainly focused on English, offers 18 language pairs with over 96 million aligned sentence pairs, with a specific emphasis on low-resource languages to bolster MT development.

Moreover, the researchers created a synthetic dataset by pivoting existing parallel datasets through English, totaling 171 language pairs and 157 million sentence pairs. All datasets include metadata for user filtering. In addition to the datasets, the researchers have released initial MT models and large language models (LLMs) along with their training pipelines.

This development highlights the changing world of translation. Machine learning is taking center stage, boosting translation services and paving the way for the future of this art form.

Previous news

Shutting down the Contribute Feature for Google Translate

Google has recently announced the discontinuation of the Contribute feature for Google Translate, a tool that allowed users to suggest translations to improve the platform's quality. This decision comes as Google Translate has made significant advancements in recent years, largely attributed to the evolution of its underlying systems.

Launched in 2014, the Contribute feature aimed to harness the expertise of language enthusiasts and native speakers to enhance translations across the 80 languages supported by Google Translate.

Next news

European Patent Office Reviews Guideline Updates

The European Patent Office (EPO) held a meeting of the SACEPO Working Party on Guidelines on April 25, 2024. SACEPO means a Standing Advisory Committee before the EPO. The newly appointed members discussed feedback from a public consultation on the EPC and PCT-EPO Guidelines, including suggestions for the upcoming Unitary Patent (UP) Guidelines.

The consultation gathered 168 comments on the EPC Guidelines and 23 on the PCT-EPO Guidelines. Discussions covered various topics, such as handling AI and computer-related inventions, third-party observations, and expediting patent approvals.