
Dataocean AI Has Participated in Creating the Open-Source Dataset GigaSpeech 2: A Large-Scale and Multi-Domain ASR Corpus for Low-Resource Languages
24.9.2024 21:00:00 CEST | Business Wire | Press release
Dataocean AI has collaborated with Shanghai Jiao Tong University, The Chinese University of Hong Kong, Tsinghua University, Pengcheng Lab, AISpeech, Birch AI, and Seasalt AI to successfully develop GigaSpeech 2. The development and test sets of GigaSpeech 2 are labeled by a professional team from Dataocean AI.
This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20240924609911/en/
(Photo: Business Wire)
GigaSpeech 2 Overview
GigaSpeech 2 is an ever-expanding, large-scale, multi-domain, and multilingual speech recognition corpus designed to promote research and development in low-resource language speech recognition. GigaSpeech 2 raw contains 30,000 hours of automatically transcribed audio, covering Thai, Indonesian, and Vietnamese. After multiple rounds of refinement and iteration, GigaSpeech 2 refined offers 10,000 hours of Thai, 6,000 hours of Indonesian, and 6,000 hours of Vietnamese. The test sets labeled by Dataocean AI for Thai and Indonesian, each consist of 10 hours, while the development sets are 10 hours for Thai and Indonesian. The team have also open-sourced multilingual speech recognition models trained on the GigaSpeech 2 data, achieving performance comparable to commercial speech recognition services.
Dataset Construction
The construction process of GigaSpeech 2 has also been open-sourced. This is an automated process for building large-scale speech recognition datasets from vast amounts of unlabeled audio available on the internet. The automated process involves data crawling, transcription, alignment, and refinement. Initially, Whisper is used for preliminary transcription, followed by forced alignment with TorchAudio to produce GigaSpeech 2 raw through multi-dimensional filtering. The dataset is then refined iteratively using an improved Noisy Student Training (NST) method, enhancing the quality of pseudo-labels through repeated iterations, ultimately resulting in GigaSpeech 2 refined.
GigaSpeech 2 encompasses a wide range of thematic domains, including agriculture, art, business, climate, culture, economics, education, entertainment, health, history, literature, music, politics, relationships, shopping, society, sports, technology, and travel. Additionally, it covers various content formats such as audiobooks, documentaries, lectures, monologues, movies and TV shows, news, interviews, and video blogs.
Training Set Details
GigaSpeech 2 offers a comprehensive and diverse training set, which is meticulously designed to support the development of robust and high-performing speech recognition models. The training set details are as follows:
- Thai: The raw version consists of 12,901.8 hours of speech data, while the refined version encompasses 10,262.0 hours.
- Indonesian: The raw data amounts to 8,112.9 hours, and the refined data comprises 5,714.0 hours.
- Vietnamese: The raw dataset includes 7,324.0 hours of speech recordings, with the refined dataset totaling 6,039.0 hours.
Development and Test Set Details
Dataocean AI’s COO - Ke Li, who is also one of the paper's authors, has led GigaSpeech 2 test sets project. With nearly 20 years of project experience, the team has contributed in Thai and Indonesian with word accuracy of over 97%. Besides those two East Asian languages, Dataocean AI’s team can also cover over 200 languages and dialects around the world. The company offer 1600+ high-quality off-the-shelf datasets are applicable for multiple scenarios such as Generative AI, Autonomous driving, Smart home, Customer services and etc., fulfilling the evolving needs of the AI industry.
Experimental Results
We conducted a comparative evaluation of speech recognition models trained on the GigaSpeech 2 dataset against industry-leading models, including OpenAI Whisper (large-v3, large-v2, base), Meta MMS L1107, Azure Speech CLI 1.37.0, and Google USM Chirp v2. The comparison was carried out in Thai, Indonesian, and Vietnamese languages. Performance evaluation was based on three test sets: GigaSpeech 2, Common Voice 17.0, and FLEURS, using Character Error Rate (CER) or Word Error Rate (WER) as metrics. The results indicate:
Thai: Our model demonstrated exceptional performance, surpassing all competitors, including commercial interfaces from Microsoft and Google. Notably, our model achieved this significant result while having only one-tenth the number of parameters compared to Whisper large-v3.
Indonesian and Vietnamese: Our system exhibited competitive performance compared to existing baseline models in both Indonesian and Vietnamese languages.
Resource Links
The GigaSpeech 2 dataset is now available for download:
https://huggingface.co/datasets/speechcolab/gigaspeech2
The automated process for constructing large-scale speech recognition datasets is available at:
https://github.com/SpeechColab/GigaSpeech2
The preprint paper is available at:
https://arxiv.org/pdf/2406.11546
Dataocean AI website:
https://www.dataoceanai.com
View source version on businesswire.com: https://www.businesswire.com/news/home/20240924609911/en/
Contacts
contact@dataoceanai.com
About Business Wire
Business Wire
24 Martin Lane
EC4R 0DR London
+44 20 7626 1982http://www.businesswire.co.uk
(c) 2018 Business Wire, Inc., All rights reserved.
Business Wire, a Berkshire Hathaway company, is the global leader in multiplatform press release distribution.

Subscribe to releases from Business Wire
Subscribe to all the latest releases from Business Wire by registering your e-mail address below. You can unsubscribe at any time.
Latest releases from Business Wire
H.I.G. Capital Announces the Sale of DGS S.p.A.11.6.2024 12:00:00 CEST | Press release
H.I.G. Capital (“H.I.G.”), a leading global alternative investment firm with $62 billion of capital under management, is pleased to announce that an affiliate has signed a definitive agreement to sell its portfolio company, DGS S.p.A. (“DGS” or the “Group”), a leading firm in the Italian Information Technology market, to DGS Co-Founders and management team in partnership with ICG, a global alternative asset manager. Since its inception in 1997, DGShas supported blue-chip customers in the design, integration, and maintenance of complex IT systems, with a specialization in digital transformation and cybersecurity services. The Group currently has over 1,900 employees, revenues of approximately €300 million, and maintains a group of highly loyal clientele. During H.I.G.’s ownership, DGS has tripled in size and consolidated its position as a leading Italian firm in cybersecurity services and digital transformation. DGS offers its clients sophisticated and proprietary digital transformation
Evertas Names Nick Selby Head of European Underwriting11.6.2024 12:00:00 CEST | Press release
Evertas, the world’s first crypto insurance company, has named Nick Selby as its new Head of European Underwriting. This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20240611141887/en/ Nick Selby, Executive Vice President and Head of European Underwriting at Evertas (Photo: Business Wire) Selby, an accomplished information and physical security professional, brings two decades of expertise in public and private sector information security, physical security, and complex incident handling, as well as seven years of experience leading teams securing billions of dollars in cryptoassets. Previously, his roles included VP of the Software Assurance Practice at Trail of Bits, Chief Security Officer at Paxos Trust Company, and Director of Cyber Intelligence and Investigations at the NYPD Intelligence Bureau. “Nick is an extremely valuable addition to our European team,” said Evertas CEO and Co-Founder J. Gdanski. “His public and private
Owlet utvider globalt fotavtrykk med lanseringen av medisinsk-sertifisert Dream Sock™ i Storbritannia og over hele Europa11.6.2024 11:00:00 CEST | Pressemelding
Owlet, Inc. («Owlet» or the «Company») (NYSE:OWLT), pioneren innen smart spedbarnsovervåking, kunngjør i dag den britiske og europeiske lanseringen av Dream Sock. Dette er en smart babymonitor med levende helseavlesninger og varsler for friske spedbarn mellom 0-18 måneder og 2,5-13,6 kg. Dette innovative medisinske utstyret gir foreldre helse og viktig informasjon i sanntid, noe som gir uovertruffen trygghet. Denne pressemeldingen inneholder multimedia. Se hele pressemeldingen her: https://www.businesswire.com/news/home/20240611820341/no/ (Photo: Business Wire) «Vi er svært stolte over å lansere Dream Sock til omsorgspersoner over hele Storbritannia og Europa og gi millioner av foreldre mer trygghet mens babyen sover,» sa Kurt Workman, Owlets administrerende direktør og medgründer. «Dream Sock er nå et globalt produkt som er anerkjent som medisinsk nøyaktig og trygt, etter å ha gjennomgått regulatoriske autorisasjoner og sertifiseringer innenfor flere geografier. I dag er misjonen vår
V-Nova Surpasses 1000 Patent Milestone in Media Technology Innovation11.6.2024 10:00:00 CEST | Press release
V-Nova, a leading provider of data compression solutions, video compression technology, XR technology, AI acceleration and parallel processing for a multitude of industries including media and entertainment, today announced its milestone achievement of 1000 active technology patents. This accomplishment underscores V-Nova’s dedication to research and development and its commitment to protecting its intellectual property globally. This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20240611724561/en/ V-Nova’s patent portfolio spans more than 50 different jurisdictions. Including over 400 patents in Europe, over 200 in the Americas, over 100 in the United States specifically, and over 200 in Asia. V-Nova forged new directions in data processing to enhance digital experiences, maximize efficiency, reduce costs, and increase sustainability. The company leads the way with key international data compression standards for the video indust
Alipay+ Reveals Top Scorer Trophy Design for UEFA EURO 2024™11.6.2024 09:24:00 CEST | Press release
Alipay+, a suite of cross-border mobile payment and digitalization technology solutions operated by Ant International and an Official Partner of UEFA EURO 2024™, today revealed the trophy that will be awarded to the most prolific marksman at the UEFA EURO 2024™ finale on July 14 in Berlin, Germany. This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20240610328619/en/ The UEFA Top Scorer Trophy presented by Alipay+ is unveiled for UEFA EURO 2024™ (Photo: Business Wire) Sculpted in the shape of the Chinese character “支” (pronounced zhi, and meaning payment as well as support), the trophy reflects Alipay+’s dedication to supporting consumers to enjoy seamless payment and a broad choice of deals using their preferred payment methods while traveling abroad. The character also resembles the fleeting moment of a barefooted striker poised to shoot, evoking the original beauty and power of football – a game that united people across the wo