• Home
  • AI News
  • AI Startups
  • Deep Learning
  • Interviews
  • Machine-Learning
  • Robotics

Subscribe to Updates

Get the latest creative news from FooBar about art, design and business.

What's Hot

OpenAI’s ChatGPT Unveils Voice and Picture Capabilities: A Revolutionary Leap in AI Interplay

September 26, 2023

Meet ProPainter: An Improved Video Inpainting (VI) AI Framework With Enhanced Propagation And An Environment friendly Transformer

September 26, 2023

This AI Analysis from Apple Investigates a Identified Difficulty of LLMs’ Conduct with Respect to Gender Stereotypes

September 26, 2023
Facebook Twitter Instagram
The AI Today
Facebook Twitter Instagram Pinterest YouTube LinkedIn TikTok
SUBSCRIBE
  • Home
  • AI News
  • AI Startups
  • Deep Learning
  • Interviews
  • Machine-Learning
  • Robotics
The AI Today
Home»Machine-Learning»Microsoft AI Crew Unveils NaturalSpeech 2: A Chopping-Edge TTS System with Latent Diffusion Fashions for Highly effective Zero-Shot Voice Synthesis and Enhanced Expressive Prosodies
Machine-Learning

Microsoft AI Crew Unveils NaturalSpeech 2: A Chopping-Edge TTS System with Latent Diffusion Fashions for Highly effective Zero-Shot Voice Synthesis and Enhanced Expressive Prosodies

By July 27, 2023Updated:July 27, 2023No Comments4 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr Reddit WhatsApp Email
Share
Facebook Twitter LinkedIn Pinterest WhatsApp Email


The purpose of text-to-speech (TTS) is to generate high-quality, various speech that feels like actual individuals spoke it. Prosodies, speaker identities (comparable to gender, accent, and timbre), talking and singing types, and extra all contribute to the richness of human speech. TTS techniques have improved tremendously in intelligibility and naturalness as neural networks and deep studying have progressed; some techniques (comparable to NaturalSpeech) have even reached human-level voice high quality on single-speaker recording-studio benchmarking datasets. 

On account of a scarcity of range within the information, earlier speaker-limited recording-studio datasets have been inadequate to seize the wide range of speaker identities, prosodies, and types in human speech. Nevertheless, utilizing few-shot or zero-shot applied sciences, TTS fashions could be educated on a big corpus to be taught these variations after which use these educated fashions to generalize to the infinite unseen eventualities. Quantizing the continual speech waveform into discrete tokens and modeling these tokens with autoregressive language fashions is frequent in right now’s large-scale TTS techniques.

New analysis by Microsoft introduces NaturalSpeech 2, a TTS system that makes use of latent diffusion fashions to provide expressive prosody, good resilience, and, most crucially, sturdy zero-shot capability for voice synthesis. The researchers started by coaching a neural audio codec that makes use of a codec encoder to rework a speech waveform right into a collection of latent vectors and a codec decoder to revive the unique waveform. After acquiring earlier vectors from a phoneme encoder, a period predictor, and a pitch predictor, they use a diffusion mannequin to assemble these latent vectors.

🚀 Be a part of the quickest rising Reddit ML Neighborhood

The next are examples of design selections which can be mentioned of their paper:

  • In prior works, speech is often quantized with quite a few residual quantizers to ensure the standard of the neural codec’s speech reconstruction. This burdens the acoustic mannequin (autoregressive language mannequin) closely as a result of the resultant discrete token sequence is sort of lengthy. As an alternative of utilizing tokens, the crew used steady vectors. Subsequently, they make use of steady vectors as an alternative of discrete tokens, which shorten the sequence and supply extra information for correct speech reconstruction on the granular degree. 
  • Changing autoregressive fashions with diffusion ones.
  • Studying in context via speech prompting mechanisms. The crew developed speech prompting mechanisms to advertise in-context studying within the diffusion mannequin and pitch/period predictors, bettering the zero-shot capability by encouraging the diffusion fashions to stick to the traits of the speech immediate.
  • NaturalSpeech 2 is extra dependable and steady than its autoregressive predecessors because it requires solely a single acoustic mannequin (the diffusion mannequin) as an alternative of two-stage token prediction. In different phrases, it could use its period/pitch prediction and non-autoregressive era to use to types aside from speech (comparable to a singing voice). 

To exhibit the efficacy of those architectures, the researchers educated NaturalSpeech 2 with 400M mannequin parameters and 44K hours of speech information. They then used it to create speech in zero-shot eventualities (with just a few seconds of speech immediate) with numerous speaker identities, prosody, and types (e.g., singing). The findings present that NaturalSpeech 2 outperforms prior highly effective TTS techniques in experiments and generates pure speech in zero-shot situations. It achieves extra comparable prosody with the speech immediate and ground-truth speech. It additionally achieves comparable or higher naturalness (concerning CMOS) than the ground-truth speech on LibriTTS and VCTK check units. The experimental outcomes additionally present that it could generate singing voices in a novel timbre with a brief singing immediate or, curiously, with solely a speech immediate, unlocking the really zero-shot singing synthesis.

Sooner or later, the crew plans to research efficient strategies, comparable to consistency fashions, to speed up the diffusion mannequin and examine widespread talking and singing voice coaching to allow stronger combined talking/singing capabilities.


Try the Paper and Undertaking Web page. Don’t neglect to affix our 26k+ ML SubReddit, Discord Channel, and E-mail E-newsletter, the place we share the most recent AI analysis information, cool AI initiatives, and extra. If in case you have any questions concerning the above article or if we missed something, be at liberty to e-mail us at Asif@marktechpost.com

🚀 Examine Out 100’s AI Instruments in AI Instruments Membership



Tanushree Shenwai is a consulting intern at MarktechPost. She is at present pursuing her B.Tech from the Indian Institute of Know-how(IIT), Bhubaneswar. She is a Information Science fanatic and has a eager curiosity within the scope of utility of synthetic intelligence in numerous fields. She is captivated with exploring the brand new developments in applied sciences and their real-life utility.


🔥 Acquire a aggressive
edge with information: Actionable market intelligence for world manufacturers, retailers, analysts, and traders. (Sponsored)

Related Posts

OpenAI’s ChatGPT Unveils Voice and Picture Capabilities: A Revolutionary Leap in AI Interplay

September 26, 2023

Meet ProPainter: An Improved Video Inpainting (VI) AI Framework With Enhanced Propagation And An Environment friendly Transformer

September 26, 2023

This AI Analysis from Apple Investigates a Identified Difficulty of LLMs’ Conduct with Respect to Gender Stereotypes

September 26, 2023

Leave A Reply Cancel Reply

Misa
Trending
Machine-Learning

OpenAI’s ChatGPT Unveils Voice and Picture Capabilities: A Revolutionary Leap in AI Interplay

By September 26, 20230

OpenAI, the trailblazing synthetic intelligence firm, is poised to revolutionize human-AI interplay by introducing voice…

Meet ProPainter: An Improved Video Inpainting (VI) AI Framework With Enhanced Propagation And An Environment friendly Transformer

September 26, 2023

This AI Analysis from Apple Investigates a Identified Difficulty of LLMs’ Conduct with Respect to Gender Stereotypes

September 26, 2023

ETH Zurich Researchers Introduce the Quick Feedforward (FFF) Structure: A Peer of the Feedforward (FF) Structure that Accesses Blocks of its Neurons in Logarithmic Time

September 26, 2023
Stay In Touch
  • Facebook
  • Twitter
  • Pinterest
  • Instagram
  • YouTube
  • Vimeo
Our Picks

OpenAI’s ChatGPT Unveils Voice and Picture Capabilities: A Revolutionary Leap in AI Interplay

September 26, 2023

Meet ProPainter: An Improved Video Inpainting (VI) AI Framework With Enhanced Propagation And An Environment friendly Transformer

September 26, 2023

This AI Analysis from Apple Investigates a Identified Difficulty of LLMs’ Conduct with Respect to Gender Stereotypes

September 26, 2023

ETH Zurich Researchers Introduce the Quick Feedforward (FFF) Structure: A Peer of the Feedforward (FF) Structure that Accesses Blocks of its Neurons in Logarithmic Time

September 26, 2023

Subscribe to Updates

Get the latest creative news from SmartMag about art & design.

The Ai Today™ Magazine is the first in the middle east that gives the latest developments and innovations in the field of AI. We provide in-depth articles and analysis on the latest research and technologies in AI, as well as interviews with experts and thought leaders in the field. In addition, The Ai Today™ Magazine provides a platform for researchers and practitioners to share their work and ideas with a wider audience, help readers stay informed and engaged with the latest developments in the field, and provide valuable insights and perspectives on the future of AI.

Our Picks

OpenAI’s ChatGPT Unveils Voice and Picture Capabilities: A Revolutionary Leap in AI Interplay

September 26, 2023

Meet ProPainter: An Improved Video Inpainting (VI) AI Framework With Enhanced Propagation And An Environment friendly Transformer

September 26, 2023

This AI Analysis from Apple Investigates a Identified Difficulty of LLMs’ Conduct with Respect to Gender Stereotypes

September 26, 2023
Trending

ETH Zurich Researchers Introduce the Quick Feedforward (FFF) Structure: A Peer of the Feedforward (FF) Structure that Accesses Blocks of its Neurons in Logarithmic Time

September 26, 2023

Microsoft Researchers Suggest Neural Graphical Fashions (NGMs): A New Sort of Probabilistic Graphical Fashions (PGM) that Learns to Characterize the Likelihood Operate Over the Area Utilizing a Deep Neural Community

September 26, 2023

Are Giant Language Fashions Actually Good at Producing Advanced Structured Knowledge? This AI Paper Introduces Struc-Bench: Assessing LLM Capabilities and Introducing a Construction-Conscious Wonderful-Tuning Resolution

September 26, 2023
Facebook Twitter Instagram YouTube LinkedIn TikTok
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms
  • Advertise
  • Shop
Copyright © MetaMedia™ Capital Inc, All right reserved

Type above and press Enter to search. Press Esc to cancel.