• Home
  • AI News
  • AI Startups
  • Deep Learning
  • Interviews
  • Machine-Learning
  • Robotics

Subscribe to Updates

Get the latest creative news from FooBar about art, design and business.

What's Hot

Meet Ego-Exo4D: A Foundational Dataset and Benchmark Suite to Assist Analysis on Video Studying and Multimodal Notion

December 6, 2023

Tencent AI Lab Introduces GPT4Video: A Unified Multimodal Massive Language Mannequin for lnstruction-Adopted Understanding and Security-Conscious Technology

December 6, 2023

Google AI Analysis Current Translatotron 3: A Novel Unsupervised Speech-to-Speech Translation Structure

December 6, 2023
Facebook X (Twitter) Instagram
The AI Today
Facebook X (Twitter) Instagram Pinterest YouTube LinkedIn TikTok
SUBSCRIBE
  • Home
  • AI News
  • AI Startups
  • Deep Learning
  • Interviews
  • Machine-Learning
  • Robotics
The AI Today
Home»Machine-Learning»From Sound to Sight: Meet AudioToken for Audio-to-Picture Synthesis
Machine-Learning

From Sound to Sight: Meet AudioToken for Audio-to-Picture Synthesis

By June 22, 2023Updated:June 22, 2023No Comments4 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr Reddit WhatsApp Email
Share
Facebook Twitter LinkedIn Pinterest WhatsApp Email


Neural generative fashions have remodeled the best way we devour digital content material, revolutionizing varied features. They’ve the aptitude to generate high-quality pictures, guarantee coherence in lengthy spans of textual content, and even produce speech and audio. Among the many completely different approaches, diffusion-based generative fashions have gained prominence and have proven promising outcomes throughout varied duties. 

In the course of the diffusion course of, the mannequin learns to map a predefined noise distribution to the goal information distribution. At every step, the mannequin predicts the noise and generates the sign from the goal distribution. Diffusion fashions can function on completely different types of information representations, resembling uncooked enter and latent representations. 

State-of-the-art fashions, resembling Steady Diffusion, DALLE, and Midjourney, have been developed for text-to-image synthesis duties. Though the curiosity in X-to-Y technology has elevated lately, audio-to-image fashions haven’t but been deeply explored. 

🚀 JOIN the quickest ML Subreddit Neighborhood

The explanation for utilizing audio indicators quite than textual content prompts is because of the interconnection between pictures and audio within the context of movies. In distinction, though text-based generative fashions can produce exceptional pictures, textual descriptions are usually not inherently linked to the picture, that means that textual descriptions are sometimes added manually. Audio indicators have, moreover, the flexibility to signify advanced scenes and objects, resembling completely different variations of the identical instrument (e.g., basic guitar, acoustic guitar, electrical guitar, and many others.) or completely different views of the similar object (e.g., basic guitar recorded in a studio versus a stay present). The guide annotation of such detailed data for distinct objects is labor-intensive, which makes scalability difficult. 

Earlier research have proposed a number of strategies for producing audio from picture inputs, primarily utilizing a Generative Adversarial Community (GAN) to generate pictures based mostly on audio recordings. Nonetheless, there are notable distinctions between their work and the proposed technique. Some strategies targeted on producing MNIST digits completely and didn’t prolong their method to embody normal audio sounds. Others did generate pictures from normal audio however resulted in low-quality pictures.

To beat the restrictions of those research, a DL mannequin for audio-to-image technology has been proposed. Its overview is depicted within the determine beneath.

This method entails leveraging a pre-trained text-to-image technology mannequin and a pre-trained audio illustration mannequin to study an adaptation layer mapping between their outputs and inputs. Drawing from latest work on textual inversions, a devoted audio token is launched to map the audio representations into an embedding vector. This vector is then forwarded into the community as a steady illustration, reflecting a brand new phrase embedding. 

The Audio Embedder makes use of a pre-trained audio classification community to seize the audio’s illustration. Sometimes, the final layer of the discriminative community is employed for classification functions, nevertheless it usually overlooks vital audio particulars unrelated to the discriminative process. To deal with this, the method combines earlier layers with the final hidden layer, leading to a temporal embedding of the audio sign.

Pattern outcomes produced by the introduced mannequin are reported beneath.

This was the abstract of AudioToken, a novel Audio-to-Picture (A2I) synthesis mannequin. In case you are , you may study extra about this system within the hyperlinks beneath.


Verify Out The Paper. Don’t neglect to affix our 24k+ ML SubReddit, Discord Channel, and E mail Publication, the place we share the newest AI analysis information, cool AI tasks, and extra. In case you have any questions concerning the above article or if we missed something, be at liberty to e mail us at Asif@marktechpost.com


Featured Instruments From AI Instruments Membership

🚀 Verify Out 100’s AI Instruments in AI Instruments Membership



Daniele Lorenzi acquired his M.Sc. in ICT for Web and Multimedia Engineering in 2021 from the College of Padua, Italy. He’s a Ph.D. candidate on the Institute of Info Expertise (ITEC) on the Alpen-Adria-Universität (AAU) Klagenfurt. He’s presently working within the Christian Doppler Laboratory ATHENA and his analysis pursuits embody adaptive video streaming, immersive media, machine studying, and QoS/QoE analysis.


Related Posts

Meet Ego-Exo4D: A Foundational Dataset and Benchmark Suite to Assist Analysis on Video Studying and Multimodal Notion

December 6, 2023

Google AI Analysis Current Translatotron 3: A Novel Unsupervised Speech-to-Speech Translation Structure

December 6, 2023

Tencent AI Lab Introduces GPT4Video: A Unified Multimodal Massive Language Mannequin for lnstruction-Adopted Understanding and Security-Conscious Technology

December 6, 2023

Leave A Reply Cancel Reply

Misa
Trending
Machine-Learning

Meet Ego-Exo4D: A Foundational Dataset and Benchmark Suite to Assist Analysis on Video Studying and Multimodal Notion

By December 6, 20230

In the present day, AI finds its utility in nearly each discipline conceivable. It has…

Tencent AI Lab Introduces GPT4Video: A Unified Multimodal Massive Language Mannequin for lnstruction-Adopted Understanding and Security-Conscious Technology

December 6, 2023

Google AI Analysis Current Translatotron 3: A Novel Unsupervised Speech-to-Speech Translation Structure

December 6, 2023

Max Planck Researchers Introduce PoseGPT: An Synthetic Intelligence Framework Using Massive Language Fashions (LLMs) to Perceive and Motive about 3D Human Poses from Pictures or Textual Descriptions

December 6, 2023
Stay In Touch
  • Facebook
  • Twitter
  • Pinterest
  • Instagram
  • YouTube
  • Vimeo
Our Picks

Meet Ego-Exo4D: A Foundational Dataset and Benchmark Suite to Assist Analysis on Video Studying and Multimodal Notion

December 6, 2023

Tencent AI Lab Introduces GPT4Video: A Unified Multimodal Massive Language Mannequin for lnstruction-Adopted Understanding and Security-Conscious Technology

December 6, 2023

Google AI Analysis Current Translatotron 3: A Novel Unsupervised Speech-to-Speech Translation Structure

December 6, 2023

Max Planck Researchers Introduce PoseGPT: An Synthetic Intelligence Framework Using Massive Language Fashions (LLMs) to Perceive and Motive about 3D Human Poses from Pictures or Textual Descriptions

December 6, 2023

Subscribe to Updates

Get the latest creative news from SmartMag about art & design.

The Ai Today™ Magazine is the first in the middle east that gives the latest developments and innovations in the field of AI. We provide in-depth articles and analysis on the latest research and technologies in AI, as well as interviews with experts and thought leaders in the field. In addition, The Ai Today™ Magazine provides a platform for researchers and practitioners to share their work and ideas with a wider audience, help readers stay informed and engaged with the latest developments in the field, and provide valuable insights and perspectives on the future of AI.

Our Picks

Meet Ego-Exo4D: A Foundational Dataset and Benchmark Suite to Assist Analysis on Video Studying and Multimodal Notion

December 6, 2023

Tencent AI Lab Introduces GPT4Video: A Unified Multimodal Massive Language Mannequin for lnstruction-Adopted Understanding and Security-Conscious Technology

December 6, 2023

Google AI Analysis Current Translatotron 3: A Novel Unsupervised Speech-to-Speech Translation Structure

December 6, 2023
Trending

Max Planck Researchers Introduce PoseGPT: An Synthetic Intelligence Framework Using Massive Language Fashions (LLMs) to Perceive and Motive about 3D Human Poses from Pictures or Textual Descriptions

December 6, 2023

This AI Analysis Unveils Photograph-SLAM: Elevating Actual-Time Photorealistic Mapping on Transportable Gadgets

December 6, 2023

Researchers from Shanghai Synthetic Intelligence Laboratory and MIT Unveil Hierarchically Gated Recurrent Neural Community RNN: A New Frontier in Environment friendly Lengthy-Time period Dependency Modeling

December 6, 2023
Facebook X (Twitter) Instagram YouTube LinkedIn TikTok
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms
  • Advertise
  • Shop
Copyright © MetaMedia™ Capital Inc, All right reserved

Type above and press Enter to search. Press Esc to cancel.