• Home
  • AI News
  • AI Startups
  • Deep Learning
  • Interviews
  • Machine-Learning
  • Robotics

Subscribe to Updates

Get the latest creative news from FooBar about art, design and business.

What's Hot

Analysis at Stanford Introduces PointOdyssey: A Massive-Scale Artificial Dataset for Lengthy-Time period Level Monitoring

September 23, 2023

Google DeepMind Introduces a New AI Software that Classifies the Results of 71 Million ‘Missense’ Mutations 

September 23, 2023

Researchers from Seoul Nationwide College Introduces Locomotion-Motion-Manipulation (LAMA): A Breakthrough AI Methodology for Environment friendly and Adaptable Robotic Management

September 23, 2023
Facebook Twitter Instagram
The AI Today
Facebook Twitter Instagram Pinterest YouTube LinkedIn TikTok
SUBSCRIBE
  • Home
  • AI News
  • AI Startups
  • Deep Learning
  • Interviews
  • Machine-Learning
  • Robotics
The AI Today
Home»Machine-Learning»From Phrases to Worlds: Exploring Video Narration With AI Multi-Modal Tremendous-grained Video Description
Machine-Learning

From Phrases to Worlds: Exploring Video Narration With AI Multi-Modal Tremendous-grained Video Description

By August 24, 2023Updated:August 24, 2023No Comments5 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr Reddit WhatsApp Email
Share
Facebook Twitter LinkedIn Pinterest WhatsApp Email


Language is the predominant mode of human interplay, providing extra than simply supplementary particulars to different schools like sight and sound. It additionally serves as a proficient channel for transmitting data, comparable to utilizing voice-guided navigation to steer us to a particular location. Within the case of visually impaired people, they will expertise a film by listening to its descriptive audio. The previous demonstrates how language can improve different sensory modes, whereas the latter highlights language’s capability to convey maximal data in several modalities.

Modern efforts in multi-modal modeling try to ascertain connections between language and numerous different senses, encompassing duties like captioning pictures or movies, producing textual representations from pictures or movies, manipulating visible content material guided by textual content, and extra.

Nonetheless, in these undertakings, the language predominantly dietary supplements data regarding different sensory inputs. Consequently, these endeavors usually fail to comprehensively depict the intricate alternate of knowledge between completely different sensory modes. They primarily concentrate on simplistic linguistic parts, comparable to one-sentence captions.

Given the brevity of those captions, they solely handle to explain distinguished entities and actions. Consequently, the knowledge conveyed by way of these captions is significantly restricted in comparison with the wealth of knowledge current in different sensory modalities. This discrepancy ends in a notable lack of data when making an attempt to translate data from different sensory realms into language.

On this research, researchers see language as a option to share data in multi-modal modeling. They create a brand new job referred to as “Tremendous-grained Audible Video Description” (FAVD), which differs from common video captioning. Often, quick captions of movies confer with the principle elements. FAVD as an alternative requests fashions to explain movies extra like how folks would, beginning with a fast abstract after which including increasingly more detailed data. This strategy retains a sounder portion of video data throughout the language framework.

Since movies enclose visible and auditory indicators, the FAVD job additionally incorporates audio descriptions to reinforce the excellent depiction. To help the execution of this job, a brand new benchmark named Tremendous-grained Audible Video Description Benchmark (FAVDBench) has been constructed for supervised coaching. FAVDBench is a group of over 11,000 video clips from YouTube, curated throughout greater than 70 real-life classes. Annotations embody concise one-sentence summaries, adopted by 4-6 detailed sentences about visible elements and 1-2 sentences about audio, providing a complete dataset.

To successfully consider the FAVD job, two novel metrics have been devised. The primary metric, termed EntityScore, evaluates the switch of knowledge from movies to descriptions by measuring the comprehensiveness of entities throughout the visible descriptions. The second metric, AudioScore, quantifies the standard of audio descriptions throughout the characteristic house of a pre-trained audio-visual-language mannequin.

The researchers furnish a foundational mannequin for the freshly launched job. This mannequin builds upon a longtime end-to-end video captioning framework, supplemented by a further audio department. Furthermore, an enlargement is constituted of a visual-language transformer to an audio-visual-language transformer (AVLFormer). AVLFormer is within the type of encoder-decoder buildings as depicted beneath. 

https://arxiv.org/abs/2303.15616

Visible and audio encoders are tailored to course of the video clips and audio, respectively, enabling the amalgamation of multi-modal tokens. The visible encoder depends on the video swin transformer, whereas the audio encoder exploits the patchout audio transformer. These parts extract visible and audio options from video frames and audio information. Different parts, comparable to masked language modeling and auto-regressive language modeling, are integrated throughout coaching. Taking inspiration from earlier video captioning fashions, AVLFormer additionally employs textual descriptions as enter. It makes use of a phrase tokenizer and a linear embedding to transform the textual content into a particular format. The transformer processes this multi-modal data and outputs a fine-detailed description of the movies offered as enter.

Some examples of qualitative outcomes and comparability with state-of-the-art approaches are reported beneath.

https://arxiv.org/abs/2303.15616

In conclusion, the researchers suggest FAVD, a brand new video captioning job for fine-grained audible video descriptions, and FAVDBench, a novel benchmark for supervised coaching. Moreover, they designed a brand new transformer-based baseline mannequin, AVLFormer, to deal with the FAVD job. If you’re and wish to be taught extra about it, please be happy to confer with the hyperlinks cited beneath.


Try the Paper and Challenge. All Credit score For This Analysis Goes To the Researchers on This Challenge. Additionally, don’t neglect to hitch our 29k+ ML SubReddit, 40k+ Fb Neighborhood, Discord Channel, and E mail Publication, the place we share the most recent AI analysis information, cool AI tasks, and extra.

For those who like our work, please observe us on Twitter



Daniele Lorenzi acquired his M.Sc. in ICT for Web and Multimedia Engineering in 2021 from the College of Padua, Italy. He’s a Ph.D. candidate on the Institute of Info Expertise (ITEC) on the Alpen-Adria-Universität (AAU) Klagenfurt. He’s presently working within the Christian Doppler Laboratory ATHENA and his analysis pursuits embody adaptive video streaming, immersive media, machine studying, and QoS/QoE analysis.


🚀 CodiumAI permits busy builders to generate significant checks (Sponsored)



Related Posts

Researchers from Seoul Nationwide College Introduces Locomotion-Motion-Manipulation (LAMA): A Breakthrough AI Methodology for Environment friendly and Adaptable Robotic Management

September 23, 2023

Unlocking Battery Optimization: How Machine Studying and Nanoscale X-Ray Microscopy May Revolutionize Lithium Batteries

September 23, 2023

This AI Analysis by Microsoft and Tsinghua College Introduces EvoPrompt: A Novel AI Framework for Automated Discrete Immediate Optimization Connecting LLMs and Evolutionary Algorithms

September 23, 2023

Leave A Reply Cancel Reply

Misa
Trending
Deep Learning

Analysis at Stanford Introduces PointOdyssey: A Massive-Scale Artificial Dataset for Lengthy-Time period Level Monitoring

By September 23, 20230

Massive-scale annotated datasets have served as a freeway for creating exact fashions in numerous pc…

Google DeepMind Introduces a New AI Software that Classifies the Results of 71 Million ‘Missense’ Mutations 

September 23, 2023

Researchers from Seoul Nationwide College Introduces Locomotion-Motion-Manipulation (LAMA): A Breakthrough AI Methodology for Environment friendly and Adaptable Robotic Management

September 23, 2023

Unlocking Battery Optimization: How Machine Studying and Nanoscale X-Ray Microscopy May Revolutionize Lithium Batteries

September 23, 2023
Stay In Touch
  • Facebook
  • Twitter
  • Pinterest
  • Instagram
  • YouTube
  • Vimeo
Our Picks

Analysis at Stanford Introduces PointOdyssey: A Massive-Scale Artificial Dataset for Lengthy-Time period Level Monitoring

September 23, 2023

Google DeepMind Introduces a New AI Software that Classifies the Results of 71 Million ‘Missense’ Mutations 

September 23, 2023

Researchers from Seoul Nationwide College Introduces Locomotion-Motion-Manipulation (LAMA): A Breakthrough AI Methodology for Environment friendly and Adaptable Robotic Management

September 23, 2023

Unlocking Battery Optimization: How Machine Studying and Nanoscale X-Ray Microscopy May Revolutionize Lithium Batteries

September 23, 2023

Subscribe to Updates

Get the latest creative news from SmartMag about art & design.

The Ai Today™ Magazine is the first in the middle east that gives the latest developments and innovations in the field of AI. We provide in-depth articles and analysis on the latest research and technologies in AI, as well as interviews with experts and thought leaders in the field. In addition, The Ai Today™ Magazine provides a platform for researchers and practitioners to share their work and ideas with a wider audience, help readers stay informed and engaged with the latest developments in the field, and provide valuable insights and perspectives on the future of AI.

Our Picks

Analysis at Stanford Introduces PointOdyssey: A Massive-Scale Artificial Dataset for Lengthy-Time period Level Monitoring

September 23, 2023

Google DeepMind Introduces a New AI Software that Classifies the Results of 71 Million ‘Missense’ Mutations 

September 23, 2023

Researchers from Seoul Nationwide College Introduces Locomotion-Motion-Manipulation (LAMA): A Breakthrough AI Methodology for Environment friendly and Adaptable Robotic Management

September 23, 2023
Trending

Unlocking Battery Optimization: How Machine Studying and Nanoscale X-Ray Microscopy May Revolutionize Lithium Batteries

September 23, 2023

This AI Analysis by Microsoft and Tsinghua College Introduces EvoPrompt: A Novel AI Framework for Automated Discrete Immediate Optimization Connecting LLMs and Evolutionary Algorithms

September 23, 2023

Researchers from the College of Oregon and Adobe Introduce CulturaX: A Multilingual Dataset with 6.3T Tokens in 167 Languages Tailor-made for Giant Language Mannequin (LLM) Growth

September 23, 2023
Facebook Twitter Instagram YouTube LinkedIn TikTok
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms
  • Advertise
  • Shop
Copyright © MetaMedia™ Capital Inc, All right reserved

Type above and press Enter to search. Press Esc to cancel.