paper review : Deep Audio-Visual learning: A Survey

Deep Audio-Visual learning: A Survey

Summary

Deep Audio-Visual Learning is able to divide into four directions: audio-visual separation and localization, audio-visual correspondence learning, audio-visual generation, and audio-visual representation. This paper concludes these directions, analysis every method, collect all dataset in this area.

1. Research Introduction

1.1 Background

Audio-visual learning (AVL)exploring the relationship between audio and visual information and using both modalities has been introduced to overcome the limitation of perception tasks in each modality.

1.2 mian research areas

1. Audio-visual separation and localization
it aim to separate specific sounds emanating from the corresponding objects and localize each sound in the visual context, as illustrated in Fig. 1 (a). Most studies in this area focus on unsupervised learning due to the lack of training labels.

2. Audio-visual correspondence learning
it focuses on discovering the global semantic relation between audio and visual modalities, as shown in Fig. 1 (b).It consists of audio-visual retrieval and audio-visual speech recognition tasks.

3. Audio-visual generation
it tries to synthesize the other modality based on one of them, which is different from the above two tasks leveraging both audio and visual modalities as inputs.

4. audio-visual representation learning
it aims to automatically discover the representation from raw data.

paper review : Deep Audio-Visual learning: A Survey

2. Direction 1 : Audio-visual Separation and Localization

The objective of audio-visual separation is to separate different sounds from the corresponding objects, while audio-visual localization mainly focuses on localizing a sound in a visual context.

paper review : Deep Audio-Visual learning: A Survey

2.1 Sub Direction :Speaker Separation

It aims to isolate a single speech signal in a noisy scene.

Method(s) :

Some studies tried to solve the problem of audio separation with only the audio modality and achieved exciting results. [14,15]

Advanced approaches [5, 7] tried to utilize visual information to aid the speaker separation task and significantly surpassed single modality-based methods.

Subsequently, several methods focused on analyzing videos containing salient motion signals and the corresponding audio events [18, 19].

2.2 Sub Direction : Separating and Localizing Objects’ Sounds

Separation : Refer to the previous statement
Localization: sound localization of acoustic signals is strongly influenced by synchronicity of their visual signals
Simultaneous Separation and Localization:

Method(s) :

omit

brief summary

paper review : Deep Audio-Visual learning: A Survey

Problem Statement

the main challenges are failing to distinguish the timbre of various objects and exploring ways of generating the
sounds of different objects. Addressing these challenges requires us to carefully design the models or ideas (e.g., the attention mechanism) for dealing with different objects.

3. Direction 2 : Audio-visual Correspondence Learning

audio-visual correspondence learning consists of

  1. the audio-visual matching task and
  2. the audio-visual speech recognition task.

3.1 Sub Direction :audio-visual matching task

Voice-Facial Matching: Given facial images of different identities and the corresponding audio sequences, voice-facial matching aims to identify the face that the audio belongs to (the V2F task) or vice versa (theF2V task), as shown in Fig. 4.
paper review : Deep Audio-Visual learning: A Survey

**Audio-image Retrieval: ** Unlike other retrieval tasks such as the text-image task [48, 49, 50]
or the sound-text task [51], the audio-visual retrieval task mainly focuses on subspace learning. Such as joint
embedding space.

Method(s) :

omit

3.2 Sub Direction :audio-visual Speech Recognition

Due to the correlation between audio and vision, combining audio-visual two modalities tends to offer more prior information.
paper review : Deep Audio-Visual learning: A Survey

Method(s) :

  1. Earlier efforts on audio-visual fusion models :1) extracting features from the image and audio signals and2) combining the features for joint classification [62, 63, 64] (注意joint学习优点共享子空间往往具备语义不变性,有助于在机器学习 模型中将知识从一种模态转移到另一种模态。缺点是各单模态语义完整性不易在早期发现和处理。 而Coordinated 是任意模态可以丢失。)

​ add connection: z = f ( w 1 T v 1 + ⋯ + w n T v n ) z=f(w_1^T v_1+⋯+w_n^T v_n ) z=f(w1Tv1++wnTvn)

​multi connection: z = [ v 1 1 ] ⊗ … ⊗ [ v n 1 ] z=\left[\begin{array}{l} \mathbf{v}^{1} \\ \mathbf{1} \end{array}\right] \otimes \ldots \otimes\left[\begin{array}{l} \mathbf{v}^{n} \\ \mathbf{1} \end{array}\right] z=[v11][vn1]

contact connection: z = ( V 1 0 … ) + ( 0 V 2 … ) + z=\left(\begin{array}{c} V^{1} \\ 0 \\ \ldots \end{array}\right)+\left(\begin{array}{c} 0 \\ V^{2} \\ \ldots \end{array}\right)+ z=V10+0V2+

  1. Later, taking advantage of deep learning, feature extraction was replaced with a neural network encoder [65, 66, 67]

  2. Several recently studies use an end-to-end approach to visual speech recognition:
    (1) the fully connected layers and LSTM to extract features to draw temporal information [56, 57] or use a 3D convolutional layer followed by a combination of CNNs and LSTMs [58, 68]
    (2)Petridis et al. [56] introduced an audio-visual fusion model that simultaneously extracted features directly from pixels and spectrograms
    (3) Wand et al. [57] presented a word-level lip-reading system using LSTM
    (4) Assael et.al[58] proposed a novel end-to-end LipNet model based on sentence-level sequence prediction, which consisted of spatial-temporalconvolutions, a recurrent network and a model trained via the connectionist temporal classification (CTC) loss.
    However, to combine both audio and visual information for various scenes, especially in noisy conditions.

    (5)Trigeorgis et al. [60] introduced an end-to-end model to obtain a ‘context-aware’ feature from the raw temporal representation.
    (6)Chung et al took advantage of the dual attention mechanism and could operate on a single or combined modality.
    (7)“in-the-wild” dataset, Cui et al. [69] proposed another model based on residual networks and a bidirectional GRU [38]
    (8), Afouras et al. [61] proposed a model deal with noisy conditions.

brief summary

paper review : Deep Audio-Visual learning: A Survey

Problem Statement

Many studies have tried to map different modalities into the shared feature space. However, it is challenging to obtain satisfactory results since extracting clear and effective information from ambiguous input and target modalities remains difficult. Therefore, sufficient prior information (the specific patterns people usually focus on) has a significant impact on obtaining more accurate results.

4. Direction 3 : Audio and Visual Generation

Following the invention and advances of generative adversarial networks (GANs) [70], image or video generation has emerged as a topic. It involves several subtasks, including generating images or video from a potential space[71], cross-modality generation [72, 73], etc.

4.1 Sub Direction :Vision-to-Audio Generation

Lip Sequence to Speech: There is a natural relationship between speech and lips. Separately from understanding the content of speech by observing lips.

** General Video to Audio:** When a sound hits the surfaces of some small objects, the latter
will vibrate slightly. Therefore, Davis et al. [79] utilized this specific feature to recover the sound from vibrations observed passively by a high-speed camera.

paper review : Deep Audio-Visual learning: A Survey

Method(s) :

omit

brief summary

paper review : Deep Audio-Visual learning: A Survey

4.2 Sub Direction :Audio to Vision

audio-to-images generation

** Body Motion Generation**

** Talking Face Generation**
paper review : Deep Audio-Visual learning: A Survey

Method(s) :

audio-to-images generation
Wan et al. [84] : conditional GAN model
Chen et al. [72] :conditional GANs
Duarte et al. [87] synthesized facial images containing expressions and poses through the GAN model.

** Body Motion Generation**
omit
Talking Face Generation
One person:

  1. Kumar et al. [94] attempted to generate key points synced to audio by utilizing a time-delayed LSTM [110] and then generated the video frames conditioned on the key points by another network.
  2. Combining RNN and GAN [70], Jalalifar et al.[97] produced a sequence of realistic faces that were synchronized
    with the input audio by two networks… One was an LSTM network used to create lip landmarks out of audio input. The other was a conditional GAN (cGAN) used to generate the resulting faces conditioned on a given set of lip landmarks.
    arbitrary identities:

omit
A frontal face photo:

omit

brief summary

paper review : Deep Audio-Visual learning: A Survey

Problem Statement

In contrast to the conventional discriminative problem, the task of cross-modality generation is to fit a mapping between probability distributions. Therefore, it is usually a many-to-many mapping problem that is difficult to learn.
Moreover, despite the large difference between audio and visual modalities, humans are sensitive to the difference between real-world and generated results, and subtle artifacts can be easily noticed, which makes this task more challenging.

5. Direction : Audio-visual Representation Learning

Representation learning aims to discover the pattern representation from data automatically. It is motivated by the fact that the choice of data representation usually greatly impacts performance of machine learning [11]. However, real-world data such as images, videos and audio are not amenable to defining specific features algorithmically.

5.1 Sub Direction :Single-Modality Representation Learning

omit

Method(s) :

omit

5.2 Sub Direction :Learning an Audio-visual Representation

omit

Method(s) :

omit

Problem Statement

omit

6. DateSet

Representation learning aims to discover the pattern representation from data automatically. It is motivated by the fact that the choice of data representation usually greatly impacts performance of machine learning [11]. However, real-world data such as images, videos and audio are not amenable to defining specific features algorithmically.

6.1 Sub DateSet:Audio-visual Speech Datasets

1. Lab-controlled Environment : these datasets include GRID [119] ,TCD TIMIT [121], and VidTIMIT [122]. Such datasets can be used for lip reading, talking face generation, and speech reconstruction.

2. In-the-wild Environment : including LRW, LRW variants [129, 59, 130],Voxceleb and its variants [127, 128], AVA-ActiveSpeaker [131] and AVSpeech [7].
LRW dataset consists of 500 sentences [129],while its variant contains 1000 sentences[59, 130], all of which were spoken by hundreds of different speakers.
VoxCeleb and its variants contain over 100,000 utterances of 1,251 celebrities [127] and over a million utterances of 6,112 identities [128], respectively.
AVA-ActiveSpeaker and AVSpeech [7]: dont want to say. It was so huge.

6.1 Sub DateSet: Audio-visual Event Datasets

1. Music-related Datasets :
ENST-Drums [133] merely contains drum videos of three professional drummers specializing in different music genres.
The C4S dataset [132] consists of 54 videos of 9 distinct clarinetists, each performing 3 different classical music
pieces twice (4.5h in total).
The URMP [134] dataset contains a number of multi-instrument musical pieces. However, these videos were recorded separately and then combined.

2. Real Events-related Datasets :
Kinetics-400 [137], Kinetics-600 [138] and Kinetics-700 [139] contain 400, 600 and 700 human action classes with at least 400, 600, and 700 video clips for each action, respectively.
The AVA-Actions dataset [140] densely annotated 80 atomic visual actions in 43015 minutes of movie clips.
AudioSet [136], a more general dataset, consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips
YouTube-8M [135] is a large-scale labeled video dataset that consists of millions of YouTube video IDs
with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities

paper review : Deep Audio-Visual learning: A Survey

Conclusion and Direction

  1. We think that mimicking the human learning process, e.g., by following the ideas of the attention mechanism and a memory bank may improve performance of learning this mapping.

  2. Consider fully utilizing the prior information and constructing the knowledge graph. Building a comprehensive knowledge graph and leveraging it in specific areas properly may help machine thinking.

  3. collecting a dataset is laborand time-intensive. Small-sample learning also benefits the application of AVL.

  4. y, many studies focus on building more complex networks to improve performance, and the resulting networks generally entail unexplainable mechanisms.To make a model or an algorithm more robust and explainable, it is necessary to learn the essence of the earlier explainable algorithms to advance AVL.

Notes(optional)

Before ready to chooese final direction,please pay more attention into suitable dataset.

Reference(optional)

列出相关性高的文献,以便之后可以继续track下去。

Question

  1. 可以和陈新那个方向结合起来么。或者我换个不吃资源的方向? 感觉如果现在搞下去发现不行就凉了。
  2. 这个论文,我在知乎上联系了作者,他是觉得这个方向竞争小,同领域审稿人比较扶持相关研究。 但是没卡做不了(实际上作者觉得卡要求不高,比图像生成领域要求低不少)。给的最低配置是4 张1080ti,或者2张2080ti, 一次实验跑三天出来结果。