Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis

1. Introduction

The application of AI in education has gained widespread attention for its potential to enhance learning experiences across disciplines, including psychology [1,2]. In the context of investigative interviewing, especially when questioning suspected child victims, AI offers a promising alternative to traditional training approaches. These conventional methods, often delivered through short workshops, fail to provide the hands-on practice, feedback, and continuous engagement needed for interviewers to master best practices in questioning child victims [3,4]. Research has shown that while best practices recommend open-ended questions and discourage leading or suggestive queries [5,6], many interviewers still struggle to implement these techniques effectively during real-world investigations [7]. The adoption of AI-powered child avatars provides a valuable solution, enabling child protective services (CPS) workers to engage in realistic practice sessions without the ethical dilemmas associated with using real children, while simultaneously offering personalized feedback on their performance [8].

Our current system leverages advanced AI techniques within a structured virtual environment to train professionals in investigative interviewing. Specifically, this system integrates the Unity Engine to generate virtual avatars. Despite the potential advantages of our AI-based training system, its effectiveness largely depends on the perceived realism and fidelity of the virtual avatars used in these simulations [9]. Based on our findings, we observe that avatars generated using Generative Adversarial Networks (GANs) demonstrate higher levels of realism compared to those created with the Unity Engine in several key aspects [10].

Accordingly, in this paper, we propose leveraging existing real-time talking portrait generation techniques to create an interactive avatar and evaluate its potential for improving perceived realism and interaction quality.

Audio-driven talking portrait synthesis enables the animation of a specific person based on arbitrary speech input, using deep learning-based methods such as neural radiance fields (NeRF) [11] and 3D Morphable Models (3DMMs) to generate high-quality 3D head models that offer flexibility in head poses and superior visual fidelity. Despite advancements in the development of real-time talking-head systems, their broad availability remains constrained by the inherent complexity of integrating multimodal data inputs, including audio and visual cues such as facial landmarks. This complexity poses challenges in achieving both synchronization and processing efficiency, which are imperative for the smooth operation of these systems in real-time scenarios. Furthermore, the field of real-time talking portrait systems is still in its nascent stages, with limited publications and, to the best of our knowledge, no standardized benchmarks or evaluation metrics currently available.

In audio-driven talking-head synthesis, the nuances of audio features directly influence the realism and synchronization of the generated visuals. Modern Audio Feature Extraction (AFE) models such as Wav2Vec [12] and HuBERT [13] have demonstrated high accuracy in tasks like Speech Emotion Recognition (SER) and Automatic Speech Recognition (ASR).

However, their reliance on transformer architectures poses significant challenges for real-time applications. For instance, Wav2Vec’s transformer-based structure exhibits poor memory and computational scalability, leading to increased inference time with larger input sizes as observed in edge deployment scenarios [14,15,16]. Additionally, these models require substantial computational resources even when using optimized configurations, which further exacerbate latency issues in energy-constrained or low-power environments [15,16]. Moreover, through our exploration of open-source real-time talking-head models, we observe that latency in AFE can hinder real-time performance, impacting the overall responsiveness and realism of the avatar. Addressing these limitations necessitates the development of lightweight, efficient AFE systems that balance accuracy and speed for real-time use cases [14].

To mitigate the latency issues associated with existing AFE models, we leverage Whisper [17], an advanced ASR system that has been adapted for AFE tasks. Whisper’s encoder–decoder Transformer architecture, trained on extensive multilingual and multitask datasets, offers an efficient and accelerated solution for AFE. Compared to earlier models like Deep-Speech 2 [18], which relies on computationally intensive recurrent networks, or Wav2Vec 2.0 [12] and HuBERT [13], which demand task-specific fine-tuning, Whisper delivers high performance across diverse conditions without additional optimization. Performance benchmarks (see Section 4.3) reveal that Whisper achieves an 80–90% reduction in processing latency relative to Deep-Speech, Wav2Vec, and HuBERT, without compromising quality. In summary, our main contributions are as follows:

Integrating a complete AI-based avatar system combining GPT-3 for conversations, Speech-To-Text (STT), Text-To-Speech (TTS), AFE, and talking portrait synthesis (Figure 1) for full end-to-end experiments.
Evaluating and comparing talking-head synthesis models.
Evaluation and comparison of four AFE models using two open-source talking-head frameworks across three datasets.
Modifying Whisper for efficient and accelerated AFE in talking portrait systems.
Assessing and discussing the best combinations of talking portrait synthesis systems and AFE systems.

The code and resources related to this work are available on GitHub [19].

2. Related Work

The development and research of our avatar is touching upon two main areas: virtual avatars for interviewer training and real-time talking portrait synthesis. Notably, avatar systems provide the interactive framework necessary for dynamic user engagement, while talking portrait synthesis improves visual and auditory realism. Together, these technologies address both interaction logic and expressive fidelity, forming a synergistic foundation for our solution. This integration is vital for achieving a immersive user experience.

2.1. Virtual Avatar for Interview Training

The progression of child avatar training systems has significantly advanced investigative interviewing techniques, introducing varying degrees of automation and efficacy. Initial systems, such as those by Linnæus University and AvBIT Labs [20], relied on prerecorded responses, limiting interaction dynamics and necessitating human intervention for response selection. This dependency increased operational costs and introduced potential for human error. The LiveSimulation [21] addressed some limitations by enabling interactions with videotaped child avatars, improving trainees’ open-ended questioning skills [22]. However, its reliance on predefined question options restricted the ability to practice generating spontaneous, context-specific queries crucial for real-world interviews.

The Empowering Interviewer Training (EIT) platform [23] advanced these methods with rule-based algorithms that allowed more dynamic interactions, though the fixed-response architecture limited adaptability. ViContact [24] further evolved the landscape by integrating virtual reality (VR) with automated feedback to refine both questioning techniques and socio-emotional support skills. Despite these advances, ViContact’s rule-based interaction system hindered its ability to handle diverse inputs, and the gamification effect it introduced risked overemphasizing certain question patterns, potentially reducing skill transfer to real scenarios.

Building on these foundations, our previous platform [9] integrated GPT-3 into a Unity-based framework to create a dynamic child interview simulation. While this approach improved the system’s responsiveness, criticisms of unrealistic avatar appearances pointed to a need for greater engagement during interactions. To address this, our current work incorporates lifelike talking portrait generation to improve visual realism and improve the overall efficacy of child interview training systems. This advancement aims to bridge the gap between simulation fidelity and real-world applicability, fostering more effective skill development.

2.2. Talking Portrait Synthesis

In recent years, real-time audio-driven talking portrait synthesis has gained significant attention, driven by applications in digital humans, virtual avatars, and video conferencing. Several approaches have been proposed to balance visual quality, synchronization, and computational efficiency.

Talking-head synthesis employs diverse methodologies, each with unique strengths and limitations. GANs generate high-quality static images with controllable attributes but face challenges in temporal consistency and complex dynamics. NeRF excels in photorealistic 3D rendering and offers flexible viewpoints, but it is computationally intensive for real-time use. The 3D Gaussian models balance efficiency and identity preservation, though they may lack fine detail [25].

In the following, we discuss recent advancements in real-time audio-driven talking portrait synthesis, which build on these foundational methods to improve computational efficiency while maintaining synchronization and visual quality.

Live Speech Portrait [26] uses auto-regressive predictive coding (APC) [27] for extracting speech information, predicts 3D lip landmarks from audio, and synthesizes video frames through an image-to-image translation network (U-Net). Similarly, RealTalk [28] utilizes 3D facial priors and efficient expression rendering modules to achieve precise lip–speech synchronization while preserving facial identity.

Furthermore, 3D Gaussian Splatting (3DGS) [29] introduces a point-based rendering technique that uses ellipsoidal, anisotropic Gaussians to represent scenes with high accuracy. GSTalker [30] builds on this by incorporating deformable Gaussian splatting, significantly reducing training times and boosting rendering speeds compared to earlier NeRF-based models. Gaussian Talker [31] further advances the field by using a Gaussian-based model to generate talking faces with high-quality lip synchronization while reducing computational complexity.

Moreover, NeRF [11] has recently gained attention for generating talking portraits, given their capacity to capture intricate visual scenes. To enhance system efficiency, RAD-NeRF [32] incorporates discrete learnable grids in AD-NeRF [33], resulting in faster training and inference processes. Building upon this, ER-NeRF [34] utilizes tri-Plane hash representation to minimize hash collisions, leading to faster convergence, while a cross-modal fusion mechanism has been developed to improve lip–speech synchronization. To further refine lip synchronization, GeneFace++ [35] introduces a dedicated audio-to-motion module within the NeRF-based rendering framework. Additionally, R2-Talker [36] employs a progressive multilayer conditioning approach, improving performance and visual fidelity by integrating hash-grid landmark encoding.

3. Methodology

This section introduces the AFE models Deep-Speech 2, Wav2Vec 2.0, HuBERT, and Whisper, which we compare in this study, and outlines the interactive avatar’s system architecture.

3.1. Audio Feature Extraction

A key aspect of talking-head generation is the ability to capture distinguishing speech features, as this directly affects the synchronization and quality of the audiovisual output. In the proposed framework, feature extraction is conducted using four ASR models: Deep-Speech 2 [18], Wav2Vec 2.0 [12], HuBERT [13], and Whisper [17]. Each of these models extracts both acoustic features and language representations from raw audio signals as part of their architecture. The following sections provide further details on these four models.

3.1.1. Deep-Speech 2

Deep-Speech 2, developed by Baidu, utilizes bidirectional recurrent neural networks (BRNNs) [37] alongside convolutional layers to capture context from both past and future frames, enhancing speech recognition accuracy. Key techniques like Batch Normalization and SortaGrad stabilize training, making the model effective across different acoustic conditions, including noisy environments [18].

In this study, building on prior methodologies [33,38], the DeepSpeech is applied to generate a 29-dimensional feature representation for every 20 ms audio segment. Consecutive frames of these audio features are processed through a temporal convolutional network to mitigate the noise present in raw audio signals. Specifically, a feature vector

a \in R^{16 \times 29}

, derived from 16 contiguous audio frames, characterizes the current state of the audio modality and facilitates synchronization and performance in the proposed system.

3.1.2. Wav2Vec 2.0

Wav2Vec 2.0 is a transformer-based model developed for self-supervised feature extraction directly from raw audio signals [12]. Initially, the model processes audio waveforms into representations using a convolutional neural network (CNN) paired with a Gaussian Error Linear Unit (GELU) activation function [39], which captures latent speech features across temporal frames

z_{1}, z_{2}, \dots, z_{T}

. These features are then fed into a transformer network [40], which is trained through a contrastive loss objective. This loss function enables the model to differentiate between correct and incorrect quantized representations of the audio signal [41].

This self-supervised training allows Wav2Vec 2.0 to learn rich, contextualized embeddings from unlabeled speech data, effectively building contextual representations across continuous speech and capturing dependencies over the entire sequence of latent representations. This approach reduces the need for hand-engineered features, while leveraging the powerful representations learned by Wav2Vec 2.0, resulting in improved performance for various speech processing applications. In this study, we integrate Wav2Vec 2.0 as utilized in [32], leveraging its ability to maintain consistent classification logits when audio signals are accurately recognized. To further refine these features, they apply an audio attention mechanism, which functions as a temporal smoothing layer to generate the final audio conditions.

3.1.3. HuBERT

Hidden-Unit BERT (HuBERT) [13] introduces a self-supervised approach that addresses key challenges in speech processing: handling multiple sound units per utterance, the lack of lexicon during pre-training, and the absence of the explicit segmentation of sound units. By applying prediction loss to masked regions, HuBERT learns a combined acoustic and language model from unmasked inputs. Pre-trained on Librispeech [42] (960 h) and Libri-light [43] (60,000 h), it outperforms previous methods and comes in three sizes: BASE (90 M parameters), LARGE (300 M), and X-LARGE (1 B). HuBERT employs masking similar to SpanBERT [44] and wav2vec 2.0 [12], using cross-entropy loss on masked and unmasked time steps, encouraging the model to capture both acoustic features and long-range speech patterns.

In this study, we apply HuBERT, as utilized in GeneFace [45], to facilitate high-fidelity audio-to-motion transformation. HuBERT extracts semantic features from the input audio. The audio is segmented for memory-efficient processing, and each segment is analyzed to produce high-dimensional hidden states. These features provide semantically rich and temporally consistent representations, optimized for precise mapping to expressive motion outputs.

3.1.4. Whisper

In this paper, we propose the use of Whisper Tiny [17], designed for lightweight applications, which offers efficient processing and broad applicability in low-resource or edge environments. With approximately 39 million parameters, it is tailored for real-time applications on less powerful devices while maintaining the core Whisper architecture. This model follows an encoder–decoder Transformer structure, allowing it to perform versatile multilingual transcription, translation, and voice activity detection within a compact and efficient design, making it particularly suitable for talking-head systems and real-time audio–visual synchronization tasks.

Whisper Tiny uses log-Mel spectrograms as input features, derived from 25 ms windows with a 10 ms stride, which are scaled to a near-zero mean. This spectrogram is processed through Transformer encoder blocks containing convolutional layers with GELU activations [39]. These layers capture critical acoustic and linguistic features across languages and environments, leveraging Whisper’s extensive pre-training on multilingual, multitask datasets. This approach provides the model with robust noise resilience and the ability to maintain high accuracy without dataset-specific fine-tuning.

The Whisper model’s design has demonstrated a substantial 80–90% reduction in processing times compared to alternative models like Deep-Speech, Wav2Vec, and HuBERT, particularly for longer audio clips.

The Whisper model processes raw audio

A (t)

by transforming it into a log-Mel spectrogram

S (f, t)

as follows:

$S (f, t) = log (\sum_{k = 0}^{N - 1} A (t) \cdot e^{- j 2 π f t})$

This spectrogram, capturing core frequency components, is then passed through Whisper’s encoder to generate high-dimensional audio embeddings E:

$E = WhisperEncoder (S (f, t))$

where E has shape $(T_{initial}, C)$ , with T representing the number of time steps aligned to the visual frames, and $C = 384$ as the dimensionality of the feature embedding space. Synchronization with a 25 FPS visual frame rate is achieved by applying a sliding window with parameters $w = 16$ , stride $s = 2$ , and padding $p = 7$ , yielding a final feature matrix with shape $(750, 16, 384)$ . This setup ensures precise temporal alignment across 750 frames over 30 s, enhancing the real-time accuracy and fluidity of interactions in talking-head applications.

3.2. System Architecture

The system architecture of our interactive child avatar, as depicted in Figure 1, is composed of several modules: Listening, STT, Language, TTS, AFE, Frames Rendering, and Audio Overlay.

The Listening module is the entry point of the system, where user speech is captured through a button-based recording process. Users click to start recording their voice input and click again to stop, saving the recorded audio file. During this recording phase, the avatar remains in a listening state, utilizing a pre-rendered video based on an empty audio file where it exhibits natural, non-verbal behaviors such as blinking and subtle movements while remaining silent. This module employs a speech recognition system to continuously listen for spoken input from the user, initiating the conversion process by passing the audio data to the speech-to-text module.

The STT module utilizes OpenAI’s Whisper model [17] to perform the real-time conversion of spoken input into text, transcribing the user’s speech for further processing. The language module is responsible for generating contextually appropriate and dynamic responses. It leverages GPT for prompt engineering, simulating a child’s conversational style. The transcribed text from the STT is processed here, where GPT generates the relevant responses tailored to the interaction context. Once the response text is generated, it is passed to the TTS, which uses Amazon Polly [46] to convert the text into speech. This module maintains the consistency of the avatar’s voice with a child’s persona.

AFE processes the incoming audio from the TTS module to extract relevant features. To achieve rapid and efficient audio processing that enhances the system’s responsiveness, four models are evaluated: Deep-Speech [18], Wav2Vec [12], HuBERT [13], and Whisper-Tiny [17].

The Frames Rendering module manages the visual representation of the avatar. It is responsible for rendering frames in real-time based on the processed audio and text input, allowing the avatar to exhibit natural behaviors such as lip-syncing and facial expressions synchronized with the spoken output. We also compare two frameworks, RAD-NeRF [32] and ER-NeRF [34], to identify the most effective solution for this purpose. The final component is the Audio Overlay module, which combines the rendered frames with the audio output.

4. Experiments

This section rigorously evaluates various AFE models across two frameworks, focusing on model efficiency, synchronization accuracy, and responsiveness.

4.1. Experimental Setup

In this section, our datasets, hardware, and configurations are outlined for evaluating AFE model performance in real-time talking portrait synthesis.

4.1.1. Dataset

The dataset used in our experiments comprises a combination of publicly available video datasets and a privately sourced video. We select three high-definition speech video clips, each with an average duration of approximately 6700 frames (around 4.5 min, as recorded at 25 FPS). The raw videos, originally recorded at their native resolutions (which can also be used in the experiments), are cropped and resized to

512 \times 512

. However, the Obama video from AD-NeRF [33], which was originally processed at

450 \times 450

pixels, is used in that resolution without further resizing.

To ensure fairness and reproducibility, two of the video clips used in our experiments are sourced from the Internet. Specifically, we utilize the “Obama” video from the publicly released data of AD-NeRF [33] and the “Shaheen” video. These videos are chosen for their recognition and frequent use as benchmarks in talking portrait synthesis research. The third video clip, featuring a young girl’s speech, is privately sourced to align with the application objectives of the investigative interview. Explicit permission was obtained from the individual for the use of this video in this research. Our selection includes a range of speaking styles, expressions, and conditions, ensuring a balanced and comprehensive dataset for training and evaluation.

For each video, the first 91% frames, along with the corresponding audio, are used as training data, while the final 9% of the data are reserved for subsequent evaluation, in accordance with previous studies [32,33,34].

4.1.2. System Configuration

Experiments are conducted on a machine with a 12th Gen Intel(R) Core(TM) i9-12900F CPU, 31 GiB of RAM, and an NVIDIA RTX 4090 GPU with 24 GiB of VRAM, running CUDA 12.4 on an Ubuntu operating system.

4.2. Real-Time Talking-Head Speed Analysis

Despite progress in real-time talking-head systems, their widespread use is limited by the complexity of integrating multimodal inputs like audio and facial landmarks. This complexity challenges both synchronization and processing efficiency, which are required for smooth real-time performance. Given these challenges, we deemed it necessary to evaluate the real-time capabilities of existing models to substantiate performance claims. To accurately assess the efficiency of these models, we conduct an experiment using the Obama dataset [33]. The experiment is carried out under identical hardware and CUDA environments across all applicable open-source methods, ensuring a consistent basis for comparison.

Our findings confirm that a major challenge identified in previous studies is the latency introduced during the AFE process. To better understand the impact of AFE on overall system performance, we measure the execution time using two different approaches: one excluding AFE and the other including it. GeneFace++ [35] is excluded from this comparison, as its audio features are deeply interwoven within the core of the model, making it impossible to measure its performance without AFE. The results, presented in Figure 2, illustrate the comparative processing efficiency of these models across different audio durations.

4.3. AFE Analysis

Based on the results presented in Figure 2, we select RAD-NeRF [32] and ER-NeRF [34] as the frameworks for further experimentation due to their better performance compared to other models. To conduct a systematic comparison of AFE models, we employ Deep-Speech, Wav2Vec [12], HuBERT [13], and Whisper-Tiny [17] within both RAD-NeRF [32] and ER-NeRF [34]. The models are trained from scratch with each AFE configuration to assess their impact on system performance, focusing on factors such as lip synchronization accuracy, visual quality, and overall execution time. In the following, we divide the analysis into two key aspects—speed and quality—to provide a more detailed evaluation of each AFE configuration’s impact on system performance.

4.3.1. AFE Speed Analysis

In evaluating the speed of AFEs, we conduct an analysis between Whisper [17] and other well-known AFE models, including Deep-Speech [18], Wav2Vec [12], and HuBERT [13]. The results, as shown in Figure 3, reveal that Whisper greatly outperforms the other models across varying audio durations, especially compared to Deep-Speech, which shows increasing execution times as the audio duration grows. Whisper consistently achieves notably lower execution times, making it particularly advantageous for conversational systems.

Additionally, Figure 4 compares the execution times of RAD-NeRF [32] and ER-NeRF [34] using various AFE models, including Whisper, Deep-Speech [18], Wav2Vec [12], and HuBERT [13]. The results show the time-saving advantages of Whisper across both models. By integrating Whisper, we achieve a reduction in processing time, which is important for real-time applications such as talking-head generation and interactive avatars. These findings reveal the potential of Whisper as an effective solution for accelerating AFE processes in interactive avatar systems, making them more efficient and responsive.

4.3.2. AFE Quality Analysis

To further validate utilizing Whisper as AFE, we conduct an evaluation of the system’s rendering quality across various settings. In the self-driven setting, where the ground truth data correspond to the same identity as the generated output, we utilize the widely recognized quantitative metrics to assess the quality of portrait reconstruction. These metrics are standard in the field and provide a comprehensive assessment of both visual fidelity and synchronization, ensuring alignment with the established evaluation practices:

Peak Signal-to-Noise Ratio (PSNR): This metric measures the fidelity of the reconstructed image relative to the ground truth. The PSNR is calculated as
$PSNR = 10 \cdot {log}_{10} (\frac{{MAX}_{I}^{2}}{MSE})$
where ${MAX}_{I}$ is the maximum possible pixel value of the image, and MSE is the Mean Squared Error between the reconstructed and ground truth images.
Structural Similarity Index Measure (SSIM): SSIM evaluates the structural similarity by considering the luminance, contrast, and structure. The formula is
$SSIM (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}$
where $μ_{x}$ and $μ_{y}$ are the average intensities, $σ_{x}$ and $σ_{y}$ are variances, and $σ_{x y}$ is the covariance between images x and y. $C_{1}$ and $C_{2}$ are constants for stability.
Learned Perceptual Image Patch Similarity (LPIPS) [47]: Measures the perceptual similarity between the generated and ground truth images by calculating the distance between feature representations extracted from a deep neural network. This metric captures differences in visual features that align more closely with human perception than simple pixel-wise comparisons, making it useful for assessing image quality in terms of perceptual fidelity:
$LPIPS = \frac{1}{N} \sum_{i = 1}^{N} {∥ f (x_{i}^{pred}) - f (x_{i}^{truth}) ∥}_{2}$
where $f (x)$ represents the feature representation of image x extracted by a neural network (e.g., AlexNet), $x_{i}^{pred}$ and $x_{i}^{truth}$ are the predicted and ground truth images respectively, and N is the number of patches or feature points compared.
Landmark Distance (LMD) [48]: This metric measures the geometric accuracy of facial landmarks by calculating the Euclidean distance between corresponding landmark points in the generated and ground truth images:
$LMD = \frac{1}{N} \sum_{i = 1}^{N} \sqrt{{(x_{i}^{pred} - x_{i}^{truth})}^{2} + {(y_{i}^{pred} - y_{i}^{truth})}^{2}}$
where $(x_{i}^{pred}, y_{i}^{pred})$ and $(x_{i}^{truth}, y_{i}^{truth})$ are the coordinates of the i-th landmark in the predicted and ground truth images, respectively, and N is the total number of landmarks.
Fréchet Inception Distance (FID) [49]: FID assesses the similarity between distributions of real and generated images. The formula is
$FID = μ_{r} - μ_{g}^{2} + Tr (Σ_{r} + Σ_{g} - 2 \sqrt{Σ_{r} Σ_{g}})$
where $μ_{r}$ and $Σ_{r}$ represent the mean and covariance of features for real images, and $μ_{g}$ and $Σ_{g}$ represent those for generated (images).
Action Units Error (AUE) [50]: AUE measures the accuracy of lower facial muscle movements by calculating the squared differences in action unit intensities between the generated and ground truth images. In this study, we specifically evaluate the lower face region, which is relevant for expressions related to speech and emotion:
${AUE}_{lower} = \frac{1}{N} \sum_{i = 1}^{N} {(A U_{i}^{pred} - A U_{i}^{truth})}^{2}$
where $A U_{i}^{pred}$ and $A U_{i}^{truth}$ are the intensities of the i-th action unit in the lower face region for the predicted and ground truth data, respectively, and N is the total number of lower face action units evaluated.
SyncNet Confidence Score (Sync) [51]: Measures the lip-sync accuracy by evaluating the alignment between audio and lip movements in generated talking-head videos. This metric utilizes SyncNet, which calculates a confidence score based on the similarity between the embeddings of audio and video frames:
$Sync = \frac{1}{N} \sum_{i = 1}^{N} \frac{v_{i} \cdot s_{i}}{max (∥ v_{i} ∥_{2} \cdot ∥ s_{i} ∥_{2}, ϵ)}$
where $v_{i}$ and $s_{i}$ are the embeddings for the i-th video and audio frames, respectively, computed by SyncNet. This formula calculates the cosine similarity between the embeddings, giving a score between 0 and 1 for each frame pair. N is the total number of frame pairs evaluated, and higher average values indicate better lip-sync quality.

The self-driven evaluation results are shown in Table 1, where the utilization of Whisper for AFE leads to a slight improvement in the PSNR and LPIPS metrics. Additionally, the FID scores reflect a minor but consistent improvement in the perceptual similarity between generated and real images. However, the most notable improvement is observed in the Sync, which shows better lip synchronization.

In the cross-driven setting, the model generates a talking portrait based on audio clips that do not match the identity of the visual data used during training. To create variation, two different audio clips are selected for each dataset, corresponding to the gender and age of the character. One audio clip is synthetic, generated using Amazon Polly [46], while the other is a natural recording from a real human, allowing for a fair comparison between synthetic and natural speech.

Because ground truth images corresponding to the same identity are absent, identity-specific metrics are not applicable. Therefore, consistent with prior studies [32,34], we use the identity-agnostic Sync score [51] as the primary metric to evaluate synchronization between audio and lip movements. This approach allows us to effectively measure the model’s performance when direct comparison with ground truth is not feasible.

The cross-driven results, as shown in Table 2, indicate that models utilizing Whisper generally outperform others, particularly in terms of SyncNet scores. However, Whisper’s superior performance is not consistent across all scenarios. While it shows particular strength in handling human voices, its effectiveness diminishes when dealing with bot-generated voices. Specifically, in our experiments, the bot’s voice is generated at a slower pace and interspersed with more pauses. This slower speech setting may unintentionally benefit other models, which makes them perform better relative to Whisper in this context.

Our experimental results reveal that HuBERT consistently underperforms in lip synchronization, with the lowest Sync scores among all the models tested. Given these results, particularly the poor performance in cross-driven settings (e.g., Sync scores as low as 0.564), HuBERT appears inadequate for applications requiring precise lip-sync. Future work should consider incorporating subjective evaluations to further assess the perceptual quality and user satisfaction, which may provide a more thorough understanding of its limitations.

Furthermore, we compare the subjective quality and lip-sync performance of two models, RAD-NeRF [32] and ER-NeRF [34], shown in Figure 5 and Figure 6, respectively, using four distinct AFEs under self-driven settings. Each model’s output is evaluated against ground truth to assess its accuracy in audio-driven 3D reconstruction and lip-sync fidelity. For a more comprehensive view of the results, please refer to the supplemental video available on our GitHub repository [19].

ER-NeRF and RAD-NeRF demonstrate varying degrees of fidelity across the AFEs, particularly in achieving accurate lip-sync with the audio input. Generally, Whisper achieves the closest approximation to ground truth across both ER-NeRF and RAD-NeRF, exhibiting strong performance in synchronizing lip movements with audio, especially during pronounced lip articulations, such as the wide-open mouth movements observed when pronouncing sounds like “wa.” This precise alignment enhances the realism of the output, particularly in sequences requiring dynamic mouth shapes.

Wav2vec and HuBERT provide reasonable approximations but show slight misalignments in lip-sync. Deep-Speech, while effective, displays the greatest variance from ground truth, with notable lip-sync discrepancies, indicating less robust performance in these NeRF-based reconstructions.

4.4. System Responsiveness Analysis

Table 3 outlines the execution times per component, with the Listening component excluded due to its dependence on user input duration. The analysis is segmented into raw execution time and the percentage contribution of each stage to the overall system latency, providing a granular view of where time is consumed in the process of generating the interactive avatar’s responses.

To evaluate the system’s responsiveness, we test various lengths of avatar answers. This approach allows us to assess how the system handles different interaction complexities, ranging from brief exchanges like “Hi” and “I’m OK” to more extended dialogues, such as “Um… at Jenny’s house, we play a lot. It’s nice there… but, well, I don’t really like talking about the pool. It wasn’t very fun last time”. The results indicate that while the overall system performs efficiently, the frame rendering stages are identified as the most time-consuming, due to the computational demands of generating high-quality, real-time visual output that matches the synchronized audio input.

Conversely, the integration of the Whisper model as the AFE component proves to be much faster than the conventional AFE models. Whisper’s improved processing speed has substantially reduced the time required to extract and synchronize audio features, contributing to a faster overall system performance.

The execution times presented in Table 3 indicate delays significantly beyond the commonly recommended thresholds for interactive systems, typically under 100 ms [52,53]. However, our project does not fall into the category of high-speed interaction tasks. Given the sensitive and often emotionally charged nature of the scenarios—such as simulating a child avatar discussing distressing topics—longer response times might even be perceived as reflective of the avatar’s emotional state, enhancing realism [9]. To better understand the impact of these delays on user experience and engagement, conducting a user-centered study is essential. This will allow us to evaluate whether the observed response times align with the expectations and needs of users in this specific context.

5. Discussion and Future Work

The AFE execution time plays a critical role in the performance of interactive avatar systems. Whisper’s ability to reduce latency by up to 90% compared to alternative models is a decisive factor in its selection. To ensure that this acceleration does not compromise rendering quality or synchronization, a thorough evaluation of image quality metrics is conducted. While these metrics are secondary to the execution time, the results confirm that Whisper’s integration preserves the system’s visual and functional integrity and, in some cases, even enhances it. Whisper’s streamlined architecture significantly reduces system latency compared to resource-intensive models like Deep-Speech [18], which slow down as the audio duration increases. This reduction in processing latency is especially important for real-time applications, where delays can disrupt user engagement. By enabling faster responses, Whisper facilitates smoother interactions in applications like interactive avatars, thereby enhancing both user experience and system reliability [9].

However, despite Whisper’s efficiency, the overall system still experiences latency due to other computationally intensive components, such as frame rendering. This issue becomes more pronounced when changes in hardware specifications significantly affect the frame rendering time of talking-head generation models. Addressing these bottlenecks and further exploring system scalability across diverse hardware configurations are essential steps toward improving real-world applicability and robustness.

Future work could focus on optimizing frame rendering by incorporating lightweight techniques such as Gaussian Splatting [29], which are well suited for real-time applications. These methods could reduce the computational overhead while maintaining high-quality output, addressing bottlenecks and increasing the system’s scalability and responsiveness across diverse hardware configurations. This aligns with GSTalker’s demonstration of the Gaussian Splatting’s efficiency in fast training and real-time rendering [30].

Additionally, we can utilize advanced solutions like NVIDIA’s Avatar Cloud Engine (ACE) (https://developer.nvidia.com/ace, accessed on 20 January 2025), a suite of technologies that combines generative AI and hardware acceleration specifically for creating realistic digital humans. By leveraging tools like ACE, which use powerful NVIDIA GPUs to accelerate processes like speech recognition, natural language processing, and real-time facial animation, system responsiveness can be further improved.

Moreover, although Whisper has demonstrated significant advantages in two real-time talking-head synthesis models, its adoption should not be presumed universally effective across all models or contexts. Each talking-head generation model leverages unique strategies and architectures tailored to specific applications, which may necessitate alternative feature extraction approaches. The diversity in model designs underscores the importance of contextual considerations when selecting feature extraction methods. Nevertheless, Whisper’s robust real-time capabilities suggest potential for future integration as a preprocessing step in emerging systems, particularly those prioritizing real-time performance. Such integration could streamline audio processing pipelines, enhancing efficiency without compromising synchronization or quality. This outlook provides a promising direction for optimizing real-time systems while acknowledging the necessity for model-specific customization.

Another limitation of the current work is the absence of a robust framework for handling errors or implementing fallback mechanisms across the modular pipeline components. In modular systems, individual component failures can propagate through the pipeline, potentially compromising both system reliability and real-time performance. To address these challenges, future research should prioritize the development of dynamic error detection and mitigation strategies.

Furthermore, conducting broader user studies with professionals in the field could provide a deeper understanding of the system’s impact on practical outcomes, such as user training and engagement. Expanding this research to include real-world evaluations with subjective user feedback will offer valuable insights into perceived realism, usability, and overall user satisfaction. Incorporating user-centric metrics will provide a more comprehensive view of the system’s effectiveness in high-engagement applications, such as investigative interview training. Through these advancements, AI-driven avatars may achieve greater utility and realism, enhancing their role as effective tools in training applications.

Ethical Considerations

In integrating AI technologies, such as deepfake-powered avatars, into sensitive domains like CPS, ethical considerations are paramount. Deepfake technology [54], while advancing digital realism, raises significant ethical concerns. Misuse could lead to the creation of nonconsensual or harmful content, infringing on privacy, consent, and psychological safety [55]. In CPS training, the misuse of AI avatars may manifest in inappropriate or unprofessional interactions, such as engaging avatars in casual or sexualized dialogues. Such behaviors compromise the seriousness of the educational context, disrespect the simulated victims, and violate professional ethics.

To mitigate these risks, stringent ethical oversight is necessary. This includes the implementation of privacy-preserving frameworks, content ownership protection for generative models, and robust guidelines on acceptable avatar interactions [56]. For example, CPS workers should engage with avatars solely for training purposes, adhering strictly to professional and ethical boundaries.

Despite these challenges, AI avatars offer unique opportunities for societal good. They provide CPS workers with realistic, interactive training tools that avoid involving real children, thus eliminating ethical dilemmas and ensuring victim privacy. Additionally, they can anonymize sensitive victim data, preserving dignity and confidentiality. To balance the benefits and risks, a multidisciplinary approach is important. Collaboration among technologists, legal experts, child protection professionals, and ethicists can ensure AI technologies are leveraged responsibly. Developing robust detection systems to identify misuse, alongside legal frameworks to regulate the deployment of such technology, is essential for protecting vulnerable populations.

6. Conclusions

This study addressed the latency challenges associated with AFE, which has hindered the practical deployment of real-time talking portrait systems in real-world applications. By integrating the Whisper model—a high-performance ASR system—into our framework, we achieved notable reductions in processing delays. These optimizations not only increased the overall responsiveness of the interactive avatars but also improved the accuracy of lip-syncing, making them more applicable for immersive training applications.

Our findings affirm Whisper’s capability to meet real-time demands, particularly in applications requiring responsive interactions and minimal delay. This efficiency is important for training environments such as CPS, where timely and realistic interactions can greatly impact training efficacy. By achieving these improvements, Whisper-integrated systems emerge as promising solutions for a variety of real-time applications, including virtual assistants, remote education, and digital customer service platforms.

Author Contributions

Conceptualization, P.S., S.A.S., V.T., S.S.S., M.A.R. and P.H.; methodology, P.S., S.A.S., V.T. and P.H.; software, P.S., S.A.S. and S.G.; validation, P.S., S.A.S. and S.G.; formal analysis, P.S.; investigation, P.S.; resources, P.H. and M.A.R.; data curation, P.S. and S.A.S.; writing—original draft preparation, P.S., S.A.S. and P.H.; writing—review and editing, P.S., S.A.S., V.T., S.G., S.S.S., D.J., M.A.R. and P.H.; visualization, P.S. and S.A.S.; supervision, V.T., S.S.S., D.J., M.A.R. and P.H.; project administration, P.S. and P.H.; funding acquisition, M.A.R. and P.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Research Council of Norway grant number #314690.

Institutional Review Board Statement

Ethical approval has been obtained from The Norwegian Agency for Shared Services in Education and Research.

Informed Consent Statement

Informed consent for participation was obtained from all subjects involved in the study.

Data Availability Statement

The code and data for this study are available in the project’s GitHub repository [19].

Conflicts of Interest

Author Saeed S. Sabet was employed by the company Forzasys. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Chernikova, O.; Heitzmann, N.; Stadler, M.; Holzberger, D.; Seidel, T.; Fischer, F. Simulation-based learning in higher education: A meta-analysis. Rev. Educ. Res. 2020, 90, 499–541. [Google Scholar] [CrossRef]
Crompton, H.; Bernacki, M.; Greene, J.A. Psychological foundations of emerging technologies for teaching and learning in higher education. Curr. Opin. Psychol. 2020, 36, 101–105. [Google Scholar] [CrossRef] [PubMed]
Lamb, M.E. Difficulties translating research on forensic interview practices to practitioners: Finding water, leading horses, but can we get them to drink? Am. Psychol. 2016, 71, 710. [Google Scholar] [CrossRef] [PubMed]
Powell, M.B. Designing effective training programs for investigative interviewers of children. Curr. Issues Crim. Justice 2008, 20, 189–208. [Google Scholar] [CrossRef]
Lamb, M.E.; Orbach, Y.; Hershkowitz, I.; Esplin, P.W.; Horowitz, D. A structured forensic interview protocol improves the quality and informativeness of investigative interviews with children: A review of research using the NICHD Investigative Interview Protocol. Child Abuse Negl. 2007, 31, 1201–1231. [Google Scholar] [CrossRef] [PubMed]
Lyon, T.D. Interviewing children. Annu. Rev. Law Soc. Sci. 2014, 10, 73–89. [Google Scholar] [CrossRef]
Lamb, M.E.; Brown, D.A.; Hershkowitz, I.; Orbach, Y.; Esplin, P.W. Tell Me What Happened: Questioning Children About Abuse; John Wiley & Sons: Hoboken, NJ, USA, 2018. [Google Scholar]
Powell, M.B.; Brubacher, S.P.; Baugerud, G.A. An overview of mock interviews as a training tool for interviewers of children. Child Abuse Negl. 2022, 129, 105685. [Google Scholar] [CrossRef] [PubMed]
Salehi, P.; Hassan, S.Z.; Baugerud, G.A.; Powell, M.; Johnson, M.S.; Johansen, D.; Sabet, S.S.; Riegler, M.A.; Halvorsen, P. A theoretical and empirical analysis of 2D and 3D Virtual Environments in Training for Child Interview Skills. IEEE Access 2024, 12, 131842–131864. [Google Scholar] [CrossRef]
Salehi, P.; Hassan, S.Z.; Shafiee Sabet, S.; Astrid Baugerud, G.; Sinkerud Johnson, M.; Halvorsen, P.; Riegler, M.A. Is more realistic better? A comparison of game engine and gan-based avatars for investigative interviews of children. In Proceedings of the 3rd ACM Workshop on Intelligent Cross-Data Analysis and Retrieval, Newark, NJ, USA, 27–30 June 2022; pp. 41–49. [Google Scholar]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
Hsu, W.N.; Tsai, Y.H.H.; Bolte, B.; Salakhutdinov, R.; Mohamed, A. HuBERT: How much can a bad teacher benefit ASR pre-training? In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6533–6537. [Google Scholar]
Chakhtouna, A.; Sekkate, S.; Abdellah, A. Unveiling embedded features in Wav2vec2 and HuBERT msodels for Speech Emotion Recognition. Procedia Comput. Sci. 2024, 232, 2560–2569. [Google Scholar] [CrossRef]
Schubert, M.E.; Langerman, D.; George, A.D. Benchmarking Inference of Transformer-Based Transcription Models with Clustering on Embedded GPUs. IEEE Access 2024, 12, 123276–123293. [Google Scholar] [CrossRef]
Chakravarty, A. Deep Learning Models in Speech Recognition: Measuring GPU Energy Consumption, Impact of Noise and Model Quantization for Edge Deployment. arXiv 2024, arXiv:2405.01004. [Google Scholar]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning. PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 28492–28518. [Google Scholar]
Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Cheng, Q.; Chen, G.; et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning. PMLR, New York, NY, USA, 19–24 June 2016; pp. 173–182. [Google Scholar]
Salehi, P. Whisper AFE for Talking Heads Generation, Version 1.0.0. 2024. Available online: https://github.com/pegahs1993/Whisper-AFE-TalkingHeadsGen (accessed on 20 February 2025).
Dalli, K.C. Technological Acceptance of an Avatar Based Interview Training Application: The Development and Technological Acceptance Study of the AvBIT Application. 2021. Available online: http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-107108 (accessed on 20 January 2025).
Røed, R.K.; Powell, M.B.; Riegler, M.A.; Baugerud, G.A. A field assessment of child abuse investigators’ engagement with a child-avatar to develop interviewing skills. Child Abuse Negl. 2023, 143, 106324. [Google Scholar] [CrossRef]
Benson, M.S.; Powell, M.B. Evaluation of a comprehensive interactive training system for investigative interviewers of children. Psychol. Public Policy Law 2015, 21, 309. [Google Scholar] [CrossRef]
Pompedda, F. Training in Investigative Interviews of Children: Serious Gaming Paired with Feedback Improves Interview Quality. Doctoral Dissertation, Åbo Akademi University, Turku, Finland, 2018. [Google Scholar]
Krause, N.; Gewehr, E.; Barbe, H.; Merschhemke, M.; Mensing, F.; Siegel, B.; Müller, J.L.; Volbert, R.; Fromberger, P.; Tamm, A.; et al. How to prepare for conversations with children about suspicions of sexual abuse? Evaluation of an interactive virtual reality training for student teachers. Child Abuse Negl. 2024, 149, 106677. [Google Scholar] [CrossRef] [PubMed]
Meng, M.; Zhao, Y.; Zhang, B.; Zhu, Y.; Shi, W.; Wen, M.; Fan, Z. A Comprehensive Taxonomy and Analysis of Talking Head Synthesis: Techniques for Portrait Generation, Driving Mechanisms, and Editing. arXiv 2024, arXiv:2406.10553. [Google Scholar]
Lu, Y.; Chai, J.; Cao, X. Live speech portraits: Real-time photorealistic talking-head animation. ACM Trans. Graph. (ToG) 2021, 40, 1–17. [Google Scholar] [CrossRef]
Chung, Y.A.; Glass, J. Generative pre-training for speech with autoregressive predictive coding. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 3497–3501. [Google Scholar]
Ji, X.; Lin, C.; Ding, Z.; Tai, Y.; Yang, J.; Zhu, J.; Hu, X.; Zhang, J.; Luo, D.; Wang, C. RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network. arXiv 2024, arXiv:2406.18284. [Google Scholar]
Kerbl, B.; Kopanas, G.; Leimkühler, T.; Drettakis, G. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Trans. Graph. 2023, 42, 1–25. [Google Scholar] [CrossRef]
Chen, B.; Hu, S.; Chen, Q.; Du, C.; Yi, R.; Qian, Y.; Chen, X. GSTalker: Real-time Audio-Driven Talking Face Generation via Deformable Gaussian Splatting. arXiv 2024, arXiv:2404.19040. [Google Scholar]
Cho, K.; Lee, J.; Yoon, H.; Hong, Y.; Ko, J.; Ahn, S.; Kim, S. GaussianTalker: Real-Time Talking Head Synthesis with 3D Gaussian Splatting. In Proceedings of the ACM Multimedia 2024, Melbourne, Australia, 28 October–1 November 2024. [Google Scholar]
Tang, J.; Wang, K.; Zhou, H.; Chen, X.; He, D.; Hu, T.; Liu, J.; Zeng, G.; Wang, J. Real-time neural radiance talking portrait synthesis via audio-spatial decomposition. arXiv 2022, arXiv:2211.12368. [Google Scholar]
Guo, Y.; Chen, K.; Liang, S.; Liu, Y.J.; Bao, H.; Zhang, J. Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 5784–5794. [Google Scholar]
Li, J.; Zhang, J.; Bai, X.; Zhou, J.; Gu, L. Efficient region-aware neural radiance fields for high-fidelity talking portrait synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 7568–7578. [Google Scholar]
Ye, Z.; He, J.; Jiang, Z.; Huang, R.; Huang, J.; Liu, J.; Ren, Y.; Yin, X.; Ma, Z.; Zhao, Z. Geneface++: Generalized and stable real-time audio-driven 3d talking face generation. arXiv 2023, arXiv:2305.00787. [Google Scholar]
Ye, Z.; Zhang, L.; Zeng, D.; Lu, Q.; Jiang, N. R2-Talker: Realistic Real-Time Talking Head Synthesis with Hash Grid Landmarks Encoding and Progressive Multilayer Conditioning. arXiv 2023, arXiv:2312.05572. [Google Scholar]
Schuster, M.; Paliwal, K.K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997, 45, 2673–2681. [Google Scholar] [CrossRef]
Thies, J.; Elgharib, M.; Tewari, A.; Theobalt, C.; Nießner, M. Neural voice puppetry: Audio-driven facial reenactment. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 716–731. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Baevski, A.; Schneider, S.; Auli, M. vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv 2019, arXiv:1910.05453. [Google Scholar]
Jegou, H.; Douze, M.; Schmid, C. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 117–128. [Google Scholar] [CrossRef]
Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar]
Kahn, J.; Riviere, M.; Zheng, W.; Kharitonov, E.; Xu, Q.; Mazaré, P.E.; Karadayi, J.; Liptchinsky, V.; Collobert, R.; Fuegen, C.; et al. Libri-light: A benchmark for asr with limited or no supervision. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7669–7673. [Google Scholar]
Joshi, M.; Chen, D.; Liu, Y.; Weld, D.S.; Zettlemoyer, L.; Levy, O. Spanbert: Improving pre-training by representing and predicting spans. Trans. Assoc. Comput. Linguist. 2020, 8, 64–77. [Google Scholar] [CrossRef]
Ye, Z.; Jiang, Z.; Ren, Y.; Liu, J.; He, J.; Zhao, Z. Geneface: Generalized and high-fidelity audio-driven 3D talking face synthesis. arXiv 2023, arXiv:2301.13430. [Google Scholar]
Amazon Web Services. AWS Polly. Web Services. Amazon. 2024. Available online: https://aws.amazon.com/polly/ (accessed on 28 September 2024).
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 586–595. [Google Scholar]
Chen, L.; Li, Z.; Maddox, R.K.; Duan, Z.; Xu, C. Lip movements generation at a glance. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 520–535. [Google Scholar]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30, 6626–6637. [Google Scholar]
Baltrušaitis, T.; Robinson, P.; Morency, L.P. Openface: An open source facial behavior analysis toolkit. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; pp. 1–10. [Google Scholar]
Chung, J.S.; Zisserman, A. Out of time: Automated lip sync in the wild. In Proceedings of the Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, 20–24 November 2016; pp. 251–263. [Google Scholar]
Attig, C.; Rauh, N.; Franke, T.; Krems, J.F. System latency guidelines then and now–is zero latency really considered necessary? In Proceedings of the Engineering Psychology and Cognitive Ergonomics: Cognition and Design: 14th International Conference, EPCE 2017, Vancouver, BC, Canada, 9–14 July 2017; pp. 3–14. [Google Scholar]
Forch, V.; Franke, T.; Rauh, N.; Krems, J.F. Are 100 ms fast enough? Characterizing latency perception thresholds in mouse-based interaction. In Proceedings of the Engineering Psychology and Cognitive Ergonomics: Cognition and Design: 14th International Conference, EPCE 2017, Vancouver, BC, Canada, 9–14 July 2017; pp. 45–56. [Google Scholar]
Meskys, E.; Kalpokiene, J.; Jurcys, P.; Liaudanskas, A. Regulating deep fakes: Legal and ethical considerations. J. Intellect. Prop. Law Pract. 2020, 15, 24–31. [Google Scholar] [CrossRef]
Hu, H. Privacy Attacks and Protection in Generative Models. 2023. Available online: https://hdl.handle.net/10993/58928 (accessed on 20 February 2025).
Kugler, M.B.; Pace, C. Deepfake privacy: Attitudes and regulation. Northwestern Univ. Law Rev. 2021, 116, 611. [Google Scholar] [CrossRef]

Figure 1.
(a) System architecture of the interactive child avatar, detailing the integration of key modules: (1) Listening, (2) STT, (3) Language Processing, (4) TTS, (5) AFE, (6) Frames Rendering, and (7) Audio Overlay. This setup simulates natural conversation, allowing the user to interact with the avatar as if communicating with a real person. (b) User interaction with the child avatar system.

Figure 2.
Execution time comparison of open-source real-time talking-head generation models, including RAD-NeRF [32], ER-NeRF [34], Gaussian Talker [31] and GeneFace++ [35]. The solid lines represent execution times excluding AFE, while the dashed lines indicate execution times that include AFE.

Figure 3.
Execution time comparison of different AFE models, including Deep-Speech [18], Wav2Vec [12], HuBERT [13], and Whisper [17].

Figure 4.
Execution time comparison of RAD-NeRF [32] and ER-NeRF [34] across different AFE models.

Figure 5.
Quality comparison: Examples of visualizations of RAD-NeRF [32] under the self-driven setting, based on two frames extracted from each video illustrating typical challenges. Yellow boxes highlight areas of noisy image quality, while red boxes indicate regions with inaccurate lip synchronization.

Figure 6.
Quality comparison: Examples of visualizations of ER-NeRF [34] methods under the self-driven setting, based on two frames extracted from each video illustrating typical challenges. Yellow boxes highlight areas of noisy image quality, while red boxes indicate regions with inaccurate lip synchronization.

Table 1.
Quantitative comparison of face reconstruction quality under self-driven synthesis on the same identity’s test set. The arrows in the metric names indicate the preferred trend for each metric: ↑ denotes that higher values are better, ↓ signifies that lower values are preferable.

Methods	AFE	Dataset	PSNR ↑	SSIM ↑	LPIPS ↓	LMD ↓	FID ↓	AUE ↓	Sync_conf ↑
RAD-NeRF [32]	Deep-Speech	Obama	27.14	0.9304	0.0738	2.675	31.29	1.995	7.171
		Donya	27.79	0.9045	0.0917	2.750	12.82	1.911	4.720
		Shaheen	30.13	0.9314	0.0697	3.199	33.05	2.837	7.330
		Mean	28.35	0.9221	0.0784	2.874	25.72	2.247	6.407
	HuBERT	Obama	26.58	0.9261	0.0769	2.762	28.78	2.006	0.563
		Donya	28.05	0.9071	0.0868	2.518	14.25	2.511	0.365
		Shaheen	30.45	0.9332	0.0729	3.050	35.96	3.229	0.494
		Mean	28.36	0.9221	0.0788	2.776	26.33	2.582	0.474
	Wav2Vec	Obama	26.59	0.9268	0.0785	2.696	15.15	1.707	6.744
		Donya	27.12	0.8972	0.0845	2.726	24.58	1.531	4.820
		Shaheen	30.08	0.9306	0.0698	3.221	34.77	2.966	7.946
		Mean	27.93	0.9182	0.0776	2.881	24.83	2.068	6.503
	Whisper	Obama	26.10	0.9231	0.0723	2.573	12.67	1.693	7.143
		Donya	28.65	0.9138	0.0844	2.640	26.85	1.504	5.269
		Shaheen	30.05	0.9303	0.0660	3.045	29.44	2.696	8.488
		Mean	28.07	0.9224	0.0761	2.826	24.04	1.964	6.966
ER-NeRF [34]	Deep-Speech	Obama	26.44	0.9339	0.0441	2.561	7.14	1.923	7.201
		Donya	28.91	0.9165	0.0605	2.647	14.59	1.874	4.722
		Shaheen	29.92	0.9267	0.0450	2.900	16.10	2.668	8.215
		Mean	28.16	0.9257	0.0689	2.7932	20.92	2.155	6.712
	HuBERT	Obama	26.30	0.9297	0.0473	2.758	8.33	1.711	0.300
		Donya	24.20	0.7826	0.1255	2.545	49.81	2.284	0.408
		Shaheen	30.45	0.9322	0.0420	2.852	16.56	3.172	0.434
		Mean	26.98	0.8815	0.0716	2.718	24.90	2.389	0.380
	Wav2Vec	Obama	25.59	0.9268	0.0497	2.645	8.83	1.704	6.616
		Donya	24.21	0.7777	0.1509	2.754	68.20	1.730	4.403
		Shaheen	29.81	0.9245	0.0470	3.003	15.59	2.948	7.917
		Mean	26.53	0.8763	0.0825	2.800	30.87	2.127	6.312
	Whisper	Obama	26.30	0.9314	0.0462	2.501	8.06	1.797	7.647
		Donya	27.36	0.9020	0.0641	2.516	14.67	1.852	5.704
		Shaheen	30.20	0.9305	0.0434	2.935	15.61	3.030	8.575
		Mean	28.12	0.9213	0.0654	2.7640	19.29	2.226	7.308

Table 2.
Quantitative comparison of audio–lip synchronization under the cross-driven setting. The arrows in the metric names indicate the preferred trend for each metric: ↑ denotes that higher values are better, ↓ signifies that lower values are preferable.

Methods	AFE	Dataset	Sync_conf ↑ Synthetic	Sync_conf ↑ Natural
RAD-NeRF [32]	Deep-Speech	Obama	6.581	6.489
		Donya	4.208	3.576
		Shaheen	6.319	5.840
		Mean	5.702	5.301
	HuBERT	Obama	5.109	0.548
		Donya	4.750	0.680
		Shaheen	6.670	0.741
		Mean	5.509	0.656
	Wav2Vec	Obama	6.851	7.388
		Donya	4.593	5.017
		Shaheen	6.837	7.303
		Mean	6.093	6.569
	Whisper	Obama	6.884	7.137
		Donya	4.628	5.742
		Shaheen	6.347	7.052
		Mean	5.953	6.643
ER-NeRF [34]	Deep-Speech	Obama	7.306	7.224
		Donya	5.054	4.851
		Shaheen	6.505	6.074
		Mean	6.323	6.050
	HuBERT	Obama	5.542	0.614
		Donya	4.822	0.480
		Shaheen	7.068	0.581
		Mean	5.810	0.558
	Wav2Vec	Obama	7.219	7.428
		Donya	4.115	4.155
		Shaheen	7.145	7.242
		Mean	6.159	6.275
	Whisper	Obama	7.399	8.247
		Donya	4.651	6.486
		Shaheen	6.148	7.093
		Mean	6.066	7.275

Table 3.
Execution time analysis of each component in the system architecture as depicted in Figure 1. AA Tokens: number of tokens in the avatar’s answer. AA Duration (s): length of the avatar’s answer in seconds.

AA Tokens	AA Duration (s)		STT	Language	TTS	AFE	Frame Rendering	Audio Overlay	SUM
1	0.41	Exe. Time (s)	0.06	0.80	0.22	0.29	4.05	0.14	5.56
1	0.41	% of Total	1.08%	14.39%	3.96%	5.22%	72.84%	2.52%	100%
8	1.69	Exe. Time (s)	0.07	0.96	0.33	0.28	4.75	0.16	6.55
8	1.69	% of Total	1.07%	14.66%	5.04%	4.27%	72.52%	2.44%	100%
14	3.63	Exe. Time (s)	0.07	2.45	0.44	0.28	5.45	0.14	8.83
14	3.63	% of Total	0.79%	27.74%	4.98%	3.17%	61.71%	1.59%	100%
21	5.08	Exe. Time (s)	0.1	2.27	0.44	0.28	5.88	0.17	9.14
21	5.08	% of Total	1.09%	24.83%	4.81%	3.06%	64.36%	1.86%	100%
30	6.55	Exe. Time (s)	0.06	2.76	0.49	0.27	6.58	0.15	10.31
30	6.55	% of Total	0.58%	26.77%	4.75%	2.62%	63.84%	1.45%	100%
39	9.55	Exe. Time (s)	0.09	1.95	0.78	0.28	7.52	0.18	10.8
39	9.55	% of Total	0.83%	18.06%	7.22%	2.59%	69.63%	1.67%	100%
50	12.16	Exe. Time (s)	0.07	2.08	0.55	0.28	8.65	0.18	11.81
50	12.16	% of Total	0.59%	17.61%	4.66%	2.37%	73.24%	1.52%	100%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Source link

Pegah Salehi www.mdpi.com

Greenberg News

Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis

1. Introduction