DigiFakeAV

Abstract

In recent years, deepfake technology has advanced rapidly, yet its misuse poses serious threats to information security and public safety. Existing datasets have mainly focused on traditional face-swapping techniques, failing to reflect the emerging trend of digital human generation methods. These diffusion-based approaches can generate highly realistic videos from speech and target images, offering greater flexibility, stealthiness, and multimodal coherence, thus challenging current detection strategies. To address this issue, we introduce DigiFakeAV , the first large-scale multimodal deepfake dataset based on digital human synthesis. It contains 60,000 videos (8.4 million frames) with diverse identities across nationalities, skin tones, and genders. Experimental results show that state-of-the-art detection models suffer over 30% performance drop on DigiFakeAV, and user studies confirm that the fake videos are nearly indistinguishable from real ones. Furthermore, we propose AVTSF , an audio-visual fusion detection model that achieves state-of-the-art performance on DF-TIMIT and establishes a benchmark for DigiFakeAV. This work presents the first systematic effort in constructing and evaluating a dataset tailored for diffusion-based digital human forgery, highlighting new research directions for robust deepfake detection.

DigiFakeAV Pipeline

We generate the DigiFakeAV dataset using state-of-the-art deepfake and speech synthesis techniques. Specifically, we adopt five recent digital human generation methods and one advanced audio synthesis approach. Compared to prior deepfake dataset generation techniques, our selection offers notable differences and advantages.

Visualizations

Fake Video–Real Audio (FV-RA)

This category consists of fake videos created by synthesizing visual content conditioned on authentic audio. We begin by extracting audio from real videos and converting it into WAV format. RetinaFace is then employed to detect and crop representative facial frames. Using five digital human generation techniques—Sonic, Hallo1, Hallo2, Echomimic, and V-Express—we produce a total of 25,000 forged videos. Modern adversaries can obtain both voice and facial data of targets, enabling highly convincing impersonation attacks. Compared to traditional face-swapping methods, these forgeries pose a more credible and serious threat. This subset provides researchers with valuable data to explore advanced identity deception scenarios.

Fake Video–Fake Audio (FV-FA)

This category encompasses both manipulated audio and video. We employed a state-of-the-art voice cloning method, CosyVoice 2. Specifically, we first generated manipulated text using a large language model. Then, by providing authentic audio–manipulated text pairs, the CosyVoice 2 model synthesized fake audio exhibiting the vocal characteristics of the target speaker. Subsequently, using Sonic, Hallo1, Hallo2, and Echomimic, we generated 25,000 forged videos based on this fake audio. This approach offers fraudsters a more flexible and realistic means of deception.

Future Works

Recent advances in digital human generation have led to increasingly realistic forgery techniques that surpass existing methods such as Sonic. These developments pose growing threats to multimedia security, social trust, and information integrity, calling for continued research and robust detection solutions.

To address these evolving challenges, we plan to continuously update and expand the DigiFakeAV dataset by incorporating emerging forgery techniques and diverse real-world scenarios. This effort aims to enhance the dataset's comprehensiveness and representativeness, fostering novel research directions and ensuring that detection models remain effective against increasingly sophisticated deepfake attacks.

BibTeX

@misc{liu2025faceswappingdiffusionbaseddigital, title={Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection}, author={Jiaxin Liu and Jia Wang and Saihui Hou and Min Ren and Huijia Wu and Long Ma and Renwang Pei and Zhaofeng He}, year={2025}, eprint={2505.16512}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2505.16512}, }

Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection

DigiFakeAV, the first large-scale multimodal deepfake dataset based on digital human synthesis. It contains 60,000 videos (8.4 million frames) with diverse identities across nationalities, skin tones, and genders.

Abstract

DigiFakeAV Pipeline

Diversity

Concrete examples demonstrating the diversity of DigiFakeAV, including multiple nationalities, diverse skin tones, various genders, and a range of scenarios.

Visualizations

Fake Video–Real Audio (FV-RA)

Fake Video–Fake Audio (FV-FA)

Comparison

Future Works

BibTeX