The Deepfake Detection Challenge (DFDC) Preview Dataset
The Deepfake Detection Challenge (DFDC) Preview Dataset
The Deepfake Detection Challenge (DFDC) Preview Dataset
Brian Dolhansky, Russ Howes, Ben Pflaum, Nicole Baram, Cristian Canton Ferrer
AI Red Team, Facebook AI
1
Figure 1. Some example face swaps from the dataset.
tor from other existing datasets is that is that actors have the dataset); with the intention of representing the real ad-
agreed to participate in the creation of the dataset which versarial space of facial manipulation, no further details of
uses and modifies their likeness. The rough approxima- the employed methods are disclosed to the participants. A
tion of the general distribution of gender and race across number of state-of-the-art methods will be applied to gener-
this preview dataset is 74% female and 26% male; and 68% ate these videos, exploring the whole gamut of manipulation
Caucasian, 20% African-American, 9% east-Asian, and 3% techniques available to generate such tamperings.
south-Asian. As we continue our crowdsourced data cap- A number of face swaps were computed across subjects
ture campaign, we will keep working on improving the di- with similar appearances, where each appearance was in-
versity towards the publication of the final DFDC dataset. ferred from facial attributes (skin tone, facial hair, glasses,
No publicly available data or data from social media sites etc.). After a given pairwise model was trained on two iden-
was used to create this dataset. tities, we swapped each identity onto the other’s videos.
For this first version of the DFDC dataset, a small set of Hereafter, we refer to the identity in the base video as the
66 individuals where chosen from the pool of crowdsourced “target” identity, and the identity of the face swapped onto
actors, and split into a training and a testing set. This was the video as the “swapped” identity. In this preview DFDC
done to avoid cross-set face swaps. Two methods were se- dataset, all base and target videos are provided as part of the
lected to generate face swaps (noted as methods A and B in training corpus.
2
Ratio
Dataset Total videos Source Participants Consent
tampered:original
Celeb-DF [4] 1 : 0.51 1203 YouTube N
FaceForensics [8] 1 : 1.00 2008 YouTube N
FaceForensics++ [9] 1 : 0.25 5000 YouTube N
DeepFakeDetection [11]
1 : 0.12 3363 Actors Y
(part of FaceForensics++)
DFDC Preview Dataset 1 : 0.28 5214 Actors Y
The initial algorithm in the dataset used to produce this tory. Swapped video filenames contain two IDs - the first is
dataset (method A) does not produce sharp or believable the swapped ID, and the second is the target ID. The final
face swaps if the subject’s face is too close to the camera, so identifiers in the video refer to which target video the swap
selfie videos or other close-up recordings resulted in easy- was produced from, and the clip within the target video that
to-spot fakes. Therefore, we initially filtered all videos by was used. Finally, the dataset can be downloaded at:
their average face size ratio, which measures the ratio of
the maximum dimension of the face bounding boxes from deepfakedetectionchallenge.ai
a sampled set of frames to the minimum dimension of the
video. All swaps were performed on videos where this mea- 3. Evaluation metrics
sure was less than 0.25; in order to enrich the original train-
ing set, we included all original videos of the identities in All relevant datasets addressing the task of Deepfake de-
this dataset where this measure was less than 0.3 (regard- tection (see Table 1) produce metrics that are strongly in-
less of whether or not the video appeared in a swap). We fluenced by the distribution of positive and negative exam-
additionally provide a second method (method B) that gen- ples present in the test set. As a result, it is difficult to
erally produces lower-quality swaps, but is similar to other quantitatively measure how any of the methods evaluated
off-the-shelf face swap algorithms. on those datasets would perform when facing the real pro-
duction traffic that any social media company would ingest.
After some rough filtering, for each original or swapped This is particularly relevant when evaluating the impact of
video, we removed the first five seconds of the video as sub- false positives (FP) and their associated actions hence the
jects were often seen setting up their cameras during this need to capture this effect into the evaluation metrics for
time. From the remaining length of the video, we extracted this competition.
multiple 15 second clips - generally three clips per video if Current prevalence of deepfakes (compared to unaltered
the length was over 50 seconds. All of the clips that com- videos) in organic traffic is much lower than that corre-
prised the training set were left at their original resolution sponding to the ratios for the datasets in Table 1. If we as-
and quality, so deriving appropriate augmentations of the sume that the ratio between deepfake and unaltered videos
training set is left as an exercise to the researcher. However, is 1 : x in organic traffic and 1 : y in a deepfakes dataset,
for each target video in the test set, we randomly selected it is likely that x y. Although it is not practical to con-
two clips out of three and applied augmentations that ap- struct a dataset that mimics the statistics of organic traffic,
proximate actual degradations seen in real-life video distri- it is critical to define metrics that capture these differences.
butions. Specifically, these augmentations were (1) reduce We can define a weighted precision for a deepfakes dataset
the FPS of the video to 15; (2) reduce the resolution of the as a very rough approximation of the precision that would
video to 1/4 of its original size; and (3) reduce the overall be computed by evaluating on a dataset representative of or-
encoding quality. In this dataset, no video was subjected to ganic traffic. Assuming the ratios of unaltered to tampered
more than one augmentation. The third remaining test clip videos differ between a test dataset and organic traffic by a
did not undergo any augmentation. After adding these orig- factor of α = x/y, we define weighted precision wP and
inal video clips and their augmentations, we end up with (standard) recall R as
a total of 4,464 unique training clips and 780 unique test
clips. TP TP
wP = , R= , (1)
TP + αFP TP + FN
All information regarding the swapped and target iden-
tities for each video, along with the train or test set assign- where TP, FP, and FN signify true positives, false positives,
ment and any augmentations applied to a video are listed in and false negatives. Although a realistic value of the preva-
the file dataset.json, located in the dataset root direc- lence of deepfake videos may lead to a large x = 107
3
Method Precision Recall log-WP olds must be tuned in tandem - for good performance, a low
TamperNet 0.833 0.033 -3.044 per-frame threshold will probably result in a high frames-
XceptionNet (Face) 0.930 0.084 -2.140 per-video threshold, and vice-versa. To normalize for video
XceptionNet (Full) 0.784 0.268 -3.352 length, we only evaluated the frames-per-video threshold
over frames that contained a detectable face. During cross-
Table 2. Video-level test metrics when optimizing for log(wP). validation on the train set, we found the optimal frame- and
video-thresholds that maximized the log-WP over each fold,
Method R=0.1 R=0.5 R=0.9 while still maintaining the desired level of recall. We then
TamperNet -2.796 -3.864 -4.041 used these thresholds tuned on the training set to compute
XceptionNet (Face) -1.999 -3.012 -4.081 the metrics in Table 3.
XceptionNet (Full) -3.293 -3.835 -4.081
5. Closing remarks
Table 3. Video-level log(wP) for various recall values
This document introduces a preview of the DFDC
dataset that will be released later in the year with the inten-
(α = 4 · 107 ), because this preview dataset has few true tion to encourage researchers getting familiar with the data,
negatives, any false positives will lead to large variations in provide early results and compare those with the proposed
the wP metric, so we report wP values with α = 100 (note baselines.
that this constant is subject to change for the final dataset).
Since false positives are heavily weighted, wP is typically References
small, so we report log(wP) in our results; a value of 0 is the [1] L. Floridi, “Artificial intelligence, deepfakes and a future of
maximum achievable value, but generally the log-weighted- ectypes,” Philos. Technol. 31: 317, 2018. 1
precision is negative. [2] R. Chesney and D. K. Citron, “Deep Fakes: A looming chal-
Finally, although precision is paramount for very-rare lenge for privacy, democracy, and national security,” 107
detection problems (where detected content may undergo California Law Review (2019, Forthcoming), 2018. 1
some form of human review or verification), recall must [3] P. M. Barrett. (2019) Disinformation and the 2020
also be considered, as it is important to detect as many items election: How the social media industry should
of interest as possible. Therefore, we report the log(wP) for prepare. [Online]. Available: https://bhr.stern.nyu.edu/
three levels of recall: 0.1, 0.5, and 0.9. The weighted pre- tech-disinfo-and-2020-election 1
cision for each recall level can give a rough approximation [4] Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu, “Celeb-DF: A new
of the cost of labeling videos if one wished to detect half, dataset for deepfake forensics,” 2019. 1, 3
most, or nearly-all deepfakes in some real distribution of [5] S. L. Xin Yang, Yuezun Li, “Exposing deep fakes using in-
videos. consistent head poses.” 1
[6] S. M. Pavel Korshunov, “Deepfakes: a new threat to face
4. Baseline recognition? assessment and detection.” 1
[7] G. S. Omar Ismael Al-Sanjary, Ahmed Abdullah Ahmed,
To derive an initial baseline, we measured the perfor- “Development of a video tampering dataset for forensic
mance of three simple detection models. The first model investigation,” Forensic Science International Volume 266
was a frame-based model which we denote as TamperNet. 565:572, 2016. 1
TamperNet is a small DNN (6 convolutional layers plus a 1 [8] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies,
fully connected layer) trained to detect low-level image ma- and M. Nießner, “FaceForensics: A large-scale video dataset
nipulations, such as cut-and-pasted objects or the addition for forgery detection in human faces,” arXiv, 2018. 1, 3
of digital text to an image, and although it was not trained [9] ——, “FaceForensics++: Learning to detect manipulated fa-
only on deepfake images, it performs well in identifying cial images,” in ICCV, 2019. 1, 3, 4
digitally-altered images in general (including face swaps). [10] M. Schroepfer. (2019) Creating a data set and a challenge
The other two models are the XceptionNet face detection for deepfakes, Facebook AI blog. [Online]. Available: https:
and full-image models, trained on the FaceForensics data //ai.facebook.com/blog/deepfake-detection-challenge 1
set [9], and evaluated as implemented in [12]. For these [11] N. Dufour and A. Gully. (2019) Contributing data
models, one frame was sampled per second of video. to Deepfake detection research, Google AI blog.
When using frame-based models for detection, there are [Online]. Available: https://ai.googleblog.com/2019/09/
two thresholds to tune - the per-frame detection threshold contributing-data-to-deepfake-detection.html 3
and a threshold that specifies how many frames must ex- [12] (2019) FaceForensics ondyari/faceforensics GitHub. [On-
ceed the per-frame threshold in order to identify a video line]. Available: https://github.com/ondyari/FaceForensics/
as fake (or the frames-per-video threshold). These thresh- tree/master/classification 4