Deepfake Detection Humans vs. Machines
Deepfake Detection Humans vs. Machines
machines
Pavel Korshunov Sébastien Marcel
Idiap Research Institute Idiap Research Institute
Martigny, Switzerland Martigny, Switzerland
Email: [email protected] Email: [email protected]
Fig. 2: Cropped faces from different categories of deepfake videos from Facebook database (top row) and the corresponding
original versions (bottom row).
challenging is the deepfake detection for the scientific and most obvious ones to those that look very realistic. We have
industrial communities. This abundance of deepfake video defined five categories of deepfakes (12 of each) by judging
data allowed researchers to train and test detection approaches them on how easy it is to spot their visual artifacts as ‘very
based on very deep neural networks, such as Xception [3], easy’, ‘easy’, ‘moderate’, ‘difficult’, and ‘very difficult’ (see
capsules networks [5], ResNet-50 [6], and EfficientNet [7] Figure 2 for some examples). For each video, on average 20
which were shown to outperform the methods based on naı̈ve subjects (including PhD students, senior scientists, and
shallow CNNs, facial physical characteristics [8], [9], [10], people in administration) had to answer if they think it is fake
or distortion features [11], [12], [13]. or not.
However, despite the public and media uneasiness with Understanding how well people recognize deepfake is im-
deepfake videos and the surge of automated methods for their portant, but also is the understanding of how detection algo-
detection, little is known about how ‘good‘ the deepfakes rithms recognize them too. Policy decisions as well as people’s
actually are at ‘fooling‘ human perception. Most of the public perceptions are often based on the assumption that automated
perception that deepfakes are realistic comes from personal detection algorithms perceive videos in a way that is similar
experience of watching some video examples on Youtube, to humans4 , which can be even dangerous when it comes to
the alarming media reports, and the understanding that the such impactful technology as deepfake detection.
deepfake generation technology will become more realistic Therefore, in this paper, we also assess how two state of the
in the nearest future. There is a lack of scientific studies art algorithms, based on Xception model [15] and EfficientNet
on how realistic the currently available deepfakes are and variant B4 [7], both of which showed a great performance
whether they can pose a threat to human perception of video. on several deepfake databases [3]. pre-trained on two other
The only study [3] that asked human subjects to evaluate 60 large databases from Google [3] (a subset of FaceForeniscs++)
images (30 were fake but the number of deepfakes was not and Celeb-DF [4], perform on the same videos and categories
reported) demonstrated that almost 80% of deepfake images of deepfakes that we used in our subjective evaluation. This
were successfully recognized as fake. comparison provides a scientific insight on the differences
In this paper, we conducted a more comprehensive subjec- between human and machine perception of deepfake videos.
tive evaluation (of deepfake videos instead of images), us- To allow researchers to verify, reproduce, and extend our
ing the web-based framework for crowdsourcing experiments work, we provide the pre-trained models, subjective scores,
QualityCrowd 2 [14]. We want to understand how easily an and the scripts used to analyze the data as an open source
average human observer can be spoofed by different types of package5 .
deepfake videos. For that purpose, we selected 120 videos (60
original and 60 deepfakes) from Facebook dataset2 , because 4 https://www.forbes.com/sites/fernandezelizabeth/2019/11/30/
it is the largest and one the most recent databases, and it ai-is-not-similar-to-human-intelligence-thinking-so-could-be-dangerous/
has many different variants of deepfakes, ranging from the 5 Source code: https://gitlab.idiap.ch/bob/bob.paper.wifs2020
This paper has the following main contributions:
• A comprehensive subjective evaluation and the analy-
sis of human perception of different types of deepfake
videos;
• Assessment of Xception and EfficientNet based models
on the same videos to compare their performance with
human subjects;
• Models, subjective data, and analysis scripts are open
source;
II. DATA AND SUBJECTIVE EVALUATION
Since the resulted videos produced by automated deepfake
generation algorithms vary drastically visually, depending on
many factors (training data, the quality of the video for manip- Fig. 3: Screenshot of one step of subjective evaluation (the
ulation, and the algorithm itself), we cannot label all deepfakes video is courtesy of Facebook database).
into one visual category. Therefore, we have manually looked
through many videos of Facebook database2 and pre-selected
60 deepfake videos, split into five categories depending of supervise participants behavior and to restrict their test con-
how clearly fake they look, with the corresponding 60 original ditions. When using crowdsourcing for evaluation, there is a
videos (see examples in Figure 2). risk of including untrusted data into analysis due to the wrong
The evaluation was conducted using QualityCrowd 2 frame- test conditions or unreliable behavior of some subjects who
work [14] designed for crowdsourcing-based evaluations (Fig- try to submit low quality work in order to reduce their effort.
ure 3 shows a screenshot of a typical evaluation step). This For this reason, unreliable workers detection is an inevitable
framework allows us to make sure subjects watch each video process in crowdsourcing-based subjective experiments. There
fully at least once and are not able to skip any question. are several methods for identifying the ‘trustworthiness’ of
Prior to the evaluation itself, a display brightness test was the subject but since our evaluation was conducted within
performed using a method similar to that described in [16]. premises of a scientific institute, we only used so called
Since deepfake detection algorithms typically evaluate only ‘honeypot’ method [16], [17] to filter out scores from people
the face regions cropped using a face detector, to have a who did not pay attention at all. Honeypot is a very easy
comparable scenario, we have also shown to the human question that refers to the video the subject just watched in the
subjects cropped face regions next to the original video (see previous steps, e.g., “what was visible in the previous video?”
Figure 3). with obvious answers that test if a person even looked at the
Each of the 60 naı̈ve subjects who participated in the video. Using this question, we filtered out the scores from 5
evaluation had to answer the question after watching a given people from our final results, hence we ended up with 18.66
video: “Is face of the person in the video real or fake?” with answers on average for each video, which is the number of
the following options: “Fake”, “real”, and “I do not know.” subjects commonly considered in subjective evaluations.
Prior to the evaluation, the explanation of the test was given
to the subjects with several test video examples of different III. S UBJECTIVE EVALUATION RESULTS
fake categories and real videos. The 120 were also split in For each deepfake or original video, we computed the
random batches of 40 each to reduce the total evaluation percentage of answers that were ‘certain & correct’, when
time for one subject, so the average time per one evaluation people selected ‘Real’ for an original or ‘Fake’ for a deepfake,
was about 16 minutes, which is consistent with the standard ‘certain & incorrect’ (selected ‘Real’ for a deepfake or ‘Fake’
recommendations. for an original) and ‘uncertain’, when the selection was ‘I
Due to privacy concerns, we did not collect any personal do not know’. We have averaged those percentages across
information from our subjects such as age or gender. Also, videos in each category to obtain the final percentages, which
the licensing conditions of Facebook database2 restricted the are shown in Figure 4. From the figure, we can note that
evaluation to the premises of Idiap research institute, which the pre-selected deepfake categories, on average, reflect the
signed the license agreement not do distribute data outside. difficulty level of recognizing them. The interesting results is
Therefore, the subjects consisted of PhD students, scientists, the low number of uncertain answers, which means people
administration, and management of Idiap. Hence the age can tend to be sure when it comes to judging the realism of a
be estimated to be between 20 and 65 years old and the gender video. And it also means people can be easily spoofed by a
distribution to be of a typical scientific community. good quality deepfake video, since only in 24.5% cases ‘well
Unlike laboratory-based subjective experiments where all done’ deepfake videos are perceived as fakes, even though
subjects can be observed by operators and its test envi- these subjects already knew they are looking for fakes. In
ronment can be controlled, the major shortcoming of the the scenario, when such deepfake would be distributed to an
crowdsourcing-based subjective experiments is the inability to unsuspected audience (e.g., via social media), we can expect
Fig. 4: Subjective answers for each category of deepfakes and Fig. 6: Average scores with confidence intervals for each video
original unaltered videos. in every video category.
TABLE I: Area under the curve (AUC) value on the test sets of
Google and Celeb-DF databases of Xception and EfficientNet
models.
Scores
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 very_easy easy moderate difficult very_difficult originals 0.0 very_easy easy moderate difficult very_difficult originals
Scores
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 very_easy easy moderate difficult very_difficult originals 0.0 very_easy easy moderate difficult very_difficult originals
Fig. 7: The detection accuracy (the threshold corresponds to FAR 10% on development set of the respective database) for each
video category from subjective test by Xception and Efficient models pre-trained on Google and Celeb-DF databases.
compute performance accuracy, we need to select threshold. the 120 videos from the subjective tests, the receiver operating
We chose the threshold corresponding to the false accept characteristic (ROC) curve and the corresponding AUC values
rate (FAR) of 10%, selected on the development set of the presented in Figure 8. From this figure, we can note that
respective database. We selected threshold based on FAR ROC curves looks ‘normal’, as typical curves for classifiers
value as oppose to equal error rate (EER) commonly used that do not generalize well on unseen data, especially taking
in biometrics, because many practical deepfake detection or into account excellent performance on the test sets shown in
anti-spoofing systems have a low bound requirement on FAR Table I. Figure 8 also shows that human subjects were more
value. In our case, FAR of 10% is quite generous. accurate at assessing this set of videos since the corresponding
Figure 7 demonstrate the evaluation results of pre-trained ROC curve is consistently higher with the highest AUC value
Xception and EfficientNet models on the videos from the of 87.47%.
subject test averaged for each deepfake category and originals V. C ONCLUSION
(when using threshold corresponding to FAR= 10%). In the In this paper, we presented the results of subjective evalu-
figure, blue bar corresponds to the percent of correctly detected ation of different categories of deepfake videos, ranging from
videos in the given category, and the orange bar correspond to obviously fake to easy being confused with real videos. The
the percent of incorrectly detected. The results for algorithms videos were manually pre-selected from Facebook database
are very different from the results of the subjective test (see and evaluated by 60 human subjects. The same videos were
Figure 4 for the evaluation results by human subjects). The also used in the evaluation of two state of the art deepfake de-
accuracy of the algorithms have no correlation to the visual tection algorithms based on Xception and EfficientNet models,
appearance of deepfakes. The algorithms ‘see’ these videos which were separately pre-trained on Google and Celeb-DF
very differently from how humans perceive the same videos. deepfake databases.
To a human observer the result may even appear random. The subjective evaluation demonstrated that people are
We can even notice that all algorithms struggle the most consistent in the way the perceive different types of deepfakes.
with the deepfake videos that were easy for human subjects. Also, the results show that people are confused by good
It is evident that the choice of threshold and the training quality deepfakes in 75.5% of cases. On the other hand, the
data have major impact on the evaluation accuracy. However, algorithms have a totally different perception of deepfakes
when selecting a deepfake detection system to use in practical compared to human subjects. The algorithms struggle to detect
scenario, one cannot assume an algorithm’s perception will many deepfakes, which look obviously fake to humans, while
have any relation to the way we think the videos look like. some of the algorithms (depending on the training data and
If we remove the choice of the threshold and the pre- the selected threshold) can accurately detect videos that are
selected video categories and simply evaluate the models on difficult for human subjects.
1.0
True Positive Rate (1 - FRR)
0.8
0.6
This paper shows that the deepfake generation is already [8] Y. Li, M. Chang, and S. Lyu, “In ictu oculi: Exposing ai created fake
at the level of realism that would confuse the majority of the videos by detecting eye blinking,” in IEEE International Workshop on
Information Forensics and Security (WIFS), 2018, pp. 1–7.
public, especially in the browser-based viewing scenario. The [9] X. Yang, Y. Li, and S. Lyu, “Exposing deep fakes using inconsistent
paper also shows that is important to clearly understand how a head pose,” in IEEE International Conference on Acoustics, Speech and
given algorithms evaluates data and what conditions impact it Signal Processing (ICASSP), 2019, pp. 8261–8265.
[10] S. Agarwal, T. El-Gaaly, H. Farid, and S. Lim, “Detecting deep-fake
performance and in which way. What is even more important videos from appearance and behavior,” arXiv preprint, 2020. [Online].
is to not confuse and to not anthropomorphize machine vision Available: https://arxiv.org/abs/2004.14491
with human vision, because they are very different and do not [11] Y. Zhang, L. Zheng, and V. L. L. Thing, “Automated face swapping and
its detection,” in IEEE International Conference on Signal and Image
correlate with each other. Processing (ICSIP), Aug 2017, pp. 15–19.
[12] A. Agarwal, R. Singh, M. Vatsa, and A. Noore, “Swapped! digital face
ACKNOWLEDGEMENTS presentation attack detection via weighted local magnitude pattern,” in
IEEE International Joint Conference on Biometrics (IJCB), Oct 2017,
This work was funded by Hasler Foundation’s VERIFAKE pp. 659–665.
project and Swiss Center for Biometrics Research and Testing. [13] P. Korshunov and S. Marcel, “Deepfakes: a new threat to face
recognition? assessment and detection,” arXiv preprint, 2018. [Online].
R EFERENCES Available: https://arxiv.org/abs/1812.08685
[14] C. Keimel, J. Habigt, C. Horch, and K. Dieopold, “Qualitycrowd –
[1] P. Korshunov and S. Marcel, “Vulnerability assessment and detection a framework for crowd-based quality evaluation,” in Picture Coding
of Deepfake videos,” in International Conference on Biometrics (ICB Symposium (PCS), May 2012.
2019), Crete, Greece, Jun. 2019. [15] F. Chollet, “Xception: Deep learning with depthwise separable convo-
[2] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and lutions,” in 2017 IEEE Conference on Computer Vision and Pattern
M. Nießner, “Faceforensics: A large-scale video dataset for forgery Recognition (CVPR), 2017, pp. 1800–1807.
detection in human faces,” arXiv.org, 2018. [Online]. Available: [16] T. Hossfeld, C. Keimel, M. Hirth, B. Gardlo, J. Habigt, K. Diepold, and
http://arxiv.org/abs/1803.09179 P. Tran-Gia, “Best practices for qoe crowdtesting: Qoe assessment with
[3] ——, “FaceForensics++: Learning to detect manipulated facial images,” crowdsourcing,” IEEE Transactions on Multimedia, vol. 16, no. 2, pp.
in International Conference on Computer Vision (ICCV), 2019. 541–558, 2014.
[4] Y. Li, P. Sun, H. Qi, and S. Lyu, “Celeb-DF: A Large-scale Challenging [17] P. Korshunov, H. Nemoto, A. Skodras, and T. Ebrahimi,
Dataset for DeepFake Forensics,” in IEEE Conference on Computer “Crowdsourcing-based evaluation of privacy in HDR images,” in Optics,
Vision and Patten Recognition, 2020. Photonics, and Digital Technologies for Multimedia Applications III,
[5] H. Nguyen, J. Yamagishi, and I. Echizen, “Capsule-forensics: using cap- vol. 9138. SPIE, 2014, pp. 1 – 11.
sule networks to detect forged images and videos,” in IEEE International [18] P. Hanhart, M. Rerabek, P. Korshunov, and T. Ebrahimi, “Subjective
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, evaluation of HEVC intra coding for still image compression,” in
pp. 2307–2311. International workshop on Video Processing and Quality Metrics for
[6] S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros, “Cnn- Consumer Electronics (VPQM), 2013.
generated images are surprisingly easy to spot...for now,” in CVPR, 2020. [19] R. Tolosana, S. Romero-Tapiador, J. Fierrez, and R. Vera-Rodriguez,
[7] D. M. Montserrat, H. Hao, S. K. Yarlagadda, S. Baireddy, R. Shao, “Deepfakes evolution: Analysis of facial regions and fake detection
J. Horvath, E. Bartusiak, J. Yang, D. Güera, F. Zhu, and E. J. performance,” arXiv preprint, 2020. [Online]. Available: https:
Delp, “Deepfakes detection with automatic face weighting,” in 2020 //arxiv.org/abs/2004.07532
IEEE/CVF Conference on Computer Vision and Pattern Recognition
Workshops (CVPRW), 2020, pp. 2851–2859.