Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Deepfake detection: humans vs.

machines
Pavel Korshunov Sébastien Marcel
Idiap Research Institute Idiap Research Institute
Martigny, Switzerland Martigny, Switzerland
Email: [email protected] Email: [email protected]

Abstract—Deepfake videos, where a person’s face is automati-


cally swapped with a face of someone else, are becoming easier to
generate with more realistic results. In response to the threat such
manipulations can pose to our trust in video evidence, several
arXiv:2009.03155v1 [cs.CV] 7 Sep 2020

large datasets of deepfake videos and many methods to detect


them were proposed recently. However, it is still unclear how
realistic deepfake videos are for an average person and whether
the algorithms are significantly better than humans at detecting
them. In this paper, we present a subjective study conducted in a
crowdsourcing-like scenario, which systematically evaluates how
hard it is for humans to see if the video is deepfake or not. For
the evaluation, we used 120 different videos (60 deepfakes and (a) By Google (b) DeepfakeTIMIT
60 originals) manually pre-selected from the Facebook deepfake
database, which was provided in the Kaggle’s Deepfake Detection
Challenge 2020. For each video, a simple question: “Is face of
the person in the video real of fake?” was answered on average
by 19 naı̈ve subjects. The results of the subjective evaluation
were compared with the performance of two different state of
the art deepfake detection methods, based on Xception and
EfficientNets (B4 variant) neural networks, which were pre-
trained on two other large public databases: the Google’s subset
from FaceForensics++ and the recent Celeb-DF dataset. The
evaluation demonstrates that while the human perception is very
different from the perception of a machine, both successfully but (c) By Facebook (d) Celeb-DF
in different ways are fooled by deepfakes. Specifically, algorithms
struggle to detect those deepfake videos, which human subjects Fig. 1: Examples of deepfakes (faces cropped from videos) in
found to be very easy to spot. different databases.
I. I NTRODUCTION
Autoencoders and generative adversarial networks (GANs) 10 000 Youtube videos [2] and which later was extended with
significantly improved the quality and realism of the automated a larger set of high resolution videos provided by Google [3].
image generation and face swapping, leading to the deepfake Another recently proposed 50 000 videos-large database of
phenomena. Many are starting to believe that the proverb deepfakes generated from Youtube videos is Celeb-Df [4].
‘seeing is believing’ is starting to loose its meaning when it But the most extensive and the largest database to date with
comes to digital video1 . The concern for the impact of the more than 100K videos (80% of which are deepfakes) is
widespread deepfake videos on our trust in video recording is the dataset from Facebook, which appeared in the recent
growing. This public unease prompted researchers to propose Deepfake Detection Challenge hosted by Kaggle2 . Figure 1
various datasets of deepfakes and methods to detect them. shows examples of faces cropped from deepfake videos in
Some of the latest approaches demonstrate encouraging accu- various databases.
racy, especially, if they are trained and evaluated on the same These datasets were generated using either the popular
datasets. open source code3 , typically, deepfakes from Youtube videos,
Many databases with deepfake videos were created to help or the latest methods by Google and Facebook for creating
develop and train deepfake detection methods. One of the deepfakes. The fact that even Google and Facebook, pri-
first freely available database was based on VidTIMIT [1], vate companies who are typically very frugal with making
followed by the FaceForeniscs database, which ‘deepfaked’ large datasets publicly available, provided some of the most
extensive datasets for research shows how important and
1 https://edition.cnn.com/interactive/2019/01/business/
pentagons-race-against-deepfakes/ 2 https://www.kaggle.com/c/deepfake-detection-challenge
3 https://www.kaggle.com/c/deepfake-detection-challenge/discussion/
121313
(f) Very easy (g) Easy (h) Moderate (i) Difficult (j) Very difficult

Fig. 2: Cropped faces from different categories of deepfake videos from Facebook database (top row) and the corresponding
original versions (bottom row).

challenging is the deepfake detection for the scientific and most obvious ones to those that look very realistic. We have
industrial communities. This abundance of deepfake video defined five categories of deepfakes (12 of each) by judging
data allowed researchers to train and test detection approaches them on how easy it is to spot their visual artifacts as ‘very
based on very deep neural networks, such as Xception [3], easy’, ‘easy’, ‘moderate’, ‘difficult’, and ‘very difficult’ (see
capsules networks [5], ResNet-50 [6], and EfficientNet [7] Figure 2 for some examples). For each video, on average 20
which were shown to outperform the methods based on naı̈ve subjects (including PhD students, senior scientists, and
shallow CNNs, facial physical characteristics [8], [9], [10], people in administration) had to answer if they think it is fake
or distortion features [11], [12], [13]. or not.
However, despite the public and media uneasiness with Understanding how well people recognize deepfake is im-
deepfake videos and the surge of automated methods for their portant, but also is the understanding of how detection algo-
detection, little is known about how ‘good‘ the deepfakes rithms recognize them too. Policy decisions as well as people’s
actually are at ‘fooling‘ human perception. Most of the public perceptions are often based on the assumption that automated
perception that deepfakes are realistic comes from personal detection algorithms perceive videos in a way that is similar
experience of watching some video examples on Youtube, to humans4 , which can be even dangerous when it comes to
the alarming media reports, and the understanding that the such impactful technology as deepfake detection.
deepfake generation technology will become more realistic Therefore, in this paper, we also assess how two state of the
in the nearest future. There is a lack of scientific studies art algorithms, based on Xception model [15] and EfficientNet
on how realistic the currently available deepfakes are and variant B4 [7], both of which showed a great performance
whether they can pose a threat to human perception of video. on several deepfake databases [3]. pre-trained on two other
The only study [3] that asked human subjects to evaluate 60 large databases from Google [3] (a subset of FaceForeniscs++)
images (30 were fake but the number of deepfakes was not and Celeb-DF [4], perform on the same videos and categories
reported) demonstrated that almost 80% of deepfake images of deepfakes that we used in our subjective evaluation. This
were successfully recognized as fake. comparison provides a scientific insight on the differences
In this paper, we conducted a more comprehensive subjec- between human and machine perception of deepfake videos.
tive evaluation (of deepfake videos instead of images), us- To allow researchers to verify, reproduce, and extend our
ing the web-based framework for crowdsourcing experiments work, we provide the pre-trained models, subjective scores,
QualityCrowd 2 [14]. We want to understand how easily an and the scripts used to analyze the data as an open source
average human observer can be spoofed by different types of package5 .
deepfake videos. For that purpose, we selected 120 videos (60
original and 60 deepfakes) from Facebook dataset2 , because 4 https://www.forbes.com/sites/fernandezelizabeth/2019/11/30/
it is the largest and one the most recent databases, and it ai-is-not-similar-to-human-intelligence-thinking-so-could-be-dangerous/
has many different variants of deepfakes, ranging from the 5 Source code: https://gitlab.idiap.ch/bob/bob.paper.wifs2020
This paper has the following main contributions:
• A comprehensive subjective evaluation and the analy-
sis of human perception of different types of deepfake
videos;
• Assessment of Xception and EfficientNet based models
on the same videos to compare their performance with
human subjects;
• Models, subjective data, and analysis scripts are open
source;
II. DATA AND SUBJECTIVE EVALUATION
Since the resulted videos produced by automated deepfake
generation algorithms vary drastically visually, depending on
many factors (training data, the quality of the video for manip- Fig. 3: Screenshot of one step of subjective evaluation (the
ulation, and the algorithm itself), we cannot label all deepfakes video is courtesy of Facebook database).
into one visual category. Therefore, we have manually looked
through many videos of Facebook database2 and pre-selected
60 deepfake videos, split into five categories depending of supervise participants behavior and to restrict their test con-
how clearly fake they look, with the corresponding 60 original ditions. When using crowdsourcing for evaluation, there is a
videos (see examples in Figure 2). risk of including untrusted data into analysis due to the wrong
The evaluation was conducted using QualityCrowd 2 frame- test conditions or unreliable behavior of some subjects who
work [14] designed for crowdsourcing-based evaluations (Fig- try to submit low quality work in order to reduce their effort.
ure 3 shows a screenshot of a typical evaluation step). This For this reason, unreliable workers detection is an inevitable
framework allows us to make sure subjects watch each video process in crowdsourcing-based subjective experiments. There
fully at least once and are not able to skip any question. are several methods for identifying the ‘trustworthiness’ of
Prior to the evaluation itself, a display brightness test was the subject but since our evaluation was conducted within
performed using a method similar to that described in [16]. premises of a scientific institute, we only used so called
Since deepfake detection algorithms typically evaluate only ‘honeypot’ method [16], [17] to filter out scores from people
the face regions cropped using a face detector, to have a who did not pay attention at all. Honeypot is a very easy
comparable scenario, we have also shown to the human question that refers to the video the subject just watched in the
subjects cropped face regions next to the original video (see previous steps, e.g., “what was visible in the previous video?”
Figure 3). with obvious answers that test if a person even looked at the
Each of the 60 naı̈ve subjects who participated in the video. Using this question, we filtered out the scores from 5
evaluation had to answer the question after watching a given people from our final results, hence we ended up with 18.66
video: “Is face of the person in the video real or fake?” with answers on average for each video, which is the number of
the following options: “Fake”, “real”, and “I do not know.” subjects commonly considered in subjective evaluations.
Prior to the evaluation, the explanation of the test was given
to the subjects with several test video examples of different III. S UBJECTIVE EVALUATION RESULTS
fake categories and real videos. The 120 were also split in For each deepfake or original video, we computed the
random batches of 40 each to reduce the total evaluation percentage of answers that were ‘certain & correct’, when
time for one subject, so the average time per one evaluation people selected ‘Real’ for an original or ‘Fake’ for a deepfake,
was about 16 minutes, which is consistent with the standard ‘certain & incorrect’ (selected ‘Real’ for a deepfake or ‘Fake’
recommendations. for an original) and ‘uncertain’, when the selection was ‘I
Due to privacy concerns, we did not collect any personal do not know’. We have averaged those percentages across
information from our subjects such as age or gender. Also, videos in each category to obtain the final percentages, which
the licensing conditions of Facebook database2 restricted the are shown in Figure 4. From the figure, we can note that
evaluation to the premises of Idiap research institute, which the pre-selected deepfake categories, on average, reflect the
signed the license agreement not do distribute data outside. difficulty level of recognizing them. The interesting results is
Therefore, the subjects consisted of PhD students, scientists, the low number of uncertain answers, which means people
administration, and management of Idiap. Hence the age can tend to be sure when it comes to judging the realism of a
be estimated to be between 20 and 65 years old and the gender video. And it also means people can be easily spoofed by a
distribution to be of a typical scientific community. good quality deepfake video, since only in 24.5% cases ‘well
Unlike laboratory-based subjective experiments where all done’ deepfake videos are perceived as fakes, even though
subjects can be observed by operators and its test envi- these subjects already knew they are looking for fakes. In
ronment can be controlled, the major shortcoming of the the scenario, when such deepfake would be distributed to an
crowdsourcing-based subjective experiments is the inability to unsuspected audience (e.g., via social media), we can expect
Fig. 4: Subjective answers for each category of deepfakes and Fig. 6: Average scores with confidence intervals for each video
original unaltered videos. in every video category.

TABLE I: Area under the curve (AUC) value on the test sets of
Google and Celeb-DF databases of Xception and EfficientNet
models.

Model Trained on AUC (%) on Test set

Xception Google database 100.00


Xception Celeb-DF database 100.0
EfficientNet Google database 99.99
EfficientNet Celeb-DF database 100.0

scores variations and overlap, which means some of the videos


from these categories are perceived similarly. It means some
of the deepfake videos could be moved to another category.
This observation is also supported by the Figure 6 which plots
the average scores with confidence intervals (computed using
Student’s t-distribution [18]) for each video in the deepfake
Fig. 5: Median values with error bars from the ANOVA test category (12 videos each) and originals (60 videos).
performed on subjective scores from five deepfake categories.
IV. E VALUATION OF ALGORITHMS
the number of people noticing it to be significantly lower. Also, For the example of machine vision, we took two state of the
it is interesting to note that even videos from ‘easy’ category art algorithms: based on Xception model [15] and EfficientNet
were not as easy to spot (71.1% correct answers) compared to variant B4 [7] shown to be performing very well on different
the original videos (82.2%). Overall, we can see that people deepfake datasets and benchmarks [3]. We pre-trained these
are better at recognizing very obvious examples of deepfakes models for 20 epochs each on the Google’s subset from
or real unaltered videos. FaceForensics++ database [3] and Celeb-Df [4] to demonstrate
To check whether the difference between videos from the the impact of different training conditions on the evaluation
five deepfake categories is statistically significant based on results. If evaluated on the test sets of the same databases they
the subjective scores, we performed ANOVA test with the were trained on, both Xception and EfficientNet classifiers
corresponding box plot shown in Figure 5. The scores were demonstrate a great performance as shown in Table I. We can
computed for each video (and per category when applicable) see that the area under the curve (AUC), which is the common
by averaging the answers from all corresponding observers. metric used to compare the performance of deepfake detection
For each correct answer the score is 1 and for both wrong and algorithms, is almost at 100% in all cases.
uncertain answer the score is 0. Please note that the red lines We evaluated these models on the 120 videos we used in
in Figure 5 correspond to median values, not average, which the subjective test. Since these videos come from Facebook
what we plotted in Figure 4. The p-value of ANOVA test is database, they can be considered as unseen data, which is
below 4.7e − 11, which means the deepfake categories are still an obstacle for many DNN classifiers, as they do not
significantly different on average. However Figure 5 shows generalize well on the unseen data the fact also highlighted in
that ‘easy’, ‘moderate’, and ‘difficult’ categories have large the recent Facebook Deepfake Detection Challenge [19]. To
1.0 Correct 1.0 Correct
0.9 Incorrect 0.9 Incorrect
0.8 0.8
0.7 0.7
0.6 0.6
Scores

Scores
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 very_easy easy moderate difficult very_difficult originals 0.0 very_easy easy moderate difficult very_difficult originals

(a) EfficientNet trained on Google (b) EfficientNet trained on Celeb-DF

1.0 Correct 1.0 Correct


0.9 Incorrect 0.9 Incorrect
0.8 0.8
0.7 0.7
0.6 0.6
Scores

Scores
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 very_easy easy moderate difficult very_difficult originals 0.0 very_easy easy moderate difficult very_difficult originals

(c) Xception trained on Google (d) Xception trained on Celeb-DF

Fig. 7: The detection accuracy (the threshold corresponds to FAR 10% on development set of the respective database) for each
video category from subjective test by Xception and Efficient models pre-trained on Google and Celeb-DF databases.

compute performance accuracy, we need to select threshold. the 120 videos from the subjective tests, the receiver operating
We chose the threshold corresponding to the false accept characteristic (ROC) curve and the corresponding AUC values
rate (FAR) of 10%, selected on the development set of the presented in Figure 8. From this figure, we can note that
respective database. We selected threshold based on FAR ROC curves looks ‘normal’, as typical curves for classifiers
value as oppose to equal error rate (EER) commonly used that do not generalize well on unseen data, especially taking
in biometrics, because many practical deepfake detection or into account excellent performance on the test sets shown in
anti-spoofing systems have a low bound requirement on FAR Table I. Figure 8 also shows that human subjects were more
value. In our case, FAR of 10% is quite generous. accurate at assessing this set of videos since the corresponding
Figure 7 demonstrate the evaluation results of pre-trained ROC curve is consistently higher with the highest AUC value
Xception and EfficientNet models on the videos from the of 87.47%.
subject test averaged for each deepfake category and originals V. C ONCLUSION
(when using threshold corresponding to FAR= 10%). In the In this paper, we presented the results of subjective evalu-
figure, blue bar corresponds to the percent of correctly detected ation of different categories of deepfake videos, ranging from
videos in the given category, and the orange bar correspond to obviously fake to easy being confused with real videos. The
the percent of incorrectly detected. The results for algorithms videos were manually pre-selected from Facebook database
are very different from the results of the subjective test (see and evaluated by 60 human subjects. The same videos were
Figure 4 for the evaluation results by human subjects). The also used in the evaluation of two state of the art deepfake de-
accuracy of the algorithms have no correlation to the visual tection algorithms based on Xception and EfficientNet models,
appearance of deepfakes. The algorithms ‘see’ these videos which were separately pre-trained on Google and Celeb-DF
very differently from how humans perceive the same videos. deepfake databases.
To a human observer the result may even appear random. The subjective evaluation demonstrated that people are
We can even notice that all algorithms struggle the most consistent in the way the perceive different types of deepfakes.
with the deepfake videos that were easy for human subjects. Also, the results show that people are confused by good
It is evident that the choice of threshold and the training quality deepfakes in 75.5% of cases. On the other hand, the
data have major impact on the evaluation accuracy. However, algorithms have a totally different perception of deepfakes
when selecting a deepfake detection system to use in practical compared to human subjects. The algorithms struggle to detect
scenario, one cannot assume an algorithm’s perception will many deepfakes, which look obviously fake to humans, while
have any relation to the way we think the videos look like. some of the algorithms (depending on the training data and
If we remove the choice of the threshold and the pre- the selected threshold) can accurately detect videos that are
selected video categories and simply evaluate the models on difficult for human subjects.
1.0
True Positive Rate (1 - FRR)
0.8

0.6

0.4 EfficientNet-trained-on-Google (AUC = 73.97%)


Xception-trained-on-Google (AUC = 72.69%)
0.2 EfficientNet-trained-on-CelebDF (AUC = 70.71%)
Xception-trained-on-CelebDF (AUC = 73.36%)
Subjective-Scores (AUC = 87.47%)
0.00.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate (FAR)
Fig. 8: ROC curves with the corresponding AUC value of Xception and Efficient models pre-trained on Google and Celeb-DF
databases evaluated on all the videos from subjective test.

This paper shows that the deepfake generation is already [8] Y. Li, M. Chang, and S. Lyu, “In ictu oculi: Exposing ai created fake
at the level of realism that would confuse the majority of the videos by detecting eye blinking,” in IEEE International Workshop on
Information Forensics and Security (WIFS), 2018, pp. 1–7.
public, especially in the browser-based viewing scenario. The [9] X. Yang, Y. Li, and S. Lyu, “Exposing deep fakes using inconsistent
paper also shows that is important to clearly understand how a head pose,” in IEEE International Conference on Acoustics, Speech and
given algorithms evaluates data and what conditions impact it Signal Processing (ICASSP), 2019, pp. 8261–8265.
[10] S. Agarwal, T. El-Gaaly, H. Farid, and S. Lim, “Detecting deep-fake
performance and in which way. What is even more important videos from appearance and behavior,” arXiv preprint, 2020. [Online].
is to not confuse and to not anthropomorphize machine vision Available: https://arxiv.org/abs/2004.14491
with human vision, because they are very different and do not [11] Y. Zhang, L. Zheng, and V. L. L. Thing, “Automated face swapping and
its detection,” in IEEE International Conference on Signal and Image
correlate with each other. Processing (ICSIP), Aug 2017, pp. 15–19.
[12] A. Agarwal, R. Singh, M. Vatsa, and A. Noore, “Swapped! digital face
ACKNOWLEDGEMENTS presentation attack detection via weighted local magnitude pattern,” in
IEEE International Joint Conference on Biometrics (IJCB), Oct 2017,
This work was funded by Hasler Foundation’s VERIFAKE pp. 659–665.
project and Swiss Center for Biometrics Research and Testing. [13] P. Korshunov and S. Marcel, “Deepfakes: a new threat to face
recognition? assessment and detection,” arXiv preprint, 2018. [Online].
R EFERENCES Available: https://arxiv.org/abs/1812.08685
[14] C. Keimel, J. Habigt, C. Horch, and K. Dieopold, “Qualitycrowd –
[1] P. Korshunov and S. Marcel, “Vulnerability assessment and detection a framework for crowd-based quality evaluation,” in Picture Coding
of Deepfake videos,” in International Conference on Biometrics (ICB Symposium (PCS), May 2012.
2019), Crete, Greece, Jun. 2019. [15] F. Chollet, “Xception: Deep learning with depthwise separable convo-
[2] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and lutions,” in 2017 IEEE Conference on Computer Vision and Pattern
M. Nießner, “Faceforensics: A large-scale video dataset for forgery Recognition (CVPR), 2017, pp. 1800–1807.
detection in human faces,” arXiv.org, 2018. [Online]. Available: [16] T. Hossfeld, C. Keimel, M. Hirth, B. Gardlo, J. Habigt, K. Diepold, and
http://arxiv.org/abs/1803.09179 P. Tran-Gia, “Best practices for qoe crowdtesting: Qoe assessment with
[3] ——, “FaceForensics++: Learning to detect manipulated facial images,” crowdsourcing,” IEEE Transactions on Multimedia, vol. 16, no. 2, pp.
in International Conference on Computer Vision (ICCV), 2019. 541–558, 2014.
[4] Y. Li, P. Sun, H. Qi, and S. Lyu, “Celeb-DF: A Large-scale Challenging [17] P. Korshunov, H. Nemoto, A. Skodras, and T. Ebrahimi,
Dataset for DeepFake Forensics,” in IEEE Conference on Computer “Crowdsourcing-based evaluation of privacy in HDR images,” in Optics,
Vision and Patten Recognition, 2020. Photonics, and Digital Technologies for Multimedia Applications III,
[5] H. Nguyen, J. Yamagishi, and I. Echizen, “Capsule-forensics: using cap- vol. 9138. SPIE, 2014, pp. 1 – 11.
sule networks to detect forged images and videos,” in IEEE International [18] P. Hanhart, M. Rerabek, P. Korshunov, and T. Ebrahimi, “Subjective
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, evaluation of HEVC intra coding for still image compression,” in
pp. 2307–2311. International workshop on Video Processing and Quality Metrics for
[6] S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros, “Cnn- Consumer Electronics (VPQM), 2013.
generated images are surprisingly easy to spot...for now,” in CVPR, 2020. [19] R. Tolosana, S. Romero-Tapiador, J. Fierrez, and R. Vera-Rodriguez,
[7] D. M. Montserrat, H. Hao, S. K. Yarlagadda, S. Baireddy, R. Shao, “Deepfakes evolution: Analysis of facial regions and fake detection
J. Horvath, E. Bartusiak, J. Yang, D. Güera, F. Zhu, and E. J. performance,” arXiv preprint, 2020. [Online]. Available: https:
Delp, “Deepfakes detection with automatic face weighting,” in 2020 //arxiv.org/abs/2004.07532
IEEE/CVF Conference on Computer Vision and Pattern Recognition
Workshops (CVPRW), 2020, pp. 2851–2859.

You might also like