Written by Tarek El-Gaaly, Reed Meyerson and Gaurav Bharaj
Introduction
With the rise of AI (Deep Learning)-powered solutions over the last decade, the subfield of Explainability (xAI) has risen in tandem, as scientists, engineers, and enthusiasts alike try to understand what complex black box neural networks are really doing under the hood. Explainability (xAI) gives insight into the inner workings of deep learning models and allows humans to comprehend the output of these models, building trust with the model’s decision making process.
At Reality Defender, the vision team focuses on detecting cutting edge AI-generated (GenAI) deepfakes images (e.g. Midjourney1, Stable Diffusion2, Dalle-33, StyleGAN-3)4 and getting to the core of what image artifacts are characteristic of the generative process. At the highest level of abstraction, our machine learning models are exposed to both real, authentic images as well as AI-generated images, and trained to identify features and artifacts only appearing in deepfake AI-generated images. We observe that such artifacts can be at any scale and can globally affect the visual structure of the images (e.g. blurring along face boundaries) or at fine-scale pixel-level patterns that are distinct to various GenAI methods. We curated a large and diverse dataset of images and trained an ensemble of Deep Learning models on this data.
While our detectors perform well, we realize that to improve and adapt to new threats, we need to understand what the neural network machinery models and why the model gives certain outputs for given images. As deepfake creation technologies are improving rapidly and it becomes harder for humans to discern, the need for xAI becomes more central and imperative. We need forensic insights into why our models make the decisions they make. In addition, customers using our platform need more than just a real or fake decision to build trust and comprehension over the results; they need to know why an image was detected as fake and which region of an image caused the model to output this decision.
Methodology
xAI approaches can be model-agnostic where the model is an unknown black box, or model-specific where the model architecture is known a-priori5. At Reality Defender, since we develop in-house detection models, we have access to our detector models and naturally perform xAI in a model-specific approach. To this end, explainability (xAI) methods can be broadly categorized into the following categories6:
Gradient-based approaches look at the gradient flowing backward through the model from the output decision to intermediate feature layers to understand which neurons in the model contribute the most to the output decision. The higher the gradient, the more this neural pathway contributes to the final output scores. Thus, given an input image, one can extract a heat map (also referred to as a saliency map) showing the regions contributing the most to the final decision of the model. When we overlay this heat map onto the original input image, we can see where and by how much these regions influence the final decision (see Figure 2).
Perturbation-based approaches operate by changing the input or feature map at an intermediate layer, observing the corresponding change in the output of the model. A large change in output implies that the perturbed features have a large influence on the model decision. This approach can be computationally expensive as all permutations of features need to be considered.
Contrastive methods aim to explain the behavior of the model by comparing and contrasting selected data samples and their model outcome. The aim is to observe/analyze the model’s behavior when given different data samples. This method can also be computationally expensive as data selection is necessary.
xAI for CNNs
We select gradient-based approaches, as we found them to be more robust, computationally inexpensive and, in addition, more general methods for xAI (easily configurable to multiple neural network architectures). Gradient Class Activation Mapping (GradCAM7) is among the most well-known gradient-based methods. In GradCAM, the gradients are propagated backwards through the network from the output category we are interested in (in this case the manipulated category) until a specific layer of the model. Refer to Figure 1 for a visual illustration of GradCAM for a Convolutional Neural Network (CNN). The partial derivatives of a specific output category yc with respect to the neuron activations of a feature map Ak (size W x H) are computed. We will refer to these W x H matrices of partial derivatives as Gradient maps (blue in Figure 1). There are k gradient maps corresponding to k feature maps for the specific layer we are interested in. Equation 1.1 shows the computation of the global averaging for one specific gradient map. This results in k weights that are used to weight the corresponding activation feature maps Ak in the summation in Equation 1.2.
Figure 1: Visual illustration of GradCAM
This is then passed through a ReLU function that removes any negative values from the result (Equation 1.2). This is a standard approach in gradient-based calculations, as we are interested in the positive influence on the final output of the model.
The result of the GradCAM computation is a heatmap showing the important regions influencing the decision of the model the most, similar to Figure 1 and Figure 2. A plethora of methods extend the work of GradCAM and improve upon it. These subsequent derivative approaches build upon GradCAM to improve the distribution of saliency across images for cases where there are multiple objects/regions of interest in an image. This case is relevant to our deepfake detection task, where there are forensic fingerprints of the generative method distributed throughout the entire image. This is done by removing global averaging, which can drown out fine-grained details in the gradient maps, and computing pixel-wise weighting. Other approaches8 claim that the gradient is intrinsically noisy and have proposed using Eigen decomposition to reduce the variance in the gradient signal.
Analysis
At Reality Defender, we tested the robustness and reliability of several of these GradCAM-based methods. We subjected images to synthetic transformations and observed how the methods performed, where equivariance across the transformations is desired. If an image with a human face is scaled/rotated/translated/flipped, the resulting heat map also transforms equivariantly. When performing these synthetic transformations on images, regions are filled with black pixels in order to keep the original dimensions - see Figure 1. Some of the GradCAM-based methods produced heat maps that highlighted regions outside the image in the areas of black borders, while others were less prone to this kind of instability. Some of the GradCAM-based methods produced de-generate heat maps like shown in Figure 3, where the corners of the image are highlighted. The more pixel-based GradCAM-based methods that avoid the global averaging of the gradient in Equation 1.1 and perform pixel-wise multiplications of the activation and gradient maps were less prone to such instability and more able to highlight different regions within the image with equal influence on the model’s output. In addition we searched for a GradCAM-based method that is more inline with human intuition. In other words, given a faceswap “cheapfake” where artifacts are quite obvious, would the method hold up and highlight the regions that we as humans see as clear artifacts of a deepfake. For example, Figure 2 shows clear artifacts in a Reface9 Face-Swap deepfake image (red highlights in Figure 2). In Face-Swap deepfakes, a human face from a source image is blended into an image containing a target human face. In Figure 2, the person’s glasses from a source face are clearly blended into the target face. You can see that on the left temple of the subject, there is some sort of misalignment and blurring leftover from the face-swap process.
Figure 1: Synthetic shifting of an image to test the robustness of GradCAM-based methods.
Figure 2: Reality Defender’s heatmap example on a Reface10 face swap deepfake
Figure 3: Degenerate example of GradCAM, when run on a model trained to detect diffusion deepfake images. The corners are highlighted here using Jet colormap. The image was taken from the Microsoft Celeb dataset.
Figure 4 shows the heatmaps presented on our platform for a diffusion deepfake image generated using Meta’s Emu image generator10. It is, in many cases, hard for a human to tell if a diffusion deepfake image is authentic or not. One of our detectors picks up on frequency patterns in the eyes and nose region (highlighted in red in Figure 4). If one looks very closely at the image, there is a lack of symmetry between the eyes. Yet in this case, we believe the detector is detecting frequency patterns that are peculiar to diffusion images.
Discussion
We note here that, while the outputs of models may sometimes deviate from human intuition, this is necessarily not suboptimal since the models can detect subtle artifacts that are not visible to the naked human eye. For this reason, our xAI method may highlight regions of an image that to a human observer do not exhibit any extraordinary characteristics. Portraying this to users is a challenge, as several Deep Learning models may or may not mimic human perception and we as humans search for explanations within the explanations returned by the xAI approaches.
Figure 4: Example show our xAI heatmap for a diffusion deepfake generated by Meta’s Emu generator11
xAI for Vision Transformer
The above approaches work on Convolutional Neural Networks (CNNs) and can be leveraged for Visual Transformers (ViTs) as well. Yet for ViT architectures, one can leverage the baked-in self-attention mechanisms that naturally lend themselves to xAI. The self-attention mechanism captures the spatial saliency within the network, which is really the essence of what we are after in xAI for Computer Vision. A caveat here is that sometimes Visual Transformers have been found to use tokens as information storage to summarize information from other tokens, which results in meaningless xAI heatmaps with bright spots in random locations across the image12.
At Reality Defender, we are actively working on improving our understanding of deepfakes/genAI and providing human interpretable explanations for the output of our deepfake detector models. xAI is a crucial tool in our research into deepfake detection and provides our customers with an intuitive visual means of understanding what our models are doing.
1 www.midjourney.com
2 High-Resolution Image Synthesis with Latent Diffusion Models, R Rombach et al., Arxiv preprint arXiv/2112.10752 (2021).
3 Improving image generation with better captions, B James et al., Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf (2023)
4 Alias-free generative adversarial networks, T Karras et al., Advances in Neural Information Processing Systems 34 (2021)
5 Opportunities and Challenges in Explainable Artificial Intelligence (XAI): A Survey, A Das et al., Arxiv preprint Computer Vision and Pattern Recognition (cs.CV) arXiv:2006.11371
6 Attribution-based XAI methods in computer vision: A review, A Kumar et al., ArXiv preprint Computer Vision and Pattern Recognition (cs.CV) arXiv:2211.14736 2022.
7 Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization, RR Selvaraju et al. 2017 IEEE International Conference on Computer Vision (ICCV).
8 Eigen-CAM: Class Activation Map using Principal Components, MB Muhammad et al. 2020 IEEE International Joint Conference on Neural Networks (IJCNN)
9 www.reface.ai
10 Emu: Enhancing image generation models using photogenic needles in a haystack, Dai, Xiaoliang, et al. arXiv preprint arXiv:2309.15807 (2023).
11 Emu: Enhancing image generation models using photogenic needles in a haystack, Dai, Xiaoliang, et al. arXiv preprint arXiv:2309.15807 (2023).
12 Vision Transformers Need Registers, T Darcet et al. ICLR 2024 Conference.