Статья опубликована в рамках: Научного журнала «Студенческий» № 6(344)
Рубрика журнала: Информационные технологии
Скачать книгу(-и): скачать журнал часть 1, скачать журнал часть 2, скачать журнал часть 3
ADVERSARIAL ATTACKS ON AI SYSTEMS AND TECHNIQUES FOR ENHANCING MODEL ROBUSTNESS
ABSTRACT
This study explores the susceptibility of deep neural networks (DNNs) to adversarial manipulations. It outlines key techniques for crafting adversarial inputs, including FGSM and PGD, and reviews contemporary defense strategies such as defensive distillation, ensemble-based adversarial training, and selective pruning of neural connections. Special emphasis is placed on the balance between maintaining high predictive performance on clean inputs and ensuring resilience against adversarial perturbations.
Keywords: deep learning; adversarial manipulation; model robustness; defensive distillation; adversarial training; neural architectures.
Introduction
Deep learning (DL) has demonstrated outstanding performance across various domains, including image recognition, semantic segmentation, and object detection. Modern deep neural networks are trained on large-scale datasets and achieve impressive predictive accuracy in real-world environments, from self-driving vehicles to Earth observation systems.
Despite these achievements, the integration of DNNs into safety-sensitive applications has exposed a critical weakness - their vulnerability to adversarially crafted inputs.
Adversarial inputs are carefully modified samples containing slight, often imperceptible perturbations that cause a neural network to misclassify data with high confidence. The impact of such manipulations can be severe, ranging from incorrect interpretation of road signs by autonomous vehicles to successful evasion of biometric security systems.
1. Types of Adversarial Attacks
The adversary’s objective is to determine a minimal perturbation
such that:
(1)
where
represents a target label that differs from the correct class.
Attack strategies can be categorized based on the attacker’s access to the model:
- White-box attacks: The adversary possesses complete knowledge of the network structure, parameters, and gradient information.
- Black-box attacks: The adversary has no internal access and can only obtain output predictions through queries. These attacks frequently exploit the transferability phenomenon, where adversarial examples generated for one model remain effective against other models.
Common Adversarial Generation Techniques
- FGSM (Fast Gradient Sign Method): A one-step method that perturbs input data in the direction determined by the sign of the loss gradient.
- PGD (Projected Gradient Descent): A multi-step extension of FGSM widely regarded as a strong first-order adversary.
- R+FGSM: A modified variant that introduces an initial random perturbation before applying gradient-based optimization to avoid highly curved regions of the loss landscape.
2. Defensive Distillation as a Robustness Enhancement Strategy
Among the proposed countermeasures, defensive distillation has received considerable attention. Unlike traditional knowledge distillation used to compress models, this technique aims to improve robustness by reducing the sensitivity of gradients within the network.
The approach involves training a secondary (“distilled”) model using softened probability distributions generated by a previously trained model. These soft labels are obtained by applying a higher temperature parameter
within the softmax layer. Elevated temperature values encourage the model to capture inter-class relationships and prevent excessively sharp decision boundaries.
Empirical evaluations conducted on MNIST and CIFAR-10 datasets indicate that this strategy can significantly decrease the effectiveness of adversarial attacks, lowering their success rate from approximately 95% to below 0.5%.
3. Ensemble-Based Adversarial Training
Conventional adversarial training, which augments training data with adversarial samples, may lead to gradient masking. In such cases, the model artificially distorts the local geometry of the loss surface, misleading gradient-based attacks without achieving genuine robustness against transferred adversarial inputs.
To mitigate this limitation, ensemble adversarial training was introduced. This method incorporates adversarial examples generated from multiple pre-trained models into the training process. By decoupling attack generation from the target model’s parameters, the approach enhances generalization and strengthens resistance against diverse black-box attacks.
To overcome this limitation, Ensemble Adversarial Training was introduced. This approach enriches the training dataset with adversarial samples generated by independent pre-trained networks. By decoupling the adversarial example creation procedure from the parameters of the target network, the method improves generalization capacity and strengthens resilience against a broad spectrum of black-box attack scenarios.
4. Advanced Approaches: Few2Decide and SaliencyMix
Recent studies have introduced novel techniques aimed at modifying the internal decision-making dynamics of neural architectures.
Few2Decide: Empirical analysis indicates that adversarial perturbations disproportionately influence specific connections within fully connected layers, while other connections remain comparatively stable. The Few2Decide framework suggests deactivating vulnerable connections and relying exclusively on a stable subset - roughly one-third of the total - during the final prediction stage. This strategy preserves strong performance on unperturbed data while substantially enhancing robustness under white-box attack conditions.
Explainable AI (XAI) and SaliencyMix: In remote sensing applications, SaliencyMix-based data augmentation has demonstrated promising results by combining images according to saliency-based importance regions. Integrating this augmentation with a training objective that minimizes discrepancies in model interpretation between clean and adversarial inputs enables neural networks to retain stable explanatory patterns (such as Class Activation Maps, CAM) even when exposed to adversarial interference.
Conclusion
Adversarial threats continue to represent one of the most critical barriers to the secure deployment of artificial intelligence technologies. As demonstrated in this analysis, protective strategies including distillation and ensemble-based training markedly decrease model susceptibility to minor perturbations by substantially suppressing gradient magnitudes - in some cases by factors approaching
.
However, a consistent trade-off persists between enhanced robustness and predictive accuracy on unaltered data. Future investigations should concentrate on designing generalized architectures capable of defending against both established and emerging adversarial techniques without compromising overall system performance.
References:
- Goodfellow I.J., Shlens J., Szegedy C. Explaining and Harnessing Adversarial Examples // International Conference on Learning Representations. 2015.
- Madry A., Makelov A., Schmidt L., Tsipras D., Vladu A. Towards Deep Learning Models Resistant to Adversarial Attacks // International Conference on Learning Representations. 2018.
- Carlini N., Wagner D. Towards Evaluating the Robustness of Neural Networks // IEEE Symposium on Security and Privacy. 2017. - С. 39–57.
- Papernot N., McDaniel P., Wu X., Jha S., Swami A. Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks // IEEE Symposium on Security and Privacy. 2016. - С. 582–597.
- Moosavi-Dezfooli S.M., Fawzi A., Frossard P. Universal Adversarial Perturbations // IEEE Conference on Computer Vision and Pattern Recognition. 2017. - С. 1765–1773.
- Tramèr F., Kurakin A., Papernot N. et al. Ensemble Adversarial Training: Attacks and Defenses // International Conference on Learning Representations. 2018.


Оставить комментарий