Approximating Black-Box Deep Neural Networks using Active Learning as a Proxy Measurement for Robustness

Thumbnail Image
Issue Date
Journal Title
Journal ISSN
Volume Title
Machine learning models are part of our every-day lives. Let it be software that is responsible for handling self-driving cars or medical imaging techniques used for magnetic resonance imaging (MRI). These machine learning models can be divided into two main categories when it comes to visibility. The rst one is called a "whitebox" model since the access to this model's internal workings is accessible to the public. The second one is called a "black-box" model that is only accessible to the public to its limited interfaces. The overall consensus is that these models behave reliably in all situations regardless of the input that is given. In other words, these models should behave robustly. However, that is often not the case based on inputs that have been altered using adversarial crafting techniques. Furthermore, verifying the robustness of black-box models is an especially hard problem because the internal workings are not accessible. This research concentrates on approximating black-box machine learning models by using ve di erent active learning techniques. The approximated model may not share the same architecture as the original model but active learning assures that its behavior is similar to make assumptions about the robustness of the original black-box model. The experiments include ve di erent deep neural networks of various complexities that have been trained on three di erent datasets, namely MNIST, CIFAR-10, and GTSRB. The VGG16 architecture has been used as the main architecture to approximate the other network architectures by using ve di erent active learning strategies, namely random-, uncertainty-, K-center-, DFAL- and a combination of DFAL and K-center strategy. The approximation of the di erent network architectures has been evaluated using two di erent metrics. The rst one is an overall agreement on a hold-out test set, while the second one is a transferability score that evaluates if adversarial crafting techniques based on the FGSM attack are also applicable to the original black-box model. This research shows that it is possible to approximate black-box deep neural networks e ciently by using a combination of the DFAL- and K-center strategy but only if the original black-box model has been trained towards a high accuracy. 3
Faculteit der Sociale Wetenschappen