Jekyll2023-03-21T20:51:42+00:00https://www.sri.inf.ethz.ch/blog/feed.xmlSRI Lab | BlogpostsSRI Group WebsiteLAMP: Extracting text from gradients with language model priors2022-11-28T00:00:00+00:002022-11-28T00:00:00+00:00https://www.sri.inf.ethz.ch/blog/lamp<h3 id="data-privacy-and-federated-learning">Data privacy and Federated learning</h3>
<p>Machine learning algorithms have set the state-of-the-art on most tasks where large amount of training data is available. While the improvements brought by these algorithms are impressive, their applications to settings where private data is used remain limited due to the privacy concerns posed by the large centralized datasets required by the training procedures. Recently, the federated learning framework has emerged as a promising alternative to collecting and training with centralized datasets.
In federated learning, multiple data owners (clients) collaboratively optimize a loss function $l$ w.r.t. the parameters $\theta$ of a global model $h$ on their own dataset $\mathcal{D}_i$ <strong>without</strong> sharing the data in $\mathcal{D}_i$ with the other participants:</p>
\[\begin{equation*}
\min_{\theta} \frac{1}{n} \sum_{i=1}^n \mathbb{E}_{(x, y) \sim \mathcal{D}_i} \left[ l(h_\theta(x), y) \right].
\end{equation*}\]
<p>To this end, the optimization is carried out in communication rounds. In particular, given the global parameters $\theta_t$ at round $t$, multiple clients compute model updates $g$ on their own data and then share them with a central server that aggregates them and produces a new global parameters $\theta_{t+1}$. After several communication rounds the model parameters converge to an optimum.
<img src="/assets/blog/lamp/server.svg" class="blogpost-img100" style="height: 500pt;margin: -50pt 0pt -50pt 0pt;" />
One common implementation of this generic framework is the FedSGD algorithm, where updates are given by the gradient of $l$ w.r.t. $\theta_t$ on a single batch $\{(x^b_i, y^b_i)\}\sim \mathcal{D}_i$ of client data of size $B$:</p>
\[\begin{equation*}
g(\theta_t,\mathcal{D}_i) = \frac{1}{B} \sum_{b=1}^B \nabla_\theta \left[ l(h_{\theta_t}(x^b_i), y^b_i) \right].
\end{equation*}\]
<p>Federated learning in theory allows for improved data privacy, as the client data does not leave the individual clients. Unfortunately, several recent works have shown that updates $g$ computed by common federated algorithms such as FedSGD can be used by a malicious server during the aggregation phase to approximately reconstruct the client’s data. So far, prior work has focused on exposing this issue in the image domain where strong image priors help the reconstruction. In this work, we show that such approximate reconstruction is also possible in the text domain, where federated learning is very commonly applied.</p>
<h3 id="gradient-leakage">Gradient leakage</h3>
<p>In order to obtain approximate input reconstructions $\{\tilde{x}^b_i\}$ from the FedSGD update of some client $i$, with updates as described above, prior works typically solve the following optimization problem at some communication round $t$:</p>
\[\begin{equation}
\min_{\\{\tilde{x}^b_i\\}} \sum_{i=1}^n \mathcal{L}_{rec}\left( \left(\frac{1}{B} \sum_{b=1}^B \nabla_\theta l(h_\theta(\tilde{x}^b_i), y^b_i)\right), g(\theta_t, \mathcal{D}_i) \right) + \alpha_{rec}\,R(\{\tilde{x}^b_i\}),
\end{equation}\]
<p>where $\mathcal{L}$<sub>rec</sub> is distance metric - e.g., $L_1$, $L_2$ or cosine, that measures the gradient reconstruction error, $R(\{\tilde{x}^b_i\})$ is some domain specific prior - e.g. Total Variation (TV) in the image domain, that assesses the quality of the reconstructed inputs, and $\alpha_{rec}$ is hyperparameter balancing between the two. Note that $\theta_t$ and $g(\theta_t, \mathcal{D}_i)$ are known by the malicious server as the former is computed by it and the latter is sent to it by client $i$ at the end of the round. Often the batch labels $\{y^b_i\}$ can be obtained by the server using specific label reconstruction attacks, that are beyond the scope of this blog post, or just guessed by running the reconstruction with all possible labels due to their discrete nature, so throughout the post we only focus on reconstructing $\{\tilde{x}^b_i\}$. In <a href="https://www.sri.inf.ethz.ch/blog/bayesian">our previous blog post</a>, we have shown that solving the optimization problem above is equivalent to finding the Bayesian optimal adversary in this setting.</p>
<p>In the image domain, the optimization problem is typically solved using gradient descent on the batch of randomly initialized images $\{x^b_i\}$ using an image-specific prior $R$. In the next section, we first discuss why such a solution is not well suited to language data and we then discuss our method, <strong>LAMP</strong>, that combines a text-specific prior with a new way to solve the optimization problem above by alternating discrete and continuous optimization steps to obtain our state-of-the-art gradient leakage framework for text.</p>
<h3 id="lamp-gradient-leakage-for-text">LAMP: Gradient leakage for text</h3>
<p>In this work, we focus our attention on transformer-based models $h_\theta$, as they are the state-of-the-art for modeling text across various language tasks. As these models operate on continuous vectors, typically they assume fixed-size vocabulary of size $V$ and embed each word into a different $\mathbb{R}^d$ vector. For a sequence of words of size $n$, we refer to the individual words in it with $t_1,\ldots,t_n$ and to their corresponding embeddings with $x_1,\ldots,x_n$.</p>
<p>In order to solve the gradient leakage optimization problem from the previous section, we choose to optimize directly over the embeddings $x_i$ as they, similarly to images, are represented by continuous values we can optimize over. However, uniquely to the text domain, only a finite subset of vectors in $\mathbb{R}^d$ are valid word embeddings. To this end, when we obtain our reconstructed embeddings $\tilde{x}_i$ for each of them we then select the most similar in cosine similarity token in the vocabulary to create a reconstruction of the sequence of words $\tilde{t}_1,\ldots,\tilde{t}_n$.</p>
<p>An additional issue that is specific to the text domain and, in particular, the transformer architecture is that the transformer outputs depend on word order only through positional embeddings. Therefore, the model gradient reconstruction loss $\mathcal{L}$<sub>rec</sub> is not as affected by wrongly reconstructed word order as it is by the wrongly reconstructed word embeddings themselves. In practice, this results into the continuous optimization process often getting stuck in local minima caused by an embedding of a token that reconstructs the correct word at the wrong position. These local minima are hard to escape from continuously. To this end, we introduce a discrete optimization step that reorders the sentence periodically, allowing to escape the local minima. The discrete step works by first proposing several word order changes such as swapping the positions of two words or moving a sentence prefix to the end of the sentence. The different order changes are then assessed based on the combination of the gradient reconstruction $\mathcal{L}$<sub>rec</sub> and the perplexity of the sentence $\mathcal{L}$<sub>lm</sub> computed by auxiliary language model such as GPT-2 on the projected words $\tilde{t}_i$:</p>
\[\begin{equation}
\mathcal{L}_{rec}(\{\tilde{x}_i\}) + \alpha_{lm}\,\mathcal{L}_{rec}(\{\tilde{t}_i\}),
\end{equation}\]
<p>where $\alpha_{lm}$ is a hyperparameter balancing the two parts. The resulting end-to-end alternating optimization is demonstrated in the image below:
<img src="/assets/blog/lamp/lamp.svg" class="blogpost-img100" style="height: 500pt;margin: -100pt 0pt -100pt 0pt;" />
where green boxes show the discrete optimization steps and the blue boxes demonstrate the continuous gradient descent optimization steps of the gradient leakage objective presented in the previous section.</p>
<p>Finally, similarly to the image domain, we introduce a new prior specific to text that improves our reconstruction. To this end, we made the empirical observation that during optimization often the embedding vectors $x_i$ grow in length even when their direction doesn’t change a lot. To this end, we regularize the average vector length of the embeddings in a sequence $\tilde{x}_i$ to be close to the average embedding length in the vocabulary $l_e$:</p>
\[\begin{equation}
R(\tilde{x}_i) = \left(\frac{1}{n}\sum_{i=1}^n \| \tilde{x}_i \|_2 - l_e\right)^2
\end{equation}\]
<p>This allows our embeddings to remain in the correct range of values, which in turn results in a more stable and accurate reconstruction of the embeddings $\tilde{x}_i$.</p>
<h3 id="experimental-evaluation">Experimental evaluation</h3>
<p>We evaluated LAMP on several standard sentiment classification datasets and architectures based on the BERT language models. As is typically the case with language models, we assume our models are pretrained to make word predictions on large text corpora and that federated learning is used only to fine-tune the models on the classification task at hand. We consider two versions of LAMP - one where $\mathcal{L}$<sub>rec</sub> is a weighted sum of L1 and L2 distances (denoted LAMP<sub>L1+L2</sub>), and another one where $\mathcal{L}$<sub>rec</sub> is the cosine similarity (denoted LAMP<sub>cos</sub>). We compare them to the state-of-the-art attacks - TAG based on the same L1+L2 distance, and DLG based on L2 distance alone. We evaluate the methods in terms of the Rouge-1 metric (R1) which measures the percentage of correctly reconstructed words and the Rouge-2 metric (R2) which measures the percentage of correctly reconstructed bigrams. We note one can interpret R2 as a proxy measurement of how well the order of the sentence has been reconstructed. We present a subset of the results shown in our paper on the CoLA dataset and batch size of 1 below:
<br /><br /></p>
<table>
<thead>
<tr>
<th> </th>
<th style="text-align: center;" colspan="2"> <span style="border-bottom:1px solid black"> $\text{TinyBERT}_6$ </span> </th>
<th style="text-align: center;" colspan="2"> <span style="border-bottom:1px solid black"> $\text{BERT}_{BASE}$ </span> </th>
<th style="text-align: center;" colspan="2"> <span style="border-bottom:1px solid black"> $\text{BERT}_{LARGE}$ </span> </th>
</tr>
<tr>
<th>Method</th>
<th style="text-align: center">R1</th>
<th style="text-align: center">R2</th>
<th style="text-align: center">R1</th>
<th style="text-align: center">R2</th>
<th style="text-align: center">R1</th>
<th style="text-align: center">R2</th>
</tr>
</thead>
<tbody>
<tr>
<td>DLG</td>
<td style="text-align: center">37.7</td>
<td style="text-align: center">3.0</td>
<td style="text-align: center">59.3</td>
<td style="text-align: center">7.7</td>
<td style="text-align: center">82.7</td>
<td style="text-align: center">10.5</td>
</tr>
<tr>
<td>TAG</td>
<td style="text-align: center">43.9</td>
<td style="text-align: center">3.8</td>
<td style="text-align: center">78.9</td>
<td style="text-align: center">10.2</td>
<td style="text-align: center">82.9</td>
<td style="text-align: center">14.6</td>
</tr>
<tr>
<td>$\text{LAMP}_{\cos}$</td>
<td style="text-align: center">93.9</td>
<td style="text-align: center"><strong>59.3</strong></td>
<td style="text-align: center"><strong>89.6</strong></td>
<td style="text-align: center"><strong>51.9</strong></td>
<td style="text-align: center"><strong>92.0</strong></td>
<td style="text-align: center"><strong>56.0</strong></td>
</tr>
<tr>
<td>$\text{LAMP}_{\text{L1}+\text{L2}}$</td>
<td style="text-align: center"><strong>94.5</strong></td>
<td style="text-align: center">52.1</td>
<td style="text-align: center">87.5</td>
<td style="text-align: center">47.5</td>
<td style="text-align: center">91.2</td>
<td style="text-align: center">47.8</td>
</tr>
</tbody>
</table>
<p><br /><br />
We see that LAMP<sub>cos</sub> is consistently recovering more words compared to the alternatives with LAMP<sub>L1+L2</sub> close behind. Further, LAMP recovers substantially better sentence ordering. It is worth noting that the improvement over the baselines for both R1 and R2 is most pronounced on the smallest model $\text{TinyBERT}_6$ where recovery is the hardest. Additionally, we also experimented with recovering text in the setting where the batch size is bigger than 1. We are the first to present results in this setting and we show them below for the CoLA dataset:
<br /><br /></p>
<table>
<thead>
<tr>
<th> </th>
<th style="text-align: center;" colspan="2"> <span style="border-bottom:1px solid black"> B=1 </span> </th>
<th style="text-align: center;" colspan="2"> <span style="border-bottom:1px solid black"> B=2 </span> </th>
<th style="text-align: center;" colspan="2"> <span style="border-bottom:1px solid black"> B=4 </span> </th>
</tr>
<tr>
<th>Method</th>
<th style="text-align: center">R1</th>
<th style="text-align: center">R2</th>
<th style="text-align: center">R1</th>
<th style="text-align: center">R2</th>
<th style="text-align: center">R1</th>
<th style="text-align: center">R2</th>
</tr>
</thead>
<tbody>
<tr>
<td>DLG</td>
<td style="text-align: center">59.3</td>
<td style="text-align: center">7.7</td>
<td style="text-align: center">49.7</td>
<td style="text-align: center">5.7</td>
<td style="text-align: center">37.6</td>
<td style="text-align: center">1.7</td>
</tr>
<tr>
<td>TAG</td>
<td style="text-align: center">78.9</td>
<td style="text-align: center">10.2</td>
<td style="text-align: center">68.8</td>
<td style="text-align: center">7.6</td>
<td style="text-align: center">56.2</td>
<td style="text-align: center">6.7</td>
</tr>
<tr>
<td>$\text{LAMP}_{\cos}$</td>
<td style="text-align: center"><strong>89.6</strong></td>
<td style="text-align: center"><strong>51.9</strong></td>
<td style="text-align: center">74.4</td>
<td style="text-align: center">29.5</td>
<td style="text-align: center">55.2</td>
<td style="text-align: center">14.5</td>
</tr>
<tr>
<td>$\text{LAMP}_{\text{L1}+\text{L2}}$</td>
<td style="text-align: center">87.5</td>
<td style="text-align: center">47.5</td>
<td style="text-align: center"><strong>78.0</strong></td>
<td style="text-align: center"><strong>31.4</strong></td>
<td style="text-align: center"><strong>66.2</strong></td>
<td style="text-align: center"><strong>21.8</strong></td>
</tr>
</tbody>
</table>
<p><br /><br />
We see that despite the worse quality of reconstruction, even batch size of 4 still leaks a substantial amount of data. Further, we observe that for bigger batch sizes LAMP<sub>L1+L2</sub> performs better than LAMP<sub>cos</sub>. Both LAMP methods, however, substantially improve upon the results of the baselines.
Finally, we show an example sentence reconstruction from LAMP and TAG on multiple datasets below:
<img src="/assets/blog/lamp/text_rec.png" class="blogpost-img100" style="margin: 30pt 0pt 30pt 0pt;" />
Here, yellow signifies a single correctly reconstructed word and green signifies a tuple of correctly recovered words. We see that LAMP recovers the word order drastically better and often is even able to reconstruct it perfectly. In addition, LAMP recovers more individual words. This confirms qualitatively the effectiveness of our attack.</p>
<h3 id="summary">Summary</h3>
<p>In this blog post, we introduced LAMP, a new framework for gradient leakage of text data from gradient updates in federated learning. Our key ideas are the alternating of continuous and discrete optimization steps and the introduction of an auxiliary text model which we use in the discrete part of our optimization to judge how well a piece of text is reconstructed. Thanks to these elements, our attack is able to produce substantially better text reconstructions compared to the state-of-the-art attacks both quantitatively and qualitatively. We, thus, show that many practical federated learning systems based on text are vulnerable and better mitigation steps should be taken.
For more details please see our <a href="https://files.sri.inf.ethz.ch/website/papers/balunovic2022lamp.pdf">NeurIPS 2022 paper</a>.</p>In this work we present an attack on federated learning's privacy specific to the text domain. We show that federated learning in the text domain can expose a lot of user data.Reliability guarantees on private data2022-11-07T00:00:00+00:002022-11-07T00:00:00+00:00https://www.sri.inf.ethz.ch/blog/phoenix<p>In this post we present our ACM CCS 2022 publication, <a href="https://www.sri.inf.ethz.ch/publications/jovanovic2022phoenix">Private and Reliable Neural Network Inference</a>, where we introduced Phoenix, a tool for NN inference that both protects client data privacy and enables important reliability guarantees.</p>
<p>We focus on the common <em>ML as a service</em> scenario, a two-party setting where a client offloads intensive computation (commonly NN inference) to the server.
The client data is of sensitive nature in many of the applications (e.g., financial, judicial), which motivated work in the field of <em>privacy-preserving NN inference</em>, aiming to build methods that perform the computation without the server learning the client data.
One of the most common techniques for this is <em>fully-homomorphic encryption</em> (FHE), which is rapidly becoming more practical.</p>
<p>Orthogonal to privacy, a long line of work focuses on enabling <em>NN inference with reliability guarantees</em>.
For example, in a loan prediction setting, augmenting predictions with <em>fairness</em> guarantees is in the interest of both parties, as it increases trust in the system and may be essential to ensure regulatory compliance.
Some of the latest works in this direction are <a href="https://www.sri.inf.ethz.ch/publications/peychev2022latent">LASSI</a> and <a href="https://www.sri.inf.ethz.ch/publications/jovanovic2022fare">FARE</a>, focusing on two aspects of the fairness problem.
Another common example is <em>robustness</em>, where for example, a medical image analysis system should be able to prove to clients that the diagnosis is robust to naturally-occurring measurement errors (see e.g., our latest work <a href="https://openreview.net/forum?id=7oFuxtJtUMH">SABR</a>).</p>
<p><img src="/assets/blog/phoenix/mlaas.png" alt="" class="blogpost-img50" /></p>
<p class="blogpost-caption"><em><strong>ML as a service.</strong> Phoenix achieves both client data privacy (via FHE) and fairness/robustness guarantees.</em></p>
<p>While the problems of privacy-preserving and reliable inference are both well-established, there was no prior attempt to consolidate the work in these two fields.
Thus, service providers who offer reliability guarantees currently have no simple way to transition their service to a privacy-preserving setting, a requirement which is becoming increasingly relevant.
This is the problem we address in Phoenix, proposing a system that supports both: client data privacy and reliability guarantees.
To this end, we lift the key building blocks of <a href="https://arxiv.org/abs/1902.02918">randomized smoothing</a>, a technique for augmenting predictions with reliability guarantees, to the popular <a href="https://eprint.iacr.org/2016/421.pdf">RNS-CKKS</a> FHE scheme.
The key challenges that Phoenix overcomes stem from the missing native support for control flow and evaluation of non-polynomial functions in the FHE scheme.</p>
<p>We now recall randomized smoothing on a high level.</p>
<p><img src="/assets/blog/phoenix/smoothing.png" alt="" class="blogpost-img100" /></p>
<p class="blogpost-caption"><em><strong>Randomized smoothing.</strong> A high-level overview of the procedure in the non-private setting.</em></p>
<p>As the service provider, we receive an input $x$ (in the illustration above, a cat image), and we aim to return a prediction $y$ (for some classification task) augmented with a reliability guarantee, for a property such as robustness.
We duplicate the input $n$ times, add independently sampled Gaussian noise to each copy, and perform batched NN inference to obtain the logit vectors, i.e., unnormalized probabilities.
Next, we apply the Argmax function to transform logits to predictions, and aggregate those predictions across $n$ samples to get the <em>counts</em>, indicating how many times each output class was predicted.
Finally, we perform a statistical test on the counts, which, if successful, produces a probabilistic reliability guarantee, ensuring that the prediction $y$ is robust with known high probability.</p>
<p>The key question is how this procedure needs to change if we attempt to execute it while protecting client data privacy, i.e., if the data is encrypted using FHE by the client.
The key steps are illustrated below.</p>
<p><img src="/assets/blog/phoenix/phoenix.png" alt="" class="blogpost-img100" /></p>
<p class="blogpost-caption"><em><strong>Overview of Phoenix.</strong> The main challenges in lifting randomized smoothing to FHE.</em></p>
<p>For the batched NN inference (dashed line) we directly utilize prior work, which offers efficient algorithms for the RNS-CKKS scheme.
Further, the addition of noise is simple, as the noise can be directly added as a plaintext due to the homomorphic property.
However, computing Argmax is a key challenge due to the difficulty of computing non-polynomial functions—we elaborate on this shortly.
In the aggregation step we combine several methods from prior work with scheme-specific optimizations, and develop a novel heuristic for randomized smoothing, necessary to obtain a computationally feasible procedure.
Finally, we perform a rewrite of the one-sided binomial test applied to counts to make it FHE-suitable.
The output of Phoenix is a single ciphertext, which when decrypted with the secret key of the client, reveals both the prediction and the computed reliability guarantee.
We next discuss the Argmax approximation in more detail, and refer to our paper for details regarding all other steps.</p>
<p class="blogpost-wrap"><img src="/assets/blog/phoenix/argmax.png" alt="" class="blogpost-img20" />
<span>
To efficiently approximate Argmax, we use the recent paper of <a href="https://eprint.iacr.org/2019/1234">Cheon et al. (ASIACRYPT ‘20)</a>, which propose <em>SgnHE</em>, a sign function approximation for FHE as a composition of low-degree polynomials, illustrated below.
Our approximate Argmax is built on several applications of <em>SgnHE</em> (see the paper for the full algorithm).
Most importantly, in our case it is crucial to have guarantees on the approximation quality of <em>SgnHE</em>—otherwise, the randomized smoothing reliability guarantee returned to the clients may in some cases be invalid, fundamentally compromising the protocol.
<span></span></span></p>
<p><img src="/assets/blog/phoenix/sgn.png" alt="" class="blogpost-img50" /></p>
<p class="blogpost-caption"><em><strong>Sign function approximation.</strong> Repeated applications of the polynomial $f_0$ increase approximation quality.</em></p>
<p>The <em>SgnHE</em> function is parametrized such that for desired parameters $a,b \in \mathbb{R}$, we can obtain an $(a,b)$-close approximation, meaning that for inputs $x \in [a, 1]$ the output is guaranteed to be in $[1 - 2^{-b}, 1]$ (similarly for the negative case).
However, as the server is unable to directly observe the intermediate values due to encryption, it is hard to ensure the above precondition is satisfied for logit values which are the input to the Argmax, and the first of the sign function applications it utilizes.</p>
<p>To overcome this we impose two <em>conditions</em> on the logit vectors, constraining the range and differences of their values, allowing us to appropriately rescale them and reason about the approximation quality.
As we can not prove for an arbitrary NN that such conditions on logits will always hold (e.g., be in some range), we use confidence intervals and a finite sample to upper bound the condition violation probability.
When reporting the guarantee to the client, the computed probability (approximation error) is added to the usual error probability of randomized smoothing as the probabilistic procedure (algorithmic error).
The resulting value represents the total error probability of our guarantee, maintaining the behavior of the non-private case.</p>
<p>In our extensive experiments across multiple scenarios we observe values for the total error probability of around 1%, confirming that our procedure leads to viable high-probability guarantees.
We further observe non-restrictive latencies and communication cost and high <em>consistency</em>, i.e., the results obtained with the FHE version of randomized smoothing are in almost 100% of the cases identical to those of the non-private baseline, confirming that transitioning a service to FHE using Phoenix does not sacrifice the key metrics.
Our Microsoft SEAL implementation is publicly available on <a href="https://github.com/eth-sri/phoenix">GitHub</a>.</p>
<p>We believe Phoenix is an important first step towards merging the worlds of reliable and privacy-preserving machine learning.
For more details of the Argmax approximation, omitted parts of the protocol, as well as detailed experimental results including microbenchmarks, please refer to our paper.</p>We present Phoenix (CCS '22), the first system for privacy-preserving neural network inference with robustness and fairness guarantees.Why tighter convex relaxations harm certified training2022-10-27T00:00:00+00:002022-10-27T00:00:00+00:00https://www.sri.inf.ethz.ch/blog/paradox<p>This blog post summarizes the key findings of our paper <a href="https://www.sri.inf.ethz.ch/publications/jovanovic2022paradox">On the Paradox Of Certified Training</a>, which was recently published in TMLR.</p>
<p>We attempt to explain the phenomenon where most state-of-the-art methods for certified training based on convex relaxations (such as <a href="https://proceedings.neurips.cc/paper/2021/hash/988f9153ac4fd966ea302dd9ab9bae15-Abstract.html">FastIBP</a> or the latest breakthrough <a href="https://openreview.net/forum?id=7oFuxtJtUMH">SABR</a>) focus on the loose interval propagation (IBP/Box), while intuitively, tighter relaxations (i.e., the ones that more tightly overapproximate the non-linearities in the network) should lead to better results.
This was <a href="https://arxiv.org/abs/1810.12715">already</a> <a href="https://www.ijcai.org/proceedings/2019/854">observed</a> <a href="https://openreview.net/forum?id=Skxuk1rFwB">in</a> <a href="https://www.sri.inf.ethz.ch/publications/balunovic2020bridging">many</a> <a href="https://arxiv.org/abs/2104.00447">prior</a> <a href="https://openreview.net/forum?id=52weXyh2yh">works</a>, which proposed several hypotheses. However, the paradox was never investigated in a principled way.</p>
<p>We start by proposing a way to quantify tightness (see the paper for details), and thoroughly reproducing the paradox: Across a wide range of settings, tighter relaxations consistently lead to lower certified robustness (in %) than the loose IBP relaxation. An example is shown in the following table, grouping equivalent methods from prior work (below we will refer to each group using the name in bold):</p>
<style>
.good {
background-color: #aaffaa;
padding: 1px;
width: 40px;
display: inline-block;
}
.bad {
background-color: #ffaaaa;
padding: 1px;
width: 40px;
display: inline-block;
}
</style>
<table>
<thead>
<tr>
<th>Relaxation</th>
<th style="text-align: center">Tightness</th>
<th style="text-align: center">Certified (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>IBP</strong> / Box</td>
<td style="text-align: center"><span class="bad"> 0.73 </span></td>
<td style="text-align: center"><span class="good"> 86.8 </span></td>
</tr>
</tbody>
<tbody>
<tr>
<td><strong>hBox</strong> / Symbolic Intervals</td>
<td style="text-align: center"><span class="good"> 1.76 </span></td>
<td style="text-align: center"><span class="bad"> 83.7 </span></td>
</tr>
<tr>
<td><strong>CROWN</strong> / DeepPoly</td>
<td style="text-align: center"><span class="good"> 3.36 </span></td>
<td style="text-align: center"><span class="bad"> 70.2 </span></td>
</tr>
<tr>
<td><strong>DeepZ</strong> / CAP / FastLin / Neurify</td>
<td style="text-align: center"><span class="good"> 3.00 </span></td>
<td style="text-align: center"><span class="bad"> 69.8 </span></td>
</tr>
<tr>
<td><strong>CROWN-IBP (R)</strong></td>
<td style="text-align: center"><span class="good"> 2.15 </span></td>
<td style="text-align: center"><span class="bad"> 75.4 </span></td>
</tr>
</tbody>
</table>
<p>Our key observation is that there are other latent properties of relaxations, besides tightness, that affect success when relaxations are used in a training procedure.
More concretely, each of the tighter relaxations has either unfavorable <em>continuity</em> (i.e., the corresponding loss function is discontinuous with respect to network weights) or unfavorable <em>sensitivity</em> (i.e., the corresponding loss function is highly sensitive to small perturbations of network weights), both preventing successful optimization. By observing all three properties jointly, we can more easily interpret the seemingly counterintuitive results.</p>
<p>The plot below shows the relaxation of the ReLU non-linearity used by CROWN, for the example input range defined by $l=-5$ and $u=8$. By reducing $u$ (using the bottom slider), we can directly observe the discontinuity of CROWN, when its heuristic choice of the lower linear bound changes at $|l|=|u|$. Using the same plot we can observe the discontinuities of hBox at $l=0$.
These observations imply discontinuities in the loss when a relaxation is used in training, which we further empirically observe in real scenarios.
Finally, we can use the plot below to observe that no discontinuities can be found for IBP and DeepZ—a formal proof of their continuity in the general case is given in the paper.</p>
<p><a class="iframe-link" href="/assets/blog/paradox/continuity.html"> Open Interactive Plot</a></p>
<iframe class="iframe-full" src="/assets/blog/paradox/continuity.html" height="780px"></iframe>
<p>While the sensitivity of the loss functions is harder to illustrate on a toy example as above, our derivation (Section 4.3 of the paper) demonstrates that the bounds used by CROWN, CROWN-IBP (R) and DeepZ lead to certified training losses highly sensitive to small changes in network weights, while the losses of IBP and hBox are not sensitive and induce more favorable loss landscapes. With these observations, we expand the table shown earlier to include all three relaxation properties: tightness, continuity and sensitivity.
This illustrates that attempts to use tighter relaxations in certified training have introduced unfavorable properties of the loss, which resulted in the failure to outperform the continuous and non-sensitive IBP.</p>
<table>
<thead>
<tr>
<th>Relaxation</th>
<th style="text-align: center">Tightness</th>
<th style="text-align: center">Continuity</th>
<th style="text-align: center">Sensitivity</th>
<th style="text-align: center">Certified (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>IBP</strong> / Box</td>
<td style="text-align: center"><span class="bad"> 0.73 </span></td>
<td style="text-align: center"><span class="good"> $\checkmark$ </span></td>
<td style="text-align: center"><span class="good"> $\checkmark$ </span></td>
<td style="text-align: center"><span class="good"> 86.8 </span></td>
</tr>
</tbody>
<tbody>
<tr>
<td><strong>hBox</strong> / Symbolic Intervals</td>
<td style="text-align: center"><span class="good"> 1.76 </span></td>
<td style="text-align: center"><span class="bad"> $\times$ </span></td>
<td style="text-align: center"><span class="good"> $\checkmark$ </span></td>
<td style="text-align: center"><span class="bad"> 83.7 </span></td>
</tr>
<tr>
<td><strong>CROWN</strong> / DeepPoly</td>
<td style="text-align: center"><span class="good"> 3.36 </span></td>
<td style="text-align: center"><span class="bad"> $\times$ </span></td>
<td style="text-align: center"><span class="bad"> $\times$ </span></td>
<td style="text-align: center"><span class="bad"> 70.2 </span></td>
</tr>
<tr>
<td><strong>DeepZ</strong> / CAP / FastLin / Neurify</td>
<td style="text-align: center"><span class="good"> 3.00 </span></td>
<td style="text-align: center"><span class="good"> $\checkmark$ </span></td>
<td style="text-align: center"><span class="bad"> $\times$ </span></td>
<td style="text-align: center"><span class="bad"> 69.8 </span></td>
</tr>
<tr>
<td><strong>CROWN-IBP (R)</strong></td>
<td style="text-align: center"><span class="good"> 2.15 </span></td>
<td style="text-align: center"><span class="bad"> $\times$ </span></td>
<td style="text-align: center"><span class="bad"> $\times$</span></td>
<td style="text-align: center"><span class="bad"> 75.4 </span></td>
</tr>
</tbody>
</table>
<h3 id="next-steps">Next steps</h3>
<p>A natural question that arises is the one of improving unfavorable properties of relaxations to make them more suitable for certified training.
In the paper we systematically investigate modifications of existing relaxations and find that designing a relaxation with all favorable properties is difficult, as the properties induce complex tradeoffs that depend on the setting.
Still, such relaxations may exist, and future work might be able to utilize our findings to discover them.</p>
<p>A more promising approach seems to be the use of existing convex relaxations with modified training procedures designed to best utilize the benefits of each relaxation. Recent successful examples of this approach include <a href="https://www.sri.inf.ethz.ch/publications/balunovic2020bridging">COLT</a>, which includes the counterexample search in training, <a href="https://openreview.net/forum?id=Skxuk1rFwB">CROWN-IBP</a>, which heuristically combines the losses of two relaxations in training, and two recent methods which focus on IBP, aiming to improve its training procedure via better initialization and regularization (<a href="https://proceedings.neurips.cc/paper/2021/hash/988f9153ac4fd966ea302dd9ab9bae15-Abstract.html">FastIBP</a>) or the propagation of
smaller input regions in training (<a href="https://openreview.net/forum?id=7oFuxtJtUMH">SABR</a>).</p>
<p>Finally, it is worth noting that there are other promising approaches for neural network certification that do not use convex relaxations and are thus not affected by tradeoffs between relaxation properties. Examples in this direction include <a href="https://arxiv.org/abs/1902.02918">Randomized Smoothing</a>, offering high-probability robustness certificates, and custom certification-friendly model architectures such as <a href="https://arxiv.org/abs/2102.05363">$l_\infty$-distance nets</a>.
While not affected by limitations of convex relaxations, these approaches introduce other challenges such as optimization difficulties and additional inference-time work.</p>We investigate a long-standing paradox in the field of certified training, identifying previously overlooked properties of convex relaxations which affect training success.SRI Lab at ICLR 20222022-04-25T00:00:00+00:002022-04-25T00:00:00+00:00https://www.sri.inf.ethz.ch/blog/iclr2022SRI Lab will present five works at ICLR 2022! In this meta post we aggregate all content related to our ICLR papers, including links to the conference portal and individual blogposts where you can learn more about the topics we currently focus on.Generating provably robust adversarial examples2022-04-21T00:00:00+00:002022-04-21T00:00:00+00:00https://www.sri.inf.ethz.ch/blog/parade<p><img src="/assets/blog/parade/motivation.svg" class="blogpost-img100" style="height: 200pt;" />
On the image above we show an image of the digit $0$ from MNIST ($x_\text{orig}$) and a region around it in red that depicts the set of geometrically perturbed images for which we expect a given neural network to be robust.
Further, in green we depict a subregion where the neural network is not robust. Traditionally, in order to assess the robustness of the network one uses adversarial attacks to generate the examples $x_1$ and $x_2$.
While robustness can be assessed that way, the information that the whole green region is adversarial is lost. This in turn might lead to never-seen-before network behaviour in the future.
One advantage of the classical approach of assessing robustness, however, is that generating $x_1$ and $x_2$ is fast. In contrast, generating the green region is computationally infeasible.
In this work, we present an algorithm called <strong>PARADE</strong> that exploits classical adversarial attacks to generate as large as possible regions that are provably adversarial. Similarly to the green region in the figure, these regions summarize many indvidual advarsarial attacks while also being practical to compute.
We call them provably robust adversarial examples.</p>
<h3 id="algorithm-overview">Algorithm overview</h3>
<p>There are three main steps to <strong>PARADE</strong>. First, we generate an initial box region that might contain non-adversarial points using off-the-shelf adversarial attacks.
We refer to this region as the overapproximation box $\mathcal{O}$. Then, we use a black-box verifier to shrink this overapproximation box to a smaller box that provably contains only adversarial points. We call this region the underapproximation box $\mathcal{U}$.
Finally, we use $\mathcal{O}$ and $\mathcal{U}$ to generate a polyhedral region $\mathcal{U}\subseteq\mathcal{P}\subseteq\mathcal{O}$ that we also prove only contains adversarial points using the same black-box verifier. Both $\mathcal{U}$ and $\mathcal{P}$ fit our definition of
provably robust adversarial examples but differ in terms of shape and precision. To this end, the generation of $\mathcal{P}$ is only an optional way to make our provably robust adversarial examples more precise. Next, we present the <strong>PARADE</strong> steps in details.</p>
<h3 id="generating-the-overapproximation-box--mathcalo">Generating the overapproximation box $\mathcal{O}$</h3>
<iframe src="/assets/blog/parade/over.svg" style="border: none;;width: 100%;height: 200pt;"></iframe>
<p>To generate the overapproximation box $\mathcal{O}$, we sample attacks from an adversarial attack algorithm, such as <strong>PGD</strong>. Then, we fit a box around them. The process is illustateted in the animation above.
We note that depending on the success of the attack algorithm, a small part of the ground truth adversarial region $\mathcal{T}$ might be excluded from $\mathcal{O}$.</p>
<h3 id="generating-the-underapproximation-box--mathcalu">Generating the underapproximation box $\mathcal{U}$</h3>
<iframe src="/assets/blog/parade/under.svg" style="border: none;;width: 100%;height: 264pt;"></iframe>
<p>We aim to generate the underapproximation box $\mathcal{U}$ in a way that it can be proven to only contain adversarial examples while also being as large as possible. Due to the complexity of this objective, we do this heuristically. In particular, we start by initializing $\mathcal{U}$
to the overapproximation box $\mathcal{O}$. At each iteration $i$, we execute a black-box verification procedure. If the procedure verifies that the box from the previous iteration, $\mathcal{U}_{i-1}$, contains only adversarial examples, we return it.
Otherwise, we obtain from the verifier a linear contraint, which can be added to $\mathcal{U}_{i-1}$ in order to make it provably robust. Unfortunately, the constraint is usually too conservative, as the black-box verifier relies on overapproximation of the set of possible network outputs. To this end, we relax the constraint by bias-adjusting it.
We make sure that we cannot relax the constraint too much, such that it becomes meaningless. We use the bias-adjusted contraint to shrink $\mathcal{U}_{i-1}$ such that the constraint is not violated but the smallest possible amount of volume is lost. The procedure is repeated until the verification succeeds. The process is depicted in the animation above.</p>
<h3 id="generating-the-polyhedral-region--mathcalp">Generating the polyhedral region $\mathcal{P}$</h3>
<iframe src="/assets/blog/parade/poly.svg" style="border: none;;width: 100%;height: 240pt;"></iframe>
<p>Finally, we present the optional step of generating polyhedral provably robust adversarial example $\mathcal{P}$ from the box provably robust adversarial example $\mathcal{U}$.
The additional flexibility of the polyhedral shape allows for larger regions $\mathcal{P}$ compared to $\mathcal{U}$ in exchange for computational complexity. As generating polyhedral regions is hard, we again do this heuristically.
Starting with the overapproximation box $\mathcal{O}$, we iteratively add linear containts to it until we arrive at a polyhedron $\mathcal{P}$ that can be proved to only contain adversarial examples by the black-box verifier.
Similarly to the generation process of $\mathcal{U}$, we use the verification at iteration $i$ to generate linear contraints.
Unlike the generation process of $\mathcal{U}$, we use not only linear constraints from the final verification objective but also linear constraints that make the <em>ReLU</em> activation neurons in the network decided.
Unfortunately, the resulting constraints might generate polyhedron $\mathcal{P}$ that is smaller than $\mathcal{O}$. To prevenet that, we leverage the fact that $\mathcal{U}$ is itself provably robust and we bias-adjust the constraints in such a way that they do not remove volume from $\mathcal{U}$.
The procedure is concludes when the verifier succeeds. We outline the procedure in the animation above.</p>
<h3 id="experiments">Experiments</h3>
<p>We experimented with two different types of provably robust adversarial examples - robust to pixel intensity changes ($\ell_\infty$ changes) and to geometric changes. We show the pixel intensity experiment below:</p>
<table>
<thead>
<tr>
<th>Network</th>
<th style="text-align: right">$\epsilon$</th>
<th style="text-align: right">PARADE<br />Box<br /># Regions</th>
<th style="text-align: right">PARADE<br />Box<br />Time</th>
<th style="text-align: right">PARADE<br />Box<br /># Attacks</th>
<th style="text-align: right">PARADE<br />Poly<br /># Regions</th>
<th style="text-align: right">PARADE<br />Poly<br /> Time</th>
<th style="text-align: right">PARADE<br />Poly<br /># Attacks</th>
</tr>
</thead>
<tbody>
<tr>
<td>MNIST<br />8x200</td>
<td style="text-align: right">0.045</td>
<td style="text-align: right">53/53</td>
<td style="text-align: right">114s</td>
<td style="text-align: right">$10^{121}$</td>
<td style="text-align: right">53/53</td>
<td style="text-align: right">1556s</td>
<td style="text-align: right">$10^{121} < \cdot < 10^{191}$</td>
</tr>
<tr>
<td>MNIST<br />ConvSmall</td>
<td style="text-align: right">0.12</td>
<td style="text-align: right">32/32</td>
<td style="text-align: right">74s</td>
<td style="text-align: right">$10^{494}$</td>
<td style="text-align: right">32/32</td>
<td style="text-align: right">141s</td>
<td style="text-align: right">$10^{494} < \cdot < 10^{561}$</td>
</tr>
<tr>
<td>MNIST<br />ConvBig</td>
<td style="text-align: right">0.05</td>
<td style="text-align: right">28/29</td>
<td style="text-align: right">880s</td>
<td style="text-align: right">$10^{137}$</td>
<td style="text-align: right">28/29</td>
<td style="text-align: right">5636s</td>
<td style="text-align: right">$10^{137} < \cdot < 10^{173}$</td>
</tr>
<tr>
<td>CIFAR-10<br />ConvSmall</td>
<td style="text-align: right">0.006</td>
<td style="text-align: right">44/44</td>
<td style="text-align: right">113s</td>
<td style="text-align: right">$10^{486}$</td>
<td style="text-align: right">44/44</td>
<td style="text-align: right">264s</td>
<td style="text-align: right">$10^{486} < \cdot < 10^{543}$</td>
</tr>
<tr>
<td>CIFAR-10<br />ConvBig</td>
<td style="text-align: right">0.008</td>
<td style="text-align: right">36/36</td>
<td style="text-align: right">404s</td>
<td style="text-align: right">$10^{573}$</td>
<td style="text-align: right">36/36</td>
<td style="text-align: right">610s</td>
<td style="text-align: right">$10^{573} < \cdot < 10^{654}$</td>
</tr>
</tbody>
</table>
<p>We note <strong>PARADE</strong> is highly effective - it generates regions successfully for all but $1$ image for which the classical adversarial attacks succeeded. Further, the regions generated contain a very large set of adversarial examples that are infeasible to generate individually.
Finally, we note that the polyhedral adversarial examples take more time to generate but contain more examples. Calculating the exact number of concrete attacks within the polyhedral regions is computationally hard so instead we approximate the number as precisely as possible from above and below using boxes.</p>
<p>Next, we show the results for adversarial examples provably robust to geometric changes:</p>
<table>
<thead>
<tr>
<th>Network</th>
<th style="text-align: right">Transform</th>
<th style="text-align: right">PARADE<br />Box<br /># Regions</th>
<th style="text-align: right">PARADE<br />Box<br />Time</th>
<th style="text-align: right">PARADE<br />Box<br /># Attacks</th>
</tr>
</thead>
<tbody>
<tr>
<td>MNIST<br />ConvSmall</td>
<td style="text-align: right">Rotate + Scale + Shear</td>
<td style="text-align: right">51/54</td>
<td style="text-align: right">774s</td>
<td style="text-align: right">$10^{96} < \cdot < 10^{195}$</td>
</tr>
<tr>
<td>MNIST<br />ConvSmall</td>
<td style="text-align: right">Scale + Translate2D</td>
<td style="text-align: right">51/56</td>
<td style="text-align: right">521s</td>
<td style="text-align: right">$10^{71} < \cdot < 10^{160}$</td>
</tr>
<tr>
<td>MNIST<br />ConvSmall</td>
<td style="text-align: right">Scale + Rotate + Brightness</td>
<td style="text-align: right">40/48</td>
<td style="text-align: right">370s</td>
<td style="text-align: right">$10^{70} < \cdot < 10^{455}$</td>
</tr>
<tr>
<td>MNIST<br />ConvBig</td>
<td style="text-align: right">Rotate + Scale + Shear</td>
<td style="text-align: right">44/50</td>
<td style="text-align: right">835s</td>
<td style="text-align: right">$10^{77} < \cdot < 10^{205}$</td>
</tr>
<tr>
<td>MNIST<br />ConvBig</td>
<td style="text-align: right">Scale + Translate2D</td>
<td style="text-align: right">42/46</td>
<td style="text-align: right">441s</td>
<td style="text-align: right">$10^{64} < \cdot < 10^{174}$</td>
</tr>
<tr>
<td>MNIST<br />ConvBig</td>
<td style="text-align: right">Scale + Rotate + Brightness</td>
<td style="text-align: right">46/52</td>
<td style="text-align: right">537s</td>
<td style="text-align: right">$10^{119} < \cdot < 10^{545}$</td>
</tr>
<tr>
<td>CIFAR-10<br />ConvSmall</td>
<td style="text-align: right">Rotate + Scale + Shear</td>
<td style="text-align: right">29/29</td>
<td style="text-align: right">1369s</td>
<td style="text-align: right">$10^{599} < \cdot < 10^{1173}$</td>
</tr>
<tr>
<td>CIFAR-10<br />ConvSmall</td>
<td style="text-align: right">Scale + Translate2D</td>
<td style="text-align: right">32/32</td>
<td style="text-align: right">954s</td>
<td style="text-align: right">$10^{66} < \cdot < 10^{174}$</td>
</tr>
<tr>
<td>CIFAR-10<br />ConvSmall</td>
<td style="text-align: right">Scale + Rotate + Brightness</td>
<td style="text-align: right">21/25</td>
<td style="text-align: right">1481s</td>
<td style="text-align: right">$10^{513} < \cdot < 10^{2187}$</td>
</tr>
</tbody>
</table>
<p>We see that again <strong>PARADE</strong> is capable of generating examples for most images where classical adversarial attacks succeeded. We note that we use <a href="https://www.sri.inf.ethz.ch/publications/balunovic2019geometric"><em>DeepG</em></a> for verification.
Since DeepG generates image polyhedra, we have to approximate the number of concrete attacks similarly to <strong>PARADE Poly</strong> above. We also note that DeepG is more computationally expensive resulting is longer runtime for our algorithm, as well.</p>
<h3 id="visualizing-parade-regions">Visualizing PARADE regions</h3>
<p><img src="/assets/blog/parade/visualize.svg" alt="" class="blogpost-img100" /></p>
<p>Above we visualize some of the provably robust adversarial examples generated by <strong>PARADE</strong> for both pixel and geometric transformations. Each pixel’s color represents the number of possible values that pixel can have within our box regions.</p>
<h3 id="summary">Summary</h3>
<p>We introduced and motivated the concept of provably robust adversarial examples. We further showed an outline of our algorithm, <strong>PARADE</strong>, that generates such examples in the shape of boxes or polyhedra.
Emperically we demonstrated that regions produced by <strong>PARADE</strong> can summarize very large number of individual adversarial examples making them an useful tool to asses the robustness of neural networks.
We hope that we piqued your interest in our work. For further details and experiments, please refer to our <a href="https://files.sri.inf.ethz.ch/website/papers/symex.pdf"><em>ICLR 2022 paper</em></a>.</p>
<h3 id="acknowledgments">Acknowledgments</h3>
<p>I would like to thank all of my collaborators for contributing to this paper. In particular, I want to thank <a href="https://ggndpsngh.github.io/"><em>Gagandeep Singh</em></a>, who supervised me on the project and is now professor at UIUC, for his help and mentorship.</p>We introduce the concept of provably robust adversarial examples. These are adversarial examples that are generated together with a region around them that can be proven robust to perturbations. We also show a method for generating large such regions in a scalable manner.Multi-neuron relaxation guided branch-and-bound2022-04-21T00:00:00+00:002022-04-21T00:00:00+00:00https://www.sri.inf.ethz.ch/blog/mnbab<p>This blog post explains the high-level concepts and intuitions behind our most recent neural network verifier <a href="https://files.sri.inf.ethz.ch/website/papers/ferrari2022complete.pdf">MN-BaB</a>. First, we introduce the neural network verification problem. Then, we present the so-called Branch-and-Bound approach for solving it and outline the main ideas behind multi-neuron constraints, before combining the two in our new verifier MN-BaB. We conclude with some experimental results and insights on why using multi-neuron constraints is key for the verification of challenging networks with high natural accuracy.</p>
<h3 id="neural-network-verification">Neural Network Verification</h3>
<p>In a nutshell, the neural network verification problem can be stated as follows:</p>
<p><em>Given a network and an input, prove that all points in a small region around that input are classified correctly, i.e., that no <a href="https://openai.com/blog/adversarial-example-research/">adversarial example</a> exists.</em></p>
<p>To formalize this a bit, we consider a network $f: \mathcal{X} \to \mathcal{Y}$, an input region $\mathcal{D} \subseteq \mathcal{X}$, and a linear property $\mathcal{P}\subseteq \mathcal{Y}$ over the output neurons $y\in\mathcal{Y}$, and we try to prove that</p>
\[f(x) \in \mathcal{P}, \forall x \in \mathcal{D}.\]
<p>For the sake of explanation, we consider a fully connected $L$-layer network with ReLU activations but note that we can handle all common architectures. <!--as the Branch-and-Bound framework only yields complete verifiers for piecewise linear activation functions but remark that our approach applies to a wide class of activations including ReLU, Sigmoid, Tanh, MaxPool, and others.--> We denote the weights and biases of neurons in the $i$-th layer as $W^{(i)}$ and $b^{(i)}$ and define the neural network as</p>
\[f(x) := \hat{z}^{(L)}(x), \qquad \hat{z}^{(i)}(x) := W^{(i)}z^{(i-1)}(x) + b^{(i)}, \qquad z^{(i)}(x) := \max(0, \hat{z}^{(i)}(x)).\]
<p>Where $z^{(0)}(x) = x$ denotes the input, $\hat{z}$ are the pre-activation values, and $z$ the post-activation values. For readability, we omit the dependency of intermediate activations on $x$ from here on.</p>
<p>Let $\mathcal{D}$ be an $\ell_\infty$ ball around an input point $x_0$ of radius $\epsilon$:
\(\mathcal{D}_\epsilon(x_0) = \left\{ x \in \mathcal{X} \mid \lVert x - x_0\rVert _{\infty} \leq \epsilon \right\}.\)</p>
<p>Since we can encode any linear property over output neurons into an additional affine layer, we can simplify the general formulation of $f(x) \in \mathcal{P}$ to $f(x) > {0}$. Any such property can now be verified by proving that a lower bound to the following optimization problem is greater than $0$:</p>
\[\begin{align*}
\min_{x \in \mathcal{D}_\epsilon(x_0)} \qquad &f(x) = \hat{z}^{(L)} \tag{1} \\
s.t. \quad & \hat{z}^{(i)} = W^{(i)}z^{(i-1)} + b^{(i)}\\
& z^{(i)} = \max({0}, \hat{z}^{(i)})\\
\end{align*}\]
<h3 id="background-branch-and-bound-for-neural-network-verification">Background: Branch-and-Bound for Neural Network Verification</h3>
<p>Recently, the <em>Branch-and-Bound</em> (<strong>BaB</strong>) approach, first described for this task in <a href="https://arxiv.org/pdf/1909.06588.pdf">Branch and Bound for Piecewise Linear Neural Network Verification</a>, has been popularized. At a high level, it is based on splitting the hard optimization problem of Eq. 1 into multiple easier subproblems by adding additional constraints until we can show the desired bound of $f(x) > 0$ on them.</p>
<p>The high-level motivation is the following: the optimization problem in Eq. 1 would be efficiently solvable if not for the non-linearity of the ReLU function. Since a ReLU function is piecewise linear and composed of only two linear regions, we can make a case distinction between a single ReLU node being “active” (input $\geq 0$) or inactive (input $< 0$) and prove the property on the resulting cases where the ReLU behaves linearly.</p>
<p>In the limit where all ReLU nodes are split, the verification problem becomes fully linear and can be solved efficiently. However, the number of subproblems to be solved in the resulting Branch-and-Bound tree is exponential in the number of ReLU neurons on which we split. Therefore, splitting all ReLU nodes is computationally intractable for all interesting verification problems. To tackle this problem, we prune this Branch-and-Bound tree using the insight that we do not have to split a subproblem further, once we find a lower bound that is $>0$.</p>
<p>In pseudo-code, the Branch-and-Bound algorithm looks as follows:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"> <span class="k">def</span> <span class="nf">verify_with_branch_and_bound</span><span class="p">(</span><span class="n">network</span><span class="p">,</span> <span class="n">input_region</span><span class="p">,</span> <span class="n">output_property</span><span class="p">)</span> <span class="o">-></span> <span class="nb">bool</span><span class="p">:</span>
<span class="n">problem_instance</span> <span class="o">=</span> <span class="p">(</span><span class="n">input_region</span><span class="p">,</span> <span class="n">output_property</span><span class="p">)</span>
<span class="n">global_lb</span><span class="p">,</span> <span class="n">global_ub</span> <span class="o">=</span> <span class="n">bound</span><span class="p">(</span><span class="n">network</span><span class="p">,</span> <span class="n">problem_instance</span><span class="p">)</span>
<span class="n">unsolved_subproblems</span> <span class="o">=</span> <span class="p">[(</span><span class="n">global_lb</span><span class="p">,</span> <span class="n">problem_instance</span><span class="p">)]</span>
<span class="k">while</span> <span class="n">global_lb</span> <span class="o"><</span> <span class="mi">0</span> <span class="ow">and</span> <span class="n">global_ub</span> <span class="o">>=</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">_</span><span class="p">,</span> <span class="n">current_subproblem</span> <span class="o">=</span> <span class="n">unsolved_subproblems</span><span class="p">.</span><span class="n">pop</span><span class="p">()</span>
<span class="n">current_lb</span><span class="p">,</span> <span class="n">current_ub</span> <span class="o">=</span> <span class="n">bound</span><span class="p">(</span><span class="n">network</span><span class="p">,</span> <span class="n">current_subproblem</span><span class="p">)</span>
<span class="k">if</span> <span class="n">current_ub</span> <span class="o"><</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">False</span>
<span class="k">if</span> <span class="n">current_lb</span> <span class="o"><</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">subproblem_left</span><span class="p">,</span> <span class="n">subproblem_right</span> <span class="o">=</span> <span class="n">branch</span><span class="p">(</span><span class="n">current_subproblem</span><span class="p">)</span>
<span class="n">unsolved_subproblems</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">subproblem_left</span><span class="p">,</span> <span class="n">subproblem_right</span><span class="p">)</span>
<span class="n">global_lb</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">lb</span> <span class="k">for</span> <span class="n">lb</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">unsolved_subproblems</span><span class="p">)</span>
<span class="k">return</span> <span class="n">global_lb</span> <span class="o">></span> <span class="mi">0</span></code></pre></figure>
<p>To define one particular verification method that follows the Branch-and-Bound approach, such as MN-BaB, all we have to do is instantiate the <strong>branch()</strong> and <strong>bound()</strong> functions.</p>
<h3 id="background-multi-neuron-constraints">Background: Multi-Neuron Constraints</h3>
<p>Before we do that, we need to understand <em>multi-neuron constraints</em> (<strong>MNCs</strong>), the second key building block of MN-BaB.</p>
<p>To bound the optimization problem in Eq. 1 efficiently, we want to replace the non-linear constraint $z = \max({0}, \hat{z})$ with its so-called linear relaxation, i.e., a set of linear constraints that is satisfied for all points satisfying the original non-linear constraint. If we consider just a single neuron, the tightest such linear relaxation is the convex hull of the function in its input-output space:</p>
<p><img src="/assets/blog/mn-bab/ConvexHull.png" alt="Convex hull ReLU abstraction" class="blogpost-img50" /></p>
<p>However, considering one neuron at a time comes with a fundamental precision limit, called the <a href="https://proceedings.neurips.cc/paper/2019/hash/246a3c5544feb054f3ea718f61adfa16-Abstract.html">(single-neuron) convex relaxation barrier</a>. It has since been <a href="https://www.sri.inf.ethz.ch/publications/singh2019krelu">shown</a>, that this limit can be overcome by considering multiple neurons jointly, thereby capturing interactions between these neurons and obtaining tighter bounds. We illustrate this improvement, showing a projection of the 4d input-output space of two neurons, below.</p>
<p><img src="/assets/blog/mn-bab/PRIMA.png" alt="PRIMA ReLU abstraction" class="blogpost-img50" /></p>
<p class="blogpost-caption"><em>The difference in tightness between the tightest single-neuron, and a multi-neuron relaxation.</em></p>
<p>We use the efficiently computable <em>multi-neuron constraints</em> from <a href="https://www.sri.inf.ethz.ch/publications/mueller2021precise">PRIMA</a>, which can be expressed as a conjunction of linear constraints over the joint input-output space.</p>
<h3 id="mn-bab-bounding">MN-BaB: Bounding</h3>
<p>The goal of the <strong>bound()</strong> method is to derive a lower bound to Eq. 1 that’s as tight as possible. The tighter it is, the earlier the Branch-and-Bound process can be terminated.</p>
<p>Following <a href="https://files.sri.inf.ethz.ch/website/papers/DeepPoly.pdf">previous</a> <a href="https://arxiv.org/abs/2103.06624">works</a>, we derive a lower bound of the network’s output as a linear function of the inputs:</p>
\[\min_{x \in \mathcal{D}} f(x) \geq \min_{x \in \mathcal{D}} a_{inp}x + c_{inp}\]
<p>There, the minimization over $x \in \mathcal{D}$ has a closed-form solution given by <a href="https://en.wikipedia.org/wiki/Hölder%27s_inequality">Hölder’s inequality</a>:</p>
\[\min_{x \in \mathcal{D}} a_{inp}x + c^{(0)} \geq a_{inp}x_0 - \lVert a_{inp} \rVert_1 \epsilon + c_{inp}\]
<p>To arrive at such a linear lower bound of the output in terms of the input, we start with the trivial lower bound $f(x) \geq z^{(L)}W + b^{(L)}$ and replace $z^{(L)}$ with symbolic, linear bounds depending only on the previous layer’s values $z^{(L-1)}$. We proceed in this manner recursively until we obtain an expression only in terms of the inputs of the network.</p>
<p>These so-called linear relaxations of the different layer types determine the precision of the obtained bounding method. While affine layers (e.g., fully connected, convolutional, avg. pooling, normalization) can be captured exactly, non-linear activation layers remain challenging and their encoding is what differentiates MN-BaB. Most importantly, MN-BaB enforces MNCs in an efficiently optimizable fashion. The full details are given in the <a href="https://files.sri.inf.ethz.ch/website/papers/ferrari2022complete.pdf">paper</a> but are rather technical and notation-heavy, so we will skip them here.</p>
<p>To derive the linear relaxations for activation layers, we need bounds on the inputs of those layers ($l_x$ and $u_x$ in the illustrations). In order to compute these lower and upper bounds on every neuron, we apply the procedure described above to every neuron in the network, starting from the first activation layer.</p>
<p>Note that if those input bounds for a ReLU node are either both negative or both positive, the corresponding activation function becomes linear and we do not have to split this node during the Branch-and-Bound process. We call such nodes “stable” and correspondingly nodes where the input bounds contain zero “unstable”.</p>
<p><img src="/assets/blog/mn-bab/stable_vs_unstable.png" alt="Alt Text" class="blogpost-img100" /></p>
<p class="blogpost-caption"><em>From Left to Right: stable inactive, stable active, unstable.</em></p>
<h3 id="mn-bab-branching">MN-BaB: Branching</h3>
<p>The <strong>branch()</strong> method takes a problem instance and splits it into two subproblems. This means deciding which unstable ReLU node to split and adding additional constraints to both resulting subproblems enforcing $\hat{z}<0$ or $\hat{z}\geq0$, on the input of the split neuron.</p>
<p><img src="/assets/blog/mn-bab/split_constraints.png" alt="Alt Text" class="blogpost-img80" /></p>
<p class="blogpost-caption"><em>Illustration of the split constraints that are added to the generated subproblems.</em></p>
<p>The choice of which node to split has a significant impact on how many subproblems we have to consider during the Branch-and-Bound process until we can prove a property. Therefore, we aim to choose a neuron that minimizes the total number of problems we have to consider. To do this, we define a proxy score trying to capture the bound improvement resulting from any particular split. Note that the optimal branching decision depends on the bounding method that is used, as different bounding methods might profit differently from additional constraints resulting from the split.</p>
<p>As our bounding method relies on MNCs, we design a proxy score that is specifically tailored to them, called the <em>Active Constraint Score</em> (<strong>ACS</strong>). ACS determines the sensitivity of the final optimization objective with respect to the MNCs and then, for each node, computes the cumulative sensitivity of all constraints containing that node. We then split the node with the highest cumulative sensitivity.</p>
<p>We further propose <em>Cost Adjusted Branching</em> (<strong>CAB</strong>) to scale this branching score by the expected cost of performing a particular split. This cost can differ significantly, as only the intermediate bounds after the split layer have to be recomputed, making splits in later layers computationally cheaper.</p>
<h3 id="why-use-multi-neuron-constraints">Why use multi-neuron constraints?</h3>
<p>Using MNCs for bounding, while making the bounds more precise, is computationally costly. The intuitive argument why it still helps verification performance is that the number of subproblems solved during Branch-and-Bound grows exponentially with the depth of the subproblem tree. A more precise bounding method that can verify subproblems earlier (at a smaller depth), can therefore save us exponentially many subproblems that we do not need to solve, which more than compensates for the increased computational cost.</p>
<p>This benefit is more pronounced the larger the considered network and the more dependencies there are between neurons in the same layer.
Most established benchmarks (e.g., from <a href="https://sites.google.com/view/vnn2021">VNNComp</a>) are based on very small networks or use training methods designed for ease of verification at the cost of natural accuracy. While this makes their certification tractable, they are less representative of networks used in practice. Therefore, we suggest focusing on larger, more challenging networks with higher natural accuracy (and more intra-layer dependencies) for the evaluation of the next generation of verifiers. There, the benefits of MNCs are particularly pronounced, leading us to believe that they represent a promising direction.</p>
<h3 id="experiments">Experiments</h3>
<p>We study the effect of MN-BaB’s components in an ablation study on the first 100 test images of the <a href="https://www.cs.toronto.edu/~kriz/cifar.html">CIFAR-10</a> dataset. We aim to prove that there is no adversarial example within an $\ell_\infty$ ball of radius $\epsilon=1/255$ and report the number of verified samples (within a timeout of 600 seconds) and the corresponding average runtime.
We consider two networks of identical architecture that only differ in the strength of their adversarial training method. ResNet6-A is weakly regularized while ResNet6-B is more strongly regularized, i.e. employs stronger adversarial training.</p>
<p><img src="/assets/blog/mn-bab/ablation_study.png" alt="Alt Text" class="blogpost-img50" /></p>
<p class="blogpost-caption"><em>Evaluating the effect of MNCs, Active Constraint Score (ACS) branching, and Cost Adjusted Branching (CAB) on MN-BaB. BaBSR is another branching method that is used as a baseline.</em></p>
<p>As expected, we see that both MNCs and Active Constraint Score branching are much more effective on the weakly regularized ResNet6-A. There, we verify 31% more samples while being around 31% faster, while on ResNet6-B we only verify 10% more samples.</p>
<p>As a more fine-grained measure of performance, we analyze the ratio of runtimes and number of subproblems required for verification on a per-property level on ResNet6-A.</p>
<p class="blogpost-wrap"><span><strong>Effectiveness of Multi-Neuron Constraints</strong>: We plot the ratio of the number of subproblems required to prove a property during Branch-and-Bound without vs. with MNCs. Using MNCs reduces the number of subproblems by two orders of magnitude on average.</span>
<img src="/assets/blog/mn-bab/ResNet6-A_n_subproblems_ratio_w_wo_MNC.png" alt="Alt Text" class="blogpost-img40" /></p>
<p class="blogpost-wrap"><span><strong>Effectiveness of Active Constraint Score Branching</strong>: We plot the ratio of the number of subproblems solved during Branch-and-Bound with BaBSR vs. ACS. Using ACS reduces the number of subproblems by an additional order of magnitude.</span>
<img src="/assets/blog/mn-bab/ResNet6-A_n_subproblems_ratio_branching.png" alt="Alt Text" class="blogpost-img40" /></p>
<p class="blogpost-wrap"><span><strong>Effectiveness of Cost Adjusted Branching</strong>: Finally, we investigate the effect of Cost Adjusted Branching on mean verification time with ACS. Using Cost Adjusted Branching further reduces the verification time by 50%. It is particularly effective in combination with the ACS scores and multi-neuron constraints, where bounding costs vary more significantly.</span>
<img src="/assets/blog/mn-bab/ResNet6-A_cab_runtime_comparison_p4c_acs.png" alt="Alt Text" class="blogpost-img40" /></p>
<h3 id="recap">Recap</h3>
<p>MN-BaB combines precise multi-neuron constraints with the Branch-and-Bound paradigm and an efficient GPU-based implementation to become a new state-of-the-art verifier, especially for less regularized networks. For a full breakdown of all technical details and detailed experimental evaluations, we recommend reading our <a href="https://www.sri.inf.ethz.ch/publications/ferrari2022complete">paper</a>. If you want to play around with MN-BaB yourself, please check out our <a href="https://github.com/eth-sri/mn-bab">code</a>.</p>Learn more about how multi-neuron constraints can be used in a Branch-and-Bound framework to build a state-of-the-art complete neural network verifier.Encoding sensitive data with guarantees2022-04-20T00:00:00+00:002022-04-20T00:00:00+00:00https://www.sri.inf.ethz.ch/blog/fnf<p>As machine learning is being increasingly used in scenarios that affect humans, such as credit scoring or crime risk assessment, it has become clear that these predictive models often discriminate and can be unfair.
It is thus increasingly important to design methods that help models make fair decisions, either by pre-processing the input data, in-processing the model or post-processing the predictions.
In our work, we propose a new pre-processing approach to encode existing data into new, unbiased representations that have high utility, but do not allow for reconstructing sensitive attributes such as race or gender.
Our approach, named Fair Normalizing Flows (FNF) is based on learning a bijective encoder for each group (where groups are determined based on the sensitive attribute).
Using bijective encoders enables us to obtain guarantees on maximum accuracy that <em>any</em> adversary can have when predicting the sensitive attribute.</p>
<h3 id="fair-representations">Fair representations</h3>
<p>Consider a case of a company with several teams that would like to build ML models for different products using the same data.
One option would be for each team to train their own model and enforce fairness of the model by themselves.
However, the teams might not have the same definition of fairness or they might even lack expertise to train fair models.
<a href="https://sanmi.cs.illinois.edu/documents/Representation_Learning_Fairness_NeurIPS19_Tutorial.pdf"><em>Fair representation learning</em></a> is a data pre-processing technique that transforms data into a new representation such that any classifier trained on top of this representation is fair.
Using representation learning enables us to pre-process data only once, and then give processed data to each team so that they can train their own model on this new data, while knowing that the model is fair, according to a single, pre-defined fairness definition.
The key question here is how to ensure that sensitive attributes cannot be recovered from the learned representations.
Typically, <a href="https://arxiv.org/abs/1802.06309">prior work</a> has checked that this is the case by jointly learning representations and an auxiliary adversarial model which is trying to predict the sensitive attribute from the representations.
However, while these representations protect against adversaries considered during training, several <a href="https://arxiv.org/abs/1808.06640">recent</a> <a href="https://arxiv.org/abs/2101.04108">papers</a> have shown that stronger adversaries can often in fact still recover the sensitive attributes.
Our work tackles this issue by proposing non-adversarial fair representation learning approach based on normalizing flows which can in certain cases <em>guarantee</em> that no adversary can reconstruct the sensitive attributes.</p>
<h3 id="motivation">Motivation</h3>
<p>To motivate our fair representation learning approach, we introduce a small example of a population consisting of a mixture of 4 Gaussians.
Consider a distribution of samples $x = (x_1, x_2)$ divided into two groups, shown as blue and orange in the figure below, with color and shape denoting sensitive attribute and label, respectively.</p>
<p><img src="/assets/blog/fnf/gauss.png" alt="" class="blogpost-img50" /></p>
<p class="blogpost-caption"><em><strong>Example of a population.</strong> Sensitive attribute is determined using color and shape denotes the label.</em></p>
<p>The first group with a sensitive attribute $a = 0$ has a distribution $(x_1, x_2) \sim p_0$, where $p_0$ is a mixture of two Gaussians at the top half.
The second group with a sensitive attribute $a = 1$ has a distribution $(x_1, x_2) \sim p_1$, where $p_1$ is a mixture of two Gaussians at the bottom half.
The label of a point $(x_1, x_2)$ is defined by $y = 1$ if $x_1$ and $x_2$ have the same sign, and $y = 0$ otherwise.
Our goal is to learn a data representation $z = f(x, a)$ such that it is <em>impossible</em> to recover $a$ from $z$, but still possible to predict target $y$ from $z$.
Note that such a representation exists for our task: simply setting $z = f(x, a) = (-1)^ax$ makes it impossible to predict whether a particular $z$ corresponds to $a = 0$ or $a = 1$, while still allowing us to train a classifier $h$ with essentially perfect accuracy (e.g. $h(z) = 1$ if and only if $z_1 > 0$).
This example also motivates our general approach: can we somehow map distributions corresponding to the two groups in the population to new distributions which are guaranteed to be difficult to distinguish?</p>
<h3 id="fair-normalizing-flows">Fair Normalizing Flows</h3>
<p><img src="/assets/blog/fnf/fnf_overview.png" alt="" class="blogpost-img100" /></p>
<p class="blogpost-caption"><em><strong>Overview of FNF.</strong> The key idea is to transform distributions corresponding to different groups using bijective encoders. After training, two distributions are aligned, and adversary cannot reconstruct sensitive attribute $a$ from the latent representation $z$.</em></p>
<p>As shown in the figure above, original features are useful for solving some downstream prediction task: we can train a classifier $h$ which predicts task label from the original features $x$ with reasonable accuracy.
However, at the same time, an adversary $g$ can recover sensitive attribute from $x$, and use it to potentially discriminate in a downstream task.
Motivated by the previous example, our goal is now to learn a function $f$ which transforms a pair of features and sensitive attribute $(x, a)$ into a new representation $z$ from which it is difficult to recover the sensitive attribute $a$.
As in the previous example, we are going to encode both distributions corresponding to $a = 0$ and $a = 1$ using a bijective transformation.
Our approach learns two bijective functions $f_0(x)$ and $f_1(x)$, and we denote the transformation as $f(x, a) = f_a(x)$.</p>
<p>One important quantity which we are interested in computing is <em>statistical distance</em> which measures how well can adversary distinguish between the distributions corresponding to the two groups in the population.
Importantly, <a href="https://arxiv.org/abs/1802.06309">Madras et al.</a> have shown that bounding statistical distance also bounds other fairness measures such as demographic parity or equalized odds.
The statistical distance between the two encoded distributions, denoted as \(\mathcal{Z}_0\) and \(\mathcal{Z}_1\) is defined as:</p>
\[\begin{equation}
\Delta(p_{Z_0}, p_{Z_1}) \triangleq \sup_{\mu \in \mathcal{B}} \lvert \mathbb{E}_{z \sim \mathcal{Z}_0} [\mu(z)] - \mathbb{E}_{z \sim \mathcal{Z}_1} [\mu(z)] \rvert,
\end{equation}\]
<p>where \(\mu\colon \mathbb{R}^d \rightarrow \{0, 1\}\) is a function from the set of all binary classifiers $\mathcal{B}$, trying to discriminate between $\mathcal{Z}_0$ and $\mathcal{Z}_1$.
We can show that supremum is attained for $\mu^*$ which, for some $z$, evaluates to \(1\) if and only if \(p_{Z_0}(z) \leq p_{Z_1}(z)\).
This shows that for computing statistical distance we need to know how to evaluate probability densities in the latent space, which is difficult for standard neural architectures
because any $z$ can correspond to the several different inputs $x$.
However, our approach uses bijective transformation for an encoder, and it is easy to compute the latent probability density using the change of variables formula:</p>
\[\begin{equation}
\log p_{Z_a}(z) = \log p_a(f^{-1}_a(z)) + \log \left | \det \frac{\partial f^{-1}_a(z)}{\partial z} \right |
\end{equation}\]
<p>We train the two encoders $f_0$ and $f_1$ to minimize the statistical distance $\Delta(p_{Z_0}, p_{Z_1})$, while at the same time we can also train an auxiliary classifier which helps learned representations to be informative for downstream prediction tasks.
One issue with training this way is that statistical distance is non-differentiable, as optimal adversary $\mu^*$ makes discrete thresholding decision, so we instead train by minimizing a loss which is a smooth approximation of the statistical distance.
After training is finished, we can evaluate statistical distance exactly, without any approximation.
Our guarantees assume that we know input distributions $p_0$ and $p_1$, which we most often have to approximate in practice.
You can find full details of how our guarantees change when input distributions are estimated in <a href="https://arxiv.org/abs/2106.05937">our paper</a>.</p>
<h3 id="experimental-evaluation">Experimental evaluation</h3>
<p>We evaluated FNF on several standard datasets from the fairness literature.
For each dataset, we take one of the features to be a sensitive attribute (e.g. race), and we train a model which balances between accuracy, measuring how well can it predict a task label, and fairness, measuring
how well can it debias the learned representations from the sensitive attributes.</p>
<p><img src="/assets/blog/fnf/fnf_cont_results.png" alt="" class="blogpost-img50" /></p>
<p class="blogpost-caption"><em><strong>Tradeoff between accuracy and fairness with FNF.</strong> For each datasets we measure classification accuracy and statistical distance, and show that FNF achieves favorable tradeoff between these quantities..</em></p>
<p>The above figure shows results on Law School, Crime and Health Heritage Prize datasets, with each point representing a single model with different fairness-accuracy tradeoff.
As a fairness metric, we measure statistical distance introduced earlier.
We can observe that for all datasets FNF can effectively balance fairness and accuracy.
In general, drop in accuracy is steeper for datasets where the task label is more correlated with the sensitive attribute.
We provide more experimental results in our paper, including experiments with discrete datasets and comparison with prior work.</p>
<p><img src="/assets/blog/fnf/fnf_bound.png" alt="" class="blogpost-img50" /></p>
<p class="blogpost-caption"><em><strong>Bounding adversarial accuracy.</strong> We show that FNF can reliably bound maximum accuracy that any adversary can have when predicting sensitive attribute from the encoded representations.</em></p>
<p>As mentioned earlier, FNF provides provable upper bound on the maximum accuracy of the adversary trying to recover the sensitive attribute, for the estimated probability densities of the input distribution.
We show our upper bound on the adversarial accuracy computed from the statistical distance using the estimated densities (diagonal dashed line), together with adversarial accuracies obtained by training an adversary, a multilayer perceptron (MLP) with two hidden layers of 50 neurons, for each model from the figure.
We can observe that FNF can successfully bound accuracy of the strong adversaries, even though the guarantees were computed on the estimated distributions.
In our paper, we also experiment with using FNF for other tasks such as algorithmic recourse and transfer learning.</p>
<h3 id="summary">Summary</h3>
<p>In this blog post, we introduced Fair Normalizing Flows (FNF), a new method for learning representations ensuring that no adversary can predict sensitive attributes at the cost of a small decrease in accuracy.
Our key idea was to use an encoder based on normalizing flows which allows computing the exact likelihood in the latent space, given an estimate of the input density.
Our experimental evaluation on several datasets showed that FNF effectively enforces fairness without significantly sacrificing utility, while simultaneously allowing interpretation of the representations and transferring to unseen tasks.
For more details please see our <a href="https://arxiv.org/abs/2106.05937">ICLR 2022 paper</a>.</p>Fair Normalizing Flows (FNF) are a new approach for encoding data into a new representation in order to ensure fairness and utility in downstream tasks.The optimal privacy attack on federated learning2022-04-20T00:00:00+00:002022-04-20T00:00:00+00:00https://www.sri.inf.ethz.ch/blog/bayesian<p>Federated learning has become the most widely used approach to collaboratively train machine learning models without requiring training data to leave devices of the individual users.
In this setting, clients compute updates on their own devices, send the updates to a central server which aggregates them and updates the global model. Because user data is not shared with the server or other users, this framework should, in principle, offer more privacy than simply uploading the data to a server.
There have been several works which showed that data can in fact be reconstructed from the gradient updates sent to the server.
In our work, we investigate what is the optimal attack for reconstructing data from gradients.</p>
<h3 id="privacy-in-federated-learning">Privacy in federated learning</h3>
<p>Goal of federated learning is to train a model $h_\theta$ through a collaborative procedure involving different clients, without data leaving individual client devices.
Typically $h_\theta$ is a neural network with parameters $\theta$, classifying an input $x$ to a label $y$.
We assume that pairs $(x, y)$ are coming from a distribution $\mathcal{D}$.
In the standard federated learning setting, there are $n$ clients with loss functions $l_1, …, l_n$, who are trying to jointly solve the optimization problem and find parameters $\theta$ which minimize their average loss:</p>
\[\begin{equation*}
\min_{\theta} \frac{1}{n} \sum_{i=1}^n \mathbb{E}_{(x, y) \sim \mathcal{D}} \left[ l_i(h_\theta(x), y) \right].
\end{equation*}\]
<p>In a single training step, each client $i$ first computes $\nabla_{\theta} l_i(h_\theta(x_i), y_i)$ on a batch of data $(x_i, y_i)$, then sends these to the central server that performs a gradient descent step to obtain the new parameters $\theta’ = \theta - \frac{\alpha}{n} \sum_{i=1}^n \nabla_{\theta} l_i(h_\theta(x_i), y_i)$, where $\alpha$ is a learning rate.
We will consider a scenario where each client reports, instead of the true gradient $\nabla_{\theta} l_i(h_\theta(x_i), y_i)$, a noisy gradient $g$ sampled from a distribution $p(g|x)$, which we call a defense mechanism.
This setup is fairly general and captures common defenses such as <a href="https://arxiv.org/abs/1607.00133">DP-SGD</a>.
In this post, we are interested in the privacy issue of federated learning: can the input $x$ be recovered from gradient update $g$?
More specifically, we are interested in analyzing the Bayes optimal attack and connecting it to the attacks from prior work.</p>
<h3 id="bayesian-framework-for-gradient-leakage">Bayesian framework for gradient leakage</h3>
<p>To measure how well can adversary reconstruct user data we introduce the notion of <em>adversarial risk</em> for gradient leakage, and then derive the Bayes optimal adversary that minimizes this risk.
The adversary can only observe the gradient $g$ and tries to reconstruct the input $x$ that produced $g$.
Formally, the adversary is a function $f: \mathbb{R}^k \rightarrow \mathcal{X}$ mapping gradients to inputs.
Given some $(x, g)$ sampled from the joint distribution $p(x, g)$, the adversary outputs the reconstruction $f(g)$ and incurs loss $\mathcal{L}(x, f(g))$, which is a function $\mathcal{L}: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$.</p>
<p>The loss measures whether the advesary was able to reconstruct the original data.
Typically, we will consider a binary loss that evaluates to 0 if the adversary’s output is close to the original input, and 1 otherwise.
If the adversary only wants to get to some $\delta$-neighbourhood of the input $x$ in the input space, an appropriate definition of the loss can be $\mathcal{L}(x, x’) := 1_{||x - x’||_2 > \delta}$.
This definition is well suited for image data, where $\ell_2$ distance captures our perception of visual similarity.
We can now define the risk $R(f)$ of the adversary $f$ as</p>
\[\begin{equation}
R(f) := \mathbb{E}_{x, g} \left[ \mathcal{L}(x, f(g)) \right] = \mathbb{E}_{x \sim p(x)} \mathbb{E}_{g \sim p(g|x)} \left[ \mathcal{L}(x, f(g)) \right].
\end{equation}\]
<p>We can then manipulate this expression and show that \(R(f) = 1 - \mathbb{E}_g \int_{B(f(g), \delta)} p(x|g) \,dx\).
This allows us to work out what is the optimal adverasary $f$ which minimizes the adversarial risk $R(f)$:</p>
\[\begin{align}
f(g) &= \underset{x_0 \in \mathcal{X}}{\operatorname{argmax}} \int_{B(x_0, \delta)} p(x|g) \,dx \nonumber \\
&= \underset{x_0 \in \mathcal{X}}{\operatorname{argmax}} \int_{B(x_0, \delta)} \frac{p(g|x)p(x)}{p(g)} \nonumber \,dx \\
&= \underset{x_0 \in \mathcal{X}}{\operatorname{argmax}} \int_{B(x_0, \delta)} p(g|x)p(x) \,dx \nonumber \\
&= \underset{x_0 \in \mathcal{X}}{\operatorname{argmax}} \left[ \log \int_{B(x_0, \delta)} p(g|x)p(x) \,dx \right] \label{eq:optadv}
%% &\geq \argmax_{x_0 \in \mathcal{X}} \int_{B(x_0, \delta)} \log p(g|x) + \log p(x) \,dx \label{eq:optadv}
\end{align}\]
<p>While the above provides a formula for the optimal adversary in the form of an optimization problem, using this adversary for practical reconstruction is computationally difficult
so we can approximate it by applying Jensen’s inequality \(\log C \int_{B(x_0, \delta)} p(g|x)p(x) \,dx \geq C \int_{B(x_0, \delta)} (\log p(g|x) + \log p(x)) \,dx.\)</p>
<p><img src="/assets/blog/bayesian/bayes_attack.png" alt="" class="blogpost-img50" /></p>
<p class="blogpost-caption"><em><strong>Gradient leakage attack.</strong> Bayes optimal adversary randomly initializes the image and then optimizes for the image which has highest likelihood of being close to the original image which produced the gradient.</em></p>
<p>We can then approximate the integral by Monte Carlo sampling and optimize the objective using gradient descent.
As shown in the figure above, adversary can randomly initialize the input, and then optimize for the input which has highest likelihood of being close to the original input which produced the update gradient $g$.
One interesting thing is that we can now recover attacks from prior work by using different priors $p(x)$ and conditional $p(g|x)$, meaning that attacks from prior work are different approximations of the Bayes optimal adversary. For example, <a href="https://arxiv.org/abs/1906.08935">DLG</a> is recovered by using uniform prior and Gaussian conditional, <a href="https://arxiv.org/abs/2003.14053">another attack</a> is recovered by using total variation prior and cosine conditional, while <a href="https://arxiv.org/abs/2104.07586">GradInversion</a> uses combination of total variation and DeepInversion prior combined with Gaussian conditional.</p>
<h3 id="attacking-existing-defenses">Attacking existing defenses</h3>
<p>In this experiment, we evaluate the three recently proposed defenses <a href="https://openaccess.thecvf.com/content/CVPR2021/papers/Sun_Soteria_Provable_Defense_Against_Privacy_Leakage_in_Federated_Learning_From_CVPR_2021_paper.pdf">Soteria</a>, <a href="https://arxiv.org/abs/2011.12505">ATS</a> and <a href="https://arxiv.org/abs/2108.04725">PRECODE</a> against strong gradient leakage attacks.
While these defenses can protect privacy against weaker attackers (as shown in the respective papers), we show that they are not actually successful against strong attacks.
Below we show our reconstructions for each defense, evaluated on CIFAR-10 dataset.</p>
<p><img src="/assets/blog/bayesian/img_reconstructions.png" alt="" class="blogpost-img100" /></p>
<p class="blogpost-caption"><em><strong>Reconstructions on defended networks.</strong> We attack models defended by Soteria, ATS and PRECODE using strong attacks. Our reconstructions are very close to the original images, showing that these defenses do not reliably protect privacy, especially early in training.</em></p>
<p>Each defense introduces a different $p(g|x)$ which we use to derive an approximation of Bayes optimal adversary.
Our results indicate that these defenses do not reliably protect privacy under gradient leakage, especially in the early stages of the training.
This suggests that creating effective defenses and strong evaluation methods remains a key challenge.
We provide full description of these defenses and our attacks in <a href="https://arxiv.org/abs/2111.04706">our paper</a>.</p>
<h3 id="comparing-different-attacks">Comparing different attacks</h3>
<p>In the next set of experiments we compare Bayes optimal attack with several other, non-optimal attacks.
We consider several different defenses all of which produce different distributions $p(g|x)$.
One defense adds Gaussian noise, another randomly prunes out some elements of the gradient and then adds Gaussian noise, and the third one adds Laplacian noise after pruning.
We measure <a href="https://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio">PSNR</a> of the reconstructions, where higher value means that reconstruction is closer to the original image.</p>
<p><img src="/assets/blog/bayesian/attacks_barplot.png" alt="" class="blogpost-img100" /></p>
<p class="blogpost-caption"><em><strong>Bayes optimal attack compared to other attacks.</strong> We consider several different defenses with adding Gaussian or Laplacian noise or pruning and show that Bayes optimal attack is typically the best.</em></p>
<p>We observe that Bayes optimal adversary generally performs best, showing that the optimal attack needs to leverage structure of the probability distribution of the gradients induced by the defense.
Note that, in the case of Gaussian defense, $\ell_2$ and Bayes attacks are equivalent up to a constant factor, and it is expected that they achieve a similar result.
In all other cases, Bayes optimal adversary outperforms the other attacks.
Overall, this experiment provides empirical support for our theory, confirming practical utility of the Bayes optimal adversary.</p>
<h3 id="summary">Summary</h3>
<p>In this blog post we considered the problem of privacy in federated learning and investigated the Bayes optimal adversary which tries to reconstruct original data from the gradient updates.
We derived form of this adversary and showed that attacks proposed in prior work are different approximations of this optimal adversary.
Experimentally, we showed that existing defenses do not protect against strong attackers, and that deriving good defense remains an open challenge.
Furthermore, we showed that Bayes optimal adversary is stronger than other attacks when it can exploit structure in probability distributions, confirming our theoretical results.
For more details, please check out our <a href="https://arxiv.org/abs/2111.04706">ICLR 2022 paper</a>.</p>The excerpt is for the home page, aim for 1-2 sentences or 3-5 lines on the homepage. LCIFR is a method for training fair representations with provable certificates of individual fairness.Boosting randomized smoothing with variance reduced classifiers2022-04-15T00:00:00+00:002022-04-15T00:00:00+00:00https://www.sri.inf.ethz.ch/blog/smoothens<p>Deep neural networks often achieve excellent accuracy on data $x$ from the distribution they were trained on. However, they have been shown to be very sensitive to slightly perturbed inputs $x+ \delta$, called adversarial examples. This severely limits their applicability to safety-critical domains.
Further, heuristic defenses against this vulnerability have been shown to be breakable, highlighting the need for provable robustness guarantees.</p>
<p>A promising method providing such guarantees for large networks is <a href="https://arxiv.org/abs/1902.02918">Randomized Smoothing</a> (RS). The core idea is to obtain probabilistic robustness guarantees with arbitrarily high confidence by adding noise to the input of a base classifier and computing the majority vote of the classification over a large number of perturbed inputs using Monte Carlo sampling.</p>
<p>In this blogpost, we consider <em>applying RS to ensembles</em> as base classifiers and explain why they are a particularly suitable choice. For this, we will first give a short recap on Randomized Smoothing, before explaining our approach and discussing our theoretical results. Finally, we show that our approach yields a new state-of-the-art in most settings, often even while using less compute than current methods.</p>
<h3 id="background-randomized-smoothing-rs">Background: Randomized Smoothing (RS)</h3>
<p>We consider a (soft) base classifier $f \colon \mathbb{R}^d \mapsto \mathbb{R}^{n}$ predicting a numerical score for each class and let \(F(x) := \text{arg max}_{i} \, f_{i}(x)\) denote the corresponding hard classifier $\mathbb{R}^d \mapsto [1, \dots, n]$. Randomized Smoothing (RS) takes such a base classifier, evaluates it on a large number of slightly perturbed versions of an input, and then predicts the majority classification over the resulting predictions. The bigger the difference between the probability of the most likely and second most likely class, the more robust the resulting smoothed classifier.</p>
<p>Formally, we write \(G(x) := \text{arg max}_c \, \mathcal{P}_{\epsilon \sim \mathbb{N}(0, \sigma_{\epsilon}^2 I)}(F(x + \epsilon) = c)\) for the smoothed classifier.
This smoothed classifier is guaranteed to be robust, i.e., predict $G(x + \delta) = c_A$, under all perturbations $\delta$ satisfying $\lVert \delta \rVert_2 < R$ with $R := \sigma_{\epsilon}\Phi^{-1}(\underline{p_A})$, where $c_A$ is the majority class, $\underline{p_A}$ the lower bound to its success probability $\mathcal{P}_{\epsilon}(F(x + \epsilon) = c_A) \geq \underline{p_A}$ and $\Phi^{-1}$ the inverse <a href="https://en.wikipedia.org/wiki/Normal_distribution#Cumulative_distribution_functions">Gaussian CDF</a>. As $\underline{p_A}$ increases, so does $R$.</p>
<h3 id="ensembles">Ensembles</h3>
<p>Instead of a single model $f$, we now consider a soft ensemble of $k$ models $\{ f^l \}_{l=1}^k$:</p>
\[\bar{f}(x) = \frac{1}{k} \sum_{l=1}^{k} f^l(x)\]
<p>We obtain different models $f^l$ by varying only the random seed for training.</p>
<h3 id="variance-reduction-via-ensembles-for-randomized-smoothing">Variance Reduction via Ensembles for Randomized Smoothing</h3>
<p>Now, we will show theoretically why ensembles are particularly suitable base models.
As shown in the illustration below, ensembling reduces the prediction’s variance over the noise introduced in RS, leading to a larger certification radius $R$.</p>
<p><img src="/assets/blog/smooth_ens/main.png" alt="Prediciton landscape of two individual models and an ensemble." class="blogpost-img100" /></p>
<p class="blogpost-caption"><em><strong>Illustration of the prediction landscape</strong> of base models $f$ where colors represent classes. The bars show the class probabilities of the corresponding smoothed classifiers. The individual models (left, middle) predict the same class for $x$ as their ensemble (right). However, the ensemble’s lower bound on the majority class’ probability $\underline{p_A}$ is increased, leading to improved robustness through RS.</em></p>
<p>Formally, we can introduce a random variable $z_i$ for the logit difference between class $c_A$ and the other classes $c_i$. We denote it as `classification margin’ and can compute its variance depending on the number $k$ of ensembled classifiers ($\sigma^2(k)$) and normalize it with that of a single classifier ($\sigma^2(1)$):</p>
\[\frac{\sigma^2(k)}{\sigma^2(1)} = \frac{1 + \zeta_{} (k-1)}{k} \xrightarrow {k \to \infty} \zeta.\]
<p>Here, $\zeta$ is a small constant describing the degree of correlation between the ensembled classifiers. We observe that for weakly correlated classifiers, the variance is significantly reduced.
Using Chebychev’s inequality, we can translate this reduction in variance into an increase in the lower bound to the success probability of the majority class $c_A$:</p>
\[\underline{p_{A}} \geq 1 - \sum_{i \neq A} \frac{\sigma_i(k)^2}{\bar{z}_i^{\,2}}% = 1\]
<p>We see that the success probability goes towards 1 quadratically as the variance is reduced. Assuming Gaussian distributions and estimating all its parameters from real ResNet20, we obtain the following distribution over the classification margin to the runner-up class $c_i$:</p>
<p><img src="/assets/blog/smooth_ens/runner_up_margin.png" alt="Illustration of classification margin variance reduction with increased number of ensembled classifiers" class="blogpost-img30" /></p>
<p>Here, the success probability $p_A$ of the model corresponds directly to the portion of the area under the curve (the probability mass) to the right of the black line. While we see that the mean classification margin remains unchanged, this portion and thus the success probability increase significantly as we ensemble more models.</p>
<p><strong>Certified Radius</strong> Having computed the success probability as a function of the number $k$ of ensembled models, we can now derive the probability distribution over the $\ell_2$-radius we can certify using RS.</p>
<p><img src="/assets/blog/smooth_ens/cert_rad_distr.png" alt="Illustration of increase in certifiable radius number of ensembled classifiers" class="blogpost-img30" /></p>
<p>As we increase the number of models we ensemble, the whole distribution shifts to larger radii. In contrast, simply increasing the number of samples used for the Monte Carlo estimation mostly concentrates the distribution and yields a much smaller increase in certified radius.</p>
<p>For a deeper dive and a full explanation and validation of all our assumptions, please check out our <a href="https://www.sri.inf.ethz.ch/publications/horvath2022boosting">ICLR’22 Spotlight paper</a>.</p>
<!-- > **TLDR**: <a name="tldrensemble"></a> Ensembling $k$ classifiers, differing only in the random seed used for training, yields a classifier with significantly reduced variance over random perturbations. Without (necessarily) changing the natural accuracy, this increases the certified radius and thereby certified accuracy significantly, even when correcting for the increased compute. -->
<h3 id="experimental-evaluation">Experimental Evaluation</h3>
<p>We conduct experiments on ImageNet and CIFAR-10 using a wide range of training methods, network architectures, and noise magnitudes and consistently observe that ensembles outperform their best constituting models.</p>
<p><strong>CIFAR-10</strong></p>
<p><img src="/assets/blog/smooth_ens/acr_cifar_050_blog.png" alt="Illustration of increase in ACR with number of ensembled classifiers" class="blogpost-img25" /></p>
<p>Using an ensemble of up to ten ResNet110’s clearly outperforms the best constituting model (currently SOTA). Even an ensemble of just three ResNet20’s outperform a single ResNet110, despite requiring significantly less compute for training and inference.</p>
<p><img src="/assets/blog/smooth_ens/sample_experiment_cifar.png" alt="Comparison of certified radius of using more samples instead of ensembling." class="blogpost-img25" /></p>
<p>Using more samples with just a single network barely improves the certified radius at all, unless mathematically necessary to achieve a sufficiently high confidence level. Note that this is only the case for very large radii (here 2.0), and, in contrast to our method, does not actually make the model more robust.</p>
<p><strong>ImageNet</strong></p>
<p><img src="/assets/blog/smooth_ens/acr_in_100_blog.png" alt="Illustration of increase in ACR with number of ensembled classifiers" class="blogpost-img25" /></p>
<p class="blogpost-wrap">On ImageNet, an ensemble of just three ResNet50’s improves over the current state-of-the-art by more than 10%.</p>
<!-- > **TLDR**:
-->
<h3 id="summary">Summary</h3>
<p>We propose a theoretically motivated and statistically sound approach to construct low variance base classifiers for Randomized Smoothing by ensembling. We show theoretically and empirically why this approach significantly increases certified accuracy yielding state-of-the-art results.
If you are interested in more details please check out our <a href="https://www.sri.inf.ethz.ch/publications/horvath2022boosting">ICLR 2022 paper</a>.</p>We theoretically motivate why and show empirically that, ensembles are particularly suitable base models for Randomized Smoothing, due to the variance reduction across the perturbations introduced during Randomized Smoothing.