SRI Lab | Blogposts

Probing Google DeepMind’s SynthID-Text Watermark

2024-12-20T00:00:00+00:00

.ws { color: #aa8dd8; } .contextsize { color: #f46d43; } .tournament { color: #126eff; } .cache { color: #ff1241; }

SynthID-Text is the first publicized large-scale deployment of an LLM watermarking algorithm. It was recently deployed in Gemini App and Web and open-sourced following a Nature publication. While this is a major milestone for watermarking research, the behavior of SynthID-Text in adversarial scenarios remains largely unexplored. In this blog post, we aim to fill this gap by providing a more thorough evaluation, directly leveraging recent work from our group. In the following 4 sections, we investigate if the presence of SynthID-Text can be detected, whether stealing attacks can enable watermark spoofing and removal, and further analyze those spoofing attempts. In each section, we highlight interesting questions that could be explored in future research.

1. The presence of SynthID-Text can be detected

In “Black-Box Detection of Language Model Watermarks” we showed that hiding the fact that an LLM watermark is deployed is not feasible, as watermark presence can be cheaply detected using only black-box queries, for all 3 of the most popular watermarking scheme families. Extending the results on Gemini 1.0 from the paper, we found no reliable evidence of a watermark on the Gemini 1.5 API. This matches the official claims, stating that the watermark is only present in the Gemini App and Web. As those deployments are not suitable for querying with thousands of similar prompts, we ran our tests on a local deployment of a model watermarked with SynthID-Text: While the first two (as expected) fail, the Red-Green test consistently passes, detecting the watermark presence ($p \approx 0$). This also shows that our tests can be directly applied to newly proposed schemes. To understand why the Red-Green test passes, let’s decompose SynthID into its building blocks as follows:

SynthID-Text = LeftHash h=3 + Increased context size + Tournament sampling + Caching

From the perspective of a Red-Green scheme, SynthID starts from the LeftHash h=3 variant, increases its context size, and uses tournament sampling to effectively generalize the boosting of green tokens to variable logit biases. These biases are still consistent for a fixed preceding context, which is the key property our detection test relies on, thus it remains effective. Next change is caching, i.e., “Repeated context masking” in the original paper. As in the default SynthID-Text implementation, we set $K=1$ to achieve single-sequence non-distortion. Such caching does not fundamentally affect our test, but introduces a new constraint: Instead of upper-bounding, the context size needs to be estimated correctly, as an overestimation would trigger the cache, preventing our queries to extract any information. Increasing $K$ would make detection harder, but as discussed in the SynthID-Text supplementary text (G.3), may reduce the watermark effectiveness and increase computational complexity.

Summary: Despite several novel building blocks, the presence of SynthID-Text can be detected in a black-box way using the test for Red-Green schemes, as long as context size estimation succeeds.

Future Work: Can SynthID-Text be tweaked to evade detection (e.g., by increasing $K$) without significantly sacrificing its effectiveness? Are viable undetectable schemes possible?

2. SynthID-Text is hard to spoof

In “Watermark Stealing in Large Language Models”, we showed that practical spoofing attacks are possible on SOTA schemes: an attacker can use a set of black-box queries to generate a corpus of watermarked text, and use it to learn to forge the watermark, creating arbitrarily many high-quality watermarked texts. Malicious users could, for instance, generate harmful content and attribute it to the model provider. As spoofing risk was not studied for SynthID-Text, we directly applied our stealing attack to test it, without attempting to optimize the algorithm specifically for this case. Our Spoofing Success metric is FPR*@1e-3: the ratio of spoofing attempts that are both high-quality (rated by GPT4) and fool the watermark detector at a realistic false positive rate of $10^{-3}$.

Click to see other boring experimental details

The Role of Red Teaming in PETs

2023-05-19T00:00:00+00:00

In February, our team won the Red Teaming category of the U.S. PETs Prize Challenge, securing a prize of 60,000 USD. In this blog post, we will provide a brief overview of the significance of Red Teaming in the field of Privacy Enhancing Technologies (PETs) research in the context of the competition. By outlining our methodology and highlighting the comprehensive objectives of a red team, we intend to showcase the essential role of Red Teaming in ensuring the development of robust and privacy-centric implementations grounded in solid theoretical foundations. Please note that due to a non-disclosure agreement (NDA), we can only discuss our general approach and are unable to delve into specific details.

The Challenge

The U.S. PETs Prize Challenge is a contest aimed at discovering innovative solutions to two critical issues where privacy plays a crucial role.

The first challenge, Financial Crime Prevention, focuses on enhancing collaboration between banks and SWIFT in detecting fraudulent transactions without disclosing private information among the parties involved. The proposed solutions encompass a range of advanced and highly tailored cryptographic protocols designed to jointly flag fraudulent transactions by the participants.

The second challenge, Pandemic Response, seeks to improve the prediction of infection probabilities for individuals, a highly relevant issue that raises numerous privacy concerns when data needs to be shared across governments or jurisdictions. The solutions presented in this challenge employ a wide array of techniques, ranging from slightly modified standard federated learning setups to highly tailored probabilistic models for pandemic forecasting.

Red Teaming In The Challenge

As a Red Team, we were tasked with evaluating various solutions from each track within a limited timeframe. To effectively analyze and attack all of them, we adopted a multi-step plan illustrated in the figure below:

First, we dedicated a substantial amount of time to understanding each task and solution individually, extracting several key pieces of information relevant to Red Teaming. In particular, we investigated the privacy and utility requirements of the tasks and the data supplied, as well as the parties involved and their interactions. Further, we examined the privacy assumptions and claims of the proposed solutions in detail, as well as their proposed theoretical and software techniques. Although some aspects are closely linked to privacy, it’s worth noting that we did not solely concentrate on that aspect of these components. We believe that Red Teams should adopt a more comprehensive perspective on the entire problem since all solution components are vital to the end-to-end process. Moreover, several submission issues, even when not directly related to privacy, can emerge naturally during a thorough analysis of the solution and may give rise to additional privacy concerns.

Secondly, we leveraged the gained in-depth understanding of the tasks and solutions to identify potential attack vectors. These are issues discovered within the given solution that could potentially lead to privacy concerns or related problems. It is worth noting that some of these attack vectors are fairly generic, such as our newest attack TabLeak that will be presented at ICML’23, and, thus, they should be incorporated as part of a proper “standard” evaluation of a given solution. We consider testing the success of generic attacks a critical subcomponent of an effective Red Teaming report. We provide several examples of attack vectors in the figure above, but it is important to note that most of these stem from discrepancies between the components identified in step 1 and/or an unsoundness in one of them. In fact, we were surprised to find how many solutions were already flawed due to misunderstandings regarding the precise requirements or the interactions between parties.

In Step 3, we start the process of attacking the given solutions, identifying four primary categories of uncovered issues that should be included in a Red Team report. Privacy breaches are the most critical set of issues but not the sole focus of our Red Teaming Report. We argue that a comprehensive analysis of the Privacy-Utility Trade-Off is equally vital to include in a report, as trivially private solutions are easily obtainable if utility is not required. Indeed, some of our reports were centered on this trade-off, as we demonstrated that while the implemented solutions were (relatively) secure, they either behaved so similarly to random models that this was anticipated, or that simplifications of the proposed mechanisms resulted in a superior privacy-utility trade-off. In addition to these essential components, we also incorporate the conceptual and theoretical flaws identified in the second step even when they didn’t directly lead to privacy attacks. We observed that addressing these flaws could either enhance the performance of the given solution or could lead to undesirable consequences for privacy.

Finally, we consolidate all uncovered issues into a single report including recommendations to correct the vulnerabilities discovered in the solution. This report should be precise and provide actionable suggestions for implementing patches to address the identified issues, or in cases where the privacy issue is inherent to the solution, recommend against using the system altogether in practice. It is also crucial to acknowledge when a solution is simply effective: the primary objective of a Red Team should not be to dismantle a system, but rather to rigorously evaluate it under stressful conditions and pinpoint problems. If breaking the system becomes the sole focus of the Red Team, we encounter the same issue as to why Blue Teams cannot conduct this analysis independently: this bias in the report skews the results, with less regard for the origins of the numbers presented.

Value of Red Teaming

The above explanation should offer a clear understanding of why Red Teaming for PETs is crucial, but we would like to emphasize this point further. A Red Teaming report provides a comprehensive evaluation of the solution across various dimensions, with privacy being the primary focus. A Red Team can objectively assess the solution’s performance, which might be more challenging for Blue Teams who directly benefit from a successful solution. Moreover, the complexity of a Red Team’s task is often inherently more difficult than that of the Blue Team due to the interactions between all the critical components identified earlier. Finally, it is worth noting that we managed to significantly compromise all the systems we evaluated, demonstrating that even when solutions are deemed good enough to progress to the final phase of a prestigious PETs challenge, issues can still arise and persist.

LAMP: Extracting Text from Gradients with Language Model Priors

2022-11-28T00:00:00+00:00

Data privacy and Federated learning

Machine learning algorithms have set the state-of-the-art on most tasks where large amount of training data is available. While the improvements brought by these algorithms are impressive, their applications to settings where private data is used remain limited due to the privacy concerns posed by the large centralized datasets required by the training procedures. Recently, the federated learning framework has emerged as a promising alternative to collecting and training with centralized datasets. In federated learning, multiple data owners (clients) collaboratively optimize a loss function $l$ w.r.t. the parameters $\theta$ of a global model $h$ on their own dataset $\mathcal{D}_i$ without sharing the data in $\mathcal{D}_i$ with the other participants:

\[\begin{equation*} \min_{\theta} \frac{1}{n} \sum_{i=1}^n \mathbb{E}_{(x, y) \sim \mathcal{D}_i} \left[ l(h_\theta(x), y) \right]. \end{equation*}\]

To this end, the optimization is carried out in communication rounds. In particular, given the global parameters $\theta_t$ at round $t$, multiple clients compute model updates $g$ on their own data and then share them with a central server that aggregates them and produces a new global parameters $\theta_{t+1}$. After several communication rounds the model parameters converge to an optimum. One common implementation of this generic framework is the FedSGD algorithm, where updates are given by the gradient of $l$ w.r.t. $\theta_t$ on a single batch $\{(x^b_i, y^b_i)\}\sim \mathcal{D}_i$ of client data of size $B$:

\[\begin{equation*} g(\theta_t,\mathcal{D}_i) = \frac{1}{B} \sum_{b=1}^B \nabla_\theta \left[ l(h_{\theta_t}(x^b_i), y^b_i) \right]. \end{equation*}\]

Federated learning in theory allows for improved data privacy, as the client data does not leave the individual clients. Unfortunately, several recent works have shown that updates $g$ computed by common federated algorithms such as FedSGD can be used by a malicious server during the aggregation phase to approximately reconstruct the client’s data. So far, prior work has focused on exposing this issue in the image domain where strong image priors help the reconstruction. In this work, we show that such approximate reconstruction is also possible in the text domain, where federated learning is very commonly applied.

Gradient leakage

In order to obtain approximate input reconstructions $\{\tilde{x}^b_i\}$ from the FedSGD update of some client $i$, with updates as described above, prior works typically solve the following optimization problem at some communication round $t$:

\[\begin{equation} \min_{\\{\tilde{x}^b_i\\}} \sum_{i=1}^n \mathcal{L}_{rec}\left( \left(\frac{1}{B} \sum_{b=1}^B \nabla_\theta l(h_\theta(\tilde{x}^b_i), y^b_i)\right), g(\theta_t, \mathcal{D}_i) \right) + \alpha_{rec}\,R(\{\tilde{x}^b_i\}), \end{equation}\]

where $\mathcal{L}$_rec is distance metric - e.g., $L_1$, $L_2$ or cosine, that measures the gradient reconstruction error, $R(\{\tilde{x}^b_i\})$ is some domain specific prior - e.g. Total Variation (TV) in the image domain, that assesses the quality of the reconstructed inputs, and $\alpha_{rec}$ is hyperparameter balancing between the two. Note that $\theta_t$ and $g(\theta_t, \mathcal{D}_i)$ are known by the malicious server as the former is computed by it and the latter is sent to it by client $i$ at the end of the round. Often the batch labels $\{y^b_i\}$ can be obtained by the server using specific label reconstruction attacks, that are beyond the scope of this blog post, or just guessed by running the reconstruction with all possible labels due to their discrete nature, so throughout the post we only focus on reconstructing $\{\tilde{x}^b_i\}$. In our previous blog post, we have shown that solving the optimization problem above is equivalent to finding the Bayesian optimal adversary in this setting.

In the image domain, the optimization problem is typically solved using gradient descent on the batch of randomly initialized images $\{x^b_i\}$ using an image-specific prior $R$. In the next section, we first discuss why such a solution is not well suited to language data and we then discuss our method, LAMP, that combines a text-specific prior with a new way to solve the optimization problem above by alternating discrete and continuous optimization steps to obtain our state-of-the-art gradient leakage framework for text.

LAMP: Gradient leakage for text

In this work, we focus our attention on transformer-based models $h_\theta$, as they are the state-of-the-art for modeling text across various language tasks. As these models operate on continuous vectors, typically they assume fixed-size vocabulary of size $V$ and embed each word into a different $\mathbb{R}^d$ vector. For a sequence of words of size $n$, we refer to the individual words in it with $t_1,\ldots,t_n$ and to their corresponding embeddings with $x_1,\ldots,x_n$.

In order to solve the gradient leakage optimization problem from the previous section, we choose to optimize directly over the embeddings $x_i$ as they, similarly to images, are represented by continuous values we can optimize over. However, uniquely to the text domain, only a finite subset of vectors in $\mathbb{R}^d$ are valid word embeddings. To this end, when we obtain our reconstructed embeddings $\tilde{x}_i$ for each of them we then select the most similar in cosine similarity token in the vocabulary to create a reconstruction of the sequence of words $\tilde{t}_1,\ldots,\tilde{t}_n$.

An additional issue that is specific to the text domain and, in particular, the transformer architecture is that the transformer outputs depend on word order only through positional embeddings. Therefore, the model gradient reconstruction loss $\mathcal{L}$_rec is not as affected by wrongly reconstructed word order as it is by the wrongly reconstructed word embeddings themselves. In practice, this results into the continuous optimization process often getting stuck in local minima caused by an embedding of a token that reconstructs the correct word at the wrong position. These local minima are hard to escape from continuously. To this end, we introduce a discrete optimization step that reorders the sentence periodically, allowing to escape the local minima. The discrete step works by first proposing several word order changes such as swapping the positions of two words or moving a sentence prefix to the end of the sentence. The different order changes are then assessed based on the combination of the gradient reconstruction $\mathcal{L}$_rec and the perplexity of the sentence $\mathcal{L}$_lm computed by auxiliary language model such as GPT-2 on the projected words $\tilde{t}_i$:

\[\begin{equation} \mathcal{L}_{rec}(\{\tilde{x}_i\}) + \alpha_{lm}\,\mathcal{L}_{rec}(\{\tilde{t}_i\}), \end{equation}\]

where $\alpha_{lm}$ is a hyperparameter balancing the two parts. The resulting end-to-end alternating optimization is demonstrated in the image below: where green boxes show the discrete optimization steps and the blue boxes demonstrate the continuous gradient descent optimization steps of the gradient leakage objective presented in the previous section.

Finally, similarly to the image domain, we introduce a new prior specific to text that improves our reconstruction. To this end, we made the empirical observation that during optimization often the embedding vectors $x_i$ grow in length even when their direction doesn’t change a lot. To this end, we regularize the average vector length of the embeddings in a sequence $\tilde{x}_i$ to be close to the average embedding length in the vocabulary $l_e$:

\[\begin{equation} R(\tilde{x}_i) = \left(\frac{1}{n}\sum_{i=1}^n \| \tilde{x}_i \|_2 - l_e\right)^2 \end{equation}\]

This allows our embeddings to remain in the correct range of values, which in turn results in a more stable and accurate reconstruction of the embeddings $\tilde{x}_i$.

Experimental evaluation

We evaluated LAMP on several standard sentiment classification datasets and architectures based on the BERT language models. As is typically the case with language models, we assume our models are pretrained to make word predictions on large text corpora and that federated learning is used only to fine-tune the models on the classification task at hand. We consider two versions of LAMP - one where $\mathcal{L}$_rec is a weighted sum of L1 and L2 distances (denoted LAMP_L1+L2), and another one where $\mathcal{L}$_rec is the cosine similarity (denoted LAMP_cos). We compare them to the state-of-the-art attacks - TAG based on the same L1+L2 distance, and DLG based on L2 distance alone. We evaluate the methods in terms of the Rouge-1 metric (R1) which measures the percentage of correctly reconstructed words and the Rouge-2 metric (R2) which measures the percentage of correctly reconstructed bigrams. We note one can interpret R2 as a proxy measurement of how well the order of the sentence has been reconstructed. We present a subset of the results shown in our paper on the CoLA dataset and batch size of 1 below:

	$\text{TinyBERT}_6$		$\text{BERT}_{BASE}$		$\text{BERT}_{LARGE}$
Method	R1	R2	R1	R2	R1	R2
DLG	37.7	3.0	59.3	7.7	82.7	10.5
TAG	43.9	3.8	78.9	10.2	82.9	14.6
$\text{LAMP}_{\cos}$	93.9	59.3	89.6	51.9	92.0	56.0
$\text{LAMP}_{\text{L1}+\text{L2}}$	94.5	52.1	87.5	47.5	91.2	47.8

We see that LAMP_cos is consistently recovering more words compared to the alternatives with LAMP_L1+L2 close behind. Further, LAMP recovers substantially better sentence ordering. It is worth noting that the improvement over the baselines for both R1 and R2 is most pronounced on the smallest model $\text{TinyBERT}_6$ where recovery is the hardest. Additionally, we also experimented with recovering text in the setting where the batch size is bigger than 1. We are the first to present results in this setting and we show them below for the CoLA dataset:

	B=1		B=2		B=4
Method	R1	R2	R1	R2	R1	R2
DLG	59.3	7.7	49.7	5.7	37.6	1.7
TAG	78.9	10.2	68.8	7.6	56.2	6.7
$\text{LAMP}_{\cos}$	89.6	51.9	74.4	29.5	55.2	14.5
$\text{LAMP}_{\text{L1}+\text{L2}}$	87.5	47.5	78.0	31.4	66.2	21.8

We see that despite the worse quality of reconstruction, even batch size of 4 still leaks a substantial amount of data. Further, we observe that for bigger batch sizes LAMP_L1+L2 performs better than LAMP_cos. Both LAMP methods, however, substantially improve upon the results of the baselines. Finally, we show an example sentence reconstruction from LAMP and TAG on multiple datasets below: Here, yellow signifies a single correctly reconstructed word and green signifies a tuple of correctly recovered words. We see that LAMP recovers the word order drastically better and often is even able to reconstruct it perfectly. In addition, LAMP recovers more individual words. This confirms qualitatively the effectiveness of our attack.

Summary

In this blog post, we introduced LAMP, a new framework for gradient leakage of text data from gradient updates in federated learning. Our key ideas are the alternating of continuous and discrete optimization steps and the introduction of an auxiliary text model which we use in the discrete part of our optimization to judge how well a piece of text is reconstructed. Thanks to these elements, our attack is able to produce substantially better text reconstructions compared to the state-of-the-art attacks both quantitatively and qualitatively. We, thus, show that many practical federated learning systems based on text are vulnerable and better mitigation steps should be taken. For more details please see our NeurIPS 2022 paper.

Reliability Guarantees on Private Data

2022-11-07T00:00:00+00:00

In this post we present our ACM CCS 2022 publication, Private and Reliable Neural Network Inference, where we introduced Phoenix, a tool for NN inference that both protects client data privacy and enables important reliability guarantees.

We focus on the common ML as a service scenario, a two-party setting where a client offloads intensive computation (commonly NN inference) to the server. The client data is of sensitive nature in many of the applications (e.g., financial, judicial), which motivated work in the field of privacy-preserving NN inference, aiming to build methods that perform the computation without the server learning the client data. One of the most common techniques for this is fully-homomorphic encryption (FHE), which is rapidly becoming more practical.

Orthogonal to privacy, a long line of work focuses on enabling NN inference with reliability guarantees. For example, in a loan prediction setting, augmenting predictions with fairness guarantees is in the interest of both parties, as it increases trust in the system and may be essential to ensure regulatory compliance. Some of the latest works in this direction are LASSI and FARE, focusing on two aspects of the fairness problem. Another common example is robustness, where for example, a medical image analysis system should be able to prove to clients that the diagnosis is robust to naturally-occurring measurement errors (see e.g., our latest work SABR).

ML as a service. Phoenix achieves both client data privacy (via FHE) and fairness/robustness guarantees.

While the problems of privacy-preserving and reliable inference are both well-established, there was no prior attempt to consolidate the work in these two fields. Thus, service providers who offer reliability guarantees currently have no simple way to transition their service to a privacy-preserving setting, a requirement which is becoming increasingly relevant. This is the problem we address in Phoenix, proposing a system that supports both: client data privacy and reliability guarantees. To this end, we lift the key building blocks of randomized smoothing, a technique for augmenting predictions with reliability guarantees, to the popular RNS-CKKS FHE scheme. The key challenges that Phoenix overcomes stem from the missing native support for control flow and evaluation of non-polynomial functions in the FHE scheme.

We now recall randomized smoothing on a high level.

Randomized smoothing. A high-level overview of the procedure in the non-private setting.

As the service provider, we receive an input $x$ (in the illustration above, a cat image), and we aim to return a prediction $y$ (for some classification task) augmented with a reliability guarantee, for a property such as robustness. We duplicate the input $n$ times, add independently sampled Gaussian noise to each copy, and perform batched NN inference to obtain the logit vectors, i.e., unnormalized probabilities. Next, we apply the Argmax function to transform logits to predictions, and aggregate those predictions across $n$ samples to get the counts, indicating how many times each output class was predicted. Finally, we perform a statistical test on the counts, which, if successful, produces a probabilistic reliability guarantee, ensuring that the prediction $y$ is robust with known high probability.

The key question is how this procedure needs to change if we attempt to execute it while protecting client data privacy, i.e., if the data is encrypted using FHE by the client. The key steps are illustrated below.

Overview of Phoenix. The main challenges in lifting randomized smoothing to FHE.

For the batched NN inference (dashed line) we directly utilize prior work, which offers efficient algorithms for the RNS-CKKS scheme. Further, the addition of noise is simple, as the noise can be directly added as a plaintext due to the homomorphic property. However, computing Argmax is a key challenge due to the difficulty of computing non-polynomial functions—we elaborate on this shortly. In the aggregation step we combine several methods from prior work with scheme-specific optimizations, and develop a novel heuristic for randomized smoothing, necessary to obtain a computationally feasible procedure. Finally, we perform a rewrite of the one-sided binomial test applied to counts to make it FHE-suitable. The output of Phoenix is a single ciphertext, which when decrypted with the secret key of the client, reveals both the prediction and the computed reliability guarantee. We next discuss the Argmax approximation in more detail, and refer to our paper for details regarding all other steps.

To efficiently approximate Argmax, we use the recent paper of Cheon et al. (ASIACRYPT ‘20), which propose SgnHE, a sign function approximation for FHE as a composition of low-degree polynomials, illustrated below. Our approximate Argmax is built on several applications of SgnHE (see the paper for the full algorithm). Most importantly, in our case it is crucial to have guarantees on the approximation quality of SgnHE—otherwise, the randomized smoothing reliability guarantee returned to the clients may in some cases be invalid, fundamentally compromising the protocol.

Sign function approximation. Repeated applications of the polynomial $f_0$ increase approximation quality.

The SgnHE function is parametrized such that for desired parameters $a,b \in \mathbb{R}$, we can obtain an $(a,b)$-close approximation, meaning that for inputs $x \in [a, 1]$ the output is guaranteed to be in $[1 - 2^{-b}, 1]$ (similarly for the negative case). However, as the server is unable to directly observe the intermediate values due to encryption, it is hard to ensure the above precondition is satisfied for logit values which are the input to the Argmax, and the first of the sign function applications it utilizes.

To overcome this we impose two conditions on the logit vectors, constraining the range and differences of their values, allowing us to appropriately rescale them and reason about the approximation quality. As we can not prove for an arbitrary NN that such conditions on logits will always hold (e.g., be in some range), we use confidence intervals and a finite sample to upper bound the condition violation probability. When reporting the guarantee to the client, the computed probability (approximation error) is added to the usual error probability of randomized smoothing as the probabilistic procedure (algorithmic error). The resulting value represents the total error probability of our guarantee, maintaining the behavior of the non-private case.

In our extensive experiments across multiple scenarios we observe values for the total error probability of around 1%, confirming that our procedure leads to viable high-probability guarantees. We further observe non-restrictive latencies and communication cost and high consistency, i.e., the results obtained with the FHE version of randomized smoothing are in almost 100% of the cases identical to those of the non-private baseline, confirming that transitioning a service to FHE using Phoenix does not sacrifice the key metrics. Our Microsoft SEAL implementation is publicly available on GitHub.

We believe Phoenix is an important first step towards merging the worlds of reliable and privacy-preserving machine learning. For more details of the Argmax approximation, omitted parts of the protocol, as well as detailed experimental results including microbenchmarks, please refer to our paper.

Why Tighter Convex Relaxations Harm Certified Training

2022-10-27T00:00:00+00:00

This blog post summarizes the key findings of our paper On the Paradox Of Certified Training, which was recently published in TMLR.

We attempt to explain the phenomenon where most state-of-the-art methods for certified training based on convex relaxations (such as FastIBP or the latest breakthrough SABR) focus on the loose interval propagation (IBP/Box), while intuitively, tighter relaxations (i.e., the ones that more tightly overapproximate the non-linearities in the network) should lead to better results. This was already observed in many prior works, which proposed several hypotheses. However, the paradox was never investigated in a principled way.

We start by proposing a way to quantify tightness (see the paper for details), and thoroughly reproducing the paradox: Across a wide range of settings, tighter relaxations consistently lead to lower certified robustness (in %) than the loose IBP relaxation. An example is shown in the following table, grouping equivalent methods from prior work (below we will refer to each group using the name in bold):

Relaxation	Tightness	Certified (%)
IBP / Box	0.73	86.8
hBox / Symbolic Intervals	1.76	83.7
CROWN / DeepPoly	3.36	70.2
DeepZ / CAP / FastLin / Neurify	3.00	69.8
CROWN-IBP (R)	2.15	75.4

Our key observation is that there are other latent properties of relaxations, besides tightness, that affect success when relaxations are used in a training procedure. More concretely, each of the tighter relaxations has either unfavorable continuity (i.e., the corresponding loss function is discontinuous with respect to network weights) or unfavorable sensitivity (i.e., the corresponding loss function is highly sensitive to small perturbations of network weights), both preventing successful optimization. By observing all three properties jointly, we can more easily interpret the seemingly counterintuitive results.

The plot below shows the relaxation of the ReLU non-linearity used by CROWN, for the example input range defined by $l=-5$ and $u=8$. By reducing $u$ (using the bottom slider), we can directly observe the discontinuity of CROWN, when its heuristic choice of the lower linear bound changes at $|l|=|u|$. Using the same plot we can observe the discontinuities of hBox at $l=0$. These observations imply discontinuities in the loss when a relaxation is used in training, which we further empirically observe in real scenarios. Finally, we can use the plot below to observe that no discontinuities can be found for IBP and DeepZ—a formal proof of their continuity in the general case is given in the paper.

Open Interactive Plot

While the sensitivity of the loss functions is harder to illustrate on a toy example as above, our derivation (Section 4.3 of the paper) demonstrates that the bounds used by CROWN, CROWN-IBP (R) and DeepZ lead to certified training losses highly sensitive to small changes in network weights, while the losses of IBP and hBox are not sensitive and induce more favorable loss landscapes. With these observations, we expand the table shown earlier to include all three relaxation properties: tightness, continuity and sensitivity. This illustrates that attempts to use tighter relaxations in certified training have introduced unfavorable properties of the loss, which resulted in the failure to outperform the continuous and non-sensitive IBP.

Relaxation	Tightness	Continuity	Sensitivity	Certified (%)
IBP / Box	0.73	$\checkmark$	$\checkmark$	86.8
hBox / Symbolic Intervals	1.76	$\times$	$\checkmark$	83.7
CROWN / DeepPoly	3.36	$\times$	$\times$	70.2
DeepZ / CAP / FastLin / Neurify	3.00	$\checkmark$	$\times$	69.8
CROWN-IBP (R)	2.15	$\times$	$\times$	75.4

Next steps

A natural question that arises is the one of improving unfavorable properties of relaxations to make them more suitable for certified training. In the paper we systematically investigate modifications of existing relaxations and find that designing a relaxation with all favorable properties is difficult, as the properties induce complex tradeoffs that depend on the setting. Still, such relaxations may exist, and future work might be able to utilize our findings to discover them.

A more promising approach seems to be the use of existing convex relaxations with modified training procedures designed to best utilize the benefits of each relaxation. Recent successful examples of this approach include COLT, which includes the counterexample search in training, CROWN-IBP, which heuristically combines the losses of two relaxations in training, and two recent methods which focus on IBP, aiming to improve its training procedure via better initialization and regularization (FastIBP) or the propagation of smaller input regions in training (SABR).

Finally, it is worth noting that there are other promising approaches for neural network certification that do not use convex relaxations and are thus not affected by tradeoffs between relaxation properties. Examples in this direction include Randomized Smoothing, offering high-probability robustness certificates, and custom certification-friendly model architectures such as $l_\infty$-distance nets. While not affected by limitations of convex relaxations, these approaches introduce other challenges such as optimization difficulties and additional inference-time work.

SRI Lab at ICLR 2022

2022-04-25T00:00:00+00:00

SRI Lab will present five works at ICLR 2022! In this meta post we aggregate all content related to our ICLR papers, including links to the conference portal and individual blogposts where you can learn more about the topics we currently focus on.

Generating Provably Robust Adversarial Examples

2022-04-21T00:00:00+00:00

On the image above we show an image of the digit $0$ from MNIST ($x_\text{orig}$) and a region around it in red that depicts the set of geometrically perturbed images for which we expect a given neural network to be robust. Further, in green we depict a subregion where the neural network is not robust. Traditionally, in order to assess the robustness of the network one uses adversarial attacks to generate the examples $x_1$ and $x_2$. While robustness can be assessed that way, the information that the whole green region is adversarial is lost. This in turn might lead to never-seen-before network behaviour in the future. One advantage of the classical approach of assessing robustness, however, is that generating $x_1$ and $x_2$ is fast. In contrast, generating the green region is computationally infeasible. In this work, we present an algorithm called PARADE that exploits classical adversarial attacks to generate as large as possible regions that are provably adversarial. Similarly to the green region in the figure, these regions summarize many indvidual advarsarial attacks while also being practical to compute. We call them provably robust adversarial examples.

Algorithm overview

There are three main steps to PARADE. First, we generate an initial box region that might contain non-adversarial points using off-the-shelf adversarial attacks. We refer to this region as the overapproximation box $\mathcal{O}$. Then, we use a black-box verifier to shrink this overapproximation box to a smaller box that provably contains only adversarial points. We call this region the underapproximation box $\mathcal{U}$. Finally, we use $\mathcal{O}$ and $\mathcal{U}$ to generate a polyhedral region $\mathcal{U}\subseteq\mathcal{P}\subseteq\mathcal{O}$ that we also prove only contains adversarial points using the same black-box verifier. Both $\mathcal{U}$ and $\mathcal{P}$ fit our definition of provably robust adversarial examples but differ in terms of shape and precision. To this end, the generation of $\mathcal{P}$ is only an optional way to make our provably robust adversarial examples more precise. Next, we present the PARADE steps in details.

Generating the overapproximation box $\mathcal{O}$

To generate the overapproximation box $\mathcal{O}$, we sample attacks from an adversarial attack algorithm, such as PGD. Then, we fit a box around them. The process is illustateted in the animation above. We note that depending on the success of the attack algorithm, a small part of the ground truth adversarial region $\mathcal{T}$ might be excluded from $\mathcal{O}$.

Generating the underapproximation box $\mathcal{U}$

We aim to generate the underapproximation box $\mathcal{U}$ in a way that it can be proven to only contain adversarial examples while also being as large as possible. Due to the complexity of this objective, we do this heuristically. In particular, we start by initializing $\mathcal{U}$ to the overapproximation box $\mathcal{O}$. At each iteration $i$, we execute a black-box verification procedure. If the procedure verifies that the box from the previous iteration, $\mathcal{U}_{i-1}$, contains only adversarial examples, we return it. Otherwise, we obtain from the verifier a linear contraint, which can be added to $\mathcal{U}_{i-1}$ in order to make it provably robust. Unfortunately, the constraint is usually too conservative, as the black-box verifier relies on overapproximation of the set of possible network outputs. To this end, we relax the constraint by bias-adjusting it. We make sure that we cannot relax the constraint too much, such that it becomes meaningless. We use the bias-adjusted contraint to shrink $\mathcal{U}_{i-1}$ such that the constraint is not violated but the smallest possible amount of volume is lost. The procedure is repeated until the verification succeeds. The process is depicted in the animation above.

Generating the polyhedral region $\mathcal{P}$

Finally, we present the optional step of generating polyhedral provably robust adversarial example $\mathcal{P}$ from the box provably robust adversarial example $\mathcal{U}$. The additional flexibility of the polyhedral shape allows for larger regions $\mathcal{P}$ compared to $\mathcal{U}$ in exchange for computational complexity. As generating polyhedral regions is hard, we again do this heuristically. Starting with the overapproximation box $\mathcal{O}$, we iteratively add linear containts to it until we arrive at a polyhedron $\mathcal{P}$ that can be proved to only contain adversarial examples by the black-box verifier. Similarly to the generation process of $\mathcal{U}$, we use the verification at iteration $i$ to generate linear contraints. Unlike the generation process of $\mathcal{U}$, we use not only linear constraints from the final verification objective but also linear constraints that make the ReLU activation neurons in the network decided. Unfortunately, the resulting constraints might generate polyhedron $\mathcal{P}$ that is smaller than $\mathcal{O}$. To prevenet that, we leverage the fact that $\mathcal{U}$ is itself provably robust and we bias-adjust the constraints in such a way that they do not remove volume from $\mathcal{U}$. The procedure is concludes when the verifier succeeds. We outline the procedure in the animation above.

Experiments

We experimented with two different types of provably robust adversarial examples - robust to pixel intensity changes ($\ell_\infty$ changes) and to geometric changes. We show the pixel intensity experiment below:

Network	$\epsilon$	PARADE Box # Regions	PARADE Box Time	PARADE Box # Attacks	PARADE Poly # Regions	PARADE Poly Time	PARADE Poly # Attacks
MNIST 8x200	0.045	53/53	114s	$10^{121}$	53/53	1556s	$10^{121} < \cdot < 10^{191}$
MNIST ConvSmall	0.12	32/32	74s	$10^{494}$	32/32	141s	$10^{494} < \cdot < 10^{561}$
MNIST ConvBig	0.05	28/29	880s	$10^{137}$	28/29	5636s	$10^{137} < \cdot < 10^{173}$
CIFAR-10 ConvSmall	0.006	44/44	113s	$10^{486}$	44/44	264s	$10^{486} < \cdot < 10^{543}$
CIFAR-10 ConvBig	0.008	36/36	404s	$10^{573}$	36/36	610s	$10^{573} < \cdot < 10^{654}$

We note PARADE is highly effective - it generates regions successfully for all but $1$ image for which the classical adversarial attacks succeeded. Further, the regions generated contain a very large set of adversarial examples that are infeasible to generate individually. Finally, we note that the polyhedral adversarial examples take more time to generate but contain more examples. Calculating the exact number of concrete attacks within the polyhedral regions is computationally hard so instead we approximate the number as precisely as possible from above and below using boxes.

Next, we show the results for adversarial examples provably robust to geometric changes:

Network	Transform	PARADE Box # Regions	PARADE Box Time	PARADE Box # Attacks
MNIST ConvSmall	Rotate + Scale + Shear	51/54	774s	$10^{96} < \cdot < 10^{195}$
MNIST ConvSmall	Scale + Translate2D	51/56	521s	$10^{71} < \cdot < 10^{160}$
MNIST ConvSmall	Scale + Rotate + Brightness	40/48	370s	$10^{70} < \cdot < 10^{455}$
MNIST ConvBig	Rotate + Scale + Shear	44/50	835s	$10^{77} < \cdot < 10^{205}$
MNIST ConvBig	Scale + Translate2D	42/46	441s	$10^{64} < \cdot < 10^{174}$
MNIST ConvBig	Scale + Rotate + Brightness	46/52	537s	$10^{119} < \cdot < 10^{545}$
CIFAR-10 ConvSmall	Rotate + Scale + Shear	29/29	1369s	$10^{599} < \cdot < 10^{1173}$
CIFAR-10 ConvSmall	Scale + Translate2D	32/32	954s	$10^{66} < \cdot < 10^{174}$
CIFAR-10 ConvSmall	Scale + Rotate + Brightness	21/25	1481s	$10^{513} < \cdot < 10^{2187}$

We see that again PARADE is capable of generating examples for most images where classical adversarial attacks succeeded. We note that we use DeepG for verification. Since DeepG generates image polyhedra, we have to approximate the number of concrete attacks similarly to PARADE Poly above. We also note that DeepG is more computationally expensive resulting is longer runtime for our algorithm, as well.

Visualizing PARADE regions

Above we visualize some of the provably robust adversarial examples generated by PARADE for both pixel and geometric transformations. Each pixel’s color represents the number of possible values that pixel can have within our box regions.

Summary

We introduced and motivated the concept of provably robust adversarial examples. We further showed an outline of our algorithm, PARADE, that generates such examples in the shape of boxes or polyhedra. Emperically we demonstrated that regions produced by PARADE can summarize very large number of individual adversarial examples making them an useful tool to asses the robustness of neural networks. We hope that we piqued your interest in our work. For further details and experiments, please refer to our ICLR 2022 paper.

Acknowledgments

I would like to thank all of my collaborators for contributing to this paper. In particular, I want to thank Gagandeep Singh, who supervised me on the project and is now professor at UIUC, for his help and mentorship.

Multi-neuron Relaxation Guided Branch-and-bound

2022-04-21T00:00:00+00:00

This blog post explains the high-level concepts and intuitions behind our most recent neural network verifier MN-BaB. First, we introduce the neural network verification problem. Then, we present the so-called Branch-and-Bound approach for solving it and outline the main ideas behind multi-neuron constraints, before combining the two in our new verifier MN-BaB. We conclude with some experimental results and insights on why using multi-neuron constraints is key for the verification of challenging networks with high natural accuracy.

Neural Network Verification

In a nutshell, the neural network verification problem can be stated as follows:

Given a network and an input, prove that all points in a small region around that input are classified correctly, i.e., that no adversarial example exists.

To formalize this a bit, we consider a network $f: \mathcal{X} \to \mathcal{Y}$, an input region $\mathcal{D} \subseteq \mathcal{X}$, and a linear property $\mathcal{P}\subseteq \mathcal{Y}$ over the output neurons $y\in\mathcal{Y}$, and we try to prove that

\[f(x) \in \mathcal{P}, \forall x \in \mathcal{D}.\]

For the sake of explanation, we consider a fully connected $L$-layer network with ReLU activations but note that we can handle all common architectures. We denote the weights and biases of neurons in the $i$-th layer as $W^{(i)}$ and $b^{(i)}$ and define the neural network as

\[f(x) := \hat{z}^{(L)}(x), \qquad \hat{z}^{(i)}(x) := W^{(i)}z^{(i-1)}(x) + b^{(i)}, \qquad z^{(i)}(x) := \max(0, \hat{z}^{(i)}(x)).\]

Where $z^{(0)}(x) = x$ denotes the input, $\hat{z}$ are the pre-activation values, and $z$ the post-activation values. For readability, we omit the dependency of intermediate activations on $x$ from here on.

Let $\mathcal{D}$ be an $\ell_\infty$ ball around an input point $x_0$ of radius $\epsilon$: $\mathcal{D}_\epsilon(x_0) = \left\{ x \in \mathcal{X} \mid \lVert x - x_0\rVert _{\infty} \leq \epsilon \right\}.$

Since we can encode any linear property over output neurons into an additional affine layer, we can simplify the general formulation of $f(x) \in \mathcal{P}$ to $f(x) > {0}$. Any such property can now be verified by proving that a lower bound to the following optimization problem is greater than $0$:

\[\begin{align*} \min_{x \in \mathcal{D}_\epsilon(x_0)} \qquad &f(x) = \hat{z}^{(L)} \tag{1} \\ s.t. \quad & \hat{z}^{(i)} = W^{(i)}z^{(i-1)} + b^{(i)}\\ & z^{(i)} = \max({0}, \hat{z}^{(i)})\\ \end{align*}\]

Background: Branch-and-Bound for Neural Network Verification

Recently, the Branch-and-Bound (BaB) approach, first described for this task in Branch and Bound for Piecewise Linear Neural Network Verification, has been popularized. At a high level, it is based on splitting the hard optimization problem of Eq. 1 into multiple easier subproblems by adding additional constraints until we can show the desired bound of $f(x) > 0$ on them.

The high-level motivation is the following: the optimization problem in Eq. 1 would be efficiently solvable if not for the non-linearity of the ReLU function. Since a ReLU function is piecewise linear and composed of only two linear regions, we can make a case distinction between a single ReLU node being “active” (input $\geq 0$) or inactive (input $< 0$) and prove the property on the resulting cases where the ReLU behaves linearly.

In the limit where all ReLU nodes are split, the verification problem becomes fully linear and can be solved efficiently. However, the number of subproblems to be solved in the resulting Branch-and-Bound tree is exponential in the number of ReLU neurons on which we split. Therefore, splitting all ReLU nodes is computationally intractable for all interesting verification problems. To tackle this problem, we prune this Branch-and-Bound tree using the insight that we do not have to split a subproblem further, once we find a lower bound that is $>0$.

In pseudo-code, the Branch-and-Bound algorithm looks as follows:

    def verify_with_branch_and_bound(network, input_region, output_property) -> bool:
      problem_instance = (input_region, output_property)

      global_lb, global_ub = bound(network, problem_instance)
      unsolved_subproblems = [(global_lb, problem_instance)]

      while global_lb < 0 and global_ub >= 0:
            _, current_subproblem = unsolved_subproblems.pop()
            current_lb, current_ub = bound(network, current_subproblem)

            if current_ub < 0:
              return False
            if current_lb < 0:
              subproblem_left, subproblem_right = branch(current_subproblem)
              unsolved_subproblems.append(subproblem_left, subproblem_right)

            global_lb = min(lb for lb, _ in unsolved_subproblems)
      return global_lb > 0

To define one particular verification method that follows the Branch-and-Bound approach, such as MN-BaB, all we have to do is instantiate the branch() and bound() functions.

Background: Multi-Neuron Constraints

Before we do that, we need to understand multi-neuron constraints (MNCs), the second key building block of MN-BaB.

To bound the optimization problem in Eq. 1 efficiently, we want to replace the non-linear constraint $z = \max({0}, \hat{z})$ with its so-called linear relaxation, i.e., a set of linear constraints that is satisfied for all points satisfying the original non-linear constraint. If we consider just a single neuron, the tightest such linear relaxation is the convex hull of the function in its input-output space:

However, considering one neuron at a time comes with a fundamental precision limit, called the (single-neuron) convex relaxation barrier. It has since been shown, that this limit can be overcome by considering multiple neurons jointly, thereby capturing interactions between these neurons and obtaining tighter bounds. We illustrate this improvement, showing a projection of the 4d input-output space of two neurons, below.

The difference in tightness between the tightest single-neuron, and a multi-neuron relaxation.

We use the efficiently computable multi-neuron constraints from PRIMA, which can be expressed as a conjunction of linear constraints over the joint input-output space.

MN-BaB: Bounding

The goal of the bound() method is to derive a lower bound to Eq. 1 that’s as tight as possible. The tighter it is, the earlier the Branch-and-Bound process can be terminated.

Following previous works, we derive a lower bound of the network’s output as a linear function of the inputs:

\[\min_{x \in \mathcal{D}} f(x) \geq \min_{x \in \mathcal{D}} a_{inp}x + c_{inp}\]

There, the minimization over $x \in \mathcal{D}$ has a closed-form solution given by Hölder’s inequality:

\[\min_{x \in \mathcal{D}} a_{inp}x + c^{(0)} \geq a_{inp}x_0 - \lVert a_{inp} \rVert_1 \epsilon + c_{inp}\]

To arrive at such a linear lower bound of the output in terms of the input, we start with the trivial lower bound $f(x) \geq z^{(L)}W + b^{(L)}$ and replace $z^{(L)}$ with symbolic, linear bounds depending only on the previous layer’s values $z^{(L-1)}$. We proceed in this manner recursively until we obtain an expression only in terms of the inputs of the network.

These so-called linear relaxations of the different layer types determine the precision of the obtained bounding method. While affine layers (e.g., fully connected, convolutional, avg. pooling, normalization) can be captured exactly, non-linear activation layers remain challenging and their encoding is what differentiates MN-BaB. Most importantly, MN-BaB enforces MNCs in an efficiently optimizable fashion. The full details are given in the paper but are rather technical and notation-heavy, so we will skip them here.

To derive the linear relaxations for activation layers, we need bounds on the inputs of those layers ($l_x$ and $u_x$ in the illustrations). In order to compute these lower and upper bounds on every neuron, we apply the procedure described above to every neuron in the network, starting from the first activation layer.

Note that if those input bounds for a ReLU node are either both negative or both positive, the corresponding activation function becomes linear and we do not have to split this node during the Branch-and-Bound process. We call such nodes “stable” and correspondingly nodes where the input bounds contain zero “unstable”.

From Left to Right: stable inactive, stable active, unstable.

MN-BaB: Branching

The branch() method takes a problem instance and splits it into two subproblems. This means deciding which unstable ReLU node to split and adding additional constraints to both resulting subproblems enforcing $\hat{z}<0$ or $\hat{z}\geq0$, on the input of the split neuron.

Illustration of the split constraints that are added to the generated subproblems.

The choice of which node to split has a significant impact on how many subproblems we have to consider during the Branch-and-Bound process until we can prove a property. Therefore, we aim to choose a neuron that minimizes the total number of problems we have to consider. To do this, we define a proxy score trying to capture the bound improvement resulting from any particular split. Note that the optimal branching decision depends on the bounding method that is used, as different bounding methods might profit differently from additional constraints resulting from the split.

As our bounding method relies on MNCs, we design a proxy score that is specifically tailored to them, called the Active Constraint Score (ACS). ACS determines the sensitivity of the final optimization objective with respect to the MNCs and then, for each node, computes the cumulative sensitivity of all constraints containing that node. We then split the node with the highest cumulative sensitivity.

We further propose Cost Adjusted Branching (CAB) to scale this branching score by the expected cost of performing a particular split. This cost can differ significantly, as only the intermediate bounds after the split layer have to be recomputed, making splits in later layers computationally cheaper.

Why use multi-neuron constraints?

Using MNCs for bounding, while making the bounds more precise, is computationally costly. The intuitive argument why it still helps verification performance is that the number of subproblems solved during Branch-and-Bound grows exponentially with the depth of the subproblem tree. A more precise bounding method that can verify subproblems earlier (at a smaller depth), can therefore save us exponentially many subproblems that we do not need to solve, which more than compensates for the increased computational cost.

This benefit is more pronounced the larger the considered network and the more dependencies there are between neurons in the same layer. Most established benchmarks (e.g., from VNNComp) are based on very small networks or use training methods designed for ease of verification at the cost of natural accuracy. While this makes their certification tractable, they are less representative of networks used in practice. Therefore, we suggest focusing on larger, more challenging networks with higher natural accuracy (and more intra-layer dependencies) for the evaluation of the next generation of verifiers. There, the benefits of MNCs are particularly pronounced, leading us to believe that they represent a promising direction.

Experiments

We study the effect of MN-BaB’s components in an ablation study on the first 100 test images of the CIFAR-10 dataset. We aim to prove that there is no adversarial example within an $\ell_\infty$ ball of radius $\epsilon=1/255$ and report the number of verified samples (within a timeout of 600 seconds) and the corresponding average runtime. We consider two networks of identical architecture that only differ in the strength of their adversarial training method. ResNet6-A is weakly regularized while ResNet6-B is more strongly regularized, i.e. employs stronger adversarial training.

Evaluating the effect of MNCs, Active Constraint Score (ACS) branching, and Cost Adjusted Branching (CAB) on MN-BaB. BaBSR is another branching method that is used as a baseline.

As expected, we see that both MNCs and Active Constraint Score branching are much more effective on the weakly regularized ResNet6-A. There, we verify 31% more samples while being around 31% faster, while on ResNet6-B we only verify 10% more samples.

As a more fine-grained measure of performance, we analyze the ratio of runtimes and number of subproblems required for verification on a per-property level on ResNet6-A.

Effectiveness of Multi-Neuron Constraints: We plot the ratio of the number of subproblems required to prove a property during Branch-and-Bound without vs. with MNCs. Using MNCs reduces the number of subproblems by two orders of magnitude on average.

Effectiveness of Active Constraint Score Branching: We plot the ratio of the number of subproblems solved during Branch-and-Bound with BaBSR vs. ACS. Using ACS reduces the number of subproblems by an additional order of magnitude.

Effectiveness of Cost Adjusted Branching: Finally, we investigate the effect of Cost Adjusted Branching on mean verification time with ACS. Using Cost Adjusted Branching further reduces the verification time by 50%. It is particularly effective in combination with the ACS scores and multi-neuron constraints, where bounding costs vary more significantly.

Recap

MN-BaB combines precise multi-neuron constraints with the Branch-and-Bound paradigm and an efficient GPU-based implementation to become a new state-of-the-art verifier, especially for less regularized networks. For a full breakdown of all technical details and detailed experimental evaluations, we recommend reading our paper. If you want to play around with MN-BaB yourself, please check out our code.

Encoding Sensitive Data with Guarantees

2022-04-20T00:00:00+00:00

As machine learning is being increasingly used in scenarios that affect humans, such as credit scoring or crime risk assessment, it has become clear that these predictive models often discriminate and can be unfair. It is thus increasingly important to design methods that help models make fair decisions, either by pre-processing the input data, in-processing the model or post-processing the predictions. In our work, we propose a new pre-processing approach to encode existing data into new, unbiased representations that have high utility, but do not allow for reconstructing sensitive attributes such as race or gender. Our approach, named Fair Normalizing Flows (FNF) is based on learning a bijective encoder for each group (where groups are determined based on the sensitive attribute). Using bijective encoders enables us to obtain guarantees on maximum accuracy that any adversary can have when predicting the sensitive attribute.

Fair representations

Consider a case of a company with several teams that would like to build ML models for different products using the same data. One option would be for each team to train their own model and enforce fairness of the model by themselves. However, the teams might not have the same definition of fairness or they might even lack expertise to train fair models. Fair representation learning is a data pre-processing technique that transforms data into a new representation such that any classifier trained on top of this representation is fair. Using representation learning enables us to pre-process data only once, and then give processed data to each team so that they can train their own model on this new data, while knowing that the model is fair, according to a single, pre-defined fairness definition. The key question here is how to ensure that sensitive attributes cannot be recovered from the learned representations. Typically, prior work has checked that this is the case by jointly learning representations and an auxiliary adversarial model which is trying to predict the sensitive attribute from the representations. However, while these representations protect against adversaries considered during training, several recent papers have shown that stronger adversaries can often in fact still recover the sensitive attributes. Our work tackles this issue by proposing non-adversarial fair representation learning approach based on normalizing flows which can in certain cases guarantee that no adversary can reconstruct the sensitive attributes.

Motivation

To motivate our fair representation learning approach, we introduce a small example of a population consisting of a mixture of 4 Gaussians. Consider a distribution of samples $x = (x_1, x_2)$ divided into two groups, shown as blue and orange in the figure below, with color and shape denoting sensitive attribute and label, respectively.

Example of a population. Sensitive attribute is determined using color and shape denotes the label.

The first group with a sensitive attribute $a = 0$ has a distribution $(x_1, x_2) \sim p_0$, where $p_0$ is a mixture of two Gaussians at the top half. The second group with a sensitive attribute $a = 1$ has a distribution $(x_1, x_2) \sim p_1$, where $p_1$ is a mixture of two Gaussians at the bottom half. The label of a point $(x_1, x_2)$ is defined by $y = 1$ if $x_1$ and $x_2$ have the same sign, and $y = 0$ otherwise. Our goal is to learn a data representation $z = f(x, a)$ such that it is impossible to recover $a$ from $z$, but still possible to predict target $y$ from $z$. Note that such a representation exists for our task: simply setting $z = f(x, a) = (-1)^ax$ makes it impossible to predict whether a particular $z$ corresponds to $a = 0$ or $a = 1$, while still allowing us to train a classifier $h$ with essentially perfect accuracy (e.g. $h(z) = 1$ if and only if $z_1 > 0$). This example also motivates our general approach: can we somehow map distributions corresponding to the two groups in the population to new distributions which are guaranteed to be difficult to distinguish?

Fair Normalizing Flows

Overview of FNF. The key idea is to transform distributions corresponding to different groups using bijective encoders. After training, two distributions are aligned, and adversary cannot reconstruct sensitive attribute $a$ from the latent representation $z$.

As shown in the figure above, original features are useful for solving some downstream prediction task: we can train a classifier $h$ which predicts task label from the original features $x$ with reasonable accuracy. However, at the same time, an adversary $g$ can recover sensitive attribute from $x$, and use it to potentially discriminate in a downstream task. Motivated by the previous example, our goal is now to learn a function $f$ which transforms a pair of features and sensitive attribute $(x, a)$ into a new representation $z$ from which it is difficult to recover the sensitive attribute $a$. As in the previous example, we are going to encode both distributions corresponding to $a = 0$ and $a = 1$ using a bijective transformation. Our approach learns two bijective functions $f_0(x)$ and $f_1(x)$, and we denote the transformation as $f(x, a) = f_a(x)$.

One important quantity which we are interested in computing is statistical distance which measures how well can adversary distinguish between the distributions corresponding to the two groups in the population. Importantly, Madras et al. have shown that bounding statistical distance also bounds other fairness measures such as demographic parity or equalized odds. The statistical distance between the two encoded distributions, denoted as $\mathcal{Z}_0$ and $\mathcal{Z}_1$ is defined as:

\[\begin{equation} \Delta(p_{Z_0}, p_{Z_1}) \triangleq \sup_{\mu \in \mathcal{B}} \lvert \mathbb{E}_{z \sim \mathcal{Z}_0} [\mu(z)] - \mathbb{E}_{z \sim \mathcal{Z}_1} [\mu(z)] \rvert, \end{equation}\]

where $\mu\colon \mathbb{R}^d \rightarrow \{0, 1\}$ is a function from the set of all binary classifiers $\mathcal{B}$, trying to discriminate between $\mathcal{Z}_0$ and $\mathcal{Z}_1$. We can show that supremum is attained for $\mu^*$ which, for some $z$, evaluates to $1$ if and only if $p_{Z_0}(z) \leq p_{Z_1}(z)$. This shows that for computing statistical distance we need to know how to evaluate probability densities in the latent space, which is difficult for standard neural architectures because any $z$ can correspond to the several different inputs $x$. However, our approach uses bijective transformation for an encoder, and it is easy to compute the latent probability density using the change of variables formula:

\[\begin{equation} \log p_{Z_a}(z) = \log p_a(f^{-1}_a(z)) + \log \left | \det \frac{\partial f^{-1}_a(z)}{\partial z} \right | \end{equation}\]

We train the two encoders $f_0$ and $f_1$ to minimize the statistical distance $\Delta(p_{Z_0}, p_{Z_1})$, while at the same time we can also train an auxiliary classifier which helps learned representations to be informative for downstream prediction tasks. One issue with training this way is that statistical distance is non-differentiable, as optimal adversary $\mu^*$ makes discrete thresholding decision, so we instead train by minimizing a loss which is a smooth approximation of the statistical distance. After training is finished, we can evaluate statistical distance exactly, without any approximation. Our guarantees assume that we know input distributions $p_0$ and $p_1$, which we most often have to approximate in practice. You can find full details of how our guarantees change when input distributions are estimated in our paper.

Experimental evaluation

We evaluated FNF on several standard datasets from the fairness literature. For each dataset, we take one of the features to be a sensitive attribute (e.g. race), and we train a model which balances between accuracy, measuring how well can it predict a task label, and fairness, measuring how well can it debias the learned representations from the sensitive attributes.

Tradeoff between accuracy and fairness with FNF. For each datasets we measure classification accuracy and statistical distance, and show that FNF achieves favorable tradeoff between these quantities..

The above figure shows results on Law School, Crime and Health Heritage Prize datasets, with each point representing a single model with different fairness-accuracy tradeoff. As a fairness metric, we measure statistical distance introduced earlier. We can observe that for all datasets FNF can effectively balance fairness and accuracy. In general, drop in accuracy is steeper for datasets where the task label is more correlated with the sensitive attribute. We provide more experimental results in our paper, including experiments with discrete datasets and comparison with prior work.

Bounding adversarial accuracy. We show that FNF can reliably bound maximum accuracy that any adversary can have when predicting sensitive attribute from the encoded representations.

As mentioned earlier, FNF provides provable upper bound on the maximum accuracy of the adversary trying to recover the sensitive attribute, for the estimated probability densities of the input distribution. We show our upper bound on the adversarial accuracy computed from the statistical distance using the estimated densities (diagonal dashed line), together with adversarial accuracies obtained by training an adversary, a multilayer perceptron (MLP) with two hidden layers of 50 neurons, for each model from the figure. We can observe that FNF can successfully bound accuracy of the strong adversaries, even though the guarantees were computed on the estimated distributions. In our paper, we also experiment with using FNF for other tasks such as algorithmic recourse and transfer learning.

Summary

In this blog post, we introduced Fair Normalizing Flows (FNF), a new method for learning representations ensuring that no adversary can predict sensitive attributes at the cost of a small decrease in accuracy. Our key idea was to use an encoder based on normalizing flows which allows computing the exact likelihood in the latent space, given an estimate of the input density. Our experimental evaluation on several datasets showed that FNF effectively enforces fairness without significantly sacrificing utility, while simultaneously allowing interpretation of the representations and transferring to unseen tasks. For more details please see our ICLR 2022 paper.

The Optimal Privacy Attack on Federated Learning

2022-04-20T00:00:00+00:00

Federated learning has become the most widely used approach to collaboratively train machine learning models without requiring training data to leave devices of the individual users. In this setting, clients compute updates on their own devices, send the updates to a central server which aggregates them and updates the global model. Because user data is not shared with the server or other users, this framework should, in principle, offer more privacy than simply uploading the data to a server. There have been several works which showed that data can in fact be reconstructed from the gradient updates sent to the server. In our work, we investigate what is the optimal attack for reconstructing data from gradients.

Privacy in federated learning

Goal of federated learning is to train a model $h_\theta$ through a collaborative procedure involving different clients, without data leaving individual client devices. Typically $h_\theta$ is a neural network with parameters $\theta$, classifying an input $x$ to a label $y$. We assume that pairs $(x, y)$ are coming from a distribution $\mathcal{D}$. In the standard federated learning setting, there are $n$ clients with loss functions $l_1, …, l_n$, who are trying to jointly solve the optimization problem and find parameters $\theta$ which minimize their average loss:

\[\begin{equation*} \min_{\theta} \frac{1}{n} \sum_{i=1}^n \mathbb{E}_{(x, y) \sim \mathcal{D}} \left[ l_i(h_\theta(x), y) \right]. \end{equation*}\]

In a single training step, each client $i$ first computes $\nabla_{\theta} l_i(h_\theta(x_i), y_i)$ on a batch of data $(x_i, y_i)$, then sends these to the central server that performs a gradient descent step to obtain the new parameters $\theta’ = \theta - \frac{\alpha}{n} \sum_{i=1}^n \nabla_{\theta} l_i(h_\theta(x_i), y_i)$, where $\alpha$ is a learning rate. We will consider a scenario where each client reports, instead of the true gradient $\nabla_{\theta} l_i(h_\theta(x_i), y_i)$, a noisy gradient $g$ sampled from a distribution $p(g|x)$, which we call a defense mechanism. This setup is fairly general and captures common defenses such as DP-SGD. In this post, we are interested in the privacy issue of federated learning: can the input $x$ be recovered from gradient update $g$? More specifically, we are interested in analyzing the Bayes optimal attack and connecting it to the attacks from prior work.

Bayesian framework for gradient leakage

To measure how well can adversary reconstruct user data we introduce the notion of adversarial risk for gradient leakage, and then derive the Bayes optimal adversary that minimizes this risk. The adversary can only observe the gradient $g$ and tries to reconstruct the input $x$ that produced $g$. Formally, the adversary is a function $f: \mathbb{R}^k \rightarrow \mathcal{X}$ mapping gradients to inputs. Given some $(x, g)$ sampled from the joint distribution $p(x, g)$, the adversary outputs the reconstruction $f(g)$ and incurs loss $\mathcal{L}(x, f(g))$, which is a function $\mathcal{L}: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$.

The loss measures whether the advesary was able to reconstruct the original data. Typically, we will consider a binary loss that evaluates to 0 if the adversary’s output is close to the original input, and 1 otherwise. If the adversary only wants to get to some $\delta$-neighbourhood of the input $x$ in the input space, an appropriate definition of the loss can be $\mathcal{L}(x, x’) := 1_{||x - x’||_2 > \delta}$. This definition is well suited for image data, where $\ell_2$ distance captures our perception of visual similarity. We can now define the risk $R(f)$ of the adversary $f$ as

\[\begin{equation} R(f) := \mathbb{E}_{x, g} \left[ \mathcal{L}(x, f(g)) \right] = \mathbb{E}_{x \sim p(x)} \mathbb{E}_{g \sim p(g|x)} \left[ \mathcal{L}(x, f(g)) \right]. \end{equation}\]

We can then manipulate this expression and show that $R(f) = 1 - \mathbb{E}_g \int_{B(f(g), \delta)} p(x|g) \,dx$. This allows us to work out what is the optimal adverasary $f$ which minimizes the adversarial risk $R(f)$:

\[\begin{align} f(g) &= \underset{x_0 \in \mathcal{X}}{\operatorname{argmax}} \int_{B(x_0, \delta)} p(x|g) \,dx \nonumber \\ &= \underset{x_0 \in \mathcal{X}}{\operatorname{argmax}} \int_{B(x_0, \delta)} \frac{p(g|x)p(x)}{p(g)} \nonumber \,dx \\ &= \underset{x_0 \in \mathcal{X}}{\operatorname{argmax}} \int_{B(x_0, \delta)} p(g|x)p(x) \,dx \nonumber \\ &= \underset{x_0 \in \mathcal{X}}{\operatorname{argmax}} \left[ \log \int_{B(x_0, \delta)} p(g|x)p(x) \,dx \right] \label{eq:optadv} %% &\geq \argmax_{x_0 \in \mathcal{X}} \int_{B(x_0, \delta)} \log p(g|x) + \log p(x) \,dx \label{eq:optadv} \end{align}\]

While the above provides a formula for the optimal adversary in the form of an optimization problem, using this adversary for practical reconstruction is computationally difficult so we can approximate it by applying Jensen’s inequality $\log C \int_{B(x_0, \delta)} p(g|x)p(x) \,dx \geq C \int_{B(x_0, \delta)} (\log p(g|x) + \log p(x)) \,dx.$

Gradient leakage attack. Bayes optimal adversary randomly initializes the image and then optimizes for the image which has highest likelihood of being close to the original image which produced the gradient.

We can then approximate the integral by Monte Carlo sampling and optimize the objective using gradient descent. As shown in the figure above, adversary can randomly initialize the input, and then optimize for the input which has highest likelihood of being close to the original input which produced the update gradient $g$. One interesting thing is that we can now recover attacks from prior work by using different priors $p(x)$ and conditional $p(g|x)$, meaning that attacks from prior work are different approximations of the Bayes optimal adversary. For example, DLG is recovered by using uniform prior and Gaussian conditional, another attack is recovered by using total variation prior and cosine conditional, while GradInversion uses combination of total variation and DeepInversion prior combined with Gaussian conditional.

Attacking existing defenses

In this experiment, we evaluate the three recently proposed defenses Soteria, ATS and PRECODE against strong gradient leakage attacks. While these defenses can protect privacy against weaker attackers (as shown in the respective papers), we show that they are not actually successful against strong attacks. Below we show our reconstructions for each defense, evaluated on CIFAR-10 dataset.

Reconstructions on defended networks. We attack models defended by Soteria, ATS and PRECODE using strong attacks. Our reconstructions are very close to the original images, showing that these defenses do not reliably protect privacy, especially early in training.

Each defense introduces a different $p(g|x)$ which we use to derive an approximation of Bayes optimal adversary. Our results indicate that these defenses do not reliably protect privacy under gradient leakage, especially in the early stages of the training. This suggests that creating effective defenses and strong evaluation methods remains a key challenge. We provide full description of these defenses and our attacks in our paper.

Comparing different attacks

In the next set of experiments we compare Bayes optimal attack with several other, non-optimal attacks. We consider several different defenses all of which produce different distributions $p(g|x)$. One defense adds Gaussian noise, another randomly prunes out some elements of the gradient and then adds Gaussian noise, and the third one adds Laplacian noise after pruning. We measure PSNR of the reconstructions, where higher value means that reconstruction is closer to the original image.

Bayes optimal attack compared to other attacks. We consider several different defenses with adding Gaussian or Laplacian noise or pruning and show that Bayes optimal attack is typically the best.

We observe that Bayes optimal adversary generally performs best, showing that the optimal attack needs to leverage structure of the probability distribution of the gradients induced by the defense. Note that, in the case of Gaussian defense, $\ell_2$ and Bayes attacks are equivalent up to a constant factor, and it is expected that they achieve a similar result. In all other cases, Bayes optimal adversary outperforms the other attacks. Overall, this experiment provides empirical support for our theory, confirming practical utility of the Bayes optimal adversary.

Summary

In this blog post we considered the problem of privacy in federated learning and investigated the Bayes optimal adversary which tries to reconstruct original data from the gradient updates. We derived form of this adversary and showed that attacks proposed in prior work are different approximations of this optimal adversary. Experimentally, we showed that existing defenses do not protect against strong attackers, and that deriving good defense remains an open challenge. Furthermore, we showed that Bayes optimal adversary is stronger than other attacks when it can exploit structure in probability distributions, confirming our theoretical results. For more details, please check out our ICLR 2022 paper.