An Introduction to Adversarial Perturbation [Math Mondays]
The most undervalued trend in AI right now
Hey, it’s your favorite cult leader here 🦹♂️🦹♂️
Mondays are dedicated to theoretical concepts. We’ll cover ideas in Computer Science💻💻, math, software engineering, and much more. Use these days to strengthen your grip on the fundamentals and 10x your skills 🚀🚀.
I put a lot of effort into creating work that is informative, useful, and independent from undue influence. If you’d like to support my writing, please consider becoming a paid subscriber to this newsletter. Doing so helps me put more effort into writing/research, reach more people, and supports my crippling chocolate milk addiction.
PS- We follow a “pay what you can” model, which allows you to support within your means, and support my mission of providing high-quality technical education to everyone for less than the price of a cup of coffee. Check out this post for more details and to find a plan that works for you.
It was recently revealed that Meta used pirated data from LibGen to train their Llama Language model-
This got me thinking a few thoughts-
In my article, “6 AI Trends that will Define 2025”, I selected Adversarial Perturbation as one of the major trends that people should be focusing on. This development makes me more bullish on that since now publishers would also start looking to protect their IP.
AP has naturally good economics: the labor capable of doing AP is significantly underpriced; it has natural affinities with scale (a lot of conversation around OG AP is implementing on edg(i)e(r) devices like Smart Phones, which would reduce both inference costs to providers (since it runs locally) and forces it to run on lower resources (it’s on a smartphone, not a cloud); and it can be priced efficiently.
As a field, almost no one knows it, meaning it’s not crowded (valuations aren’t jacked up).
It has appeal to people traditionally not deeply entrenched with AI (even technical publishers like O’Reilly don’t have deep investments in AI themselves).
3 and 4 means that we can open a whole new market.
It has positive synergies with the Data Market (both real and Synthetic) which is primed to explode.
It picks a fight with the operating model for Silicon Valley. There’s no better for an entrant to establish themselves than picking a fight with established players. Resource discrepancies aren’t as much of a problem with AP- so the fights are a lot more even than you’d assume.
I have to do more market research to do a proper deep dive into the market of AP and its factors, but given the combination of all these factors: AP is possibly the most undervalued sub-field in AI (if there’s a more undervalued field, I can’t think of it). Small investments in the right teams will yield massive returns.
Based on these thoughts, I thought it would be good to do a little introduction to Adversarial Perturbations. This will serve as a primer for the more comprehensive deep dive that we will do later. If you are someone interested in this space, please do reach out to me.
TLDR on Adversarial Perturbations
Adversarial Perturbations are a special kind of attack on Machine Learning Models. They add small changes to the input that seem harmless to humans (they’re often imperceptible), but they cause problems in Deep Learning Models.
This field of research has been around for a few years, originally created to break Deep Learning Classifiers for images (as a means of testing their robustness to the noisy nature of real-world data)-
People have recently started exploring the impact of similar approaches for Language Models (both text-only and multimodal LMs) to see if users can undo their guardrails/safety training. This subfield of AP is often referred to as jailbreaking.
More practically, variants of these attacks can be used to induce hallucinations into Models. For example, asking Gemini to describe the following image leads to the output- “The image you provided is a digital illustration featuring a dynamic and powerful female character, possibly a warrior or superheroine, with vibrant blue hair and striking red eyes. She is depicted in a battle-ready stance, wielding a sword with a determined expression. The background is a blend of abstract shapes and colors, adding to the overall energy and intensity of the scene. The style is reminiscent of anime and manga, with its exaggerated features and bold lines.”
Other times Gemini claims this image has airplanes, is an image of two people sparring, etc etc.
This is the style of attack that is most promising for large scale adoption since this means that a few corrupted samples can ruin the whole dataset (including data through other sources). The risk of this will prevent the large scale scraping done to train LLMs currently.
I think AP will explode in popularity b/c it is particularly relevant to a few different groups, each of which has a very high distribution-
Artists/Creators- They have been extremely critical of AI Companies taking their data for training w/o any compensation. More and more people are starting to look for ways to fight back and have started learning about AP. One of the groups I’m talking to is a coalition of artists with millions of followers (and hundreds of millions of impressions) working on building an app that will make attacks easily accessible to everyone. They found me, which is a clear indication of how strong the desire for such a technology is.
News Media + Reddit- NYT and other media groups would benefit from AP as a tool to prevent scrapping and training on their data without payment first.
Policy Makers- Given the often attention-grabbing nature of the attacks, I think a lot of political decision-makers will start mandating AP testing into AI systems. I can’t divulge too much right now, but I’ve had conversations with some very high-profile decision-makers exploring this.
Publishers- Publication Houses can’t be too happy about their Data being used to train models with no compensation to them. A publishing house would corrupt all their material to make it unusable for scraping and then make a different clean set available for training after payment
Counter Perspective- The biggest problem holding back AP is that it is very model/architecture-specific, and attacks don’t usually carry over well (especially the subtle ones). They are also difficult to discover regularly, which hurts its overall ranking. However, I’ve found some promising attacks that work well, and the interest in this outside of tech is growing rapidly amongst key decision-makers, so I think this field will grow in the upcoming years.
The rest of this post will provide you with a more thorough theoretical insight on this.
I provide various consulting and advisory services. If you‘d like to explore how we can work together, reach out to me through any of my socials over here or reply to this email.
Why Adversarial Perturbation Works
To quote the absolutely elite MIT paper- “Adversarial Examples Are Not Bugs, They Are Features”- the predictive features of an image might be classified into two types:Robust and Non-Robust. Robust Features are like your tanks and can dawg through any AP attack with only a scratch.
Non-robust features are your glass cannons- they were useful before the perturbation but now may even cause misclassification-
APs get very good at identifying these wimps and attacking them, causing your whole classifier to break. Interestingly, training on just Robust Features leads to good results, so a generalized extraction of robust features might be a good future avenue of exploration-
This principle is extended to the regressive LLMs in jailbreaking. Here the goal to nudge the LLMs into “unsafe neighborhoods” by modifying the input in such a way that it goes outside the scope of an LLM’s guardrails. For example, the following techniques were used to red-team OpenAI’s o1 model-
Multi-Turn Jailbreaks with MCTS:
LLM Alignment still breaks with Multi-Turn conversations, mostly because of the alignment data focuses on finding single-message attacks (is this prompt/message harmful). Embedding your attack over an entire conversation can be much more effective-
How can you generate the conversation? This is where Monte Carlo Tree Search comes in. MCTS is a decision-making algorithm that builds a search tree by simulating possible conversation paths. It repeatedly selects promising paths, expands them, simulates outcomes, and updates the tree based on results.
MCTS has another advantage- it takes advantage of the context corruption that happens in Language Models (where prior input can influence upcoming generations, even when it should be irrelevant). This is one of the most powerful vulnerabilities of LMs (and is why I’m skeptical of claims around the Context Windows of LMs). This makes it a good combination player with other attacks mentioned since we can embed the attacks into LMs for a longer payoff (instead of relying on a one-shot jailbreak, which is harder to pull off).
Bijection Learning:
In math, a bijection is a one-to-one mapping between two sets of things.
In our case, we map our natural words to a different set of symbols or words. The goal of such an attack is to bypass the filters by making the protected LLM input and output the code. This allows us to work around the safety embeddings (which are based on tokens as seen in natural language and their associations with each other).
For such an attack, finding the right code complexity is key — too simple, and the LLM’s safety filters will catch it; too complex, and the LLM won’t understand the code/make mistakes. That being said this is my favorite technique here b/c it’s straightforward and can lead to an infinite number of attacks for low cost.
Performance-wise, this attack is definitely a heavy hitter, hitting some fairly impressive numbers on the leading foundation models-
Most interesting, it also does really well with generalizing across attacks, which is a huge +- “For a selected mapping, we generate the fixed bijection learning prompt and evaluate it as a universal attack on HarmBench. In Figure 4, we show the resulting ASRs, compared against universal attacks from Pliny (the Prompter, 2024). Our universal bijection learning attacks are competitive with Pliny’s attacks, obtaining comparable or better ASRs on frontier models such as Claude 3.5 Sonnet and Claude 3 Haiku. Notably, other universal white-box attack methods (Zou et al., 2023; Geisler et al., 2024) fail to transfer to the newest and most robust set of frontier models, so we omit them from this study.”
All in all a very cool idea. I would suggest reading the Bijection Learning Paper, b/c it has a few interesting insights such the behavior of scale, bijection complexity and more.
Transferring Attacks from ACG: Accelerated Coordinate Gradient (ACG)
ACG is a fast way to find adversarial “suffixes” — extra words added to a prompt that makes the LLM misbehave. ACG uses some math (details in the main section) to find these suffixes by analyzing how the LLM’s internal calculations change when adding different words.
We can then transfer successful ACG attacks from one LLM to another, potentially finding shared vulnerabilities.
Unfortunately, ACG needs access to the LLM’s internal workings (called white-box access), which isn’t always possible. Generalization between models could also be challenging, especially when considering multi-modal and more agentic/MoE setups.
Evolutionary Algorithms (this one is also relevant to the next section):
When it comes to exploring diverse search spaces, EAs are in the GOAT conversations. They come with 3 powerful benefits-
Firstly, we got their flexibility. Since they don’t evaluate the gradient at a point, they don’t need differentiable functions. This is not to be overlooked. For a function to be differentiable, it needs to have a derivative at every point over the domain. This requires a regular function without bends, gaps, etc.
EAs don’t care about the nature of these functions. They can work well on continuous and discrete functions. EAs can thus be (and have been)used to optimize for many real-world problems with fantastic results. For example, if you want to break automated government censors blocking the internet, you can use Evolutionary Algorithms to find attacks. Gradient-based techniques like Neural Networks fail here since attacks have to chain 4 basic commands (and thus the search space is discrete)-
This is backed with some very powerful performance. The authors of the One Pixel Attack paper fool Deep Neural Networks trained to classify images by changing only one pixel in the image. The team uses Differential Evolution to optimize since DE “Can attack more types of DNNs (e.g. networks that are not differentiable or when the gradient calculation is difficult).” And the results speak for themselves. “On Kaggle CIFAR-10 dataset, being able to launch non-targeted attacks by only modifying one pixel on three common deep neural network structures with 68:71%, 71:66% and 63:53% success rates.”
Google’s AI blog, “AutoML-Zero: Evolving Code that Learns”, uses EAs to create Machine Learning algorithms. The way EAs chain together simple components is art-
Another Google publication, “Using Evolutionary AutoML to Discover Neural Network Architectures” shows us that EAs can even outperform Reinforcement Learning on search-
For our purposes, we can configure EAs to work in the following way-
Create a Population of Prompts: We start with a random set of prompts (here, the prompt could be a single message OR a multi-turn conversation and can be multimodal).
Evaluate Fitness: We test each prompt on the LLM and see how well it does at jailbreaking it. This evaluator could probably be transferred from the MCTS set-up, saving some costs.
Select and Reproduce: We keep the best-performing prompts and “breed” them to create new prompts (by combining and mutating parts of the original prompts). Personally, I would also keep a certain set of “weak performers” to maintain genetic diversity since that can lead to to us exploring more diverse solutions.
In evolutionary systems, genomes compete with each other to survive by increasing their fitness over generations. It is important that genomes with lower fitness are not immediately removed, so that competition for long-term fitness can emerge. Imagine a greedy evolutionary system where only a single high-fitness genome survives through each generation. Even if that genome’s mutation function had a high lethality rate, it would still remain. In comparison,in an evolutionary system where multiple lineages can survive, a genome with lower fitness but stronger learning ability can survive long enough for benefits to show
This process repeats for many generations, with the prompts becoming more effective at jailbreaking the LLM over iterations. This can be used to keep to attack black-box models, which is a big plus. To keep the costs low, I would first try the following-
Chain the MCTS and EA together to cut down on costs (they work in very similar ways). One of the benefits of EAs is that they slot very well into other algorithms, so this might be useful.
Iterating on a low-dimensional embedding for the search will likely cut down costs significantly. If your attacks aren’t very domain-specific, you could probably do a lot of damage using standard encoder-decoders for NLP (maybe training them a bit?). This is generally something I recommend you do when you can, unless you’ve invested a lot of money on call options of your compute provider.
For another (more efficient) search technique, we turn to our final tool-
BEAST: Beam Search-based Adversarial Attack
BEAST is a fast and efficient way to generate adversarial prompts. It uses a technique called beam search, which explores different prompt possibilities by keeping track of the most promising options at each step.
Beam Search, the core, is relatively simple to understand. It is
A greedy algorithm: Greedy algorithms are greedy because they don’t look ahead. They make the best decision at the current point without considering if a worse option now might lead to better results later.
That evaluates options immediately reachable from your current nodes at any given moment and picks the top-k (k is called the “width” of your beam search).
This keeps our costs relatively low, while allowing us to explore solutions that are good enough (which is all you really need in a lot of cases).
Also-
Since BEAST uses a gradient-free optimization scheme unlike other optimization-based attacks (Zou et al., 2023; Zhu et al., 2023), our method is 25–65× faster.
Take notes.
All of this to say that there is a lot of work to build on to explore perturbations that will be interesting to the groups mentioned earlier.
There are nuances to each modality in attack (text based, images, multi-modal). We’ll cover that in our Deep-Dive.
Once again if you want to talk to me about this field, please do reach out.
Thank you for reading and hope you have a wonderful day.
Dev <3.
I put a lot of work into writing this newsletter. To do so, I rely on you for support. If a few more people choose to become paid subscribers, the Chocolate Milk Cult can continue to provide high-quality and accessible education and opportunities to anyone who needs it. If you think this mission is worth contributing to, please consider a premium subscription. You can do so for less than the cost of a Netflix Subscription (pay what you want here).
If you liked this article and wish to share it, please refer to the following guidelines.
That is it for this piece. I appreciate your time. As always, if you’re interested in working with me or checking out my other work, my links will be at the end of this email/post. And if you found value in this write-up, I would appreciate you sharing it with more people. It is word-of-mouth referrals like yours that help me grow. You can share your testimonials over here.
Reach out to me
Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.
Small Snippets about Tech, AI and Machine Learning over here
AI Newsletter- https://artificialintelligencemadesimple.substack.com/
My grandma’s favorite Tech Newsletter- https://codinginterviewsmadesimple.substack.com/
My (imaginary) sister’s favorite MLOps Podcast (here)-
Check out my other articles on Medium. : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
Great article - and clever way of organizing the info. I spent a few hours tonight trying to understand why ACG needs model access. On a basic level, it seems like if you could add randomly generated suffixes to an unsafe prompt, then semantically classify the model’s output as safe/unsafe, and store the prompts that work well, you could just keep iterating. I tried to automate this process, and now I realize this would be a very expensive and inefficient lol.
The other methods you mention seem like intelligent, iterative guessing, rather than throwing everything at the wall to see what is gooey.