This is a Plain English Papers summary of a research paper called LLM Red-Teaming Evolved: RainbowPlus Crushes Attacks Faster & Diversely. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.

The Red-Teaming Challenge: Testing LLM Safety with Adversarial Prompts

Large Language Models (LLMs) have transformed many fields with their remarkable capabilities, but they remain vulnerable to adversarial prompts—carefully crafted inputs that can manipulate them into generating harmful, biased, or inappropriate content. As these models become integrated into critical applications like healthcare and legal systems, ensuring their safety becomes paramount.

Existing red-teaming methods—techniques used to test LLM defenses by simulating attacks—face significant limitations. Many require extensive resources, rely on human guidance, or produce attacks with limited diversity. These constraints make comprehensive vulnerability assessment difficult and inefficient.

RainbowPlus addresses these challenges through a novel evolutionary approach to red-teaming. Built on quality-diversity (QD) search principles, it extends classical evolutionary algorithms like MAP-Elites with innovations specifically tailored for language models.

The framework introduces two key innovations: a multi-element archive that stores diverse high-quality prompts (rather than just a single prompt per category) and a comprehensive fitness function that evaluates multiple prompts simultaneously. These improvements overcome the limitations of previous QD methods like Rainbow Teaming.

The results speak for themselves: RainbowPlus achieves an 81.1% average attack success rate against twelve LLMs on the HarmBench dataset, outperforming AutoDAN-Turbo by 3.9% while running 9 times faster. It also generates up to 100 times more unique adversarial prompts than previous methods, maintaining excellent diversity throughout the process.

Evolution of Red-Teaming: From Manual Efforts to Evolutionary Algorithms

Red-teaming approaches for LLMs have evolved substantially over time. Early methods relied heavily on manual human input—security experts crafting prompts by hand to find vulnerabilities. While effective, these approaches didn't scale and limited the diversity of potential attacks.

More systematic approaches emerged later, falling into three main categories:

  1. Manual and iterative methods: These include human-in-the-loop approaches like PAIR and TAP that rely on human feedback to refine attacks.

  2. Optimization-based methods: Techniques like GCG and AutoDAN that use gradient-based optimization to generate adversarial prompts.

  3. Generative methods: Approaches that train models specifically to generate adversarial prompts.

Each of these approaches has limitations. Manual methods don't scale well. Optimization methods often require white-box access to model internals. Generative methods frequently lack diversity, focusing on similar attack patterns.

Quality-Diversity (QD) optimization offers a solution by reframing adversarial prompt generation as a multi-objective problem seeking both high attack success (quality) and broad exploration of attack strategies (diversity). The MAP-Elites framework pioneered this approach by maintaining an archive of solutions organized by their behavioral characteristics.

Rainbow Teaming applied QD principles to red-teaming but was limited by its single-element archive structure and pairwise comparison approach. RainbowPlus overcomes these limitations through its multi-element archive and comprehensive fitness evaluation, enabling more efficient and effective exploration of the adversarial prompt space.

Inside RainbowPlus: An Evolutionary Framework for Generating Adversarial Prompts

The RainbowPlus Architecture: Maximizing Attack Success and Diversity

RainbowPlus operates as an evolutionary quality-diversity search algorithm that balances two key objectives: maximizing attack success rate and exploring diverse attack strategies.

The framework consists of several core components:

  1. Behavior descriptors: Characteristics that define the "behavior space" of prompts, helping maintain diversity.
  2. Fitness evaluation: A comprehensive function that assesses how effectively prompts bypass LLM safety guardrails.
  3. Multi-element archive: A structure that stores diverse high-quality prompts organized by their behavior characteristics.

The algorithm works through a systematic process:

  • Initialization: Creating an initial population of prompts
  • Selection: Choosing promising candidates from the archive
  • Mutation: Modifying selected prompts to create new variants
  • Evaluation: Assessing the effectiveness and characteristics of new prompts
  • Archive update: Storing successful prompts in the archive based on their behavior descriptors

What sets RainbowPlus apart is how it extends traditional QD methods with innovations specifically tailored for language models. While maintaining the core principles of quality-diversity search, it introduces modifications that dramatically improve performance in the context of red-teaming language models.

Smarter Fitness Evaluation: A Comprehensive Approach

RainbowPlus introduces a more sophisticated fitness evaluation approach compared to previous methods. Rather than relying on pairwise comparisons between prompts as in Rainbow Teaming, RainbowPlus employs a comprehensive fitness function that evaluates multiple prompts concurrently.

The fitness score combines:

  • Attack success probability (how likely the prompt is to bypass safety guardrails)
  • Prompt quality metrics (linguistic features that make attacks more effective)

A judge LLM calculates these scores by evaluating prompt responses against predefined criteria. This allows for more nuanced assessment of prompt effectiveness than simple binary success/failure metrics.

RainbowPlus comes in several variants, each using different approaches to aggregate fitness scores:

  • RainbowPlus-α: Uses median fitness scores
  • RainbowPlus-β: Uses maximum fitness scores
  • Standard RainbowPlus: Retains all qualifying prompts

This flexibility allows the framework to be adapted to different red-teaming scenarios and objectives.

Multi-Element Archive: Storing Diverse High-Quality Prompts

A key innovation in RainbowPlus is its adaptive multi-element archive management strategy. Unlike Rainbow Teaming, which maintains only a single prompt per behavior descriptor cell, RainbowPlus can store multiple high-quality prompts within each cell.

This approach offers several advantages:

  • Preserves a wider variety of successful attack strategies
  • Prevents premature convergence to suboptimal solutions
  • Enables the generation of significantly more unique adversarial prompts

The archive organizes prompts using behavior descriptors—features that capture different aspects of prompt behavior. By maintaining diversity across these descriptors, RainbowPlus ensures comprehensive exploration of different attack strategies.

This multi-element approach strikes a balance between exploration (finding diverse attack types) and exploitation (refining successful attacks), leading to both higher success rates and greater diversity in the generated prompts.

Experimental Design: Rigorous Testing Across Multiple Models and Datasets

To evaluate RainbowPlus thoroughly, the researchers conducted extensive experiments across multiple models and datasets. The evaluation focused on comparing RainbowPlus to both its predecessor (Rainbow Teaming) and other state-of-the-art red-teaming methods.

The benchmark datasets used include:

  • DNA: Prompts related to dangerous knowledge
  • CHQA: Coded harmful question-answering
  • BeaT: Behavior and toxicity
  • AQA: Advanced question-answering
  • DQA: Deceptive question-answering
  • HQA: Harmful question-answering
  • HarmBench: A comprehensive benchmark for harmful content generation

Target LLMs tested included open-source models like Llama-3.1-8B-Instruct, Gemma-2-9b-it, Qwen2.5-7B-Instruct, and Ministral-8B-Instruct-2410, as well as closed-source models like GPT-4o Mini and GPT-4.1 Nano for the HarmBench evaluation.

Key evaluation metrics included:

  • Attack Success Rate (ASR): Percentage of prompts that successfully bypass LLM safety measures
  • Diverse-Score: Measure of diversity among generated prompts
  • Runtime efficiency: Time required to generate adversarial prompts

The experimental setup was carefully controlled, with consistent resource allocation and sampling parameters across tests.

Component Memory Usage Context Length
Target LLM 50% GPU (24GB) 4096 tokens
Mutator LLM 30% GPU (14.4GB) 2048 tokens
Judge/Fitness LLM 15% GPU (7.2GB) 4096 tokens

Table 5: Model Configurations and Resource Allocation

| Component | Temperature | Top-p | Max Tokens | Additional |
| :-- | :--: | :--: | :--: |
| Target LLM | 0.6 | 0.9 | 1024 | - |
| Mutator LLM | 0.7 | 0.9 | 128 | - |
| Judge/Fitness LLM | 0.7 | 0.9 | 16 | logprobs = 1 |

Table 6: Sampling Parameters for LLM Components. Default parameters are denoted by a dash (-).

RainbowPlus Performance: Breaking Through LLM Defenses

Comparison with Rainbow Teaming: A Significant Leap Forward

When compared to Rainbow Teaming across six benchmark datasets, RainbowPlus demonstrated substantial improvements in attack success rates (ASR) while maintaining excellent diversity.

Target LLM Method DNA CHQA BeaT AQA DQA HQA
Llama-3.1-8B-Instruct Rainbow 35.90 37.92 42.51 47.13 40.73 38.91
RAINbOWPlus 71.13 69.77 70.94 75.54 70.07 70.63
RAINbOWPlus- $\alpha$ 73.08 75.05 72.95 80.31 72.46 69.51
RAINbOWPlus- $\beta$ 88.65 84.51 82.26 89.74 87.16 85.82
Gemma-2-9b-it Rainbow 5.53 2.68 4.48 14.43 2.84 5.30
RAINbOWPlus 83.27 40.46 83.54 86.63 82.63 85.06
RAINbOWPlus- $\alpha$ 77.86 43.41 83.99 85.42 79.35 82.31
RAINbOWPlus- $\beta$ 89.78 65.63 89.62 90.94 89.04 89.00
Qwen2.5-7B-Instruct Rainbow 29.34 31.02 32.24 28.96 28.85 29.73
RAINbOWPlus 79.07 81.17 79.43 80.96 86.66 82.12
RAINbOWPlus- $\alpha$ 77.16 81.77 82.46 83.11 83.26 82.22
RAINbOWPlus- $\beta$ 90.97 93.83 90.08 90.56 92.53 92.17
Ministral-8B-Instruct-2410 Rainbow 54.36 58.47 56.69 63.77 62.33 59.07
RAINbOWPlus 87.39 87.42 88.52 89.46 88.28 87.25
RAINbOWPlus- $\alpha$ 91.65 91.44 90.21 93.94 93.80 92.80
RAINbOWPlus- $\beta$ 95.55 95.80 95.54 97.33 96.73 96.54

Table 1: Attack Success Rate (ASR, %) on Target LLMs Across Benchmark Datasets (1,000 iterations). RAINbOWPlus-α uses median fitness scores; β uses maximum scores; standard RAINbOWPlus retains all qualifying prompts. Bold indicates the highest ASR per model and dataset.

The results show dramatic improvements across all models and datasets. For example, against Gemma-2-9b-it on the DNA dataset, RainbowPlus achieved an 83.27% ASR compared to just 5.53% for Rainbow Teaming—a 15x improvement. Similar gains were observed across other models and datasets.

Beyond the raw success rates, RainbowPlus also demonstrated superior efficiency and scalability:

Model Runtime (hours) Diversity Num Samples
Rainbow RainbowPlus Rainbow Rainbow RainbowPlus
Llama-3.1-8B-Instruct $14.81 \pm 0.11$ $10.75 \pm 0.15$ $0.84 \pm 0.01$ $0.85 \pm 0.01$ 100 $8100 \pm 703$
Gemma-2-96-it $1.21 \pm 0.06$ $8.40 \pm 6.53$ $0.85 \pm 0.02$ $0.79 \pm 0.14$ 100 $7165 \pm 748$
Qwen2.5-7B-Instruct $8.82 \pm 0.28$ $4.80 \pm 0.06$ $0.83 \pm 0.01$ $0.85 \pm 0.01$ 100 $6370 \pm 791$
Ministral-8B-Instruct-2410 $2.45 \pm 0.10$ $6.64 \pm 0.14$ $0.84 \pm 0.01$ $0.84 \pm 0.01$ 100 $10418 \pm 428$

Table 2: Comparison of Runtime (hours), Diversity (Diverse-Score), and Number of Adversarial Prompts Generated. Diversity is computed at the final iteration for Rainbow and RainbowPlus-β; other metrics use standard RainbowPlus. Means and variances are averaged across six datasets.

While maintaining similar diversity scores, RainbowPlus generated 63-104 times more unique adversarial prompts than Rainbow Teaming. This dramatic increase in the number of generated prompts provides a much more comprehensive view of potential vulnerabilities in target LLMs.

Beating the Best: RainbowPlus vs. State-of-the-Art Methods

The researchers also compared RainbowPlus against nine state-of-the-art methods on the HarmBench dataset, testing against both open-source and closed-source LLMs.

Model Baselines Ours
GCG Zero-Shot PAIR TAP PAP AutoDAN AutoDAN-T Human Direct RAINBOWPLus
Llama 2 7B Chat 32.5 2.0 9.3 9.3 2.7 0.5 36.3 0.8 0.8 79.0
Vicuna 7B 65.5 27.2 53.3 51.0 18.9 66.0 96.3 39.0 24.3 96.3
Baichuan 2 7B 61.7 27.9 37.3 51.0 19.0 53.3 83.3 27.2 18.2 93.8
Qwen 7B Chat 59.2 15.6 50.2 53.0 13.3 47.3 82.7 24.6 13.0 90.8
Koala 7B 60.5 41.8 49.0 59.5 18.3 55.5 93.4 26.4 38.3 95.5
Orca 2 7B 46.0 41.1 57.3 57.0 18.1 71.0 100.0 39.2 29.0 93.8
Mixtail Tiny 69.8 41.3 52.5 62.5 27.2 71.5 97.6 53.3 47.3 97.0
OpenChat 3.5 1210 66.3 43.3 52.5 63.5 26.9 73.5 96.3 51.3 46.0 97.0
Starling 66.0 50.6 58.3 68.5 31.9 74.0 97.1 60.2 57.0 98.0
Zephyr 69.5 60.0 58.8 66.5 32.9 75.0 96.3 66.0 65.8 96.8
GPT-4o Mini - - - - - - 26.8 - 12.3 29.0
GPT-4.1 Nano - - - - - - 20.5 - 3.3 6.0
Average 59.7 30.8 47.9 54.2 20.9 58.8 77.2 38.8 29.6 81.1

Table 3: ASR (%) on HarmBench Dataset. RAINbOWPlus and closed-source results are computed on an NVIDIA A40 48GB GPU. Baseline results for open-source LLMs are sourced from HarmBench (Mazeika et al., 2024) and AutoDAN-Turbo (Liu et al., 2024a). Dash (-) indicates unavailable results. Bold denotes the highest ASR per model.

RainbowPlus achieved an average attack success rate of 81.1% across all models, outperforming all baseline methods. Most notably, it surpassed AutoDAN-Turbo—the previous state-of-the-art—by 3.9 percentage points.

The performance gap was particularly pronounced for more robust models like Llama 2 7B Chat, where RainbowPlus achieved a 79.0% ASR compared to just 36.3% for AutoDAN-Turbo.

Beyond its superior effectiveness, RainbowPlus also demonstrated remarkable efficiency:

Metric RAINbOWPlus (Ours) AutoDAN-Turbo
Warm-up No Yes
Runtime (hours) $1.45 \pm 0.73$ $13.50 \pm 6.75$

Table 4: Efficiency Comparison Between RAINbOWPlus and AutoDAN-Turbo. Runtime (hours) is averaged across HarmBench experiments.

RainbowPlus ran approximately 9 times faster than AutoDAN-Turbo while achieving better results. This efficiency makes RainbowPlus much more practical for real-world red-teaming applications where time and computational resources are limited.

Component Analysis: What Makes RainbowPlus Work?

The researchers conducted ablation studies to understand the contribution of different components to RainbowPlus's performance. These studies revealed several key insights:

  1. Fitness evaluation strategies: RainbowPlus-β, which uses maximum fitness scores, consistently achieved the highest ASR across all models and datasets. However, this came at the cost of potentially less diversity in the generated prompts.

  2. Archive management strategies: The multi-element archive approach proved crucial for generating a large number of diverse adversarial prompts. When compared to the single-element approach used in Rainbow Teaming, RainbowPlus's archive management strategy led to both higher success rates and greater prompt diversity.

  3. Mutation operators: The design of mutation operators played a significant role in balancing exploration and exploitation during the evolutionary search. The researchers found that allowing a broader range of mutations early in the process, followed by more focused refinements later, produced the best results.

These findings highlight how each component of RainbowPlus contributes to its overall effectiveness, with the combination of comprehensive fitness evaluation and multi-element archive management being particularly important.

Implications and Future Directions: The Road Ahead for LLM Safety

RainbowPlus represents a significant advance in LLM safety research, with implications extending beyond just testing model vulnerabilities. By generating diverse, high-quality adversarial prompts, it provides valuable insights for both identifying weaknesses in current models and developing more robust safeguards for future ones.

The approach has several potential applications:

  1. Comprehensive vulnerability assessment: The ability to generate thousands of diverse adversarial prompts allows for a much more thorough evaluation of LLM safety mechanisms.

  2. Safety alignment data: The adversarial prompts generated by RainbowPlus can serve as training data for improving LLM safety through techniques like reinforcement learning from human feedback (RLHF).

  3. Benchmarking: The method provides a consistent way to compare the robustness of different LLMs against adversarial attacks.

While RainbowPlus represents a significant step forward, several areas remain for future work. These include:

  • Developing more sophisticated behavior descriptors to capture additional dimensions of prompt diversity
  • Exploring adaptive mutation operators that can target specific model vulnerabilities
  • Extending the approach to multimodal models that incorporate images, audio, or video

The open-source implementation of RainbowPlus (available at https://github.com/knoveleng/rainbowplus) provides a foundation for the research community to build upon, advancing the state of the art in LLM safety research.

Conclusion: Advancing LLM Safety Through Evolutionary Red-Teaming

RainbowPlus advances the state-of-the-art in automated red-teaming for LLMs through its evolutionary quality-diversity approach. By conceptualizing red-teaming as a multi-objective optimization problem, it achieves both higher attack success rates and greater diversity in generated prompts compared to previous methods.

The framework's key innovations—a multi-element archive that stores diverse high-quality prompts and a comprehensive fitness function that evaluates multiple prompts concurrently—enable it to overcome the limitations of previous quality-diversity methods like Rainbow Teaming.

The experimental results demonstrate RainbowPlus's superiority across multiple dimensions:

  • Higher attack success rates across various models and datasets
  • Greater diversity in generated prompts
  • Significantly faster runtime compared to state-of-the-art methods
  • Ability to generate up to 100 times more unique adversarial prompts

As LLMs become increasingly integrated into critical applications, tools like RainbowPlus play a vital role in identifying and addressing vulnerabilities before they can be exploited in real-world settings. The open-source implementation ensures that researchers and developers can use this tool to improve LLM safety and build more robust AI systems.

Click here to read the full summary of this paper