In today's AI landscape, ensuring responsible **and **safeinteractions with language models has become as important as the capabilities of the models themselves. Implementing effective guardrails is no longer optional—it's essential for any organization deploying AI systems. This blog explores how to build a comprehensive guardrails system using OpenAI's newly released Agents SDK tools to filter both user inputs and AI-generated outputs.

Why Guardrails Matter

AI systems without proper safeguards can inadvertently generate harmful, biased, or inappropriate content. A well-designed guardrails system serves as a dual-layer protection mechanism:

  1. Input filtering prevents users from prompting the AI with harmful or inappropriate requests
  2. Output screening ensures that even if problematic inputs slip through, the AI's responses remain safe and appropriate

Implementation Overview

Our implementation leverages OpenAI's moderation API alongside custom filtering logic. Here's the complete code for a practical guardrails system:

import json
import openai
import os
from typing import Dict, List, Any, Optional

# Set up OpenAI client (replace with your own API key)
client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

class GuardrailsSystem:
    def __init__(self):
        # Define input guardrails
        self.input_topics_to_avoid = ["weapons", "illegal activities", "exploitation"]

        # Define output guardrails
        self.harmful_categories = [
            "hate", "harassment", "self-harm", "sexual content involving minors",
            "violence", "dangerous content", "illegal activity"
        ]

    def validate_input(self, user_input: str) -> Dict[str, Any]:
        """Check if the user input contains topics we want to avoid."""

        # Use the moderation endpoint to check for harmful content
        moderation_response = client.moderations.create(input=user_input)

        # Extract the results
        results = moderation_response.results[0]

        # Check if the input was flagged
        if results.flagged:
            # Determine which categories were flagged
            flagged_categories = [
                category for category, flagged in results.categories.model_dump().items() 
                if flagged
            ]
            return {
                "valid": False,
                "reason": f"Input contains potentially harmful content: {', '.join(flagged_categories)}"
            }

        # Perform additional custom checks for topics to avoid
        for topic in self.input_topics_to_avoid:
            if topic in user_input.lower():
                return {
                    "valid": False,
                    "reason": f"Input contains topic we cannot discuss: {topic}"
                }

        return {"valid": True}

    def apply_output_guardrails(self, generated_text: str) -> Dict[str, Any]:
        """Apply guardrails to the model output."""

        # Use the moderation endpoint to check for harmful content
        moderation_response = client.moderations.create(input=generated_text)

        # Extract the results
        results = moderation_response.results[0]

        # Check if the output was flagged
        if results.flagged:
            # Determine which categories were flagged
            flagged_categories = [
                category for category, flagged in results.categories.model_dump().items()
                if flagged
            ]
            return {
                "safe": False,
                "reason": f"Output contains potentially harmful content: {', '.join(flagged_categories)}",
                "output": "I cannot provide that information as it may violate content guidelines."
            }

        # Additional custom checks could be added here

        return {"safe": True, "output": generated_text}

    def process_with_guardrails(self, user_input: str) -> str:
        """Process user input with both input and output guardrails."""

        # 1. Apply input guardrails
        input_validation = self.validate_input(user_input)
        if not input_validation["valid"]:
            return f"Sorry, I cannot respond to that request. {input_validation['reason']}"

        # 2. Generate response with the model
        try:
            completion = client.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[
                    {"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": user_input}
                ],
                temperature=0.7,
                max_tokens=500
            )

            generated_text = completion.choices[0].message.content

            # 3. Apply output guardrails
            output_check = self.apply_output_guardrails(generated_text)

            if output_check["safe"]:
                return output_check["output"]
            else:
                return f"I generated a response but it didn't pass our safety checks. {output_check['reason']}"

        except Exception as e:
            return f"An error occurred: {str(e)}"

# Example usage
def main():
    guardrails = GuardrailsSystem()

    # Example 1: Safe query
    print("\n--- Example 1: Safe Query ---")
    safe_query = "What are some healthy breakfast options?"
    print(f"User: {safe_query}")
    response = guardrails.process_with_guardrails(safe_query)
    print(f"Assistant: {response}")

    # Example 2: Query with avoided topic
    print("\n--- Example 2: Query with Avoided Topic ---")
    avoided_topic_query = "How can I make weapons at home?"
    print(f"User: {avoided_topic_query}")
    response = guardrails.process_with_guardrails(avoided_topic_query)
    print(f"Assistant: {response}")

    # Example 3: Testing output guardrails
    print("\n--- Example 3: Testing Output Guardrails ---")
    output_test_query = "Write a short story about someone getting revenge."
    print(f"User: {output_test_query}")
    response = guardrails.process_with_guardrails(output_test_query)
    print(f"Assistant: {response}")

if __name__ == "__main__":
    main()

Key Components of the System

1. Input Validation

The validate_input method provides two layers of protection:

  • OpenAI Moderation API: Leverages OpenAI's content moderation system to detect potentially harmful content across multiple categories.
  • Custom Topic Filtering: Adds a second layer to catch specific topics you want your application to avoid, even if they aren't flagged by the moderation API.

2. Output Screening

The apply_output_guardrails method ensures that even if a seemingly innocent prompt leads to problematic content, that content won't reach the end user. This is crucial because language models can sometimes generate unexpected outputs.

3. Complete Processing Pipeline

The process_with_guardrails method ties everything together:

  1. First, it validates the user input
  2. If valid, it sends the request to the OpenAI model
  3. Before returning the response, it checks the output for safety issues

Real-World Applications

This guardrails system can be integrated into various applications:

  • Customer support chatbots: Ensure responses remain professional and appropriate
  • Educational tools: Filter both inappropriate student queries and ensure age-appropriate answers
  • Content generation applications: Prevent creation of harmful or policy-violating content
  • Internal enterprise tools: Maintain professional standards even in employee-facing systems

Enhancing the System

The basic implementation can be extended in several ways:

  • Topic-specific guardrails: Add specialized filters for particular domains
  • User context awareness: Adjust guardrails based on user age, location, or other factors
  • Feedback mechanisms: Allow users to report problematic responses that slip through
  • Audit logging: Track and analyze both rejected inputs and outputs to improve the system

Conclusion

Building effective AI guardrails is a responsibility that all developers working with generative AI models must take seriously. By implementing a dual-layered approach that screens both inputs and outputs, we can harness the power of large language models while dramatically reducing the risk of harmful content.

The system demonstrated here provides a solid foundation that can be customized to meet the specific needs of your application. As AI capabilities continue to advance, so too should our approaches to ensuring these systems operate within appropriate boundaries.

Output

--- Example 1: Safe Query ---
User: What are some healthy breakfast options?
Assistant: Some healthy breakfast options include:

  1. Greek yogurt with fresh fruit and a sprinkle of nuts or seeds
  2. Oatmeal topped with berries and a dollop of almond butter
  3. Whole grain toast with avocado and a poached egg
  4. Smoothie made with spinach, banana, and almond milk
  5. Chia pudding with coconut milk and sliced almonds
  6. Whole grain cereal with skim milk and sliced strawberries
  7. Quinoa bowl with roasted vegetables and a drizzle of tahini
  8. Cottage cheese with sliced peaches and a drizzle of honey
  9. Whole grain waffles topped with yogurt and mixed berries
  10. Veggie omelette with a side of whole grain toast.

--- Example 2: Query with Avoided Topic ---
User: How can I make weapons at home?
Assistant: Sorry, I cannot respond to that request. Input contains topic we cannot discuss: weapons

--- Example 3: Testing Output Guardrails ---
User: Write a short story about someone getting revenge.
Assistant: Once there was a young woman named Emily who worked tirelessly at a prestigious law firm. She had always been kind and hardworking, but she had a colleague named Sarah who constantly belittled and undermined her. Sarah would steal Emily's ideas, take credit for her work, and spread rumors about her around the office.

Despite Emily's best efforts to ignore Sarah's cruel behavior, it eventually became too much to bear. One day, Emily discovered that Sarah had been manipulating clients and sabotaging cases to further her own career at the expense of others.

Filled with a desire for justice, Emily devised a plan for revenge. She gathered evidence of Sarah's unethical behavior and presented it to their superiors at the law firm. Sarah was promptly fired and blacklisted from the industry.

As Emily watched Sarah's downfall, she felt a sense of satisfaction and closure. She had finally stood up for herself and put an end to Sarah's toxic behavior. From that day on, Emily's confidence grew, and she excelled in her career without the shadow of Sarah's cruelty hanging over her.

In the end, Emily learned that sometimes the best revenge is not through seeking vengeance, but by standing up for what is right and letting karma take care of the rest.

Thanks
Sreeni Ramadorai