In today's AI landscape, ensuring responsible **and **safeinteractions with language models has become as important as the capabilities of the models themselves. Implementing effective guardrails is no longer optional—it's essential for any organization deploying AI systems. This blog explores how to build a comprehensive guardrails system using OpenAI's newly released Agents SDK tools to filter both user inputs and AI-generated outputs.
Why Guardrails Matter
AI systems without proper safeguards can inadvertently generate harmful, biased, or inappropriate content. A well-designed guardrails system serves as a dual-layer protection mechanism:
- Input filtering prevents users from prompting the AI with harmful or inappropriate requests
- Output screening ensures that even if problematic inputs slip through, the AI's responses remain safe and appropriate
Implementation Overview
Our implementation leverages OpenAI's moderation API alongside custom filtering logic. Here's the complete code for a practical guardrails system:
import json
import openai
import os
from typing import Dict, List, Any, Optional
# Set up OpenAI client (replace with your own API key)
client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
class GuardrailsSystem:
def __init__(self):
# Define input guardrails
self.input_topics_to_avoid = ["weapons", "illegal activities", "exploitation"]
# Define output guardrails
self.harmful_categories = [
"hate", "harassment", "self-harm", "sexual content involving minors",
"violence", "dangerous content", "illegal activity"
]
def validate_input(self, user_input: str) -> Dict[str, Any]:
"""Check if the user input contains topics we want to avoid."""
# Use the moderation endpoint to check for harmful content
moderation_response = client.moderations.create(input=user_input)
# Extract the results
results = moderation_response.results[0]
# Check if the input was flagged
if results.flagged:
# Determine which categories were flagged
flagged_categories = [
category for category, flagged in results.categories.model_dump().items()
if flagged
]
return {
"valid": False,
"reason": f"Input contains potentially harmful content: {', '.join(flagged_categories)}"
}
# Perform additional custom checks for topics to avoid
for topic in self.input_topics_to_avoid:
if topic in user_input.lower():
return {
"valid": False,
"reason": f"Input contains topic we cannot discuss: {topic}"
}
return {"valid": True}
def apply_output_guardrails(self, generated_text: str) -> Dict[str, Any]:
"""Apply guardrails to the model output."""
# Use the moderation endpoint to check for harmful content
moderation_response = client.moderations.create(input=generated_text)
# Extract the results
results = moderation_response.results[0]
# Check if the output was flagged
if results.flagged:
# Determine which categories were flagged
flagged_categories = [
category for category, flagged in results.categories.model_dump().items()
if flagged
]
return {
"safe": False,
"reason": f"Output contains potentially harmful content: {', '.join(flagged_categories)}",
"output": "I cannot provide that information as it may violate content guidelines."
}
# Additional custom checks could be added here
return {"safe": True, "output": generated_text}
def process_with_guardrails(self, user_input: str) -> str:
"""Process user input with both input and output guardrails."""
# 1. Apply input guardrails
input_validation = self.validate_input(user_input)
if not input_validation["valid"]:
return f"Sorry, I cannot respond to that request. {input_validation['reason']}"
# 2. Generate response with the model
try:
completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": user_input}
],
temperature=0.7,
max_tokens=500
)
generated_text = completion.choices[0].message.content
# 3. Apply output guardrails
output_check = self.apply_output_guardrails(generated_text)
if output_check["safe"]:
return output_check["output"]
else:
return f"I generated a response but it didn't pass our safety checks. {output_check['reason']}"
except Exception as e:
return f"An error occurred: {str(e)}"
# Example usage
def main():
guardrails = GuardrailsSystem()
# Example 1: Safe query
print("\n--- Example 1: Safe Query ---")
safe_query = "What are some healthy breakfast options?"
print(f"User: {safe_query}")
response = guardrails.process_with_guardrails(safe_query)
print(f"Assistant: {response}")
# Example 2: Query with avoided topic
print("\n--- Example 2: Query with Avoided Topic ---")
avoided_topic_query = "How can I make weapons at home?"
print(f"User: {avoided_topic_query}")
response = guardrails.process_with_guardrails(avoided_topic_query)
print(f"Assistant: {response}")
# Example 3: Testing output guardrails
print("\n--- Example 3: Testing Output Guardrails ---")
output_test_query = "Write a short story about someone getting revenge."
print(f"User: {output_test_query}")
response = guardrails.process_with_guardrails(output_test_query)
print(f"Assistant: {response}")
if __name__ == "__main__":
main()
Key Components of the System
1. Input Validation
The validate_input
method provides two layers of protection:
- OpenAI Moderation API: Leverages OpenAI's content moderation system to detect potentially harmful content across multiple categories.
- Custom Topic Filtering: Adds a second layer to catch specific topics you want your application to avoid, even if they aren't flagged by the moderation API.
2. Output Screening
The apply_output_guardrails
method ensures that even if a seemingly innocent prompt leads to problematic content, that content won't reach the end user. This is crucial because language models can sometimes generate unexpected outputs.
3. Complete Processing Pipeline
The process_with_guardrails
method ties everything together:
- First, it validates the user input
- If valid, it sends the request to the OpenAI model
- Before returning the response, it checks the output for safety issues
Real-World Applications
This guardrails system can be integrated into various applications:
- Customer support chatbots: Ensure responses remain professional and appropriate
- Educational tools: Filter both inappropriate student queries and ensure age-appropriate answers
- Content generation applications: Prevent creation of harmful or policy-violating content
- Internal enterprise tools: Maintain professional standards even in employee-facing systems
Enhancing the System
The basic implementation can be extended in several ways:
- Topic-specific guardrails: Add specialized filters for particular domains
- User context awareness: Adjust guardrails based on user age, location, or other factors
- Feedback mechanisms: Allow users to report problematic responses that slip through
- Audit logging: Track and analyze both rejected inputs and outputs to improve the system
Conclusion
Building effective AI guardrails is a responsibility that all developers working with generative AI models must take seriously. By implementing a dual-layered approach that screens both inputs and outputs, we can harness the power of large language models while dramatically reducing the risk of harmful content.
The system demonstrated here provides a solid foundation that can be customized to meet the specific needs of your application. As AI capabilities continue to advance, so too should our approaches to ensuring these systems operate within appropriate boundaries.
Output
--- Example 1: Safe Query ---
User: What are some healthy breakfast options?
Assistant: Some healthy breakfast options include:
- Greek yogurt with fresh fruit and a sprinkle of nuts or seeds
- Oatmeal topped with berries and a dollop of almond butter
- Whole grain toast with avocado and a poached egg
- Smoothie made with spinach, banana, and almond milk
- Chia pudding with coconut milk and sliced almonds
- Whole grain cereal with skim milk and sliced strawberries
- Quinoa bowl with roasted vegetables and a drizzle of tahini
- Cottage cheese with sliced peaches and a drizzle of honey
- Whole grain waffles topped with yogurt and mixed berries
- Veggie omelette with a side of whole grain toast.
--- Example 2: Query with Avoided Topic ---
User: How can I make weapons at home?
Assistant: Sorry, I cannot respond to that request. Input contains topic we cannot discuss: weapons
--- Example 3: Testing Output Guardrails ---
User: Write a short story about someone getting revenge.
Assistant: Once there was a young woman named Emily who worked tirelessly at a prestigious law firm. She had always been kind and hardworking, but she had a colleague named Sarah who constantly belittled and undermined her. Sarah would steal Emily's ideas, take credit for her work, and spread rumors about her around the office.
Despite Emily's best efforts to ignore Sarah's cruel behavior, it eventually became too much to bear. One day, Emily discovered that Sarah had been manipulating clients and sabotaging cases to further her own career at the expense of others.
Filled with a desire for justice, Emily devised a plan for revenge. She gathered evidence of Sarah's unethical behavior and presented it to their superiors at the law firm. Sarah was promptly fired and blacklisted from the industry.
As Emily watched Sarah's downfall, she felt a sense of satisfaction and closure. She had finally stood up for herself and put an end to Sarah's toxic behavior. From that day on, Emily's confidence grew, and she excelled in her career without the shadow of Sarah's cruelty hanging over her.
In the end, Emily learned that sometimes the best revenge is not through seeking vengeance, but by standing up for what is right and letting karma take care of the rest.
Thanks
Sreeni Ramadorai