When you're building with LLMs like Amazon Bedrock's Nova Lite, there's a tough question you'll eventually have to face:

"How do I know my model's answers aren't basically AI Slop?"

We're past the point where eyeballing responses are good enough — we need automated validation at runtime, ideally one that fits neatly into a serverless, production-friendly stack.

So let's wire up a 100% AWS-native solution:

  • ✍️ Nova Lite (via Amazon Bedrock) generates a response
  • 📊 Claude or custom Lambda logic evaluates the response
  • 🤖 Step Functions + Lambda orchestrate it
  • 💾 DynamoDB stores evaluation results
  • 🔔 SNS handles notifications
  • 🧱 CDK deploys the whole thing
  • ✅ And we track responses over time to improve

Let's go full-stack, eval-style baby

🧩 High level architecture

Arch

We'll keep things TypeScript down xD

🛠 CDK stack

Here's a bare-bones CDK setup (in TypeScript) to deploy everything:

import * as cdk from 'aws-cdk-lib';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import * as sns from 'aws-cdk-lib/aws-sns';
import * as snsSubscriptions from 'aws-cdk-lib/aws-sns-subscriptions';
import * as stepfunctions from 'aws-cdk-lib/aws-stepfunctions';
import * as tasks from 'aws-cdk-lib/aws-stepfunctions-tasks';
import { NodejsFunction } from 'aws-cdk-lib/aws-lambda-nodejs';
import { Construct } from 'constructs';

export class LlmEvaluationStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // DynamoDB to store evaluation results
    const evaluationTable = new dynamodb.Table(this, 'EvaluationResults', {
      partitionKey: { name: 'id', type: dynamodb.AttributeType.STRING },
      sortKey: { name: 'timestamp', type: dynamodb.AttributeType.STRING },
      billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
    });

    // SNS Topic for notifications
    const evalNotificationTopic = new sns.Topic(this, 'EvalNotificationTopic');
    evalNotificationTopic.addSubscription(
      new snsSubscriptions.EmailSubscription('your-email@example.com')
    );

    // Lambda to get Nova response
    const getNovaResponse = new NodejsFunction(this, 'NovaResponseFunction', {
      entry: 'lambda/getNovaResponse.ts',
      handler: 'handler',
      runtime: lambda.Runtime.NODEJS_18_X,
      environment: {
        REGION: 'us-east-1',
      },
    });

    // Lambda to evaluate the response using Claude or custom logic
    const evaluateResponse = new NodejsFunction(this, 'EvaluateResponseFunction', {
      entry: 'lambda/evaluateResponse.ts',
      handler: 'handler',
      runtime: lambda.Runtime.NODEJS_18_X,
      environment: {
        EVALUATION_TABLE: evaluationTable.tableName,
      },
    });

    // Lambda to store results in DynamoDB
    const storeResults = new NodejsFunction(this, 'StoreResultsFunction', {
      entry: 'lambda/storeResults.ts',
      handler: 'handler',
      runtime: lambda.Runtime.NODEJS_18_X,
      environment: {
        EVALUATION_TABLE: evaluationTable.tableName,
      },
    });

    // Grant permissions
    evaluationTable.grantReadWriteData(evaluateResponse);
    evaluationTable.grantWriteData(storeResults);

    // Step Function to orchestrate the flow
    const workflow = new stepfunctions.StateMachine(this, 'EvaluationWorkflow', {
      definition: new stepfunctions.Chain()
        .start(new tasks.LambdaInvoke(this, 'GetNovaResponse', {
          lambdaFunction: getNovaResponse,
          outputPath: '$.Payload',
        }))
        .next(new tasks.LambdaInvoke(this, 'EvaluateResponse', {
          lambdaFunction: evaluateResponse,
          outputPath: '$.Payload',
        }))
        .next(new tasks.LambdaInvoke(this, 'StoreResults', {
          lambdaFunction: storeResults,
          outputPath: '$.Payload',
        }))
        .next(new tasks.SnsPublish(this, 'SendNotification', {
          topic: evalNotificationTopic,
          message: stepfunctions.TaskInput.fromJsonPathAt('$.evaluationSummary'),
        }))
        .next(new stepfunctions.Succeed(this, 'EvaluationComplete')),
    });
  }
}

What does this CDK stack drop?

  • 3 Lambdas
  • A DynamoDB table
  • A Step Function workflow
  • A SNS topic

Let's break it down 👇

📦 Lambda 1 — Nova Response (getNovaResponse.ts)

import { BedrockRuntimeClient, InvokeModelCommand } from '@aws-sdk/client-bedrock-runtime';
import { Handler } from 'aws-lambda';

const client = new BedrockRuntimeClient({ region: process.env.REGION || 'us-east-1' });

export const handler: Handler = async (event) => {
  const prompt = event.prompt || 'What is the capital of France?';
  const expectedAnswer = event.expectedAnswer; // Optional

  try {
    const params = {
      modelId: 'amazon.nova-lite-v1:0',
      contentType: 'application/json',
      accept: 'application/json',
      body: JSON.stringify({
        inputText: prompt,
        textGenerationConfig: {
          temperature: 0.2,
          maxTokenCount: 300,
        }
      }),
    };

    const command = new InvokeModelCommand(params);
    const response = await client.send(command);

    // Process the response
    const responseBody = JSON.parse(new TextDecoder().decode(response.body));
    const completion = responseBody.results[0].outputText;

    return {
      prompt,
      completion,
      expectedAnswer,
      timestamp: new Date().toISOString(),
    };
  } catch (error) {
    console.error('Error invoking Nova:', error);
    throw error;
  }
};

Returns something like:

{
  "prompt": "What is the capital of France?",
  "completion": "The capital of France is Paris.",
  "expectedAnswer": "Paris",
  "timestamp": "2025-04-11T17:56:28.000Z"
}

🔍 Lambda 2 — Evaluate with Claude, cuz it's a more opinionated and used model (evaluateResponse.ts)

import { BedrockRuntimeClient, InvokeModelCommand } from '@aws-sdk/client-bedrock-runtime';
import { Handler } from 'aws-lambda';

const client = new BedrockRuntimeClient({ region: process.env.REGION || 'us-east-1' });

export const handler: Handler = async (event) => {
  const { prompt, completion, expectedAnswer } = event;

  // Build the evaluation prompt for Claude
  let evaluationPrompt = `
  Evaluate the quality and accuracy of the following AI response:

  Question: ${prompt}
  Response: ${completion}
  `;

  if (expectedAnswer) {
    evaluationPrompt += `\nExpected answer: ${expectedAnswer}`;
  }

  evaluationPrompt += `

  Evaluation criteria:
  1. Factual accuracy (if an expected answer was provided)
  2. Relevance to the question
  3. Clarity and conciseness
  4. Completeness

  Provide a score from 0 to 1 (where 1 is perfect) and a brief justification.
  Respond only in the following JSON format:
  {
    "score": [score between 0 and 1],
    "passed": [true/false based on score >= 0.7],
    "justification": "[brief explanation]"
  }
  `;

  try {
    const params = {
      modelId: 'anthropic.claude-3-sonnet-20240229-v1:0',
      contentType: 'application/json',
      accept: 'application/json',
      body: JSON.stringify({
        anthropic_version: "bedrock-2023-05-31",
        max_tokens: 1000,
        messages: [
          {
            role: "user",
            content: evaluationPrompt
          }
        ]
      }),
    };

    const command = new InvokeModelCommand(params);
    const response = await client.send(command);

    // Process the response
    const responseBody = JSON.parse(new TextDecoder().decode(response.body));
    const evaluationText = responseBody.content[0].text;

    // Extract JSON from response
    const jsonMatch = evaluationText.match(/\{[\s\S]*\}/);
    const evaluationResult = jsonMatch ? JSON.parse(jsonMatch[0]) : { score: 0, passed: false, justification: "Failed to parse response" };

    return {
      ...event,
      evaluation: {
        ...evaluationResult,
        method: 'claude-evaluation',
      },
      evaluationSummary: `Evaluation for prompt "${prompt}": ${evaluationResult.passed ? 'PASSED' : 'FAILED'} (Score: ${evaluationResult.score})`
    };
  } catch (error) {
    console.error('Error evaluating with Claude:', error);
    throw error;
  }
};

💾 Lambda 3 — Store Results (storeResults.ts)

import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
import { DynamoDBDocumentClient, PutCommand } from '@aws-sdk/lib-dynamodb';
import { Handler } from 'aws-lambda';
import { v4 as uuidv4 } from 'uuid';

const client = new DynamoDBClient({});
const docClient = DynamoDBDocumentClient.from(client);

export const handler: Handler = async (event) => {
  const { prompt, completion, expectedAnswer, evaluation, timestamp } = event;

  const item = {
    id: uuidv4(),
    timestamp,
    prompt,
    completion,
    expectedAnswer,
    evaluation,
    modelId: 'amazon.nova-lite-v1:0',
  };

  try {
    await docClient.send(
      new PutCommand({
        TableName: process.env.EVALUATION_TABLE,
        Item: item,
      })
    );

    return {
      ...event,
      storedId: item.id,
      message: 'Evaluation stored successfully',
    };
  } catch (error) {
    console.error('Error storing evaluation:', error);
    throw error;
  }
};

📊 Simulated Scenarios

Let's run some examples with more complex situations where Nova might struggle with factual accuracy:

Prompt Nova Output Expected Eval Score
What was the GDP growth rate of Brazil in Q3 2024? Brazil's GDP grew by 3.2% in Q3 2024. Brazil's GDP grew by 1.8% in Q3 2024 ❌ FAIL
Name the 5 permanent members of the UN Security Council The 5 permanent members are USA, UK, France, Russia, and China. USA, UK, France, Russia, China ✅ PASS
Who won the Nobel Prize in Physics in 2023? The 2023 Nobel Prize in Physics was awarded to Peter Higgs for his work on the Higgs boson. Anne L'Huillier, Pierre Agostini, and Ferenc Krausz ❌ FAIL
What's the most efficient algorithm for finding the shortest path in a weighted graph with negative edges? Dijkstra's algorithm is the most efficient for finding shortest paths in weighted graphs. Bellman-Ford algorithm (Dijkstra's doesn't work with negative edges) ❌ FAIL
What medications are contraindicated with Warfarin? Aspirin, ibuprofen, and some antibiotics are contraindicated with Warfarin. Partial list but missing key contraindications like certain antibiotics, antifungals, and herbal supplements ⚠️ PARTIAL
Explain the difference between quantum computing and classical computing in one sentence. Quantum computing uses quantum bits that can exist in multiple states simultaneously, unlike classical bits that can only be 0 or 1. Reasonable explanation mentioning superposition ✅ PASS
What was the primary cause of the 2008 financial crisis? The 2008 financial crisis was primarily caused by predatory lending practices and the housing bubble collapse. Complex answer requiring multiple factors including subprime mortgages, derivatives, lack of regulation, etc. ⚠️ PARTIAL
Write a SQL query to find the second highest salary in a 'employees' table. SELECT MAX(salary) FROM employees WHERE salary < (SELECT MAX(salary) FROM employees) Correct SQL query that finds second highest salary ✅ PASS

🔁 How to improve When it fails

When your eval fails, here's what you can do:

1. Tune inference settings

  • Lower temperature for deterministic answers
  • Increase topP for broader creativity (only if needed)

2. Improve prompt design

- What's 17 + 26?
+ What's 17 + 26? Answer only with the number

3. Upgrade your eval rubric

Use more specific evaluation criteria or customize Claude's evaluation prompt to reflect your domain needs

4. Chain a retry state

In Step Functions, wrap failed evals with a retry flow using a different prompt phrasing or fallback model

That's all folks xD