Keeping your AI applications online and running smoothly, even when lots of people use them at once, is super important.
Good news! Bedrock has a cool feature called cross-region inference that makes building resilient and highly available GenAI applications much easier.
So, What Exactly is Cross-Region Inference?
Imagine you have a popular AI app powered by Bedrock.
Normally, all the AI thinking happens in one specific AWS Region. But what if that region gets really busy (as it is shared one)? Or what if there's a temporary issue? Your app might slow down or not work for some users.
Amazon Bedrock's cross-region inference is like having backup locations ready to help. It automatically sends your requests for AI processing (like asking the model a question or asking it to generate text) to other available AWS Regions within the same general area (like the US or Europe).
This means your application doesn't just rely on one spot. It can tap into resources from other spots nearby, helping ensure your users get a consistent and speedy experience.
Why This Makes Your AI App Resilient
The main reason cross-region inference is a game-changer is the resiliency it adds. By spreading the work across multiple AWS Regions, it gives you several key benefits that make your AI applications much more robust:
- Handles Traffic Jumps Easily: When your app suddenly gets popular, cross-region inference helps handle the rush. It automatically sends requests to regions with available capacity. You don't have to guess how much traffic you'll get or build complex systems to manage it yourself. Bedrock checks your main region first and, if needed, smartly sends the request to another region that can handle it.
- Better Model Availability: If one region is facing a temporary capacity crunch, your requests can be sent to another. This greatly increases the chances that your request will be completed successfully, helping keep your service running continuously for your users.
- Increased Throughput: By using compute power from different regions, your application can process more requests overall. This means your app can handle a higher volume of activity without performance dropping.
- Automatic Failover: If processing a request in the initial region fails for some reason, Bedrock will automatically try to send it to another working region within the group. This built-in safety net makes your AI applications much more reliable.
- No Extra Cost for Resiliency: This is fantastic! There's no additional charge for using cross-region inference. You pay the same price per token (the units used for processing text) as you would if the request was handled only in your original region.
Getting Started:
It's Not Automatic, But It Is Simple!
So, is this cross-region magic automatic as soon as you use Bedrock? Not quite.
The routing of the request across regions is automatic once you tell Bedrock you want to use this feature, but you need to make a small change in your code to enable it.
The key to enabling cross-region inference is using something called inference profile IDs. Instead of telling Bedrock the specific, single-region address (ARN) of an AI model, you use a special ID that represents the model and tells Bedrock it can use the cross-region capability.
Here’s how you get started specifically:
Discover the Right Inference Profile IDs:
Amazon Bedrock provides special IDs called "system-defined" inference profiles for models that support cross-region inference. These IDs cover specific geographical areas (like a group of US regions or a group of EU regions).
How to find them:
-
AWS Console:
- Go to the Amazon Bedrock console.
- Look for an option like "Cross-region Inference" on the left menu. Here, you can browse the available profiles for your regions and easily copy the IDs.
-
AWS CLI:
- You can use a command-line tool.
- Run
aws bedrock list-inference-profiles
. - Look for profiles listed with the type "SYSTEM_DEFINED". These IDs will often start with a prefix like
us.
oreu.
.
-
Boto3 (Python SDK):
- If you're coding in Python, you can use the AWS SDK:
import boto3 # Replace "your-aws-region" with the region you're working from bedrock_client = boto3.client("bedrock", region_name="your-aws-region") response = bedrock_client.list_inference_profiles() print(response)
- This will list the available profiles, including the system-defined cross-region ones.
Use the Inference Profile ID in Your Code:
Once you have the inference profile ID (it will look something like us.anthropic.claude-3-5-sonnet-20240620-v1:0
), you simply use this ID instead of the regular model ARN when you make your requests to the Bedrock API (using InvokeModel
or Converse
).
Example:
# Using a single-region ARN
# model_id = "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20240620-v1:0"
# ... call bedrock_runtime.invoke_model(modelId=model_id, ...)
# You would use:
# Using the cross-region inference profile ID
inference_profile_id = 'us.anthropic.claude-3-5-sonnet-20240620-v1:0' # Example ID
# Replace "your-source-region" with the region you are making the request from
bedrock_runtime = boto3.client("bedrock-runtime", region_name="your-source-region")
response = bedrock_runtime.converse(
modelId = inference_profile_id, # Use the profile ID here!
system = [{"text": "You are a helpful AI assistant."}],
messages = [{"role": "user", "content": [{"text": "Tell me something interesting."}]}]
)
print(response['output']['message']['content'])
Bedrock sees the inference profile ID and knows it has the flexibility to route your request to any available region within that profile's set.
So, while you don't manage the traffic routing yourself (Bedrock does that automatically!), you do need to make the initial step of changing your code to use the inference profile ID instead of the standard model ARN. It's a straightforward change that unlocks powerful resiliency.
Monitoring Your Applications
Amazon Bedrock gives you visibility into how cross-region inference is working. If your request gets re-routed to another region, this information is recorded in your AWS CloudTrail logs and Amazon Bedrock Model Invocation Logs.
You'll find details like an inferenceRegion
key which tells you where the request was actually processed. By looking at these logs (for example, in Amazon CloudWatch), you can see when requests are being served from different regions. This helps you understand your application's traffic flow and how effectively cross-region inference is handling demand spikes.
Important Things to Keep in Mind
While cross-region inference is great for resiliency, here are a few points to be aware of:
- Latency: There might be a very slight delay (usually very small, like double-digit milliseconds in testing) if a request needs to be re-routed to another region.
- Data Residency: Your main data stays in your source region. However, the input prompts and output results for a specific inference request might be processed in another region within the same geographical group (like US or EU). Make sure this fits with any data location rules or compliance requirements you have.
Rest assured, all data transfer between regions happens over AWS's secure network.
- Supported Models and Regions: This feature works with a specific list of models and within defined geographical sets of regions (US, EU, etc.). Check the Bedrock documentation to confirm that the models you want to use and the regions you operate in are supported.