Deploy Llama 3 on AWS Bedrock — Production Guide 2026
AWS Bedrock now supports Meta's Llama 3 models. Here's how to deploy, call, and optimize Llama 3 on Bedrock for production use cases without managing GPU infrastructure.
Running Llama 3 on your own infrastructure means managing GPU instances, CUDA drivers, model downloads, and serving infrastructure. AWS Bedrock eliminates all of that — you call an API and Meta handles the rest.
Here's how to use Llama 3 on Bedrock in production.
Why Bedrock for Llama 3
Advantages:
- No GPU management — AWS handles infrastructure
- Pay per token — no idle compute costs
- Same IAM security model as all AWS services
- Auto-scales with demand
- Data stays in your AWS account (no data sent to Meta)
Trade-offs:
- More expensive per token than self-hosted at high volume
- Less control over model serving parameters
- Limited to Bedrock-supported model versions
Enable Llama 3 in AWS Bedrock
# Request model access (one-time setup)
aws bedrock put-foundation-model-entitlement \
--model-id meta.llama3-70b-instruct-v1:0 \
--region us-east-1
# Or via Console:
# AWS Console → Bedrock → Model Access → Request Access → Meta Llama 3
# Approval is usually instant for most accountsAvailable models:
meta.llama3-8b-instruct-v1:0— fast, cheapmeta.llama3-70b-instruct-v1:0— most capablemeta.llama3-1-405b-instruct-v1:0— frontier model (where available)
Basic API Call
import boto3
import json
bedrock = boto3.client(
service_name="bedrock-runtime",
region_name="us-east-1"
)
def call_llama3(
prompt: str,
model_id: str = "meta.llama3-70b-instruct-v1:0",
max_gen_len: int = 512,
temperature: float = 0.7,
) -> str:
# Llama 3 uses a specific prompt format
formatted_prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful DevOps assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
body = json.dumps({
"prompt": formatted_prompt,
"max_gen_len": max_gen_len,
"temperature": temperature,
"top_p": 0.9,
})
response = bedrock.invoke_model(
body=body,
modelId=model_id,
accept="application/json",
contentType="application/json"
)
response_body = json.loads(response.get("body").read())
return response_body["generation"]
# Test it
result = call_llama3("Write a Kubernetes liveness probe for a Node.js app")
print(result)Streaming Responses
For chat interfaces where you want real-time output:
def stream_llama3(prompt: str, model_id: str = "meta.llama3-70b-instruct-v1:0"):
formatted_prompt = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>
{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
body = json.dumps({
"prompt": formatted_prompt,
"max_gen_len": 1024,
"temperature": 0.7,
})
response = bedrock.invoke_model_with_response_stream(
body=body,
modelId=model_id,
accept="application/json",
contentType="application/json"
)
for event in response["body"]:
chunk = json.loads(event["chunk"]["bytes"])
if chunk.get("generation"):
yield chunk["generation"]
if chunk.get("stop_reason"):
break
# Stream output
for token in stream_llama3("Explain Kubernetes networking in detail"):
print(token, end="", flush=True)Production Setup: FastAPI Wrapper
# app.py
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import boto3
import json
import os
app = FastAPI()
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
class ChatRequest(BaseModel):
message: str
model: str = "meta.llama3-70b-instruct-v1:0"
max_tokens: int = 1024
stream: bool = False
@app.post("/chat")
async def chat(request: ChatRequest):
formatted = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
{request.message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
body = json.dumps({
"prompt": formatted,
"max_gen_len": request.max_tokens,
"temperature": 0.7,
})
if request.stream:
def generate():
response = bedrock.invoke_model_with_response_stream(
body=body, modelId=request.model,
accept="application/json", contentType="application/json"
)
for event in response["body"]:
chunk = json.loads(event["chunk"]["bytes"])
if chunk.get("generation"):
yield f"data: {json.dumps({'text': chunk['generation']})}\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
response = bedrock.invoke_model(
body=body, modelId=request.model,
accept="application/json", contentType="application/json"
)
result = json.loads(response.get("body").read())
return {"response": result["generation"]}IAM Permissions
Your Lambda, ECS task, or EC2 role needs:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel",
"bedrock:InvokeModelWithResponseStream"
],
"Resource": "arn:aws:bedrock:us-east-1::foundation-model/meta.llama3-70b-instruct-v1:0"
}
]
}Cost Comparison: Llama 3 Options
| Option | Cost | Latency | Setup |
|---|---|---|---|
| Bedrock Llama 3 8B | ~$0.0003/1K tokens | Low | Zero |
| Bedrock Llama 3 70B | ~$0.00265/1K tokens | Medium | Zero |
| Self-hosted g4dn.xlarge | ~$0.000009/1K tokens | Low | High |
| Groq (fastest) | ~$0.00008/1K tokens | Ultra-low | Zero |
Bedrock wins on operational simplicity. Self-hosted wins at high volume. Groq wins on speed.
Monitoring with CloudWatch
import boto3
from datetime import datetime
cloudwatch = boto3.client("cloudwatch", region_name="us-east-1")
def track_llm_usage(model_id: str, input_tokens: int, output_tokens: int, latency_ms: float):
cloudwatch.put_metric_data(
Namespace="LLM/Bedrock",
MetricData=[
{"MetricName": "InputTokens", "Value": input_tokens,
"Dimensions": [{"Name": "Model", "Value": model_id}]},
{"MetricName": "OutputTokens", "Value": output_tokens,
"Dimensions": [{"Name": "Model", "Value": model_id}]},
{"MetricName": "Latency", "Value": latency_ms, "Unit": "Milliseconds",
"Dimensions": [{"Name": "Model", "Value": model_id}]},
]
)AWS Bedrock's Llama 3 is the fastest path to production LLM deployment if you're already on AWS. No GPU management, no model serving infrastructure, pay only for what you use.
For AWS infrastructure management, AWS Solutions Architect certification via KodeKloud covers Bedrock and other AI/ML AWS services.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
How to Deploy Mistral 7B on AWS EC2 — Production Guide 2026
Step-by-step guide to deploying Mistral 7B on AWS EC2 for production use. Covers instance selection, quantization, serving with vLLM, and cost optimization.
LLM Cost Optimization in Production — Caching, Batching, Quantization 2026
LLM API bills spiral fast. Here's every technique to cut your LLM costs in production without sacrificing quality — prompt caching, request batching, model routing, and quantization.
LLM Fine-Tuning on AWS SageMaker — When, Why and How 2026
When should you fine-tune an LLM vs just prompting? How do you do it on SageMaker? This guide covers the decision framework and step-by-step fine-tuning with LoRA on AWS.