🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Deploy Llama 3 on AWS Bedrock — Production Guide 2026

AWS Bedrock now supports Meta's Llama 3 models. Here's how to deploy, call, and optimize Llama 3 on Bedrock for production use cases without managing GPU infrastructure.

DevOpsBoysJun 2, 20263 min read
Share:Tweet

Running Llama 3 on your own infrastructure means managing GPU instances, CUDA drivers, model downloads, and serving infrastructure. AWS Bedrock eliminates all of that — you call an API and Meta handles the rest.

Here's how to use Llama 3 on Bedrock in production.


Why Bedrock for Llama 3

Advantages:

  • No GPU management — AWS handles infrastructure
  • Pay per token — no idle compute costs
  • Same IAM security model as all AWS services
  • Auto-scales with demand
  • Data stays in your AWS account (no data sent to Meta)

Trade-offs:

  • More expensive per token than self-hosted at high volume
  • Less control over model serving parameters
  • Limited to Bedrock-supported model versions

Enable Llama 3 in AWS Bedrock

bash
# Request model access (one-time setup)
aws bedrock put-foundation-model-entitlement \
  --model-id meta.llama3-70b-instruct-v1:0 \
  --region us-east-1
 
# Or via Console:
# AWS Console → Bedrock → Model Access → Request Access → Meta Llama 3
# Approval is usually instant for most accounts

Available models:

  • meta.llama3-8b-instruct-v1:0 — fast, cheap
  • meta.llama3-70b-instruct-v1:0 — most capable
  • meta.llama3-1-405b-instruct-v1:0 — frontier model (where available)

Basic API Call

python
import boto3
import json
 
bedrock = boto3.client(
    service_name="bedrock-runtime",
    region_name="us-east-1"
)
 
def call_llama3(
    prompt: str,
    model_id: str = "meta.llama3-70b-instruct-v1:0",
    max_gen_len: int = 512,
    temperature: float = 0.7,
) -> str:
    
    # Llama 3 uses a specific prompt format
    formatted_prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful DevOps assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
 
    body = json.dumps({
        "prompt": formatted_prompt,
        "max_gen_len": max_gen_len,
        "temperature": temperature,
        "top_p": 0.9,
    })
    
    response = bedrock.invoke_model(
        body=body,
        modelId=model_id,
        accept="application/json",
        contentType="application/json"
    )
    
    response_body = json.loads(response.get("body").read())
    return response_body["generation"]
 
# Test it
result = call_llama3("Write a Kubernetes liveness probe for a Node.js app")
print(result)

Streaming Responses

For chat interfaces where you want real-time output:

python
def stream_llama3(prompt: str, model_id: str = "meta.llama3-70b-instruct-v1:0"):
    formatted_prompt = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>
{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
 
    body = json.dumps({
        "prompt": formatted_prompt,
        "max_gen_len": 1024,
        "temperature": 0.7,
    })
    
    response = bedrock.invoke_model_with_response_stream(
        body=body,
        modelId=model_id,
        accept="application/json",
        contentType="application/json"
    )
    
    for event in response["body"]:
        chunk = json.loads(event["chunk"]["bytes"])
        if chunk.get("generation"):
            yield chunk["generation"]
        if chunk.get("stop_reason"):
            break
 
# Stream output
for token in stream_llama3("Explain Kubernetes networking in detail"):
    print(token, end="", flush=True)

Production Setup: FastAPI Wrapper

python
# app.py
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import boto3
import json
import os
 
app = FastAPI()
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
 
class ChatRequest(BaseModel):
    message: str
    model: str = "meta.llama3-70b-instruct-v1:0"
    max_tokens: int = 1024
    stream: bool = False
 
@app.post("/chat")
async def chat(request: ChatRequest):
    formatted = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
{request.message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
 
    body = json.dumps({
        "prompt": formatted,
        "max_gen_len": request.max_tokens,
        "temperature": 0.7,
    })
    
    if request.stream:
        def generate():
            response = bedrock.invoke_model_with_response_stream(
                body=body, modelId=request.model,
                accept="application/json", contentType="application/json"
            )
            for event in response["body"]:
                chunk = json.loads(event["chunk"]["bytes"])
                if chunk.get("generation"):
                    yield f"data: {json.dumps({'text': chunk['generation']})}\n\n"
        
        return StreamingResponse(generate(), media_type="text/event-stream")
    
    response = bedrock.invoke_model(
        body=body, modelId=request.model,
        accept="application/json", contentType="application/json"
    )
    result = json.loads(response.get("body").read())
    return {"response": result["generation"]}

IAM Permissions

Your Lambda, ECS task, or EC2 role needs:

json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream"
      ],
      "Resource": "arn:aws:bedrock:us-east-1::foundation-model/meta.llama3-70b-instruct-v1:0"
    }
  ]
}

Cost Comparison: Llama 3 Options

OptionCostLatencySetup
Bedrock Llama 3 8B~$0.0003/1K tokensLowZero
Bedrock Llama 3 70B~$0.00265/1K tokensMediumZero
Self-hosted g4dn.xlarge~$0.000009/1K tokensLowHigh
Groq (fastest)~$0.00008/1K tokensUltra-lowZero

Bedrock wins on operational simplicity. Self-hosted wins at high volume. Groq wins on speed.


Monitoring with CloudWatch

python
import boto3
from datetime import datetime
 
cloudwatch = boto3.client("cloudwatch", region_name="us-east-1")
 
def track_llm_usage(model_id: str, input_tokens: int, output_tokens: int, latency_ms: float):
    cloudwatch.put_metric_data(
        Namespace="LLM/Bedrock",
        MetricData=[
            {"MetricName": "InputTokens", "Value": input_tokens, 
             "Dimensions": [{"Name": "Model", "Value": model_id}]},
            {"MetricName": "OutputTokens", "Value": output_tokens,
             "Dimensions": [{"Name": "Model", "Value": model_id}]},
            {"MetricName": "Latency", "Value": latency_ms, "Unit": "Milliseconds",
             "Dimensions": [{"Name": "Model", "Value": model_id}]},
        ]
    )

AWS Bedrock's Llama 3 is the fastest path to production LLM deployment if you're already on AWS. No GPU management, no model serving infrastructure, pay only for what you use.

For AWS infrastructure management, AWS Solutions Architect certification via KodeKloud covers Bedrock and other AI/ML AWS services.

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments