Bedrock Inference

With Docker enabled, Bedrock Runtime runs LLM inference via Ollama. The InvokeModel and Converse APIs translate Bedrock-formatted requests to Ollama chat completions, so you can test Bedrock integrations with real model responses.

Prerequisites

  • SIMFRA_DOCKER=true
  • An Ollama-compatible container image (default ollama/ollama:latest)
  • Sufficient RAM for the models you plan to use (7B parameter models typically need 4-8 GB)

How It Works

Simfra starts an Ollama container on first use and maps AWS Bedrock model IDs to Ollama model names. When you call InvokeModel or Converse, Simfra:

  1. Maps the Bedrock model ID to an Ollama model name.
  2. Translates the request format (Bedrock messages to Ollama chat format).
  3. Forwards the request to the Ollama container.
  4. Translates the response back to Bedrock format.

The default Ollama model is llama3.2. Models are pulled automatically on first use (set SIMFRA_BEDROCK_CACHE_MODELS=true to pre-pull on startup).

Supported APIs

API Streaming Description
Converse No Multi-turn conversation with structured messages
ConverseStream Yes Streaming version of Converse (event-stream framing)
InvokeModel No Raw model invocation with provider-specific request/response format
InvokeModelWithResponseStream Yes Streaming version of InvokeModel
InvokeModelWithBidirectionalStream Yes Bidirectional streaming
ApplyGuardrail No Evaluate text against guardrail policies
CountTokens No Estimate token count for input

Model Families

Simfra accepts model IDs from seven Bedrock model families. All are mapped to Ollama models:

Bedrock Model Prefix Default Ollama Model Example Model IDs
anthropic.* llama3.2 anthropic.claude-3-5-sonnet-20241022-v2:0
meta.* llama3.2 meta.llama3-2-90b-instruct-v1:0
amazon.titan-text* llama3.2 amazon.titan-text-express-v1
mistral.* mistral mistral.mistral-large-2407-v1:0
cohere.* llama3.2 cohere.command-r-plus-v1:0
ai21.* llama3.2 ai21.jamba-1-5-large-v1:0
stability.* llama3.2 stability.stable-diffusion-xl-v1

The actual model running behind these IDs is the Ollama model it maps to. Responses will reflect the capabilities of the Ollama model, not the original Bedrock model.

Custom Model Mapping

Override the default mappings with SIMFRA_BEDROCK_MODEL_MAP:

export SIMFRA_BEDROCK_MODEL_MAP="anthropic.claude-3-5-sonnet-20241022-v2:0=llama3.2:latest,mistral.mistral-large-2407-v1:0=mistral:latest"

Format: comma-separated bedrock_model_id=ollama_model_name pairs.

Change the default fallback model:

export SIMFRA_BEDROCK_DEFAULT_MODEL=llama3.1

Using the Converse API

aws --endpoint-url http://localhost:4599 bedrock-runtime converse \
  --model-id anthropic.claude-3-5-sonnet-20241022-v2:0 \
  --messages '[{"role":"user","content":[{"text":"What is 2+2?"}]}]'
{
  "output": {
    "message": {
      "role": "assistant",
      "content": [{"text": "2 + 2 = 4."}]
    }
  },
  "stopReason": "end_turn",
  "usage": {
    "inputTokens": 12,
    "outputTokens": 8,
    "totalTokens": 20
  }
}

Using the InvokeModel API

aws --endpoint-url http://localhost:4599 bedrock-runtime invoke-model \
  --model-id anthropic.claude-3-5-sonnet-20241022-v2:0 \
  --content-type application/json \
  --body '{"anthropic_version":"bedrock-2023-05-31","messages":[{"role":"user","content":"Hello"}],"max_tokens":100}' \
  output.json

Streaming

ConverseStream and InvokeModelWithResponseStream return responses as they are generated, using AWS event-stream framing. SDK clients handle this automatically:

import boto3

client = boto3.client('bedrock-runtime', endpoint_url='http://localhost:4599')
response = client.converse_stream(
    modelId='anthropic.claude-3-5-sonnet-20241022-v2:0',
    messages=[{'role': 'user', 'content': [{'text': 'Write a haiku about clouds.'}]}]
)

for event in response['stream']:
    if 'contentBlockDelta' in event:
        print(event['contentBlockDelta']['delta']['text'], end='')

Guardrails

Bedrock guardrails perform rule-based evaluation of input and output text. Create a guardrail and apply it to conversations:

aws --endpoint-url http://localhost:4599 bedrock create-guardrail \
  --name my-guardrail \
  --blocked-input-messaging "Input blocked by guardrail." \
  --blocked-outputs-messaging "Output blocked by guardrail." \
  --word-policy-config blockedWordList=[{text=forbidden}] \
  --sensitive-information-policy-config piiEntitiesConfig=[{type=EMAIL,action=ANONYMIZE}]

Guardrail evaluation supports:

  • Word policies: blocked words and phrases (case-insensitive substring match).
  • PII detection: regex-based detection of email addresses, phone numbers, SSNs, credit card numbers, and other PII types. Actions: BLOCK or ANONYMIZE.
  • Content filters: keyword-based detection for categories like hate speech, insults, sexual content, violence, and misconduct.
  • Topic policies: denied topics with keyword triggers.

Apply a guardrail to a conversation by passing guardrailConfig in the Converse or InvokeModel request, or evaluate text directly with ApplyGuardrail.

GPU Acceleration

For faster inference, pass GPU access to the Ollama container:

export SIMFRA_BEDROCK_OLLAMA_GPU=nvidia

This passes --gpus to the Docker container. Requires the NVIDIA Container Toolkit installed on the host.

Image Generation Backend

For image generation models (Stability, Titan Image), Simfra supports an alternative backend using stable-diffusion.cpp:

export SIMFRA_BEDROCK_IMAGE_BACKEND=sdcpp
export SIMFRA_BEDROCK_SDCPP_IMAGE=my-sdcpp-image:latest

When not configured, image generation requests return a placeholder image.

Configuration Reference

Variable Default Description
SIMFRA_BEDROCK_IMAGE_BACKEND sdcpp Inference backend: ollama or sdcpp
SIMFRA_BEDROCK_OLLAMA_IMAGE (uses sidecar registry) Ollama container image (default: simfra-ollama with pre-baked model)
SIMFRA_BEDROCK_SDCPP_IMAGE (empty) stable-diffusion.cpp container image
SIMFRA_BEDROCK_OLLAMA_GPU (empty) GPU device for Ollama (nvidia, all)
SIMFRA_BEDROCK_DEFAULT_MODEL llama3.2 Default Ollama model for unmapped Bedrock model IDs
SIMFRA_BEDROCK_MODEL_MAP (empty) Custom model mapping (comma-separated bedrock=ollama pairs)
SIMFRA_BEDROCK_CACHE_MODELS true Pre-pull models on startup

Next Steps