Bedrock Inference

With Docker enabled, Bedrock Runtime runs LLM inference via Ollama. The InvokeModel and Converse APIs translate Bedrock-formatted requests to Ollama chat completions, so you can test Bedrock integrations with real model responses.

Prerequisites

SIMFRA_DOCKER=true
An Ollama-compatible container image (default ollama/ollama:latest)
Sufficient RAM for the models you plan to use (7B parameter models typically need 4-8 GB)

How It Works

Simfra starts an Ollama container on first use and maps AWS Bedrock model IDs to Ollama model names. When you call InvokeModel or Converse, Simfra:

Maps the Bedrock model ID to an Ollama model name.
Translates the request format (Bedrock messages to Ollama chat format).
Forwards the request to the Ollama container.
Translates the response back to Bedrock format.

The default Ollama model is llama3.2. Models are pulled automatically on first use (set SIMFRA_BEDROCK_CACHE_MODELS=true to pre-pull on startup).

Supported APIs

API	Streaming	Description
`Converse`	No	Multi-turn conversation with structured messages
`ConverseStream`	Yes	Streaming version of Converse (event-stream framing)
`InvokeModel`	No	Raw model invocation with provider-specific request/response format
`InvokeModelWithResponseStream`	Yes	Streaming version of InvokeModel
`InvokeModelWithBidirectionalStream`	Yes	Bidirectional streaming
`ApplyGuardrail`	No	Evaluate text against guardrail policies
`CountTokens`	No	Estimate token count for input

Model Families

Simfra accepts model IDs from seven Bedrock model families. All are mapped to Ollama models:

Bedrock Model Prefix	Default Ollama Model	Example Model IDs
`anthropic.*`	`llama3.2`	`anthropic.claude-3-5-sonnet-20241022-v2:0`
`meta.*`	`llama3.2`	`meta.llama3-2-90b-instruct-v1:0`
`amazon.titan-text*`	`llama3.2`	`amazon.titan-text-express-v1`
`mistral.*`	`mistral`	`mistral.mistral-large-2407-v1:0`
`cohere.*`	`llama3.2`	`cohere.command-r-plus-v1:0`
`ai21.*`	`llama3.2`	`ai21.jamba-1-5-large-v1:0`
`stability.*`	`llama3.2`	`stability.stable-diffusion-xl-v1`

The actual model running behind these IDs is the Ollama model it maps to. Responses will reflect the capabilities of the Ollama model, not the original Bedrock model.

Custom Model Mapping

Override the default mappings with SIMFRA_BEDROCK_MODEL_MAP:

export SIMFRA_BEDROCK_MODEL_MAP="anthropic.claude-3-5-sonnet-20241022-v2:0=llama3.2:latest,mistral.mistral-large-2407-v1:0=mistral:latest"

Format: comma-separated bedrock_model_id=ollama_model_name pairs.

Change the default fallback model:

export SIMFRA_BEDROCK_DEFAULT_MODEL=llama3.1

Using the Converse API

aws --endpoint-url http://localhost:4599 bedrock-runtime converse \
  --model-id anthropic.claude-3-5-sonnet-20241022-v2:0 \
  --messages '[{"role":"user","content":[{"text":"What is 2+2?"}]}]'

{
  "output": {
    "message": {
      "role": "assistant",
      "content": [{"text": "2 + 2 = 4."}]
    }
  },
  "stopReason": "end_turn",
  "usage": {
    "inputTokens": 12,
    "outputTokens": 8,
    "totalTokens": 20
  }
}

Using the InvokeModel API

aws --endpoint-url http://localhost:4599 bedrock-runtime invoke-model \
  --model-id anthropic.claude-3-5-sonnet-20241022-v2:0 \
  --content-type application/json \
  --body '{"anthropic_version":"bedrock-2023-05-31","messages":[{"role":"user","content":"Hello"}],"max_tokens":100}' \
  output.json

Streaming

ConverseStream and InvokeModelWithResponseStream return responses as they are generated, using AWS event-stream framing. SDK clients handle this automatically:

import boto3

client = boto3.client('bedrock-runtime', endpoint_url='http://localhost:4599')
response = client.converse_stream(
    modelId='anthropic.claude-3-5-sonnet-20241022-v2:0',
    messages=[{'role': 'user', 'content': [{'text': 'Write a haiku about clouds.'}]}]
)

for event in response['stream']:
    if 'contentBlockDelta' in event:
        print(event['contentBlockDelta']['delta']['text'], end='')

Guardrails

Bedrock guardrails perform rule-based evaluation of input and output text. Create a guardrail and apply it to conversations:

aws --endpoint-url http://localhost:4599 bedrock create-guardrail \
  --name my-guardrail \
  --blocked-input-messaging "Input blocked by guardrail." \
  --blocked-outputs-messaging "Output blocked by guardrail." \
  --word-policy-config blockedWordList=[{text=forbidden}] \
  --sensitive-information-policy-config piiEntitiesConfig=[{type=EMAIL,action=ANONYMIZE}]

Guardrail evaluation supports:

Word policies: blocked words and phrases (case-insensitive substring match).
PII detection: regex-based detection of email addresses, phone numbers, SSNs, credit card numbers, and other PII types. Actions: BLOCK or ANONYMIZE.
Content filters: keyword-based detection for categories like hate speech, insults, sexual content, violence, and misconduct.
Topic policies: denied topics with keyword triggers.

Apply a guardrail to a conversation by passing guardrailConfig in the Converse or InvokeModel request, or evaluate text directly with ApplyGuardrail.

GPU Acceleration

For faster inference, pass GPU access to the Ollama container:

export SIMFRA_BEDROCK_OLLAMA_GPU=nvidia

This passes --gpus to the Docker container. Requires the NVIDIA Container Toolkit installed on the host.

Image Generation Backend

For image generation models (Stability, Titan Image), Simfra supports an alternative backend using stable-diffusion.cpp:

export SIMFRA_BEDROCK_IMAGE_BACKEND=sdcpp
export SIMFRA_BEDROCK_SDCPP_IMAGE=my-sdcpp-image:latest

When not configured, image generation requests return a placeholder image.

Configuration Reference

Variable	Default	Description
`SIMFRA_BEDROCK_IMAGE_BACKEND`	`sdcpp`	Inference backend: `ollama` or `sdcpp`
`SIMFRA_BEDROCK_OLLAMA_IMAGE`	(uses sidecar registry)	Ollama container image (default: `simfra-ollama` with pre-baked model)
`SIMFRA_BEDROCK_SDCPP_IMAGE`	(empty)	stable-diffusion.cpp container image
`SIMFRA_BEDROCK_OLLAMA_GPU`	(empty)	GPU device for Ollama (`nvidia`, `all`)
`SIMFRA_BEDROCK_DEFAULT_MODEL`	`llama3.2`	Default Ollama model for unmapped Bedrock model IDs
`SIMFRA_BEDROCK_MODEL_MAP`	(empty)	Custom model mapping (comma-separated `bedrock=ollama` pairs)
`SIMFRA_BEDROCK_CACHE_MODELS`	`true`	Pre-pull models on startup

Next Steps

Lambda Execution - run Lambda functions that call Bedrock
SDK Configuration - configure AWS SDKs to call Bedrock via Simfra