Bedrock Inference
With Docker enabled, Bedrock Runtime runs LLM inference via Ollama. The InvokeModel and Converse APIs translate Bedrock-formatted requests to Ollama chat completions, so you can test Bedrock integrations with real model responses.
Prerequisites
SIMFRA_DOCKER=true- An Ollama-compatible container image (default
ollama/ollama:latest) - Sufficient RAM for the models you plan to use (7B parameter models typically need 4-8 GB)
How It Works
Simfra starts an Ollama container on first use and maps AWS Bedrock model IDs to Ollama model names. When you call InvokeModel or Converse, Simfra:
- Maps the Bedrock model ID to an Ollama model name.
- Translates the request format (Bedrock messages to Ollama chat format).
- Forwards the request to the Ollama container.
- Translates the response back to Bedrock format.
The default Ollama model is llama3.2. Models are pulled automatically on first use (set SIMFRA_BEDROCK_CACHE_MODELS=true to pre-pull on startup).
Supported APIs
| API | Streaming | Description |
|---|---|---|
Converse |
No | Multi-turn conversation with structured messages |
ConverseStream |
Yes | Streaming version of Converse (event-stream framing) |
InvokeModel |
No | Raw model invocation with provider-specific request/response format |
InvokeModelWithResponseStream |
Yes | Streaming version of InvokeModel |
InvokeModelWithBidirectionalStream |
Yes | Bidirectional streaming |
ApplyGuardrail |
No | Evaluate text against guardrail policies |
CountTokens |
No | Estimate token count for input |
Model Families
Simfra accepts model IDs from seven Bedrock model families. All are mapped to Ollama models:
| Bedrock Model Prefix | Default Ollama Model | Example Model IDs |
|---|---|---|
anthropic.* |
llama3.2 |
anthropic.claude-3-5-sonnet-20241022-v2:0 |
meta.* |
llama3.2 |
meta.llama3-2-90b-instruct-v1:0 |
amazon.titan-text* |
llama3.2 |
amazon.titan-text-express-v1 |
mistral.* |
mistral |
mistral.mistral-large-2407-v1:0 |
cohere.* |
llama3.2 |
cohere.command-r-plus-v1:0 |
ai21.* |
llama3.2 |
ai21.jamba-1-5-large-v1:0 |
stability.* |
llama3.2 |
stability.stable-diffusion-xl-v1 |
The actual model running behind these IDs is the Ollama model it maps to. Responses will reflect the capabilities of the Ollama model, not the original Bedrock model.
Custom Model Mapping
Override the default mappings with SIMFRA_BEDROCK_MODEL_MAP:
export SIMFRA_BEDROCK_MODEL_MAP="anthropic.claude-3-5-sonnet-20241022-v2:0=llama3.2:latest,mistral.mistral-large-2407-v1:0=mistral:latest"
Format: comma-separated bedrock_model_id=ollama_model_name pairs.
Change the default fallback model:
export SIMFRA_BEDROCK_DEFAULT_MODEL=llama3.1
Using the Converse API
aws --endpoint-url http://localhost:4599 bedrock-runtime converse \
--model-id anthropic.claude-3-5-sonnet-20241022-v2:0 \
--messages '[{"role":"user","content":[{"text":"What is 2+2?"}]}]'
{
"output": {
"message": {
"role": "assistant",
"content": [{"text": "2 + 2 = 4."}]
}
},
"stopReason": "end_turn",
"usage": {
"inputTokens": 12,
"outputTokens": 8,
"totalTokens": 20
}
}
Using the InvokeModel API
aws --endpoint-url http://localhost:4599 bedrock-runtime invoke-model \
--model-id anthropic.claude-3-5-sonnet-20241022-v2:0 \
--content-type application/json \
--body '{"anthropic_version":"bedrock-2023-05-31","messages":[{"role":"user","content":"Hello"}],"max_tokens":100}' \
output.json
Streaming
ConverseStream and InvokeModelWithResponseStream return responses as they are generated, using AWS event-stream framing. SDK clients handle this automatically:
import boto3
client = boto3.client('bedrock-runtime', endpoint_url='http://localhost:4599')
response = client.converse_stream(
modelId='anthropic.claude-3-5-sonnet-20241022-v2:0',
messages=[{'role': 'user', 'content': [{'text': 'Write a haiku about clouds.'}]}]
)
for event in response['stream']:
if 'contentBlockDelta' in event:
print(event['contentBlockDelta']['delta']['text'], end='')
Guardrails
Bedrock guardrails perform rule-based evaluation of input and output text. Create a guardrail and apply it to conversations:
aws --endpoint-url http://localhost:4599 bedrock create-guardrail \
--name my-guardrail \
--blocked-input-messaging "Input blocked by guardrail." \
--blocked-outputs-messaging "Output blocked by guardrail." \
--word-policy-config blockedWordList=[{text=forbidden}] \
--sensitive-information-policy-config piiEntitiesConfig=[{type=EMAIL,action=ANONYMIZE}]
Guardrail evaluation supports:
- Word policies: blocked words and phrases (case-insensitive substring match).
- PII detection: regex-based detection of email addresses, phone numbers, SSNs, credit card numbers, and other PII types. Actions:
BLOCKorANONYMIZE. - Content filters: keyword-based detection for categories like hate speech, insults, sexual content, violence, and misconduct.
- Topic policies: denied topics with keyword triggers.
Apply a guardrail to a conversation by passing guardrailConfig in the Converse or InvokeModel request, or evaluate text directly with ApplyGuardrail.
GPU Acceleration
For faster inference, pass GPU access to the Ollama container:
export SIMFRA_BEDROCK_OLLAMA_GPU=nvidia
This passes --gpus to the Docker container. Requires the NVIDIA Container Toolkit installed on the host.
Image Generation Backend
For image generation models (Stability, Titan Image), Simfra supports an alternative backend using stable-diffusion.cpp:
export SIMFRA_BEDROCK_IMAGE_BACKEND=sdcpp
export SIMFRA_BEDROCK_SDCPP_IMAGE=my-sdcpp-image:latest
When not configured, image generation requests return a placeholder image.
Configuration Reference
| Variable | Default | Description |
|---|---|---|
SIMFRA_BEDROCK_IMAGE_BACKEND |
sdcpp |
Inference backend: ollama or sdcpp |
SIMFRA_BEDROCK_OLLAMA_IMAGE |
(uses sidecar registry) | Ollama container image (default: simfra-ollama with pre-baked model) |
SIMFRA_BEDROCK_SDCPP_IMAGE |
(empty) | stable-diffusion.cpp container image |
SIMFRA_BEDROCK_OLLAMA_GPU |
(empty) | GPU device for Ollama (nvidia, all) |
SIMFRA_BEDROCK_DEFAULT_MODEL |
llama3.2 |
Default Ollama model for unmapped Bedrock model IDs |
SIMFRA_BEDROCK_MODEL_MAP |
(empty) | Custom model mapping (comma-separated bedrock=ollama pairs) |
SIMFRA_BEDROCK_CACHE_MODELS |
true |
Pre-pull models on startup |
Next Steps
- Lambda Execution - run Lambda functions that call Bedrock
- SDK Configuration - configure AWS SDKs to call Bedrock via Simfra