Treetop Discovery is an AWS CDK-based data discovery platform that builds searchable knowledge bases from IIIF (International Image Interoperability Framework) manifests and EAD (Encoded Archival Description) files using Amazon Bedrock.
- AWS Account: Administrator permissions recommended
- AWS Regions: Deploy in a region where Amazon Bedrock is available (e.g., us-east-1, us-west-2)
- Bedrock Model Access: Enable access to embedding and foundation models in the Bedrock console
Install these tools on your local machine:
uv
(installation instructions)aws
cli (installation instructions)- AWS CDK cli (installation instructions)
- AWS CLI configured with your credentials (
aws configure
)
Note
If you don't have uv
installed, see alternate Python setup below.
# Clone the repository and navigate to it
git clone https://github.com/nulib/treetop-discovery.git
cd treetop-discovery
# Install Python dependencies
uv sync --all-groups
# Activate virtual environment
source .venv/bin/activate
# Verify CDK can build (while in project root)
cd osdp && cdk synth && cd ..
Important
If CDK cannot locate Python dependencies, restart your shell and re-activate the virtual environment.
Important
Choose Your Data Source Type First: Treetop Discovery supports two data source types - IIIF or EAD. You must choose one during initial deployment. Additional data sources can be added later.
Create your configuration file:
cp osdp/config.toml.example osdp/config.toml
For IIIF Data Sources (digital collections with IIIF manifests):
stack_prefix = "my-treetop" # Choose your stack name prefix
# Recommended: Cross-region inference profiles (replace 123456789012 with your account ID)
embedding_model_arn = "arn:aws:bedrock:us-east-1::foundation-model/cohere.embed-multilingual-v3"
foundation_model_arn = "arn:aws:bedrock:us-east-1:123456789012:inference-profile/us.anthropic.claude-3-5-sonnet-20241022-v2:0"
[data]
type = "iiif"
collection_url = "https://your-iiif-collection-api-url"
[tags]
project = "my-project"
# ECR configuration (optional - defaults shown below)
# Uncomment and modify only if you need to override defaults
# [ecr]
# registry = "public.ecr.aws" # Default
# repository = "nulib-staging/osdp-iiif-fetcher" # Default
# tag = "latest" # Default
For EAD Data Sources (archival XML files in S3):
stack_prefix = "my-treetop" # Choose your stack name prefix
# Recommended: Cross-region inference profiles (replace 123456789012 with your account ID)
embedding_model_arn = "arn:aws:bedrock:us-east-1::foundation-model/cohere.embed-multilingual-v3"
foundation_model_arn = "arn:aws:bedrock:us-east-1:123456789012:inference-profile/us.anthropic.claude-3-5-sonnet-20241022-v2:0"
[data]
type = "ead"
[data.s3]
bucket = "your-s3-bucket-name"
prefix = "path/to/ead/files/"
[tags]
project = "my-project"
# Note: ECR section not required for EAD workflows
Required Configuration Changes:
stack_prefix
: Choose a unique name for your deployment (e.g., "my-treetop")- Account ID: Replace
123456789012
in thefoundation_model_arn
(inference profile) with your AWS account ID collection_url
(IIIF only): Your institution's IIIF collection API endpointbucket
&prefix
(EAD only): S3 location where your EAD XML files are stored
Get Your AWS Account ID:
aws sts get-caller-identity --query Account --output text
Data Source Requirements:
- IIIF: Your collection must expose a IIIF Collection API endpoint that lists manifest URLs
- EAD: Your EAD XML files must be uploaded to S3 before deployment
If you chose EAD as your data source, you need to prepare your files before deployment:
-
Create S3 Bucket (if you don't have one):
# Note: S3 bucket names must be lowercase and globally unique aws s3 mb s3://your-ead-bucket-name
-
Upload EAD XML Files:
# Upload individual files aws s3 cp your-file.xml s3://your-ead-bucket-name/ead-files/ # Upload entire directory aws s3 sync ./local-ead-directory/ s3://your-ead-bucket-name/ead-files/
-
Update Configuration: Ensure your
config.toml
reflects the S3 location:[data.s3] bucket = "your-ead-bucket-name" # Must be lowercase prefix = "ead-files/"
Important
S3 Bucket Naming: Bucket names must be lowercase, contain no underscores, and be globally unique across all AWS accounts.
Enable Model Access: Follow the AWS documentation to enable model access. You'll need to enable access to:
- Embedding models: Amazon Titan Embed Text, Cohere Embed, or other embedding models
- Foundation models: Anthropic Claude, Amazon Titan Text, or other foundation models
Get Model ARNs: After enabling access, you can find ARNs using:
# List available embedding models
aws bedrock list-foundation-models --by-output-modality EMBEDDING
# List available text generation models
aws bedrock list-foundation-models --by-output-modality TEXT
# List available inference profiles (recommended for better performance)
aws bedrock list-inference-profiles
ARN Format Requirements:
For embedding_model_arn
- Use direct foundation model ARNs:
- Format:
arn:aws:bedrock:REGION::foundation-model/MODEL_ID
- Example:
arn:aws:bedrock:us-east-1::foundation-model/cohere.embed-multilingual-v3
For foundation_model_arn
- Use inference profile ARNs:
- Format:
arn:aws:bedrock:REGION:ACCOUNT_ID:inference-profile/PROFILE_ID
- Example:
arn:aws:bedrock:us-east-1:123456789012:inference-profile/us.anthropic.claude-3-5-sonnet-20241022-v2:0
- Benefits: Inference profiles automatically route requests across regions for higher throughput and better availability
Common Model ARNs by Region:
For embedding_model_arn
(direct foundation model ARNs):
- US East 1:
arn:aws:bedrock:us-east-1::foundation-model/cohere.embed-multilingual-v3
- US East 1 (Alternative):
arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v2:0
- US West 2:
arn:aws:bedrock:us-west-2::foundation-model/cohere.embed-multilingual-v3
- US West 2 (Alternative):
arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-embed-text-v1
For foundation_model_arn
(inference profile ARNs - replace 123456789012 with your AWS account ID):
- US Regions:
arn:aws:bedrock:us-east-1:123456789012:inference-profile/us.anthropic.claude-3-5-sonnet-20241022-v2:0
- US Regions (Alternative):
arn:aws:bedrock:us-east-1:123456789012:inference-profile/us.anthropic.claude-3-7-sonnet-20250219-v1:0
Tip
Use Inference Profiles for Production: Inference profiles provide cross-region inference for better performance and availability. They automatically route requests to optimal regions within your geography.
# Navigate to CDK directory
cd osdp
# Activate virtual environment
source ../.venv/bin/activate
# Bootstrap CDK (required for first-time deployment in region)
cdk bootstrap
# List available stacks to confirm name
cdk ls
# Output will show: my-treetop-OSDP-Prototype
# Deploy the stack (bypass approval prompts for automated deployment)
cdk deploy my-treetop-OSDP-Prototype --require-approval never
Note
First-time Setup: The cdk bootstrap
command is required only once per AWS account/region combination. It creates necessary S3 buckets and IAM roles for CDK deployments.
CDK will deploy approximately 15-20 AWS resources including databases, compute services, storage, and AI/ML components. The deployment typically takes 10-15 minutes.
Warning
Post-Deployment Data Loading: After CDK deployment completes, the application will show CORS errors and be unusable until the initial data ingestion finishes. This process can take several hours depending on your collection size. Monitor progress using the steps below.
Required AWS Permissions: This deployment requires Administrator permissions or a custom policy with extensive permissions across S3, RDS, Lambda, Step Functions, Bedrock, Cognito, API Gateway, Amplify, ECS, and IAM.
AWS Credentials Setup:
- Ensure your AWS CLI is configured with valid credentials (
aws configure
or AWS SSO) - For AWS SSO users: Refresh credentials if you get "invalid security token" errors (if you need help, see Northwestern University Development)
- The deployment process can take 15-30 minutes, so ensure your session won't expire mid-deployment
After deployment completes check the bottom for relevant stack outputs (if you miss them, just go to: AWS Console → CloudFormation → Your Stack → Outputs tab):
Website URL
: Your Amplify application URLApiUrl
: Your API Gateway endpoint URLUserPoolId
: Cognito User Pool ID (needed for creating users)KnowledgeBaseId
: Bedrock Knowledge Base ID
Tip
Save the UserPoolId
and UserPoolIdWebsite URL
from outputs. You'll need the UserPoolId
to create Cognito users in the next step. Website URL
is your application URL which can be retrieved manually by following step 6 below.
Initial data loading takes hours and the UI shows CORS errors until complete.
1. Check Step Function Execution:
- AWS Console → Step Functions →
<stack-prefix>-data-pipeline
- Look for "RUNNING" or "SUCCEEDED" status
- If failed, click execution to see error details
- Expected runtime: 1-4 hours depending on collection size
2. Monitor Bedrock Knowledge Base Sync:
- AWS Console → Amazon Bedrock → Knowledge bases
- Select your knowledge base (named
<stack-prefix>-knowledge-base
) - Click "Data source" tab → View sync jobs
- Wait for sync status to change from "Syncing" to "Ready"
3. Verify Data Processing:
- AWS Console → S3 → Your Data Processing Bucket (the one with random suffix - see Finding Your S3 Buckets)
- Check for processed files in
data/ead/
folder (EAD workflows) ordata/iiif/
folder (IIIF workflows) - Files should appear as Step Function progresses
- Note: Your original EAD files remain in the Config/Source Bucket unchanged
4. Check for Errors:
- AWS Console → CloudWatch → Log groups
- Look for logs from Lambda functions:
/aws/lambda/<stack-prefix>-get-iiif-manifest
(IIIF only)/aws/lambda/<stack-prefix>-process-ead
(EAD only)
Before accessing the UI, create Cognito users using the UserPoolId
from your stack outputs:
-
Via AWS Console:
- Go to Amazon Cognito → User Pools
- Select your pool (named
<stack-prefix>-user-pool
) - Click "Create user"
- Set username, temporary password, and email
- User must change password on first login
-
Via CLI (using UserPoolId from stack outputs):
aws cognito-idp admin-create-user \ --user-pool-id <UserPoolId-from-stack-outputs> \ --username john.doe \ --user-attributes Name=email,[email protected] \ --temporary-password TempPass123! \ --message-action SUPPRESS
Your application URL is available in the stack outputs as Website URL
. If you need to find it manually:
- Via AWS Console: Go to AWS Amplify → Apps →
<your-prefix>-ui-<suffix>
→ View App - Via CLI:
aws amplify list-apps --query 'apps[?contains(name,`my-treetop-ui`)].{Name:name,Domain:defaultDomain}' --output table
The URL format is: https://main.<app-id>.amplifyapp.com
Testing Access:
- Navigate to your application URL
- Log in with the Cognito user credentials you created
- The chat interface should be available once data loading completes
After your initial deployment, you can load additional datasets by manually invoking the Step Function:
For Additional IIIF Collections:
- Go to AWS Console → Step Functions
- Select your state machine:
<stack-prefix>-data-pipeline
- Click "Start execution"
- Use this JSON input (replace bucket name with your Data Processing Bucket - see Finding Your S3 Buckets below):
{
"s3": {
"Bucket": "your-s3-bucket-name",
"Key": "manifests.csv"
},
"workflowType": "iiif",
"collection_url": "https://your-new-collection-api-url"
}
For Additional EAD Files:
- Upload your EAD XML files to your S3 bucket
- Go to AWS Console → Step Functions
- Select your state machine:
<stack-prefix>-data-pipeline
- Click "Start execution"
- Use this JSON input (replace bucket name with your Data Processing Bucket - see Finding Your S3 Buckets below):
{
"s3": {
"Bucket": "your-s3-bucket-name",
"Prefix": "path/to/new/ead/files/"
},
"workflowType": "ead"
}
Note
To load EAD data after initially deploying with IIIF, you must grant S3 GetObject
and ListObjects
permissions to both the state machine and EAD processing Lambda function.
Your deployment creates two S3 buckets:
- Config/Source Bucket: Stores your original EAD XML files (named from your
config.toml
- e.g.,my-treetop-ead-bucket
) - Data Processing Bucket: Stores processed data and results (auto-generated name with random suffix - e.g.,
my-treetop-12345678abcd
)
For Step Function workflows, you need the Data Processing Bucket name:
Method 1 - CloudFormation Outputs:
- AWS Console → CloudFormation → Your stack → Outputs tab
- Look for output with key containing "bucket" or "s3"
Method 2 - S3 Console:
- AWS Console → S3 → Buckets
- Look for bucket name starting with
<stack-prefix>-
and ending with random characters - Example:
my-treetop-12345678abcd
(this is your Data Processing Bucket) - Your Config/Source Bucket will match your
config.toml
bucket name exactly
Method 3 - AWS CLI:
# List buckets containing your stack prefix
aws s3 ls | grep my-treetop
# You'll see both buckets:
# my-treetop-ead-bucket <- Config/Source Bucket
# my-treetop-12345678abcd <- Data Processing Bucket (use this for Step Functions)
- Step Function: 1-4 hours (varies by collection size)
- Bedrock Sync: Additional 30-60 minutes after Step Function completes
- UI Access: Available once Bedrock sync shows "Ready" status
- Step Function fails: Check CloudWatch logs for specific Lambda errors
- Bedrock sync stuck: Verify S3 permissions and file formats
- UI still shows CORS errors: Bedrock sync may still be in progress
- No data in S3: Check your source data configuration (collection URL or S3 path)
Treetop Discovery uses AWS CDK (Cloud Development Kit) to define and deploy cloud infrastructure as code. The CDK application creates the following AWS resources:
Core Infrastructure:
- S3 bucket for data storage
- RDS Aurora PostgreSQL cluster for vector storage
- VPC with subnets and security groups
Data Processing:
- Step Functions state machine for data ingestion
- Lambda functions for IIIF/EAD processing
- ECS tasks for IIIF manifest fetching (IIIF workflows only)
AI/ML Services:
- Amazon Bedrock Knowledge Base
- IAM roles for Bedrock access
User Interface:
- Amazon Cognito User Pool for authentication
- API Gateway with Lambda backend
- AWS Amplify app for frontend hosting
The core of the data processing is an AWS Step Function that orchestrates the ingestion workflow. The process begins by checking the workflowType
to determine whether to process IIIF or EAD data. For IIIF data, it fetches manifest URLs from a collection API, processes each manifest using a Lambda function, and stores the results in S3. For EAD data, it processes XML files from a specified S3 location. Both workflows conclude by initiating a Bedrock ingestion job to make the data available for search and retrieval.
This application creates several IAM roles and policies to ensure that the different AWS services have the necessary permissions to interact with each other securely.
- Cognito User Pool: Manages user authentication.
- Chat Lambda Function:
- Granted permissions to interact with Amazon Bedrock for invoking models (
bedrock:InvokeModel
), retrieving data (bedrock:Retrieve
), and generating responses (bedrock:RetrieveAndGenerate
). - The API Gateway uses a Cognito authorizer to protect the chat endpoint.
- Granted permissions to interact with Amazon Bedrock for invoking models (
- RDS Cluster Security Group: Allows inbound traffic on port 5432 from within the VPC, enabling services like Bedrock to connect to the database.
- Database Initialization: A custom resource is granted permissions to execute SQL statements on the RDS cluster (
rds-data:ExecuteStatement
) and retrieve database credentials from AWS Secrets Manager (secretsmanager:GetSecretValue
).
- ECS Task Role: The ECS task for fetching IIIF manifests is granted
s3:PutObject
permissions to write data to the S3 data bucket. - Step Functions:
- The Step Function orchestrating the data pipeline has permissions to invoke Lambda functions (
fetch_iiif_manifest_function
,process_ead_function
) and run the ECS task. - The Lambda functions for processing IIIF and EAD data are granted read and write access to the S3 data bucket.
- The Step Function orchestrating the data pipeline has permissions to invoke Lambda functions (
- Bedrock Knowledge Base Role: A role is created for the Bedrock Knowledge Base with permissions to:
- Read data from the S3 bucket.
- Access the RDS database cluster for vector storage.
- Invoke the embedding model in Bedrock.
- Retrieve database credentials from AWS Secrets Manager.
- CodePipeline: The pipeline is configured with a source from GitHub and uses a secret from AWS Secrets Manager for authentication.
- Pipeline Stages: The linting, testing, and deployment steps in the pipeline have the necessary permissions to install dependencies, run commands, and deploy the CDK application.
- Amplify Build Function: A Lambda function for building the UI is granted permissions to create and start deployments in AWS Amplify.
If you prefer not to use uv
, you can set up the environment with standard Python tools:
Ensure you are using Python 3.12:
python --version
# Python 3.12.x
Create a virtual environment:
python -m venv .venv
Activate it:
source .venv/bin/activate
Install uv
within the virtual environment:
pip install uv
Install dependencies:
uv sync --all-groups
Common Issues:
- CDK synthesis fails: Restart your shell and re-activate the virtual environment
- Bedrock model access denied: Enable model access in the Bedrock console for your region
- UI shows CORS errors: Wait for data loading to complete (can take hours)
- Authentication errors with AWS SSO: Re-authenticate using your AWS SSO provider
- "Invalid bucket name" errors: Ensure S3 bucket names are lowercase and contain no underscores
- "CDK bootstrap required" errors: Run
cdk bootstrap
before deployment - "Security token invalid" errors: AWS credentials expired - refresh using SSO or
aws configure
- Deployment stuck on approval: Use
--require-approval never
flag for automated deployment
Note
This section is specific to Northwestern University developers and staging environments.
Step 1: Clone Repository
git clone https://github.com/nulib/treetop-discovery.git
cd treetop-discovery
Follow the main Quick Start steps 1-2 for installation and configuration.
Step 2: Authentication Setup (Required before Step 3)
- Go to aws.northwestern.edu
- Click the "GENERAL USE LOGIN" button
- Expand the name of the AWS staging account
- Hit the "Access keys" button
- Choose "Option 1: Set AWS environment variables"
- Copy and paste the commands into your terminal
Important
Complete the authentication setup above before running cdk synth
or cdk deploy
in Step 3, or you'll get authentication errors.
Configuration Notes:
For NU developers, the stack_prefix
is automatically set using the DEV_PREFIX
environment variable in AWS, so it can be omitted from config.toml
.
The project includes a CI/CD pipeline for staging deployments:
- Pipeline Stack:
OsdpPipelineStack
- Staging Stack:
OsdpPipelineStack/staging/OSDP-Prototype
- GitHub Integration: Pipeline sources from GitHub with Secrets Manager authentication
Northwestern maintains the IIIF manifest fetcher Docker image in a public ECR repository:
- Registry:
public.ecr.aws/nulib-staging
- Repository:
osdp-iiif-fetcher
- Usage: Uses Northwestern's public container registry for IIIF processing
Building and Pushing ECR Images:
# set your AWS PROFILE to the staging admin profile (it has permissions to push to the ECR repository)
export AWS_PROFILE=[your-staging-profile]
# Authenticate with ECR
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/nulib-staging
# Build image
docker build -t public.ecr.aws/nulib-staging/osdp-iiif-fetcher:[tag] -f iiif/Dockerfile .
# Push image
docker push public.ecr.aws/nulib-staging/osdp-iiif-fetcher:[tag]
# Clear the profile variable after pushing
unset AWS_PROFILE
Image Development: See iiif/README.md for detailed development instructions.
Python Environment: Follow the Quick Start steps above.
Tip
VSCode Users: Set Python interpreter via Command Palette (⇧⌘P
) → Python: Select Interpreter
→ ./.venv/bin/python
Node.js Setup:
node --version # Should be v22.x
# Install Node.js dependencies for build function
cd osdp/functions/build_function && npm i && cd ../../../
Development Commands:
# Testing
pytest
# Linting
ruff check .
ruff check --fix .
# Formatting
ruff format .
# CDK commands (run from osdp/ directory)
cd osdp
cdk ls # List stacks
cdk synth # Generate CloudFormation
cdk deploy # Deploy stack
cdk diff # Compare local vs deployed