Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resurrect snapshot container #403

Merged
merged 2 commits into from
Feb 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/hadolint-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,4 @@ jobs:
dockerfile: "Dockerfile*"
recursive: true
# don't pin versions in dependencies
ignore: DL3028,DL3008
ignore: DL3028,DL3008,DL3018
57 changes: 57 additions & 0 deletions .github/workflows/snapshot-service-image.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
name: Snapshot Service Image

# Cancel workflow if there is a new change to the branch.
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}

on:
push:
branches: [main]
merge_group:
pull_request:
branches: [main]

jobs:
build-and-push-docker-image:
name: Build images and push to GHCR
runs-on: ubuntu-latest
timeout-minutes: 30
steps:
- name: List cached docker images
run: docker image ls

- name: Checkout code
uses: actions/checkout@v4

- name: Login to Github Packages
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}

# This step yields the following labels:
# - date+sha, e.g. 2023-01-19-da4692d,
# - latest,
- name: Docker Meta
id: meta
uses: docker/metadata-action@v5
with:
images: ghcr.io/chainsafe/forest-snapshot-service
tags: |
type=raw,value={{date 'YYYY-MM-DD'}}-{{sha}}
type=raw,value=latest,enable={{is_default_branch}}

- name: Build image and push to GitHub Container Registry
uses: docker/build-push-action@v5
with:
context: ./images/snapshot-service/
build-contexts: |
common=./tf-managed/scripts/
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
push: ${{ github.ref == 'refs/heads/main' }}

- name: List docker images
run: docker image ls
1 change: 1 addition & 0 deletions images/snapshot-service/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.md
26 changes: 26 additions & 0 deletions images/snapshot-service/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Snapshot service Dockerfile.
# It is meant to produce a single snapshot of the given chain in the Filecoin network and upload it to S3 (preferably Cloudflare R2, it should work for other providers as well, but it wasn't tested).
FROM docker:24
LABEL org.opencontainers.image.description "Forest snapshot service generator and uploader for Filecoin"

RUN apk add --no-cache \
ruby \
ruby-dev \
docker \
bash && \
gem install \
docker-api \
slack-ruby-client \
activesupport

COPY ./src /opt/snapshot-service

# `common` is defined via the `--build-context` flag in the `docker build` command, e.g.,
# `docker build --build-context common=../../tf-managed/scripts/ -t ghcr.io/chainsafe/forest-snapshot-service:latest .`
# TODO: Change this once `sync-check` is fully-dockerized as well.
# hadolint ignore=DL3022
COPY --from=common ruby_common /opt/snapshot-service/ruby_common

WORKDIR /opt/snapshot-service

CMD ["bash", "run.sh"]
38 changes: 38 additions & 0 deletions images/snapshot-service/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Forest snapshot service

This service serves as a Filecoin snapshot generator and uploader. Supported networks are [calibnet](https://docs.filecoin.io/networks/calibration) and [mainnet](https://docs.filecoin.io/networks/mainnet). All S3-compatible providers should work correctly, though it was tested exclusively on Cloudflare R2.

## Building the image

```bash
docker build --build-context common=../../tf-managed/scripts/ -t <name>:<tag> .
```

## Running the Forest snapshot service

The container needs additional privileges and access to the docker socket to issue other `docker` commands.

This command will generate a snapshot for the given network and upload it to an S3 bucket.
```bash
docker run --privileged -v /var/run/docker.sock:/var/run/docker.sock --rm --env-file <variable-file> --env NETWORK_CHAIN=<chain> ghcr.io/chainsafe/forest-snapshot-service:edge
```

## Variables (all required)

```bash
# Details for the snapshot upload
R2_ACCESS_KEY=
R2_SECRET_KEY=
R2_ENDPOINT=
SNAPSHOT_BUCKET=

# Details for the Slack notifications
SLACK_API_TOKEN=
SLACK_NOTIFICATION_CHANNEL=

# Network chain - can be either `mainnet` or `calibnet`
NETWORK_CHAIN=
# Forest tag to use. `latest` is the newest stable version.
# See [Forest packages](https://github.com/ChainSafe/forest/pkgs/container/forest) for more.
FOREST_TAG=
```
Original file line number Diff line number Diff line change
Expand Up @@ -8,24 +8,16 @@
require 'logger'
require 'fileutils'

BASE_FOLDER = get_and_assert_env_variable 'BASE_FOLDER'
SLACK_TOKEN = get_and_assert_env_variable 'SLACK_API_TOKEN'
CHANNEL = get_and_assert_env_variable 'SLACK_NOTIF_CHANNEL'

# Prune logs files(txt) older than 2 weeks
def prune_logs(logs_folder = 'logs')
cutoff_date = Date.today - 14 # set the cutoff date to 14 days ago

Dir.glob("#{logs_folder}/*.txt").each do |file|
File.delete(file) if File.file?(file) && File.mtime(file).to_date < cutoff_date
end
end
CHANNEL = get_and_assert_env_variable 'SLACK_NOTIFICATION_CHANNEL'

CHAIN_NAME = ARGV[0]
raise 'No chain name supplied. Please provide chain identifier, e.g. calibnet or mainnet' if ARGV.empty?

# Current datetime, to append to the log files
DATE = Time.new.strftime '%FT%H:%M:%S'

FileUtils.mkdir_p 'logs'
LOG_EXPORT_SCRIPT_RUN = "logs/#{CHAIN_NAME}_#{DATE}_script_run.txt"
LOG_EXPORT_DAEMON = "logs/#{CHAIN_NAME}_#{DATE}_daemon.txt"
LOG_EXPORT_METRICS = "logs/#{CHAIN_NAME}_#{DATE}_metrics.txt"
Expand All @@ -46,7 +38,7 @@ def prune_logs(logs_folder = 'logs')

upload_cmd = <<~CMD.chomp
set -o pipefail && \
timeout --signal=KILL 8h ./upload_snapshot.sh #{CHAIN_NAME} #{LOG_EXPORT_DAEMON} #{LOG_EXPORT_METRICS} | \
timeout -s SIGKILL 8h ./upload_snapshot.sh #{CHAIN_NAME} #{LOG_EXPORT_DAEMON} #{LOG_EXPORT_METRICS} | \
#{add_timestamps_cmd}
CMD

Expand All @@ -71,6 +63,3 @@ def prune_logs(logs_folder = 'logs')
[LOG_EXPORT_SCRIPT_RUN, LOG_EXPORT_DAEMON, LOG_EXPORT_METRICS].each do |log_file|
logger.info "Snapshot export log:\n#{File.read(log_file)}\n\n" if File.exist?(log_file)
end

# Prune logs files(txt) in the logs directory older than 2 weeks
prune_logs
15 changes: 15 additions & 0 deletions images/snapshot-service/src/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/bin/bash

set -euo pipefail

# Assert that all required environment variables are set
: "${R2_ACCESS_KEY:?}"
: "${R2_SECRET_KEY:?}"
: "${R2_ENDPOINT:?}"
: "${SNAPSHOT_BUCKET:?}"
: "${SLACK_API_TOKEN:?}"
: "${SLACK_NOTIFICATION_CHANNEL:?}"
: "${NETWORK_CHAIN:?}"
: "${FOREST_TAG:?}"

ruby daily_snapshot.rb "$NETWORK_CHAIN"
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,9 @@ timeout "$SYNC_TIMEOUT" forest-cli sync wait
forest-cli snapshot export -o forest_db/
forest-cli --token=\$(cat token.txt) shutdown --force

# Snapshot is exported, remove the Forest DB to limit space usage.
forest-tool db destroy --force --config config.toml --chain "$CHAIN_NAME"

# Run full checks only for calibnet, given that it takes too long for mainnet.
if [ "$CHAIN_NAME" = "calibnet" ]; then
timeout 30m forest-tool snapshot validate --check-network "$CHAIN_NAME" forest_db/forest_snapshot_*.forest.car.zst
Expand All @@ -83,7 +86,6 @@ else
timeout 30m forest-tool snapshot validate --check-links 0 --check-network "$CHAIN_NAME" --check-stateroots 5 forest_db/forest_snapshot_*.forest.car.zst
fi


# Kill the metrics writer process
kill %1

Expand All @@ -95,25 +97,31 @@ CONTAINER_NAME="forest-snapshot-upload-node-$CHAIN_NAME"
docker stop "$CONTAINER_NAME" || true
docker rm --force "$CONTAINER_NAME"

CHAIN_DB_DIR="$BASE_FOLDER/forest_db/$CHAIN_NAME"
CHAIN_LOGS_DIR="$BASE_FOLDER/logs"
CHAIN_DB_DIR="/opt/forest_db/$CHAIN_NAME"
CHAIN_LOGS_DIR="/opt/logs/$CHAIN_NAME"
mkdir -p "$CHAIN_DB_DIR"
mkdir -p "$CHAIN_LOGS_DIR"

# Delete any existing snapshot files. It may be that the previous run failed
# before deleting those.
rm "$CHAIN_DB_DIR/forest_snapshot_$CHAIN_NAME"*
# Cleanup volumes from the previous if any.
DB_VOLUME="${CHAIN_NAME}_db"
LOG_VOLUME="${CHAIN_NAME}_logs"
docker volume rm "${DB_VOLUME}" || true
docker volume rm "${LOG_VOLUME}" || true

# Run forest and generate a snapshot in forest_db/
# Run forest and generate a snapshot in the `DB_VOLUME` volume.
docker run \
--name "$CONTAINER_NAME" \
--rm \
--user root \
-v "$CHAIN_DB_DIR:/home/forest/forest_db":z \
-v "$CHAIN_LOGS_DIR:/home/forest/logs":z \
-v "${DB_VOLUME}:/home/forest/forest_db" \
-v "${LOG_VOLUME}:/home/forest/logs" \
--entrypoint /bin/bash \
ghcr.io/chainsafe/forest:"${FOREST_TAG}" \
-c "$COMMANDS" || exit 1

aws --endpoint "$R2_ENDPOINT" s3 cp --no-progress "$CHAIN_DB_DIR/forest_snapshot_$CHAIN_NAME"*.forest.car.zst s3://"$SNAPSHOT_BUCKET"/"$CHAIN_NAME"/latest/ || exit 1
# Mount the snapshot volume and copy the snapshot to the S3 bucket.
docker run -v "${DB_VOLUME}":/opt/snapshots --rm --entrypoint /bin/bash --env AWS_ACCESS_KEY_ID="$R2_ACCESS_KEY" --env AWS_SECRET_ACCESS_KEY="$R2_SECRET_KEY" \
public.ecr.aws/aws-cli/aws-cli:2.15.18 \
-c "aws configure set default.s3.multipart_chunksize 4GB && aws --endpoint ${R2_ENDPOINT} s3 cp --no-progress /opt/snapshots/forest_snapshot_${CHAIN_NAME}*.forest.car.zst s3://${SNAPSHOT_BUCKET}/${CHAIN_NAME}/latest/" || exit 1

# Delete snapshot files
rm "$CHAIN_DB_DIR/forest_snapshot_$CHAIN_NAME"*
docker volume rm "${DB_VOLUME}" || true
38 changes: 18 additions & 20 deletions tf-managed/modules/daily-snapshot/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

// Ugly hack because 'archive_file' cannot mix files and folders.
data "external" "sources_tar" {
program = ["bash", "${path.module}/prep_sources.sh", path.module, var.common_resources_dir]
program = ["bash", "${path.module}/prep_sources.sh", path.module]
}


Expand All @@ -30,32 +30,30 @@ data "digitalocean_ssh_keys" "keys" {
}
}

# Set required environment variables
# Required environment variables for the snapshot service itself.
locals {
env_content = templatefile("${path.module}/service/forest-env.tpl", {
R2_ACCESS_KEY = var.R2_ACCESS_KEY,
R2_SECRET_KEY = var.R2_SECRET_KEY,
r2_endpoint = var.r2_endpoint,
slack_token = var.slack_token,
slack_channel = var.slack_channel,
snapshot_bucket = var.snapshot_bucket,
snapshot_endpoint = var.snapshot_endpoint,
NEW_RELIC_API_KEY = var.new_relic_api_key,
NEW_RELIC_ACCOUNT_ID = var.new_relic_account_id,
NEW_RELIC_REGION = var.new_relic_region,
BASE_FOLDER = "/root",
forest_tag = var.forest_tag
})
env_content = <<-EOT
R2_ACCESS_KEY=${var.R2_ACCESS_KEY}
R2_SECRET_KEY=${var.R2_SECRET_KEY}
R2_ENDPOINT=${var.r2_endpoint}
SNAPSHOT_BUCKET=${var.snapshot_bucket}
SLACK_API_TOKEN=${var.slack_token}
SLACK_NOTIFICATION_CHANNEL=${var.slack_channel}
FOREST_TAG=${var.forest_tag}
EOT
}

locals {
init_commands = ["cd /root/",
"tar xf sources.tar",
# Set required environment variables
"echo '${local.env_content}' >> /root/.forest_env",
"echo '. ~/.forest_env' >> .bashrc",
". ~/.forest_env",
"nohup sh ./init.sh > init_log.txt &",
<<-EOT
export NEW_RELIC_API_KEY=${var.new_relic_api_key}
export NEW_RELIC_ACCOUNT_ID=${var.new_relic_account_id}
export NEW_RELIC_REGION=${var.new_relic_region}
nohup sh ./init.sh > init_log.txt &
EOT
,
# Exiting without a sleep sometimes kills the script :-/
"sleep 60s"
]
Expand Down
6 changes: 1 addition & 5 deletions tf-managed/modules/daily-snapshot/prep_sources.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,8 @@
# Enable strict error handling and command tracing
set -euxo pipefail

# Copy local source files in a folder together with ruby_common and create a zip archive.

# Copy local source files in a folder, and create a zip archive.
cd "$1"
cp --archive "$2"/ruby_common service/

rm -f sources.tar
(cd service && tar cf ../sources.tar --sort=name --mtime='UTC 2019-01-01' ./* > /dev/null 2>&1)
rm -fr service/ruby_common
echo "{ \"path\": \"$1/sources.tar\" }"
5 changes: 2 additions & 3 deletions tf-managed/modules/daily-snapshot/service/calibnet_cron_job
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
#!/bin/bash

# shellcheck source=/dev/null
source ~/.forest_env
cd "$BASE_FOLDER" || exit
flock -n /tmp/calibnet.lock -c "ruby daily_snapshot.rb calibnet >> logs/calibnet_log.txt 2>&1"
cd "$HOME" || exit
flock -n /tmp/calibnet.lock -c "docker run --privileged -v /var/run/docker.sock:/var/run/docker.sock --rm --env-file .forest_env -e NETWORK_CHAIN=calibnet ghcr.io/chainsafe/forest-snapshot-service:latest >> calibnet_log.txt 2>&1"
11 changes: 0 additions & 11 deletions tf-managed/modules/daily-snapshot/service/forest-env.tpl

This file was deleted.

19 changes: 1 addition & 18 deletions tf-managed/modules/daily-snapshot/service/init.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,26 +10,9 @@ export DEBIAN_FRONTEND=noninteractive

# Using timeout to ensure the script retries if the APT servers are temporarily unavailable.
timeout 10m bash -c 'until apt-get -qqq --yes update && \
apt-get -qqq --yes install ruby ruby-dev anacron awscli; do sleep 10; \
apt-get -qqq --yes install anacron ; do sleep 10; \
done'

# Install the gems
gem install docker-api slack-ruby-client
gem install activesupport -v 7.0.8

# 1. Configure aws
# 2. Create forest_db directory
# 3. Copy scripts to /etc/cron.hourly

## Configure aws
aws configure set default.s3.multipart_chunksize 4GB
aws configure set aws_access_key_id "$R2_ACCESS_KEY"
aws configure set aws_secret_access_key "$R2_SECRET_KEY"

## Create forest data directory
mkdir forest_db logs
chmod 777 forest_db logs

# Run new_relic and fail2ban scripts
bash newrelic_fail2ban.sh

Expand Down
5 changes: 2 additions & 3 deletions tf-managed/modules/daily-snapshot/service/mainnet_cron_job
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
#!/bin/bash

# shellcheck source=/dev/null
source ~/.forest_env
cd "$BASE_FOLDER" || exit
flock -n /tmp/mainnet.lock -c "ruby daily_snapshot.rb mainnet > mainnet_log.txt 2>&1" || exit
cd "$HOME" || exit
flock -n /tmp/mainnet.lock -c "docker run --privileged -v /var/run/docker.sock:/var/run/docker.sock --rm --env-file .forest_env -e NETWORK_CHAIN=mainnet ghcr.io/chainsafe/forest-snapshot-service:latest >> mainnet_log.txt 2>&1"
Loading
Loading