Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 101 additions & 0 deletions .claude/context/state-sync-development.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# State-Sync Development Context

**When to load:** Working on state-sync scripts, Dockerfile, workspace hydration, or S3 sync logic

## Quick Reference

- **Language:** Bash (POSIX-ish, requires bash for arrays and `${var//pattern/}`)
- **Base image:** Alpine 3.21
- **Tools:** rclone (S3 sync), git (repo operations), jq (JSON parsing), sqlite3 (WAL checkpoint), bash
- **Primary files:** `components/runners/state-sync/hydrate.sh`, `components/runners/state-sync/sync.sh`
- **Spec:** [components/runners/state-sync/spec/spec.md](../../components/runners/state-sync/spec/spec.md)

## Critical Rules

### Input Sanitization

All user-provided path components MUST be stripped to `[a-zA-Z0-9-]`:

```bash
NAMESPACE="${NAMESPACE//[^a-zA-Z0-9-]/}"
SESSION_NAME="${SESSION_NAME//[^a-zA-Z0-9-]/}"
```

Used in S3 and filesystem paths; prevents path traversal.

### Credential Handling

**NEVER log tokens.** The git credential helper writes tokens only to stdout via git credential protocol. It does not echo them.

**ALWAYS strip credentials from persisted URLs:**

```bash
remote_url=$(echo "${remote_url}" | sed 's|://[^@]*@|://|')
```

This runs before writing `metadata.json` for git backups.

**Protect rclone config:** The config file contains S3 credentials and MUST be written with `chmod 600`.

### Error Handling

- `set -e` at script start (both scripts)
- `set +e` before git clone loops — clone failures are non-fatal
- `trap 'final_sync' SIGTERM SIGINT` in sync.sh — ensures final backup on shutdown
- Individual operation failures log warnings and continue; the scripts do not exit on non-critical errors

### Permissions

The 777 permissions on workspace directories are intentional (cross-container UID mismatch, SELinux/SCC fallback). See spec Workspace Structure > Permissions for full rationale.

### S3 Operations

- All S3 access via rclone with `--config /tmp/.config/rclone/rclone.conf`
- Sync uses `--checksum` (content-based, not timestamp-based); hydrate uses `rclone copy` without checksum
- Sync passes `--max-size ${MAX_SYNC_SIZE}` — rclone skips individual files exceeding this limit
- `--copy-links` to follow symlinks
- `--fast-list` to reduce API calls
- Hydrate uses 8 transfers (download), sync uses 4 (upload)

## Testing

No automated test suite exists. Validate changes manually:

1. Deploy to a kind cluster: `make kind-up LOCAL_IMAGES=true`
2. Create a session — verify hydrate logs show workspace creation and repo cloning
3. Wait for sync cycle — verify S3 contains expected paths (`kubectl exec` into MinIO or use `mc` CLI)
4. Delete the session pod and recreate — verify state is restored from S3
5. Test ephemeral mode — remove S3 credentials, verify hydrate succeeds without persistence

Edge cases to test:
- Private repo without credentials (should warn, not fail)
- Workflow with invalid subpath (should fall back to full repo)
- Large workspace exceeding MAX_SYNC_SIZE (should warn, sync anyway)
- SIGTERM during sync (should complete final sync before exit)

## Common Tasks

### Adding a new sync path

1. Add to `SYNC_PATHS` array in both `hydrate.sh` and `sync.sh`
2. Add `mkdir -p` and permission setup in `hydrate.sh`
3. Verify the path is not covered by an exclude pattern

### Adding a new env var

1. Add to the configuration section at the top of the script
2. Apply sanitization if the value is used in filesystem or S3 paths
3. Document in `spec/spec.md` under Inputs

### Changing the base image

1. Update `Dockerfile`
2. Verify all required packages are available (`rclone`, `git`, `jq`, `bash`, `sqlite`)
3. Test that `stat -c%s` works (GNU coreutils syntax; macOS `stat` differs)

## Key Files

- `hydrate.sh` — init container entrypoint
- `sync.sh` — sidecar entrypoint
- `Dockerfile` — container definition
- `spec/spec.md` — behavioral specification
8 changes: 8 additions & 0 deletions BOOKMARKS.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,10 @@ NextJS patterns, Shadcn UI usage, React Query data fetching, component guideline

Auth flows, RBAC enforcement, token handling, container security patterns.

### [State-Sync Development Context](.claude/context/state-sync-development.md)

Shell scripting conventions, security constraints, and testing approach for state-sync.

---

## Code Patterns
Expand Down Expand Up @@ -107,6 +111,10 @@ Operator development, watch patterns, reconciliation loop.

Python runner development, Claude Code SDK integration.

### [State-Sync Spec](components/runners/state-sync/spec/spec.md)

Behavioral specification for the session state persistence sidecar.

### [Public API README](components/public-api/README.md)

Stateless gateway design, token forwarding, input validation.
Expand Down
225 changes: 225 additions & 0 deletions components/runners/state-sync/spec/spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
# State-Sync Specification

Session state persistence for the Ambient Code Platform. Ensures workspace data survives pod restarts by synchronizing workspace contents to and from S3-compatible object storage.

## Operational Modes

### Init (hydrate)

Runs as a Kubernetes init container before the runner starts. Prepares the workspace:

1. **Create workspace structure** — directories for framework state, artifacts, file uploads, and repositories
2. **Set permissions** — ownership to uid 1001 (runner user), with 777 fallbacks for cross-container access
3. **Download prior session state** — if S3 is configured and prior state exists, download framework state, artifacts, and file uploads
4. **Fetch git credentials** — retrieve GitHub/GitLab tokens from the backend API using the session's bot token
5. **Install credential helper** — a shell-based git credential helper that maps host patterns to the appropriate token (GitHub or GitLab)
6. **Clone repositories** — iterate `REPOS_JSON`, clone each repo to `/workspace/repos/{name}` on the specified branch (or default branch)
7. **Clone workflow** — if `ACTIVE_WORKFLOW_GIT_URL` is set, clone the workflow repo and optionally extract a subpath
8. **Restore git state** — if S3 contains a `repo-state/` backup, restore branches from bundles, apply uncommitted/staged patches, and verify HEAD matches expectations
9. **Final permissions** — re-apply ownership and permissions after all downloads and clones

### Sidecar (sync)

Runs alongside the runner container for the lifetime of the session pod. Periodically uploads workspace state:

1. **Wait for workspace population** — 30-second initial delay after pod start
2. **Sync loop** — every `SYNC_INTERVAL` seconds (default 60):
- Check total sync size against `MAX_SYNC_SIZE`
- Checkpoint any SQLite WAL files in the framework data directory (defensive — databases are created by the framework runtime and are opaque to state-sync)
- Upload framework state, artifacts, and file uploads to S3 via rclone
- Write sync metadata (timestamp, session info, paths synced)
3. **Periodic git backup** — every `REPO_BACKUP_INTERVAL` sync cycles (default 5), back up git repo state:
- Create bundles with all refs
- Capture uncommitted and staged changes as patches
- Write metadata (remote URL with credentials stripped, branch, HEAD SHA, local branches)
- Upload to S3 under `repo-state/`
4. **Graceful shutdown** — on SIGTERM, perform one final git backup + sync before exiting

## Inputs

### Required for persistence

| Variable | Description |
|---|---|
| `AWS_ACCESS_KEY_ID` | S3 access key |
| `AWS_SECRET_ACCESS_KEY` | S3 secret key |

If either is missing, state-sync operates in **ephemeral mode**: hydrate creates the workspace structure but skips S3; sync sleeps indefinitely.

### Session identity

| Variable | Default | Description |
|---|---|---|
| `NAMESPACE` | `default` | Kubernetes namespace (sanitized to `[a-zA-Z0-9-]`) |
| `SESSION_NAME` | `unknown` | Session identifier (sanitized to `[a-zA-Z0-9-]`) |

### S3 configuration

| Variable | Default | Description |
|---|---|---|
| `S3_ENDPOINT` | `http://minio.ambient-code.svc:9000` | S3-compatible endpoint URL |
| `S3_BUCKET` | `ambient-sessions` | Bucket name |

### Framework configuration

| Variable | Default | Description |
|---|---|---|
| `RUNNER_STATE_DIR` | `.claude` | Relative path under `/workspace/` for framework state |

### Repository configuration

| Variable | Default | Description |
|---|---|---|
| `REPOS_JSON` | (empty) | JSON array of `{url, branch, name}` objects |

### Workflow configuration

| Variable | Default | Description |
|---|---|---|
| `ACTIVE_WORKFLOW_GIT_URL` | (empty) | Git URL of the workflow repository |
| `ACTIVE_WORKFLOW_BRANCH` | `main` | Branch to clone |
| `ACTIVE_WORKFLOW_PATH` | (empty) | Subpath within the repo to extract |

### Credential sources

| Variable | Description |
|---|---|
| `GITHUB_TOKEN` | GitHub personal access token (if pre-set, skips backend fetch) |
| `GITLAB_TOKEN` | GitLab access token (if pre-set, skips backend fetch) |
| `BACKEND_API_URL` | Backend API base URL for credential fetch |
| `BOT_TOKEN` | Authentication token for backend API calls |
| `PROJECT_NAME` | Project name for credential endpoint path |

### Sync tuning (sidecar only)

| Variable | Default | Description |
|---|---|---|
| `SYNC_INTERVAL` | `60` | Seconds between sync cycles |
| `MAX_SYNC_SIZE` | `1073741824` | Maximum total sync size in bytes (1 GB) |
| `REPO_BACKUP_INTERVAL` | `5` | Back up git repos every Nth sync cycle |

## Workspace Structure

Hydration produces:

```
/workspace/
{RUNNER_STATE_DIR}/ # Framework state (e.g., .claude/)
debug/ # Debug logs (created only when RUNNER_STATE_DIR is ".claude"; excluded from sync regardless)
artifacts/ # Output files created by the agent
file-uploads/ # User-uploaded files
repos/
{repo-name}/ # Cloned repositories
workflows/
{workflow-name}/ # Cloned workflow (or extracted subpath)
```

### Permissions

The runner container runs as uid 1001 (non-root). The init container runs as root.

| Path | Permissions | Rationale |
|---|---|---|
| `{RUNNER_STATE_DIR}/` | 777 | Framework SDK requires write access; group-based permissions don't work across containers with different UIDs |
| `artifacts/` | 755 | Runner user owns, standard access |
| `file-uploads/` | 777 | Content sidecar (uid 1001) must write; init container (root) creates |
| `repos/` | 777 | Runtime repo additions via `clone_repo_at_runtime`; containers may not share groups |

Ownership is set to `1001:0` via `chown` first. The 777 fallback handles environments where `chown` fails (SELinux, OpenShift SCCs with forced fsGroup).

## S3 Storage Layout

```
s3://{bucket}/{namespace}/{session_name}/
{RUNNER_STATE_DIR}/ # Framework state files
artifacts/ # Agent output files
file-uploads/ # User-uploaded files
repo-state/
{repo-name}/
repo.bundle # Git bundle with all refs
uncommitted.patch # Uncommitted tracked changes
staged.patch # Staged changes
metadata.json # Remote URL, branch, HEAD SHA, local branches, timestamp
metadata.json # Sync metadata (last sync time, session info, paths synced)
```

### Sync exclusions

The following patterns are excluded from S3 sync:

- `repos/**` — git handles this separately via bundles
- `node_modules/**`, `.venv/**`, `__pycache__/**`, `*.pyc` — dependency artifacts
- `.cache/**`, `target/**`, `dist/**`, `build/**` — build artifacts
- `.git/**` — git internals (bundled separately)
- `debug/**` — debug logs with symlinks that break rclone

## Behavioral Invariants

1. **Repo clone failures are non-fatal.** Individual repository clone failures MUST log a warning and continue. Other repos and the rest of workspace initialization MUST proceed.

2. **S3 unavailability does not block workspace creation.** If S3 credentials are missing or the endpoint is unreachable, hydration MUST create the workspace structure and exit successfully. The session operates in ephemeral mode.

3. **Credentials never appear in logs or persisted metadata.** The git credential helper writes tokens only to stdout in git credential protocol format. `backup_git_repos` strips embedded credentials from remote URLs before writing `metadata.json` (via `sed 's|://[^@]*@|://|'`).

4. **Final sync on shutdown.** The sidecar MUST trap SIGTERM and perform a complete git backup + workspace sync before exiting. This is the primary mechanism for preserving uncommitted work.

5. **SQLite WAL checkpoint before sync.** Before uploading framework state, all `.db` files MUST be checkpointed (`PRAGMA wal_checkpoint(TRUNCATE)`) to ensure consistent backups. The `.db` files are created by the framework runtime (e.g., Claude Code CLI) and their contents are opaque to state-sync.

6. **Sync size enforcement.** Total sync size MUST be checked against `MAX_SYNC_SIZE` before each cycle. If exceeded, a warning is logged but sync proceeds. Additionally, rclone enforces `--max-size` per-file — individual files exceeding `MAX_SYNC_SIZE` are silently skipped by rclone.

7. **Input sanitization.** `NAMESPACE` and `SESSION_NAME` MUST be stripped to `[a-zA-Z0-9-]` to prevent path traversal in both S3 paths and local filesystem paths.

8. **Rclone config protection.** The rclone configuration file (which contains S3 credentials) MUST be written with mode 600.

## Failure Modes

| Scenario | Behavior |
|---|---|
| S3 not configured (missing credentials) | Hydrate: creates workspace, exits 0. Sync: sleeps forever (keeps sidecar alive). |
| S3 unreachable | Hydrate: workspace created without prior state, exits 0. Sync: logs error, retries next interval. |
| Repo clone fails (auth, network, etc.) | Warning logged, other repos continue. |
| Workflow clone fails | Warning logged, no workflow available. |
| Workflow subpath not found | Warning logged, falls back to entire cloned repo. |
| Git bundle fetch fails during restore | Warning logged, repo stays at freshly-cloned state. |
| Patch apply fails during restore | Warning logged (likely merge conflicts), repo stays at bundle state. |
| HEAD SHA mismatch after restore | Warning logged (diverged state), no corrective action taken. |
| Sync size exceeds MAX_SYNC_SIZE | Warning logged, sync proceeds anyway. |

## Interfaces

### Operator

The Kubernetes operator configures state-sync by setting environment variables on the init container and sidecar container specs. The operator controls:
- Session identity (`NAMESPACE`, `SESSION_NAME`)
- S3 credentials (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`)
- Repository configuration (`REPOS_JSON`)
- Workflow configuration (`ACTIVE_WORKFLOW_GIT_URL`, `ACTIVE_WORKFLOW_BRANCH`, `ACTIVE_WORKFLOW_PATH`)
- Framework selection (`RUNNER_STATE_DIR`)
- Backend API access (`BACKEND_API_URL`, `BOT_TOKEN`, `PROJECT_NAME`)

### Runner container

Reads the `/workspace/` directory structure created by hydration. Expects:
- Repos cloned to `/workspace/repos/{name}`
- Framework state directory at `/workspace/{RUNNER_STATE_DIR}`
- Artifacts directory at `/workspace/artifacts`
- File uploads at `/workspace/file-uploads`

### S3 / MinIO

All S3 operations use rclone. Configuration:
- Provider type: `Other` (S3-compatible), ACL: `private`
- Sync (upload) uses `--checksum` for content-based comparison; hydrate (download) uses `rclone copy` without checksum
- Transfers: 8 (hydrate download), 4 (sync upload)
- `--fast-list` and `--copy-links` enabled

### Backend API

The init container fetches git credentials from `{BACKEND_API_URL}/projects/{PROJECT_NAME}/agentic-sessions/{SESSION_NAME}/credentials/{provider}` using `BOT_TOKEN` for authentication. Providers: `github`, `gitlab`. Tokens are only fetched if not already present in the environment.

## Container

- **Base image:** Alpine 3.21
- **Installed packages:** rclone, git, jq, bash, sqlite
- **Entrypoint:** `/usr/local/bin/sync.sh` (sidecar mode)
- **Init container usage:** overrides entrypoint to `/usr/local/bin/hydrate.sh`