Compare commits

...

52 Commits

Author SHA1 Message Date
Timmy 7db0b78e88 Bump version to 0.11.1 2026-05-15 23:38:09 +01:00
dave 979492449e huskies: merge 1105 bug Freeze from Backlog stores wrong resume_to — Unfreeze restores to Coding instead of Backlog 2026-05-15 22:33:54 +00:00
Timmy 6fbe239313 fix(1102): require non-empty origin.id on create_* MCP tools
bug 1102 was created today with origin={kind:user, id:""} because
build_origin silently defaulted id to empty when the caller didn't pass
one — we couldn't tell who filed it. Bug 1088's origin field is useless
as audit if every caller can omit themselves.

Changes:
- build_origin (server/src/http/mcp/story_tools/mod.rs) now returns
  Result<String, String> and rejects missing/empty/whitespace-only id
  with an instructional error pointing at bug 1102 / story 1104.
- 5 create_* tool handlers (bug, spike, refactor, epic, story) now
  resolve origin BEFORE create_*_file so an attribution-less call
  leaves no half-state behind.
- 5 tool input schemas now advertise origin as a required object via
  a shared origin_schema() helper. The schema description gives every
  caller (coder agent, chat bot, user, system) a concrete example so
  the LLM populates the field correctly on first sight.
- Test fixtures pass origin = {kind:"test", id:"test-suite"}.

Story 1104 (signed actions) is the longer-term replacement; this is the
quick attribution win agreed for master ahead of that design work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 23:13:54 +01:00
Timmy 26527e7dae diag(1101): log classify verdict + matched trigger on merge gate failures
Bug 1101's reframed AC1: when a non-success merge runs, log the typed
GateFailureKind, the matched classifier-trigger substring (if any) and
~90 chars of surrounding context. Fires on every gate failure regardless
of routing, so the next fixup-loop bounce will tell us which substring is
fooling classify() into Fmt|Lint|SourceMapCheck on what's actually a Test
failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 23:13:38 +01:00
dave 04a57e92c2 huskies: merge 1103 bug Rate-limit warning at session start sticks the rate_limit_exit flag, causing 1053's fast-path bypass to skip completion on clean session exits 2026-05-15 21:02:37 +00:00
dave d59efa0b5c huskies: regen source-map.json 2026-05-15 20:24:31 +00:00
dave 4216ced493 huskies: merge 1100 bug Multiple LLM agents can run concurrently on the same story (coder + mergemaster + others) — enforce one-agent-per-story invariant 2026-05-15 20:24:31 +00:00
dave 9f4f493486 huskies: regen source-map.json 2026-05-15 19:05:56 +00:00
dave 63d86f1263 huskies: merge 1096 bug Shadow drift: set_agent writes CRDT agent register without updating pipeline_items.agent 2026-05-15 19:05:56 +00:00
dave 398a5806e7 huskies: regen source-map.json 2026-05-15 18:25:25 +00:00
dave 1adc734801 huskies: merge 1098 bug Shadow drift: set_retry_count / bump_retry_count write CRDT register without updating pipeline_items.retry_count 2026-05-15 18:25:25 +00:00
dave 0ae6dfd565 huskies: regen source-map.json 2026-05-15 12:40:17 +00:00
dave 8531bac6cd huskies: merge 1097 bug Shadow drift: set_depends_on writes CRDT depends_on register without updating pipeline_items.depends_on 2026-05-15 12:40:17 +00:00
dave ce13c00ebd huskies: regen source-map.json 2026-05-15 12:27:48 +00:00
dave 2857c3b46b huskies: merge 1094 bug delete_story leaks zombie rows in pipeline_items shadow table — 176 tombstoned items still report non-terminal stages 2026-05-15 12:27:48 +00:00
dave d944885ce9 huskies: regen source-map.json 2026-05-15 12:10:11 +00:00
dave 62d1535e76 huskies: merge 1095 bug Shadow drift: set_name writes CRDT name register without updating pipeline_items.name 2026-05-15 12:10:11 +00:00
dave 46556d308a huskies: regen source-map.json 2026-05-15 12:03:09 +00:00
dave fc5481dbe4 huskies: merge 1093 bug Chat dispatcher spawns one Timmy per inbound message — needs coalesce window + per-session serial lock 2026-05-15 12:03:09 +00:00
dave 01e60a670c huskies: merge 1091 refactor Migrate the merge-gate's stale-cargo kill path to process_kill 2026-05-15 11:50:03 +00:00
dave c4010854a5 huskies: merge 1089 bug Stuck-agent detector blocks stories on legitimate exploration / debugging — uses too narrow a "progress" signal 2026-05-15 11:40:44 +00:00
dave fb1311cdae huskies: regen source-map.json 2026-05-15 11:16:16 +00:00
dave 4aa76ce673 huskies: merge 1090 refactor Migrate AgentPool::kill_all_children and kill_child_for_key to process_kill so server shutdown and stop_agent actually kill claude 2026-05-15 11:16:16 +00:00
Timmy fb82bd7bca test(tick_loop): de-flake reconcile_never_floods_broadcast_channel
The test asserted msg_count == 0 on a process-global broadcast channel
(TRANSITION_TX is a single OnceLock<Sender> shared across the test
binary), so any concurrent test calling apply_transition could land
events in our receiver between the drain and the post-reconcile check.
Observed failure: 3 stray transitions from parallel tests.

Drop the strict count check.  The real "never floods" invariant is
captured by the Lagged check alone: 1000 seeded items must not overflow
the 256-slot channel, which can only hold if the reconcile path
bypasses the broadcast (AC4).  The sibling test
`reconcile_pass_scales_to_1000_items_without_lagged_divergence` already
uses this Lagged-only pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 11:13:31 +01:00
Timmy b7df5cbe4e fix(agents): kill-then-status reorder in stop_agent
stop_agent had the same order-of-operations bug fixed in the watchdog:
status flipped to Failed before the claude process was verified gone,
opening the idempotency window that allowed a duplicate spawn to race
in alongside the surviving process.

Now follows the three-step protocol:
1. Read worktree path under a read-only lock (no mutation).
2. SIGKILL the worktree's process tree via process_kill and block
   until verified gone — start_agent's Running/Pending whitelist
   continues to reject duplicate spawns throughout.
3. Only then mutate the agent record, abort the task handle, and
   drop the child_killers entry.

Falls back to the old portable_pty SIGHUP path (with a warning) when
no worktree was recorded, matching the watchdog's behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 10:46:02 +01:00
Timmy fe9804b32c feat: add process_kill module + use it to fix watchdog double-spawn
Adds `crate::process_kill` — reliable SIGKILL-with-verify primitives used
across the server in place of the various ad-hoc kill paths that ignored
their kill-effective return values. The module exposes three pieces:

  - `sigkill_pids_and_verify(pids)`: SIGKILL each pid and block (up to 2s)
    until every pid is verified gone. Returns survivors if not.
  - `pids_matching(pattern)`: pgrep -f wrapper.
  - `descendant_pids(root)`: recursive pgrep -P walker for process trees.

Wires the watchdog's limit-termination path through it, and reorders the
protocol to fix the duplicate-coder bug observed on story 1086 (2026-05-15):

  Before: check_agent_limits set status=Failed before the kill ran. The
  kill itself was `portable_pty::ChildKiller::kill()`, which sends SIGHUP
  on Unix — claude-code ignores SIGHUP, so the process kept running while
  the agent record was already marked terminated. The idempotency check
  in `start_agent` whitelists Running/Pending, so the next auto-assign
  pass spawned a fresh agent alongside the still-alive prior one. Two
  claude PIDs sharing one session_id, racing on the same worktree.

  After: status update is moved OUT of check_agent_limits and into the
  caller AFTER the kill is verified. The kill itself is now SIGKILL-the-
  process-tree-in-the-worktree, with explicit verification that every pid
  is gone. The idempotency window is closed.

The existing watchdog test suite (14 tests) still passes; 7 new tests
cover the process_kill primitives directly.

`agents/pool/process.rs`'s `kill_all_children` and `kill_child_for_key`
still use the old portable_pty SIGHUP path — they have the same bug but
in lower-impact code paths (shutdown, operator stop). They will be
migrated under a separate story to keep this commit focused.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 10:36:33 +01:00
Timmy 8446ab1c71 chore: gitignore .huskies/double_timmy_log.md
Local-only scratchpad for tracking suspected duplicate-Timmy /
duplicate-create_story incidents while we hunt the cause.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 10:06:37 +01:00
dave b5054b08d3 huskies: regen source-map.json 2026-05-15 08:47:38 +00:00
dave df32a1542b huskies: merge 1087 story Pipeline+Status split — Step D: migrate CRDT storage to (Pipeline, Status) and remove the Stage enum 2026-05-15 08:47:38 +00:00
dave e82602db77 huskies: merge 1086 story Pipeline+Status split — Step C: migrate auto-assign, subscribers, and lifecycle transitions to read Pipeline + Status 2026-05-15 08:26:39 +00:00
Timmy 2d6105c778 fix: skip setup commands on worktree reuse so reconciler doesn't fire npm ci every 30s
Story 1066 (merged 2026-05-14 23:39) introduced a periodic reconciler that
calls `reconcile_worktree_create` every 30 seconds (default
`reconcile_interval_secs`). The reconciler's docstring promises it is a no-op
for stories whose worktree already exists — but the implementation calls
`create_worktree`, whose reuse path was running `run_setup_commands`
unconditionally. Setup includes destructive `npm ci` (rm -rf node_modules
then reinstall), so every Coding story got `npm ci` fired every 30 seconds.

When story 1086 hit a gate-failure retry loop on 2026-05-15, the merge gate's
own `npm install`/`npm run build` raced one of these reconciler-driven
`npm ci` runs that was wiping node_modules — leaving `.bin/tsc` as a broken
symlink pointing into a half-populated `typescript/` package and producing
`sh: 1: tsc: not found`. 37 npm ci fires for 1086 in 5 hours against only
3 real Coding transitions, a 12x amplification driven entirely by the
30-second reconcile cadence.

Fix: align `create_worktree`'s behaviour with the contract `reconcile_worktree_create`
already documents — reuse is a no-op for setup commands. Sparse checkout
and `.mcp.json` rewrite still run (both cheap and idempotent).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 08:57:38 +01:00
Timmy d89940e85b fix: drop source-map.json from agent orientation bundle
The orientation bundle was 96 KB per coder spawn with 85 KB of that being
source-map.json — a static symbol listing that drowned out the workflow rules
in AGENT.md and likely explains why PLAN.md ceremony is being skipped (the
instruction is ~5% of the bundle, buried under a wall of symbols). Agents are
excellent at grep on demand, so the source map adds little value as a preloaded
cheat sheet. File stays on disk for the merge-time source-map-check doc-coverage
gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 07:48:18 +01:00
dave 60fceee204 huskies: regen source-map.json 2026-05-15 02:03:30 +00:00
dave 13f7dab5f0 huskies: merge 1088 2026-05-15 02:03:30 +00:00
dave f7413cc711 huskies: regen source-map.json 2026-05-15 01:38:05 +00:00
dave b053f14d58 huskies: merge 1085 2026-05-15 01:38:05 +00:00
dave 56179d712e huskies: merge 1078 2026-05-15 01:32:29 +00:00
dave a06bf6778b huskies: regen source-map.json 2026-05-15 01:27:25 +00:00
dave 1506141155 huskies: merge 1072 2026-05-15 01:27:25 +00:00
dave ae69cd50b1 huskies: regen source-map.json 2026-05-15 00:58:57 +00:00
dave 0c23d209a0 huskies: merge 1077 2026-05-15 00:58:57 +00:00
dave eac5763e03 huskies: merge 1075 2026-05-15 00:48:06 +00:00
dave 6530eeab6d huskies: merge 811 2026-05-15 00:42:14 +00:00
dave 5eb8f2f8a7 huskies: regen source-map.json 2026-05-15 00:37:01 +00:00
dave f9b140add9 huskies: merge 1073 2026-05-15 00:37:01 +00:00
dave d4db96f709 huskies: merge 1070 2026-05-15 00:20:29 +00:00
dave 5f08573db8 huskies: merge 1076 2026-05-15 00:10:15 +00:00
dave da83fcb78d huskies: merge 1074 2026-05-15 00:01:58 +00:00
dave f04bdd1f14 huskies: regen source-map.json 2026-05-14 23:45:53 +00:00
dave bb6a6063e8 huskies: merge 1066 2026-05-14 23:45:53 +00:00
dave bf813d910b huskies: regen source-map.json 2026-05-14 23:29:32 +00:00
dave 374aa77f27 huskies: merge 1069 2026-05-14 23:29:32 +00:00
119 changed files with 5427 additions and 910 deletions
+23
View File
@@ -0,0 +1,23 @@
#!/bin/sh
#
# Pre-commit hook installed by huskies.
# Runs script/check (fmt-check, clippy, cargo check, source-map-check)
# before every commit. Aborts if any gate fails.
#
# Emergency bypass: git commit --no-verify (see AGENT.md — avoid this)
REPO_ROOT="$(git rev-parse --show-toplevel)"
printf '[pre-commit] Running script/check ...\n'
OUTPUT=$("$REPO_ROOT/script/check" 2>&1)
STATUS=$?
if [ "$STATUS" -ne 0 ]; then
printf '\n=== PRE-COMMIT HOOK FAILED ===\n\n'
printf '%s\n' "$OUTPUT"
printf '\nFix the issues above, then re-validate with:\n'
printf ' script/check\n'
printf '\nEmergency bypass (see AGENT.md -- avoid this):\n'
printf ' git commit --no-verify\n\n'
exit 1
fi
+1
View File
@@ -29,6 +29,7 @@ timers.json
# Misc
wishlist.md
double_timmy_log.md
# Database
pipeline.db
+57 -11
View File
@@ -172,6 +172,8 @@
"interface WizardStepInfo",
"interface WizardStateData",
"interface AgentAssignment",
"type Pipeline",
"type Status",
"interface PipelineStageItem",
"interface PipelineState",
"type WsResponse",
@@ -200,6 +202,8 @@
"interface JoinedAgent",
"interface GatewayProject",
"interface GatewayInfo",
"type Pipeline",
"type Status",
"interface PipelineItem",
"interface ProjectPipelineStatus",
"interface AllProjectsPipeline",
@@ -517,6 +521,7 @@
],
"server/src/agents/merge/squash/tests_advanced.rs": [],
"server/src/agents/merge/squash/tests_basic.rs": [],
"server/src/agents/merge/squash/tests_changelog.rs": [],
"server/src/agents/mod.rs": [
"mod gates",
"mod lifecycle",
@@ -536,6 +541,7 @@
"enum TerminationReason",
"enum PipelineStage",
"fn pipeline_stage",
"fn canonical_pipeline_stage",
"fn agent_config_stage",
"struct CompletionReport",
"struct TokenUsage",
@@ -558,9 +564,11 @@
"fn assign_merge_stage"
],
"server/src/agents/pool/auto_assign/merge_failure_block_subscriber.rs": [
"fn reconcile_merge_failure_block",
"fn spawn_merge_failure_block_subscriber"
],
"server/src/agents/pool/auto_assign/merge_failure_subscriber.rs": [
"fn reconcile_merge_failure",
"fn spawn_merge_failure_subscriber"
],
"server/src/agents/pool/auto_assign/mod.rs": [
@@ -612,6 +620,7 @@
],
"server/src/agents/pool/auto_assign/watchdog/tests/orphan_tests.rs": [],
"server/src/agents/pool/cost_rollup_subscriber.rs": [
"fn reconcile_cost_rollup",
"fn spawn_cost_rollup_subscriber",
"fn on_terminal_transition"
],
@@ -670,9 +679,7 @@
"server/src/agents/pool/pipeline/mod.rs": [],
"server/src/agents/pool/process.rs": [
"fn kill_all_children",
"fn kill_child_for_key",
"fn inject_child_killer",
"fn child_killer_count"
"fn kill_child_for_key"
],
"server/src/agents/pool/query.rs": [
"fn available_agents_for_stage",
@@ -699,6 +706,7 @@
],
"server/src/agents/pool/stop.rs": [
"fn stop_agent",
"fn reconcile_canonical_agents",
"fn remove_agents_for_story"
],
"server/src/agents/pool/test_helpers.rs": [
@@ -730,6 +738,8 @@
"server/src/agents/pool/worktree_lifecycle.rs": [
"fn spawn_worktree_create_subscriber",
"fn spawn_worktree_cleanup_subscriber",
"fn reconcile_worktree_create",
"fn reconcile_worktree_cleanup",
"fn on_coding_transition",
"fn on_terminal_transition"
],
@@ -742,9 +752,7 @@
"fn run_agent_pty_streaming"
],
"server/src/agents/pty/types.rs": [
"struct PtyResult",
"fn composite_key",
"struct ChildKillerGuard"
"struct PtyResult"
],
"server/src/agents/runtime/claude_code.rs": [
"struct ClaudeCodeRuntime",
@@ -888,6 +896,13 @@
"server/src/chat/commands/unreleased.rs": [
"fn handle_unreleased"
],
"server/src/chat/dispatcher.rs": [
"type SpawnFn",
"struct ChatDispatcher",
"fn new",
"fn submit",
"fn stop"
],
"server/src/chat/history.rs": [
"type ChatConversationHistory",
"fn load_chat_history",
@@ -898,6 +913,7 @@
],
"server/src/chat/mod.rs": [
"mod commands",
"mod dispatcher",
"mod history",
"mod lookup",
"mod test_helpers",
@@ -1027,6 +1043,7 @@
"fn default_permission_timeout_secs",
"fn default_aggregated_notifications_poll_interval_secs",
"fn default_aggregated_notifications_enabled",
"fn default_coalesce_window_ms",
"fn default_transport",
"fn default_whatsapp_provider",
"struct BotConfig"
@@ -1390,6 +1407,7 @@
"fn qa_mode",
"fn item_type",
"fn epic",
"fn origin",
"fn for_test",
"type PipelineItemView",
"struct NodePresenceView",
@@ -1416,6 +1434,7 @@
"fn set_agent",
"fn set_qa_mode",
"fn set_plan_state",
"fn set_origin",
"fn write_item",
"fn write_item_str",
"fn set_retry_count",
@@ -1429,7 +1448,9 @@
"fn migrate_legacy_stage_strings",
"fn migrate_node_claims_to_agent_claims",
"fn migrate_merge_job",
"fn purge_done_stage_merge_jobs"
"fn purge_done_stage_merge_jobs",
"fn migrate_zombie_pipeline_rows",
"fn sweep_zombie_rows"
],
"server/src/crdt_state/write/mod.rs": [],
"server/src/crdt_state/write/tests.rs": [],
@@ -1538,7 +1559,11 @@
"fn named",
"fn write_item_with_content",
"fn move_item_stage",
"fn sync_item_agent",
"fn delete_item",
"fn delete_item_sync",
"fn sync_item_name",
"fn sync_item_depends_on",
"fn next_item_number"
],
"server/src/db/recover.rs": [
@@ -1548,11 +1573,15 @@
"fn recover_half_written_items"
],
"server/src/db/shadow_write.rs": [
"struct UnknownMigration",
"fn get_shared_pool",
"struct PipelineWriteMsg",
"struct PipelineDb",
"static PIPELINE_DB",
"fn init"
"static SHADOW_DB_PATH",
"fn init",
"fn backup_pre_pipeline_status",
"fn check_schema_drift"
],
"server/src/gateway/mod.rs": [
"fn build_gateway_route",
@@ -1734,7 +1763,9 @@
"fn tool_list_epics",
"fn tool_show_epic"
],
"server/src/http/mcp/story_tools/mod.rs": [],
"server/src/http/mcp/story_tools/mod.rs": [
"fn build_origin"
],
"server/src/http/mcp/story_tools/refactor.rs": [
"fn tool_create_refactor",
"fn tool_list_refactors"
@@ -2158,6 +2189,7 @@
"mod mesh",
"mod node_identity",
"mod pipeline_state",
"mod process_kill",
"mod rebuild",
"mod services",
"mod sled_uplink",
@@ -2193,7 +2225,6 @@
"server/src/pipeline_state/events.rs": [
"fn subscribe_transitions",
"fn try_broadcast",
"fn replay_current_pipeline_state",
"struct TransitionFired",
"trait TransitionSubscriber",
"struct EventBus",
@@ -2210,6 +2241,7 @@
"server/src/pipeline_state/subscribers.rs": [
"fn format_audit_entry",
"struct AuditLogSubscriber",
"fn reconcile_audit_log",
"fn spawn_audit_log_subscriber",
"struct MatrixBotSubscriber",
"struct FileRendererSubscriber",
@@ -2243,6 +2275,12 @@
"enum ArchiveReason",
"fn dir_name",
"fn from_dir",
"enum Pipeline",
"fn as_str",
"enum Status",
"fn as_str",
"fn pipeline",
"fn status",
"enum ExecutionState",
"struct PipelineItem",
"fn retry_count",
@@ -2250,6 +2288,11 @@
"fn stage_label",
"fn stage_dir_name"
],
"server/src/process_kill.rs": [
"fn sigkill_pids_and_verify",
"fn pids_matching",
"fn descendant_pids"
],
"server/src/rebuild.rs": [
"enum ShutdownReason",
"struct BotShutdownNotifier",
@@ -2579,7 +2622,9 @@
"fn format_oauth_accounts_exhausted",
"fn format_agent_started_notification",
"fn format_agent_completed_notification",
"fn merge_failure_snippet"
"fn format_new_item_notification",
"const MERGE_FAILURE_TAIL_LINES",
"fn truncate_gate_output"
],
"server/src/service/notifications/io/listener.rs": [
"fn spawn_notification_listener"
@@ -2965,6 +3010,7 @@
"fn spawn_tick_loop",
"fn spawn_gateway_relay",
"fn spawn_event_trigger_subscriber",
"fn run_reconcile_pass",
"fn spawn_startup_reconciliation"
],
"server/src/state.rs": [
Generated
+1 -1
View File
@@ -1911,7 +1911,7 @@ checksum = "df3b46402a9d5adb4c86a0cf463f42e19994e3ee891101b1841f30a545cb49a9"
[[package]]
name = "huskies"
version = "0.11.0"
version = "0.11.1"
dependencies = [
"ammonia",
"async-stream",
+2 -2
View File
@@ -1,12 +1,12 @@
{
"name": "huskies",
"version": "0.11.0",
"version": "0.11.1",
"lockfileVersion": 3,
"requires": true,
"packages": {
"": {
"name": "huskies",
"version": "0.11.0",
"version": "0.11.1",
"dependencies": {
"@types/react-syntax-highlighter": "^15.5.13",
"react": "^19.1.0",
+1 -1
View File
@@ -1,7 +1,7 @@
{
"name": "huskies",
"private": true,
"version": "0.11.0",
"version": "0.11.1",
"type": "module",
"scripts": {
"dev": "vite",
+29
View File
@@ -50,6 +50,29 @@ export interface AgentAssignment {
status: string;
}
/** Display column for a work item — derived server-side from `Stage::pipeline()` (story 1085). */
export type Pipeline =
| "backlog"
| "coding"
| "qa"
| "merge"
| "done"
| "closed"
| "archived";
/** Badge/indicator for a work item — derived server-side from `Stage::status()` (story 1085). */
export type Status =
| "active"
| "frozen"
| "review-hold"
| "blocked"
| "merge-failure"
| "merge-failure-final"
| "abandoned"
| "superseded"
| "rejected"
| "done";
/** A single item in any pipeline stage (backlog, current, QA, merge, or done). */
export interface PipelineStageItem {
story_id: string;
@@ -57,6 +80,10 @@ export interface PipelineStageItem {
error: string | null;
merge_failure: string | null;
agent: AgentAssignment | null;
/** Display column (story 1085); falls back to the bucket name on legacy servers. */
pipeline?: Pipeline;
/** Display badge (story 1085); falls back to derived `blocked`/`frozen` on legacy servers. */
status?: Status;
review_hold: boolean | null;
qa: string | null;
depends_on: number[] | null;
@@ -214,6 +241,8 @@ export interface WorkItemContent {
stage: string;
name: string;
agent: string | null;
/** Origin JSON string (story 1088), or null for pre-origin items. */
origin: string | null;
}
/** Result for a single test case from the server's test runner. */
+28
View File
@@ -24,10 +24,38 @@ export interface GatewayInfo {
projects: GatewayProject[];
}
/** Display column for a work item — derived server-side from `Stage::pipeline()` (story 1085). */
export type Pipeline =
| "backlog"
| "coding"
| "qa"
| "merge"
| "done"
| "closed"
| "archived";
/** Badge/indicator for a work item — derived server-side from `Stage::status()` (story 1085). */
export type Status =
| "active"
| "frozen"
| "review-hold"
| "blocked"
| "merge-failure"
| "merge-failure-final"
| "abandoned"
| "superseded"
| "rejected"
| "done";
export interface PipelineItem {
story_id: string;
name: string;
/** Legacy stage string (kept for back-compat); prefer `pipeline` + `status`. */
stage: string;
/** Display column (story 1085). Optional until all servers are upgraded. */
pipeline?: Pipeline;
/** Display badge (story 1085). Optional until all servers are upgraded. */
status?: Status;
agent?: { agent_name: string; model: string; status: string } | null;
blocked?: boolean;
retry_count?: number;
+44 -7
View File
@@ -69,29 +69,34 @@ describe("StoryRow", () => {
expect(screen.getByText("awaiting-slot (#2)")).toBeInTheDocument();
});
// AC2: failure kind labels derived from merge_failure string
it("shows ConflictDetected for merge_failure with conflict text", () => {
// Story 1085: failure kind no longer derived from substring. Items in
// the merge_failure / merge_failure_final status get a generic FAILED badge;
// the kind detail is exposed via the typed `status` field for callers that
// need it (instead of being squeezed into the badge text).
it("shows ✕ FAILED badge for merge-failure status", () => {
const item: PipelineItem = {
story_id: "73_story_conflict",
name: "Conflict Story",
stage: "merge",
blocked: true,
pipeline: "merge",
status: "merge-failure",
merge_failure: "Merge conflict: conflicts detected",
};
render(<StoryRow item={item} />);
expect(screen.getByText("ConflictDetected")).toBeInTheDocument();
expect(screen.getByText("✕ FAILED")).toBeInTheDocument();
});
it("shows GatesFailed for merge_failure with quality gates text", () => {
it("shows ⛔ FAILED (FINAL) badge for merge-failure-final status", () => {
const item: PipelineItem = {
story_id: "74_story_gates",
name: "Gates Failed Story",
stage: "merge",
blocked: true,
pipeline: "merge",
status: "merge-failure-final",
merge_failure: "Quality gates failed: cargo test failed",
};
render(<StoryRow item={item} />);
expect(screen.getByText("GatesFailed")).toBeInTheDocument();
expect(screen.getByText("⛔ FAILED (FINAL)")).toBeInTheDocument();
});
it("shows RECOVERING badge for merge_failure item with running mergemaster", () => {
@@ -163,4 +168,36 @@ describe("StoryRow", () => {
render(<StoryRow item={item} />);
expect(screen.getByText("⊘ BLOCKED")).toBeInTheDocument();
});
// Story 1085 AC 4 — Frozen items remain visible in their underlying column
// with a frozen indicator. The server hands us `pipeline: "coding"` for a
// frozen-while-coding story and the badge is decorated separately.
it("shows ❄ FROZEN badge for a frozen item (column stays as underlying pipeline)", () => {
const item: PipelineItem = {
story_id: "70_story_frozen_coding",
name: "Paused Coding Story",
stage: "current",
pipeline: "coding",
status: "frozen",
};
render(<StoryRow item={item} />);
expect(screen.getByText("❄ FROZEN")).toBeInTheDocument();
});
// Story 1085 AC 4 (subsumes 1052) — Done items must never get a
// MergeFailure indicator, even if a stale `merge_failure` string is present.
it("done items render Done badge, never MergeFailure", () => {
const item: PipelineItem = {
story_id: "71_story_done",
name: "Completed Story",
stage: "done",
pipeline: "done",
status: "done",
merge_failure: "ignored stale string",
};
render(<StoryRow item={item} />);
expect(screen.getByText("Done")).toBeInTheDocument();
expect(screen.queryByText("✕ FAILED")).not.toBeInTheDocument();
expect(screen.queryByText(/FAILED/)).not.toBeInTheDocument();
});
});
+114 -64
View File
@@ -14,9 +14,42 @@ import {
type JoinedAgent,
type GatewayProject,
type AllProjectsPipeline,
type Pipeline,
type PipelineItem,
type Status,
} from "../api/gateway";
/// Resolve an item's pipeline column. Servers running the new (story 1085)
/// backend send `pipeline`; older servers only send `stage` so we fall back to
/// mapping the bucket name onto the new column vocabulary.
function itemPipeline(item: PipelineItem): Pipeline {
if (item.pipeline) return item.pipeline;
switch (item.stage) {
case "current":
return "coding";
case "qa":
return "qa";
case "merge":
return "merge";
case "done":
return "done";
case "archived":
return "archived";
default:
return "backlog";
}
}
/// Resolve an item's badge. Falls back to `merge_failure`/`blocked` on
/// legacy servers that don't yet emit `status`.
function itemStatus(item: PipelineItem): Status {
if (item.status) return item.status;
if (item.merge_failure) return "merge-failure";
if (item.blocked) return "blocked";
if (item.stage === "done") return "done";
return "active";
}
const { useCallback, useEffect, useRef, useState } = React;
/// Seconds of silence before an agent is considered disconnected.
@@ -48,72 +81,86 @@ const STATUS_LABELS: Record<AgentStatus, string> = {
disconnected: "Disconnected",
};
const STAGE_COLORS: Record<string, string> = {
const PIPELINE_COLORS: Record<Pipeline, string> = {
backlog: "#8b949e",
current: "#3fb950",
coding: "#3fb950",
qa: "#d2a679",
merge: "#79c0ff",
done: "#6e7681",
closed: "#6e7681",
archived: "#6e7681",
};
const STAGE_LABELS: Record<string, string> = {
const PIPELINE_LABELS: Record<Pipeline, string> = {
backlog: "Backlog",
current: "In Progress",
coding: "In Progress",
qa: "QA",
merge: "Merging",
done: "Done",
closed: "Closed",
archived: "Archived",
};
/// Derive a short label from a merge failure string based on the failure kind.
function mergeFailureKindLabel(failure: string): string {
if (failure.includes("Merge conflict") || failure.includes("CONFLICT")) {
return "ConflictDetected";
}
if (failure.includes("Quality gates failed") || failure.includes("gates failed")) {
return "GatesFailed";
}
if (failure.includes("no code changes") || failure.includes("empty diff")) {
return "EmptyDiff";
}
if (failure.includes("No commits")) {
return "NoCommits";
}
return "✕ FAILED";
}
/// A single story row inside a project pipeline card.
/** Render one story row in a gateway-aggregate panel: `#<id> <name>` with stage badge. */
/** Render one story row in a gateway-aggregate panel: `#<id> <name>` with status badge. */
export function StoryRow({ item, mergeQueuePos }: { item: PipelineItem; mergeQueuePos?: number }) {
const isStuck = item.merge_failure != null || item.blocked;
const isMergeActive = item.stage === "merge" && !isStuck && item.agent?.status === "running";
const pipeline = itemPipeline(item);
const status = itemStatus(item);
const agentStatus = item.agent?.status;
let color: string;
let label: string;
let frozenPrefix = "";
if (isMergeActive) {
color = "#58a6ff";
label = "▶ MERGING";
} else if (isStuck) {
const agentStatus = item.agent?.status;
// Frozen items keep their underlying pipeline column but get a ❄️ badge.
// (AC 4 — story 1085, subsumes the freeze-hides-item bug.)
if (status === "frozen") {
color = "#79c0ff";
label = "❄ FROZEN";
frozenPrefix = "❄ ";
} else if (status === "merge-failure" || status === "merge-failure-final") {
// Done items never reach this branch — `Stage::status()` returns
// `Status::Done` for done items (AC 4).
if (agentStatus === "running") {
color = "#e3b341";
label = "⟳ RECOVERING";
} else if (agentStatus === "pending") {
color = "#e3b341";
label = "⏳ QUEUED";
} else if (item.merge_failure != null) {
} else {
color = "#f85149";
label = mergeFailureKindLabel(item.merge_failure);
label = status === "merge-failure-final" ? "⛔ FAILED (FINAL)" : "✕ FAILED";
}
} else if (status === "blocked") {
if (agentStatus === "running") {
color = "#e3b341";
label = "⟳ RECOVERING";
} else if (agentStatus === "pending") {
color = "#e3b341";
label = "⏳ QUEUED";
} else {
color = "#f85149";
label = "⊘ BLOCKED";
}
} else if (item.stage === "merge" && item.agent?.status === "pending") {
} else if (status === "review-hold") {
color = "#d2a679";
label = "REVIEW HOLD";
} else if (status === "abandoned") {
color = "#6e7681";
label = "ABANDONED";
} else if (status === "superseded") {
color = "#6e7681";
label = "SUPERSEDED";
} else if (status === "rejected") {
color = "#f85149";
label = "REJECTED";
} else if (pipeline === "merge" && agentStatus === "running") {
color = "#58a6ff";
label = "▶ MERGING";
} else if (pipeline === "merge" && agentStatus === "pending") {
color = "#e3b341";
label = "⏳ QUEUED";
} else if (item.stage === "merge") {
} else if (pipeline === "merge") {
color = "#6e7681";
if (mergeQueuePos === 1) {
label = "NEXT IN QUEUE";
@@ -123,10 +170,11 @@ export function StoryRow({ item, mergeQueuePos }: { item: PipelineItem; mergeQue
label = "awaiting-slot";
}
} else {
color = STAGE_COLORS[item.stage] ?? "#8b949e";
label = STAGE_LABELS[item.stage] ?? item.stage;
color = PIPELINE_COLORS[pipeline] ?? "#8b949e";
label = PIPELINE_LABELS[pipeline] ?? pipeline;
}
const isMergeActive = pipeline === "merge" && status === "active" && agentStatus === "running";
const idNum = item.story_id.match(/^(\d+)/)?.[1];
return (
@@ -158,7 +206,7 @@ export function StoryRow({ item, mergeQueuePos }: { item: PipelineItem; mergeQue
</span>
<span style={{ color: "#e6edf3", overflow: "hidden", textOverflow: "ellipsis", whiteSpace: "nowrap" }}>
{idNum && <span style={{ color: "#8b949e", fontFamily: "monospace" }}>#{idNum}{" "}</span>}
{item.name}
{frozenPrefix}{item.name}
</span>
</div>
);
@@ -388,6 +436,8 @@ function aggregateItems(
story_id: b.story_id,
name: b.name,
stage: "backlog",
pipeline: "backlog" as Pipeline,
status: "active" as Status,
})),
};
}
@@ -395,14 +445,14 @@ function aggregateItems(
return {
project,
items: (status.active ?? []).filter(
(i) => i.stage !== "done",
(i) => itemPipeline(i) !== "done",
),
};
}
if (tab === "done") {
return {
project,
items: (status.active ?? []).filter((i) => i.stage === "done"),
items: (status.active ?? []).filter((i) => itemPipeline(i) === "done"),
};
}
// archived
@@ -419,12 +469,12 @@ function tabCount(pipeline: AllProjectsPipeline, tab: TabKey): number {
if (tab === "in-progress") {
return (
sum +
(status.active ?? []).filter((i) => i.stage !== "done").length
(status.active ?? []).filter((i) => itemPipeline(i) !== "done").length
);
}
if (tab === "done") {
return (
sum + (status.active ?? []).filter((i) => i.stage === "done").length
sum + (status.active ?? []).filter((i) => itemPipeline(i) === "done").length
);
}
return sum + (status.archived ?? []).length;
@@ -518,13 +568,16 @@ function ProjectStoryRow({
);
}
const IN_PROGRESS_STAGE_LABELS: Record<string, string> = {
current: "Coding",
const IN_PROGRESS_PIPELINE_LABELS: Record<"coding" | "qa" | "merge", string> = {
coding: "Coding",
qa: "QA",
merge: "Merging",
};
/// In Progress tab content — items grouped by stage (coding / qa / merging).
/// In Progress tab content — items grouped by their `pipeline` column.
///
/// Frozen items appear in the column corresponding to their underlying
/// `Stage::resume_to` (server-side), so they always show up in-place.
function InProgressTabContent({
groups,
}: {
@@ -535,25 +588,22 @@ function InProgressTabContent({
);
const multiProject = new Set(allItems.map((x) => x.project)).size > 1;
const byStage = {
current: allItems.filter((x) => x.item.stage === "current"),
qa: allItems.filter((x) => x.item.stage === "qa"),
merge: allItems.filter((x) => x.item.stage === "merge"),
const byPipeline = {
coding: allItems.filter((x) => itemPipeline(x.item) === "coding"),
qa: allItems.filter((x) => itemPipeline(x.item) === "qa"),
merge: allItems.filter((x) => itemPipeline(x.item) === "merge"),
};
const stages = (["current", "qa", "merge"] as const).filter(
(s) => byStage[s].length > 0,
const pipelines = (["coding", "qa", "merge"] as const).filter(
(p) => byPipeline[p].length > 0,
);
// Compute queue position among clean awaiting merge items (Stage::Merge, no failure, no running agent).
// Compute queue position among "clean" awaiting-merge items: pipeline=merge,
// status=active, and no agent currently running.
const mergeQueuePosMap = new Map<string, number>();
let queuePos = 0;
for (const { project, item } of byStage.merge) {
if (
!item.blocked &&
!item.merge_failure &&
item.agent?.status !== "running"
) {
for (const { project, item } of byPipeline.merge) {
if (itemStatus(item) === "active" && item.agent?.status !== "running") {
queuePos += 1;
mergeQueuePosMap.set(`${project}:${item.story_id}`, queuePos);
}
@@ -569,33 +619,33 @@ function InProgressTabContent({
return (
<div>
{stages.map((stage) => (
<div key={stage} style={{ marginBottom: "20px" }}>
{pipelines.map((p) => (
<div key={p} style={{ marginBottom: "20px" }}>
<div
style={{
fontSize: "0.8em",
fontWeight: 600,
color: STAGE_COLORS[stage] ?? "#8b949e",
color: PIPELINE_COLORS[p] ?? "#8b949e",
textTransform: "uppercase",
letterSpacing: "0.06em",
marginBottom: "8px",
paddingBottom: "4px",
borderBottom: `1px solid ${STAGE_COLORS[stage] ?? "#8b949e"}33`,
borderBottom: `1px solid ${PIPELINE_COLORS[p] ?? "#8b949e"}33`,
}}
>
{IN_PROGRESS_STAGE_LABELS[stage]}{" "}
{IN_PROGRESS_PIPELINE_LABELS[p]}{" "}
<span style={{ color: "#6e7681" }}>
({byStage[stage].length})
({byPipeline[p].length})
</span>
</div>
{byStage[stage].map(({ project, item }) => (
{byPipeline[p].map(({ project, item }) => (
<ProjectStoryRow
key={`${project}:${item.story_id}`}
project={project}
item={item}
showProject={multiProject}
mergeQueuePos={
stage === "merge"
p === "merge"
? mergeQueuePosMap.get(`${project}:${item.story_id}`)
: undefined
}
@@ -43,6 +43,7 @@ const DEFAULT_CONTENT = {
stage: "current",
name: "Big Title Story",
agent: null,
origin: null,
};
beforeEach(() => {
@@ -43,6 +43,7 @@ const DEFAULT_CONTENT = {
stage: "current",
name: "Big Title Story",
agent: null,
origin: null,
};
const sampleTestResults: TestResultsResponse = {
@@ -42,6 +42,7 @@ const DEFAULT_CONTENT = {
stage: "current",
name: "Big Title Story",
agent: null,
origin: null,
};
beforeEach(() => {
@@ -127,6 +128,7 @@ describe("WorkItemDetailPanel", () => {
stage: "current",
name: "My Story Name",
agent: null,
origin: null,
});
render(
<WorkItemDetailPanel
@@ -146,6 +148,7 @@ describe("WorkItemDetailPanel", () => {
stage: "current",
name: "My Story Name",
agent: null,
origin: null,
});
render(
<WorkItemDetailPanel
@@ -164,6 +167,7 @@ describe("WorkItemDetailPanel", () => {
stage: "current",
name: "My Story Name",
agent: null,
origin: null,
});
render(
<WorkItemDetailPanel
@@ -186,6 +190,7 @@ describe("WorkItemDetailPanel", () => {
stage: "current",
name: "My Story Name",
agent: null,
origin: null,
});
render(
<WorkItemDetailPanel
@@ -20,6 +20,26 @@ import { stripDisplayContent } from "./workItemDetailPanelUtils";
const { useCallback, useEffect, useRef, useState } = React;
/** Parse and format an origin JSON string for display. */
function formatOrigin(origin: string | null): string {
if (!origin) return "unknown";
try {
const obj = JSON.parse(origin) as {
kind?: string;
id?: string;
ts?: number;
};
const kind = obj.kind ?? "unknown";
const id = obj.id ? ` (${obj.id})` : "";
const ts = obj.ts
? ` at ${new Date(obj.ts * 1000).toISOString().replace("T", " ").slice(0, 19)}Z`
: "";
return `${kind}${id}${ts}`;
} catch {
return origin;
}
}
interface WorkItemDetailPanelProps {
storyId: string;
pipelineVersion: number;
@@ -38,6 +58,7 @@ export function WorkItemDetailPanel({
const [stage, setStage] = useState<string>("");
const [name, setName] = useState<string | null>(null);
const [assignedAgent, setAssignedAgent] = useState<string | null>(null);
const [origin, setOrigin] = useState<string | null>(null);
const [loading, setLoading] = useState(true);
const [error, setError] = useState<string | null>(null);
const [agentInfo, setAgentInfo] = useState<AgentInfo | null>(null);
@@ -63,6 +84,7 @@ export function WorkItemDetailPanel({
setStage(data.stage);
setName(data.name);
setAssignedAgent(data.agent);
setOrigin(data.origin);
})
.catch((err: unknown) => {
setError(err instanceof Error ? err.message : "Failed to load content");
@@ -289,6 +311,19 @@ export function WorkItemDetailPanel({
<TestResultsSection testResults={testResults} />
{!loading && (
<div
data-testid="detail-panel-origin"
style={{
fontSize: "0.75em",
color: "#555",
fontFamily: "monospace",
}}
>
origin: {formatOrigin(origin)}
</div>
)}
<div
style={{
display: "flex",
+30 -6
View File
@@ -124,19 +124,43 @@ else
fi
# Categorise merged work items and format names.
# Supports two subject formats (after stripping the "huskies: merge " prefix):
# New: "1063 story Human Readable Name"
# Old: "1063_story_human_readable_name"
FEATURES=""
FIXES=""
REFACTORS=""
while IFS= read -r item; do
[ -z "$item" ] && continue
# Strip the numeric prefix and type to get the human name.
name=$(echo "$item" | sed -E 's/^[0-9]+_(story|bug|refactor|spike)_//' | tr '_' ' ')
# Extract the leading numeric ID (present in both formats).
id=$(echo "$item" | grep -oE '^[0-9]+')
# Detect format and extract human name + type word.
if echo "$item" | grep -qE '^[0-9]+ (story|bug|refactor|spike|epic) '; then
# New format: "1063 story Human Name Here"
type_word=$(echo "$item" | sed -E 's/^[0-9]+ ([a-z]+) .*/\1/')
name=$(echo "$item" | sed -E 's/^[0-9]+ [a-z]+ //')
else
# Legacy slug format: "1063_story_human_name_here"
type_word=$(echo "$item" | sed -E 's/^[0-9]+_([a-z]+)_.*/\1/')
name=$(echo "$item" | sed -E 's/^[0-9]+_(story|bug|refactor|spike|epic)_//' | tr '_' ' ')
fi
# Capitalise first letter.
name="$(echo "${name:0:1}" | tr '[:lower:]' '[:upper:]')${name:1}"
case "$item" in
*_bug_*) FIXES="${FIXES}- ${name}\n" ;;
*_refactor_*) REFACTORS="${REFACTORS}- ${name}\n" ;;
*) FEATURES="${FEATURES}- ${name}\n" ;;
# Format as "Name (ID)" when a numeric ID was found, plain name otherwise.
if [ -n "$id" ]; then
entry="${name} (${id})"
else
entry="${name}"
fi
case "$type_word" in
bug) FIXES="${FIXES}- ${entry}\n" ;;
refactor) REFACTORS="${REFACTORS}- ${entry}\n" ;;
*) FEATURES="${FEATURES}- ${entry}\n" ;;
esac
done <<< "$MERGED_RAW"
+16 -1
View File
@@ -53,7 +53,22 @@ cargo run --manifest-path "$PROJECT_ROOT/Cargo.toml" -p source-map-gen --bin sou
echo "=== Building frontend ==="
if [ -d "$PROJECT_ROOT/frontend" ]; then
cd "$PROJECT_ROOT/frontend"
npm install
# The merge gate runs in workspaces whose pre-existing `node_modules` was
# populated by an earlier `npm install --omit=dev` (or a partial install).
# In that state `npm install` reports "up to date, audited N packages"
# without actually adding the missing devDependencies, so the subsequent
# `tsc && vite build` fails with `sh: 1: tsc: not found`.
#
# Repair the install when typescript isn't reachable (story 1086 merge gate
# regression). We probe the on-disk binary rather than relying on PATH so
# this also covers the case where `node_modules/.bin/` is missing.
if [ ! -x node_modules/typescript/bin/tsc ]; then
echo "[script/test] node_modules missing typescript; performing clean install."
rm -rf node_modules
npm install --include=dev
else
npm install --include=dev
fi
npm run build
cd "$PROJECT_ROOT"
else
+1 -1
View File
@@ -1,6 +1,6 @@
[package]
name = "huskies"
version = "0.11.0"
version = "0.11.1"
edition = "2024"
build = "build.rs"
+14
View File
@@ -17,6 +17,20 @@ fn run(cmd: &str, args: &[&str], dir: &Path) {
fn main() {
println!("cargo:rerun-if-changed=build.rs");
println!("cargo:rerun-if-env-changed=PROFILE");
// Embed the current git commit hash at compile time so `get_version` always
// reflects the binary that is actually running, not a potentially-stale file.
println!("cargo:rerun-if-changed=../.git/HEAD");
println!("cargo:rerun-if-changed=../.git/refs/");
let git_hash = std::process::Command::new("git")
.args(["rev-parse", "--short", "HEAD"])
.output()
.ok()
.filter(|o| o.status.success())
.and_then(|o| String::from_utf8(o.stdout).ok())
.map(|s| s.trim().to_string())
.unwrap_or_else(|| "unknown".to_string());
println!("cargo:rustc-env=BUILD_GIT_HASH={git_hash}");
println!("cargo:rerun-if-changed=../frontend/package.json");
println!("cargo:rerun-if-changed=../frontend/package-lock.json");
println!("cargo:rerun-if-changed=../frontend/vite.config.ts");
@@ -0,0 +1,56 @@
-- Story 1087: split the legacy `stage` column on `pipeline_items` into a
-- `(pipeline, status)` pair so the read side no longer needs to re-derive the
-- display column and badge from the stage string.
--
-- The migration is additive: `stage` is retained for backwards compatibility
-- while remaining Step E callers are migrated. The backup of `pipeline.db`
-- written by `shadow_write::init` immediately before this migration runs is
-- the recovery path if the backfill produces an unexpected projection.
ALTER TABLE pipeline_items ADD COLUMN pipeline TEXT NOT NULL DEFAULT '';
ALTER TABLE pipeline_items ADD COLUMN status TEXT NOT NULL DEFAULT '';
-- Backfill `pipeline` from the existing `stage` column. Every wire-form
-- stage string emitted by `stage_dir_name` maps to exactly one of the seven
-- Pipeline columns defined in `pipeline_state::types::Pipeline::as_str`.
-- Legacy directory strings (`1_backlog`, `2_current`, ...) are also handled
-- so that databases predating story 934 migrate cleanly.
UPDATE pipeline_items SET pipeline = CASE stage
WHEN 'upcoming' THEN 'backlog'
WHEN 'backlog' THEN 'backlog'
WHEN '1_backlog' THEN 'backlog'
WHEN 'coding' THEN 'coding'
WHEN 'blocked' THEN 'coding'
WHEN '2_current' THEN 'coding'
WHEN 'qa' THEN 'qa'
WHEN 'review_hold' THEN 'qa'
WHEN '3_qa' THEN 'qa'
WHEN 'merge' THEN 'merge'
WHEN 'merge_failure' THEN 'merge'
WHEN 'merge_failure_final' THEN 'merge'
WHEN '4_merge' THEN 'merge'
WHEN 'done' THEN 'done'
WHEN '5_done' THEN 'done'
WHEN 'abandoned' THEN 'closed'
WHEN 'superseded' THEN 'closed'
WHEN 'rejected' THEN 'closed'
WHEN 'archived' THEN 'archived'
WHEN '6_archived' THEN 'archived'
WHEN 'frozen' THEN 'coding'
ELSE ''
END;
-- Backfill `status` (badge) from the existing `stage` column.
UPDATE pipeline_items SET status = CASE stage
WHEN 'frozen' THEN 'frozen'
WHEN 'review_hold' THEN 'review-hold'
WHEN 'blocked' THEN 'blocked'
WHEN 'merge_failure' THEN 'merge-failure'
WHEN 'merge_failure_final' THEN 'merge-failure-final'
WHEN 'abandoned' THEN 'abandoned'
WHEN 'superseded' THEN 'superseded'
WHEN 'rejected' THEN 'rejected'
WHEN 'done' THEN 'done'
WHEN '5_done' THEN 'done'
ELSE 'active'
END;
+1
View File
@@ -78,6 +78,7 @@ pub(super) fn build_agent_app_context(
pending_perm_replies: Arc::new(tokio::sync::Mutex::new(std::collections::HashMap::new())),
permission_timeout_secs: 120,
status: agents.status_broadcaster(),
chat_dispatcher: Arc::new(crate::chat::dispatcher::ChatDispatcher::new(1_500)),
});
crate::http::context::AppContext {
state: Arc::new(state),
+7 -4
View File
@@ -198,10 +198,13 @@ pub async fn run(
)
};
// Replay current pipeline state so subscribers (worktree lifecycle, merge-failure
// auto-spawn) react to any stories already in active stages, then auto-assign.
slog!("[agent-mode] Replaying current pipeline state.");
crate::pipeline_state::replay_current_pipeline_state();
// Reconcile subscriber side effects for the current CRDT state without
// flooding the broadcast channel (replaces the former replay_current_pipeline_state call).
slog!("[agent-mode] Running startup reconcile pass.");
let done_retention = crate::config::ProjectConfig::load(&project_root)
.map(|c| std::time::Duration::from_secs(c.watcher.done_retention_secs))
.unwrap_or_else(|_| std::time::Duration::from_secs(4 * 3600));
crate::startup::tick_loop::run_reconcile_pass(&project_root, &agents, done_retention).await;
// Run initial auto-assign.
slog!("[agent-mode] Initial auto-assign scan.");
+16 -142
View File
@@ -10,10 +10,12 @@
//! - `.huskies/README.md`
//! - `.huskies/specs/00_CONTEXT.md`
//! - `.huskies/AGENT.md`
//! - `.huskies/source-map.json` (up to 200 KB; truncated with a log if larger)
//!
//! `STACK.md` is intentionally excluded — it is large and changes often; agents
//! should grep it on demand.
//! `STACK.md` and `.huskies/source-map.json` are intentionally excluded — they
//! are large and change often; agents should grep on demand instead. Earlier
//! versions of this bundle inlined the source map, which ballooned the orientation
//! to ~96 KB and drowned out the workflow rules in AGENT.md; the file is still
//! kept on disk for the merge-time `source-map-check` doc-coverage gate.
//!
//! Behaviour contract:
//! - Files that are missing or empty are skipped silently (no error, no section).
@@ -33,12 +35,6 @@ const ORIENTATION_FILES: &[&str] = &[
".huskies/AGENT.md",
];
/// Path to the source map (relative to project root), appended after AGENT.md.
const SOURCE_MAP_REL: &str = ".huskies/source-map.json";
/// Maximum bytes of source-map content to embed in the prompt.
const SOURCE_MAP_BYTE_CAP: usize = 200 * 1024;
/// Attempt to load the project-local agent prompt by concatenating orientation
/// files from the project root.
///
@@ -60,14 +56,11 @@ pub fn read_project_local_prompt(project_root: &Path) -> Option<String> {
sections.push((rel_path, trimmed.to_string()));
}
// Read source-map.json (after AGENT.md) with a byte cap.
let source_map_content = read_source_map_section(project_root);
if sections.is_empty() && source_map_content.is_none() {
if sections.is_empty() {
return None;
}
let mut included_files: Vec<&str> = sections.iter().map(|(name, _)| *name).collect();
let included_files: Vec<&str> = sections.iter().map(|(name, _)| *name).collect();
let mut bundle = String::new();
for (i, (name, content)) in sections.iter().enumerate() {
if i > 0 {
@@ -77,15 +70,6 @@ pub fn read_project_local_prompt(project_root: &Path) -> Option<String> {
bundle.push_str(content);
}
if let Some(sm) = source_map_content {
if !bundle.is_empty() {
bundle.push('\n');
}
bundle.push_str(&format!("=== {SOURCE_MAP_REL} ===\n"));
bundle.push_str(&sm);
included_files.push(SOURCE_MAP_REL);
}
crate::slog!(
"[agents] orientation bundle: {} bytes, files: [{}]",
bundle.len(),
@@ -95,39 +79,6 @@ pub fn read_project_local_prompt(project_root: &Path) -> Option<String> {
Some(bundle)
}
/// Read `.huskies/source-map.json` from `project_root`, applying a byte cap.
///
/// Returns `None` when the file is absent, unreadable, or empty.
/// When the content exceeds [`SOURCE_MAP_BYTE_CAP`], truncates at a char
/// boundary and logs the truncation.
#[allow(clippy::string_slice)] // cap is walked back to a char boundary before slicing
fn read_source_map_section(project_root: &Path) -> Option<String> {
let path = project_root.join(SOURCE_MAP_REL);
let Ok(content) = std::fs::read_to_string(&path) else {
return None;
};
let trimmed = content.trim();
if trimmed.is_empty() {
return None;
}
if trimmed.len() > SOURCE_MAP_BYTE_CAP {
let mut cap = SOURCE_MAP_BYTE_CAP;
while cap > 0 && !trimmed.is_char_boundary(cap) {
cap -= 1;
}
crate::slog!(
"[agents] source-map.json truncated: {} bytes > {} byte cap; \
including first {} bytes",
trimmed.len(),
SOURCE_MAP_BYTE_CAP,
cap
);
Some(trimmed[..cap].to_string())
} else {
Some(trimmed.to_string())
}
}
#[cfg(test)]
mod tests {
use super::*;
@@ -310,10 +261,13 @@ mod tests {
);
}
// ── source-map.json tests ────────────────────────────────────────────────
// ── source-map.json must NOT be inlined into the bundle ──────────────────
// The file is kept on disk for the merge-time source-map-check gate, but
// inlining it into every agent spawn ballooned the orientation past 96 KB
// and drowned out the workflow rules in AGENT.md.
#[test]
fn source_map_included_after_agent_md() {
fn source_map_not_included_even_when_present() {
let tmp = tempfile::tempdir().unwrap();
write_file(tmp.path(), ".huskies/AGENT.md", "agent content");
write_file(
@@ -324,92 +278,12 @@ mod tests {
let result = read_project_local_prompt(tmp.path()).unwrap();
assert!(
result.contains("=== .huskies/source-map.json ==="),
"source-map delimiter must be present: {result}"
!result.contains("=== .huskies/source-map.json ==="),
"source-map must not appear as an orientation section: {result}"
);
assert!(
result.contains(r#""src/lib.rs""#),
"source-map content must be present: {result}"
);
// source-map section must appear after AGENT.md section
let agent_pos = result.find("=== .huskies/AGENT.md ===").unwrap();
let sm_pos = result.find("=== .huskies/source-map.json ===").unwrap();
assert!(
sm_pos > agent_pos,
"source-map section must come after AGENT.md section"
);
}
#[test]
fn source_map_missing_skipped_silently() {
let tmp = tempfile::tempdir().unwrap();
write_file(tmp.path(), ".huskies/AGENT.md", "agent content");
// source-map.json intentionally absent
let result = read_project_local_prompt(tmp.path()).unwrap();
assert!(
!result.contains("source-map.json"),
"absent source-map must not create a section: {result}"
);
}
#[test]
fn source_map_empty_skipped_silently() {
let tmp = tempfile::tempdir().unwrap();
write_file(tmp.path(), ".huskies/AGENT.md", "agent content");
write_file(tmp.path(), ".huskies/source-map.json", "");
let result = read_project_local_prompt(tmp.path()).unwrap();
assert!(
!result.contains("source-map.json"),
"empty source-map must not create a section: {result}"
);
}
#[test]
fn source_map_only_returns_some() {
let tmp = tempfile::tempdir().unwrap();
// Only source-map.json present; all orientation files absent.
write_file(
tmp.path(),
".huskies/source-map.json",
r#"{"src/main.rs": {}}"#,
);
let result = read_project_local_prompt(tmp.path());
assert!(
result.is_some(),
"source-map alone must produce Some bundle"
);
assert!(
result.unwrap().contains("=== .huskies/source-map.json ==="),
"bundle must contain source-map section"
);
}
#[test]
#[allow(clippy::string_slice)] // sm_start is derived from str::find — always a char boundary
fn source_map_truncated_at_byte_cap() {
let tmp = tempfile::tempdir().unwrap();
write_file(tmp.path(), ".huskies/AGENT.md", "agent");
// Build content larger than SOURCE_MAP_BYTE_CAP (200 KB).
let big = "x".repeat(SOURCE_MAP_BYTE_CAP + 1024);
write_file(tmp.path(), ".huskies/source-map.json", &big);
let result = read_project_local_prompt(tmp.path()).unwrap();
assert!(
result.contains("=== .huskies/source-map.json ==="),
"truncated source-map must still produce a section: {result}"
);
// The content length of just the source-map section must be <= SOURCE_MAP_BYTE_CAP.
let sm_start = result.find("=== .huskies/source-map.json ===").unwrap()
+ "=== .huskies/source-map.json ===\n".len();
let sm_content = &result[sm_start..];
assert!(
sm_content.len() <= SOURCE_MAP_BYTE_CAP,
"source-map section content must be <= {} bytes, got {}",
SOURCE_MAP_BYTE_CAP,
sm_content.len()
!result.contains("src/lib.rs"),
"source-map content must not be inlined: {result}"
);
}
}
+11 -1
View File
@@ -124,7 +124,15 @@ pub(crate) fn run_squash_merge(
// ── Commit in the temporary worktree ──────────────────────────
all_output.push_str("=== git commit ===\n");
let commit_msg = format!("huskies: merge {story_id}");
// Include human-readable name and item type when the CRDT is available.
// Falls back to the bare ID when running outside the server (e.g. in tests).
let story_label = crate::crdt_state::read_item(story_id)
.map(|item| {
let type_str = item.item_type().map(|t| t.as_str()).unwrap_or("story");
format!(" {} {}", type_str, item.name())
})
.unwrap_or_default();
let commit_msg = format!("huskies: merge {story_id}{story_label}");
let commit = Command::new("git")
.args(["commit", "-m", &commit_msg])
.current_dir(&merge_wt_path)
@@ -507,3 +515,5 @@ fn run_merge_quality_gates(
mod tests_advanced;
#[cfg(test)]
mod tests_basic;
#[cfg(test)]
mod tests_changelog;
@@ -0,0 +1,142 @@
//! Regression tests for changelog entry parsing — both legacy-slug and new-format
//! merge commit subjects must resolve to a human-readable "Name (ID)" entry.
/// Parse a single merge commit subject (after stripping the `huskies: merge ` prefix)
/// into `(id, type_word, human_name)`.
///
/// Returns `None` for subjects that are not recognised merge items.
fn parse_changelog_entry(item: &str) -> Option<(String, String, String)> {
let item = item.trim();
if item.is_empty() {
return None;
}
// Extract leading numeric ID present in both formats.
let id: String = item.chars().take_while(|c| c.is_ascii_digit()).collect();
if id.is_empty() {
return None;
}
// Detect format by the character immediately following the digits.
// id contains only ASCII digits so id.len() is a valid char boundary.
let rest = item.get(id.len()..).unwrap_or("");
if let Some(space_rest) = rest.strip_prefix(' ') {
// New format: "1063 story Human Name Here"
let mut words = space_rest.splitn(2, ' ');
let type_word = words.next().unwrap_or("story").to_string();
let name = words.next().unwrap_or("").trim().to_string();
if name.is_empty() {
return None;
}
Some((id, type_word, name))
} else if let Some(slug_rest) = rest.strip_prefix('_') {
// Legacy slug format: "1063_story_human_name_here"
let mut parts = slug_rest.splitn(2, '_');
let type_word = parts.next().unwrap_or("story").to_string();
let slug = parts.next().unwrap_or("").replace('_', " ");
if slug.is_empty() {
return None;
}
Some((id, type_word, slug))
} else {
None
}
}
/// Format a parsed entry as "Human Name (ID)".
fn format_entry(id: &str, name: &str) -> String {
let mut chars = name.chars();
let capitalised = match chars.next() {
None => String::new(),
Some(c) => c.to_uppercase().collect::<String>() + chars.as_str(),
};
format!("{capitalised} ({id})")
}
#[test]
fn changelog_new_format_story_resolves_to_name_and_id() {
let item = "1063 story Tee pipeline events into gateway context";
let (id, _type_word, name) = parse_changelog_entry(item).expect("should parse new format");
assert_eq!(id, "1063");
assert_eq!(
format_entry(&id, &name),
"Tee pipeline events into gateway context (1063)"
);
}
#[test]
fn changelog_new_format_bug_resolves_to_name_and_id() {
let item = "999 bug Fix the broken auth token";
let (id, type_word, name) = parse_changelog_entry(item).expect("should parse new-format bug");
assert_eq!(id, "999");
assert_eq!(type_word, "bug");
assert_eq!(format_entry(&id, &name), "Fix the broken auth token (999)");
}
#[test]
fn changelog_new_format_refactor_resolves_to_name_and_id() {
let item = "777 refactor Extract config parsing";
let (id, type_word, name) = parse_changelog_entry(item).expect("should parse refactor");
assert_eq!(type_word, "refactor");
assert_eq!(format_entry(&id, &name), "Extract config parsing (777)");
}
#[test]
fn changelog_legacy_slug_story_resolves_to_name_and_id() {
let item = "1063_story_tee_pipeline_events_into_gateway_context";
let (id, _type_word, name) = parse_changelog_entry(item).expect("should parse legacy slug");
assert_eq!(id, "1063");
assert_eq!(
format_entry(&id, &name),
"Tee pipeline events into gateway context (1063)"
);
}
#[test]
fn changelog_legacy_slug_bug_resolves_to_name_and_id() {
let item = "999_bug_fix_the_broken_auth_token";
let (id, type_word, name) = parse_changelog_entry(item).expect("should parse legacy bug slug");
assert_eq!(id, "999");
assert_eq!(type_word, "bug");
assert_eq!(format_entry(&id, &name), "Fix the broken auth token (999)");
}
#[test]
fn changelog_mixed_fixture_all_entries_have_human_names() {
// Fixture: a mix of legacy-slug and new-format subjects (as they appear
// after stripping the "huskies: merge " prefix from the git log).
let fixture = [
// Legacy slug formats (pre-migration)
"1001_story_add_matrix_transport",
"1002_bug_fix_crdt_sync_disconnect",
"1003_refactor_extract_gateway_config",
// New format (post-story-1069)
"1050 story Add agent pool auto-assign",
"1063 story Tee pipeline events into gateway context",
"1064 bug Stop lagged handler re-emitting via same channel",
"1065 refactor Move squash merge into own module",
];
for item in &fixture {
let result = parse_changelog_entry(item);
assert!(result.is_some(), "failed to parse merge subject: {item:?}");
let (id, _type_word, name) = result.unwrap();
let entry = format_entry(&id, &name);
// Every entry must contain the numeric ID in parentheses.
assert!(
entry.contains(&format!("({id})")),
"entry missing numeric ID: {entry:?}"
);
// Name must not be empty or just whitespace.
assert!(
!name.trim().is_empty(),
"empty human name for item: {item:?}"
);
// Name must not be a raw slug (contains underscores as word separators).
// (Underscores are OK inside words like "auto-assign" but not as spaces.)
assert!(
!name.contains('_'),
"name still contains underscores (slug not decoded): {name:?}"
);
}
}
+36
View File
@@ -161,6 +161,42 @@ pub fn pipeline_stage(agent_name: &str) -> PipelineStage {
}
}
/// Map a pipeline [`Stage`] to the canonical [`PipelineStage`] for LLM agent spawning.
///
/// Returns `None` for stages where no LLM agent should be active (terminal states,
/// blocked, frozen, or unclassified merge failures requiring human intervention).
/// Returns `Some(stage)` naming the single LLM-agent type that may run on this story.
/// Used by `validate_agent_stage` and `reconcile_canonical_agents` to enforce the
/// one-agent-per-story invariant (story 1100).
pub fn canonical_pipeline_stage(s: &crate::pipeline_state::Stage) -> Option<PipelineStage> {
use crate::pipeline_state::{MergeFailureKind, Stage};
match s {
Stage::Coding { .. } => Some(PipelineStage::Coder),
Stage::Qa => Some(PipelineStage::Qa),
Stage::Merge { .. } => Some(PipelineStage::Mergemaster),
Stage::MergeFailure {
kind: MergeFailureKind::ConflictDetected(_),
..
} => Some(PipelineStage::Mergemaster),
Stage::MergeFailure {
kind: MergeFailureKind::GatesFailed(_),
..
} => Some(PipelineStage::Coder),
Stage::MergeFailureFinal { .. } => Some(PipelineStage::Mergemaster),
Stage::Upcoming
| Stage::Backlog
| Stage::MergeFailure { .. }
| Stage::Done { .. }
| Stage::Blocked { .. }
| Stage::Archived { .. }
| Stage::Frozen { .. }
| Stage::ReviewHold { .. }
| Stage::Abandoned { .. }
| Stage::Superseded { .. }
| Stage::Rejected { .. } => None,
}
}
/// Determine the pipeline stage for a configured agent.
///
/// Prefers the explicit `stage` config field (added in Bug 150) over the
@@ -569,14 +569,15 @@ mod tests {
);
}
// ── AC4: startup event replay + pool reconstruction ──────────────────
// ── AC4: startup reconcile + pool reconstruction ──────────────────
/// AC4: Simulates a server restart by seeding the CRDT with a story in
/// Coding stage, calling `replay_current_pipeline_state` (the new startup
/// path), then `auto_assign_available_work`. Asserts the pool ends in the
/// expected state: exactly one agent assigned to the story.
/// Coding stage, then running `auto_assign_available_work` (startup no longer
/// floods the broadcast channel via replay — it calls reconcile functions
/// directly). Asserts the pool ends in the expected state: exactly one agent
/// assigned to the story, and a second pass does not double-spawn.
#[tokio::test]
async fn startup_replay_followed_by_auto_assign_assigns_agent_once() {
async fn startup_auto_assign_assigns_agent_once() {
let tmp = tempfile::tempdir().unwrap();
let sk = tmp.path().join(".huskies");
std::fs::create_dir_all(&sk).unwrap();
@@ -597,8 +598,7 @@ mod tests {
let pool = AgentPool::new_test(3001);
// Simulate startup: replay current state, then auto-assign.
crate::pipeline_state::replay_current_pipeline_state();
// First auto-assign pass.
pool.auto_assign_available_work(tmp.path()).await;
let count_after_first = {
@@ -612,8 +612,7 @@ mod tests {
.count()
};
// AC3 (idempotency): replaying twice must not double-spawn agents.
crate::pipeline_state::replay_current_pipeline_state();
// Second pass (idempotency): must not double-spawn agents.
pool.auto_assign_available_work(tmp.path()).await;
let count_after_second = {
@@ -629,11 +628,11 @@ mod tests {
assert!(
count_after_first <= 1,
"after first replay+assign at most one agent must be assigned to {story_id}"
"after first auto-assign at most one agent must be assigned to {story_id}"
);
assert_eq!(
count_after_first, count_after_second,
"second replay must not spawn additional agents (idempotency)"
"second auto-assign must not spawn additional agents (idempotency)"
);
}
}
+20 -10
View File
@@ -1,29 +1,39 @@
//! Backlog promotion: scan `1_backlog/` and promote stories whose `depends_on` are all met.
//! Backlog promotion: scan items in `Pipeline::Backlog` and promote stories whose `depends_on` are all met.
use crate::pipeline_state::Stage;
use crate::pipeline_state::Pipeline;
use crate::slog;
use crate::slog_warn;
use super::super::AgentPool;
use super::scan::scan_stage_items;
use super::story_checks::{check_archived_dependencies, has_unmet_dependencies};
impl AgentPool {
/// Scan `1_backlog/` and promote any story whose `depends_on` are all met.
/// Scan items in `Pipeline::Backlog` and promote any story whose `depends_on` are all met.
///
/// A story is only promoted if it explicitly lists `depends_on` AND every
/// listed dependency has reached `5_done` or `6_archived`. Stories with no
/// `depends_on` are left in the backlog for human scheduling.
/// listed dependency has reached `Pipeline::Done` or `Pipeline::Archived`.
/// Stories with no `depends_on` are left in the backlog for human scheduling.
///
/// **Archived dep semantics:** a dep in `6_archived` counts as satisfied (since
/// stories auto-sweep from `5_done` to `6_archived` after 4 hours, and the
/// **Archived dep semantics:** a dep in `Pipeline::Archived` counts as satisfied
/// (since stories auto-sweep from `Done` to `Archived` after 4 hours, and the
/// dependent story would normally already be promoted by then). However, if a
/// dep was already in `6_archived` when the dependent story was created (e.g. it
/// dep was already archived when the dependent story was created (e.g. it
/// was abandoned/superseded before the dependent existed), a prominent warning is
/// logged so the user can see the promotion was triggered by an archived dep, not
/// a clean completion.
pub(super) fn promote_ready_backlog_stories(&self) {
let items = scan_stage_items(&Stage::Backlog);
// Story 1086: scan by Pipeline column, not Stage variant. Pipeline::Backlog
// covers Stage::Upcoming and Stage::Backlog uniformly.
let items: Vec<String> = {
use std::collections::BTreeSet;
let mut ids = BTreeSet::new();
for item in crate::pipeline_state::read_all_typed() {
if item.stage.pipeline() == Pipeline::Backlog {
ids.insert(item.story_id.0.clone());
}
}
ids.into_iter().collect()
};
for story_id in &items {
// Only promote stories that explicitly declare dependencies
// (story 929: read from the CRDT register, not YAML).
@@ -13,7 +13,7 @@ use std::collections::HashMap;
use std::path::{Path, PathBuf};
use std::sync::Arc;
use crate::pipeline_state::{MergeFailureKind, PipelineEvent, Stage, StoryId};
use crate::pipeline_state::{MergeFailureKind, PipelineEvent, Stage, Status, StoryId};
use crate::slog;
use crate::slog_warn;
@@ -21,6 +21,15 @@ use super::super::super::PipelineStage;
use super::super::AgentPool;
use super::scan::is_story_assigned_for_stage;
/// Reconcile: no-op for the merge-failure block subscriber.
///
/// The block subscriber maintains an in-memory per-story consecutive-failure counter
/// that cannot be reconstructed from CRDT state alone (only the current stage is
/// stored, not the history of how many times each story failed). Eventual consistency
/// is guaranteed by the live subscriber reacting to each new `MergeFailure` event;
/// the periodic reconciler cannot add value here without risking spurious blocks.
pub(crate) fn reconcile_merge_failure_block() {}
/// Spawn a background task that blocks stories after N consecutive `MergeFailure` transitions.
///
/// Subscribes to the pipeline transition broadcast channel and tracks a per-story
@@ -86,6 +95,13 @@ fn on_transition(
counters: &mut HashMap<StoryId, (u32, MergeFailureKind)>,
recovery_running: bool,
) {
// Story 1086: gate on the typed `Status` projection — `Status::MergeFailure`
// is precisely the set of stages we count toward the block threshold. We
// still need the variant pattern below to read `kind`.
if fired.after.status() != Status::MergeFailure {
counters.remove(&fired.story_id);
return;
}
match &fired.after {
Stage::MergeFailure { kind, .. } => {
if recovery_running {
@@ -9,7 +9,7 @@
use std::path::{Path, PathBuf};
use std::sync::Arc;
use crate::pipeline_state::{MergeFailureKind, Stage};
use crate::pipeline_state::{MergeFailureKind, Stage, Status};
use crate::slog;
use crate::slog_warn;
@@ -17,6 +17,35 @@ use super::super::super::PipelineStage;
use super::super::AgentPool;
use super::scan::{find_free_agent_for_stage, is_story_assigned_for_stage};
/// Reconcile: for each story currently in `MergeFailure { kind: ConflictDetected }`,
/// ensure a mergemaster agent is running.
///
/// Idempotent — `on_merge_failure_transition` guards against double-spawning via
/// `is_story_assigned_for_stage`. Called by the periodic reconciler so that a Lagged
/// startup event never leaves a ConflictDetected story without a recovery agent.
pub(crate) async fn reconcile_merge_failure(pool: &Arc<AgentPool>, project_root: &Path) {
use crate::pipeline_state::{MergeFailureKind, PipelineEvent, Stage, TransitionFired};
for item in crate::pipeline_state::read_all_typed() {
// Story 1086: scan via the Status projection; the variant pattern is
// still needed to read `kind`.
if item.stage.status() != Status::MergeFailure {
continue;
}
if let Stage::MergeFailure { ref kind, .. } = item.stage
&& matches!(kind, MergeFailureKind::ConflictDetected(_))
{
let fired = TransitionFired {
story_id: item.story_id.clone(),
before: item.stage.clone(),
after: item.stage.clone(),
event: PipelineEvent::MergeFailed { kind: kind.clone() },
at: chrono::Utc::now(),
};
on_merge_failure_transition(pool, project_root, &fired).await;
}
}
}
/// Spawn a background task that auto-spawns mergemaster agents on
/// `Stage::MergeFailure { kind: ConflictDetected(_) }` transitions.
///
@@ -49,6 +78,11 @@ async fn on_merge_failure_transition(
project_root: &Path,
fired: &crate::pipeline_state::TransitionFired,
) {
// Story 1086: gate on the typed `Status` projection first; only the
// `MergeFailure` kind extraction needs the variant pattern.
if fired.after.status() != Status::MergeFailure {
return;
}
let Stage::MergeFailure { ref kind, .. } = fired.after else {
return;
};
@@ -17,7 +17,11 @@ pub(crate) mod watchdog;
// so that pool::lifecycle and pool::pipeline continue to access them unchanged.
pub(super) use scan::{find_free_agent_for_stage, is_agent_free};
/// Re-export for `startup::tick_loop`.
pub(crate) use merge_failure_block_subscriber::reconcile_merge_failure_block;
/// Re-export for `startup::tick_loop`.
pub(crate) use merge_failure_block_subscriber::spawn_merge_failure_block_subscriber;
/// Re-export for `startup::tick_loop`.
pub(crate) use merge_failure_subscriber::reconcile_merge_failure;
/// Re-export for `startup::tick_loop`.
pub(crate) use merge_failure_subscriber::spawn_merge_failure_subscriber;
@@ -187,13 +187,14 @@ pub(super) fn check_agent_limits(
),
};
// Mark agent as Failed with termination reason.
if let Ok(mut lock) = agents.lock()
&& let Some(agent) = lock.get_mut(key)
{
agent.status = AgentStatus::Failed;
agent.termination_reason = Some(reason.clone());
}
// NOTE: agent status is intentionally NOT updated here. Setting
// `status = Failed` before the kill (the previous behaviour)
// opened a window where the `start_agent` idempotency check
// (which whitelists Running/Pending) would let a fresh spawn
// through while the prior PTY child was still alive — directly
// causing the concurrent-agents bug we hit on story 1086
// (2026-05-15). The caller (`run_watchdog_pass`) is responsible
// for: (1) verifying the kill, (2) THEN updating the agent record.
slog!("[watchdog] Terminating agent '{key}': {reason_str}.");
@@ -9,8 +9,11 @@ mod tests;
use std::path::Path;
use crate::agents::AgentStatus;
use crate::config::ProjectConfig;
use crate::process_kill::{pids_matching, sigkill_pids_and_verify};
use crate::slog;
use crate::slog_warn;
use super::super::AgentPool;
use limits::check_agent_limits;
@@ -42,14 +45,64 @@ impl AgentPool {
if let Some(root) = project_root {
let terminated = check_agent_limits(&self.agents, root);
let config = ProjectConfig::load(root).unwrap_or_default();
for (key, _reason) in &terminated {
// Kill the PTY child and abort the task, same as stop_agent.
self.kill_child_for_key(key);
for (key, reason) in &terminated {
// Step 1: snapshot the agent's worktree path so we can find every
// process running in it (claude + any subprocesses). This must
// happen BEFORE we mutate the agent record so we can read the
// worktree info safely.
let worktree_path = self.agents.lock().ok().and_then(|lock| {
lock.get(key)
.and_then(|a| a.worktree_info.as_ref().map(|wt| wt.path.clone()))
});
// Step 2: SIGKILL every process running in the worktree and
// BLOCK until verified gone. The previous mechanism — portable_pty's
// `ChildKiller::kill()` — sends SIGHUP, which claude-code
// ignores, leaving the process alive while the agent record
// was being marked terminated; that gap let a fresh spawn race
// in alongside the surviving one. SIGKILL is uncatchable;
// [`sigkill_pids_and_verify`] only returns once the kernel has
// reaped each pid.
if let Some(wt_path) = worktree_path.as_ref() {
let pids = pids_matching(&wt_path.display().to_string());
if pids.is_empty() {
// Nothing in this worktree — agent likely already
// exited on its own before the watchdog noticed.
} else {
match sigkill_pids_and_verify(&pids) {
Ok(n) => slog!(
"[watchdog] SIGKILL'd {n} process(es) in worktree {} for '{key}'.",
wt_path.display()
),
Err(survivors) => slog_warn!(
"[watchdog] SIGKILL incomplete for '{key}': pids still alive: {survivors:?}. \
Proceeding with cleanup; concurrent spawn protection may be weakened."
),
}
}
} else {
slog_warn!(
"[watchdog] No worktree path recorded for '{key}'; cannot tree-kill, \
falling back to portable_pty SIGHUP (likely no-op for claude-code)."
);
self.kill_child_for_key(key);
}
// Step 3: NOW update the agent record. The process is verified
// gone (or we logged that SIGKILL didn't take effect, which is
// exceptional), so flipping status away from Running can no
// longer open a window for a concurrent spawn.
if let Ok(mut lock) = self.agents.lock()
&& let Some(agent) = lock.get_mut(key)
&& let Some(handle) = agent.task_handle.take()
{
handle.abort();
agent.status = AgentStatus::Failed;
agent.termination_reason = Some(reason.clone());
if let Some(handle) = agent.task_handle.take() {
// Best-effort abort of the outer tokio task. The PTY
// blocking thread already returned (claude is dead),
// so this is bookkeeping rather than load-bearing.
handle.abort();
}
}
// Use the retry mechanism: increment retry_count and only block
@@ -9,10 +9,19 @@
use std::path::{Path, PathBuf};
use crate::pipeline_state::Stage;
use crate::pipeline_state::{Pipeline, Stage, Status};
use crate::slog;
use crate::slog_warn;
/// Reconcile: re-populate the CostRollup register from disk for all known stories.
///
/// Idempotent — `init_from_disk` scans all existing token-usage JSONL files and
/// overwrites the in-memory register. Called by the periodic reconciler so that
/// a Lagged event can never leave a story with a stale or absent cost entry.
pub(crate) fn reconcile_cost_rollup(project_root: &Path) {
crate::service::agents::cost_rollup::init_from_disk(project_root);
}
/// Spawn a background task that maintains the CostRollup register.
///
/// On every terminal stage transition (Done, Archived, Abandoned, Superseded,
@@ -41,17 +50,15 @@ pub(crate) fn spawn_cost_rollup_subscriber(project_root: PathBuf) {
/// Returns `true` if `stage` is a terminal pipeline stage.
///
/// Terminal stages are those from which no further work is expected:
/// Done, Archived, Abandoned, Superseded, Rejected.
/// MergeFailure variants are NOT terminal — stories can recover from them.
/// Done, Archived, Abandoned, Superseded, Rejected. Story 1086 routes the
/// classification through the [`Status`] / [`Pipeline`] projection so future
/// Stage variants automatically participate. MergeFailure variants are NOT
/// terminal — stories can recover from them.
fn is_terminal(stage: &Stage) -> bool {
matches!(
stage,
Stage::Done { .. }
| Stage::Archived { .. }
| Stage::Abandoned { .. }
| Stage::Superseded { .. }
| Stage::Rejected { .. }
)
stage.status(),
Status::Done | Status::Abandoned | Status::Superseded | Status::Rejected
) || matches!(stage.pipeline(), Pipeline::Archived)
}
/// Snapshot the cost data for `fired.story_id` into the register when
-6
View File
@@ -18,7 +18,6 @@ mod test_helpers;
use crate::io::watcher::WatcherEvent;
use crate::service::status::StatusBroadcaster;
use portable_pty::ChildKiller;
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
use tokio::sync::broadcast;
@@ -31,10 +30,6 @@ use types::{StoryAgent, composite_key};
pub struct AgentPool {
agents: Arc<Mutex<HashMap<String, StoryAgent>>>,
port: u16,
/// Registry of active PTY child process killers, keyed by "{story_id}:{agent_name}".
/// Used to terminate child processes on server shutdown or agent stop, preventing
/// orphaned Claude Code processes from running after the server exits.
child_killers: Arc<Mutex<HashMap<String, Box<dyn ChildKiller + Send + Sync>>>>,
/// Broadcast channel for notifying WebSocket clients of agent state changes.
/// When an agent transitions state (Pending, Running, Completed, Failed, Stopped),
/// an `AgentStateChanged` event is emitted so the frontend can refresh the
@@ -56,7 +51,6 @@ impl AgentPool {
let pool = Self {
agents: Arc::new(Mutex::new(HashMap::new())),
port,
child_killers: Arc::new(Mutex::new(HashMap::new())),
watcher_tx: watcher_tx.clone(),
status_broadcaster: Arc::new(StatusBroadcaster::new()),
};
@@ -33,7 +33,6 @@ pub(crate) fn spawn_pipeline_advance(
let pool = AgentPool {
agents,
port,
child_killers: Arc::new(Mutex::new(HashMap::new())),
watcher_tx,
status_broadcaster: Arc::new(crate::service::status::StatusBroadcaster::new()),
};
+169 -25
View File
@@ -78,21 +78,34 @@ impl AgentPool {
// The coder exited with uncommitted content but no commits
// (typical "claude-code session boundary mid-sweep" pattern).
// Use a PROGRESS-AWARE retry cap: the agent gets unlimited
// respawns as long as file edits keep growing between
// attempts; only when the worktree diff is byte-identical
// to the previous attempt do we count it as "no progress".
// After NO_PROGRESS_CAP consecutive no-progress respawns,
// block for human attention.
// respawns as long as progress is being made between attempts.
// Progress is satisfied if EITHER (a) the worktree diff grew,
// OR (b) the set of files the agent read grew. Raw tool-call
// count does NOT count — a looping agent can produce many calls.
// Only self-exited sessions with no file or read progress count
// toward the cap; forced exits (API error, network, budget
// exhaustion) are excluded (story 1089).
// After NO_PROGRESS_CAP consecutive qualifying no-progress
// respawns, block for human attention.
//
// TOTAL_ATTEMPTS_CAP is the OUTER bound: even if the agent
// keeps making file-edit progress every session, after this
// many total respawns without a commit we escalate — caught
// the "agent flaps between different edits but never
// commits" pattern that the progress-aware counter would
// never trigger.
// many total respawns without a commit we escalate — catches
// the "agent flaps between different edits but never commits"
// pattern that the progress-aware counter would never trigger.
const NO_PROGRESS_CAP: u32 = 3;
const TOTAL_ATTEMPTS_CAP: u32 = 8;
// AC1: consume the forced-exit flag written by spawn.rs when
// the agent process exited with a non-zero code.
let forced_exit = crate::db::read_content(
crate::db::ContentKey::CommitRecoveryForcedExit(story_id),
)
.is_some();
crate::db::delete_content(crate::db::ContentKey::CommitRecoveryForcedExit(
story_id,
));
let current_fingerprint = worktree_path.as_deref().and_then(|p| {
std::process::Command::new("git")
.args(["diff", "master"])
@@ -104,18 +117,31 @@ impl AgentPool {
let stored_fingerprint = crate::db::read_content(
crate::db::ContentKey::CommitRecoveryDiffFingerprint(story_id),
);
let made_progress = current_fingerprint.is_some()
let diff_progress = current_fingerprint.is_some()
&& stored_fingerprint.as_ref() != current_fingerprint.as_ref();
let no_progress_count = if made_progress || stored_fingerprint.is_none() {
// AC2: check read-file set progress as an additional signal.
let read_progress = previous_session_id.as_deref().is_some_and(|session_id| {
collect_read_progress(&project_root, story_id, agent_name, session_id)
});
let made_progress = diff_progress || read_progress;
let prev_no_progress_count = crate::db::read_content(
crate::db::ContentKey::CommitRecoveryPending(story_id),
)
.and_then(|s| s.trim().parse::<u32>().ok())
.unwrap_or(0);
// AC1: forced exits do not increment the stuck-respawn counter.
let no_progress_count = if forced_exit {
prev_no_progress_count
} else if made_progress || stored_fingerprint.is_none() {
1
} else {
crate::db::read_content(crate::db::ContentKey::CommitRecoveryPending(
story_id,
))
.and_then(|s| s.trim().parse::<u32>().ok())
.unwrap_or(0)
+ 1
prev_no_progress_count + 1
};
let total_attempts = crate::db::read_content(
crate::db::ContentKey::CommitRecoveryTotalAttempts(story_id),
)
@@ -136,13 +162,17 @@ impl AgentPool {
crate::db::delete_content(
crate::db::ContentKey::CommitRecoveryTotalAttempts(story_id),
);
crate::db::delete_content(crate::db::ContentKey::CommitRecoveryReadSet(
story_id,
));
slog!(
"[pipeline] Coder '{agent_name}' for '{story_id}' hit total \
commit-recovery cap ({total_attempts}/{TOTAL_ATTEMPTS_CAP}) \
without a commit. Blocking story."
);
let reason = format!(
"agent flapped — {total_attempts} respawns without ever committing"
"commit absent after {total_attempts} respawns \
agent kept making edits but never committed"
);
if let Err(e) =
crate::agents::lifecycle::transition_to_blocked(story_id, &reason)
@@ -167,14 +197,18 @@ impl AgentPool {
crate::db::delete_content(
crate::db::ContentKey::CommitRecoveryTotalAttempts(story_id),
);
crate::db::delete_content(crate::db::ContentKey::CommitRecoveryReadSet(
story_id,
));
slog!(
"[pipeline] Coder '{agent_name}' for '{story_id}' made no \
file-edit progress over {no_progress_count} consecutive \
commit-recovery respawns. Blocking story."
file or read progress over {no_progress_count} consecutive \
self-exit commit-recovery respawns. Blocking story."
);
// AC4: block message names the specific cause.
let reason = format!(
"agent stuck — {no_progress_count} respawns without commits or \
new file edits"
"stuck-respawn cap reached: {NO_PROGRESS_CAP} consecutive \
self-exits with no file or read progress"
);
if let Err(e) =
crate::agents::lifecycle::transition_to_blocked(story_id, &reason)
@@ -206,7 +240,8 @@ impl AgentPool {
"[pipeline] Coder '{agent_name}' exited with uncommitted work \
for '{story_id}' (no-progress {no_progress_count}/\
{NO_PROGRESS_CAP}, total {total_attempts}/\
{TOTAL_ATTEMPTS_CAP}; progress_made={made_progress}). \
{TOTAL_ATTEMPTS_CAP}; diff_progress={diff_progress}, \
read_progress={read_progress}, forced_exit={forced_exit}). \
Issuing commit-only respawn."
);
let addendum = "\n\nYou have uncommitted work in this worktree. \
@@ -302,10 +337,13 @@ impl AgentPool {
});
}
} else if completion.gates_passed {
// Clear any stale recovery key when the coder succeeds normally.
// Clear any stale recovery keys when the coder succeeds normally.
crate::db::delete_content(crate::db::ContentKey::CommitRecoveryPending(
story_id,
));
crate::db::delete_content(crate::db::ContentKey::CommitRecoveryReadSet(
story_id,
));
// Determine effective QA mode for this story.
let qa_mode = {
let item_type = crate::agents::lifecycle::item_type_from_id(story_id);
@@ -361,11 +399,14 @@ impl AgentPool {
}
}
} else {
// Clear any stale recovery key when gates fail normally (agent committed
// Clear any stale recovery keys when gates fail normally (agent committed
// but the build is broken — treat as a standard retry, not a recovery).
crate::db::delete_content(crate::db::ContentKey::CommitRecoveryPending(
story_id,
));
crate::db::delete_content(crate::db::ContentKey::CommitRecoveryReadSet(
story_id,
));
// Bug 645 / 668: Before retry/block, check if the agent left committed
// work AND the agent had a passing run_tests result captured during its
// session. An agent may crash mid-output (e.g. Claude Code CLI PTY write
@@ -724,6 +765,109 @@ mod helpers;
use helpers::{resolve_qa_mode_from_store, write_review_hold_to_store};
pub(crate) use helpers::{should_block_story, spawn_pipeline_advance};
/// Parse a huskies agent log and return the set of file paths passed to the
/// Read tool in that session. Returns an empty set if the log cannot be read.
///
/// Used by [`collect_read_progress`] to detect read-exploration progress even
/// when the worktree diff did not grow (story 1089, AC2).
fn collect_read_files_from_log(
project_root: &std::path::Path,
story_id: &str,
agent_name: &str,
session_id: &str,
) -> std::collections::HashSet<String> {
let log_path = crate::agent_log::log_file_path(project_root, story_id, agent_name, session_id);
let mut files = std::collections::HashSet::new();
let log_text = match std::fs::read_to_string(&log_path) {
Ok(t) => t,
Err(_) => return files,
};
for line in log_text.lines() {
let trimmed = line.trim();
if trimmed.is_empty() {
continue;
}
let entry: serde_json::Value = match serde_json::from_str(trimmed) {
Ok(v) => v,
Err(_) => continue,
};
// Only look at agent_json events where data.type == "assistant".
if entry.get("type").and_then(|t| t.as_str()) != Some("agent_json") {
continue;
}
let data = match entry.get("data") {
Some(d) => d,
None => continue,
};
if data.get("type").and_then(|t| t.as_str()) != Some("assistant") {
continue;
}
let content = match data.pointer("/message/content").and_then(|c| c.as_array()) {
Some(c) => c,
None => continue,
};
for item in content {
if item.get("type").and_then(|t| t.as_str()) != Some("tool_use") {
continue;
}
if item.get("name").and_then(|n| n.as_str()) != Some("Read") {
continue;
}
if let Some(path) = item.pointer("/input/file_path").and_then(|p| p.as_str()) {
files.insert(path.to_string());
}
}
}
files
}
/// Return `true` if the agent read any files in `session_id` that were not in
/// the cumulative read set for `story_id`. Updates the stored cumulative set
/// when new files are found (story 1089, AC2).
fn collect_read_progress(
project_root: &std::path::Path,
story_id: &str,
agent_name: &str,
session_id: &str,
) -> bool {
let session_files = collect_read_files_from_log(project_root, story_id, agent_name, session_id);
if session_files.is_empty() {
return false;
}
let stored_set: std::collections::HashSet<String> =
crate::db::read_content(crate::db::ContentKey::CommitRecoveryReadSet(story_id))
.map(|s| {
s.lines()
.filter(|l| !l.is_empty())
.map(str::to_string)
.collect()
})
.unwrap_or_default();
let union: std::collections::HashSet<String> =
stored_set.union(&session_files).cloned().collect();
if union.len() > stored_set.len() {
let mut sorted: Vec<&String> = union.iter().collect();
sorted.sort();
crate::db::write_content(
crate::db::ContentKey::CommitRecoveryReadSet(story_id),
&sorted
.into_iter()
.map(String::as_str)
.collect::<Vec<_>>()
.join("\n"),
);
true
} else {
false
}
}
#[cfg(test)]
mod tests;
#[cfg(test)]
@@ -1077,7 +1077,7 @@ stage = "coder"
"Story must be blocked after NO_PROGRESS_CAP consecutive no-progress respawns"
);
assert!(
block_reason.contains("without commits or new file edits"),
block_reason.contains("self-exits with no file or read progress"),
"Block reason should describe the no-progress condition, got: {block_reason}"
);
@@ -1193,7 +1193,7 @@ stage = "coder"
"Story must be blocked once total commit-recovery attempts hits the outer cap"
);
assert!(
block_reason.contains("flapped") && block_reason.contains("without ever committing"),
block_reason.contains("commit absent") && block_reason.contains("never committed"),
"Block reason should describe the flapping pattern, got: {block_reason}"
);
@@ -111,7 +111,6 @@ impl AgentPool {
let pool_clone = Self {
agents: Arc::clone(&self.agents),
port: self.port,
child_killers: Arc::clone(&self.child_killers),
watcher_tx: self.watcher_tx.clone(),
status_broadcaster: Arc::clone(&self.status_broadcaster),
};
@@ -74,25 +74,11 @@ pub(in crate::agents::pool) async fn run_server_owned_completion(
// Kill any in-flight cargo test processes for this worktree so they don't
// hold the build lock while gates try to run.
if let Some(wt_path) = worktree_path.as_ref()
&& let Ok(output) = std::process::Command::new("pgrep")
.args([
"-f",
&format!("--manifest-path {}/Cargo.toml", wt_path.display()),
])
.output()
{
let pids = String::from_utf8_lossy(&output.stdout);
for pid_str in pids.lines() {
if let Ok(pid) = pid_str.trim().parse::<i32>() {
crate::slog!(
"[agents] Killing stale cargo process (pid {pid}) for '{story_id}' before running gates"
);
unsafe {
libc::kill(pid, libc::SIGKILL);
}
}
}
if let Some(wt_path) = worktree_path.as_ref() {
let pattern = format!("--manifest-path {}/Cargo.toml", wt_path.display());
let _ = crate::process_kill::sigkill_pids_and_verify(&crate::process_kill::pids_matching(
&pattern,
));
}
// Run acceptance gates. Third element of the tuple is `needs_commit_recovery`:
@@ -18,7 +18,6 @@ impl AgentPool {
let pool = Arc::new(Self {
agents: Arc::clone(&self.agents),
port: self.port,
child_killers: Arc::clone(&self.child_killers),
watcher_tx: self.watcher_tx.clone(),
status_broadcaster: Arc::clone(&self.status_broadcaster),
});
@@ -186,6 +186,50 @@ impl AgentPool {
.map(|k| k.is_self_evident_fix())
.unwrap_or(false);
// Bug 1101 diagnostic: log the classified failure_kind and the
// matched classifier-trigger substring with surrounding context,
// so we can confirm whether classify() is incorrectly matching
// a passing-step stdout substring (e.g. "Diff in " inside a
// failing test's panic message) and bouncing the story to a
// fixup coder. Remove once the fix lands.
if let Ok(r) = report.as_ref()
&& let crate::agents::merge::MergeResult::GateFailure {
output: gate_output,
failure_kind: Some(k),
} = &r.result
{
const TRIGGERS: &[&str] = &[
"CONFLICT (content):",
"Merge conflict:",
"Diff in ",
"would reformat",
"missing-docs direction",
"error[clippy::",
"warning[clippy::",
"missing_doc_comments",
"error[E",
];
let matched = TRIGGERS
.iter()
.find_map(|t| gate_output.find(t).map(|i| (*t, i)));
let (trigger, context) = match matched {
Some((t, i)) => {
let start = i.saturating_sub(30);
let end = (i + t.len() + 60).min(gate_output.len());
let ctx = gate_output
.get(start..end)
.unwrap_or("<context unavailable>")
.replace('\n', " ");
(Some(t), ctx)
}
None => (None, String::from("<no trigger matched>")),
};
slog!(
"[merge] classify diagnostic for '{sid}': failure_kind={k:?} \
is_fixup={is_fixup} trigger={trigger:?} context='{context}'"
);
}
if is_no_commits {
let reason = kind.display_reason();
if let Err(e) = crate::agents::lifecycle::transition_to_blocked(&sid, &reason) {
+161 -81
View File
@@ -1,5 +1,20 @@
//! Process management — kills orphaned PTY child processes on server shutdown.
//!
//! As of story 1090 (2026-05-15), all process termination in this module uses
//! [`crate::process_kill::sigkill_pids_and_verify`] — SIGHUP-based killing via
//! `portable_pty::ChildKiller` has been removed entirely from the server.
//!
//! ## History
//!
//! Prior to commit `fe9804b3`, the watchdog and all kill paths sent SIGHUP via
//! `portable_pty::ChildKiller::kill()`. Claude Code ignores SIGHUP, so agents
//! survived "kills" and ran concurrently with their replacements — the root cause
//! of the 2026-05-15 duplicate-spawn incident. `fe9804b3` migrated the watchdog;
//! story 1090 completes the migration by rewriting `kill_all_children` and
//! `kill_child_for_key` (this file) to use `pids_matching` + `sigkill_pids_and_verify`.
use crate::process_kill::{pids_matching, sigkill_pids_and_verify};
use crate::slog;
use crate::slog_warn;
use super::AgentPool;
@@ -7,53 +22,97 @@ impl AgentPool {
/// Kill all active PTY child processes.
///
/// Called on server shutdown to prevent orphaned Claude Code processes from
/// continuing to run after the server exits. Each registered killer is called
/// once, then the registry is cleared.
/// continuing to run after the server exits. Collects each agent's worktree
/// path, then SIGKILLs every process running inside that path and verifies
/// termination before returning.
pub fn kill_all_children(&self) {
if let Ok(mut killers) = self.child_killers.lock() {
for (key, killer) in killers.iter_mut() {
slog!("[agents] Killing child process for {key} on shutdown");
let _ = killer.kill();
let worktree_paths: Vec<(String, std::path::PathBuf)> = {
let Ok(agents) = self.agents.lock() else {
return;
};
agents
.iter()
.filter_map(|(key, agent)| {
agent
.worktree_info
.as_ref()
.map(|wt| (key.clone(), wt.path.clone()))
})
.collect()
};
for (key, path) in worktree_paths {
let pattern = path.display().to_string();
let pids = pids_matching(&pattern);
if pids.is_empty() {
slog!(
"[agents] No processes found in worktree {} for '{key}' on shutdown",
path.display()
);
continue;
}
match sigkill_pids_and_verify(&pids) {
Ok(n) => slog!(
"[agents] SIGKILL'd {n} process(es) in worktree {} for '{key}' on shutdown",
path.display()
),
Err(survivors) => slog_warn!(
"[agents] SIGKILL incomplete for '{key}' on shutdown: \
pids still alive: {survivors:?}"
),
}
killers.clear();
}
}
/// Kill and deregister the child process for a specific agent key.
///
/// Used by `stop_agent` to ensure the PTY child is terminated even though
/// aborting a `spawn_blocking` task handle does not interrupt the blocking thread.
/// Fallback used by `stop_agent` when no worktree path is recorded for the
/// agent. Also the primary kill path for any caller that has only a composite
/// key and not a worktree path directly.
pub(super) fn kill_child_for_key(&self, key: &str) {
if let Ok(mut killers) = self.child_killers.lock()
&& let Some(mut killer) = killers.remove(key)
{
slog!("[agents] Killing child process for {key} on stop");
let _ = killer.kill();
let worktree_path = {
let Ok(agents) = self.agents.lock() else {
return;
};
agents
.get(key)
.and_then(|a| a.worktree_info.as_ref().map(|wt| wt.path.clone()))
};
let Some(path) = worktree_path else {
slog_warn!(
"[agents] No worktree path recorded for '{key}'; \
cannot SIGKILL via process_kill (no-op)"
);
return;
};
let pattern = path.display().to_string();
let pids = pids_matching(&pattern);
if pids.is_empty() {
slog!(
"[agents] No processes found in worktree {} for '{key}' on stop",
path.display()
);
return;
}
match sigkill_pids_and_verify(&pids) {
Ok(n) => slog!(
"[agents] SIGKILL'd {n} process(es) in worktree {} for '{key}' on stop",
path.display()
),
Err(survivors) => slog_warn!(
"[agents] SIGKILL incomplete for '{key}' on stop: \
pids still alive: {survivors:?}"
),
}
}
/// Test helper: inject a child killer into the registry.
#[cfg(test)]
pub fn inject_child_killer(
&self,
key: &str,
killer: Box<dyn portable_pty::ChildKiller + Send + Sync>,
) {
let mut killers = self.child_killers.lock().unwrap();
killers.insert(key.to_string(), killer);
}
/// Test helper: return the number of registered child killers.
#[cfg(test)]
pub fn child_killer_count(&self) -> usize {
self.child_killers.lock().unwrap().len()
}
}
#[cfg(test)]
mod tests {
use super::super::AgentPool;
use portable_pty::{CommandBuilder, PtySize, native_pty_system};
use crate::agents::AgentStatus;
use std::process::Command;
/// Returns true if a process with the given PID is currently running.
@@ -68,79 +127,100 @@ mod tests {
#[test]
fn kill_all_children_is_safe_on_empty_pool() {
let pool = AgentPool::new_test(3001);
pool.kill_all_children();
assert_eq!(pool.child_killer_count(), 0);
pool.kill_all_children(); // must not panic
}
/// AC 4 — `kill_child_for_key` SIGKILLs the single agent's process and
/// verifies it is gone within 2 s. The sleeper has the worktree path in
/// its argv[0] so `pgrep -f` can locate it, mirroring how claude-code is
/// launched with `--directory <worktree>` in production.
#[test]
fn kill_all_children_kills_real_process() {
let pool = AgentPool::new_test(3001);
fn kill_child_for_key_kills_real_process() {
use std::os::unix::process::CommandExt;
let pty_system = native_pty_system();
let pair = pty_system
.openpty(PtySize {
rows: 24,
cols: 80,
pixel_width: 0,
pixel_height: 0,
})
.expect("failed to open pty");
let pool = AgentPool::new_test(3002);
let tmp = tempfile::tempdir().unwrap();
let worktree = tmp.path();
let mut cmd = CommandBuilder::new("sleep");
cmd.arg("100");
let mut child = pair
.slave
.spawn_command(cmd)
.expect("failed to spawn sleep");
let pid = child.process_id().expect("no pid");
// argv[0] = worktree path → pgrep -f <path> finds this process.
let mut child = Command::new("sleep")
.arg0(worktree.to_string_lossy().as_ref())
.arg("100")
.spawn()
.expect("spawn sleeper");
let pid = child.id();
pool.inject_child_killer("story:agent", child.clone_killer());
// Give pgrep a moment to see the new process.
std::thread::sleep(std::time::Duration::from_millis(100));
pool.inject_test_agent_with_path(
"story-1090-kill",
"coder",
AgentStatus::Running,
worktree.to_path_buf(),
);
assert!(
process_is_running(pid),
"process {pid} should be running before kill_all_children"
"sleeper pid {pid} should be running before kill_child_for_key"
);
pool.kill_all_children();
let _ = child.wait();
pool.kill_child_for_key("story-1090-kill:coder");
let _ = child.wait(); // reap zombie so ps -p returns false
assert!(
!process_is_running(pid),
"process {pid} should have been killed by kill_all_children"
"sleeper pid {pid} should be dead after kill_child_for_key"
);
}
/// AC 5 — `kill_all_children` SIGKILLs all agents' processes. Two agents
/// with distinct worktree paths are injected; both must be gone after the call.
#[test]
fn kill_all_children_clears_registry() {
let pool = AgentPool::new_test(3001);
fn kill_all_children_kills_multiple_real_processes() {
use std::os::unix::process::CommandExt;
let pty_system = native_pty_system();
let pair = pty_system
.openpty(PtySize {
rows: 24,
cols: 80,
pixel_width: 0,
pixel_height: 0,
let pool = AgentPool::new_test(3003);
let mut sleepers: Vec<(u32, std::process::Child, tempfile::TempDir)> = (0..2_u32)
.map(|i| {
let tmp = tempfile::tempdir().unwrap();
let worktree = tmp.path();
// argv[0] = worktree path for pgrep discoverability.
let child = Command::new("sleep")
.arg0(worktree.to_string_lossy().as_ref())
.arg("100")
.spawn()
.expect("spawn sleeper");
let pid = child.id();
pool.inject_test_agent_with_path(
&format!("story-1090-all-{i}"),
"coder",
AgentStatus::Running,
worktree.to_path_buf(),
);
(pid, child, tmp)
})
.expect("failed to open pty");
.collect();
let mut cmd = CommandBuilder::new("sleep");
cmd.arg("1");
let mut child = pair
.slave
.spawn_command(cmd)
.expect("failed to spawn sleep");
// Give pgrep a moment to see the new processes.
std::thread::sleep(std::time::Duration::from_millis(100));
pool.inject_child_killer("story:agent", child.clone_killer());
assert_eq!(pool.child_killer_count(), 1);
for (pid, _, _) in &sleepers {
assert!(
process_is_running(*pid),
"pid {pid} should be running before kill_all_children"
);
}
pool.kill_all_children();
let _ = child.wait();
assert_eq!(
pool.child_killer_count(),
0,
"child_killers should be cleared after kill_all_children"
);
for (pid, child, _tmp) in &mut sleepers {
let _ = child.wait(); // reap zombie
assert!(
!process_is_running(*pid),
"pid {pid} should be dead after kill_all_children"
);
}
}
}
+36 -1
View File
@@ -271,6 +271,42 @@ impl AgentPool {
'{conflicting_name}' is already active at the same pipeline stage"
));
}
// Cross-stage LLM agent guard: reject if any Coder/Qa/Mergemaster agent
// is already Running or Pending on this story at a *different* pipeline stage.
// These are stale agents left over from a previous stage transition that has
// since advanced. The periodic reconciler (reconcile_canonical_agents) stops
// them; here we surface the conflict so the caller waits for reconciliation.
if matches!(
resolved_stage,
PipelineStage::Coder | PipelineStage::Qa | PipelineStage::Mergemaster
) && let Some(stale_name) = agents.iter().find_map(|(k, a)| {
let k_story = k.rsplit_once(':').map(|(s, _)| s).unwrap_or(k);
if k_story != story_id || a.agent_name == resolved_name {
return None;
}
if !matches!(a.status, AgentStatus::Running | AgentStatus::Pending) {
return None;
}
let a_stage = config
.find_agent(&a.agent_name)
.map(agent_config_stage)
.unwrap_or_else(|| pipeline_stage(&a.agent_name));
if matches!(
a_stage,
PipelineStage::Coder | PipelineStage::Qa | PipelineStage::Mergemaster
) && a_stage != resolved_stage
{
Some(a.agent_name.clone())
} else {
None
}
}) {
return Err(format!(
"story '{story_id}' already has an active LLM agent '{stale_name}'; \
refusing to spawn '{resolved_name}'"
));
}
// Enforce single-instance concurrency for explicitly-named agents:
// if this agent is already running on any other story, reject.
// Auto-selected agents are already guaranteed idle by
@@ -392,7 +428,6 @@ impl AgentPool {
event_log.clone(),
self.port,
log_writer.clone(),
self.child_killers.clone(),
self.watcher_tx.clone(),
inactivity_timeout_secs,
prior_events,
+15 -10
View File
@@ -8,7 +8,6 @@ use std::collections::HashMap;
use std::path::PathBuf;
use std::sync::{Arc, Mutex};
use portable_pty::ChildKiller;
use tokio::sync::broadcast;
use crate::agent_log::AgentLogWriter;
@@ -135,7 +134,6 @@ pub(super) async fn run_agent_spawn(
event_log: Arc<Mutex<Vec<AgentEvent>>>,
port: u16,
log_writer: Option<Arc<Mutex<AgentLogWriter>>>,
child_killers: Arc<Mutex<HashMap<String, Box<dyn ChildKiller + Send + Sync>>>>,
watcher_tx: broadcast::Sender<WatcherEvent>,
inactivity_timeout_secs: u64,
// Formatted `<recent-events>` block drained from the previous session's
@@ -159,7 +157,6 @@ pub(super) async fn run_agent_spawn(
let log_clone = event_log;
let port_for_task = port;
let log_writer_clone = log_writer;
let child_killers_clone = child_killers;
let watcher_tx_clone = watcher_tx;
let _ = inactivity_timeout_secs; // currently unused inside the closure body
@@ -371,8 +368,7 @@ pub(super) async fn run_agent_spawn(
let run_result = match runtime_name {
"claude-code" => {
let runtime =
ClaudeCodeRuntime::new(child_killers_clone.clone(), watcher_tx_clone.clone());
let runtime = ClaudeCodeRuntime::new(watcher_tx_clone.clone());
let ctx = RuntimeContext {
story_id: sid.clone(),
agent_name: aname.clone(),
@@ -566,7 +562,6 @@ pub(super) async fn run_agent_spawn(
let pool = AgentPool {
agents: agents_for_respawn,
port: port_r,
child_killers: Arc::new(Mutex::new(HashMap::new())),
watcher_tx: watcher_for_respawn,
status_broadcaster: Arc::new(
crate::service::status::StatusBroadcaster::new(),
@@ -654,7 +649,6 @@ pub(super) async fn run_agent_spawn(
let pool = AgentPool {
agents: agents_for_cd,
port: port_for_cd,
child_killers: Arc::new(Mutex::new(HashMap::new())),
watcher_tx: watcher_for_cd,
status_broadcaster: Arc::new(
crate::service::status::StatusBroadcaster::new(),
@@ -774,7 +768,6 @@ pub(super) async fn run_agent_spawn(
let pool = AgentPool {
agents: agents_for_cd,
port: port_for_cd,
child_killers: Arc::new(Mutex::new(HashMap::new())),
watcher_tx: watcher_for_cd,
status_broadcaster: Arc::new(
crate::service::status::StatusBroadcaster::new(),
@@ -815,6 +808,7 @@ pub(super) async fn run_agent_spawn(
crate::db::delete_content(crate::db::ContentKey::CommitRecoveryTotalAttempts(
&sid,
));
crate::db::delete_content(crate::db::ContentKey::CommitRecoveryReadSet(&sid));
// Remove agent from the pool and unblock any wait_for_agent callers.
let tx_done = {
@@ -862,7 +856,6 @@ pub(super) async fn run_agent_spawn(
let pool = AgentPool {
agents: agents_for_respawn,
port: port_r,
child_killers: Arc::new(Mutex::new(HashMap::new())),
watcher_tx: watcher_for_respawn,
status_broadcaster: Arc::new(
crate::service::status::StatusBroadcaster::new(),
@@ -881,6 +874,17 @@ pub(super) async fn run_agent_spawn(
return;
}
// AC1 (story 1089): mark forced exits so the commit-recovery
// stuck counter is not incremented for API errors, network
// failures, or Claude-API budget exhaustion. A non-zero exit
// code means the CLI was forced out, not that it chose to stop.
if !result.exit_ok {
crate::db::write_content(
crate::db::ContentKey::CommitRecoveryForcedExit(&sid),
"1",
);
}
// Server-owned completion: run acceptance gates automatically
// when the agent process exits normally.
super::super::pipeline::run_server_owned_completion(
@@ -1254,12 +1258,13 @@ mod tests {
"abc123",
);
// Rate-limit exit handler: reset all three counters (the fix).
// Rate-limit exit handler: reset all counters (the fix).
crate::db::delete_content(crate::db::ContentKey::CommitRecoveryPending(story_id));
crate::db::delete_content(crate::db::ContentKey::CommitRecoveryDiffFingerprint(
story_id,
));
crate::db::delete_content(crate::db::ContentKey::CommitRecoveryTotalAttempts(story_id));
crate::db::delete_content(crate::db::ContentKey::CommitRecoveryReadSet(story_id));
// CommitRecoveryPending must be cleared after each rate-limit exit.
assert!(
@@ -602,6 +602,266 @@ async fn start_agent_allows_correct_stage_agent() {
}
}
// ── story-1100: cross-stage LLM agent rejection ─────────────────────────
#[tokio::test]
async fn start_agent_rejects_mergemaster_when_coder_running_same_story() {
use std::fs;
let tmp = tempfile::tempdir().unwrap();
let root = tmp.path();
let sk_dir = root.join(".huskies");
fs::create_dir_all(&sk_dir).unwrap();
fs::write(
sk_dir.join("project.toml"),
"[[agent]]\nname = \"coder-1\"\nstage = \"coder\"\n\n\
[[agent]]\nname = \"mergemaster\"\nstage = \"mergemaster\"\n",
)
.unwrap();
let pool = AgentPool::new_test(3099);
pool.inject_test_agent("999_story_cross", "coder-1", AgentStatus::Running);
let result = pool
.start_agent(root, "999_story_cross", Some("mergemaster"), None, None)
.await;
assert!(
result.is_err(),
"mergemaster must be rejected when coder-1 is still running on same story"
);
let err = result.unwrap_err();
assert!(
err.contains("active LLM agent") || err.contains("stale agent"),
"error must mention active LLM agent conflict, got: '{err}'"
);
}
#[tokio::test]
async fn start_agent_rejects_coder_when_mergemaster_running_same_story() {
use std::fs;
let tmp = tempfile::tempdir().unwrap();
let root = tmp.path();
let sk_dir = root.join(".huskies");
fs::create_dir_all(&sk_dir).unwrap();
fs::write(
sk_dir.join("project.toml"),
"[[agent]]\nname = \"coder-1\"\nstage = \"coder\"\n\n\
[[agent]]\nname = \"mergemaster\"\nstage = \"mergemaster\"\n",
)
.unwrap();
let pool = AgentPool::new_test(3099);
pool.inject_test_agent("888_story_cross2", "mergemaster", AgentStatus::Running);
let result = pool
.start_agent(root, "888_story_cross2", Some("coder-1"), None, None)
.await;
assert!(
result.is_err(),
"coder-1 must be rejected when mergemaster is running on same story"
);
let err = result.unwrap_err();
assert!(
err.contains("active LLM agent") || err.contains("stale agent"),
"error must mention active LLM agent conflict, got: '{err}'"
);
}
#[tokio::test]
async fn start_agent_cross_stage_does_not_block_different_stories() {
use std::fs;
let tmp = tempfile::tempdir().unwrap();
let root = tmp.path();
let sk_dir = root.join(".huskies");
fs::create_dir_all(sk_dir.join("work/1_backlog")).unwrap();
fs::write(
root.join(".huskies/project.toml"),
"[[agent]]\nname = \"coder-1\"\nstage = \"coder\"\n\n\
[[agent]]\nname = \"mergemaster\"\nstage = \"mergemaster\"\n",
)
.unwrap();
fs::write(
root.join(".huskies/work/1_backlog/777_story_other.md"),
"---\nname: Other\n---\n",
)
.unwrap();
let pool = AgentPool::new_test(3099);
// mergemaster running on story-x should NOT block coder on story-y
pool.inject_test_agent("111_story_x", "mergemaster", AgentStatus::Running);
let result = pool
.start_agent(root, "777_story_other", Some("coder-1"), None, None)
.await;
if let Err(ref e) = result {
assert!(
!e.contains("active LLM agent") && !e.contains("stale agent"),
"cross-stage guard must not fire for agents on different stories, got: '{e}'"
);
}
}
#[tokio::test]
async fn reconcile_canonical_agents_stops_stale_coder_in_qa_stage() {
use std::fs;
let tmp = tempfile::tempdir().unwrap();
let root = tmp.path();
let sk_dir = root.join(".huskies");
fs::create_dir_all(&sk_dir).unwrap();
fs::write(
sk_dir.join("project.toml"),
"[[agent]]\nname = \"coder-1\"\nstage = \"coder\"\n",
)
.unwrap();
// Write story to CRDT in QA stage: canonical = Qa, but coder-1 is Running.
crate::db::ensure_content_store();
crate::db::write_item_with_content(
"777_story_reconcile",
"qa",
"---\nname: Reconcile Test\n---\n",
crate::db::ItemMeta::named("Reconcile Test"),
);
let pool = AgentPool::new_test(3099);
pool.inject_test_agent("777_story_reconcile", "coder-1", AgentStatus::Running);
let before = pool.list_agents().unwrap();
assert!(
before.iter().any(|a| a.agent_name == "coder-1"
&& matches!(a.status, AgentStatus::Running | AgentStatus::Pending)),
"coder-1 should be Running before reconciliation"
);
pool.reconcile_canonical_agents(root).await;
let after = pool.list_agents().unwrap();
let still_active = after.iter().any(|a| {
a.story_id == "777_story_reconcile"
&& a.agent_name == "coder-1"
&& matches!(a.status, AgentStatus::Running | AgentStatus::Pending)
});
assert!(
!still_active,
"reconciler must have stopped coder-1 (CRDT stage is QA, coder is wrong stage)"
);
}
#[tokio::test]
async fn reconcile_canonical_agents_leaves_correct_stage_agent_alone() {
use std::fs;
let tmp = tempfile::tempdir().unwrap();
let root = tmp.path();
let sk_dir = root.join(".huskies");
fs::create_dir_all(&sk_dir).unwrap();
fs::write(
sk_dir.join("project.toml"),
"[[agent]]\nname = \"coder-1\"\nstage = \"coder\"\n",
)
.unwrap();
// Story is in coding stage: canonical = Coder. coder-1 is correct.
crate::db::ensure_content_store();
crate::db::write_item_with_content(
"555_story_correct",
"coding",
"---\nname: Correct Stage\n---\n",
crate::db::ItemMeta::named("Correct Stage"),
);
let pool = AgentPool::new_test(3099);
pool.inject_test_agent("555_story_correct", "coder-1", AgentStatus::Running);
pool.reconcile_canonical_agents(root).await;
let after = pool.list_agents().unwrap();
let still_active = after.iter().any(|a| {
a.story_id == "555_story_correct"
&& a.agent_name == "coder-1"
&& matches!(a.status, AgentStatus::Running | AgentStatus::Pending)
});
assert!(
still_active,
"reconciler must NOT stop coder-1 when it matches the canonical stage"
);
}
/// Regression test for story 1100: a stale coder left running after a stage
/// transition blocks both a same-stage coder and a cross-stage mergemaster.
/// The periodic reconciler stops the stale coder, after which the pool no
/// longer has a cross-stage conflict.
#[tokio::test]
async fn regression_1100_stale_coder_blocks_mergemaster_then_reconciler_clears() {
use std::fs;
let tmp = tempfile::tempdir().unwrap();
let root = tmp.path();
let sk_dir = root.join(".huskies");
fs::create_dir_all(&sk_dir).unwrap();
fs::write(
sk_dir.join("project.toml"),
"[[agent]]\nname = \"coder-1\"\nstage = \"coder\"\n\n\
[[agent]]\nname = \"coder-2\"\nstage = \"coder\"\n\n\
[[agent]]\nname = \"mergemaster\"\nstage = \"mergemaster\"\n",
)
.unwrap();
let pool = AgentPool::new_test(3099);
// Simulate coder-1 still Running after the story advanced past the coding stage.
pool.inject_test_agent("1100_reg", "coder-1", AgentStatus::Running);
// coder-2 blocked by same-stage check (both are Coder stage)
let r1 = pool
.start_agent(root, "1100_reg", Some("coder-2"), None, None)
.await;
assert!(r1.is_err(), "coder-2 must be rejected by same-stage guard");
assert!(
r1.unwrap_err().contains("same pipeline stage"),
"same-stage check must fire for coder-2"
);
// mergemaster blocked by cross-stage LLM guard (coder-1 is a different LLM stage)
let r2 = pool
.start_agent(root, "1100_reg", Some("mergemaster"), None, None)
.await;
assert!(
r2.is_err(),
"mergemaster must be rejected because coder-1 (different LLM stage) is still running"
);
let r2_err = r2.unwrap_err();
assert!(
r2_err.contains("active LLM agent") || r2_err.contains("stale agent"),
"cross-stage rejection expected, got: '{r2_err}'"
);
// Reconciler: story "1100_reg" has no CRDT entry → canonical = None → stop coder-1.
pool.reconcile_canonical_agents(root).await;
// coder-1 must be gone from the active pool.
let remaining = pool.list_agents().unwrap();
assert!(
!remaining.iter().any(|a| {
a.story_id == "1100_reg"
&& a.agent_name == "coder-1"
&& matches!(a.status, AgentStatus::Running | AgentStatus::Pending)
}),
"reconciler must have removed stale coder-1 from the active pool"
);
}
/// Bug 502: when start_agent is called for a non-Coder agent (mergemaster
/// or qa) on a story that's in 4_merge/, the unconditional
/// move_story_to_current at the top of start_agent must NOT fire — even
+11 -12
View File
@@ -2,11 +2,11 @@
use std::path::Path;
use crate::config::ProjectConfig;
use crate::pipeline_state::Stage;
use super::super::super::{PipelineStage, agent_config_stage, pipeline_stage};
use super::super::super::{
PipelineStage, agent_config_stage, canonical_pipeline_stage, pipeline_stage,
};
use super::super::worktree::find_active_story_stage;
use crate::config::ProjectConfig;
/// Validate that an explicit `agent_name` is allowed to attach to `story_id`'s
/// current pipeline stage.
@@ -34,16 +34,15 @@ pub(super) fn validate_agent_stage(
let Some(story_stage) = find_active_story_stage(project_root, story_id) else {
return Ok(());
};
let expected_stage = match story_stage {
Stage::Coding { .. } => PipelineStage::Coder,
Stage::Qa => PipelineStage::Qa,
Stage::Merge { .. } => PipelineStage::Mergemaster,
_ => PipelineStage::Other,
};
if expected_stage != PipelineStage::Other && expected_stage != agent_stage {
let canonical = canonical_pipeline_stage(&story_stage);
let is_llm = matches!(
agent_stage,
PipelineStage::Coder | PipelineStage::Qa | PipelineStage::Mergemaster
);
if is_llm && (canonical.is_none() || canonical.as_ref() != Some(&agent_stage)) {
return Err(format!(
"Agent '{name}' (stage: {agent_stage:?}) cannot be assigned to \
story '{story_id}' in {}/ (requires stage: {expected_stage:?})",
story '{story_id}' in {}/ (requires stage: {canonical:?})",
story_stage.dir_name()
));
}
+138 -10
View File
@@ -1,14 +1,35 @@
//! Agent stop — terminates a running agent while preserving its worktree.
use crate::process_kill::{pids_matching, sigkill_pids_and_verify};
use crate::slog;
use crate::slog_error;
use crate::slog_warn;
use std::path::Path;
use super::super::{AgentEvent, AgentStatus};
use super::super::{
AgentEvent, AgentStatus, PipelineStage, agent_config_stage, canonical_pipeline_stage,
pipeline_stage,
};
use super::AgentPool;
use super::types::composite_key;
impl AgentPool {
/// Stop a running agent. Worktree is preserved for inspection.
///
/// **Order of operations matters here.** The naive implementation set
/// `status = Failed` before killing the process, which opened the same
/// idempotency window that produced the 2026-05-15 watchdog
/// double-spawn: the `start_agent` check whitelists Running/Pending,
/// so flipping status away from Running while the underlying claude
/// process was still alive let a fresh spawn race in alongside the
/// surviving one. The fix is:
///
/// 1. Read the worktree path (so we can find every process running
/// in it) without mutating the agent record yet.
/// 2. SIGKILL the process tree via [`crate::process_kill`] and BLOCK
/// until verified gone. While this is in progress, status stays
/// Running and `start_agent` continues to reject duplicate spawns.
/// 3. Now that the process is gone, mutate the agent record (status,
/// handle abort, removal).
pub async fn stop_agent(
&self,
_project_root: &Path,
@@ -17,27 +38,58 @@ impl AgentPool {
) -> Result<(), String> {
let key = composite_key(story_id, agent_name);
let (worktree_info, task_handle, tx) = {
// Step 1: snapshot the worktree path (no status mutation yet).
let worktree_info = {
let agents = self.agents.lock().map_err(|e| e.to_string())?;
let agent = agents
.get(&key)
.ok_or_else(|| format!("No agent '{agent_name}' for story '{story_id}'"))?;
agent.worktree_info.clone()
};
// Step 2: SIGKILL every process running in the worktree, verify gone.
// We do this BEFORE updating the agent record so the idempotency check
// in `start_agent` keeps rejecting duplicate spawns until the slot is
// legitimately free. Replaces the prior `kill_child_for_key` path,
// which sent SIGHUP via portable_pty (ignored by claude-code).
if let Some(wt) = worktree_info.as_ref() {
let pids = pids_matching(&wt.path.display().to_string());
if !pids.is_empty() {
match sigkill_pids_and_verify(&pids) {
Ok(n) => slog!(
"[stop_agent] SIGKILL'd {n} process(es) in worktree {} for '{key}'.",
wt.path.display()
),
Err(survivors) => slog_warn!(
"[stop_agent] SIGKILL incomplete for '{key}': pids still alive: {survivors:?}. \
Proceeding with record cleanup anyway; concurrent spawn protection may be weakened."
),
}
}
} else {
slog_warn!(
"[stop_agent] No worktree path recorded for '{key}'; cannot tree-kill, \
falling back to portable_pty SIGHUP (likely no-op for claude-code)."
);
self.kill_child_for_key(&key);
}
// Step 3: now safe to mutate. Status flip and handle abort.
let (task_handle, tx) = {
let mut agents = self.agents.lock().map_err(|e| e.to_string())?;
let agent = agents
.get_mut(&key)
.ok_or_else(|| format!("No agent '{agent_name}' for story '{story_id}'"))?;
let wt = agent.worktree_info.clone();
let handle = agent.task_handle.take();
let tx = agent.tx.clone();
agent.status = AgentStatus::Failed;
(wt, handle, tx)
(handle, tx)
};
// Abort the task and kill the PTY child process.
// Note: aborting a spawn_blocking task handle does not interrupt the blocking
// thread, so we must also kill the child process directly via the killer registry.
if let Some(handle) = task_handle {
handle.abort();
let _ = handle.await;
}
self.kill_child_for_key(&key);
// Preserve worktree for inspection — don't destroy agent's work on stop.
if let Some(ref wt) = worktree_info {
@@ -53,7 +105,7 @@ impl AgentPool {
status: "stopped".to_string(),
});
// Remove from map
// Remove from map.
{
let mut agents = self.agents.lock().map_err(|e| e.to_string())?;
agents.remove(&key);
@@ -65,6 +117,82 @@ impl AgentPool {
Ok(())
}
/// Stop LLM agents whose pipeline stage no longer matches the story's canonical stage.
///
/// Called periodically by the tick loop (story 1100). For each Running or Pending
/// LLM agent (Coder, Qa, or Mergemaster) whose stage does not match the canonical
/// stage derived from the story's current CRDT state, the agent is stopped via the
/// existing SIGKILL path. Idempotent: agents already at the correct stage are left
/// untouched. Also stops LLM agents on stories that have no active pipeline stage
/// (terminal, blocked, or frozen), since no LLM agent should run there.
pub async fn reconcile_canonical_agents(&self, root: &std::path::Path) {
use crate::config::ProjectConfig;
let config = match ProjectConfig::load(root) {
Ok(c) => c,
Err(e) => {
slog_warn!("[reconcile] Cannot load config for canonical reconcile: {e}");
return;
}
};
// Snapshot active LLM agents without holding the lock during async stops.
let snapshot: Vec<(String, String, PipelineStage)> = {
let Ok(agents) = self.agents.lock() else {
return;
};
agents
.iter()
.filter_map(|(key, a)| {
if !matches!(a.status, AgentStatus::Running | AgentStatus::Pending) {
return None;
}
let stage = config
.find_agent(&a.agent_name)
.map(agent_config_stage)
.unwrap_or_else(|| pipeline_stage(&a.agent_name));
if !matches!(
stage,
PipelineStage::Coder | PipelineStage::Qa | PipelineStage::Mergemaster
) {
return None;
}
let story_id = key
.rsplit_once(':')
.map(|(s, _)| s)
.unwrap_or(key)
.to_string();
Some((story_id, a.agent_name.clone(), stage))
})
.collect()
};
for (story_id, agent_name, agent_stage) in snapshot {
let canonical = crate::pipeline_state::read_typed(&story_id)
.ok()
.flatten()
.and_then(|item| canonical_pipeline_stage(&item.stage));
let should_stop = match &canonical {
None => true,
Some(c) if *c != agent_stage => true,
_ => false,
};
if !should_stop {
continue;
}
slog!(
"[reconcile] stopping '{agent_name}' on '{story_id}': \
canonical={canonical:?} actual={agent_stage:?}"
);
if let Err(e) = self.stop_agent(root, &story_id, &agent_name).await {
slog_warn!("[reconcile] failed to stop '{agent_name}' on '{story_id}': {e}");
}
}
}
/// Remove all agent entries for a given story_id from the pool.
///
/// Called when a story is archived so that stale entries don't accumulate.
+2
View File
@@ -33,6 +33,8 @@ pub(super) fn find_active_story_stage(
crate::pipeline_state::Stage::Coding { .. }
| crate::pipeline_state::Stage::Qa
| crate::pipeline_state::Stage::Merge { .. }
| crate::pipeline_state::Stage::MergeFailure { .. }
| crate::pipeline_state::Stage::MergeFailureFinal { .. }
)
{
return Some(item.stage);
+50 -9
View File
@@ -6,10 +6,20 @@
use std::path::{Path, PathBuf};
use crate::pipeline_state::Stage;
use crate::pipeline_state::{Pipeline, Stage, Status};
use crate::slog;
use crate::slog_warn;
/// Story 1086: matches the set of terminal stages used by the worktree-cleanup
/// subscriber via the typed [`Status`] / [`Pipeline`] projections. Excludes
/// `Status::Rejected` so rejected stories keep their worktree for human review.
fn is_cleanup_terminal(stage: &Stage) -> bool {
matches!(
stage.status(),
Status::Done | Status::Abandoned | Status::Superseded
) || matches!(stage.pipeline(), Pipeline::Archived)
}
/// Spawn a background task that creates a git worktree when a story enters `Stage::Coding`.
///
/// Subscribes to the pipeline transition broadcast channel. On each
@@ -22,7 +32,14 @@ pub(crate) fn spawn_worktree_create_subscriber(project_root: PathBuf, port: u16)
loop {
match rx.recv().await {
Ok(fired) => {
if matches!(fired.after, Stage::Coding { .. }) {
// Story 1086: classify by Pipeline column. `Pipeline::Coding`
// covers `Stage::Coding` and `Stage::Blocked` — but Blocked has
// no worktree to create, so we still need the Stage::Coding
// payload check. Use a layered match: pipeline first for fast
// skip, then variant guard.
if fired.after.pipeline() == Pipeline::Coding
&& matches!(fired.after, Stage::Coding { .. })
{
on_coding_transition(&project_root, port, &fired.story_id.0).await;
}
}
@@ -50,13 +67,7 @@ pub(crate) fn spawn_worktree_cleanup_subscriber(project_root: PathBuf) {
loop {
match rx.recv().await {
Ok(fired) => {
if matches!(
fired.after,
Stage::Done { .. }
| Stage::Archived { .. }
| Stage::Abandoned { .. }
| Stage::Superseded { .. }
) {
if is_cleanup_terminal(&fired.after) {
on_terminal_transition(&project_root, &fired.story_id.0).await;
}
}
@@ -72,6 +83,36 @@ pub(crate) fn spawn_worktree_cleanup_subscriber(project_root: PathBuf) {
});
}
/// Reconcile worktree creation: for each story currently in `Stage::Coding`, ensure its worktree exists.
///
/// Idempotent — creates worktrees for Coding stories that have no worktree yet, and is
/// a no-op for stories whose worktree already exists. Called by the periodic reconciler
/// so that Lagged events on the broadcast channel never leave Coding stories without worktrees.
pub(crate) async fn reconcile_worktree_create(project_root: &Path, port: u16) {
for item in crate::pipeline_state::read_all_typed() {
// Story 1086: filter by Pipeline column then narrow to the `Coding`
// variant (Blocked is in `Pipeline::Coding` but has no worktree).
if item.stage.pipeline() == Pipeline::Coding
&& matches!(item.stage, crate::pipeline_state::Stage::Coding { .. })
{
on_coding_transition(project_root, port, &item.story_id.0).await;
}
}
}
/// Reconcile worktree cleanup: for each story in a terminal stage, ensure its worktree is removed.
///
/// Idempotent — removes worktrees for terminal stories that still have one, and is a no-op
/// for stories with no worktree. Called by the periodic reconciler so that Lagged events on
/// the broadcast channel never leave terminal stories with dangling worktrees.
pub(crate) async fn reconcile_worktree_cleanup(project_root: &Path) {
for item in crate::pipeline_state::read_all_typed() {
if is_cleanup_terminal(&item.stage) {
on_terminal_transition(project_root, &item.story_id.0).await;
}
}
}
/// Create the worktree and feature branch for `story_id` when it enters `Stage::Coding`.
pub(crate) async fn on_coding_transition(project_root: &Path, port: u16, story_id: &str) {
let config = match crate::config::ProjectConfig::load(project_root) {
+64 -18
View File
@@ -13,7 +13,6 @@ mod tests {
use super::*;
use crate::agents::AgentEvent;
use crate::io::watcher::WatcherEvent;
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
use tokio::sync::broadcast;
@@ -41,7 +40,6 @@ mod tests {
let (tx, _rx) = broadcast::channel::<AgentEvent>(64);
let (watcher_tx, mut watcher_rx) = broadcast::channel::<WatcherEvent>(16);
let event_log = Arc::new(Mutex::new(Vec::new()));
let child_killers = Arc::new(Mutex::new(HashMap::new()));
// sh -p "--" <script>: -p = privileged mode, "--" = end options,
// then the script path is the file operand.
@@ -56,7 +54,6 @@ mod tests {
&event_log,
None,
0,
child_killers,
watcher_tx,
None,
None,
@@ -98,7 +95,6 @@ mod tests {
let (tx, _rx) = broadcast::channel::<AgentEvent>(64);
let (watcher_tx, mut watcher_rx) = broadcast::channel::<WatcherEvent>(16);
let event_log = Arc::new(Mutex::new(Vec::new()));
let child_killers = Arc::new(Mutex::new(HashMap::new()));
let result = run_agent_pty_streaming(
"423_story_rate_limit",
@@ -111,7 +107,6 @@ mod tests {
&event_log,
None,
0,
child_killers,
watcher_tx,
None,
None,
@@ -160,7 +155,6 @@ mod tests {
let (tx, _rx) = broadcast::channel::<AgentEvent>(64);
let (watcher_tx, mut watcher_rx) = broadcast::channel::<WatcherEvent>(16);
let event_log = Arc::new(Mutex::new(Vec::new()));
let child_killers = Arc::new(Mutex::new(HashMap::new()));
let before = chrono::Utc::now();
let result = run_agent_pty_streaming(
@@ -174,7 +168,6 @@ mod tests {
&event_log,
None,
0,
child_killers,
watcher_tx,
None,
None,
@@ -229,7 +222,6 @@ mod tests {
let (tx, _rx) = broadcast::channel::<AgentEvent>(64);
let (watcher_tx, _watcher_rx) = broadcast::channel::<WatcherEvent>(16);
let event_log = Arc::new(Mutex::new(Vec::new()));
let child_killers = Arc::new(Mutex::new(HashMap::new()));
let result = run_agent_pty_streaming(
"916_story_rate_limit_extension",
@@ -242,7 +234,6 @@ mod tests {
&event_log,
None,
1, // inactivity_timeout_secs = 1s; would expire before the 3s sleep without the extension
child_killers,
watcher_tx,
None,
None,
@@ -407,18 +398,16 @@ mod tests {
let (tx, _rx) = broadcast::channel::<AgentEvent>(64);
let (watcher_tx, _watcher_rx) = broadcast::channel::<WatcherEvent>(16);
let event_log = Arc::new(Mutex::new(Vec::new()));
let child_killers: Arc<
Mutex<HashMap<String, Box<dyn portable_pty::ChildKiller + Send + Sync>>>,
> = Arc::new(Mutex::new(HashMap::new()));
let child_killers_for_kill = Arc::clone(&child_killers);
// Spawn a task to kill the child after a short delay (simulating watchdog).
// Uses pids_matching on the script path — same mechanism as the production
// watchdog after the process_kill migration (story 1090).
let script_path_for_kill = script.to_string_lossy().to_string();
tokio::spawn(async move {
tokio::time::sleep(tokio::time::Duration::from_millis(500)).await;
if let Ok(mut killers) = child_killers_for_kill.lock() {
for (_, killer) in killers.iter_mut() {
let _ = killer.kill();
}
let pids = crate::process_kill::pids_matching(&script_path_for_kill);
if !pids.is_empty() {
let _ = crate::process_kill::sigkill_pids_and_verify(&pids);
}
});
@@ -435,7 +424,6 @@ mod tests {
&event_log,
None,
0, // no inactivity timeout
child_killers,
watcher_tx,
None, // no session to resume
Some((project_root.clone(), "sonnet".to_string())),
@@ -457,4 +445,62 @@ mod tests {
the respawn's lookup_session returns it (warm), not None (cold)"
);
}
// ── bug 1103: soft rate-limit warning (status=allowed) must NOT set rate_limit_exit ──
/// Regression: a `rate_limit_event` with `status="allowed"` is a soft
/// warning — the request was permitted. The session that follows should
/// complete normally and report `rate_limit_exit == false`, not trigger the
/// rate-limit respawn path in the spawn handler.
#[tokio::test]
async fn rate_limit_allowed_status_does_not_set_rate_limit_exit() {
use std::os::unix::fs::PermissionsExt;
let tmp = tempfile::tempdir().unwrap();
let script = tmp.path().join("emit_allowed_then_exit.sh");
// Emit status="allowed" (soft warning), then exit cleanly.
std::fs::write(
&script,
"#!/bin/sh\nprintf '%s\\n' '{\"type\":\"rate_limit_event\",\"rate_limit_info\":{\"status\":\"allowed\",\"reset_at\":\"2099-01-01T12:00:00Z\"}}'\n",
)
.unwrap();
std::fs::set_permissions(&script, std::fs::Permissions::from_mode(0o755)).unwrap();
let (tx, _rx) = broadcast::channel::<AgentEvent>(64);
let (watcher_tx, mut watcher_rx) = broadcast::channel::<WatcherEvent>(16);
let event_log = Arc::new(Mutex::new(Vec::new()));
let result = run_agent_pty_streaming(
"1103_soft_warning_no_exit_flag",
"coder-1",
"sh",
&[script.to_string_lossy().to_string()],
"--",
"/tmp",
&tx,
&event_log,
None,
0,
watcher_tx,
None,
None,
)
.await;
let pty = result.expect("PTY run should succeed");
assert!(
!pty.rate_limit_exit,
"rate_limit_exit must be false for a soft 'allowed' warning; \
only genuine hard blocks (rejected) should set it"
);
// Watcher must have received RateLimitWarning, not RateLimitHardBlock.
let evt = watcher_rx
.try_recv()
.expect("Expected a RateLimitWarning watcher event");
assert!(
matches!(evt, WatcherEvent::RateLimitWarning { .. }),
"Expected RateLimitWarning for status=allowed, got: {evt:?}"
);
}
}
+7 -22
View File
@@ -1,10 +1,9 @@
//! PTY process spawning and output loop: builds the command, drives the reader thread,
//! and dispatches parsed JSON events to the broadcast channel.
use std::collections::HashMap;
use std::io::{BufRead, BufReader};
use std::sync::{Arc, Mutex};
use portable_pty::{ChildKiller, CommandBuilder, PtySize, native_pty_system};
use portable_pty::{CommandBuilder, PtySize, native_pty_system};
use tokio::sync::broadcast;
use crate::agent_log::AgentLogWriter;
@@ -14,7 +13,7 @@ use crate::slog;
use crate::slog_warn;
use super::events::{emit_event, handle_agent_stream_event};
use super::types::{ChildKillerGuard, PtyResult, composite_key};
use super::types::PtyResult;
/// Spawn claude agent in a PTY and stream events through the broadcast channel.
///
@@ -55,7 +54,6 @@ pub(in crate::agents) async fn run_agent_pty_streaming(
event_log: &Arc<Mutex<Vec<AgentEvent>>>,
log_writer: Option<Arc<Mutex<AgentLogWriter>>>,
inactivity_timeout_secs: u64,
child_killers: Arc<Mutex<HashMap<String, Box<dyn ChildKiller + Send + Sync>>>>,
watcher_tx: broadcast::Sender<WatcherEvent>,
session_id_to_resume: Option<&str>,
eager_record: Option<(std::path::PathBuf, String)>,
@@ -82,7 +80,6 @@ pub(in crate::agents) async fn run_agent_pty_streaming(
&event_log,
log_writer.as_deref(),
inactivity_timeout_secs,
&child_killers,
&watcher_tx,
resume_sid.as_deref(),
eager_record,
@@ -104,7 +101,6 @@ fn run_agent_pty_blocking(
event_log: &Mutex<Vec<AgentEvent>>,
log_writer: Option<&Mutex<AgentLogWriter>>,
inactivity_timeout_secs: u64,
child_killers: &Arc<Mutex<HashMap<String, Box<dyn ChildKiller + Send + Sync>>>>,
watcher_tx: &broadcast::Sender<WatcherEvent>,
session_id_to_resume: Option<&str>,
eager_record: Option<(std::path::PathBuf, String)>,
@@ -204,21 +200,6 @@ fn run_agent_pty_blocking(
.spawn_command(cmd)
.map_err(|e| format!("Failed to spawn agent for {story_id}:{agent_name}: {e}"))?;
// Register the child killer so that kill_all_children() / stop_agent() can
// terminate this process on server shutdown, even if the blocking thread
// cannot be interrupted. The ChildKillerGuard deregisters on function exit.
let killer_key = composite_key(story_id, agent_name);
{
let killer = child.clone_killer();
if let Ok(mut killers) = child_killers.lock() {
killers.insert(killer_key.clone(), killer);
}
}
let _killer_guard = ChildKillerGuard {
killers: Arc::clone(child_killers),
key: killer_key,
};
drop(pair.slave);
let reader = pair
@@ -366,7 +347,11 @@ fn run_agent_pty_blocking(
.and_then(|i| i.get("status"))
.and_then(|s| s.as_str())
.unwrap_or("");
let is_hard_block = !status.is_empty() && status != "allowed_warning";
// "allowed" and "allowed_warning" are soft warnings — the request was
// permitted; only statuses that actually block the request (e.g. "rejected")
// are genuine hard blocks that warrant a rate-limit exit respawn.
let is_hard_block =
!status.is_empty() && status != "allowed" && status != "allowed_warning";
let reset_at = rate_limit_info
.and_then(|i| i.get("reset_at"))
.and_then(|r| r.as_str())
-22
View File
@@ -1,9 +1,4 @@
//! Core types for the PTY runner: result container and process lifecycle helpers.
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
use portable_pty::ChildKiller;
use crate::agents::TokenUsage;
/// Result from a PTY agent session, containing the session ID and token usage.
@@ -23,20 +18,3 @@ pub(in crate::agents) struct PtyResult {
/// event was seen or when the `reset_at` field was absent from the event.
pub rate_limit_reset_at: Option<chrono::DateTime<chrono::Utc>>,
}
pub(super) fn composite_key(story_id: &str, agent_name: &str) -> String {
format!("{story_id}:{agent_name}")
}
pub(super) struct ChildKillerGuard {
pub killers: Arc<Mutex<HashMap<String, Box<dyn ChildKiller + Send + Sync>>>>,
pub key: String,
}
impl Drop for ChildKillerGuard {
fn drop(&mut self) {
if let Ok(mut killers) = self.killers.lock() {
killers.remove(&self.key);
}
}
}
+5 -15
View File
@@ -1,8 +1,6 @@
//! Claude Code runtime — launches Claude Code CLI sessions as agent backends.
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
use portable_pty::ChildKiller;
use tokio::sync::broadcast;
use crate::agent_log::AgentLogWriter;
@@ -17,20 +15,13 @@ use super::{AgentEvent, AgentRuntime, RuntimeContext, RuntimeResult, RuntimeStat
/// It wraps the existing PTY-based execution logic, preserving all streaming,
/// token tracking, and inactivity timeout behaviour.
pub struct ClaudeCodeRuntime {
child_killers: Arc<Mutex<HashMap<String, Box<dyn ChildKiller + Send + Sync>>>>,
watcher_tx: broadcast::Sender<WatcherEvent>,
}
impl ClaudeCodeRuntime {
/// Create a new Claude Code runtime with shared child-killer registry and event channel.
pub fn new(
child_killers: Arc<Mutex<HashMap<String, Box<dyn ChildKiller + Send + Sync>>>>,
watcher_tx: broadcast::Sender<WatcherEvent>,
) -> Self {
Self {
child_killers,
watcher_tx,
}
/// Create a new Claude Code runtime with a shared event channel.
pub fn new(watcher_tx: broadcast::Sender<WatcherEvent>) -> Self {
Self { watcher_tx }
}
}
@@ -57,7 +48,6 @@ impl AgentRuntime for ClaudeCodeRuntime {
&event_log,
log_writer.clone(),
ctx.inactivity_timeout_secs,
Arc::clone(&self.child_killers),
self.watcher_tx.clone(),
ctx.session_id_to_resume.as_deref(),
eager_record.clone(),
@@ -69,6 +59,7 @@ impl AgentRuntime for ClaudeCodeRuntime {
// Abort+no-session: CLI crashed (e.g. SIGABRT) before emitting its
// first "system" event. Detected by: non-zero exit AND no session.
aborted_signal: !result.exit_ok && result.session_id.is_none(),
exit_ok: result.exit_ok,
session_id: result.session_id,
token_usage: result.token_usage,
rate_limit_exit: result.rate_limit_exit,
@@ -94,7 +85,6 @@ impl AgentRuntime for ClaudeCodeRuntime {
&event_log,
log_writer,
ctx.inactivity_timeout_secs,
Arc::clone(&self.child_killers),
self.watcher_tx.clone(),
None, // no --resume on fallback
eager_record,
@@ -103,6 +93,7 @@ impl AgentRuntime for ClaudeCodeRuntime {
Ok(RuntimeResult {
aborted_signal: !fallback_result.exit_ok
&& fallback_result.session_id.is_none(),
exit_ok: fallback_result.exit_ok,
session_id: fallback_result.session_id,
token_usage: fallback_result.token_usage,
rate_limit_exit: fallback_result.rate_limit_exit,
@@ -115,7 +106,6 @@ impl AgentRuntime for ClaudeCodeRuntime {
fn stop(&self) {
// Stopping is handled externally by the pool via kill_child_for_key().
// The ChildKillerGuard in pty.rs deregisters automatically on process exit.
}
fn get_status(&self) -> RuntimeStatus {
+4
View File
@@ -135,6 +135,7 @@ impl AgentRuntime for GeminiRuntime {
return Ok(RuntimeResult {
session_id: None,
token_usage: Some(total_usage),
exit_ok: true,
aborted_signal: false,
rate_limit_exit: false,
rate_limit_reset_at: None,
@@ -151,6 +152,7 @@ impl AgentRuntime for GeminiRuntime {
return Ok(RuntimeResult {
session_id: None,
token_usage: Some(total_usage),
exit_ok: true,
aborted_signal: false,
rate_limit_exit: false,
rate_limit_reset_at: None,
@@ -254,6 +256,7 @@ impl AgentRuntime for GeminiRuntime {
return Ok(RuntimeResult {
session_id: None,
token_usage: Some(total_usage),
exit_ok: true,
aborted_signal: false,
rate_limit_exit: false,
rate_limit_reset_at: None,
@@ -339,6 +342,7 @@ impl AgentRuntime for GeminiRuntime {
Ok(RuntimeResult {
session_id: None,
token_usage: Some(total_usage),
exit_ok: true,
aborted_signal: false,
rate_limit_exit: false,
rate_limit_reset_at: None,
+10 -6
View File
@@ -55,6 +55,12 @@ pub struct RuntimeContext {
pub struct RuntimeResult {
pub session_id: Option<String>,
pub token_usage: Option<TokenUsage>,
/// `true` when the process exited with exit code 0; `false` for non-zero exits
/// (API errors, network failures, or Claude-API-level budget exhaustion). Always
/// `true` for API-based runtimes (OpenAI, Gemini) which have no exit-code concept.
/// Used by the commit-recovery path to skip the stuck-respawn counter for forced
/// exits (story 1089, AC1).
pub exit_ok: bool,
/// `true` when the process exited with a failure AND no session was established.
///
/// This indicates the Claude Code CLI crashed (e.g. SIGABRT from an assertion
@@ -169,6 +175,7 @@ mod tests {
cache_read_input_tokens: 0,
total_cost_usd: 0.01,
}),
exit_ok: true,
aborted_signal: false,
rate_limit_exit: false,
rate_limit_reset_at: None,
@@ -186,6 +193,7 @@ mod tests {
let result = RuntimeResult {
session_id: None,
token_usage: None,
exit_ok: true,
aborted_signal: false,
rate_limit_exit: false,
rate_limit_reset_at: None,
@@ -204,20 +212,16 @@ mod tests {
#[test]
fn claude_code_runtime_get_status_returns_idle() {
use crate::io::watcher::WatcherEvent;
use std::collections::HashMap;
let killers = Arc::new(Mutex::new(HashMap::new()));
let (watcher_tx, _) = broadcast::channel::<WatcherEvent>(16);
let runtime = ClaudeCodeRuntime::new(killers, watcher_tx);
let runtime = ClaudeCodeRuntime::new(watcher_tx);
assert_eq!(runtime.get_status(), RuntimeStatus::Idle);
}
#[test]
fn claude_code_runtime_stream_events_empty() {
use crate::io::watcher::WatcherEvent;
use std::collections::HashMap;
let killers = Arc::new(Mutex::new(HashMap::new()));
let (watcher_tx, _) = broadcast::channel::<WatcherEvent>(16);
let runtime = ClaudeCodeRuntime::new(killers, watcher_tx);
let runtime = ClaudeCodeRuntime::new(watcher_tx);
assert!(runtime.stream_events().is_empty());
}
}
+3
View File
@@ -122,6 +122,7 @@ impl AgentRuntime for OpenAiRuntime {
return Ok(RuntimeResult {
session_id: None,
token_usage: Some(total_usage),
exit_ok: true,
aborted_signal: false,
rate_limit_exit: false,
rate_limit_reset_at: None,
@@ -138,6 +139,7 @@ impl AgentRuntime for OpenAiRuntime {
return Ok(RuntimeResult {
session_id: None,
token_usage: Some(total_usage),
exit_ok: true,
aborted_signal: false,
rate_limit_exit: false,
rate_limit_reset_at: None,
@@ -224,6 +226,7 @@ impl AgentRuntime for OpenAiRuntime {
return Ok(RuntimeResult {
session_id: None,
token_usage: Some(total_usage),
exit_ok: true,
aborted_signal: false,
rate_limit_exit: false,
rate_limit_reset_at: None,
+68 -67
View File
@@ -2,37 +2,30 @@
use crate::agents::{AgentPool, AgentStatus};
use crate::config::ProjectConfig;
use crate::pipeline_state::{ArchiveReason, PipelineItem, Stage};
use crate::pipeline_state::{ArchiveReason, Pipeline, PipelineItem, Stage, Status};
use std::collections::{HashMap, HashSet};
/// Map a stage to its display section label, or `None` to skip it entirely.
///
/// This is the single source of truth for the "where does this item appear"
/// decision. It mirrors the bucket routing in `http/workflow/pipeline.rs`
/// so that chat output and the web UI are always consistent.
///
/// `Stage::Frozen { resume_to }` is handled recursively: a frozen story
/// appears in the same section its `resume_to` stage would land in.
/// This routes through [`Stage::pipeline`] so chat output and the web UI use
/// the same column derivation. Frozen stories appear in their underlying
/// `resume_to` column (handled inside `Stage::pipeline`) and items in
/// `Stage::Archived` (with non-Blocked reasons) stay hidden.
pub(crate) fn display_section(s: &Stage) -> Option<&'static str> {
match s {
Stage::Upcoming | Stage::Backlog => Some("Backlog"),
Stage::Coding { .. }
| Stage::Blocked { .. }
| Stage::Archived {
reason: ArchiveReason::Blocked { .. },
..
} => Some("In Progress"),
Stage::Qa | Stage::ReviewHold { .. } => Some("QA"),
Stage::Merge { .. } | Stage::MergeFailure { .. } | Stage::MergeFailureFinal { .. } => {
Some("Merge")
}
Stage::Done { .. } => Some("Done"),
Stage::Frozen { resume_to } => display_section(resume_to),
Stage::Abandoned { .. } | Stage::Superseded { .. } | Stage::Rejected { .. } => {
Some("Closed")
}
Stage::Archived { .. } => None, // Completed/MergeFailed/ReviewHeld stay hidden
// Archived items with non-Blocked reasons are hidden from chat output.
if matches!(s, Stage::Archived { reason, .. } if !matches!(reason, ArchiveReason::Blocked { .. }))
{
return None;
}
Some(match s.pipeline() {
Pipeline::Backlog => "Backlog",
Pipeline::Coding => "In Progress",
Pipeline::Qa => "QA",
Pipeline::Merge => "Merge",
Pipeline::Done => "Done",
Pipeline::Closed => "Closed",
Pipeline::Archived => return None,
})
}
/// Check which dependency numbers from `item.depends_on` are unmet.
@@ -114,10 +107,10 @@ pub(crate) fn build_status_from_items(
let config = ProjectConfig::load(project_root).ok();
// Pre-fetch working tree state for all Coding-stage items whose worktrees exist.
// Pre-fetch working tree state for all Coding-column items whose worktrees exist.
let dirty_files_by_story: HashMap<String, crate::service::git_ops::DirtyFiles> = items
.iter()
.filter(|i| matches!(i.stage, Stage::Coding { .. }))
.filter(|i| i.stage.pipeline() == Pipeline::Coding && i.stage.status() == Status::Active)
.filter_map(|i| {
let wt = crate::worktree::worktree_path(project_root, &i.story_id.0);
if wt.is_dir() {
@@ -137,10 +130,13 @@ pub(crate) fn build_status_from_items(
.into_iter()
.collect();
// Merge-failure detail now lives on the typed MergeJob CRDT entry
// (story 929 — CRDT is the sole source of metadata).
// (story 929 — CRDT is the sole source of metadata). Only items in the
// Merge column with an Active status (i.e. `Stage::Merge { .. }`) need a
// pre-fetched failure snippet; MergeFailure(Final) items render their
// own snippet from the typed kind.
let merge_failures: HashMap<String, String> = items
.iter()
.filter(|i| matches!(i.stage, Stage::Merge { .. }))
.filter(|i| i.stage.pipeline() == Pipeline::Merge && i.stage.status() == Status::Active)
.filter_map(|i| {
let job = crate::crdt_state::read_merge_job(&i.story_id.0)?;
let err = job.error?;
@@ -215,11 +211,12 @@ pub(crate) fn build_status_from_items(
out
}
/// Render the one-line working tree summary for a story with uncommitted changes.
/// Return an inline working-tree suffix for a story with uncommitted changes.
///
/// Returns an empty string when the working tree is clean. File paths are not
/// listed here; use `status N` (triage) for the per-file breakdown.
fn render_working_tree_lines(info: &crate::service::git_ops::DirtyFiles) -> String {
/// Returns an empty string when the working tree is clean. The suffix is
/// appended directly to the coder line, e.g. `, Working tree: 3 modified (uncommitted)`.
/// File paths are not listed here; use `status N` (triage) for the per-file breakdown.
fn working_tree_suffix(info: &crate::service::git_ops::DirtyFiles) -> String {
if info.is_clean() {
return String::new();
}
@@ -228,7 +225,7 @@ fn render_working_tree_lines(info: &crate::service::git_ops::DirtyFiles) -> Stri
(0, n) => format!("{n} new"),
(m, n) => format!("{m} modified, {n} new"),
};
format!(" Working tree: {summary} (uncommitted)\n")
format!(", Working tree: {summary} (uncommitted)")
}
/// Shared lookup tables passed to [`render_item_line`] to keep the argument count manageable.
@@ -259,8 +256,10 @@ fn render_item_line(
} else {
Some(item.name.as_str())
};
// Use the typed CRDT stage as the sole source of truth (story 945).
let frozen = matches!(item.stage, Stage::Frozen { .. });
// Use the new Pipeline + Status helpers (story 1085).
let pipeline = item.stage.pipeline();
let status = item.stage.status();
let frozen = status == Status::Frozen;
let base_label = super::story_short_label(story_id, name_opt);
let display = if frozen {
format!("\u{2744}\u{FE0F} {base_label}") // ❄️ prefix
@@ -281,41 +280,52 @@ fn render_item_line(
format!(" *(waiting on: {})*", nums.join(", "))
};
// Closed-stage items (abandoned / superseded / rejected) each get a
// Closed-pipeline items (abandoned / superseded / rejected) each get a
// distinct indicator and optionally display their metadata.
match &item.stage {
Stage::Abandoned { .. } => {
match status {
Status::Abandoned => {
return format!(" \u{1F5D1}\u{FE0F} {display}{cost_suffix}\n"); // 🗑️
}
Stage::Superseded { superseded_by, .. } => {
Status::Superseded => {
let superseded_by = match &item.stage {
Stage::Superseded { superseded_by, .. } => superseded_by.0.as_str(),
_ => "",
};
return format!(
" \u{1F500} {display}{cost_suffix} — superseded by {}\n", // 🔀
superseded_by.0
" \u{1F500} {display}{cost_suffix} — superseded by {superseded_by}\n", // 🔀
);
}
Stage::Rejected { reason, .. } => {
Status::Rejected => {
let reason = match &item.stage {
Stage::Rejected { reason, .. } => reason.as_str(),
_ => "",
};
let snippet = first_non_empty_snippet(reason, 120);
return format!(" \u{1F6AB} {display}{cost_suffix}{snippet}\n"); // 🚫
}
_ => {}
}
// Merge-stage items get dedicated breakdown indicators instead of the
// Merge-column items get dedicated breakdown indicators instead of the
// generic traffic-light dot. MergeFailure / MergeFailureFinal items
// now also appear in the Merge section (in-place) so they are handled
// here alongside normal Merge items.
if matches!(
item.stage,
Stage::Merge { .. } | Stage::MergeFailure { .. } | Stage::MergeFailureFinal { .. }
) {
match &item.stage {
// appear in the Merge column (in-place) and are handled by the same arm.
if pipeline == Pipeline::Merge {
match status {
// MergeFailureFinal: mergemaster already tried and gave up — always ⛔.
Stage::MergeFailureFinal { kind } => {
Status::MergeFailureFinal => {
let kind = match &item.stage {
Stage::MergeFailureFinal { kind } => kind,
_ => unreachable!(),
};
let snippet = first_non_empty_snippet(&kind.display_reason(), 120);
return format!(" \u{26D4} {display}{cost_suffix}{dep_suffix}{snippet}\n");
}
// MergeFailure: a recovery agent may be running or queued.
Stage::MergeFailure { kind, .. } => {
Status::MergeFailure => {
let kind = match &item.stage {
Stage::MergeFailure { kind, .. } => kind,
_ => unreachable!(),
};
return match agent.map(|a| &a.status) {
Some(AgentStatus::Running) => format!(
" \u{1F916} {display}{cost_suffix}{dep_suffix} — mergemaster running\n"
@@ -352,16 +362,7 @@ fn render_item_line(
}
}
let blocked = matches!(
item.stage,
Stage::Blocked { .. }
| Stage::MergeFailure { .. }
| Stage::MergeFailureFinal { .. }
| Stage::Archived {
reason: ArchiveReason::Blocked { .. },
..
}
);
let blocked = status == Status::Blocked;
// Blocked items with a recovery agent get differentiated indicators.
if blocked {
return match agent.map(|a| &a.status) {
@@ -378,9 +379,9 @@ fn render_item_line(
.and_then(|a| a.throttled)
.is_some_and(|until| until > chrono::Utc::now());
let dot = super::traffic_light_dot(blocked, throttled, agent.is_some());
let wt_lines = dirty_files_by_story
let wt_suffix = dirty_files_by_story
.get(story_id)
.map(render_working_tree_lines)
.map(working_tree_suffix)
.unwrap_or_default();
if let Some(agent) = agent {
let model_str = config
@@ -389,10 +390,10 @@ fn render_item_line(
.and_then(|ac| ac.model.as_ref().map(|m| m.as_str()))
.unwrap_or("?");
format!(
" {dot}{display}{cost_suffix}{dep_suffix} — {} ({model_str})\n{wt_lines}",
" {dot}{display}{cost_suffix}{dep_suffix} — {} ({model_str}){wt_suffix}\n",
agent.agent_name
)
} else {
format!(" {dot}{display}{cost_suffix}{dep_suffix}\n{wt_lines}")
format!(" {dot}{display}{cost_suffix}{dep_suffix}{wt_suffix}\n")
}
}
+367
View File
@@ -0,0 +1,367 @@
//! Protocol-agnostic chat dispatcher — coalesce window + per-session serial lock.
//!
//! Sits between every inbound transport (Matrix, Slack, WhatsApp, …) and the
//! `claude -p` spawner. Transport handlers call [`ChatDispatcher::submit`]
//! instead of spawning directly; the dispatcher enforces two invariants:
//!
//! 1. **Coalesce window**: messages arriving for the same session within
//! `coalesce_ms` of each other are concatenated and delivered to a single
//! spawn. The window is a *debounce*: each new message extends the window by
//! `coalesce_ms` from its arrival time, so bursts flush as one batch.
//!
//! 2. **Per-session serial lock**: while one `claude -p` run is active, further
//! messages for that session queue up and are dispatched as a single batch
//! once the running invocation completes.
//!
//! A [`ChatDispatcher::stop`] call cancels the active run for a session and
//! discards the pending queue.
use crate::slog;
use std::collections::HashMap;
use std::pin::Pin;
use std::sync::{Arc, Mutex};
use std::time::Duration;
use tokio::sync::{mpsc, watch};
/// A factory function that produces one LLM execution future per dispatch.
///
/// Arguments:
/// - `String` — the (possibly concatenated) prompt to send to `claude -p`.
/// - `watch::Receiver<bool>` — send `true` on this channel to cancel the run.
///
/// Returns a boxed, pinned `Send + 'static` future that resolves when the LLM
/// session ends (whether normally or via cancellation).
pub type SpawnFn = Arc<
dyn Fn(
String,
watch::Receiver<bool>,
) -> Pin<Box<dyn std::future::Future<Output = ()> + Send + 'static>>
+ Send
+ Sync,
>;
enum SessionMsg {
UserMessage { text: String, factory: SpawnFn },
Stop,
}
struct SessionHandle {
tx: mpsc::UnboundedSender<SessionMsg>,
}
/// Coalescing, serialising dispatcher for chat-to-LLM message routing.
///
/// Construct once at startup via [`ChatDispatcher::new`] and share via `Arc`.
/// Call [`submit`](ChatDispatcher::submit) from every transport handler instead
/// of spawning `claude -p` directly.
pub struct ChatDispatcher {
sessions: Mutex<HashMap<String, SessionHandle>>,
coalesce_ms: u64,
}
impl ChatDispatcher {
/// Create a new dispatcher with the given coalesce window in milliseconds.
pub fn new(coalesce_ms: u64) -> Self {
Self {
sessions: Mutex::new(HashMap::new()),
coalesce_ms,
}
}
/// Submit a message for a chat session.
///
/// If no session task exists for `session_key`, one is created lazily.
/// The `factory` is called by the session task when the coalesce window
/// closes (or immediately after the current run finishes, for pending
/// messages).
pub fn submit(&self, session_key: String, message: String, factory: SpawnFn) {
let mut guard = self.sessions.lock().unwrap();
let coalesce_ms = self.coalesce_ms;
let handle = guard.entry(session_key.clone()).or_insert_with(|| {
let (tx, rx) = mpsc::unbounded_channel();
tokio::spawn(session_task(session_key.clone(), rx, coalesce_ms));
SessionHandle { tx }
});
let _ = handle.tx.send(SessionMsg::UserMessage {
text: message,
factory,
});
}
/// Stop the active LLM run for `session_key` and clear its pending queue.
///
/// Returns `true` if the session existed (whether or not anything was
/// actually running), `false` if no session for that key has been created.
pub fn stop(&self, session_key: &str) -> bool {
let guard = self.sessions.lock().unwrap();
if let Some(handle) = guard.get(session_key) {
let _ = handle.tx.send(SessionMsg::Stop);
true
} else {
false
}
}
}
/// Per-session background task.
///
/// Phases:
/// 1. **Wait** — blocks until the first `UserMessage` arrives.
/// 2. **Coalesce** — extends the window by `coalesce_ms` on each new message;
/// fires when no message arrives within the window.
/// 3. **Run** — calls the factory with the concatenated batch; while running,
/// collects further `UserMessage`s into a pending list and logs a warn per
/// message. A `Stop` message cancels the running call and clears pending.
/// 4. **Drain** — after the run, if pending is non-empty, fires a second run
/// with the accumulated batch and loops back to step 3.
/// 5. Returns to step 1 when pending is empty.
async fn session_task(
session_key: String,
mut rx: mpsc::UnboundedReceiver<SessionMsg>,
coalesce_ms: u64,
) {
let coalesce_dur = Duration::from_millis(coalesce_ms);
loop {
// ── Phase 1: wait for the first message ─────────────────────────────
let (first_text, first_factory) = loop {
match rx.recv().await {
None => return,
Some(SessionMsg::Stop) => continue,
Some(SessionMsg::UserMessage { text, factory }) => break (text, factory),
}
};
// ── Phase 2: coalesce window (debounce) ──────────────────────────────
let mut batch: Vec<String> = vec![first_text];
let mut latest_factory: SpawnFn = first_factory;
let mut deadline = tokio::time::Instant::now() + coalesce_dur;
'coalesce: loop {
let now = tokio::time::Instant::now();
if now >= deadline {
break 'coalesce;
}
let remaining = deadline - now;
match tokio::time::timeout(remaining, rx.recv()).await {
Err(_) => break 'coalesce, // window closed
Ok(None) => return, // channel closed → exit task
Ok(Some(SessionMsg::Stop)) => {
batch.clear();
break 'coalesce;
}
Ok(Some(SessionMsg::UserMessage { text, factory })) => {
batch.push(text);
latest_factory = factory;
// Extend deadline on each new message (debounce).
deadline = tokio::time::Instant::now() + coalesce_dur;
}
}
}
if batch.is_empty() {
continue; // Stop received during coalesce — restart
}
// ── Phase 3 + 4: run → drain pending → repeat ───────────────────────
let mut prompt = batch.join("\n\n");
let mut factory = latest_factory;
loop {
let (cancel_tx, cancel_rx) = watch::channel(false);
let llm_fut = factory(prompt, cancel_rx);
let mut llm_task = tokio::spawn(llm_fut);
let mut pending_texts: Vec<String> = vec![];
let mut pending_factory: Option<SpawnFn> = None;
let mut stopped = false;
// Wait for the LLM to finish, collecting messages that arrive during the run.
loop {
tokio::select! {
_ = &mut llm_task => { break; }
msg = rx.recv() => {
match msg {
None => {
llm_task.abort();
return;
}
Some(SessionMsg::Stop) => {
let _ = cancel_tx.send(true);
let _ = llm_task.await;
pending_texts.clear();
stopped = true;
break;
}
Some(SessionMsg::UserMessage { text, factory: f }) => {
pending_texts.push(text);
let depth = pending_texts.len();
slog!(
"[chat-dispatcher] coalescing message for session={}, queue_depth={}",
session_key,
depth,
);
pending_factory = Some(f);
}
}
}
}
}
if stopped || pending_texts.is_empty() {
break; // back to Phase 1
}
// Fire the pending batch as the next run (no additional coalesce window).
prompt = pending_texts.join("\n\n");
factory = pending_factory.unwrap();
}
}
}
// ── Tests ─────────────────────────────────────────────────────────────────────
#[cfg(test)]
mod tests {
use super::*;
use std::sync::atomic::{AtomicUsize, Ordering};
fn make_factory(spawn_count: Arc<AtomicUsize>, run_ms: u64) -> SpawnFn {
Arc::new(move |_prompt: String, _cancel_rx: watch::Receiver<bool>| {
let count = Arc::clone(&spawn_count);
Box::pin(async move {
count.fetch_add(1, Ordering::SeqCst);
tokio::time::sleep(Duration::from_millis(run_ms)).await;
})
})
}
/// AC 6 regression: three messages arriving 200 ms / (long gap) / (after run)
/// apart on the same session must produce at most two spawns, never three
/// concurrent processes.
///
/// Setup:
/// coalesce_ms = 50 ms (short window so test runs fast)
/// LLM "run" = 150 ms
/// msg1 @ t=0
/// msg2 @ t=20 ms — within coalesce window, merged with msg1 → 1 spawn
/// msg3 @ t=300 ms — after run completes → 2nd spawn
///
/// Expected: exactly 2 spawns, never 3.
#[tokio::test]
async fn three_messages_never_three_concurrent_spawns() {
let spawn_count = Arc::new(AtomicUsize::new(0));
let dispatcher = Arc::new(ChatDispatcher::new(50));
let session = "room1".to_string();
// msg1 at t=0
dispatcher.submit(
session.clone(),
"msg1".to_string(),
make_factory(Arc::clone(&spawn_count), 150),
);
// msg2 at t=20 ms — inside the 50 ms coalesce window
tokio::time::sleep(Duration::from_millis(20)).await;
dispatcher.submit(
session.clone(),
"msg2".to_string(),
make_factory(Arc::clone(&spawn_count), 150),
);
// msg3 at t=300 ms — after the coalesce window fires (t≈70 ms) and the
// 150 ms run completes (t≈220 ms), so msg3 starts a second coalesce cycle.
tokio::time::sleep(Duration::from_millis(280)).await;
dispatcher.submit(
session.clone(),
"msg3".to_string(),
make_factory(Arc::clone(&spawn_count), 150),
);
// Wait long enough for both runs to finish.
tokio::time::sleep(Duration::from_millis(500)).await;
let count = spawn_count.load(Ordering::SeqCst);
assert!(
(1..=2).contains(&count),
"expected 1 or 2 spawns (msgs 1+2 coalesced, msg3 separate), got {count}"
);
}
/// Messages that arrive while the LLM is running are not lost — they are
/// delivered as a single follow-up spawn once the first run completes.
#[tokio::test]
async fn pending_messages_dispatched_after_run_completes() {
let spawn_count = Arc::new(AtomicUsize::new(0));
let dispatcher = Arc::new(ChatDispatcher::new(50));
let session = "room2".to_string();
// First message — starts a 200 ms run.
dispatcher.submit(
session.clone(),
"first".to_string(),
make_factory(Arc::clone(&spawn_count), 200),
);
// Wait for coalesce window to fire, then send two more.
tokio::time::sleep(Duration::from_millis(100)).await;
dispatcher.submit(
session.clone(),
"second".to_string(),
make_factory(Arc::clone(&spawn_count), 50),
);
dispatcher.submit(
session.clone(),
"third".to_string(),
make_factory(Arc::clone(&spawn_count), 50),
);
// Wait long enough for both runs.
tokio::time::sleep(Duration::from_millis(600)).await;
let count = spawn_count.load(Ordering::SeqCst);
assert_eq!(
count, 2,
"first run + one pending-batch run = 2 total spawns"
);
}
/// Stop cancels the running LLM and discards pending messages.
#[tokio::test]
async fn stop_cancels_run_and_clears_pending() {
let spawn_count = Arc::new(AtomicUsize::new(0));
let dispatcher = Arc::new(ChatDispatcher::new(30));
let session = "room3".to_string();
// Start a long run.
dispatcher.submit(
session.clone(),
"long-running".to_string(),
make_factory(Arc::clone(&spawn_count), 500),
);
// Wait for coalesce window to fire.
tokio::time::sleep(Duration::from_millis(80)).await;
// Queue a pending message.
dispatcher.submit(
session.clone(),
"pending".to_string(),
make_factory(Arc::clone(&spawn_count), 50),
);
// Stop immediately.
dispatcher.stop(&session);
// Wait longer than the run would have taken if not stopped.
tokio::time::sleep(Duration::from_millis(700)).await;
let count = spawn_count.load(Ordering::SeqCst);
// The first run was started before stop (spawn_count=1).
// The pending message should NOT have produced a second spawn.
assert!(
count <= 1,
"stop should discard pending; got {count} spawns"
);
}
}
+2
View File
@@ -6,6 +6,8 @@
/// Bot command registry and dispatch — parses and routes incoming chat messages.
pub mod commands;
/// Protocol-agnostic chat dispatcher — coalesce window and per-session serial lock.
pub mod dispatcher;
/// Chat history utilities — loading and serialising conversation history.
pub mod history;
pub(crate) mod lookup;
@@ -268,6 +268,7 @@ mod tests {
pending_perm_replies: Arc::new(TokioMutex::new(HashMap::new())),
permission_timeout_secs: 120,
status: Arc::new(crate::service::status::StatusBroadcaster::new()),
chat_dispatcher: Arc::new(crate::chat::dispatcher::ChatDispatcher::new(1_500)),
})
}
@@ -21,6 +21,7 @@ pub(in crate::chat::transport::matrix::bot) async fn handle_message(
ctx: BotContext,
sender: String,
user_message: String,
mut cancel_rx: watch::Receiver<bool>,
) {
// Look up the room's existing Claude Code session ID (if any) so we can
// resume the conversation with structured API messages instead of
@@ -41,7 +42,16 @@ pub(in crate::chat::transport::matrix::bot) async fn handle_message(
let all_lines: Vec<String> = sled_guard.drain(..).chain(gtw_guard.drain(..)).collect();
drop(sled_guard);
drop(gtw_guard);
format_drained_events(all_lines)
slog!(
"[matrix-bot] drained {} gateway audit lines for LLM context",
all_lines.len()
);
let prefix = format_drained_events(all_lines);
slog!(
"[matrix-bot] format_drained_events output: {} bytes",
prefix.len()
);
prefix
};
// The prompt is just the current message with sender attribution.
@@ -59,9 +69,6 @@ pub(in crate::chat::transport::matrix::bot) async fn handle_message(
);
let provider = ClaudeCodeProvider::new();
let (cancel_tx, mut cancel_rx) = watch::channel(false);
// Keep the sender alive for the duration of the call.
let _cancel_tx = cancel_tx;
// Channel for sending complete paragraphs to the Matrix posting task.
let (msg_tx, mut msg_rx) = tokio::sync::mpsc::unbounded_channel::<String>();
@@ -608,9 +608,56 @@ pub(in crate::chat::transport::matrix::bot) async fn on_room_message(
return;
}
// Spawn a separate task so the Matrix sync loop is not blocked while we
// wait for the LLM response (which can take several seconds).
tokio::spawn(async move {
handle_message(room_id_str, incoming_room_id, ctx, sender, user_message).await;
});
// "stop" — cancel the running LLM turn for this session and clear pending queue.
{
let stripped = crate::chat::util::strip_bot_mention(
&user_message,
&ctx.services.bot_name,
ctx.matrix_user_id.as_str(),
)
.trim()
.to_ascii_lowercase();
if stripped == "stop" {
slog!("[matrix-bot] stop command from {sender} for session {room_id_str}");
ctx.services.chat_dispatcher.stop(&room_id_str);
let msg = "Stopped.";
let html = markdown_to_html(msg);
if let Ok(msg_id) = ctx.transport.send_message(&room_id_str, msg, &html).await
&& let Ok(event_id) = msg_id.parse()
{
ctx.bot_sent_event_ids.lock().await.insert(event_id);
}
return;
}
}
// Hand the message to the protocol-agnostic dispatcher instead of spawning
// directly. The dispatcher applies a coalesce window and a per-session
// serial lock, preventing duplicate concurrent Timmy spawns.
let ctx_for_factory = ctx.clone();
let factory: crate::chat::dispatcher::SpawnFn = {
let room_id_str2 = room_id_str.clone();
std::sync::Arc::new(
move |coalesced: String, cancel_rx: tokio::sync::watch::Receiver<bool>| {
let room_id_str = room_id_str2.clone();
let incoming_room_id = incoming_room_id.clone();
let ctx = ctx_for_factory.clone();
let sender = sender.clone();
Box::pin(async move {
handle_message(
room_id_str,
incoming_room_id,
ctx,
sender,
coalesced,
cancel_rx,
)
.await;
})
},
)
};
ctx.services
.chat_dispatcher
.submit(room_id_str, user_message, factory);
}
@@ -150,6 +150,7 @@ mod tests {
pending_perm_replies: Arc::new(TokioMutex::new(HashMap::new())),
permission_timeout_secs: 120,
status: Arc::new(crate::service::status::StatusBroadcaster::new()),
chat_dispatcher: Arc::new(crate::chat::dispatcher::ChatDispatcher::new(1_500)),
});
(services, perm_tx)
}
+119 -6
View File
@@ -326,21 +326,49 @@ pub async fn run_bot(
}
// Subscribe to gateway-side status events and buffer compact audit lines for
// the LLM context. A separate resubscribed receiver is used so both the
// buffer task and the room-forwarder task receive every event independently.
// the LLM context.
//
// Investigation log (story 1078) — hypotheses ruled out:
// (A) gateway_event_rx is None: impossible — spawn_gateway_bot always passes
// Some(state.event_tx.clone()) in gateway mode (gateway/mod.rs:130).
// (B) recv() never returns: buf task uses the ORIGINAL event_rx (subscribed
// before Matrix init) so any events buffered during init are visible;
// future events arrive normally via the shared broadcast channel.
// (C) Different Arc: buf and ctx.pending_gateway_events are both clones of
// the same Arc<TokioMutex<Vec<String>>> — writes in the buf task are
// immediately visible to handle_message.
// (D) format_drained_events empty on non-empty input: the function is
// pure/tested; the drain slog in handle_message now makes the count
// observable so we can confirm it is non-zero when events arrive.
//
// Bug fixed here: previously the buffer task held `event_rx.resubscribe()`,
// which starts at the *current tail* (next unsent message) and silently
// discards every event that arrived during the Matrix login / room-join /
// cross-signing phase (~530 s window). The forwarder now gets the
// resubscribed receiver (only needs live events going forward); the buffer
// task holds the original `event_rx` so it drains the init-window backlog
// on first poll.
let pending_gateway_events: Arc<TokioMutex<Vec<String>>> =
Arc::new(TokioMutex::new(Vec::new()));
let gateway_event_rx_for_forwarder = if let Some(event_rx) = gateway_event_rx {
// Buffer task: silently accumulate compact audit lines for Timmy's context.
// The forwarder only needs live (future) events — resubscribe is fine.
let forwarder_rx = event_rx.resubscribe();
// Buffer task: hold the *original* receiver so init-window events are
// not lost. Silently accumulate compact audit lines for Timmy's context.
{
use crate::service::gateway::polling::format_gateway_audit_line;
let buf_rx = event_rx.resubscribe();
let buf = Arc::clone(&pending_gateway_events);
slog!("[matrix-bot] subscribed to gateway events; buffer task starting");
tokio::spawn(async move {
let mut rx = buf_rx;
let mut rx = event_rx;
loop {
match rx.recv().await {
Ok(event) => {
slog!(
"[matrix-bot] buffered audit line for project={} id={}",
event.project,
event.event.timestamp_ms()
);
let line = format_gateway_audit_line(&event.project, &event.event);
buf.lock().await.push(line);
}
@@ -352,7 +380,7 @@ pub async fn run_bot(
}
});
}
Some(event_rx)
Some(forwarder_rx)
} else {
None
};
@@ -592,4 +620,89 @@ mod tests {
assert_eq!(steps[2], 20);
assert_eq!(steps[3], 40);
}
/// Regression test (story 1078): gateway broadcast events must reach
/// `pending_gateway_events` and produce an `audit ts=…` line in the
/// `format_drained_events` output that is prepended to Timmy's prompt.
///
/// The test spins up a mock `event_tx` broadcaster, sends one
/// `StageTransition` event, lets the buffer task process it, drains the
/// buffer, and asserts the result contains the expected audit prefix.
#[tokio::test]
async fn gateway_buffer_task_injects_audit_line_into_context() {
use super::super::messages::format_drained_events;
use crate::service::events::StoredEvent;
use crate::service::gateway::GatewayStatusEvent;
use crate::service::gateway::polling::format_gateway_audit_line;
let (event_tx, event_rx) = tokio::sync::broadcast::channel::<GatewayStatusEvent>(16);
// pending_gateway_events shared between buffer task and drain site.
let pending: Arc<TokioMutex<Vec<String>>> = Arc::new(TokioMutex::new(Vec::new()));
// Spawn a minimal buffer task — same logic as run_bot uses.
{
let buf = Arc::clone(&pending);
tokio::spawn(async move {
let mut rx = event_rx;
loop {
match rx.recv().await {
Ok(event) => {
let line = format_gateway_audit_line(&event.project, &event.event);
buf.lock().await.push(line);
}
Err(tokio::sync::broadcast::error::RecvError::Lagged(_)) => {}
Err(tokio::sync::broadcast::error::RecvError::Closed) => break,
}
}
});
}
// Send one stage-transition event, as a project node would.
let evt = GatewayStatusEvent {
project: "huskies".to_string(),
event: StoredEvent::StageTransition {
story_id: "42_story_feat".to_string(),
story_name: String::new(),
from_stage: "2_current".to_string(),
to_stage: "3_qa".to_string(),
timestamp_ms: 1_000_000,
},
};
let receivers = event_tx.send(evt).unwrap_or(0);
assert!(
receivers > 0,
"event must have at least one active receiver"
);
// Wait for the buffer task to process the event.
let deadline = std::time::Instant::now() + std::time::Duration::from_secs(2);
loop {
if !pending.lock().await.is_empty() {
break;
}
assert!(
std::time::Instant::now() < deadline,
"buffer task did not receive the event within 2 s"
);
tokio::time::sleep(std::time::Duration::from_millis(10)).await;
}
// Drain and format — mirrors what handle_message does.
let lines: Vec<String> = pending.lock().await.drain(..).collect();
let prefix = format_drained_events(lines);
assert!(
prefix.contains("audit ts="),
"prompt prefix must contain 'audit ts='; got: {prefix}"
);
assert!(
prefix.contains("project=huskies"),
"prompt prefix must name the project; got: {prefix}"
);
assert!(
prefix.starts_with("<system-reminder>\n"),
"prefix must open with <system-reminder>; got: {prefix}"
);
}
}
@@ -17,6 +17,11 @@ pub(super) fn default_aggregated_notifications_enabled() -> bool {
true
}
/// Default coalesce window for the chat dispatcher (1 500 ms).
pub(super) fn default_coalesce_window_ms() -> u64 {
1_500
}
pub(super) fn default_transport() -> String {
"matrix".to_string()
}
@@ -187,4 +192,14 @@ pub struct BotConfig {
/// Defaults to `true`.
#[serde(default = "default_aggregated_notifications_enabled")]
pub aggregated_notifications_enabled: bool,
/// Duration in milliseconds of the chat dispatcher's coalesce window.
///
/// Messages for the same session arriving within this window are
/// concatenated into a single `claude -p` call. The window is a
/// debounce: each new message extends the deadline by this duration.
///
/// Defaults to 1 500 ms (1.5 s).
#[serde(default = "default_coalesce_window_ms")]
pub coalesce_window_ms: u64,
}
@@ -310,6 +310,7 @@ mod tests {
perm_rx: Arc::new(tokio::sync::Mutex::new(perm_rx)),
pending_perm_replies: Arc::new(tokio::sync::Mutex::new(Default::default())),
permission_timeout_secs: 120,
chat_dispatcher: Arc::new(crate::chat::dispatcher::ChatDispatcher::new(1_500)),
});
Arc::new(WhatsAppWebhookContext {
services,
+11
View File
@@ -161,6 +161,12 @@ pub struct WatcherConfig {
/// moved to `6_archived/`. Default: 14400 (4 hours).
#[serde(default = "default_done_retention_secs")]
pub done_retention_secs: u64,
/// How often (in seconds) the periodic reconciler runs to converge
/// subscriber side effects. The reconciler calls each subscriber's
/// `reconcile()` entry point so that Lagged events never leave persistent
/// state diverged. Default: 30 seconds.
#[serde(default = "default_reconcile_interval_secs")]
pub reconcile_interval_secs: u64,
}
impl Default for WatcherConfig {
@@ -168,6 +174,7 @@ impl Default for WatcherConfig {
Self {
sweep_interval_secs: default_sweep_interval_secs(),
done_retention_secs: default_done_retention_secs(),
reconcile_interval_secs: default_reconcile_interval_secs(),
}
}
}
@@ -180,6 +187,10 @@ fn default_done_retention_secs() -> u64 {
4 * 60 * 60 // 4 hours
}
fn default_reconcile_interval_secs() -> u64 {
30
}
fn default_qa() -> String {
"server".to_string()
}
+4 -3
View File
@@ -54,9 +54,10 @@ pub use types::{
};
pub use write::{
bump_retry_count, migrate_legacy_stage_strings, migrate_merge_job, migrate_names_from_slugs,
migrate_node_claims_to_agent_claims, migrate_story_ids_to_numeric, name_from_story_id,
purge_done_stage_merge_jobs, set_agent, set_depends_on, set_epic, set_item_type, set_name,
set_plan_state, set_qa_mode, set_resume_to, set_resume_to_raw, set_retry_count, write_item,
migrate_node_claims_to_agent_claims, migrate_story_ids_to_numeric,
migrate_zombie_pipeline_rows, name_from_story_id, purge_done_stage_merge_jobs, set_agent,
set_depends_on, set_epic, set_item_type, set_name, set_origin, set_plan_state, set_qa_mode,
set_resume_to, set_resume_to_raw, set_retry_count, write_item,
};
#[cfg(test)]
+33 -28
View File
@@ -29,6 +29,8 @@ pub struct CrdtItemDump {
/// Hex-encoded OpId of the list insert op — cross-reference with `crdt_ops`.
pub content_index: String,
pub is_deleted: bool,
/// Origin JSON string, or `None` for items that pre-date story 1088.
pub origin: Option<String>,
}
/// Top-level debug dump of the in-memory CRDT state.
@@ -149,6 +151,10 @@ pub fn dump_crdt_state(story_id_filter: Option<&str>) -> CrdtStateDump {
JsonValue::Number(n) if n > 0.0 => Some(n),
_ => None,
};
let origin = match item_crdt.origin.view() {
JsonValue::String(s) if !s.is_empty() => Some(s),
_ => None,
};
let content_index = op.id.iter().map(|b| format!("{b:02x}")).collect::<String>();
@@ -163,6 +169,7 @@ pub fn dump_crdt_state(story_id_filter: Option<&str>) -> CrdtStateDump {
claim_ts,
content_index,
is_deleted: op.is_deleted,
origin,
});
}
@@ -408,6 +415,11 @@ pub(super) fn extract_item_view(item: &PipelineItemCrdt) -> Option<PipelineItemV
_ => None,
};
let origin = match item.origin.view() {
JsonValue::String(s) if !s.is_empty() => Some(s),
_ => None,
};
let stage = project_stage_for_view(
&stage_str,
&story_id,
@@ -429,6 +441,7 @@ pub(super) fn extract_item_view(item: &PipelineItemCrdt) -> Option<PipelineItemV
qa_mode,
item_type,
epic,
origin,
})
}
@@ -585,56 +598,48 @@ fn project_stage_for_view(
}
}
/// Check whether a dependency (by numeric ID prefix) is in `5_done` or `6_archived`
/// according to CRDT state.
/// Check whether a dependency (by numeric ID prefix) is in `Pipeline::Done` or
/// `Pipeline::Archived` according to CRDT state.
///
/// Returns `true` if the dependency is satisfied (item found in a done stage).
/// Matches both legacy slug-form IDs (`"664_story_foo"`) and numeric-only IDs
/// (`"664"`) so the check remains correct after the slug→numeric migration.
/// See `dep_is_archived_crdt` to distinguish archive-satisfied from cleanly-done.
/// Returns `true` if the dependency is satisfied (item found in a Done or
/// Archived pipeline column). Matches both legacy slug-form IDs
/// (`"664_story_foo"`) and numeric-only IDs (`"664"`) so the check remains
/// correct after the slug→numeric migration. Story 1086 routes the check
/// through the `Pipeline` projection so that future Stage variants automatically
/// participate via [`crate::pipeline_state::Stage::pipeline`]. See
/// `dep_is_archived_crdt` to distinguish archive-satisfied from cleanly-done.
pub fn dep_is_done_crdt(dep_number: u32) -> bool {
use crate::pipeline_state::{Stage, read_all_typed};
use crate::pipeline_state::{Pipeline, read_all_typed};
let exact = dep_number.to_string();
let prefix = format!("{dep_number}_");
read_all_typed().into_iter().any(|item| {
(item.story_id.0 == exact || item.story_id.0.starts_with(&prefix))
&& matches!(
item.stage,
Stage::Done { .. }
| Stage::Archived { .. }
| Stage::Abandoned { .. }
| Stage::Superseded { .. }
| Stage::Rejected { .. }
)
&& matches!(item.stage.pipeline(), Pipeline::Done | Pipeline::Archived)
})
}
/// Check whether a dependency (by numeric ID prefix) is specifically in `6_archived`
/// according to CRDT state.
/// Check whether a dependency (by numeric ID prefix) is specifically in
/// `Pipeline::Archived` according to CRDT state.
///
/// Used to detect when a dependency is satisfied via archive rather than via a clean
/// completion through `5_done`. Returns `false` when the CRDT layer is not initialised.
/// Matches both legacy slug-form IDs (`"664_story_foo"`) and numeric-only IDs (`"664"`).
/// completion through `Pipeline::Done`. Returns `false` when the CRDT layer is not
/// initialised. Matches both legacy slug-form IDs (`"664_story_foo"`) and
/// numeric-only IDs (`"664"`).
pub fn dep_is_archived_crdt(dep_number: u32) -> bool {
use crate::pipeline_state::{Stage, read_all_typed};
use crate::pipeline_state::{Pipeline, read_all_typed};
let exact = dep_number.to_string();
let prefix = format!("{dep_number}_");
read_all_typed().into_iter().any(|item| {
(item.story_id.0 == exact || item.story_id.0.starts_with(&prefix))
&& matches!(
item.stage,
Stage::Archived { .. }
| Stage::Abandoned { .. }
| Stage::Superseded { .. }
| Stage::Rejected { .. }
)
&& item.stage.pipeline() == Pipeline::Archived
})
}
/// Check unmet dependencies for a story by reading its `depends_on` from the
/// CRDT document and checking each dependency against CRDT state.
///
/// Returns the list of dependency numbers that are NOT in `5_done` or `6_archived`.
/// Returns the list of dependency numbers whose stage is NOT in `Pipeline::Done`
/// or `Pipeline::Archived`.
pub fn check_unmet_deps_crdt(story_id: &str) -> Vec<u32> {
let item = match read_item(story_id) {
Some(i) => i,
+30
View File
@@ -105,6 +105,26 @@ pub struct PipelineItemCrdt {
/// means no merge task is in flight. Projected into `Stage::Merge {
/// server_start_time }` so callers never read this register directly.
pub merge_server_start: LwwRegisterCrdt<f64>,
/// Story 1086: kebab-case wire form of the [`crate::pipeline_state::Pipeline`]
/// projection of the current `stage`. Written by `write_item` alongside
/// `stage` so display/scan code on remote peers can route by pipeline column
/// without re-deriving from the stage string. Empty string means "use the
/// value derived from `stage`" (legacy items predating 1086).
pub pipeline: LwwRegisterCrdt<String>,
/// Story 1086: kebab-case wire form of the [`crate::pipeline_state::Status`]
/// projection of the current `stage`. Written alongside `stage` so badge
/// renderers can read the status directly without re-projecting from the
/// stage string. Empty string means "use the value derived from `stage`"
/// (legacy items predating 1086).
pub status: LwwRegisterCrdt<String>,
/// Story 1088: origin of the work item — who or what created it.
///
/// Stored as a compact JSON string, e.g.
/// `{"kind":"user","id":"","ts":1716768000.0}` or
/// `{"kind":"agent","id":"coder-1","ts":1716768000.0}`.
/// Empty string on older items that pre-date this register; the typed
/// read path surfaces those as `None`, which the UI renders as `"unknown"`.
pub origin: LwwRegisterCrdt<String>,
}
/// CRDT node that holds a single peer's presence entry.
@@ -203,6 +223,9 @@ pub struct WorkItem {
pub(super) item_type: Option<crate::io::story_metadata::ItemType>,
/// Epic this item belongs to. `None` when the item has no parent epic.
pub(super) epic: Option<EpicId>,
/// Origin of the work item (story 1088). `None` for items created before
/// the origin register was introduced; those display as `"unknown"`.
pub(super) origin: Option<String>,
}
impl WorkItem {
@@ -261,6 +284,12 @@ impl WorkItem {
self.epic
}
/// Origin of the work item (story 1088), or `None` for items created before
/// the origin register was introduced.
pub fn origin(&self) -> Option<&str> {
self.origin.as_deref()
}
/// Construct a `WorkItem` for use in tests outside `crdt_state::*`.
///
/// Within `crdt_state` use a struct literal directly (fields are `pub(super)`).
@@ -286,6 +315,7 @@ impl WorkItem {
qa_mode,
item_type,
epic,
origin: None,
}
}
}
+120 -25
View File
@@ -21,21 +21,26 @@ use crate::pipeline_state::{AgentClaim, Stage, stage_dir_name};
///
/// Returns `true` if the item was found and the op was applied, `false` otherwise.
pub fn set_depends_on(story_id: &str, deps: &[u32]) -> bool {
let Some(state_mutex) = get_crdt() else {
return false;
};
let Ok(mut state) = state_mutex.lock() else {
return false;
};
let Some(&idx) = state.index.get(story_id) else {
return false;
};
let value = if deps.is_empty() {
String::new()
} else {
serde_json::to_string(deps).unwrap_or_default()
};
apply_and_persist(&mut state, |s| s.crdt.doc.items[idx].depends_on.set(value));
{
let Some(state_mutex) = get_crdt() else {
return false;
};
let Ok(mut state) = state_mutex.lock() else {
return false;
};
let Some(&idx) = state.index.get(story_id) else {
return false;
};
let value = if deps.is_empty() {
String::new()
} else {
serde_json::to_string(deps).unwrap_or_default()
};
apply_and_persist(&mut state, |s| s.crdt.doc.items[idx].depends_on.set(value));
}
// Drop the CRDT lock before calling sync: read_item acquires the same
// mutex and would deadlock if the lock were still held here.
crate::db::ops::sync_item_depends_on(story_id);
true
}
@@ -155,6 +160,9 @@ pub fn set_name(story_id: &str, name: Option<&str>) -> bool {
apply_and_persist(&mut state, |s| {
s.crdt.doc.items[idx].name.set(value.clone())
});
// Drop the lock before the shadow write so `read_item` can acquire it.
drop(state);
crate::db::sync_item_name(story_id);
true
}
@@ -175,16 +183,21 @@ pub fn set_agent(story_id: &str, agent: Option<crate::config::AgentName>) -> boo
let Some(state_mutex) = get_crdt() else {
return false;
};
let Ok(mut state) = state_mutex.lock() else {
return false;
};
let Some(&idx) = state.index.get(story_id) else {
return false;
};
let value = agent.map(|a| a.as_str().to_string()).unwrap_or_default();
apply_and_persist(&mut state, |s| {
s.crdt.doc.items[idx].agent.set(value.clone())
});
{
let Ok(mut state) = state_mutex.lock() else {
return false;
};
let Some(&idx) = state.index.get(story_id) else {
return false;
};
let value = agent.map(|a| a.as_str().to_string()).unwrap_or_default();
apply_and_persist(&mut state, |s| {
s.crdt.doc.items[idx].agent.set(value.clone())
});
}
// Sync the updated agent to the SQLite shadow table. Must be called after
// releasing the CRDT mutex so read_item can re-acquire it without deadlock.
crate::db::ops::sync_item_agent(story_id);
true
}
@@ -235,6 +248,31 @@ pub fn set_plan_state(story_id: &str, state: crate::pipeline_state::PlanState) -
true
}
/// Set the `origin` CRDT register for a pipeline item (story 1088).
///
/// Writes a compact JSON string describing who or what created the item, e.g.
/// `{"kind":"user","id":"","ts":1716768000.0}` or
/// `{"kind":"agent","id":"coder-1","ts":1716768000.0}`.
///
/// Passing an empty string is treated as "no origin set" (equivalent to the
/// pre-1088 state for older items). Returns `true` if the item was found and
/// the op was applied, `false` otherwise.
pub fn set_origin(story_id: &str, origin: &str) -> bool {
let Some(state_mutex) = get_crdt() else {
return false;
};
let Ok(mut state) = state_mutex.lock() else {
return false;
};
let Some(&idx) = state.index.get(story_id) else {
return false;
};
apply_and_persist(&mut state, |s| {
s.crdt.doc.items[idx].origin.set(origin.to_string())
});
true
}
/// Write a pipeline item state through CRDT operations.
///
/// If the item exists, updates its registers. If not, inserts a new item
@@ -256,6 +294,11 @@ pub fn write_item(
merged_at: Option<f64>,
) {
let stage_str = stage_dir_name(stage);
// Story 1086: persist the typed Pipeline + Status projections alongside
// the stage register so subscribers/display code on remote peers can route
// by them without re-deriving from the stage string.
let pipeline_str = stage.pipeline().as_str();
let status_str = stage.status().as_str();
let claim: Option<&AgentClaim> = match stage {
Stage::Coding { claim, .. } => claim.as_ref(),
Stage::Merge { claim, .. } => claim.as_ref(),
@@ -311,6 +354,14 @@ pub fn write_item(
apply_and_persist(&mut state, |s| {
s.crdt.doc.items[idx].stage.set(stage_str.to_string())
});
// Story 1086: keep `pipeline` and `status` registers in lock-step with
// the stage write so subscribers/display can read them directly.
apply_and_persist(&mut state, |s| {
s.crdt.doc.items[idx].pipeline.set(pipeline_str.to_string())
});
apply_and_persist(&mut state, |s| {
s.crdt.doc.items[idx].status.set(status_str.to_string())
});
if let Some(n) = name {
apply_and_persist(&mut state, |s| {
@@ -394,6 +445,10 @@ pub fn write_item(
"resume_to": "",
"plan_state": "",
"merge_server_start": merge_server_start_val,
// Story 1086: typed Pipeline + Status projections written at insert.
"pipeline": pipeline_str,
"status": status_str,
"origin": "",
})
.into();
@@ -424,6 +479,10 @@ pub fn write_item(
item.resume_to.advance_seq(floor);
item.plan_state.advance_seq(floor);
item.merge_server_start.advance_seq(floor);
// Story 1086.
item.pipeline.advance_seq(floor);
item.status.advance_seq(floor);
item.origin.advance_seq(floor);
}
// Broadcast a CrdtEvent for the new item.
@@ -510,6 +569,24 @@ pub fn set_retry_count(story_id: &str, count: i64) {
_ => return,
};
write_item(story_id, &new_stage, None, None, None, None);
if let Some(db) = crate::db::shadow_write::PIPELINE_DB.get() {
let stage = stage_dir_name(&new_stage).to_string();
let name = Some(item.name().to_string());
let agent = item.agent().map(|a| a.to_string());
let depends_on = (!item.depends_on().is_empty())
.then(|| serde_json::to_string(item.depends_on()).ok())
.flatten();
let msg = crate::db::shadow_write::PipelineWriteMsg {
story_id: story_id.to_string(),
stage,
name,
agent,
retry_count: Some(count.max(0)),
depends_on,
content: None,
};
let _ = db.tx.send(msg);
}
}
/// Increment `retries` by 1 and return the new value.
@@ -559,5 +636,23 @@ pub fn bump_retry_count(story_id: &str) -> i64 {
_ => return 0,
};
write_item(story_id, &new_stage, None, None, None, None);
if let Some(db) = crate::db::shadow_write::PIPELINE_DB.get() {
let stage = stage_dir_name(&new_stage).to_string();
let name = Some(item.name().to_string());
let agent = item.agent().map(|a| a.to_string());
let depends_on = (!item.depends_on().is_empty())
.then(|| serde_json::to_string(item.depends_on()).ok())
.flatten();
let msg = crate::db::shadow_write::PipelineWriteMsg {
story_id: story_id.to_string(),
stage,
name,
agent,
retry_count: Some(new_retries as i64),
depends_on,
content: None,
};
let _ = db.tx.send(msg);
}
new_retries as i64
}
+150
View File
@@ -705,6 +705,59 @@ pub fn purge_done_stage_merge_jobs() {
slog!("[crdt] Purged {count} stale MergeJob entries for terminal-stage stories");
}
/// Delete `pipeline_items` rows that correspond to CRDT-tombstoned stories.
///
/// Pre-1094 code deleted pipeline_items via a fire-and-forget channel that
/// could be lost on an abrupt restart, leaving rows with non-terminal stage
/// values for stories that no longer exist in the CRDT. This migration
/// removes those zombie rows on startup.
///
/// Idempotent: rows already absent are unaffected; running twice produces the
/// same result.
pub async fn migrate_zombie_pipeline_rows() {
let pool = match crate::db::get_shared_pool() {
Some(p) => p,
None => return,
};
let tombstone_ids = crate::crdt_state::tombstoned_ids();
sweep_zombie_rows(pool, &tombstone_ids).await;
}
/// Inner sweep used by [`migrate_zombie_pipeline_rows`] and its tests.
///
/// Deletes every `pipeline_items` row in `ids` whose stage is not already a
/// terminal value. Returns the number of rows deleted.
#[cfg_attr(test, allow(dead_code))]
pub(crate) async fn sweep_zombie_rows(pool: &sqlx::SqlitePool, ids: &[String]) -> u32 {
if ids.is_empty() {
return 0;
}
let mut cleaned = 0u32;
for story_id in ids {
match sqlx::query(
"DELETE FROM pipeline_items WHERE id = ?1 AND stage NOT IN \
('done','archived','abandoned','superseded','rejected')",
)
.bind(story_id)
.execute(pool)
.await
{
Ok(r) if r.rows_affected() > 0 => cleaned += 1,
Ok(_) => {}
Err(e) => {
slog!(
"[crdt] migrate_zombie_pipeline_rows: failed to delete '{}': {e}",
story_id
);
}
}
}
if cleaned > 0 {
slog!("[crdt] Swept {cleaned} zombie pipeline_items rows for tombstoned stories");
}
cleaned
}
#[cfg(test)]
mod merge_job_migration_tests {
use super::super::super::state::init_for_test;
@@ -909,3 +962,100 @@ mod merge_job_migration_tests {
migrate_merge_job(std::path::Path::new("/nonexistent/pipeline.db"));
}
}
#[cfg(test)]
mod zombie_row_migration_tests {
use super::super::super::state::init_for_test;
use super::*;
use sqlx::Row as _;
async fn make_pool() -> sqlx::SqlitePool {
let options = sqlx::sqlite::SqliteConnectOptions::new()
.filename(":memory:")
.create_if_missing(true);
let pool = sqlx::pool::PoolOptions::new()
.max_connections(1)
.connect_with(options)
.await
.unwrap();
sqlx::migrate!("./migrations").run(&pool).await.unwrap();
pool
}
async fn insert_row(pool: &sqlx::SqlitePool, story_id: &str, stage: &str) {
let now = chrono::Utc::now().to_rfc3339();
sqlx::query(
"INSERT INTO pipeline_items \
(id, name, stage, agent, retry_count, depends_on, content, created_at, updated_at) \
VALUES (?1, ?2, ?3, NULL, 0, NULL, NULL, ?4, ?4)",
)
.bind(story_id)
.bind(story_id)
.bind(stage)
.bind(&now)
.execute(pool)
.await
.unwrap();
}
async fn row_stage(pool: &sqlx::SqlitePool, story_id: &str) -> Option<String> {
sqlx::query("SELECT stage FROM pipeline_items WHERE id = ?1")
.bind(story_id)
.fetch_optional(pool)
.await
.unwrap()
.map(|r| r.get(0))
}
/// Bug 1094 regression: delete a story in `coding` stage, assert the
/// `pipeline_items` row is gone; then re-run the sweep and confirm no
/// further changes (idempotent).
#[tokio::test]
async fn sweep_removes_zombie_coding_row_and_is_idempotent() {
init_for_test();
let pool = make_pool().await;
let story_id = "1094_zombie_regression";
// Seed: insert a pipeline_items row in the "coding" stage.
insert_row(&pool, story_id, "coding").await;
assert_eq!(row_stage(&pool, story_id).await.as_deref(), Some("coding"));
// Tombstone the story in the CRDT (simulate evict_item outcome).
crate::crdt_state::write_item_str(
story_id,
"coding",
Some("Zombie regression story"),
None,
None,
None,
);
crate::crdt_state::evict_item(story_id).ok();
// Run the sweep — row must be deleted.
let deleted = sweep_zombie_rows(&pool, &[story_id.to_string()]).await;
assert_eq!(deleted, 1, "expected one zombie row to be cleaned");
assert!(
row_stage(&pool, story_id).await.is_none(),
"pipeline_items row must be gone after sweep"
);
// Re-run is a no-op (idempotent).
let second = sweep_zombie_rows(&pool, &[story_id.to_string()]).await;
assert_eq!(second, 0, "second sweep must be a no-op");
}
/// Rows already in a terminal stage must be left alone.
#[tokio::test]
async fn sweep_skips_terminal_stage_rows() {
let pool = make_pool().await;
let story_id = "1094_terminal_skip";
insert_row(&pool, story_id, "done").await;
let deleted = sweep_zombie_rows(&pool, &[story_id.to_string()]).await;
assert_eq!(deleted, 0, "terminal-stage row must not be deleted");
assert!(
row_stage(&pool, story_id).await.is_some(),
"terminal-stage row must survive sweep"
);
}
}
+4 -4
View File
@@ -10,14 +10,14 @@ mod migrations;
mod tests;
pub use item::{
bump_retry_count, set_agent, set_depends_on, set_epic, set_item_type, set_name, set_plan_state,
set_qa_mode, set_resume_to, set_resume_to_raw, set_retry_count, write_item,
bump_retry_count, set_agent, set_depends_on, set_epic, set_item_type, set_name, set_origin,
set_plan_state, set_qa_mode, set_resume_to, set_resume_to_raw, set_retry_count, write_item,
};
#[cfg(test)]
pub use item::write_item_str;
pub use migrations::{
migrate_legacy_stage_strings, migrate_merge_job, migrate_names_from_slugs,
migrate_node_claims_to_agent_claims, migrate_story_ids_to_numeric, name_from_story_id,
purge_done_stage_merge_jobs,
migrate_node_claims_to_agent_claims, migrate_story_ids_to_numeric,
migrate_zombie_pipeline_rows, name_from_story_id, purge_done_stage_merge_jobs,
};
+1
View File
@@ -434,6 +434,7 @@ async fn handle_work_items_get(params: Value) -> Value {
"stage": c.stage,
"name": c.name,
"agent": c.agent,
"origin": c.origin,
}),
Err(e) => serde_json::json!({"error": e.to_string()}),
}
+15
View File
@@ -60,6 +60,17 @@ pub enum ContentKey<'a> {
/// completion. Read by `get_merge_status` to surface gate output for the
/// "completed" state without a separate MergeJob CRDT register (story 1036).
MergeReport(&'a str),
/// Flag written by spawn.rs when a coder session exits with a non-zero exit
/// code (API error, network failure, or Claude-API-level budget exhaustion).
/// Prevents the stuck-respawn counter from incrementing for forced exits —
/// only self-exits with no file or read changes count toward the cap.
/// Consumed (read + deleted) by the commit-recovery path in pipeline advance.
CommitRecoveryForcedExit(&'a str),
/// Cumulative set of files read across all commit-recovery sessions for a
/// story, stored as a newline-separated sorted list. Used to detect whether
/// the agent made read-exploration progress even when the worktree diff did
/// not grow (story 1089, AC2). Cleared when a commit lands or the story blocks.
CommitRecoveryReadSet(&'a str),
}
impl<'a> ContentKey<'a> {
@@ -85,6 +96,10 @@ impl<'a> ContentKey<'a> {
ContentKey::MergeFailureKind(id) => format!("{id}:merge_failure_kind"),
ContentKey::MergeSuccess(id) => format!("{id}:merge_success"),
ContentKey::MergeReport(id) => format!("{id}:merge_report"),
ContentKey::CommitRecoveryForcedExit(id) => {
format!("{id}:commit_recovery_forced_exit")
}
ContentKey::CommitRecoveryReadSet(id) => format!("{id}:commit_recovery_read_set"),
}
}
}
+11 -9
View File
@@ -12,7 +12,7 @@
//! zombie entries left over from sessions that predate the subscriber.
use crate::db::{ContentKey, all_content_ids, delete_content};
use crate::pipeline_state::Stage;
use crate::pipeline_state::{Pipeline, Stage, Status};
use crate::slog;
use crate::slog_warn;
@@ -111,16 +111,18 @@ pub(crate) fn sweep_zombie_content_on_startup() {
}
}
/// Return `true` when `stage` is one of the five terminal pipeline stages.
/// Return `true` when `stage` is one of the terminal pipeline classifications.
///
/// Story 1086: matches via the [`Status`] projection (Done / Abandoned /
/// Superseded / Rejected) plus [`Pipeline::Archived`] for plain archived items
/// (which carry `Status::Active`). Future Stage variants automatically
/// participate by returning the appropriate Status / Pipeline from
/// [`Stage::status`] / [`Stage::pipeline`].
fn is_terminal_stage(stage: &Stage) -> bool {
matches!(
stage,
Stage::Done { .. }
| Stage::Archived { .. }
| Stage::Abandoned { .. }
| Stage::Superseded { .. }
| Stage::Rejected { .. }
)
stage.status(),
Status::Done | Status::Abandoned | Status::Superseded | Status::Rejected
) || matches!(stage.pipeline(), Pipeline::Archived)
}
#[cfg(test)]
+463 -2
View File
@@ -28,8 +28,11 @@ pub mod recover;
pub mod shadow_write;
pub use content_store::{ContentKey, all_content_ids, delete_content, read_content, write_content};
pub use ops::{ItemMeta, delete_item, move_item_stage, next_item_number, write_item_with_content};
pub use shadow_write::{get_shared_pool, init};
pub use ops::{
ItemMeta, delete_item, delete_item_sync, move_item_stage, next_item_number, sync_item_name,
write_item_with_content,
};
pub use shadow_write::{check_schema_drift, get_shared_pool, init};
#[cfg(test)]
pub use content_store::ensure_content_store;
@@ -395,6 +398,112 @@ mod tests {
);
}
/// Regression: root cause of the 2026-05-14 21:07 production outage.
///
/// A headless agent on a feature branch (whose binary includes a new
/// sqlx migration) must NEVER apply that migration to the production
/// pipeline.db. Verify that opening an agent-local DB and running
/// migrations on it leaves the production DB's `_sqlx_migrations` table
/// unchanged.
///
/// The enforcement mechanism is in `init_subsystems(is_agent=true)`, which
/// redirects to a temp path. This test validates the SQLite isolation
/// property: migrations applied to one file are confined to that file.
#[tokio::test]
async fn agent_db_isolation_does_not_affect_production_db() {
let tmp = tempfile::tempdir().unwrap();
let prod_db_path = tmp.path().join("production.db");
let agent_db_path = tmp.path().join("agent_temp.db");
// Set up the production DB — apply the current compiled-in migrations.
let prod_opts = sqlx::sqlite::SqliteConnectOptions::new()
.filename(&prod_db_path)
.create_if_missing(true);
let prod_pool = sqlx::SqlitePool::connect_with(prod_opts).await.unwrap();
sqlx::migrate!("./migrations")
.run(&prod_pool)
.await
.unwrap();
// Record the migration versions present in the production DB.
let before: Vec<(i64,)> =
sqlx::query_as("SELECT version FROM _sqlx_migrations ORDER BY version")
.fetch_all(&prod_pool)
.await
.unwrap();
// Simulate the agent opening its own isolated DB and running migrations.
let agent_opts = sqlx::sqlite::SqliteConnectOptions::new()
.filename(&agent_db_path)
.create_if_missing(true);
let agent_pool = sqlx::SqlitePool::connect_with(agent_opts).await.unwrap();
sqlx::migrate!("./migrations")
.run(&agent_pool)
.await
.unwrap();
// Production DB must be completely unaffected by the agent's migration run.
let after: Vec<(i64,)> =
sqlx::query_as("SELECT version FROM _sqlx_migrations ORDER BY version")
.fetch_all(&prod_pool)
.await
.unwrap();
assert_eq!(
before, after,
"agent opening its own DB must not alter the production DB migration table"
);
}
/// Verify that `check_schema_drift` returns an empty list when all
/// migrations in the database are recognised by this binary.
#[tokio::test]
async fn check_schema_drift_empty_when_all_known() {
let tmp = tempfile::tempdir().unwrap();
let db_path = tmp.path().join("drift_test.db");
let opts = sqlx::sqlite::SqliteConnectOptions::new()
.filename(&db_path)
.create_if_missing(true);
let pool = sqlx::SqlitePool::connect_with(opts).await.unwrap();
sqlx::migrate!("./migrations").run(&pool).await.unwrap();
let drift = super::shadow_write::check_schema_drift(&pool).await;
assert!(
drift.is_empty(),
"no drift expected when DB matches the compiled-in migration set"
);
}
/// Verify that `check_schema_drift` identifies a manually-inserted
/// migration row that is not part of the compiled-in set.
#[tokio::test]
async fn check_schema_drift_detects_unknown_migration() {
let tmp = tempfile::tempdir().unwrap();
let db_path = tmp.path().join("drift_future.db");
let opts = sqlx::sqlite::SqliteConnectOptions::new()
.filename(&db_path)
.create_if_missing(true);
let pool = sqlx::SqlitePool::connect_with(opts).await.unwrap();
sqlx::migrate!("./migrations").run(&pool).await.unwrap();
// Inject a fake "future" migration that no binary compiled today would know.
let fake_checksum: Vec<u8> = vec![0u8; 20];
sqlx::query(
"INSERT INTO _sqlx_migrations \
(version, description, installed_on, success, checksum, execution_time) \
VALUES (99999999999999, 'future_migration', '2099-01-01T00:00:00Z', 1, ?1, 0)",
)
.bind(&fake_checksum)
.execute(&pool)
.await
.unwrap();
let drift = super::shadow_write::check_schema_drift(&pool).await;
assert_eq!(drift.len(), 1, "exactly one unknown migration expected");
assert_eq!(drift[0].version, 99999999999999_i64);
assert_eq!(drift[0].description, "future_migration");
}
/// Story 864: passing `ItemMeta::default()` against a content blob that
/// LOOKS like front-matter must NOT silently extract metadata into the
/// CRDT. The whole point of removing the implicit YAML round-trip is
@@ -482,4 +591,356 @@ mod tests {
"retry_count must reset to 0 on stage transition"
);
}
/// `shadow_write::init` spawns its background task on the calling runtime,
/// which under `#[tokio::test]` is per-test and dies when the test ends.
/// Park the init on a leaked multi-thread runtime so the bg task lives for
/// the whole test process; mirrors `db::ops::tests::ensure_shadow_db`.
#[cfg(test)]
static SHADOW_RT: std::sync::OnceLock<tokio::runtime::Runtime> = std::sync::OnceLock::new();
#[cfg(test)]
async fn ensure_shadow_db() {
static INIT: std::sync::OnceLock<()> = std::sync::OnceLock::new();
if INIT.get().is_some() {
return;
}
let rt = SHADOW_RT.get_or_init(|| {
tokio::runtime::Builder::new_multi_thread()
.worker_threads(1)
.enable_all()
.build()
.expect("shadow rt")
});
rt.spawn(async {
static INNER: std::sync::OnceLock<()> = std::sync::OnceLock::new();
if INNER.get().is_some() {
return;
}
let tmp = tempfile::tempdir().expect("tmp");
let db_path = tmp.path().join("pipeline.db");
std::mem::forget(tmp);
shadow_write::init(&db_path).await.expect("shadow init");
let _ = INNER.set(());
})
.await
.expect("shadow init task");
let _ = INIT.set(());
}
/// Regression for story 1095: `set_name` must propagate the new name to the
/// SQLite shadow table via `sync_item_name`. Before the fix, the CRDT
/// register was updated but `pipeline_items.name` stayed stale.
#[tokio::test]
async fn set_name_updates_shadow_name_column() {
crate::crdt_state::init_for_test();
ensure_content_store();
ensure_shadow_db().await;
let story_id = "9095_story_set_name_shadow";
write_item_with_content(
story_id,
"1_backlog",
"---\nname: Original Name\n---\n",
ItemMeta::named("Original Name"),
);
// Wait for the initial insert to land.
tokio::time::sleep(std::time::Duration::from_millis(50)).await;
// Rename via the CRDT setter — now also triggers sync_item_name.
crate::crdt_state::set_name(story_id, Some("Updated Name"));
// Wait for the background write task to flush.
tokio::time::sleep(std::time::Duration::from_millis(100)).await;
// Open a fresh pool on this test's runtime — sqlx pools are not safe
// to share across runtimes, so we can't reuse `get_shared_pool()`
// (which was created on the leaked shadow-write runtime).
let path = shadow_write::SHADOW_DB_PATH
.get()
.expect("SHADOW_DB_PATH set by init");
let opts = sqlx::sqlite::SqliteConnectOptions::new()
.filename(path)
.create_if_missing(false);
let pool = sqlx::SqlitePool::connect_with(opts).await.unwrap();
let row: (Option<String>,) =
sqlx::query_as("SELECT name FROM pipeline_items WHERE id = ?1")
.bind(story_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(
row.0.as_deref(),
Some("Updated Name"),
"set_name must propagate the new name to the shadow table"
);
}
/// Bug 1098: `bump_retry_count` must mirror the new value to the SQLite
/// shadow table, not only to the CRDT register.
///
/// Before the fix, calling `bump_retry_count` updated the CRDT but left
/// `pipeline_items.retry_count` stale.
#[tokio::test]
async fn bump_retry_count_updates_shadow_table() {
crate::crdt_state::init_for_test();
ensure_content_store();
ensure_shadow_db().await;
let story_id = "9899_story_retry_shadow_1098";
// Insert the story into both CRDT and the shadow table.
write_item_with_content(
story_id,
"2_current",
"# Retry shadow test\n",
ItemMeta::named("Retry Shadow Test"),
);
// Let the background write task process the initial insert.
tokio::time::sleep(std::time::Duration::from_millis(50)).await;
// Three bumps → retry_count must reach 3 in SQLite.
crate::crdt_state::bump_retry_count(story_id);
crate::crdt_state::bump_retry_count(story_id);
crate::crdt_state::bump_retry_count(story_id);
// Let the background write task process all three updates.
tokio::time::sleep(std::time::Duration::from_millis(100)).await;
let path = shadow_write::SHADOW_DB_PATH
.get()
.expect("SHADOW_DB_PATH set by init");
let opts = sqlx::sqlite::SqliteConnectOptions::new()
.filename(path)
.create_if_missing(false);
let pool = sqlx::SqlitePool::connect_with(opts).await.unwrap();
let (count,): (i64,) =
sqlx::query_as("SELECT retry_count FROM pipeline_items WHERE id = ?1")
.bind(story_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(
count, 3,
"retry_count must be 3 after three bump_retry_count calls"
);
}
/// Story 1087, AC2: the split-stage migration projects every supported
/// wire-form `stage` string into the canonical `(pipeline, status)` pair.
/// The fixture covers each Stage variant (and the legacy numeric-prefix
/// directory names retained for back-compat).
#[tokio::test]
async fn split_stage_migration_backfills_pipeline_and_status_for_every_variant() {
let tmp = tempfile::tempdir().unwrap();
let db_path = tmp.path().join("pipeline.db");
let opts = sqlx::sqlite::SqliteConnectOptions::new()
.filename(&db_path)
.create_if_missing(true);
let pool = sqlx::SqlitePool::connect_with(opts).await.unwrap();
sqlx::migrate!("./migrations").run(&pool).await.unwrap();
// (stage written by older code, expected pipeline, expected status)
let fixture: &[(&str, &str, &str)] = &[
("upcoming", "backlog", "active"),
("backlog", "backlog", "active"),
("coding", "coding", "active"),
("blocked", "coding", "blocked"),
("qa", "qa", "active"),
("review_hold", "qa", "review-hold"),
("merge", "merge", "active"),
("merge_failure", "merge", "merge-failure"),
("merge_failure_final", "merge", "merge-failure-final"),
("done", "done", "done"),
("abandoned", "closed", "abandoned"),
("superseded", "closed", "superseded"),
("rejected", "closed", "rejected"),
("archived", "archived", "active"),
("frozen", "coding", "frozen"),
// Legacy numeric-prefix directory names.
("1_backlog", "backlog", "active"),
("2_current", "coding", "active"),
("3_qa", "qa", "active"),
("4_merge", "merge", "active"),
("5_done", "done", "done"),
("6_archived", "archived", "active"),
];
let now = chrono::Utc::now().to_rfc3339();
for (idx, (stage, _, _)) in fixture.iter().enumerate() {
let id = format!("1087_fixture_{idx}");
sqlx::query(
"INSERT INTO pipeline_items \
(id, name, stage, agent, retry_count, depends_on, content, created_at, updated_at) \
VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8, ?8)",
)
.bind(&id)
.bind("fixture")
.bind(*stage)
.bind(Option::<String>::None)
.bind(Option::<i64>::None)
.bind(Option::<String>::None)
.bind("---\nname: fixture\n---\n")
.bind(&now)
.execute(&pool)
.await
.unwrap();
}
// Force the split-stage backfill to run against the rows we just
// inserted. In production this is `sqlx::migrate!`'s job, but the
// sqlx migrator only runs migrations once per DB and they were already
// applied at the top of the test before any rows existed. Reissuing
// the backfill statements is the migration logic under test.
sqlx::query(
"UPDATE pipeline_items SET pipeline = CASE stage \
WHEN 'upcoming' THEN 'backlog' \
WHEN 'backlog' THEN 'backlog' \
WHEN '1_backlog' THEN 'backlog' \
WHEN 'coding' THEN 'coding' \
WHEN 'blocked' THEN 'coding' \
WHEN '2_current' THEN 'coding' \
WHEN 'qa' THEN 'qa' \
WHEN 'review_hold' THEN 'qa' \
WHEN '3_qa' THEN 'qa' \
WHEN 'merge' THEN 'merge' \
WHEN 'merge_failure' THEN 'merge' \
WHEN 'merge_failure_final' THEN 'merge' \
WHEN '4_merge' THEN 'merge' \
WHEN 'done' THEN 'done' \
WHEN '5_done' THEN 'done' \
WHEN 'abandoned' THEN 'closed' \
WHEN 'superseded' THEN 'closed' \
WHEN 'rejected' THEN 'closed' \
WHEN 'archived' THEN 'archived' \
WHEN '6_archived' THEN 'archived' \
WHEN 'frozen' THEN 'coding' \
ELSE '' END",
)
.execute(&pool)
.await
.unwrap();
sqlx::query(
"UPDATE pipeline_items SET status = CASE stage \
WHEN 'frozen' THEN 'frozen' \
WHEN 'review_hold' THEN 'review-hold' \
WHEN 'blocked' THEN 'blocked' \
WHEN 'merge_failure' THEN 'merge-failure' \
WHEN 'merge_failure_final' THEN 'merge-failure-final' \
WHEN 'abandoned' THEN 'abandoned' \
WHEN 'superseded' THEN 'superseded' \
WHEN 'rejected' THEN 'rejected' \
WHEN 'done' THEN 'done' \
WHEN '5_done' THEN 'done' \
ELSE 'active' END",
)
.execute(&pool)
.await
.unwrap();
for (idx, (stage_input, expect_pipeline, expect_status)) in fixture.iter().enumerate() {
let id = format!("1087_fixture_{idx}");
let row: (String, String) =
sqlx::query_as("SELECT pipeline, status FROM pipeline_items WHERE id = ?1")
.bind(&id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(
row.0, *expect_pipeline,
"stage {stage_input:?} should backfill pipeline to {expect_pipeline:?}, got {:?}",
row.0
);
assert_eq!(
row.1, *expect_status,
"stage {stage_input:?} should backfill status to {expect_status:?}, got {:?}",
row.1
);
}
}
/// Story 1087, AC1: `shadow_write::init` writes a timestamped backup of
/// pipeline.db before the split-stage migration applies, and skips the
/// backup on subsequent restarts (after the migration is recorded).
#[tokio::test]
async fn pre_pipeline_status_backup_only_runs_once() {
let tmp = tempfile::tempdir().unwrap();
let db_path = tmp.path().join("pipeline.db");
// Seed a "pre-1087" DB: open without applying the split-stage migration.
// We do this by opening with `create_if_missing` and running only the
// legacy migrations — but the simplest way to simulate that here is to
// hand-craft a DB containing an `_sqlx_migrations` table that lists
// every migration EXCEPT the split-stage one.
let opts = sqlx::sqlite::SqliteConnectOptions::new()
.filename(&db_path)
.create_if_missing(true);
let pool = sqlx::SqlitePool::connect_with(opts).await.unwrap();
// Apply migrations the normal way, then delete the split-stage row so
// the backup branch fires on the next `init`.
sqlx::migrate!("./migrations").run(&pool).await.unwrap();
sqlx::query("DELETE FROM _sqlx_migrations WHERE version = 20260515000000")
.execute(&pool)
.await
.unwrap();
pool.close().await;
// First call: backup branch fires, side-car file appears.
super::shadow_write::backup_pre_pipeline_status(&db_path).await;
let backups: Vec<_> = std::fs::read_dir(tmp.path())
.unwrap()
.filter_map(Result::ok)
.filter(|e| {
e.file_name()
.to_string_lossy()
.contains(".pre-pipeline-status.")
})
.collect();
assert_eq!(
backups.len(),
1,
"expected exactly one .pre-pipeline-status backup, got {}",
backups.len()
);
// Re-apply the migration so the marker row is back, simulating a
// post-migration server restart.
let opts = sqlx::sqlite::SqliteConnectOptions::new()
.filename(&db_path)
.create_if_missing(false);
let pool = sqlx::SqlitePool::connect_with(opts).await.unwrap();
let fake_checksum: Vec<u8> = vec![0u8; 20];
sqlx::query(
"INSERT INTO _sqlx_migrations \
(version, description, installed_on, success, checksum, execution_time) \
VALUES (20260515000000, 'split_stage_into_pipeline_status', '2026-05-15T00:00:00Z', 1, ?1, 0)",
)
.bind(&fake_checksum)
.execute(&pool)
.await
.unwrap();
pool.close().await;
// Second call: no new backup written.
super::shadow_write::backup_pre_pipeline_status(&db_path).await;
let backups_after: Vec<_> = std::fs::read_dir(tmp.path())
.unwrap()
.filter_map(Result::ok)
.filter(|e| {
e.file_name()
.to_string_lossy()
.contains(".pre-pipeline-status.")
})
.collect();
assert_eq!(
backups_after.len(),
1,
"post-migration init must not create another backup; got {} backups",
backups_after.len()
);
}
}
+227
View File
@@ -176,6 +176,43 @@ pub fn move_item_stage(
}
}
/// Shadow-write the updated agent field for an existing pipeline item.
///
/// Called by [`crate::crdt_state::set_agent`] after the CRDT register is updated
/// so `pipeline_items.agent` stays in sync. Reads the full current metadata from
/// the CRDT (stage, name, depends_on, retry_count) to avoid overwriting other
/// columns with stale values — only the `agent` column carries the new data.
pub fn sync_item_agent(story_id: &str) {
let Some(db) = PIPELINE_DB.get() else {
return;
};
let Some(view) = crate::crdt_state::read_item(story_id) else {
return;
};
let stage = view.stage().dir_name().to_string();
let name = Some(view.name().to_string());
let agent = view.agent().map(|a| a.as_str().to_string());
let depends_on = {
let d = view.depends_on();
if d.is_empty() {
None
} else {
serde_json::to_string(d).ok()
}
};
let retry_count = Some(i64::from(view.retry_count()));
let msg = PipelineWriteMsg {
story_id: story_id.to_string(),
stage,
name,
agent,
retry_count,
depends_on,
content: None,
};
let _ = db.tx.send(msg);
}
/// Delete a story from the shadow table (fire-and-forget).
pub fn delete_item(story_id: &str) {
delete_content(ContentKey::Story(story_id));
@@ -198,6 +235,111 @@ pub fn delete_item(story_id: &str) {
}
}
/// Delete a story from the shadow table, awaiting the SQLite write.
///
/// Unlike [`delete_item`], this function issues a direct `DELETE FROM
/// pipeline_items` via the shared pool and awaits the result — so the row
/// is gone before this function returns. Use this from async call sites
/// where durability of the deletion matters (e.g. story deletion, startup
/// migration). Falls back to the fire-and-forget channel when the shared
/// pool is not yet initialised.
pub async fn delete_item_sync(story_id: &str) {
delete_content(ContentKey::Story(story_id));
if let Some(pool) = super::shadow_write::get_shared_pool() {
if let Err(e) = sqlx::query("DELETE FROM pipeline_items WHERE id = ?1")
.bind(story_id)
.execute(pool)
.await
{
crate::slog_warn!(
"[db] Synchronous delete from pipeline_items failed for '{}': {e}",
story_id
);
}
} else if let Some(db) = PIPELINE_DB.get() {
let msg = PipelineWriteMsg {
story_id: story_id.to_string(),
stage: "deleted".to_string(),
name: None,
agent: None,
retry_count: None,
depends_on: None,
content: None,
};
let _ = db.tx.send(msg);
}
}
/// Sync the shadow table's `name` column after a CRDT name-register write.
///
/// Reads the current item from the CRDT (which already holds the new name after
/// `apply_and_persist`) and sends a `PipelineWriteMsg` so the SQLite mirror
/// stays in sync. All other columns (stage, agent, retry_count, depends_on)
/// are preserved from the live CRDT view; `content` is left as `None` so the
/// UPSERT's `COALESCE` keeps the existing value.
///
/// No-ops if the DB is not initialised or the item is not in the CRDT.
pub fn sync_item_name(story_id: &str) {
let Some(db) = PIPELINE_DB.get() else { return };
let Some(view) = crate::crdt_state::read_item(story_id) else {
return;
};
let depends_on = {
let d = view.depends_on();
if d.is_empty() {
None
} else {
serde_json::to_string(d).ok()
}
};
let msg = PipelineWriteMsg {
story_id: story_id.to_string(),
stage: view.stage().dir_name().to_string(),
name: Some(view.name().to_string()),
agent: view.agent().map(|a| a.to_string()),
retry_count: Some(view.retry_count() as i64),
depends_on,
content: None,
};
let _ = db.tx.send(msg);
}
/// Sync the `depends_on` field of a pipeline item from the CRDT to the shadow table.
///
/// Called after [`crate::crdt_state::set_depends_on`] updates the CRDT register so
/// that the SQLite shadow table stays in lock-step. Reads the full current view from
/// the CRDT (stage, name, agent, retry_count, depends_on) and sends a
/// [`PipelineWriteMsg`] over [`PIPELINE_DB`]`.tx`. Pattern mirrors
/// [`move_item_stage`] lines 157-176. No-op when the CRDT is uninitialised or the
/// story_id is not found.
pub fn sync_item_depends_on(story_id: &str) {
let Some(db) = PIPELINE_DB.get() else {
return;
};
let Some(view) = crate::crdt_state::read_item(story_id) else {
return;
};
let depends_on = {
let d = view.depends_on();
if d.is_empty() {
None
} else {
serde_json::to_string(d).ok()
}
};
let msg = PipelineWriteMsg {
story_id: story_id.to_string(),
stage: view.stage().dir_name().to_string(),
name: Some(view.name().to_string()),
agent: view.agent().map(|a| a.to_string()),
retry_count: Some(view.retry_count() as i64),
depends_on,
content: None,
};
let _ = db.tx.send(msg);
}
/// Get the next available item number by scanning the CRDT state, the
/// in-memory content store, AND the tombstone set for the highest existing
/// number.
@@ -248,3 +390,88 @@ pub fn next_item_number() -> u32 {
max_num + 1
}
#[cfg(test)]
mod tests {
use super::*;
use crate::db::shadow_write;
/// `shadow_write::init` spawns its background task on the calling runtime.
/// Under `#[tokio::test]` that runtime is per-test and drops when the test
/// ends, killing the task. This OnceLock holds a multi-thread runtime that
/// persists for the lifetime of the test binary so the write loop stays alive
/// across all tests that share `PIPELINE_DB`.
static SHADOW_RT: std::sync::OnceLock<tokio::runtime::Runtime> = std::sync::OnceLock::new();
async fn ensure_shadow_db() {
static INIT: std::sync::OnceLock<()> = std::sync::OnceLock::new();
if INIT.get().is_some() {
return;
}
let rt = SHADOW_RT.get_or_init(|| {
tokio::runtime::Builder::new_multi_thread()
.worker_threads(1)
.enable_all()
.build()
.expect("shadow rt")
});
rt.spawn(async {
static INNER: std::sync::OnceLock<()> = std::sync::OnceLock::new();
if INNER.get().is_some() {
return;
}
let tmp = tempfile::tempdir().expect("tmp");
let db_path = tmp.path().join("pipeline.db");
std::mem::forget(tmp);
shadow_write::init(&db_path).await.expect("shadow init");
let _ = INNER.set(());
})
.await
.expect("shadow init task");
let _ = INIT.set(());
}
/// Regression test for story 1097: `set_depends_on` must sync the shadow
/// table. Before the fix, the CRDT register was updated but the
/// `pipeline_items.depends_on` column was never written.
#[tokio::test]
async fn set_depends_on_syncs_shadow_table() {
crate::crdt_state::init_for_test();
ensure_content_store();
ensure_shadow_db().await;
let story_id = "1097_story_depends_on_shadow_drift";
// Insert the story so it exists in both the CRDT and the shadow table.
write_item_with_content(
story_id,
"backlog",
"---\nname: Depends On Shadow Drift\n---\n",
ItemMeta::named("Depends On Shadow Drift"),
);
// Let the initial shadow write land.
tokio::time::sleep(std::time::Duration::from_millis(50)).await;
// This is the write under test: it must update the shadow table.
let ok = crate::crdt_state::set_depends_on(story_id, &[1, 2]);
assert!(ok, "set_depends_on must return true for an existing item");
// Let the shadow write land.
tokio::time::sleep(std::time::Duration::from_millis(50)).await;
let pool = shadow_write::get_shared_pool().expect("pool must be initialised");
let row: (Option<String>,) =
sqlx::query_as("SELECT depends_on FROM pipeline_items WHERE id = ?1")
.bind(story_id)
.fetch_one(pool)
.await
.expect("row must exist in shadow table");
assert_eq!(
row.0.as_deref(),
Some("[1,2]"),
"pipeline_items.depends_on must reflect the set_depends_on call"
);
}
}
+126 -10
View File
@@ -11,10 +11,23 @@ use crate::slog;
use sqlx::SqlitePool;
use sqlx::sqlite::SqliteConnectOptions;
use std::collections::HashMap;
use std::collections::HashSet;
use std::path::Path;
use std::sync::OnceLock;
use tokio::sync::mpsc;
/// One migration row in the live database that is not in the compiled-in set.
///
/// Returned by [`check_schema_drift`] for each unknown migration.
pub struct UnknownMigration {
/// sqlx migration version number (derived from the filename timestamp).
pub version: i64,
/// Human-readable description from the migration filename.
pub description: String,
/// When the migration was applied, as stored in `_sqlx_migrations.installed_on`.
pub installed_on: String,
}
/// The process-global SQLite pool, set once by [`init`].
///
/// Other modules call [`get_shared_pool`] to access the pool without needing
@@ -28,23 +41,30 @@ pub fn get_shared_pool() -> Option<&'static SqlitePool> {
}
/// A pending shadow write for one pipeline item.
pub(super) struct PipelineWriteMsg {
pub(super) story_id: String,
pub(super) stage: String,
pub(super) name: Option<String>,
pub(super) agent: Option<String>,
pub(super) retry_count: Option<i64>,
pub(super) depends_on: Option<String>,
pub(super) content: Option<String>,
pub(crate) struct PipelineWriteMsg {
pub(crate) story_id: String,
pub(crate) stage: String,
pub(crate) name: Option<String>,
pub(crate) agent: Option<String>,
pub(crate) retry_count: Option<i64>,
pub(crate) depends_on: Option<String>,
pub(crate) content: Option<String>,
}
/// Handle to the background shadow-write task.
pub struct PipelineDb {
pub(super) tx: mpsc::UnboundedSender<PipelineWriteMsg>,
pub(crate) tx: mpsc::UnboundedSender<PipelineWriteMsg>,
}
/// Process-global handle to the background shadow-write task, set once during `init`.
pub(super) static PIPELINE_DB: OnceLock<PipelineDb> = OnceLock::new();
pub(crate) static PIPELINE_DB: OnceLock<PipelineDb> = OnceLock::new();
/// Path of the SQLite file opened by [`init`], set once by the first successful caller.
///
/// Tests that need to open their own pool (because sqlx pools are not safe to
/// share across Tokio runtimes) read this to find the right file regardless of
/// which test won the `PIPELINE_DB` init race.
pub(crate) static SHADOW_DB_PATH: OnceLock<std::path::PathBuf> = OnceLock::new();
/// Initialise the pipeline database.
///
@@ -55,6 +75,17 @@ pub async fn init(db_path: &Path) -> Result<(), sqlx::Error> {
if PIPELINE_DB.get().is_some() {
return Ok(());
}
// Record the path before doing any real work so tests can always find the
// correct file even if two callers race — the OnceLock ensures only one
// path wins, and whichever wins will also win the PIPELINE_DB set below.
let _ = SHADOW_DB_PATH.set(db_path.to_path_buf());
// Story 1087: before running the migration that splits `stage` into
// (`pipeline`, `status`), take a timestamped side-car copy of the live DB
// so the pre-split state is recoverable. Skip the copy when the file does
// not yet exist (fresh installs) or when the split-stage migration has
// already been applied (subsequent restarts).
backup_pre_pipeline_status(db_path).await;
let options = SqliteConnectOptions::new()
.filename(db_path)
@@ -133,3 +164,88 @@ pub async fn init(db_path: &Path) -> Result<(), sqlx::Error> {
let _ = PIPELINE_DB.set(PipelineDb { tx });
Ok(())
}
/// Story 1087: file name of the split-stage migration. The version prefix is
/// the same `i64` sqlx assigns to that migration on `installed_on` rows in
/// `_sqlx_migrations`.
const SPLIT_STAGE_MIGRATION_VERSION: i64 = 20260515000000;
/// Story 1087: take a timestamped side-car copy of `pipeline.db` if and only if
/// the split-stage migration has not yet been applied. This is the AC1 backup
/// — `pipeline.db.pre-pipeline-status.<unix-ts>.bak` next to the live file.
///
/// Failures are logged but never propagated: a missing backup must not block
/// the server from starting (a corrupt source file or a read-only directory
/// will be surfaced by the migration step itself).
pub(crate) async fn backup_pre_pipeline_status(db_path: &Path) {
if !db_path.exists() {
return;
}
// Cheap pre-check: open the DB read-only and see whether the split-stage
// migration version is recorded in `_sqlx_migrations`. If it is, the
// backup has already been taken on a previous start and there is nothing
// to do.
let options = SqliteConnectOptions::new()
.filename(db_path)
.read_only(true)
.create_if_missing(false);
let probe = SqlitePool::connect_with(options).await;
if let Ok(pool) = probe {
let already_split: Result<Option<(i64,)>, _> =
sqlx::query_as("SELECT version FROM _sqlx_migrations WHERE version = ?1 LIMIT 1")
.bind(SPLIT_STAGE_MIGRATION_VERSION)
.fetch_optional(&pool)
.await;
pool.close().await;
if let Ok(Some(_)) = already_split {
return;
}
}
let ts = chrono::Utc::now().timestamp();
let mut backup = db_path.as_os_str().to_owned();
backup.push(format!(".pre-pipeline-status.{ts}.bak"));
let backup_path = std::path::PathBuf::from(backup);
match tokio::fs::copy(db_path, &backup_path).await {
Ok(_) => slog!(
"[db] Wrote pre-pipeline-status backup of {} to {}",
db_path.display(),
backup_path.display(),
),
Err(e) => slog!(
"[db] Failed to write pre-pipeline-status backup of {}: {e}",
db_path.display(),
),
}
}
/// Compare the live `_sqlx_migrations` table against the compiled-in migration
/// set and return any rows whose version is not known to this binary.
///
/// A non-empty result means the database was previously opened by a newer
/// binary that applied additional migrations. The server must refuse to start
/// in that state because the schema may contain tables or columns that this
/// binary does not understand.
pub async fn check_schema_drift(pool: &SqlitePool) -> Vec<UnknownMigration> {
let migrator = sqlx::migrate!("./migrations");
let known: HashSet<i64> = migrator.migrations.iter().map(|m| m.version).collect();
let rows: Vec<(i64, String, String)> = sqlx::query_as(
"SELECT version, description, installed_on FROM _sqlx_migrations ORDER BY version",
)
.fetch_all(pool)
.await
.unwrap_or_default();
rows.into_iter()
.filter(|(v, _, _)| !known.contains(v))
.map(|(version, description, installed_on)| UnknownMigration {
version,
description,
installed_on,
})
.collect()
}
+1
View File
@@ -118,6 +118,7 @@ impl AppContext {
)),
permission_timeout_secs: 120,
status: agents.status_broadcaster(),
chat_dispatcher: Arc::new(crate::chat::dispatcher::ChatDispatcher::new(1_500)),
});
Self {
state: Arc::new(state),
+43 -3
View File
@@ -92,9 +92,20 @@ pub(crate) fn tool_dump_crdt(args: &Value) -> Result<String, String> {
.items
.into_iter()
.map(|item| {
// Story 1087: emit `pipeline` and `status` alongside `stage` so
// crdt-dump consumers can route by column/badge without re-deriving
// the projection from the stage string.
let (pipeline, status) = item
.stage
.as_deref()
.and_then(crate::pipeline_state::Stage::from_dir)
.map(|s| (s.pipeline().as_str(), s.status().as_str()))
.unwrap_or(("", ""));
json!({
"story_id": item.story_id,
"stage": item.stage,
"pipeline": pipeline,
"status": status,
"name": item.name,
"agent": item.agent,
"retry_count": item.retry_count,
@@ -103,6 +114,7 @@ pub(crate) fn tool_dump_crdt(args: &Value) -> Result<String, String> {
"claimed_at": item.claim_ts,
"content_index": item.content_index,
"is_deleted": item.is_deleted,
"origin": item.origin,
})
})
.collect();
@@ -123,11 +135,10 @@ pub(crate) fn tool_dump_crdt(args: &Value) -> Result<String, String> {
/// MCP tool: return the server version, build hash, and running port.
pub(crate) fn tool_get_version(ctx: &AppContext) -> Result<String, String> {
let build_hash =
std::fs::read_to_string(".huskies/build_hash").unwrap_or_else(|_| "unknown".to_string());
let build_hash = option_env!("BUILD_GIT_HASH").unwrap_or("unknown");
serde_json::to_string_pretty(&json!({
"version": env!("CARGO_PKG_VERSION"),
"build_hash": build_hash.trim(),
"build_hash": build_hash,
"port": ctx.services.agents.port(),
}))
.map_err(|e| format!("Serialization error: {e}"))
@@ -312,4 +323,33 @@ mod tests {
let result = tool_get_server_logs(&json!({"lines": 9999})).unwrap();
let _ = result;
}
#[test]
fn tool_get_version_ignores_build_hash_file_and_reports_compile_time_value() {
// Regression: get_version must NOT read .huskies/build_hash at runtime.
// Write a deliberately wrong value to the file and assert get_version
// returns the compile-time hash, not the file content.
let dir = tempfile::tempdir().expect("tempdir");
let huskies_dir = dir.path().join(".huskies");
std::fs::create_dir_all(&huskies_dir).unwrap();
std::fs::write(huskies_dir.join("build_hash"), "wrong_hash_sentinel_xyz").unwrap();
let ctx = crate::http::test_helpers::test_ctx(dir.path());
let result = tool_get_version(&ctx).expect("tool_get_version must not fail");
let parsed: serde_json::Value = serde_json::from_str(&result).expect("must be valid JSON");
let returned_hash = parsed["build_hash"]
.as_str()
.expect("build_hash must be a string");
assert_ne!(
returned_hash, "wrong_hash_sentinel_xyz",
"get_version must not read .huskies/build_hash; got '{returned_hash}'"
);
// The returned hash must equal the compile-time value.
let compile_time_hash = option_env!("BUILD_GIT_HASH").unwrap_or("unknown");
assert_eq!(
returned_hash, compile_time_hash,
"get_version must return compile-time BUILD_GIT_HASH"
);
}
}
+3
View File
@@ -195,6 +195,9 @@ pub(super) async fn tool_status(args: &Value, ctx: &AppContext) -> Result<String
if !deps.is_empty() {
front_matter.insert("depends_on".to_string(), json!(deps));
}
// Story 1088: origin tracking.
let origin_str = view.origin().unwrap_or("unknown");
front_matter.insert("origin".to_string(), json!(origin_str));
let stage_claim = match &typed_item.stage {
crate::pipeline_state::Stage::Coding { claim, .. } => claim.as_ref(),
crate::pipeline_state::Stage::Merge { claim, .. } => claim.as_ref(),
+20 -3
View File
@@ -26,6 +26,10 @@ pub(crate) fn tool_create_bug(args: &Value, ctx: &AppContext) -> Result<String,
let acs = req.acceptance_criteria_strings();
let depends_on = req.depends_on_ids();
// Bug 1102: resolve and validate origin BEFORE creating the bug file so a
// missing-attribution call leaves no half-state behind.
let origin = super::build_origin(args)?;
let root = ctx.state.get_project_root()?;
let bug_id = create_bug_file(
&root,
@@ -38,6 +42,16 @@ pub(crate) fn tool_create_bug(args: &Value, ctx: &AppContext) -> Result<String,
depends_on.as_deref(),
)?;
crate::crdt_state::set_origin(&bug_id, &origin);
let _ = ctx
.watcher_tx
.send(crate::io::watcher::WatcherEvent::NewItemCreated {
item_id: bug_id.clone(),
item_type: "bug".to_string(),
name: req.name.as_ref().to_string(),
});
Ok(format!("Created bug: {bug_id}"))
}
@@ -233,7 +247,8 @@ mod tests {
"steps_to_reproduce": "1. Open app\n2. Click login",
"actual_result": "500 error",
"expected_result": "Successful login",
"acceptance_criteria": ["Login succeeds without error"]
"acceptance_criteria": ["Login succeeds without error"],
"origin": {"kind": "test", "id": "test-suite"}
}),
&ctx,
)
@@ -354,7 +369,8 @@ mod tests {
"steps_to_reproduce": "s",
"actual_result": "a",
"expected_result": "e",
"acceptance_criteria": ["Bug is fixed"]
"acceptance_criteria": ["Bug is fixed"],
"origin": {"kind": "test", "id": "test-suite"}
}),
&ctx,
);
@@ -396,7 +412,8 @@ mod tests {
"steps_to_reproduce": "s",
"actual_result": "a",
"expected_result": "e",
"acceptance_criteria": ["TODO", "Real AC"]
"acceptance_criteria": ["TODO", "Real AC"],
"origin": {"kind": "test", "id": "test-suite"}
}),
&ctx,
);
+17 -2
View File
@@ -13,6 +13,10 @@ use serde_json::{Value, json};
pub(crate) fn tool_create_epic(args: &Value, ctx: &AppContext) -> Result<String, String> {
let req = CreateEpicRequest::from_json(args)?;
// Bug 1102: resolve and validate origin BEFORE creating the epic so a
// missing-attribution call leaves no half-state behind.
let origin = super::build_origin(args)?;
let root = ctx.state.get_project_root()?;
let success_criteria = req.success_criteria_strings();
@@ -29,6 +33,8 @@ pub(crate) fn tool_create_epic(args: &Value, ctx: &AppContext) -> Result<String,
},
)?;
crate::crdt_state::set_origin(&epic_id, &origin);
Ok(format!("Created epic: {epic_id}"))
}
@@ -127,10 +133,14 @@ pub(crate) fn tool_show_epic(args: &Value, _ctx: &AppContext) -> Result<String,
if matches!(item.stage, Stage::Done { .. }) {
done += 1;
}
// Story 1087: expose pipeline + status alongside the legacy
// stage name so epic-show callers can route by column/badge.
member_items.push(json!({
"story_id": sid,
"name": item.name,
"stage": stage_name,
"pipeline": item.stage.pipeline().as_str(),
"status": item.stage.status().as_str(),
}));
}
}
@@ -164,7 +174,8 @@ mod tests {
"name": "My Test Epic",
"goal": "Achieve something great",
"motivation": "Because it matters",
"success_criteria": ["All stories done", "Tests pass"]
"success_criteria": ["All stories done", "Tests pass"],
"origin": {"kind": "test", "id": "test-suite"}
}),
&ctx,
);
@@ -217,7 +228,11 @@ mod tests {
// Create an epic.
tool_create_epic(
&json!({"name": "List Epics Test Epic", "goal": "Testing list"}),
&json!({
"name": "List Epics Test Epic",
"goal": "Testing list",
"origin": {"kind": "test", "id": "test-suite"}
}),
&ctx,
)
.unwrap();
+48
View File
@@ -12,6 +12,54 @@ mod refactor;
mod spike;
mod story;
/// Build a compact origin JSON string for a newly-created work item (story 1088).
///
/// `args` must contain an `"origin"` object with a non-empty `id` field and an
/// optional `kind` (defaulting to `"user"`) and `ts` (defaulting to now). The
/// `id` MUST identify the calling actor — e.g. `coder-1@story=42` for a coder
/// agent, `chat-bot:Timmy@<room_id>` for a chat-bot session, or a human's user
/// id for a CLI/MCP-direct call. Empty / whitespace-only `id` is rejected so
/// that every work item carries a usable provenance trail (bug 1102 — we lost
/// 1102's attribution because the default was `id=""`).
///
/// Returns the canonical origin JSON string on success. Returns `Err` with a
/// human-readable explanation when the caller failed to identify itself; the
/// caller (`tool_create_*` handlers) must propagate the error without creating
/// the work item, so a missing-attribution call leaves no half-state behind.
pub(super) fn build_origin(args: &serde_json::Value) -> Result<String, String> {
let ts = std::time::SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)
.unwrap_or_default()
.as_secs_f64();
let origin_obj = args.get("origin").and_then(|v| v.as_object()).ok_or(
"Missing required argument: 'origin'. Every create_* MCP call must \
identify the calling actor. Pass origin = { \"kind\": \"agent\" | \
\"chat-bot\" | \"user\" | \"system\", \"id\": \"<your-identifier>\", \
\"ts\": <unix-seconds, optional> }. Example: { \"kind\": \"agent\", \
\"id\": \"coder-1@story=42\" }.",
)?;
let id = origin_obj
.get("id")
.and_then(|v| v.as_str())
.map(str::trim)
.filter(|s| !s.is_empty())
.ok_or(
"origin.id must be a non-empty string identifying the calling \
actor (e.g. \"coder-1@story=42\", \"chat-bot:Timmy@!room:home\", \
or a human user id). See bug 1102 / story 1104 for the rationale.",
)?;
let kind = origin_obj
.get("kind")
.and_then(|v| v.as_str())
.unwrap_or("user");
let ts_val = origin_obj.get("ts").and_then(|v| v.as_f64()).unwrap_or(ts);
Ok(serde_json::json!({"kind": kind, "id": id, "ts": ts_val}).to_string())
}
pub(crate) use bug::{tool_close_bug, tool_create_bug, tool_list_bugs};
pub(crate) use criteria::{
tool_add_criterion, tool_check_criterion, tool_edit_criterion, tool_ensure_acceptance,
+24 -2
View File
@@ -27,6 +27,10 @@ pub(crate) fn tool_create_refactor(args: &Value, ctx: &AppContext) -> Result<Str
let description = req.description.as_ref().map(|d| d.as_str());
let depends_on = req.depends_on_ids();
// Bug 1102: resolve and validate origin BEFORE creating the refactor file
// so a missing-attribution call leaves no half-state behind.
let origin = super::build_origin(args)?;
let root = ctx.state.get_project_root()?;
let refactor_id = create_refactor_file(
&root,
@@ -36,6 +40,16 @@ pub(crate) fn tool_create_refactor(args: &Value, ctx: &AppContext) -> Result<Str
depends_on.as_deref(),
)?;
crate::crdt_state::set_origin(&refactor_id, &origin);
let _ = ctx
.watcher_tx
.send(crate::io::watcher::WatcherEvent::NewItemCreated {
item_id: refactor_id.clone(),
item_type: "refactor".to_string(),
name: req.name.as_ref().to_string(),
});
Ok(format!("Created refactor: {refactor_id}"))
}
@@ -104,7 +118,11 @@ mod tests {
let tmp = tempfile::tempdir().unwrap();
let ctx = test_ctx(tmp.path());
let result = tool_create_refactor(
&json!({"name": "Single Criterion Refactor", "acceptance_criteria": ["Code is clean"]}),
&json!({
"name": "Single Criterion Refactor",
"acceptance_criteria": ["Code is clean"],
"origin": {"kind": "test", "id": "test-suite"}
}),
&ctx,
);
assert!(result.is_ok(), "expected ok: {result:?}");
@@ -131,7 +149,11 @@ mod tests {
let tmp = tempfile::tempdir().unwrap();
let ctx = test_ctx(tmp.path());
let result = tool_create_refactor(
&json!({"name": "Mixed Refactor", "acceptance_criteria": ["TODO", "Real AC"]}),
&json!({
"name": "Mixed Refactor",
"acceptance_criteria": ["TODO", "Real AC"],
"origin": {"kind": "test", "id": "test-suite"}
}),
&ctx,
);
assert!(result.is_ok(), "expected ok for mixed AC: {result:?}");
+39 -6
View File
@@ -27,6 +27,10 @@ pub(crate) fn tool_create_spike(args: &Value, ctx: &AppContext) -> Result<String
let description = req.description.as_ref().map(|d| d.as_str());
let depends_on = req.depends_on_ids();
// Bug 1102: resolve and validate origin BEFORE creating the spike file so
// a missing-attribution call leaves no half-state behind.
let origin = super::build_origin(args)?;
let root = ctx.state.get_project_root()?;
let spike_id = create_spike_file(
&root,
@@ -36,6 +40,16 @@ pub(crate) fn tool_create_spike(args: &Value, ctx: &AppContext) -> Result<String
depends_on.as_deref(),
)?;
crate::crdt_state::set_origin(&spike_id, &origin);
let _ = ctx
.watcher_tx
.send(crate::io::watcher::WatcherEvent::NewItemCreated {
item_id: spike_id.clone(),
item_type: "spike".to_string(),
name: req.name.as_ref().to_string(),
});
Ok(format!("Created spike: {spike_id}"))
}
@@ -75,8 +89,14 @@ mod tests {
fn tool_create_spike_rejects_empty_name() {
let tmp = tempfile::tempdir().unwrap();
let ctx = test_ctx(tmp.path());
let result =
tool_create_spike(&json!({"name": "!!!", "acceptance_criteria": ["AC"]}), &ctx);
let result = tool_create_spike(
&json!({
"name": "!!!",
"acceptance_criteria": ["AC"],
"origin": {"kind": "test", "id": "test-suite"}
}),
&ctx,
);
assert!(result.is_err());
assert!(result.unwrap_err().contains("alphanumeric"));
}
@@ -105,7 +125,8 @@ mod tests {
&json!({
"name": "Compare Encoders",
"description": "Which encoder is fastest?",
"acceptance_criteria": ["Encoder comparison is documented"]
"acceptance_criteria": ["Encoder comparison is documented"],
"origin": {"kind": "test", "id": "test-suite"}
}),
&ctx,
)
@@ -130,7 +151,11 @@ mod tests {
let ctx = test_ctx(tmp.path());
let result = tool_create_spike(
&json!({"name": "My Spike", "acceptance_criteria": ["Spike findings documented"]}),
&json!({
"name": "My Spike",
"acceptance_criteria": ["Spike findings documented"],
"origin": {"kind": "test", "id": "test-suite"}
}),
&ctx,
)
.unwrap();
@@ -180,7 +205,11 @@ mod tests {
let tmp = tempfile::tempdir().unwrap();
let ctx = test_ctx(tmp.path());
let result = tool_create_spike(
&json!({"name": "Single Criterion Spike", "acceptance_criteria": ["Findings documented"]}),
&json!({
"name": "Single Criterion Spike",
"acceptance_criteria": ["Findings documented"],
"origin": {"kind": "test", "id": "test-suite"}
}),
&ctx,
);
assert!(result.is_ok(), "expected ok: {result:?}");
@@ -207,7 +236,11 @@ mod tests {
let tmp = tempfile::tempdir().unwrap();
let ctx = test_ctx(tmp.path());
let result = tool_create_spike(
&json!({"name": "Mixed Spike", "acceptance_criteria": ["TODO", "Real AC"]}),
&json!({
"name": "Mixed Spike",
"acceptance_criteria": ["TODO", "Real AC"],
"origin": {"kind": "test", "id": "test-suite"}
}),
&ctx,
);
assert!(result.is_ok(), "expected ok for mixed AC: {result:?}");
@@ -15,6 +15,10 @@ use serde_json::Value;
pub(crate) fn tool_create_story(args: &Value, ctx: &AppContext) -> Result<String, String> {
let req = CreateStoryRequest::from_json(args)?;
// Bug 1102: resolve and validate origin BEFORE creating the story file so
// a missing-attribution call leaves no half-state behind.
let origin = super::super::build_origin(args)?;
let root = ctx.state.get_project_root()?;
let depends_on_ids = req.depends_on_ids();
@@ -31,6 +35,16 @@ pub(crate) fn tool_create_story(args: &Value, ctx: &AppContext) -> Result<String
false,
)?;
crate::crdt_state::set_origin(&story_id, &origin);
let _ = ctx
.watcher_tx
.send(crate::io::watcher::WatcherEvent::NewItemCreated {
item_id: story_id.clone(),
item_type: "story".to_string(),
name: req.name.as_ref().to_string(),
});
// Bug 503: warn at creation time if any depends_on points at an already-archived story.
let archived_deps: Vec<u32> = depends_on_ids
.as_deref()
@@ -245,7 +259,11 @@ mod tests {
let tmp = tempfile::tempdir().unwrap();
let ctx = test_ctx(tmp.path());
let result = tool_create_story(
&json!({"name": "Single Criterion Story", "acceptance_criteria": ["It works"]}),
&json!({
"name": "Single Criterion Story",
"acceptance_criteria": ["It works"],
"origin": {"kind": "test", "id": "test-suite"}
}),
&ctx,
);
assert!(result.is_ok(), "expected ok: {result:?}");
@@ -268,7 +286,11 @@ mod tests {
let tmp = tempfile::tempdir().unwrap();
let ctx = test_ctx(tmp.path());
let result = tool_create_story(
&json!({"name": "Mixed Story", "acceptance_criteria": ["TODO", "Real AC"]}),
&json!({
"name": "Mixed Story",
"acceptance_criteria": ["TODO", "Real AC"],
"origin": {"kind": "test", "id": "test-suite"}
}),
&ctx,
);
assert!(result.is_ok(), "expected ok for mixed AC: {result:?}");
@@ -284,7 +306,8 @@ mod tests {
&json!({
"name": "Story With Description",
"description": "This is the background context.",
"acceptance_criteria": ["Described well"]
"acceptance_criteria": ["Described well"],
"origin": {"kind": "test", "id": "test-suite"}
}),
&ctx,
)
@@ -351,7 +374,8 @@ mod tests {
let result = tool_create_story(
&json!({
"name": "Story with <b>bold</b> name",
"acceptance_criteria": ["AC1"]
"acceptance_criteria": ["AC1"],
"origin": {"kind": "test", "id": "test-suite"}
}),
&ctx,
);
+98 -18
View File
@@ -39,34 +39,32 @@ pub(crate) fn tool_get_pipeline_status(ctx: &AppContext) -> Result<String, Strin
let state = load_pipeline_state(ctx)?;
let running_merges = ctx.services.agents.list_running_merges()?;
fn slim_name(name: &str) -> &str {
crate::chat::util::truncate_at_char_boundary(name, 120)
}
fn map_items(items: &[crate::http::workflow::UpcomingStory], stage: &str) -> Vec<Value> {
items
.iter()
.map(|s| {
let mut item = json!({
"story_id": s.story_id,
"name": s.name,
"name": slim_name(&s.name),
"stage": stage,
"pipeline": s.pipeline.as_str(),
"status": s.status.as_str(),
"agent": s.agent.as_ref().map(|a| json!({
"agent_name": a.agent_name,
"model": a.model,
"status": a.status,
})),
});
// Include blocked/retry_count when present so callers can
// identify stories stuck in the pipeline.
if let Some(true) = s.blocked {
item["blocked"] = json!(true);
}
if let Some(rc) = s.retry_count {
item["retry_count"] = json!(rc);
}
if let Some(ref mf) = s.merge_failure {
item["merge_failure"] = json!(mf);
}
if let Some(ref epic_id) = s.epic_id {
item["epic_id"] = json!(epic_id);
}
item
})
.collect()
@@ -81,19 +79,21 @@ pub(crate) fn tool_get_pipeline_status(ctx: &AppContext) -> Result<String, Strin
let backlog: Vec<Value> = state
.backlog
.iter()
.map(|s| {
let mut item = json!({ "story_id": s.story_id, "name": s.name });
if let Some(ref epic_id) = s.epic_id {
item["epic_id"] = json!(epic_id);
}
item
})
.map(|s| json!({ "story_id": s.story_id, "name": slim_name(&s.name) }))
.collect();
let archived: Vec<Value> = state
.archived
.iter()
.map(|s| json!({ "story_id": s.story_id, "name": s.name, "stage": "archived" }))
.map(|s| {
json!({
"story_id": s.story_id,
"name": slim_name(&s.name),
"stage": "archived",
"pipeline": s.pipeline.as_str(),
"status": s.status.as_str(),
})
})
.collect();
serde_json::to_string_pretty(&json!({
@@ -130,7 +130,11 @@ mod tests {
let ctx = test_ctx(tmp.path());
let result = super::super::tool_create_story(
&json!({"name": "Test Story", "acceptance_criteria": ["AC1", "AC2"]}),
&json!({
"name": "Test Story",
"acceptance_criteria": ["AC1", "AC2"],
"origin": {"kind": "test", "id": "test-suite"}
}),
&ctx,
)
.unwrap();
@@ -248,6 +252,82 @@ mod tests {
assert_eq!(item["valid"], true);
}
#[test]
fn pipeline_status_50_items_under_10kb() {
crate::db::ensure_content_store();
let stages = [
("1_backlog", "backlog"),
("2_current", "current"),
("3_qa", "qa"),
("4_merge", "merge"),
("5_done", "done"),
];
for (i, (dir, _)) in stages.iter().enumerate() {
for j in 0..10 {
let id = format!("99{i}{j}0_story_size_test");
let name = format!("Pipeline Size Test Story {i}-{j}");
crate::db::write_item_with_content(
&id,
dir,
&format!("---\nname: \"{name}\"\n---\n"),
crate::db::ItemMeta {
name: Some(name),
..Default::default()
},
);
}
}
let tmp = tempfile::tempdir().unwrap();
let ctx = test_ctx(tmp.path());
let result = tool_get_pipeline_status(&ctx).unwrap();
assert!(
result.len() < 10 * 1024,
"50-item response must be under 10 KB; got {} bytes",
result.len()
);
}
#[test]
fn pipeline_status_per_item_under_500_bytes() {
crate::db::ensure_content_store();
// Insert one item per active stage with a moderately long name.
let stages = [
("2_current", "9995_story_peritem_current"),
("3_qa", "9996_story_peritem_qa"),
("4_merge", "9997_story_peritem_merge"),
("5_done", "9998_story_peritem_done"),
];
for (dir, id) in &stages {
let name = "A Reasonably Named Story For Size Testing";
crate::db::write_item_with_content(
id,
dir,
&format!("---\nname: \"{name}\"\n---\n"),
crate::db::ItemMeta {
name: Some(name.to_string()),
..Default::default()
},
);
}
let tmp = tempfile::tempdir().unwrap();
let ctx = test_ctx(tmp.path());
let result = tool_get_pipeline_status(&ctx).unwrap();
let parsed: Value = serde_json::from_str(&result).unwrap();
let active = parsed["active"].as_array().unwrap();
for item in active {
if stages.iter().any(|(_, id)| item["story_id"] == *id) {
let item_json = serde_json::to_string(item).unwrap();
assert!(
item_json.len() < 500,
"per-item payload must be under 500 bytes; story_id={} got {} bytes: {}",
item["story_id"],
item_json.len(),
item_json
);
}
}
}
#[test]
fn tool_validate_stories_with_invalid_front_matter() {
let tmp = tempfile::tempdir().unwrap();
+41 -11
View File
@@ -2,6 +2,31 @@
use serde_json::{Value, json};
/// JSON schema fragment for the `origin` argument required by every `create_*`
/// tool (bug 1102). The caller MUST identify itself — empty `id` is rejected
/// server-side so every work item carries a usable provenance trail.
fn origin_schema() -> Value {
json!({
"type": "object",
"description": "Required: identifies the calling actor so every work item carries provenance. Empty/missing id is rejected (bug 1102). Examples: { \"kind\": \"agent\", \"id\": \"coder-1@story=42\" }, { \"kind\": \"chat-bot\", \"id\": \"Timmy@!room:home.local\" }, { \"kind\": \"user\", \"id\": \"dave\" }.",
"properties": {
"kind": {
"type": "string",
"description": "One of: \"agent\" (LLM coder/mergemaster/qa), \"chat-bot\" (Timmy or other chat-routed bot), \"user\" (human via CLI/MCP), \"system\" (server-automation)."
},
"id": {
"type": "string",
"description": "Non-empty identifier of the caller. For agents include the story id (e.g. \"coder-1@story=42\"); for chat-bots include the room/session (e.g. \"Timmy@!room:home.local\"); for users the user id or short name."
},
"ts": {
"type": "number",
"description": "Optional unix-seconds timestamp. Defaults to the server's clock when absent."
}
},
"required": ["kind", "id"]
})
}
/// Returns tool schemas for story/work-item lifecycle management.
pub(super) fn story_tools() -> Vec<Value> {
vec![
@@ -37,9 +62,10 @@ pub(super) fn story_tools() -> Vec<Value> {
"commit": {
"type": "boolean",
"description": "If true, git-add and git-commit the new story file to the current branch"
}
},
"origin": origin_schema()
},
"required": ["name", "acceptance_criteria"]
"required": ["name", "acceptance_criteria", "origin"]
}
}),
json!({
@@ -282,9 +308,10 @@ pub(super) fn story_tools() -> Vec<Value> {
"items": { "type": "string" },
"minItems": 1,
"description": "List of acceptance criteria (at least one required)"
}
},
"origin": origin_schema()
},
"required": ["name", "acceptance_criteria"]
"required": ["name", "acceptance_criteria", "origin"]
}
}),
json!({
@@ -323,9 +350,10 @@ pub(super) fn story_tools() -> Vec<Value> {
"type": "array",
"items": { "type": "integer" },
"description": "Optional list of story numbers this bug depends on (e.g. [42, 43]). Persisted as depends_on in YAML front matter."
}
},
"origin": origin_schema()
},
"required": ["name", "description", "steps_to_reproduce", "actual_result", "expected_result", "acceptance_criteria"]
"required": ["name", "description", "steps_to_reproduce", "actual_result", "expected_result", "acceptance_criteria", "origin"]
}
}),
json!({
@@ -360,9 +388,10 @@ pub(super) fn story_tools() -> Vec<Value> {
"type": "array",
"items": { "type": "integer" },
"description": "Optional list of story numbers this refactor depends on (e.g. [42, 43]). Persisted as depends_on in YAML front matter."
}
},
"origin": origin_schema()
},
"required": ["name", "acceptance_criteria"]
"required": ["name", "acceptance_criteria", "origin"]
}
}),
json!({
@@ -399,9 +428,10 @@ pub(super) fn story_tools() -> Vec<Value> {
"type": "array",
"items": { "type": "string" },
"description": "Optional: list of high-level success criteria for the epic"
}
},
"origin": origin_schema()
},
"required": ["name", "goal"]
"required": ["name", "goal", "origin"]
}
}),
json!({
@@ -574,7 +604,7 @@ pub(super) fn story_tools() -> Vec<Value> {
}),
json!({
"name": "get_pipeline_status",
"description": "Return a structured snapshot of the full work item pipeline. Includes all active stages (current, qa, merge, done) with each item's stage, name, and assigned agent. Also includes upcoming backlog items.",
"description": "Return a structured snapshot of the full work item pipeline. Each item includes only slim fields: story_id, name (capped at 120 chars), stage, agent (with agent_name/model/status), and optional boolean flags blocked and retry_count. Active stages (current, qa, merge, done) appear in the 'active' array; backlog items in 'backlog'. For full story details, use status(story_id) or dump_crdt.",
"inputSchema": {
"type": "object",
"properties": {}
+8
View File
@@ -24,6 +24,10 @@ pub struct UpcomingStory {
pub merge_failure: Option<String>,
/// Active agent working on this item, if any.
pub agent: Option<AgentAssignment>,
/// Display column (story 1085) — derived from `Stage::pipeline()`.
pub pipeline: crate::pipeline_state::Pipeline,
/// Display badge/indicator (story 1085) — derived from `Stage::status()`.
pub status: crate::pipeline_state::Status,
/// True when the item is held in QA for human review.
#[serde(skip_serializing_if = "Option::is_none")]
pub review_hold: Option<bool>,
@@ -142,6 +146,8 @@ pub fn load_pipeline_state(ctx: &AppContext) -> Result<PipelineState, String> {
error: None,
merge_failure,
agent,
pipeline: item.stage.pipeline(),
status: item.stage.status(),
review_hold,
qa,
retry_count: if item.retry_count() > 0 {
@@ -278,6 +284,8 @@ pub fn load_upcoming_stories(_ctx: &AppContext) -> Result<Vec<UpcomingStory>, St
error: None,
merge_failure: None,
agent: None,
pipeline: item.stage.pipeline(),
status: item.stage.status(),
review_hold: None,
qa: None,
retry_count: if item_retry_count > 0 {
+10
View File
@@ -90,4 +90,14 @@ pub enum WatcherEvent {
/// `true` if acceptance gates passed; `false` if they failed.
success: bool,
},
/// A new work item was successfully created and added to the backlog.
/// Triggers a creation notification to configured chat rooms.
NewItemCreated {
/// Work item ID (e.g. `"1075_refactor_split_stage_enum"`).
item_id: String,
/// Human-readable item type (`"story"`, `"bug"`, `"refactor"`, `"spike"`).
item_type: String,
/// Human-readable item name.
name: String,
},
}
-1
View File
@@ -21,7 +21,6 @@ mod sweep;
pub use events::WatcherEvent;
pub(crate) use sweep::spawn_done_to_archived_subscriber;
#[cfg(test)]
pub(crate) use sweep::sweep_done_to_archived;
use crate::slog;
+29 -5
View File
@@ -29,13 +29,20 @@ use std::time::Duration;
///
/// Replaces the periodic `sweep_done_to_archived` call from the tick loop.
pub(crate) fn spawn_done_to_archived_subscriber(done_retention: Duration) {
use crate::pipeline_state::{PipelineEvent, Stage, apply_transition, subscribe_transitions};
use crate::pipeline_state::{
PipelineEvent, Stage, Status, apply_transition, subscribe_transitions,
};
let mut rx = subscribe_transitions();
tokio::spawn(async move {
loop {
match rx.recv().await {
Ok(fired) => {
// Story 1086: gate on the typed `Status::Done` projection;
// the variant pattern is still required to read `merged_at`.
if fired.after.status() != Status::Done {
continue;
}
if let Stage::Done { merged_at, .. } = fired.after {
let story_id = fired.story_id.0.clone();
let retention = done_retention;
@@ -70,7 +77,7 @@ pub(crate) fn spawn_done_to_archived_subscriber(done_retention: Duration) {
});
}
/// Sweep items in `Stage::Done` whose `merged_at` timestamp exceeds the
/// Reconcile: sweep items in `Stage::Done` whose `merged_at` timestamp exceeds the
/// retention duration to `Stage::Archived` via the typed transition table.
///
/// Routes through [`crate::pipeline_state::apply_transition`] so the
@@ -78,14 +85,22 @@ pub(crate) fn spawn_done_to_archived_subscriber(done_retention: Duration) {
/// `TransitionFired` event is emitted to subscribers (worktree pruning,
/// matrix notifier, etc.).
///
/// Used in tests for direct one-shot sweeps; production code uses
/// Called at startup and by the periodic reconciler to archive Done stories
/// whose retention has elapsed, even when the `TransitionFired` subscriber
/// lagged and missed their Done event. Production reactive archiving uses
/// [`spawn_done_to_archived_subscriber`] instead.
#[cfg(test)]
///
/// Logs a summary INFO line on every call: candidates evaluated and items
/// archived, or "no items past retention" when nothing was swept.
pub(crate) fn sweep_done_to_archived(done_retention: Duration) {
use crate::pipeline_state::{PipelineEvent, Stage, apply_transition, read_all_typed};
let mut candidates: usize = 0;
let mut archived: usize = 0;
for item in read_all_typed() {
if let Stage::Done { merged_at, .. } = &item.stage {
candidates += 1;
let age = chrono::Utc::now()
.signed_duration_since(*merged_at)
.to_std()
@@ -93,7 +108,10 @@ pub(crate) fn sweep_done_to_archived(done_retention: Duration) {
if age >= done_retention {
let story_id = item.story_id.0.clone();
match apply_transition(&story_id, PipelineEvent::Accepted, None) {
Ok(_) => slog!("[watcher] sweep: promoted {story_id} → archived"),
Ok(_) => {
archived += 1;
slog!("[watcher] sweep: promoted {story_id} → archived")
}
Err(e) => {
slog!("[watcher] sweep: transition error for {story_id}: {e}")
}
@@ -101,4 +119,10 @@ pub(crate) fn sweep_done_to_archived(done_retention: Duration) {
}
}
}
if archived > 0 {
slog!("[watcher] sweep: {candidates} candidate(s) evaluated, {archived} archived");
} else {
slog!("[watcher] sweep: {candidates} candidate(s) evaluated, no items past retention");
}
}
+42
View File
@@ -301,6 +301,48 @@ async fn done_to_archived_subscriber_archives_on_transition() {
);
}
/// Regression: simulates a server restart occurring between move-to-done and
/// the configured retention window expiry.
///
/// Before the fix the archive-deadline was held only in the reactive
/// subscriber's volatile sleep task; a restart would lose that task and the
/// item would never be archived. The fix is that `sweep_done_to_archived`
/// reads `merged_at` from the CRDT (durable across restarts) and archives any
/// item whose age exceeds the retention, so the next periodic reconcile tick
/// after restart picks it up regardless of whether a sleep task existed.
#[test]
fn restart_scenario_sweep_archives_past_retention_after_sweep_tick() {
crate::crdt_state::init_for_test();
crate::db::ensure_content_store();
let story_id = "9885_sweep_restart_regression";
// Simulate: item moved to Done 10 seconds before the restart.
// The reactive subscriber would have had a sleep task for the remaining
// retention time; that task is now gone (process restarted).
let ten_seconds_ago = (chrono::Utc::now() - chrono::Duration::seconds(10)).timestamp() as f64;
crate::crdt_state::write_item_str(
story_id,
"5_done",
Some("Restart regression test"),
None,
None,
Some(ten_seconds_ago),
);
// The next periodic reconcile tick after restart calls sweep_done_to_archived
// directly. With 5-second retention and merged_at 10s ago, the item must
// be archived even though no reactive subscriber sleep task exists.
sweep_done_to_archived(Duration::from_secs(5));
let items = crate::pipeline_state::read_all_typed();
let item = items.iter().find(|i| i.story_id.0 == story_id);
assert!(
item.is_some_and(|i| matches!(i.stage, crate::pipeline_state::Stage::Archived { .. })),
"item past retention must be archived on the next sweep tick after a server restart"
);
}
/// Prove that an item with merged_at NEWER than done_retention is NOT swept.
#[test]
fn sweep_keeps_item_newer_than_retention() {
+10 -4
View File
@@ -33,6 +33,8 @@ pub mod mesh;
/// Node identity — Ed25519 keypair generation and stable node ID management.
pub mod node_identity;
pub(crate) mod pipeline_state;
/// Reliable process-termination primitives shared across the server.
pub mod process_kill;
/// Rebuild — process restart and shutdown coordination.
pub mod rebuild;
mod service;
@@ -82,12 +84,10 @@ async fn main() -> Result<(), std::io::Error> {
});
// Log version and build hash so we can verify what's running.
let build_hash =
std::fs::read_to_string(".huskies/build_hash").unwrap_or_else(|_| "unknown".to_string());
slog!(
"[startup] huskies v{} (build {})",
env!("CARGO_PKG_VERSION"),
build_hash.trim()
option_env!("BUILD_GIT_HASH").unwrap_or("unknown")
);
let app_state = Arc::new(SessionState::default());
@@ -151,7 +151,7 @@ async fn main() -> Result<(), std::io::Error> {
startup::project::open_project_root(is_init, explicit_path, &cwd, &app_state, &store, port)
.await;
startup::project::init_subsystems(&app_state, &cwd).await;
startup::project::init_subsystems(&app_state, &cwd, is_agent).await;
let crdt_join_token = cli
.join_token
@@ -238,6 +238,12 @@ async fn main() -> Result<(), std::io::Error> {
.map(|c| c.permission_timeout_secs)
.unwrap_or(120),
status: agents.status_broadcaster(),
chat_dispatcher: std::sync::Arc::new(chat::dispatcher::ChatDispatcher::new(
bot_cfg
.as_ref()
.map(|c| c.coalesce_window_ms)
.unwrap_or(1_500),
)),
});
// Sled uplink: forward permission requests to an upstream gateway when configured.
+10
View File
@@ -78,6 +78,16 @@ pub fn apply_transition(
super::Stage::Rejected { reason, .. } | super::Stage::Blocked { reason } => {
crate::crdt_state::set_resume_to_raw(story_id, reason);
}
// Story 1105: write the resume target so read-back can reconstruct the
// correct variant. Without this, the register is stale (or empty) and
// the deserialiser falls back to Coding regardless of where the story
// was when it was frozen.
super::Stage::Frozen { resume_to } => {
crate::crdt_state::set_resume_to_raw(story_id, resume_to.dir_name());
}
super::Stage::ReviewHold { resume_to, .. } => {
crate::crdt_state::set_resume_to_raw(story_id, resume_to.dir_name());
}
_ => {}
}
-80
View File
@@ -36,32 +36,6 @@ pub(super) fn try_broadcast(fired: &TransitionFired) {
let _ = get_or_init_tx().send(fired.clone());
}
/// Replay the current CRDT pipeline state as a burst of synthetic
/// [`TransitionFired`] events at server startup.
///
/// Reads every item from the CRDT and broadcasts a self-transition
/// (`before == after`) for each one so that all existing subscribers
/// (worktree lifecycle, merge-failure auto-spawn, auto-assign) react
/// identically to a live event. This replaces the legacy scan-based
/// `reconcile_on_startup` path.
///
/// Idempotent: a second call produces another burst of events, but every
/// subscriber already guards against duplicate work (e.g.
/// `is_story_assigned_for_stage` returns true once an agent is running,
/// and worktree creation is a no-op when the worktree already exists).
pub fn replay_current_pipeline_state() {
for item in super::read_all_typed() {
let fired = TransitionFired {
story_id: item.story_id.clone(),
before: item.stage.clone(),
after: item.stage,
event: super::PipelineEvent::DepsMet,
at: chrono::Utc::now(),
};
try_broadcast(&fired);
}
}
/// Fired when a pipeline stage transition completes.
#[derive(Debug, Clone)]
pub struct TransitionFired {
@@ -183,58 +157,4 @@ mod tests {
}
// ── TransitionError Display ─────────────────────────────────────────
// ── replay_current_pipeline_state ──────────────────────────────────
/// AC1: replay broadcasts a synthetic event for every item in the CRDT.
#[test]
fn replay_broadcasts_event_for_crdt_item_in_coding_stage() {
crate::crdt_state::init_for_test();
crate::db::ensure_content_store();
let story_id = "9901_replay_coding";
crate::db::write_item_with_content(
story_id,
"2_current",
"---\nname: Replay Coding\n---\n",
crate::db::ItemMeta::named("Replay Coding"),
);
let mut rx = subscribe_transitions();
replay_current_pipeline_state();
let mut found = false;
while let Ok(fired) = rx.try_recv() {
if fired.story_id.0 == story_id && matches!(fired.after, Stage::Coding { .. }) {
found = true;
}
}
assert!(
found,
"replay must broadcast a Coding event for a story in 2_current"
);
}
/// AC3: calling replay_current_pipeline_state twice fires events both times.
///
/// Pool-state idempotency (no duplicate agents) is enforced by subscribers,
/// not by the replay function itself. This test verifies that replay is safe
/// to call multiple times without panicking.
#[test]
fn replay_twice_does_not_panic() {
crate::crdt_state::init_for_test();
crate::db::ensure_content_store();
let story_id = "9902_replay_idem";
crate::db::write_item_with_content(
story_id,
"3_qa",
"---\nname: Replay QA\n---\n",
crate::db::ItemMeta::named("Replay QA"),
);
// Two successive replays must not panic.
replay_current_pipeline_state();
replay_current_pipeline_state();
}
}
+4 -6
View File
@@ -41,8 +41,8 @@ mod tests;
#[allow(unused_imports)]
pub use types::{
AgentClaim, AgentName, ArchiveReason, BranchName, ExecutionState, GitSha, MergeFailureKind,
NodePubkey, PipelineItem, PlanState, Stage, StoryId, TransitionError, stage_dir_name,
stage_label,
NodePubkey, Pipeline, PipelineItem, PlanState, Stage, Status, StoryId, TransitionError,
stage_dir_name, stage_label,
};
#[allow(unused_imports)]
@@ -51,10 +51,7 @@ pub use transition::{
};
#[allow(unused_imports)]
pub use events::{
EventBus, TransitionFired, TransitionSubscriber, replay_current_pipeline_state,
subscribe_transitions,
};
pub use events::{EventBus, TransitionFired, TransitionSubscriber, subscribe_transitions};
#[allow(unused_imports)]
pub use projection::ProjectionError;
@@ -66,6 +63,7 @@ pub use apply::{
transition_to_unfrozen,
};
pub(crate) use subscribers::reconcile_audit_log;
pub use subscribers::spawn_audit_log_subscriber;
#[allow(unused_imports)]
+8
View File
@@ -35,6 +35,14 @@ impl TransitionSubscriber for AuditLogSubscriber {
}
}
/// Reconcile: no-op for the audit log subscriber.
///
/// The audit log records live transitions only. Replaying historical CRDT state at
/// reconcile time would produce misleading entries (wrong timestamps, duplicate lines).
/// Eventual consistency of the audit log is not required — missed events are simply
/// absent from the log, which is acceptable.
pub(crate) fn reconcile_audit_log() {}
/// Spawn a background task that writes a structured audit log entry for every pipeline transition.
///
/// Subscribes to the transition broadcast channel. Every `TransitionFired` event produces

Some files were not shown because too many files have changed in this diff Show More