Commit Graph

1059 Commits

Author SHA1 Message Date
dave 2d0387fe63 huskies: merge 1126 story Gateway event aggregator with per-session scope filters (Timmy=All, Sally=single sled) 2026-05-17 21:02:08 +00:00
dave d86cc38b2a huskies: merge 1128 story Bounded event queues + EventStreamGap sentinel + observability for context assembly 2026-05-17 20:30:02 +00:00
dave badd522d60 huskies: merge 1125 story LLM session entity + assemble_prompt_context helper, wired into Matrix bot 2026-05-17 20:09:33 +00:00
dave ecd3f600d9 huskies: merge 1130 story Adopted/launched project containers bind huskies to 127.0.0.1, unreachable from host MCP 2026-05-17 20:02:22 +00:00
dave 89058ebd49 huskies: merge 1124 story Persist TransitionFired into a per-sled CRDT event log 2026-05-17 19:37:50 +00:00
dave d8204ab7ed huskies: merge 1129 story find_free_port fallback returns unbindable port silently when range is exhausted 2026-05-17 19:24:29 +00:00
dave c1b7e12b0b huskies: merge 1122 story Chat-bot switch command reads stale gateway_projects Vec instead of live gateway_projects_store 2026-05-17 18:49:58 +00:00
dave 6331dea8b0 huskies: merge 1121 story Remove the marketing website from the huskies OSS repo (now lives in huskies-server) 2026-05-17 18:43:43 +00:00
dave 7de167b21b huskies: merge 1116 story rebuild_and_restart loses pending CRDT ops by calling exec() before persistence channel drains 2026-05-17 17:48:44 +00:00
dave 73cf1c6ff9 huskies: merge 1117 story MCP tool for adopt: expose new project --adopt as an MCP call 2026-05-17 16:42:06 +00:00
dave f8b1e14b74 huskies: merge 1118 story Automate per-project docker image builds (huskies-project-base + per-stack overlays) 2026-05-17 16:30:08 +00:00
Timmy 265e6f9a15 fix(1101): strip passing-test lines before classify() lint check; remove diagnostic
The merge gate classifier was matching trigger keywords like
`missing_doc_comments` inside passing-test name lines
(e.g. `test agents::gates::tests::classify_lint_from_missing_doc_comments ... ok`),
causing every gate failure to be mis-classified as Lint and bounced
back to a fixup coder. Strip `test … … ok` lines before scanning for
lint triggers. Also removes the temporary diagnostic block in
runner.rs that confirmed the bug.

Applied directly to master because the 1101 feature branch carried
stale work from an earlier incarnation of the story that semantically
conflicted with master's later diagnostic commit (`is_fixup` deleted
on the branch, referenced on master).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 16:52:26 +01:00
dave 0695ad7ae6 huskies: merge 1115 story new project: --adopt flow to wrap a container around an existing checkout 2026-05-17 15:17:12 +00:00
dave eb6b07531a huskies: merge 1114 story new project: --path flag to override default host directory 2026-05-17 14:48:49 +00:00
Timmy a5bfd40233 Bump version to 0.12.0 2026-05-17 02:10:31 +01:00
dave a40500eea9 huskies: merge 1111 bug Test isolation: init_for_test() and ensure_content_store() are once-per-thread, not once-per-test, polluting CRDT state across tests 2026-05-17 00:33:45 +00:00
dave f8212f102f huskies: merge 1109 story Chat bootstrap Phase 4: --git clones an existing repo and configures push credentials 2026-05-17 00:18:25 +00:00
dave 59302b465d huskies: merge 1108 story Chat bootstrap Phase 3: SSH-remote editor access into the project container (any editor) 2026-05-16 23:37:59 +00:00
dave efafe44db1 huskies: merge 1110 story Chat bootstrap Phase 2b: additional stack overlays (Go, Python, Ruby, JVM) 2026-05-16 23:20:31 +00:00
dave 3a43337735 huskies: merge 1107 story Chat bootstrap Phase 2a: stack-overlay framework + Rust and Node stack overlays 2026-05-16 23:01:49 +00:00
dave 10d992a7e4 huskies: merge 1106 story Chat bootstrap Phase 1: new project chat command spawns a bare project container and registers it with the gateway 2026-05-16 22:39:20 +00:00
Timmy 7db0b78e88 Bump version to 0.11.1 2026-05-15 23:38:09 +01:00
dave 979492449e huskies: merge 1105 bug Freeze from Backlog stores wrong resume_to — Unfreeze restores to Coding instead of Backlog 2026-05-15 22:33:54 +00:00
Timmy 6fbe239313 fix(1102): require non-empty origin.id on create_* MCP tools
bug 1102 was created today with origin={kind:user, id:""} because
build_origin silently defaulted id to empty when the caller didn't pass
one — we couldn't tell who filed it. Bug 1088's origin field is useless
as audit if every caller can omit themselves.

Changes:
- build_origin (server/src/http/mcp/story_tools/mod.rs) now returns
  Result<String, String> and rejects missing/empty/whitespace-only id
  with an instructional error pointing at bug 1102 / story 1104.
- 5 create_* tool handlers (bug, spike, refactor, epic, story) now
  resolve origin BEFORE create_*_file so an attribution-less call
  leaves no half-state behind.
- 5 tool input schemas now advertise origin as a required object via
  a shared origin_schema() helper. The schema description gives every
  caller (coder agent, chat bot, user, system) a concrete example so
  the LLM populates the field correctly on first sight.
- Test fixtures pass origin = {kind:"test", id:"test-suite"}.

Story 1104 (signed actions) is the longer-term replacement; this is the
quick attribution win agreed for master ahead of that design work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 23:13:54 +01:00
Timmy 26527e7dae diag(1101): log classify verdict + matched trigger on merge gate failures
Bug 1101's reframed AC1: when a non-success merge runs, log the typed
GateFailureKind, the matched classifier-trigger substring (if any) and
~90 chars of surrounding context. Fires on every gate failure regardless
of routing, so the next fixup-loop bounce will tell us which substring is
fooling classify() into Fmt|Lint|SourceMapCheck on what's actually a Test
failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 23:13:38 +01:00
dave 04a57e92c2 huskies: merge 1103 bug Rate-limit warning at session start sticks the rate_limit_exit flag, causing 1053's fast-path bypass to skip completion on clean session exits 2026-05-15 21:02:37 +00:00
dave 4216ced493 huskies: merge 1100 bug Multiple LLM agents can run concurrently on the same story (coder + mergemaster + others) — enforce one-agent-per-story invariant 2026-05-15 20:24:31 +00:00
dave 63d86f1263 huskies: merge 1096 bug Shadow drift: set_agent writes CRDT agent register without updating pipeline_items.agent 2026-05-15 19:05:56 +00:00
dave 1adc734801 huskies: merge 1098 bug Shadow drift: set_retry_count / bump_retry_count write CRDT register without updating pipeline_items.retry_count 2026-05-15 18:25:25 +00:00
dave 8531bac6cd huskies: merge 1097 bug Shadow drift: set_depends_on writes CRDT depends_on register without updating pipeline_items.depends_on 2026-05-15 12:40:17 +00:00
dave 2857c3b46b huskies: merge 1094 bug delete_story leaks zombie rows in pipeline_items shadow table — 176 tombstoned items still report non-terminal stages 2026-05-15 12:27:48 +00:00
dave 62d1535e76 huskies: merge 1095 bug Shadow drift: set_name writes CRDT name register without updating pipeline_items.name 2026-05-15 12:10:11 +00:00
dave fc5481dbe4 huskies: merge 1093 bug Chat dispatcher spawns one Timmy per inbound message — needs coalesce window + per-session serial lock 2026-05-15 12:03:09 +00:00
dave 01e60a670c huskies: merge 1091 refactor Migrate the merge-gate's stale-cargo kill path to process_kill 2026-05-15 11:50:03 +00:00
dave c4010854a5 huskies: merge 1089 bug Stuck-agent detector blocks stories on legitimate exploration / debugging — uses too narrow a "progress" signal 2026-05-15 11:40:44 +00:00
dave 4aa76ce673 huskies: merge 1090 refactor Migrate AgentPool::kill_all_children and kill_child_for_key to process_kill so server shutdown and stop_agent actually kill claude 2026-05-15 11:16:16 +00:00
Timmy fb82bd7bca test(tick_loop): de-flake reconcile_never_floods_broadcast_channel
The test asserted msg_count == 0 on a process-global broadcast channel
(TRANSITION_TX is a single OnceLock<Sender> shared across the test
binary), so any concurrent test calling apply_transition could land
events in our receiver between the drain and the post-reconcile check.
Observed failure: 3 stray transitions from parallel tests.

Drop the strict count check.  The real "never floods" invariant is
captured by the Lagged check alone: 1000 seeded items must not overflow
the 256-slot channel, which can only hold if the reconcile path
bypasses the broadcast (AC4).  The sibling test
`reconcile_pass_scales_to_1000_items_without_lagged_divergence` already
uses this Lagged-only pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 11:13:31 +01:00
Timmy b7df5cbe4e fix(agents): kill-then-status reorder in stop_agent
stop_agent had the same order-of-operations bug fixed in the watchdog:
status flipped to Failed before the claude process was verified gone,
opening the idempotency window that allowed a duplicate spawn to race
in alongside the surviving process.

Now follows the three-step protocol:
1. Read worktree path under a read-only lock (no mutation).
2. SIGKILL the worktree's process tree via process_kill and block
   until verified gone — start_agent's Running/Pending whitelist
   continues to reject duplicate spawns throughout.
3. Only then mutate the agent record, abort the task handle, and
   drop the child_killers entry.

Falls back to the old portable_pty SIGHUP path (with a warning) when
no worktree was recorded, matching the watchdog's behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 10:46:02 +01:00
Timmy fe9804b32c feat: add process_kill module + use it to fix watchdog double-spawn
Adds `crate::process_kill` — reliable SIGKILL-with-verify primitives used
across the server in place of the various ad-hoc kill paths that ignored
their kill-effective return values. The module exposes three pieces:

  - `sigkill_pids_and_verify(pids)`: SIGKILL each pid and block (up to 2s)
    until every pid is verified gone. Returns survivors if not.
  - `pids_matching(pattern)`: pgrep -f wrapper.
  - `descendant_pids(root)`: recursive pgrep -P walker for process trees.

Wires the watchdog's limit-termination path through it, and reorders the
protocol to fix the duplicate-coder bug observed on story 1086 (2026-05-15):

  Before: check_agent_limits set status=Failed before the kill ran. The
  kill itself was `portable_pty::ChildKiller::kill()`, which sends SIGHUP
  on Unix — claude-code ignores SIGHUP, so the process kept running while
  the agent record was already marked terminated. The idempotency check
  in `start_agent` whitelists Running/Pending, so the next auto-assign
  pass spawned a fresh agent alongside the still-alive prior one. Two
  claude PIDs sharing one session_id, racing on the same worktree.

  After: status update is moved OUT of check_agent_limits and into the
  caller AFTER the kill is verified. The kill itself is now SIGKILL-the-
  process-tree-in-the-worktree, with explicit verification that every pid
  is gone. The idempotency window is closed.

The existing watchdog test suite (14 tests) still passes; 7 new tests
cover the process_kill primitives directly.

`agents/pool/process.rs`'s `kill_all_children` and `kill_child_for_key`
still use the old portable_pty SIGHUP path — they have the same bug but
in lower-impact code paths (shutdown, operator stop). They will be
migrated under a separate story to keep this commit focused.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 10:36:33 +01:00
dave df32a1542b huskies: merge 1087 story Pipeline+Status split — Step D: migrate CRDT storage to (Pipeline, Status) and remove the Stage enum 2026-05-15 08:47:38 +00:00
dave e82602db77 huskies: merge 1086 story Pipeline+Status split — Step C: migrate auto-assign, subscribers, and lifecycle transitions to read Pipeline + Status 2026-05-15 08:26:39 +00:00
Timmy 2d6105c778 fix: skip setup commands on worktree reuse so reconciler doesn't fire npm ci every 30s
Story 1066 (merged 2026-05-14 23:39) introduced a periodic reconciler that
calls `reconcile_worktree_create` every 30 seconds (default
`reconcile_interval_secs`). The reconciler's docstring promises it is a no-op
for stories whose worktree already exists — but the implementation calls
`create_worktree`, whose reuse path was running `run_setup_commands`
unconditionally. Setup includes destructive `npm ci` (rm -rf node_modules
then reinstall), so every Coding story got `npm ci` fired every 30 seconds.

When story 1086 hit a gate-failure retry loop on 2026-05-15, the merge gate's
own `npm install`/`npm run build` raced one of these reconciler-driven
`npm ci` runs that was wiping node_modules — leaving `.bin/tsc` as a broken
symlink pointing into a half-populated `typescript/` package and producing
`sh: 1: tsc: not found`. 37 npm ci fires for 1086 in 5 hours against only
3 real Coding transitions, a 12x amplification driven entirely by the
30-second reconcile cadence.

Fix: align `create_worktree`'s behaviour with the contract `reconcile_worktree_create`
already documents — reuse is a no-op for setup commands. Sparse checkout
and `.mcp.json` rewrite still run (both cheap and idempotent).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 08:57:38 +01:00
Timmy d89940e85b fix: drop source-map.json from agent orientation bundle
The orientation bundle was 96 KB per coder spawn with 85 KB of that being
source-map.json — a static symbol listing that drowned out the workflow rules
in AGENT.md and likely explains why PLAN.md ceremony is being skipped (the
instruction is ~5% of the bundle, buried under a wall of symbols). Agents are
excellent at grep on demand, so the source map adds little value as a preloaded
cheat sheet. File stays on disk for the merge-time source-map-check doc-coverage
gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 07:48:18 +01:00
dave 13f7dab5f0 huskies: merge 1088 2026-05-15 02:03:30 +00:00
dave b053f14d58 huskies: merge 1085 2026-05-15 01:38:05 +00:00
dave 56179d712e huskies: merge 1078 2026-05-15 01:32:29 +00:00
dave 1506141155 huskies: merge 1072 2026-05-15 01:27:25 +00:00
dave 0c23d209a0 huskies: merge 1077 2026-05-15 00:58:57 +00:00
dave eac5763e03 huskies: merge 1075 2026-05-15 00:48:06 +00:00
dave f9b140add9 huskies: merge 1073 2026-05-15 00:37:01 +00:00