docs: fold EKG draft into README/README_EN and remove standalone draft design doc

2026-04-14 13:37:32 +08:00 · 2026-03-01 05:38:48 +00:00
parent 6608456fbf
commit 11a0005645
3 changed files with 30 additions and 97 deletions
--- a/README.md
+++ b/README.md
@@ -67,6 +67,21 @@
 - 未显式声明时，系统会从任务文本自动推断资源键。
 - 冲突任务进入 `resource_lock` 等待，默认 30 秒后重试抢锁，并带公平加权（等待越久优先级越高）。
 - 自治完成/阻塞通知不再使用 `autonomy.notify_channel` / `autonomy.notify_chat_id`；默认自动从已启用通道的 `allow_from` 推导目标（优先 Telegram）。
+- 入站消息去重：基于 `message_id` 进行通道级去重（默认 TTL 10 分钟），避免平台重试导致重复回复。
+
+### EKG（Execution Knowledge Graph）
+
+ClawGo 现已内置执行知识图谱能力（轻量 JSONL 事件流，不依赖外部图数据库）：
+
+- 事件存储：`memory/ekg-events.jsonl`
+- 错误签名归一化（路径/数字/hex 去噪）
+- 自治重复错误抑制（`ekg_consecutive_error_threshold`）
+- provider fallback 按历史效果排序（含 errsig-aware）
+- 任务审计支持 provider/model 可观测
+- EKG 统计按 source/channel 分层（heartbeat 与 workload 分离）
+
+> 为什么需要时间窗口：
+> 历史全量统计会被旧数据与 heartbeat 噪音稀释，导致当前阶段决策失真。建议默认观察近 24h（或 6h/7d 可切换），让 fallback 和告警更贴近“当前”系统状态。

 ## 🏁 快速开始

--- a/README_EN.md
+++ b/README_EN.md
@@ -67,6 +67,21 @@ Autonomy now supports lock scheduling via `resource_keys`. You can explicitly de
 - Without explicit keys, the engine derives keys from task text heuristically.
 - Conflicting tasks enter `resource_lock` waiting, retry lock acquisition after 30s, and use fairness weighting (longer wait => higher scheduling priority).
 - Autonomy completion/blocked notifications no longer use `autonomy.notify_channel` / `autonomy.notify_chat_id`; target is derived from enabled channel `allow_from` (Telegram first).
+- Inbound dedupe: channel-level dedupe by `message_id` (default TTL: 10 minutes) to avoid duplicate replies from platform retries.
+
+### EKG (Execution Knowledge Graph)
+
+ClawGo now includes a built-in execution knowledge graph (lightweight JSONL event stream; no external graph DB required):
+
+- Event store: `memory/ekg-events.jsonl`
+- Normalized error signatures (path/number/hex denoise)
+- Repeated-error suppression for autonomy (`ekg_consecutive_error_threshold`)
+- Provider fallback ranking by historical outcomes (errsig-aware)
+- Task-audit visibility for provider/model
+- Source/channel-stratified EKG stats (heartbeat separated from workload)
+
+> Why time windows matter:
+> Full-history stats get diluted by stale data and heartbeat noise, which degrades current decisions. A recent window (e.g., 24h, optionally 6h/7d) keeps fallback and alerts aligned with present runtime behavior.

 ## 🏁 Quick Start

--- a/docs/ekg-design.md
+++ b/docs/ekg-design.md
@@ -1,97 +0,0 @@
-# EKG 设计稿（Execution Knowledge Graph）
-
-> 目标：在不引入重型图数据库的前提下，为 ClawGo 提供“可审计、可回放、可降错”的执行知识图谱能力，优先降低 agent 重复报错与自治死循环。
-
-## 1. 范围与阶段
-
-### M1（本次实现）
- 记录执行结果事件（成功/失败/抑制）到 `memory/ekg-events.jsonl`
- 对错误文本做签名归一化（errsig）
- 在自治引擎中读取 advice：同任务同 errsig 连续失败达到阈值时，直接阻断重试（避免死循环）
-
-### M2（后续）
- provider/model/tool 维度的成功率建议（preferred / banned）
- channel/source 维度的策略分层
-
-### M3（后续）
- WAL + 快照（snapshot）
- WebUI 可视化（errsig 热点、抑制命中率）
-
---
-
-## 2. 数据模型（接口草图）
-
-```go
-type Event struct {
-    Time    string `json:"time"`
-    TaskID  string `json:"task_id,omitempty"`
-    Session string `json:"session,omitempty"`
-    Channel string `json:"channel,omitempty"`
-    Source  string `json:"source,omitempty"`
-    Status  string `json:"status"` // success|error|suppressed
-    ErrSig  string `json:"errsig,omitempty"`
-    Log     string `json:"log,omitempty"`
-}
-
-type Advice struct {
-    ShouldEscalate bool     `json:"should_escalate"`
-    RetryBackoffSec int     `json:"retry_backoff_sec"`
-    Reason         []string `json:"reason"`
-}
-
-type SignalContext struct {
-    TaskID  string
-    ErrSig  string
-    Source  string
-    Channel string
-}
-```
-
---
-
-## 3. 存储与性能
-
- 存储：`memory/ekg-events.jsonl`（append-only）
- 读取：仅扫描最近窗口（默认 2000 行）
- 复杂度：O(N_recent)
- 设计取舍：M1 以正确性优先，后续再加入 snapshot 与索引
-
---
-
-## 4. 规则（M1）
-
- 错误签名归一化：
-  - 路径归一化 `<path>`
-  - 数字归一化 `<n>`
-  - hex 归一化 `<hex>`
-  - 空白压缩
- 阈值规则：
-  - 若 `task_id + errsig` 连续 `>=3` 次 error，则
-  - `ShouldEscalate=true`，自治任务进入 `blocked:repeated_error_signature`
-
---
-
-## 5. 接入点
-
-1) `pkg/agent/loop.go`
- 在 `appendTaskAuditEvent` 处同步写入 EKG 事件（与 task-audit 同步）
-
-2) `pkg/autonomy/engine.go`
- 在运行结果为 error 的分支读取 EKG advice
- 命中升级条件时，直接阻断重试并标记 block reason
-
---
-
-## 6. 风险与回滚
-
- 风险：阈值过低导致过早阻断
- 缓解：默认阈值 3，且仅在同 task+同 errsig 命中时触发
- 回滚：移除 advice 判断即可恢复原重试路径
-
---
-
-## 7. 验收标准（M1）
-
- 能生成并追加 `memory/ekg-events.jsonl`
- 相同任务在相同错误签名下连续失败 3 次后，自治不再继续循环 dispatch
- `make test`（Docker compile）通过