mirror of
https://github.com/YspCoder/clawgo.git
synced 2026-04-14 13:37:32 +08:00
docs: fold EKG draft into README/README_EN and remove standalone draft design doc
This commit is contained in:
15
README.md
15
README.md
@@ -67,6 +67,21 @@
|
||||
- 未显式声明时,系统会从任务文本自动推断资源键。
|
||||
- 冲突任务进入 `resource_lock` 等待,默认 30 秒后重试抢锁,并带公平加权(等待越久优先级越高)。
|
||||
- 自治完成/阻塞通知不再使用 `autonomy.notify_channel` / `autonomy.notify_chat_id`;默认自动从已启用通道的 `allow_from` 推导目标(优先 Telegram)。
|
||||
- 入站消息去重:基于 `message_id` 进行通道级去重(默认 TTL 10 分钟),避免平台重试导致重复回复。
|
||||
|
||||
### EKG(Execution Knowledge Graph)
|
||||
|
||||
ClawGo 现已内置执行知识图谱能力(轻量 JSONL 事件流,不依赖外部图数据库):
|
||||
|
||||
- 事件存储:`memory/ekg-events.jsonl`
|
||||
- 错误签名归一化(路径/数字/hex 去噪)
|
||||
- 自治重复错误抑制(`ekg_consecutive_error_threshold`)
|
||||
- provider fallback 按历史效果排序(含 errsig-aware)
|
||||
- 任务审计支持 provider/model 可观测
|
||||
- EKG 统计按 source/channel 分层(heartbeat 与 workload 分离)
|
||||
|
||||
> 为什么需要时间窗口:
|
||||
> 历史全量统计会被旧数据与 heartbeat 噪音稀释,导致当前阶段决策失真。建议默认观察近 24h(或 6h/7d 可切换),让 fallback 和告警更贴近“当前”系统状态。
|
||||
|
||||
## 🏁 快速开始
|
||||
|
||||
|
||||
15
README_EN.md
15
README_EN.md
@@ -67,6 +67,21 @@ Autonomy now supports lock scheduling via `resource_keys`. You can explicitly de
|
||||
- Without explicit keys, the engine derives keys from task text heuristically.
|
||||
- Conflicting tasks enter `resource_lock` waiting, retry lock acquisition after 30s, and use fairness weighting (longer wait => higher scheduling priority).
|
||||
- Autonomy completion/blocked notifications no longer use `autonomy.notify_channel` / `autonomy.notify_chat_id`; target is derived from enabled channel `allow_from` (Telegram first).
|
||||
- Inbound dedupe: channel-level dedupe by `message_id` (default TTL: 10 minutes) to avoid duplicate replies from platform retries.
|
||||
|
||||
### EKG (Execution Knowledge Graph)
|
||||
|
||||
ClawGo now includes a built-in execution knowledge graph (lightweight JSONL event stream; no external graph DB required):
|
||||
|
||||
- Event store: `memory/ekg-events.jsonl`
|
||||
- Normalized error signatures (path/number/hex denoise)
|
||||
- Repeated-error suppression for autonomy (`ekg_consecutive_error_threshold`)
|
||||
- Provider fallback ranking by historical outcomes (errsig-aware)
|
||||
- Task-audit visibility for provider/model
|
||||
- Source/channel-stratified EKG stats (heartbeat separated from workload)
|
||||
|
||||
> Why time windows matter:
|
||||
> Full-history stats get diluted by stale data and heartbeat noise, which degrades current decisions. A recent window (e.g., 24h, optionally 6h/7d) keeps fallback and alerts aligned with present runtime behavior.
|
||||
|
||||
## 🏁 Quick Start
|
||||
|
||||
|
||||
@@ -1,97 +0,0 @@
|
||||
# EKG 设计稿(Execution Knowledge Graph)
|
||||
|
||||
> 目标:在不引入重型图数据库的前提下,为 ClawGo 提供“可审计、可回放、可降错”的执行知识图谱能力,优先降低 agent 重复报错与自治死循环。
|
||||
|
||||
## 1. 范围与阶段
|
||||
|
||||
### M1(本次实现)
|
||||
- 记录执行结果事件(成功/失败/抑制)到 `memory/ekg-events.jsonl`
|
||||
- 对错误文本做签名归一化(errsig)
|
||||
- 在自治引擎中读取 advice:同任务同 errsig 连续失败达到阈值时,直接阻断重试(避免死循环)
|
||||
|
||||
### M2(后续)
|
||||
- provider/model/tool 维度的成功率建议(preferred / banned)
|
||||
- channel/source 维度的策略分层
|
||||
|
||||
### M3(后续)
|
||||
- WAL + 快照(snapshot)
|
||||
- WebUI 可视化(errsig 热点、抑制命中率)
|
||||
|
||||
---
|
||||
|
||||
## 2. 数据模型(接口草图)
|
||||
|
||||
```go
|
||||
type Event struct {
|
||||
Time string `json:"time"`
|
||||
TaskID string `json:"task_id,omitempty"`
|
||||
Session string `json:"session,omitempty"`
|
||||
Channel string `json:"channel,omitempty"`
|
||||
Source string `json:"source,omitempty"`
|
||||
Status string `json:"status"` // success|error|suppressed
|
||||
ErrSig string `json:"errsig,omitempty"`
|
||||
Log string `json:"log,omitempty"`
|
||||
}
|
||||
|
||||
type Advice struct {
|
||||
ShouldEscalate bool `json:"should_escalate"`
|
||||
RetryBackoffSec int `json:"retry_backoff_sec"`
|
||||
Reason []string `json:"reason"`
|
||||
}
|
||||
|
||||
type SignalContext struct {
|
||||
TaskID string
|
||||
ErrSig string
|
||||
Source string
|
||||
Channel string
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. 存储与性能
|
||||
|
||||
- 存储:`memory/ekg-events.jsonl`(append-only)
|
||||
- 读取:仅扫描最近窗口(默认 2000 行)
|
||||
- 复杂度:O(N_recent)
|
||||
- 设计取舍:M1 以正确性优先,后续再加入 snapshot 与索引
|
||||
|
||||
---
|
||||
|
||||
## 4. 规则(M1)
|
||||
|
||||
- 错误签名归一化:
|
||||
- 路径归一化 `<path>`
|
||||
- 数字归一化 `<n>`
|
||||
- hex 归一化 `<hex>`
|
||||
- 空白压缩
|
||||
- 阈值规则:
|
||||
- 若 `task_id + errsig` 连续 `>=3` 次 error,则
|
||||
- `ShouldEscalate=true`,自治任务进入 `blocked:repeated_error_signature`
|
||||
|
||||
---
|
||||
|
||||
## 5. 接入点
|
||||
|
||||
1) `pkg/agent/loop.go`
|
||||
- 在 `appendTaskAuditEvent` 处同步写入 EKG 事件(与 task-audit 同步)
|
||||
|
||||
2) `pkg/autonomy/engine.go`
|
||||
- 在运行结果为 error 的分支读取 EKG advice
|
||||
- 命中升级条件时,直接阻断重试并标记 block reason
|
||||
|
||||
---
|
||||
|
||||
## 6. 风险与回滚
|
||||
|
||||
- 风险:阈值过低导致过早阻断
|
||||
- 缓解:默认阈值 3,且仅在同 task+同 errsig 命中时触发
|
||||
- 回滚:移除 advice 判断即可恢复原重试路径
|
||||
|
||||
---
|
||||
|
||||
## 7. 验收标准(M1)
|
||||
|
||||
- 能生成并追加 `memory/ekg-events.jsonl`
|
||||
- 相同任务在相同错误签名下连续失败 3 次后,自治不再继续循环 dispatch
|
||||
- `make test`(Docker compile)通过
|
||||
Reference in New Issue
Block a user