Apache SkyWalking – Engineering

Blog: Meet Horizon UI · 11/17: Runtime Rules & Live Debugging

Mon, 29 Jun 2026 00:00:00 +0000

This is the eleventh post in the Meet Horizon UI series, and it stays in Act 3 — operate it. The previous post was about reading what the backend already decided; this one is about changing how it decides — and then proving the change does what you meant.

Almost everything OAP computes runs through a small family of DSLs: OAL turns traces into service and endpoint metrics, MAL turns meters (OpenTelemetry, Telegraf) into metrics, LAL turns logs into tags and metrics. Traditionally you edit those as YAML on the server and restart. Horizon brings both halves into the console — edit and hot-apply the rules, and debug them against live data — two capabilities new to the SkyWalking UI that ride OAP’s admin host.

Your rules, live in the console

Operate → DSL management lists every analysis rule the cluster is running, grouped by source. Four catalogs are editable — MAL · OTel, MAL · Telegraf, LAL, and LAL → MAL (log-to-metric) — plus a read-only OAL browser. Rules group by prefix (ActiveMQ, BanyanDB, Elasticsearch, Flink…), each tagged by status, and you can filter by active / inactive / bundled / modified to see at a glance what an operator has changed versus what shipped.

Figure 1: DSL management — every OAL/MAL/LAL rule the cluster runs, grouped by source and filterable by status (active / inactive / bundled / modified). Here, the OpenTelemetry MAL catalog: 37 bundled rules.

Edit, and hot-apply safely

Open a rule and it’s a Monaco YAML editor with syntax highlighting and two diff modes — vs. server (what’s live) and vs. bundled (what shipped) — so you always see what you’re about to change. The green ▶ in the gutter beside each - name: jumps that rule straight into the Live Debugger.

Figure 2: Edit a rule as Monaco YAML — syntax-highlighted, with diffs against the live (server) and bundled versions, and a green ▶ in the gutter that jumps the rule into the Live Debugger.

Saving is where the care shows. A body- or filter-only edit applies instantly. But a structural change — one that moves a metric’s scope, downsampling, or storage shape — reshapes the cluster’s storage, so Horizon runs it as a fenced rollout and tracks it on screen: Compiled → Confirming across the cluster → Committing → Done, reporting success only once the change is durable. If a node lags the fence, the apply ends DEGRADED — it names the laggard nodes (they self-converge on their next scan) rather than failing; a pre-commit error is rolled back with the reason inline and your edit kept in the buffer; a compile error surfaces as an inline diagnostic. A one-click Force re-apply re-runs a stuck rollout on byte-identical content to un-stick a node (it briefly pauses that one rule’s collection). Reverting a rule to its bundled default goes through the same fenced path; you can also inactivate it, delete it, or dump the whole catalog to a tarball.

The Live Debugger: see what a rule actually does

Editing a rule is the easy part. The hard part — the part that used to mean reading code and squinting at output — is knowing what a rule computes against your real data. Operate → Live debugger answers that directly: pick a rule, click start sampling, and Horizon installs a bounded capture on every reachable OAP node, grabs a handful of real records, and shows each one stepped through the rule.

Figure 3: Start a capture and it installs on every reachable OAP node (here 2/2), grabs real records, and bounds itself with a record cap and a retention window — the same shell serves all three analysis languages.

It has one tab per analysis language, because each works on a different kind of data.

OAL → traces. A captured source row — a real trace segment — flows clause by clause: from(Service.*) reads the segment (you see its latency, status, endpoint), build_metrics shapes it, cpm() aggregates it. You watch a trace become a metric.

Figure 4: OAL → traces — a real segment from agent::gateway (latency 38, status 200, /rcmd) stepped clause by clause, from(Service.*) → build_metrics → cpm(), into the service-CPM metric it feeds.

MAL → metrics. A meter sample flows input → filter → function → output. Because one metric fans out into many label-sets, the samples are grouped by metric, and a diff dims the labels every sample shares and highlights only the ones that differ.

Figure 5: MAL → metrics — samples grouped by metric, with a diff that dims the 16 labels every sample shares and lights only the two that differ (group, pod_name), so four near-identical series read apart at a glance.

LAL → logs. Each captured log record becomes a column and each DSL block (or statement) a row, so the whole capture reads as a matrix: you can see which records the filter aborted and what the extractor pulled out of the ones that passed — and click any cell to open the record in full and compare it against another.

Figure 6: LAL → logs — every captured record is a column, every DSL block a row. The filter aborts the normal logs; for each abnormal one that passes, the extractor row shows the tag it added (status.code=404) as a diff over the raw record.

Figure 7: “Show complete data” opens a record in full — the entire raw log payload (here an Envoy access log) — with a Compare with selector to diff it field-by-field against any other captured record.

Where it runs

Both surfaces are operate features: they talk to OAP’s admin host, not the query port — DSL management through the receiver-runtime-rule module, the Live Debugger through dsl-debugging. That admin host ships with OAP 11, so on today’s backend these two pages surface a clear “needs the admin host / module” banner and stay read-only, while every observe surface — dashboards, traces, logs, alarms, profiling — keeps working untouched. Access is role-gated: browsing rules and viewing captures are read permissions, while editing, structural apply, and running a capture each need their own write or execute verb — so a read-only operator can study captured samples all day without ever being able to change a rule or start a session. This is the slice of “operate” the open-source backend only just made possible.

Where to go next

For the field reference — every apply state, the dump format, the per-tab capture controls — see the Runtime Rules and Live Debugger docs.

Next up: Inspect — Cross-Layer Query Power-Tools — the Operate-side surfaces for running metric, trace, and log queries straight across every layer.

Blog: Meet Horizon UI · 12/17: Inspect — Cross-Layer Query Power-Tools

Mon, 29 Jun 2026 00:00:00 +0000

This is the twelfth post in the Meet Horizon UI series, still in Act 3 — operate it. The Trace Explorer and Log Explorer both start the same way: you pick a layer, then a service, then you search. That’s the right shape when you’re already looking at a service. But sometimes you aren’t — you have a trace id and no idea which layer owns it, a failing service name from an alert, or a metric you just want to chart across everything. The Inspect family under Operate is built for exactly that: three cross-layer query power-tools that drop the layer-first step — and one of them has no per-layer equivalent at all.

Metrics inspect: the metric catalog, and the rule behind every number

SkyWalking computes a lot of metrics, and until now there was no way to simply see them all. Metrics inspect is that view. Its catalog drawer lists every metric the connected OAP computes — and groups them by the rule that defines them: the OAL files and MAL rule sets you met in the previous post. Filter by source (OAL, MAL·OTel, MAL·Telegraf, LAL→MAL), search by name, and read each metric’s value type and scope at a glance.

Figure 1: Metrics inspect’s catalog — every metric the OAP computes, grouped by the rule that defines it (the OAL files and MAL rule sets from DSL management), filterable by source and scope. Pick metrics onto the board.

Pick metrics from the catalog and they land on a board of charts, where you choose the entities to plot — a paginated top-N from OAP, or hand-entered ones — and read the values back as a line, bar, or area chart. Each widget carries its rule source and scope so you never lose the thread from “this number” to “the rule that produces it.” It’s an MQE scratchpad: the time range is browser-local but sent to OAP in server time, the board persists in your browser, and metrics that live only in shared storage (not defined on the connected OAP) can be added as foreign metrics.

Figure 2: The board — chart any metric across entities; each widget keeps its rule source (OAL) and scope (Service), a per-widget entity paginator, and a browser-local range sent to OAP in server time.

Trace inspect: find a trace without picking a layer

Trace inspect is the Trace Explorer with the layer taken off. The Target is optional: pick a service through the layer → service → instance → endpoint cascade, type a name (with a real / conjectured flag), or leave it blank to query every service at once. Set the usual conditions — trace id, status, order, duration bounds, tags, window — and Run query. A resolved-query line spells out the exact call sent to OAP, and the results render as the same distribution scatter, trace list, and three-lens waterfall you already know — just not bound to any one layer.

Figure 3: Trace inspect — layer-less: leave Target blank to query every service (or pick/type one), then Run query. Here one trace crosses five services (agent::ui → frontend → app → gateway → songs); the resolved-query line shows the exact OAP call (native · queryTraces).

Log inspect: one query, three log sources

Log inspect does the same for logs — “query any service across layers, pick it, type its name, or leave it blank” — and folds three different log worlds into one place via a Source switch:

Raw — the stored service logs, streamed across services with tag and trace-id conditions, each row opening the same payload popout as the per-layer Log Explorer;
Browser — the BROWSER layer’s JS errors by category, with the same source-map de-obfuscation from the Browser Errors post;
Kubernetes Pod logs — an on-demand live tail of a pod’s container logs, with Start / Pause and Include/Exclude regex filters, never persisted.

Figure 4: Log inspect — “query any service across layers, or leave it blank,” across three sources (Raw stored logs, Browser JS errors, Kubernetes Pod logs). Here raw logs stream from several services at once.

Where it runs

All three live under Operate and share one permission, inspect:read. They split on the backend, though. Trace inspect and Log inspect ride OAP’s standard query protocol — the same one the dashboards and per-layer explorers use — so they’re always on and work on any OAP, including 10.x. Metrics inspect is the exception: it reads OAP’s metric catalog and entity enumerator through the admin host’s inspect module, so it needs OAP 11; when that module is absent it shows a clear “set SW_INSPECT=default” banner instead of a broken page, while the other two keep working. Think of the trio as the cross-layer, Operate-side counterparts to the per-layer Trace and Log explorers — plus the metric catalog that finally answers “what does this backend even measure, and which rule measures it?”

Where to go next

For the field reference — the metric catalog, entity enumeration, foreign metrics, and MQE execution — see the Inspect docs.

Next up: Platform & Cluster Introspection — Cluster Status, OAP configuration, and data-retention, the last stop in Act 3 before we turn to governing and securing the console.

Zh: 认识 Horizon UI · 11/17：运行时规则与实时调试

Mon, 29 Jun 2026 00:00:00 +0000

译自英文原文：Meet Horizon UI · 11/17: Runtime Rules & Live Debugging。

这是 Meet Horizon UI 系列的第十一篇，仍然属于第三幕 operate it。上一篇讲的是如何读取后端已经判断出的告警；这一篇更进一步，讲如何改变后端的判断逻辑，并验证这次修改确实按预期生效。

OAP 里的大多数分析都经过一组小型 DSL：OAL 把 trace 转成 service 和 endpoint 指标，MAL 把 meter（OpenTelemetry、Telegraf）转成指标，LAL 把 log 转成 tag 和指标。过去，这些规则通常要在服务器上改 YAML，再重启服务。Horizon 把两件事都放进控制台：编辑规则并在线生效，以及用实时数据调试规则。这两项都是 SkyWalking UI 新增的能力，底层走 OAP 的 admin host。

规则直接在控制台里看

Operate → DSL management 会列出集群当前正在运行的分析规则，并按来源分组。四类目录可编辑：MAL · OTel、MAL · Telegraf、LAL 和 LAL → MAL（log-to-metric）；另外还有只读的 OAL 浏览器。规则会按前缀归组，比如 ActiveMQ、BanyanDB、Elasticsearch、Flink 等。每条规则都有状态标记，也可以按 active / inactive / bundled / modified 过滤，这样可以很快看出哪些是随版本发布的规则，哪些被 operator 改过。

图 1：DSL management：集群运行的 OAL/MAL/LAL 规则按来源分组，并可按 active / inactive / bundled / modified 过滤。这里展示的是 OpenTelemetry MAL 目录，包含 37 条 bundled 规则。

编辑，并安全地在线生效

打开一条规则后，界面是一个 Monaco YAML 编辑器，带语法高亮和两种 diff：vs. server 看当前线上版本，vs. bundled 看随版本发布的默认版本。这样你在保存前就能确认自己改了什么。每个 - name: 旁边还有绿色 ▶，点击后会把这条规则直接带到 Live Debugger。

图 2：用 Monaco YAML 编辑规则：语法高亮、对比线上版本和 bundled 版本，并可通过代码左侧的绿色 ▶ 直接进入 Live Debugger。

保存时，Horizon 会区分修改的风险。只改规则主体或 filter，可以立即生效。但如果是结构性变更，比如改了指标的 scope、downsampling 或存储形态，就会影响集群存储结构。Horizon 会按带集群确认的流程执行，并在界面上展示进度：Compiled → Confirming across the cluster → Committing → Done；只有变更在集群里确认持久化后才算成功。

如果某个节点没有及时通过确认，应用结果会变成 DEGRADED。界面会列出落后的节点，这些节点会在下次扫描时自行追上，而不是让整次应用直接失败。如果 commit 前出错，变更会 rolled back，原因显示在界面里，你的编辑内容仍保留在 buffer 中。编译错误则会作为 inline diagnostic 展示。对于卡住的发布，可以点一次 Force re-apply，用完全相同的内容重新跑一遍应用流程，让落后的节点恢复同步；这会短暂暂停那条规则的采集。把规则恢复到 bundled 默认版本也走同一套确认流程；此外也可以 inactivate、delete，或者把整个目录导出成 tarball。

Live Debugger：看清规则实际算出了什么

编辑规则只是第一步。更难的是确认它跑在真实数据上到底会算出什么。过去通常要读代码、对输出、靠经验判断。Operate → Live debugger 直接把这件事放到界面里：选择一条规则，点击 start sampling，Horizon 会在每个可达的 OAP 节点上安装一个受限采集任务，抓取少量真实记录，然后逐条展示这些记录如何经过规则处理。

图 3：启动采集后，任务会安装到每个可达的 OAP 节点上（这里是 2/2），抓取真实记录，并用 record cap 和 retention window 控制边界。三种分析语言共用这套会话框架。

Live Debugger 按分析语言分成三个标签页，因为三种规则处理的数据不同。

OAL → traces。 捕获到的一行 source 是真实的 trace segment。它会按 clause 展开：from(Service.*) 读取 segment（可以看到 latency、status、endpoint），build_metrics 组织指标结构，cpm() 做聚合。你可以直接看到一条 trace 如何变成指标。

图 4：OAL → traces：来自 agent::gateway 的真实 segment（latency 38，status 200，/rcmd）逐步经过 from(Service.*) → build_metrics → cpm()，最终进入 service-CPM 指标。

MAL → metrics。 一个 meter sample 会按 input → filter → function → output 流动。因为同一个指标往往会展开成多组 label，样本会按 metric 分组；diff 会淡化所有样本共有的 label，只高亮不同的部分。

图 5：MAL → metrics：sample 按 metric 分组，diff 会淡化 16 个所有样本共有的 label，只高亮不同的两个 label（group、pod_name）。四条非常相似的时序因此能一眼区分。

LAL → logs。 每条捕获到的 log record 是一列，每个 DSL block（或 statement）是一行，所以整个采集结果会变成一个矩阵。你可以看到哪些记录被 filter aborted，也能看到通过 filter 的记录被 extractor 提取出了什么；点开任意一个格子，还能查看这条记录的完整内容，并和另一条记录逐字段对比。

图 6：LAL → logs：每条捕获记录是一列，每个 DSL block 是一行。filter 会丢弃正常日志；对通过的异常日志，extractor 行会以 diff 的形式显示它新增的标签（这里是 status.code=404）。

图 7：点击格子上的“show complete data”，即可查看整条捕获记录的原始内容（这里是一条 Envoy access log），并通过 Compare with 选择器与其他任意记录逐字段对比。

它在哪里运行

这两个界面都属于 operate 功能：它们访问的是 OAP 的 admin host，不是 query port。DSL management 走 receiver-runtime-rule 模块，Live Debugger 走 dsl-debugging。admin host 随 OAP 11 提供；在当前后端上，这两个页面会明确提示“需要 admin host / module”，并保持只读。与此同时，所有 observe 界面不受影响：dashboard、trace、log、alarm、profiling 都照常工作。

权限也按角色拆开：浏览规则、查看采集结果是读权限；编辑规则、执行结构性应用、启动采集分别需要对应的写权限或执行权限。因此，只读 operator 可以一直查看采集样本，但不能改规则，也不能启动新的调试会话。这正是开源后端最近才补上的那块 operate 能力。

后续阅读

字段参考，包括每个 apply state、dump 格式和各标签页的采集控制，可以看 Runtime Rules 和 Live Debugger 文档。

下一篇：Inspect，跨 layer 查询工具：在 Operate 界面里跨 layer 直接运行 metric、trace 和 log 查询。

Zh: 认识 Horizon UI · 12/17：Inspect，跨 layer 查询工具

Mon, 29 Jun 2026 00:00:00 +0000

译自英文原文：Meet Horizon UI · 12/17: Inspect — Cross-Layer Query Power-Tools。

这是 Meet Horizon UI 系列的第十二篇，仍然属于第三幕 operate it。Trace Explorer 和 Log Explorer 的入口方式很一致：先选 layer，再选 service，然后搜索。如果你已经知道要看哪个服务，这个流程很顺。但有些时候，你并不知道入口在哪：手里只有一个 trace id，却不知道它属于哪个 layer；告警里只有一个出问题的 service name；或者你只是想把某个指标拿出来，在所有实体上画一遍。Operate 下的 Inspect 家族就是为这些场景准备的：三个跨 layer 查询入口，去掉“先选 layer”这一步；其中一个甚至没有对应的单 layer 版本。

Metrics inspect：指标目录，以及每个指标背后的规则

SkyWalking 会计算大量指标，但过去没有一个地方能把它们完整列出来。Metrics inspect 补上了这个视图。它的 catalog drawer 会列出当前连接的 OAP 计算出的所有指标，并按定义这些指标的规则分组：也就是上一篇里讲过的 OAL 文件和 MAL 规则集。你可以按来源过滤（OAL、MAL·OTel、MAL·Telegraf、LAL→MAL），也可以按名称搜索，并直接看到每个指标的 value type 和 scope。

图 1：Metrics inspect 的指标目录：OAP 计算的所有指标按定义规则分组，也就是 DSL management 里的 OAL 文件和 MAL 规则集；可以按 source 和 scope 过滤，再把指标选到看板上。

从目录里选中指标后，它们会进入一个图表 board。你可以选择要画哪些实体：从 OAP 返回的分页 top-N 里选，或者手动输入实体名；图表可以用 line、bar 或 area 展示。每个 widget 都会带上规则来源和 scope，所以你始终能从“这个数”追溯到“是哪条规则算出了这个数”。它也可以当作一个 MQE 临时看板：时间范围保存在浏览器本地，但发送给 OAP 时会转成服务端时间；board 本身也保存在浏览器里；那些只存在于共享存储、但不由当前连接的 OAP 定义的指标，也可以作为 foreign metrics 加进来。

图 2：Inspect board：任选一个指标，在多个实体上画图；每个 widget 保留规则来源（OAL）、scope（Service）、独立的 entity 分页器，以及一个浏览器本地保存、提交给 OAP 时转换成服务端时间的时间范围。

Trace inspect：不用先选 layer，也能找 trace

Trace inspect 可以理解成拿掉 layer 限制的 Trace Explorer。Target 是可选的：你可以通过 layer → service → instance → endpoint 级联选择服务，也可以直接输入一个名字（并标记它是真实存在还是推测出来的），还可以留空 Target，一次查询所有 service。之后照常设置查询条件：trace id、status、排序、duration 范围、tags 和时间窗口，然后点击 Run query。界面会显示一行解析后的查询，写清楚实际发给 OAP 的调用；结果仍然是你熟悉的分布散点图、trace 列表和三视角 waterfall，只是不再绑在某个 layer 上。

图 3：Trace inspect 不需要先选 layer：Target 留空即可查询所有 service，也可以选择或输入某个 service。这里一条 trace 跨过五个服务（agent::ui → frontend → app → gateway → songs）；解析后的查询行展示了实际 OAP 调用（native · queryTraces）。

Log inspect：一次入口，三类日志

Log inspect 对 log 做同样的事：可以跨 layer 查询任意 service，选择它、输入它的名字，或者直接留空。它还通过 Source 切换，把三类日志放到同一个入口里：

Raw：存储下来的 service logs，可以跨 service 流式查询，支持 tag 和 trace id 条件；每一行都能打开和单 layer Log Explorer 相同的 payload 弹窗；
Browser：来自 BROWSER layer 的 JS errors，按 category 查询，并使用 Browser Errors 那篇里讲过的 source map 反混淆；
Kubernetes Pod logs：按需 live tail 某个 pod 的 container logs，支持 Start / Pause 和 Include/Exclude 正则过滤，不会持久化。

图 4：Log inspect 可以跨 layer 查询任意 service，也可以留空 target；三种 source 分别对应 Raw 存储日志、Browser JS errors 和 Kubernetes Pod logs。这里展示的是多个 service 同时输出的 raw logs。

它在哪里运行

这三个入口都在 Operate 下，共用一个权限：inspect:read。但它们访问后端的方式不同。Trace inspect 和 Log inspect 走 OAP 标准 query protocol，也就是 dashboard 和单 layer explorer 使用的同一套接口；所以它们始终可用，也兼容 10.x OAP。Metrics inspect 是例外：它通过 admin host 的 inspect 模块读取 OAP 的指标目录和实体枚举，因此需要 OAP 11。如果模块不存在，页面会给出明确的 “set SW_INSPECT=default” 提示，而不是只显示一个不可用的页面；另外两个入口仍然可以正常使用。可以把这组三个入口看成 Trace 和 Log explorer 在 Operate 侧的跨 layer 版本，再加上一个终于能回答“这个后端到底在算哪些指标、这些指标由哪条规则定义”的指标目录。

后续阅读

字段参考，包括指标目录、实体枚举、foreign metrics 和 MQE 执行，可以看 Inspect 文档。

下一篇是 Platform & Cluster Introspection：Cluster Status、OAP configuration 和 data-retention。它是第三幕的最后一站，之后这个系列会转向控制台的治理和安全。

Blog: Meet Horizon UI · 9/17: Five Profilers, One Flame Graph

Fri, 26 Jun 2026 00:00:00 +0000

This is the ninth post in the Meet Horizon UI series. Metrics tell you what slowed down; traces tell you which hop. Profiling goes one level deeper — into the call stacks, kernel events, and process-to-process conversations of a running service — to tell you where in the code. SkyWalking has five different profilers for that, and Horizon surfaces all of them. The headline of this post: four of the five pour into one shared flame graph, and the fifth is a deliberate exception.

One renderer, four profilers

Trace, async, eBPF, and pprof profiling all produce the same fundamental thing — a tree of stack frames with sample counts — so Horizon normalizes them into one shape and renders them through one flame-graph component (a wrapper over d3-flame-graph). The payoff is that you learn the view once and it works the same everywhere:

each frame’s width is its share of the samples, and the hover card reads out the code signature, the dump count, the time spent (including and excluding children), and the frame’s % of root;
clicking a frame zooms into it and pins a highlight on it — and that selected-frame highlight is consistent across all four profilers;
a dim, per-frame color keyed off the method name keeps a thousand-frame graph legible on the dark canvas.

Figure 1: One flame graph for four profilers — frames by sample share, the selected frame pinned, the hover card with % of root.

On the Trace and eBPF tabs you can flip the same data to a Tree view instead — an indented stack table with each method’s total vs self duration and its dump count, expandable frame by frame. (Async and pprof are flame-graph-only; the toggle shows up where both views apply.)

Figure 2: The same result, one toggle away — the Tree view swaps the flame for an indented stack table carrying total vs self duration and dump count.

What each of the four catches

The four stack profilers share the renderer but answer different questions, and each has its own New Task form:

Trace Profiling samples the call stacks of slow trace segments. Scope a task to a service (and optionally one endpoint), set a slowness threshold and a dump period, and the agent snapshots thread stacks from segments that cross the threshold. Then you pick a sampled trace, drill to a profiled span, and Analyze — with a data mode that includes or excludes child-span time.
Async Profiling runs the JVM async-profiler against a live Java service with no restart. A task can target several instances and several events at once — CPU, ALLOC, LOCK, WALL, and the timer events — and an event-type selector re-draws the flame for whichever one you want to read.
eBPF Profiling captures kernel-level stacks with no in-process agent, driven by SkyWalking Rover: ON_CPU (where the process burns CPU) or OFF_CPU (where it’s blocked — on locks, I/O, scheduling). A process picker lets you expand a process’s attributes and pin the ones to profile, and an aggregate toggle counts samples or sums blocked time (the latter only makes sense off-CPU).
pprof profiles a live Go service through the standard runtime profiler — exactly one event per task, chosen from CPU, HEAP, BLOCK, MUTEX, GOROUTINE, ALLOCS, and THREADCREATE. The dialog adapts to the choice: a duration for the timed captures, a sampling rate for BLOCK/MUTEX, and a one-shot snapshot for the rest.

Figure 3: pprof takes exactly one Go event per task — GOROUTINE, MUTEX, and CPU are separate tasks, each with its own duration and sampling rate; select one and Analyze pours it into the same flame graph.

Network Profiling: the deliberate exception

The fifth profiler answers a different kind of question — not “where is one process spending time” but “which processes are talking to which, and over what” — so it renders differently on purpose. Network Profiling captures the network conversations between the processes of a service instance and draws them as a honeycomb topology: each process is a hexagon, the instance’s own processes pack into the centre under a dashed pod boundary, and external peers ring the edge. The links between them are directed and animated, and colored by protocol — HTTPS, TLS, HTTP, and plain TCP each get their own hue and a small pill.

It also runs differently: instead of a fixed duration, a network task carries sampling rules — match by URI pattern, by 4xx/5xx responses, or by a minimum duration, and choose how much of each request/response body to keep — and keeps running until you stop it. Click an edge and a Client side | Server side panel opens with that conversation’s call rate, latency, and bytes charted over the window. It’s drawn from the same process-relation data that powers the 3D Infrastructure Map — and there’s not a flame graph in sight.

Figure 4: The odd one out — process conversations as a honeycomb. In-pod processes pack inside the dashed pod boundary, external peers ring it, and every edge is colored by protocol; clicking one opens its client-vs-server metrics.

One task model, two permissions

For all the differences in what they capture, every profiling tab is the same workflow: a task list on the left, a New Task control, and a result panel on the right. Create a task and the list polls for a few rounds until OAP has dispatched it and the instances report back; select a task to analyze it.

That create-versus-read split is also a permission boundary. Starting a task needs profile:enable (an operator-and-above default) — because an unbounded profile could peg a production instance’s CPU, so the task forms are duration- and size-capped on the server. Reading a result needs only profile:read (part of the read-only data catalog). So a viewer can sit with a flame graph all day and never be able to launch a profile.

Which tabs you even see depends on the service: a tab appears only when OAP reports that the service supports that kind of profiling. In practice the General agent layer carries the four stack engines (trace, eBPF, async, pprof), eBPF rides wherever Rover is deployed, and network profiling lights up on the service mesh.

Where to go next

For the field reference — every task field, the eBPF aggregate modes, the network sampling rules — see the Profiling docs.

Next up: Alarms & Incident Triage — the incident-centric alarm surface, and replaying the MQE snapshot that fired a rule.

Zh: 认识 Horizon UI · 9/17：五种 Profiler，一套火焰图

Fri, 26 Jun 2026 00:00:00 +0000

译自英文原文：Meet Horizon UI · 9/17: Five Profilers, One Flame Graph。

这是 Meet Horizon UI 系列的第九篇。指标告诉你什么变慢了；Trace 告诉你慢在哪一跳；Profiling 再往下一层，进入运行中服务的调用栈、内核事件和进程间通信，告诉你问题落在 哪段代码。SkyWalking 为这件事提供了五种 profiler，Horizon 会把它们都展示出来。这篇的主线是：五种里有四种进入同一套火焰图，第五种则是刻意设计的例外。

一个渲染器，四种 profiler

Trace、async、eBPF 和 pprof 这四类 profiling 最终都会产出同一类数据：一棵带采样计数的 stack frame 树。Horizon 先把它们归一成同一种结构，再交给 同一个火焰图组件（基于 d3-flame-graph 封装）渲染。好处很直接：你只需要学一次这个视图，之后四种 profiling 都按同样方式读：

每个 frame 的宽度代表它占全部样本的比例；hover 卡片会显示代码签名、dump 次数、耗时（包含和 不包含 子调用），以及该 frame 占根节点的 % of root；
点击一个 frame 会缩放进去，并把选中高亮固定住；这个选中态在四种 profiler 里保持一致；
每个 frame 使用由方法名决定的低饱和度颜色，让上千个 frame 的图在暗色画布上仍然能读。

图 1：四种 profiler 共用一套火焰图：frame 按样本占比展开，选中 frame 会固定高亮，hover 卡片显示 % of root。

在 Trace 和 eBPF 标签页里，同一份分析结果还可以切到 Tree 视图：它是一张缩进的 stack 表，逐帧展示每个方法的 total 和 self duration，以及 dump count。（Async 和 pprof 只提供火焰图；只有同时支持两种视图的地方才会出现这个切换。）

图 2：同一份结果，一次切换即可从火焰图变成 Tree：缩进 stack 表展示 total/self duration 和 dump count。

四种 stack profiler 分别抓什么

这四种 stack profiler 共用渲染器，但回答的问题不同，每种也有自己的 New Task 表单：

Trace Profiling 会对 慢 trace segment 的调用栈采样。创建任务时指定 service（也可以限定 endpoint）、慢调用 threshold 和 dump period。segment 超过阈值时，agent 会抓取线程栈快照。之后你选择一条采样到的 Trace，下钻到带 profiling 的 span，再点 Analyze。这里还有一个 data mode，可以选择是否把 child span 时间计入结果。
Async Profiling 在运行中的 Java 服务上启动 JVM async-profiler，不需要重启。一个任务可以同时覆盖多个实例和多个事件：CPU、ALLOC、LOCK、WALL 以及 timer 类事件。选择不同 event type 后，火焰图会按对应事件重新绘制。
eBPF Profiling 不需要进程内 agent，由 SkyWalking Rover 在内核层抓 stack：ON_CPU 看进程把 CPU 花在哪里，OFF_CPU 看它阻塞在哪里，比如锁、I/O、调度。进程选择器可以展开进程属性，固定要剖析的进程；聚合开关可以选择统计样本数，或者累加 blocked time（后者只适合 off-CPU）。
pprof 通过 Go 标准 runtime profiler 剖析运行中的 Go 服务。每个任务只能选择一个 event，来自 CPU、HEAP、BLOCK、MUTEX、GOROUTINE、ALLOCS 和 THREADCREATE。对话框会跟随 event 调整：定时采集需要 duration，BLOCK/MUTEX 需要 sampling rate，其余则是一次性快照。

图 3：pprof 每个任务只采一个 Go event：GOROUTINE、MUTEX 和 CPU 是不同任务，各自带 duration 和 sampling rate；选中后 Analyze，同样进入火焰图。

Network Profiling：刻意设计的例外

第五种 profiler 问的是另一类问题：不是“一个进程把时间花在哪里”，而是“哪些进程在通信、通过什么协议通信”。所以它刻意不用火焰图。Network Profiling 会捕获某个服务实例内进程之间的网络会话，并画成 蜂窝拓扑：每个进程是一个六边形，实例自身的进程聚在虚线 pod 边界内，外部 peers 围在边缘。它们之间的边有方向、有动画，并按协议着色：HTTPS、TLS、HTTP 和普通 TCP 都有自己的颜色和小标签。

它的运行方式也不同：network task 不是固定时长，而是带 sampling rules。你可以按 URI pattern、4xx/5xx 响应或最小时延匹配，并配置保留多少 request/response body。任务会一直运行，直到你手动停止。点击一条边，会打开 Client side | Server side 面板，展示这段会话在当前窗口内的调用速率、时延和字节数图表。它使用的是和 3D Infrastructure Map 同源的 process-relation 数据。这里看不到火焰图，这正是设计。

图 4：这个 profiler 是例外：进程通信画成蜂窝拓扑。pod 内进程聚在虚线边界内，外部 peers 围在外侧，每条边按协议着色；点击边会打开 client-vs-server 指标。

同一套任务模型，两类权限

虽然五种 profiler 抓取的内容不同，每个 profiling 标签页的操作流程是一样的：左侧是 任务列表，上方有 New Task，右侧是 结果面板。创建任务后，列表会轮询几轮，等待 OAP 下发任务、实例回报结果；选中一个任务后再分析。

创建任务和读取结果也是一条权限边界。启动任务需要 profile:enable（默认 operator 及以上拥有），因为没有边界的 profile 可能把生产实例 CPU 打满，所以任务表单的时长和数据大小都在服务端限额。读取结果只需要 profile:read（属于只读数据权限）。所以 viewer 可以一直看火焰图，但不能发起 profiling 任务。

你能看到哪些标签页，也取决于当前服务：只有 OAP 上报该服务支持某类 profiling 时，对应标签页才会出现。实际使用中，General agent Layer 会带上四个 stack 引擎（trace、eBPF、async、pprof）；部署了 Rover 的地方会有 eBPF；service mesh 上会出现 network profiling。

后续阅读

字段参考，包括每个任务字段、eBPF 聚合模式和 network sampling rules，可以看 Profiling 文档。

下一篇：告警与 Incident 排查：Horizon 如何把重复告警归并成 incident，并回放触发规则时的 MQE 指标快照。

Blog: Meet Horizon UI · 8/17: Browser Errors & Source Maps

Tue, 23 Jun 2026 00:00:00 +0000

This is the eighth post in the Meet Horizon UI series. Part 7 was your services’ logs; this one is your users’ errors — the JavaScript exceptions the browser agent reports — and the one capability that turns them from noise into something you can act on.

A production JavaScript stack is unreadable. Your code shipped minified and bundled, so the browser reports an error at app.min.js:1:98412 — a position into machine-generated soup that tells you nothing. The point of this feature is to walk that stack back to your source: the original file, line, column, symbol name, and a snippet of the code around it — frame by frame — by pointing the error at the right source map.

The browser-error feed

On the BROWSER layer, the Browser Logs tab (the on-screen label — it’s specifically the JavaScript-error feed) lists what your browser agent reports. The BROWSER layer renames its slots to match its world — services become Applications, instances Versions, endpoints Pages — and the feed reads like the Log Explorer: a clickable category legend with counts and a density histogram over a stream of rows. Each row carries the time, the category, the page, the app version, and the error message — with the minified line:col shown as a chip when there is one.

You scope it with the same triage instincts as the trace and log tabs: it owns its own Time range (the global topbar is paused), and you narrow by Version, Page, or Category and hit Run query — there’s no background polling to shift the view under you, and no query language to learn, just structured controls. Click a row and it expands inline, right there in the stream.

Figure 1: The browser agent’s error feed — categorized, charted, and scoped to one app’s version and pages.

That minified line:col is the whole problem in miniature. It’s a real position — but into your built bundle, not your source. Which is where the rest of this post comes in.

From a minified stack back to your source

Expand an error and the panel splits in two: on the left, the raw stack exactly as the browser reported it (the gibberish); on the right, where you resolve it. Pick a source map from the dropdown and click Resolve, and Horizon parses the stack and maps every frame through that map:

each frame’s original file:line:column,
the original symbol name (when the map carries it), and
a few lines of the original source around the offending line, with the hit line highlighted (when the map embeds sourcesContent).

A frame the map doesn’t cover is shown honestly as unmapped. So a stack whose top frame read app.min.js:1:45 resolves to computeCartTotal at checkout.ts:2:20, with the lines of checkout.ts around it — the cart.items.reduce(...) that actually threw — sitting right there, the whole stack top to bottom, not just the first frame.

It’s careful about the details that make this either trustworthy or quietly wrong: browser stacks count columns from 1 while source maps count from 0, so the resolver shifts before each lookup — and that path is tested against real bundler output, not a hand-made fixture.

Figure 2: The hero — point a minified stack at the right map and read it back in your own source, frame by frame.

Which errors carry a stack to resolve

Not every category has something to translate. JS, PROMISE, and VUE are real JavaScript errors whose stack points into your bundle — these resolve. AJAX and RESOURCE are network and load failures; their “stack” is an HTTP status or a failed URL, not code, so there’s simply nothing for a source map to map (Horizon doesn’t block them — there’s just no JavaScript there to walk back). Frames from code with no source map, or from eval/inline scripts, stay unmapped too. (JS is also the only category the browser reports a top-level line:col for; the others carry their position inside the stack string, which the resolver parses.)

Getting maps in: upload, or mount

A map has to be available before you can resolve against it, and there are two ways to provide one — deliberately different in durability:

Upload a .map straight from the tab. It’s held in the server’s memory only — there’s no backend storage — and it’s temporary by design: it counts against a memory budget, is evicted least-recently-used under pressure, is lost when the server restarts, and (in a multi-instance deployment) lives only on the instance that received it. This is the fast path for ad-hoc triage: drag a map in, resolve, move on.
Mount .map files into the server’s source-map directory (/app/sourcemaps in the container image, via HORIZON_SOURCEMAPS_DIR). These are validated as Source Map v3 at boot, read from disk on demand (so they never sit in the memory budget), survive restarts, reload on their own, and can’t be deleted from the UI. This is the durable, production path — bake your builds’ maps into the image and they’re always there.

The manager shows each map’s origin (an uploaded · temporary map vs a mounted · durable one) and the live memory usage against the budget; budgets (a per-file cap and a total resident-upload cap, 64 MiB and 512 MiB by default) live in a sourceMaps block in horizon.yaml.

Figure 3: Two ways to provide a map — upload for a quick triage, mount for the durable, production set.

You pick the map — on purpose

One thing Horizon deliberately does not do is guess. The browser agent reports an app version but no exact build fingerprint, so there’s no safe way to auto-match an error to a map — and applying a map from the wrong build gives you confidently wrong line numbers, which is worse than no answer. So the choice is yours: pick the map that matches the error’s build, and keep your maps labelled by version. (One caution worth stating plainly: a source map’s sourcesContent embeds your original source code, so treat the maps you upload or mount as sensitive, and provision them only on servers you trust.)

That manual-by-design choice also draws a clean permission line. Viewing the errors, listing the maps, and resolving a stack are all reads, gated by browser-errors:read; uploading or removing a map is a write, gated by source-map:write. So a read-only viewer can de-obfuscate stacks all day without ever being able to change what maps are loaded — reading is reading, mutating the map store is a write.

Where to go next

For the field reference — the categories, the two provisioning paths, the budgets and the matching-maps-to-builds guidance — see the Browser Logs & Source Maps docs.

Next up: Profiling — five profilers (trace, async, eBPF, Go pprof, network) rendered through one flame graph.

Zh: 认识 Horizon UI · 8/17：浏览器错误与 Source Map

Tue, 23 Jun 2026 00:00:00 +0000

译自英文原文：Meet Horizon UI · 8/17: Browser Errors & Source Maps。

这是 Meet Horizon UI 系列的第八篇。第七篇讲的是服务日志；这一篇讲用户遇到的错误，也就是浏览器端 agent 上报的 JavaScript 异常，以及把这些错误定位到源码的关键一步。

生产环境 JavaScript stack 基本不可读。代码经过压缩和打包后发布，浏览器只会报告错误出现在 app.min.js:1:98412，也就是一段机器生成代码里的位置，几乎不给你任何线索。Horizon 要做的是找到正确的 source map，把 stack 里的每一帧映射回源码：原始文件、行、列、符号名，以及出错位置附近的代码片段。

浏览器端错误流

在 BROWSER Layer 上，Browser Logs 标签页（屏幕上的标签名，专指 JavaScript 错误流）会列出浏览器端 agent 上报的内容。BROWSER Layer 会把槽位重命名成自己的语义：services 变成 Applications，instances 变成 Versions，endpoints 变成 Pages。这个列表的阅读方式类似 Log Explorer：可点击的 category legend 带计数，density histogram 位于日志流之上。每一行都有时间、category、page、app version 和错误消息；如果带有压缩后的 line:col，也会显示成标记。

排查方式和 trace/log 标签页一致：它使用独立的 Time range（全局顶栏暂停），你可以按 Version、Page 或 Category 收窄，然后点击 Run query。没有后台轮询把视图不断往前推，也没有要学习的查询语言，只有结构化控件。点击一行后，它会在日志流里原地展开。

图 1：浏览器端 agent 上报的错误流：按 category 组织、带图表，并限定到某个 app 版本和页面。

压缩后的 line:col 就是最典型的问题。它是真实位置，但位置在 构建后 的 bundle 里，不在你的源码里。后面的解析流程就是为了解决这个落差。

从压缩后的 stack 定位到源码

展开一个错误后，面板分成两侧：左边是浏览器原样报告的 raw stack，也就是那段很难读的生产栈；右边是解析结果区域。选择一个 source map，点击 Resolve，Horizon 会解析 stack，并通过这份 map 映射 每一帧：

每帧原始 file:line:column；
原始 symbol name（如果 map 携带了）；
出错行附近几行 原始源码，命中行会高亮（如果 map 内嵌 sourcesContent）。

map 覆盖不到的 frame 会明确标为 unmapped。所以一条顶层 frame 为 app.min.js:1:45 的 stack，可以还原成 checkout.ts:2:20 上的 computeCartTotal，并把 checkout.ts 附近几行显示出来。真正抛错的 cart.items.reduce(...) 就在面板里，不只是还原第一帧，而是从上到下还原整条 stack。

这里有一些细节决定结果是否可信：浏览器 stack 的列号从 1 开始计数，source map 从 0 开始计数，所以解析器每次查找前都会做偏移；这条路径用真实 bundler 输出测试，而不是手写 fixture。

图 2：核心流程：把压缩后的 stack 匹配到正确 map，然后逐帧还原到你自己的源码。

哪些错误可以用 source map 解析

不是每类错误都有可还原内容。JS、PROMISE 和 VUE 是真实 JavaScript 错误，它们的 stack 指向 bundle，可以解析。AJAX 和 RESOURCE 是网络和加载失败；它们的“stack”是 HTTP status 或失败 URL，不是代码，所以 source map 没有东西可映射（Horizon 不会阻止它们，只是没有可映射的 JavaScript 位置）。没有 source map 的代码、eval 或 inline scripts 里的 frame，也会保持 unmapped。（JS 也是唯一由浏览器上报顶层 line:col 的 category；其他 category 的位置在 stack 字符串内部，由解析器提取。）

提供 source map：上传或挂载

解析前必须让 map 可用。Horizon 提供两种方式，持久化策略有意不同：

Upload 一个 .map，直接从标签页上传。它只保存在服务端内存里，没有后端存储，并且临时性是设计目标：它占用内存预算，在压力下按 least-recently-used 淘汰，服务重启后丢失；多实例部署时，它也只存在于接收上传的那一个实例。这个路径适合临时排查：拖一个 map 进来，解析，处理完离开。
Mount .map 文件到服务端 source-map 目录（容器镜像中是 /app/sourcemaps，可通过 HORIZON_SOURCEMAPS_DIR 指定）。这些文件在启动时按 Source Map v3 校验，按需从磁盘读取（所以不占内存预算），重启后仍然存在，会自动重新加载，并且 不能从 UI 删除。这是生产路径：把构建产物里的 map 文件放进镜像或挂载目录，它们就一直可用。

管理器会显示每个 map 的来源（uploaded · temporary 还是 mounted · durable），以及当前内存预算使用量。预算配置，包括单文件上限和常驻上传总量上限，默认 64 MiB 和 512 MiB，位于 horizon.yaml 的 sourceMaps 块。

图 3：两种提供 map 的方式：upload 用于快速排查，mount 用于生产环境长期使用。

map 必须手动选择

Horizon 刻意 不猜测。浏览器端 agent 会上报 app version，但不会上报精确的构建指纹，所以没有安全方法自动把一个错误匹配到某份 map。用错构建的 map 会给出非常自信、但完全错误的行号，比没有答案更糟。所以这里由你选择：挑选和这次错误对应构建的 map，并按版本给 map 清晰命名。（还有一个必须直说的注意点：source map 的 sourcesContent 会包含你的原始源码，所以无论上传还是挂载，都要把 map 当成敏感内容，只放在可信服务器上。）

手动选择 map 这件事，也划清了权限边界。查看错误、列出 maps、解析 stack 都是读操作，由 browser-errors:read 控制；上传或删除 map 是写操作，由 source-map:write 控制。所以只读用户可以反复解析 stack，但没有权限改变已加载的 map 集合。读就是读，修改 map 存储才是写。

后续阅读

字段参考，包括 categories、两种提供路径、预算，以及如何按构建版本匹配 map，可以看 Browser Logs & Source Maps 文档。

下一篇进入 Profiling：五种 profiler（trace、async、eBPF、Go pprof、network）如何共用一套火焰图视图，以及为什么 network profiling 是例外。

Blog: Meet Horizon UI · 5/17: The 3D Infrastructure Map

Mon, 22 Jun 2026 00:00:00 +0000

This is the fifth post in the Meet Horizon UI series. Part 3 drew the map between services and Part 4 drew it inside one service. This post zooms all the way out: a single WebGL view of your entire deployment at once — every SkyWalking layer’s services rendered as cubes, stacked in 3D, with live traffic, alarms, and the calls between them. It’s the “stand back and look at everything” companion to the per-layer dashboards.

It’s also genuinely interactive, so rather than describe it cold, here it is — the real map running on the demo’s sample data. Drag to rotate, scroll to zoom, click a cube:

Interactive · sample data Open the 3D map

One scene for the whole estate

The 3D map is a standalone, full-screen view at /3d/map, opened from the 3D Infra pill in the topbar. It deliberately drops the rest of the console — no sidebar, no topbar, no global time picker — so the scene gets the whole viewport. The SkyWalking mark sits bottom-left; the × top-right returns you to Horizon. Everything in it is read live from the same OAP the rest of Horizon talks to: the service roster, each layer’s topology, the per-service traffic, and the active alarms.

Figure 1: The whole deployment in one scene — services as cubes, layers as colored zones, roles as stacked tiers.

Tiers are the spine

A tier is a horizontal plane that groups related layers by their role in the system. Tiers read top-to-bottom the way a request flows — from the apps a user touches down to the platform everything runs on. Horizon ships four bundled tiers:

Apps (top) — application surfaces and the dependencies as the app sees them: General agent services, Browser/RUM, mobile, and the Virtual* targets (database, cache, MQ, gateway, GenAI).
Middleware — the data and messaging services, gateways, and self-observability: MySQL, PostgreSQL, Redis, Kafka, RocketMQ, APISIX, Nginx, the SkyWalking SO11Y components, and cloud-managed data services.
Service Mesh — the mesh that fronts the apps: Istio control/data plane, Cilium, the Envoy AI Gateway.
Infra (bottom) — the platform the rest runs on: Kubernetes, hosts, VMs.

Every layer OAP reports lands on exactly one tier. A layer Horizon hasn’t classified yet — a brand-new OAP layer, say — falls to the failover tier (Middleware by default) with an “unclassified” mark, so it shows up and an operator can re-assign it rather than silently dropping off the map. The panel on the right mirrors the stack: click a tier row to fly the camera to it, use the eye toggle to show or hide a whole tier (or a single layer) at once, and read off how many of its services are currently visible.

Reading the map: cubes, zones, traffic

Each cube is one service. Cubes are grouped into their layer’s zone on the tier — a translucent rectangle painted in the layer’s brand color and stamped with the project’s logo (Istio’s sail, the Kubernetes helm, a database cylinder, a queue) so you can pick a zone out from any camera angle. Layers that ship a topology (General, Service Mesh, Kubernetes Service, Cilium) lay their cubes out by call dependency — upstream callers on one side, downstream services on the other, the 3D analogue of the 2D service map. Layers without a topology pack their cubes into a tidy grid.

A small traffic pill under a cube shows that service’s live headline throughput — requests per minute for app and mesh services, queries or operations per second for data services, each with its own unit. The pills appear on cubes close enough to read and fade away as you zoom out to keep the scene clean, then return as you come back in; a selected cube always shows its number.

Alarms, and Beacon mode for incidents

When a service has a currently-firing alarm (Horizon polls the last 20 minutes and counts only service-scoped, still-firing ones), its cube pulses red — a beacon you can spot from clear across the scene, while the alarm feed refreshes on its own.

On a busy map, one more red cube among hundreds can still be hard to find — so there’s Beacon mode. Toggle it from the toolbar and every healthy cube dims to a dark wireframe ghost, leaving only the alarming services lit and glowing. The shape of the deployment stays legible, but the services that are actually on fire are the only thing with color. It turns the bird’s-eye view into an incident triage surface in one click.

Figure 2: Beacon mode dims everything healthy to a ghost, so the firing services are the only thing you see.

The lines between things

The map draws the call graph, not just the nodes:

In-layer calls — light cyan tubes between two services in the same layer, with packets animating along them. This is each layer’s internal call graph, always on.
Cross-layer calls — soft amber arrows between services in different layers on the same tier (a Browser app calling a Frontend, a Frontend calling a Virtual Database), pointing from caller to callee.
Hierarchy links — and here’s the one that makes the 3D layout earn its keep. Select a cube and thicker gray tubes connect the different faces of the same logical service across tiers — the service as its agent sees it, as the mesh sees it, as a Kubernetes service. These represent identity, not traffic, so they stay hidden until you select a cube and then show just that cube’s relatives, climbing the stack from tier to tier. It’s the Smartscape idea from Part 3, drawn in the dimension it was always meant for.

Figure 3: Select a cube and its identity links climb the tiers — one service, seen by its agent, the mesh, and Kubernetes.

Moving around

Drag to rotate, scroll to zoom, and arrow keys or WASD pan the view (hold Shift for a bigger step); a top-left toolbar offers the same gestures as buttons for trackpads. There’s one deliberate rule worth knowing: clicks inside the 3D scene never move the camera — they only select. Click a cube and it highlights, a detail card appears beside it (the service’s name, its tier and layer, and an Open dashboard button that jumps into that service’s layer dashboard in a new tab), and its hierarchy links light up. The camera-move surface is the side panel and the toolbar — click a layer row to glide the camera to its zone. Keeping those two jobs separate is what makes selecting a small cube feel reliable instead of having the cube slide out from under your cursor.

How it builds

A whole deployment is too much to fetch in one request, so the map loads in stages, and a slim timeline strip along the bottom shows the progress live: Services (the roster and their layers) → Templates (which layers carry a topology) → Topologies (each topology-bearing layer’s call graph) → Hierarchy (the cross-tier identity links) → Layout (placing the cubes) → Metrics (the per-service traffic, fetched in batches so the cubes light up progressively). Click any step for a drawer with its detail, or hit refresh to re-run the whole sequence.

Two touches make refreshes cheap. The hierarchy step is incremental — only services that are new since the last run get probed, the rest reused from cache, so a steady deployment costs nothing there. And the scene is re-keyed on a per-layer structure hash, so an unchanged refresh keeps your camera exactly where it was; only a real roster or edge change rebuilds the layout. Under the hood it’s Three.js with a thin Vue wrapper, every geometry and material shared across cubes of the same kind — the kind of detail that keeps a few hundred services rendering smoothly in a browser tab.

Configured, not coded

None of the above is a hard-coded “3D screen.” What the map shows is driven by a single configuration an administrator edits in a structured form editor at /admin/3d-map — tiers, layers, colors, and metrics through form controls, not raw JSON. From it you can:

Filter layers with one global regex — anything it excludes drops off the map entirely.
Arrange tiers — rename them, reorder them top-to-bottom, and pin each layer to a tier (with a nominated failover tier so nothing falls off silently).
Group layers — cluster several related layers (the SkyWalking self-observability components, say) into one labelled block, each member keeping its own color.
Color each layer and choose its traffic metric — the MQE expression, a label, and a unit, seeded by default from that layer’s dashboard template so most layers show a sensible number out of the box.

Horizon ships a bundled default, so the map is useful immediately; your edits live as a local draft until you Check diff & push them to OAP — the same draft → preview → publish model behind every dashboard, and the same Export/Import for backup or moving a configuration between deployments. The map itself is a read-only observe surface that runs against your current OAP; publishing the config that shapes it is part of the config-driven customization story a later post in this series covers end to end.

Figure 4: The map is configuration, not code — tiers, colors, and per-layer traffic metrics edited as a form, then published to OAP.

Where to go next

The 3D map is the bird’s-eye summary; the 2D per-layer pages stay the authoritative service maps. Viewing it needs only read access (infra-3d:read, held by the built-in viewer role and up); shaping it needs the same write permission as the dashboards. For the field reference — tiers, the config shape, the loading stages — see the 3D Infrastructure Map docs.

Next up: the Trace Explorer — from the bird’s-eye view of the whole deployment back down to a single request, drawn three different ways.

Zh: 认识 Horizon UI · 5/17：3D 基础设施地图

Mon, 22 Jun 2026 00:00:00 +0000

译自英文原文：Meet Horizon UI · 5/17: The 3D Infrastructure Map。

这是 Meet Horizon UI 系列的第五篇。第三篇画的是服务之间的地图，第四篇画的是单个服务内部的地图。这一篇把视角拉到最远：用一个 WebGL 视图一次看完整个部署。每个 SkyWalking Layer 的服务都会渲染成立方体，堆叠到 3D 空间里，并显示实时流量、告警和它们之间的调用关系。它补上了按 Layer 仪表盘之外的全局视角：退后一步，看整个系统。

而且它不是静态截图。所以这里直接放出来：下面就是运行在 demo 数据上的真实地图。拖动旋转，滚轮缩放，点击立方体：

交互演示 · 示例数据 Open the 3D map

一张 3D 图查看完整部署

3D 地图是 /3d/map 里的独立全屏页面，从顶栏里的 3D Infra 入口打开。它刻意拿掉控制台其余部分：没有侧边栏，没有顶栏，没有全局时间选择器。整个浏览器视口都交给场景。SkyWalking 标识放在左下角，右上角的 × 返回 Horizon。里面所有数据都来自 Horizon 其他页面访问的同一个 OAP：服务清单、每个 Layer 的拓扑、每个服务的流量，以及活跃告警。

图 1：一张 3D 图看完整部署：服务是立方体，Layer 是带颜色的 zone，角色按 tier 堆叠。

用 tier 组织系统层次

Tier 是一层横向平面，用来把系统中职责相近的 Layer 放在一起。Tier 从上到下的阅读顺序，就是请求流动的大致方向：从用户直接访问的应用，一路到底层平台。Horizon 内置四个 tier：

Apps（顶部）：应用界面和应用视角看到的依赖，包括 General agent services、Browser/RUM、mobile，以及 Virtual* targets（database、cache、MQ、gateway、GenAI）。
Middleware：数据和消息服务、网关，以及自观测组件，包括 MySQL、PostgreSQL、Redis、Kafka、RocketMQ、APISIX、Nginx、SkyWalking SO11Y components 和云托管数据服务。
Service Mesh：承载应用流量的 mesh，包括 Istio control/data plane、Cilium、Envoy AI Gateway。
Infra（底部）：其余内容运行在其上的平台，包括 Kubernetes、hosts、VMs。

OAP 上报的每个 Layer 都会归入一个 tier。Horizon 还没分类的新 Layer，比如 OAP 新增的 Layer，会归入 failover tier（默认 Middleware）并带 “unclassified” 标记。这样它会出现，运维人员也可以重新分配，而不是静默从地图上消失。右侧面板对应整个堆栈：点击 tier 行可以把镜头移到对应位置，用眼睛开关一次显示或隐藏整个 tier（或单个 Layer），并查看当前可见服务数量。

地图元素：立方体、zone 和流量

每个服务对应一个 立方体。立方体会在 tier 上按所属 Layer 聚成一个 zone：半透明矩形使用该 Layer 的品牌色，并盖上项目 logo（Istio 的帆、Kubernetes 舵轮、数据库圆柱、队列图标），所以从任何视角都能辨认出 zone。带拓扑的 Layer（General、Service Mesh、Kubernetes Service、Cilium）会按 调用依赖 摆放立方体：上游 caller 在一侧，下游服务在另一侧，对应 2D 服务地图的 3D 版本。没有拓扑的 Layer 则把立方体排成整齐网格。

立方体下方的小 traffic 标签显示该服务的实时主吞吐：应用和 mesh 服务是 requests per minute，数据服务是 queries 或 operations per second，并保留各自单位。只有镜头足够近、文字可辨认时标签才出现；缩远后会淡出，保持场景干净；选中的立方体总会显示它的数字。

用 Beacon mode 突出告警服务

当一个服务有 当前正在触发的告警 时（Horizon 轮询最近 20 分钟，并只统计 service 作用域下仍在触发的告警），它的立方体会 红色脉冲。这就是一个信标，即使隔着整个场景也能看到，而告警列表会独立刷新。

在繁忙地图里，几百个立方体中的一个红点仍然可能难找，所以有 Beacon mode。从工具栏打开后，所有健康立方体都会变成深色 wireframe，只留下正在告警的服务发光。部署结构仍然清楚，但真正出问题的服务才有颜色。这个模式可以把鸟瞰图快速切成 incident triage 视图。

图 2：Beacon mode 会把健康对象变暗成幽影，所以你只会看到正在触发的服务。

线表示调用和层级关系

地图不只画节点，也画调用图：

In-layer calls：同一个 Layer 内两个服务之间的浅青色管线，并带沿线运动的数据包动画。这是每个 Layer 自己的内部调用图，始终开启。
Cross-layer calls：同一个 tier 上不同 Layer 服务之间的柔和琥珀色箭头，比如 Browser app 调用 Frontend、Frontend 调用 Virtual Database，方向从 caller 指向 callee。
Hierarchy links：这是让 3D 布局真正发挥价值的一类线。选中一个立方体后，粗灰色管线会连接 同一个逻辑服务在不同 tier 上的不同形态：agent 看到的服务、mesh 看到的服务、Kubernetes service 看到的服务。它们表示身份，不是流量，所以默认隐藏；选中立方体后只显示与它相关的对象，并沿着 tier 一层层爬上去。这就是第三篇里的 Smartscape，只是放到了更适合表达层级关系的 3D 视角里。

图 3：选中立方体后，身份链接沿 tier 爬升：agent、mesh 和 Kubernetes 各自看到的同一个服务。

视角移动与选择

拖动旋转，滚轮缩放，方向键 或 WASD 平移视角（按住 Shift 步长更大）；左上角工具栏也为触控板提供同样动作的按钮。有一条刻意设计的规则值得知道：3D 场景里的点击永远不会移动视角，只负责选择。点击一个立方体，它会高亮，旁边出现详情卡片（服务名称、tier、Layer，以及会在新标签页打开该服务 Layer 仪表盘的 Open dashboard 按钮），它的 hierarchy links 也会亮起。移动视角的交互入口是 侧边面板 和 工具栏；点击 Layer 行，镜头会滑到它的 zone。把“选择”和“移动视角”分开，点击小立方体才会可靠，不会刚点中它就让它从光标下滑走。

地图如何分阶段加载

整个部署的数据太大，不适合一次请求拉完，所以地图分阶段加载，底部细长的 timeline strip 会实时展示进度：Services（服务清单和所属 Layer）→ Templates（哪些 Layer 带拓扑）→ Topologies（每个有拓扑 Layer 的调用图）→ Hierarchy（跨 tier 身份链接）→ Layout（摆放立方体）→ Metrics（按批次获取每个服务的流量，让立方体逐步亮起来）。点击任意阶段可以打开抽屉查看详情，也可以点击 Refresh 重新跑完整流程。

两个设计降低了刷新成本。Hierarchy 阶段是增量的：只有上次之后新增的服务需要探测，其余从缓存复用，所以稳定部署在这一步没有额外代价。场景还会按每个 Layer 的结构 hash 重新生成 key；结构没变时刷新会保留你的视角位置，只有服务清单或边真的变化时才重建布局。底层使用 Three.js 加一层很薄的 Vue 封装，同类立方体共享 geometry 和 material。正是这些细节，让几百个服务也能在浏览器标签页里平滑渲染。

地图结构来自配置

上面这些不是一张写死的 “3D 页面”。地图展示什么，由管理员在 /admin/3d-map 的 结构化表单编辑器 里编辑：tier、Layer、颜色和指标都通过表单控制，而不是直接写 JSON。你可以在里面：

用一个全局正则过滤 Layer：匹配排除的内容会完全从地图上消失。
安排 tier：重命名、从上到下重排，并把每个 Layer 固定到某个 tier 上，同时指定 failover tier，避免内容静默丢失。
对 Layer 分组：把多个相关 Layer，比如 SkyWalking 自观测组件，聚成一个带标签的 block，每个成员仍然保留自己的颜色。
为每个 Layer 配色并选择流量指标：配置 MQE 表达式、label 和单位；默认值会从该 Layer 的仪表盘模板里初始化，所以大多数 Layer 一开始就能显示合理数字。

Horizon 自带一份默认配置，所以地图开箱就有用。你的修改会以本地 draft 保存，直到点击 Check diff & push 发布到 OAP。它使用和仪表盘相同的 draft → preview → publish 模型，也支持同样的 Export/Import，用于备份或在部署之间迁移配置。地图本身是只读 observe 界面，可以直接运行在当前 OAP 上；发布用于控制地图形态的配置，则属于后续文章会完整介绍的配置化定制。

图 4：地图是配置，不是代码。tier、颜色和每个 Layer 的流量指标都以表单方式编辑，然后发布到 OAP。

后续阅读

3D 地图是全局入口；2D 的按 Layer 页面仍然是最完整的服务地图。查看它只需要读权限（infra-3d:read，内置 viewer 角色及以上持有）；调整它需要和仪表盘相同的写权限。字段参考，包括 tier、配置结构和加载阶段，可以看 3D Infrastructure Map 文档。

下一篇回到单个请求：Trace Explorer 会用分布图、瀑布图和调用树帮你定位慢调用。

Zh: 基于 SkyWalking 10.4 的大模型应用监控：洞察 LLM 的性能与成本

Sun, 05 Apr 2026 00:00:00 +0000

问题：当应用开始“吞噬”大模型，监控却留下了盲区

随着生成式 AI（GenAI）在企业业务中的深度渗透，开发者正面临一个尴尬的局面：我们在应用中通过Spring AI或OpenAI SDK快速集成了强大的大模型能力，但对于这些调用的实际表现却几乎一无所知。

成本与性能的“黑盒”：昂贵的模型真的更具性价比吗？
面对高昂的大模型账单，我们往往只知道把钱交给了某个Provider，却算不清这笔账在应用内部的“投入产出比”。盲目的选型升级：为了追求更好的体验，你可能将业务默认切换到了成本更高的旗舰模型。但在具体的业务场景下，花费数倍的 Token 成本，它真的能在真实请求中带来更低的延迟和更快的 TTFT(Time to First Token) 吗？缺乏真实的评估基准：脱离了真实的业务请求，单纯看官网的 Benchmark 意义不大，你需要知道在实际的 Prompt 长度和并发压力下，同一Provider下的哪个模型能在“Token/Cost 消耗”与“响应速度”之间达到完美的平衡。如果没有应用侧的数据支撑，你根本无从判断哪款模型才是当前业务的最优解。
消失的“黄金超时时间”
很多团队在代码里给 LLM 调用设置超时（Timeout）时，往往是拍脑袋决定（比如 30s 或 60s）。
设太短：长文本生成或模型高峰期时，请求会被频繁强行中断，导致业务失败率飙升。
设太长：如果下游供应商出现故障（卡死），大量的请求会堆积在应用内存中，阻塞执行线程，最终导致整个 Java 应用甚至微服务集群的瘫痪。只有真正掌握了预估的整体调用延迟（P99/P95 Latency），你才能基于数据而非直觉，为不同模型设置最合理的超时策略。
被忽视的体验杀手：TTFT
在 GenAI 场景下，用户对“快”的感知并不完全取决于整个对话结束的总耗时，而取决于**“第一行字什么时候跳出来”**。一个总耗时 10 秒但 TTFT 仅 500ms 的流式响应，给用户的观感是“秒回”。一个总耗时 5 秒但 TTFT 需要 4s 的非流式响应，给用户的观感却是“卡死”。如果你的观测系统只能看到总耗时，你就会漏掉最核心的 UX 指标，无法解释为什么用户反馈“AI 很慢”即便总耗时看起来还行。

SkyWalking 10.4：应用视角的“数字仪表盘”
Apache SkyWalking 自 10.4 版本引入的 Virtual GenAI 能力，正是为了解决应用层侧的这种“观测真空”。它不依赖任何外部网关，直接通过应用侧探针（如 Java Agent）在客户端视角采集最真实的数据。

精准的延迟分布（Latency Percentiles）：通过 P50、P90、P99 等多维指标，帮你勾勒出 LLM 调用的真实波动曲线，为设置“动态超时时间”提供科学依据。
核心 UX 指标——TTFT 监控：原生支持流式（Streaming）调用的首字延迟统计。通过对比不同 Provider 或不同模型的 TTFT，你可以优化提示词（Prompt）策略或切换更快的模型，确保用户体验始终在线。
多维度的模型“画像”分析：在 Provider 和 Model 两个维度上，将 Token 消耗、预估成本与性能指标深度对齐。这让你不再看供应商全网的“理想平均数”，而是看清你的应用在调用特定模型时的真实表现，从而在复杂的模型生态中选出最具性价比的选型方案。

虚拟 GenAI 观测

虚拟 GenAI 代表了由探针插件检测到的生成式 AI 服务节点。GenAI 操作的性能指标均基于 GenAI 客户端视角。

例如，Java 探针中的 Spring AI 插件可以检测一次对话补全（Chat Completion）请求的响应延迟。随后，SkyWalking 将在仪表盘中展示：

流量与成功率 (CPM & SLA)
响应延迟 (Latency & TTFT)
Token 消耗 (Input/Output)
预估成本 (Estimated Cost)

如图：

原理

当 SkyWalking Java Agent 或 OTLP 探针拦截到主流 AI 框架（如 Spring AI、OpenAI SDK 等）的调用时，将Trace 数据上报至 SkyWalking OAP。 OAP会基于这些 Trace 自动完成数据的聚合与计算。最终会生成 Provider（服务商）与 Model（模型）两个维度的各类性能指标，并直接渲染填充至内置的 Virtual-GenAI 仪表盘中。

安装配置

要求

版本要求

● SkyWalking Java Agent: >= 9.7 ● SkyWalking Oap: >= 10.4

语义规范与兼容性

SkyWalking 虚拟 GenAI 遵循 OpenTelemetry GenAI 语义规范。OAP 将根据以下标准识别 GenAI 相关 Span：

SkyWalking Java Agent

上报的 Span 必须为 Exit 类型，其 SpanLayer 属性需设定为 GENAI,包含gen_ai.response.model 标签。

输出OTLP / Zipkin格式数据的探针

上报的 Span 中包含 gen_ai.response.model 标签。

具体可以参考e2e配置
SkyWalking Java Agent上报数据
 探针上报OTLP格式数据
 探针上报Zipkin格式数据

GenAI 预估成本配置

概览

SkyWalking 提供了一个内置的GenAI计费配置文件

该配置定义了SkyWalking 如何将 Trace 数据中的模型名称映射到对应的供应商，并估算每次 LLM 调用的 Token 成本。估算成本将与 Trace 和指标数据一起显示在 SkyWalking UI 中，帮助用户直观了解 GenAI 使用带来的预估费用影响。重要提示: 此文件中的定价仅用于成本估算，不得视为实际账单或发票金额。建议用户定期从供应商官方定价页面核实最新费率。

配置结构

Top 字段

字段	类型	描述
`last-updated`	`date`	定价数据的最后更新日期。所有价格均基于该日期前各厂商官网公布的公开计费标准。
`providers`	`list`	GenAI 厂商定义列表。每个厂商条目下包含匹配规则（matching rules）以及具体的模型计费信息（model pricing）。

provider 定义

providers 下的每个条目定义一个 GenAI 供应商：

providers:
- provider: <provider-name>
  prefix-match:
    - <prefix-1>
    - <prefix-2>
  models:
    - name: <model-name>
      aliases: [<alias-1>, <alias-2>]
      input-estimated-cost-per-m: <cost>
      output-estimated-cost-per-m: <cost>

字段 (Field)	类型 (Type)	必填 (Required)	描述 (Description)
`provider`	`string`	是	供应商标识（如 `openai`, `anthropic`, `gemini`）。在 SkyWalking 中作为虚拟 GenAI 服务名显示。
`prefix-match`	`list[string]`	是	用于将模型名称匹配到该供应商的前缀列表。如果 Trace 数据中的模型名以其中任一前缀开头，则会被映射到该供应商。
`models`	`list[model]`	否	包含定价信息的模型定义列表。如果省略，系统仍能识别供应商，但不会进行成本估算。

model 定义

models 下的每个条目定义特定模型的定价：

字段 (Field)	类型 (Type)	必填 (Required)	描述 (Description)
`name`	`string`	是	用于匹配的标准模型名称。
`aliases`	`list[string]`	否	应解析为同一计费条目的备选名称。当供应商使用不同的命名习惯时非常有用（参见“模型别名”部分）。
`input-estimated-cost-per-m`	`float`	否	每 1,000,000（一百万）输入（Prompt）Token 的预估成本。默认单位为 USD。
`output-estimated-cost-per-m`	`float`	否	每 1,000,000（一百万）输出（Completion）Token 的预估成本。默认单位为 USD。

模型匹配机制

供应商级前缀匹配

当 SkyWalking 接收到包含 GenAI 调用的 Trace 时，会按照以下优先级顺序来确定供应商（Provider）：

gen_ai.provider.name 标签：首先检索此标签。它是OpenTelemetry最新的语义规范。
gen_ai.system 标签：如果缺少上述标签，系统将回退到此旧版（Legacy）标签。注意：此标签仅在处理 OTLP 或 Zipkin 协议的数据时会被解析，主要用于兼容旧版的 Python 自动仪表化等库。
前缀匹配 (Prefix Matching)：若上述两个标签均不存在，SkyWalking 会读取 gen-ai-config.yml 中定义的 prefix-match 规则，通过匹配模型名称 (Model Name) 来尝试识别供应商。

- provider: openai
  prefix-match:
    - gpt

任何以 gpt 开头的模型名称（如 gpt-4o, gpt-4.1-mini, gpt-5-nano）都会被映射到 openai 供应商。一个供应商可以拥有多个前缀：

- provider: tencent
  prefix-match:
    - hunyuan
    - Tencent

模型级最长前缀匹配 (Model-Level Longest-Prefix Matching)

一旦确定了供应商，SkyWalking 会使用基于前缀树 (Trie) 的最长前缀匹配算法来查找最佳的模型计费条目。这至关重要，因为 LLM 供应商在 API 响应中返回的模型名称通常包含版本号或时间戳，与配置中的基础模型名称有所不同。示例：假设 OpenAI 的配置条目如下：

models:
- name: gpt-4o
  input-estimated-cost-per-m: 2.5
  output-estimated-cost-per-m: 10.0
- name: gpt-4o-mini
  input-estimated-cost-per-m: 0.15
  output-estimated-cost-per-m: 0.6

其匹配行为如下表所示：

Trace 中的模型名称	匹配的配置条目	原因
`gpt-4o`	`gpt-4o`	完全匹配
`gpt-4o-2024-08-06`	`gpt-4o`	最长前缀为 `gpt-4o`
`gpt-4o-mini`	`gpt-4o-mini`	完全匹配（比 `gpt-4o` 更长的前缀优先）
`gpt-4o-mini-2024-07-18`	`gpt-4o-mini`	最长前缀为 `gpt-4o-mini`

这种机制确保了 API 返回的带有版本的模型名称能够被正确映射到相应的价格档位，而无需在配置文件中维护精确的全名。

模型别名 (Model Aliases)

部分供应商在 API 响应和官方文档中会使用不同的命名规范。例如，Anthropic 的模型在 Trace 中可能显示为 claude-4-sonnet 或 claude-sonnet-4。通过 aliases 字段，可以让单个计费条目同时支持这两种配置：

- name: claude-4-sonnet
  aliases: [claude-sonnet-4]
  input-estimated-cost-per-m: 3.0
  output-estimated-cost-per-m: 15.0

在这种配置下，claude-4-sonnet 和 claude-sonnet-4（以及任何带有版本的变体，如 claude-sonnet-4-20250514）都会解析为同一个计费条目。
注意：别名同样参与最长前缀匹配。因此，claude-sonnet-4-20250514 会匹配到别名 claude-sonnet-4，进而解析到 claude-4-sonnet 的定价信息。

自定义配置

添加新供应商 (Adding a New Provider) 要添加默认配置中未包含的供应商：

providers:
# ... 现有供应商 ...

- provider: ollama
  prefix-match:
    - mymodel
  models:
    - name: mymodel-large
      input-estimated-cost-per-m: 1.0
      output-estimated-cost-per-m: 5.0
    - name: mymodel-small
      input-estimated-cost-per-m: 0.1
      output-estimated-cost-per-m: 0.5

针对OTLP/zipkin的数据，新增了单独的estimated tag, 可以在UI上看到这次GenAI调用消耗的cost。

主要指标

1. Provider Level (服务商维度)

指标 ID	描述	含义
`gen_ai_provider_cpm`	Calls Per Minute	每分钟请求数 (吞吐量)
`gen_ai_provider_sla`	Success Rate	请求成功率
`gen_ai_provider_resp_time`	Avg Response Time	平均响应耗时
`gen_ai_provider_latency_percentile`	Latency Percentiles	响应耗时百分位数 (P50, P75, P90, P95, P99)
`gen_ai_provider_input_tokens_sum/avg`	Input Token Usage	输入 Token 的总和及平均值
`gen_ai_provider_output_tokens_sum/avg`	Output Token Usage	输出 Token 的总和及平均值
`gen_ai_provider_total_estimated_cost/avg`	Estimated Cost	预估总成本及次均成本

2. Model Level (模型维度)

指标 ID	描述	含义
`gen_ai_model_call_cpm`	Calls Per Minute	该特定模型的每分钟请求数
`gen_ai_model_sla`	Success Rate	模型请求成功率
`gen_ai_model_latency_avg/percentile`	Latency	模型响应耗时的平均值及百分位数
`gen_ai_model_ttft_avg/percentile`	TTFT	首个token响应时间 (仅限流式传输 Streaming)
`gen_ai_model_input_tokens_sum/avg`	Input Token Usage	该模型的输入 Token 消耗详情
`gen_ai_model_output_tokens_sum/avg`	Output Token Usage	该模型的输出 Token 消耗详情
`gen_ai_model_total_estimated_cost/avg`	Estimated Cost	该模型的预估总成本及次均成本

建议使用场景

性能评估：利用响应延迟（Latency）和首字响应时间（TTFT）指标，分析模型推理效率及终端用户交互体验。
Token 监控：实时监控输入（Input）与输出（Output）Token 的消耗，用于分析不同业务场景下的资源占用情况。
成本预警：支持基于预估成本（Cost）或 Token 消耗量配置告警阈值，及时发现异常调用，防止成本超支。

Zh: 用 Apache SkyWalking 监控 Envoy AI Gateway

Thu, 02 Apr 2026 00:00:00 +0000

问题：LLM 流量缺乏统一观测

LLM 流量正在成为生产基础设施中不可忽视的一部分。团队同时在调用 OpenAI、Anthropic、AWS Bedrock、Azure OpenAI、Google Gemini——往往还不止一个提供商。但大多数组织对这些流量缺乏统一的可见性：

Token 费用失控，却不知道哪个团队、哪个模型、哪个提供商在烧钱。一个配置不当的 prompt 模板就可能在无人察觉的情况下烧掉几千美元。
提供商故障引发连锁反应。 OpenAI 出问题的那一小时，你的应用也跟着挂——而你既没有故障切换的可见性，也无法自动切换提供商。
缺乏统一指标。 延迟、首 Token 耗时（TTFT）、每 Token 输出耗时（TPOT）、Token 用量、错误率——每个提供商的报告方式都不一样，有些甚至不提供。没有一个统一的面板能做对比。

这和十年前微服务面临的可观测性困境如出一辙。当时的解法是服务网格和内置遥测的 API 网关。对 AI 工作负载来说，答案就是 AI 网关。

为什么选择 AI 网关

Envoy AI Gateway 是一个开源 AI 网关，构建在 Envoy Proxy 和 Envoy Gateway 之上。底层就是云原生世界里已经广泛部署的 Envoy，天然具备基础设施级的稳定性和性能。

核心能力：

多提供商路由 —— 支持 16+ AI 提供商（OpenAI、Anthropic、AWS Bedrock、Azure OpenAI、Google Gemini、Mistral、Cohere、DeepSeek 等），统一 API 接入。
基于 Token 的限流 —— 按 Token 消耗限流，而不只是按请求数。
提供商故障切换 —— 某个提供商宕机或响应慢时自动切换。
模型虚拟化 —— 抽象模型名称，让应用与具体提供商解耦。
两层架构 —— 参考架构包含一个集中入口网关（Tier 1）负责认证和全局路由，以及每集群网关（Tier 2）负责推理优化。
CNCF 生态原生 —— 运行在 Kubernetes 上，兼容现有的 Envoy Filter、WASM 插件和标准 Kubernetes Gateway API 资源。

Envoy AI Gateway 原生支持通过 OTLP 发送 GenAI 指标和访问日志，遵循 OpenTelemetry GenAI 语义约定，可以直接接入任何兼容 OpenTelemetry 的后端。

从 SkyWalking 10.4.0 开始，OAP 原生接收和分析 Envoy AI Gateway 的 OTLP 指标和访问日志——中间不需要部署 OpenTelemetry Collector。

数据流

AI Gateway 通过 OTLP gRPC 直接将遥测数据推送到 SkyWalking：

应用通过 Envoy AI Gateway 发送 LLM API 请求。
Envoy AI Gateway 将请求路由到 AI 提供商（或 Ollama 这样的本地模型），同时记录 GenAI 指标（Token 用量、延迟、TTFT、TPOT）和访问日志。
网关通过 OTLP gRPC 直接将指标和日志推送到 SkyWalking OAP 的 11800 端口。
SkyWalking OAP 用 MAL 规则解析指标、用 LAL 规则解析访问日志，然后统一存储到 BanyanDB。

不需要 OpenTelemetry Collector。SkyWalking OAP 内置的 OTLP 接收器可以直接处理所有数据。

本地体验

这个 Demo 使用 Ollama 作为本地 LLM 后端，不需要任何 API Key 就能跑起来。Envoy AI Gateway CLI（aigw）提供独立运行模式，不依赖 Kubernetes，非常适合本地测试。

前置条件

Docker 和 Docker Compose
主机上已安装 Ollama

第一步：启动 Ollama

让 Ollama 监听所有网络接口，以便 Docker 容器能访问到：

OLLAMA_HOST=0.0.0.0 ollama serve

拉取一个小模型用于测试：

ollama pull llama3.2:1b

第二步：启动服务栈

创建 docker-compose.yaml：

services:
  banyandb:
    image: apache/skywalking-banyandb:0.10.0
    container_name: banyandb
    ports:
      - "17912:17912"
    command: standalone --stream-root-path /tmp/stream-data --measure-root-path /tmp/measure-data
    healthcheck:
      test: ["CMD-SHELL", "wget -qO- http://localhost:17913/api/healthz || exit 1"]
      interval: 5s
      timeout: 3s
      retries: 10

  oap:
    image: apache/skywalking-oap-server:10.4.0
    container_name: oap
    depends_on:
      banyandb:
        condition: service_healthy
    ports:
      - "11800:11800"
      - "12800:12800"
    environment:
      SW_STORAGE: banyandb
      SW_STORAGE_BANYANDB_TARGETS: banyandb:17912
    healthcheck:
      test: ["CMD-SHELL", "bash -c 'echo > /dev/tcp/localhost/12800' || exit 1"]
      interval: 10s
      timeout: 5s
      retries: 30
      start_period: 60s

  ui:
    image: apache/skywalking-ui:10.4.0
    container_name: ui
    depends_on:
      oap:
        condition: service_healthy
    ports:
      - "8080:8080"
    environment:
      SW_OAP_ADDRESS: http://oap:12800

  aigw:
    image: envoyproxy/ai-gateway-cli:latest
    container_name: aigw
    depends_on:
      oap:
        condition: service_healthy
    environment:
      - OPENAI_BASE_URL=http://host.docker.internal:11434/v1
      - OPENAI_API_KEY=unused
      - OTEL_SERVICE_NAME=my-ai-gateway
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://oap:11800
      - OTEL_EXPORTER_OTLP_PROTOCOL=grpc
      - OTEL_METRICS_EXPORTER=otlp
      - OTEL_LOGS_EXPORTER=otlp
      - OTEL_METRIC_EXPORT_INTERVAL=5000
      - OTEL_RESOURCE_ATTRIBUTES=job_name=envoy-ai-gateway,service.instance.id=aigw-1,service.layer=ENVOY_AI_GATEWAY
    ports:
      - "1975:1975"
    extra_hosts:
      - "host.docker.internal:host-gateway"
    command: ["run"]

启动所有服务：

docker compose up -d

等待所有服务变为健康状态（BanyanDB 先启动，然后是 OAP，最后是 UI 和 AI Gateway）：

docker compose ps

aigw 服务的关键 OTLP 配置：

环境变量	值	用途
`OTEL_SERVICE_NAME`	`my-ai-gateway`	SkyWalking 中的服务名
`OTEL_EXPORTER_OTLP_ENDPOINT`	`http://oap:11800`	SkyWalking OAP gRPC 端点
`OTEL_EXPORTER_OTLP_PROTOCOL`	`grpc`	OTLP 传输协议
`OTEL_METRICS_EXPORTER`	`otlp`	启用指标推送
`OTEL_LOGS_EXPORTER`	`otlp`	启用访问日志推送

OTEL_RESOURCE_ATTRIBUTES 必须包含：

job_name=envoy-ai-gateway —— MAL/LAL 规则的路由标签
service.instance.id=<id> —— 实例标识
service.layer=ENVOY_AI_GATEWAY —— 将日志路由到 AI Gateway LAL 规则

MAL 和 LAL 规则在 SkyWalking OAP 中默认启用，不需要额外配置。

第三步：运行 Demo 应用

创建一个简单的 Python 应用，通过 AI Gateway 发送请求（app.py）。它混合了普通请求、流式请求（用于产生 TTFT/TPOT 指标）和错误请求（不存在的模型 → HTTP 404，始终会被 LAL 采样策略捕获）：

import time, random, requests

GATEWAY = "http://localhost:1975"
HEADERS = {"Authorization": "Bearer unused", "Content-Type": "application/json"}

questions = [
    "What is Apache SkyWalking? Answer in one sentence.",
    "What is Envoy Proxy used for? Answer in one sentence.",
    "What are the benefits of an AI gateway? Answer in two sentences.",
    "Explain observability in three sentences.",
]

def chat(model, question, stream=False):
    resp = requests.post(
        f"{GATEWAY}/v1/chat/completions",
        json={"model": model, "messages": [{"role": "user", "content": question}], "stream": stream},
        headers=HEADERS, timeout=60, stream=stream,
    )
    if stream:
        chunks = []
        for line in resp.iter_lines():
            if line:
                chunks.append(line.decode())
        return resp.status_code, f"[streamed {len(chunks)} chunks]"
    return resp.status_code, resp.json()

while True:
    r = random.random()
    if r < 0.2:
        # Error request: non-existent model triggers 404
        status, body = chat("non-existent-model", "hello")
        print(f"[error] model=non-existent-model status={status}")
    elif r < 0.5:
        # Streaming request — generates TTFT and TPOT metrics
        q = random.choice(questions)
        status, info = chat("llama3.2:1b", q, stream=True)
        print(f"[stream] status={status} {info}")
    else:
        # Normal non-streaming request
        q = random.choice(questions)
        status, body = chat("llama3.2:1b", q)
        answer = body.get("choices", [{}])[0].get("message", {}).get("content", "")[:80]
        tokens = body.get("usage", {})
        print(f"[ok] status={status} tokens={tokens} answer={answer}...")
    time.sleep(random.randint(20, 30))

运行：

pip install requests
python app.py

应用通过 1975 端口与 AI Gateway 通信，AI Gateway 再路由到 Ollama。每次请求都会产生 GenAI 指标（Token 用量、延迟、TTFT、TPOT）和访问日志，由网关通过 OTLP 推送到 SkyWalking。

错误请求（不存在的模型 → HTTP 404）始终会被访问日志采样策略捕获，所以在 SkyWalking 的日志视图中一定能看到。

第四步：在 SkyWalking UI 中查看

打开 http://localhost:8080，选择 GenAI > Envoy AI Gateway 菜单。

服务列表显示 my-ai-gateway，可以一览 CPM、延迟和 Token 速率：

点击进入服务详情，查看完整仪表盘——请求 CPM、延迟（平均值 + 百分位数）、输入/输出 Token 速率、TTFT 和 TPOT：

Providers 标签页按 AI 提供商维度展示指标：

Models 标签页展示每个模型的指标，包括 TTFT 和 TPOT（仅流式请求）。注意 unknown 模型条目——这些就是使用不存在模型的错误请求：

Log 标签页展示访问日志。采样策略会丢弃正常的成功响应，但始终保留错误（HTTP 404）和高 Token 消耗的请求：

清理

docker compose down

Kubernetes 生产部署

生产环境中，Envoy AI Gateway 作为完整的 Kubernetes 控制器运行，以 Envoy Gateway 作为控制面。详见 Envoy AI Gateway 入门指南。

OTLP 配置方式相同——在 AI Gateway 的 External Processor 上设置 OTEL_* 环境变量，指向 SkyWalking OAP 的 gRPC 端口（11800）。详见 SkyWalking Envoy AI Gateway 监控文档。

不用 AI 网关也能做 GenAI 可观测

并非所有场景都需要 AI 网关。如果你的应用直接调用 LLM 提供商，SkyWalking 10.4.0 也提供了基于 Virtual GenAI 层的 GenAI 可观测方案。

任何接入了 SkyWalking、OpenTelemetry 或 Zipkin 探针的应用都能使用这个功能。只要 Trace 中携带 gen_ai.* 标签（遵循 OpenTelemetry GenAI 语义约定），SkyWalking 就能从客户端视角推导出每提供商、每模型的指标：延迟、Token 用量、成功率和预估费用。

对于 Java 应用，SkyWalking Java Agent（9.7+）内置了 Spring AI 插件，自动为 13+ 提供商（OpenAI、Anthropic、AWS Bedrock、Google GenAI、DeepSeek、Mistral 等）的调用注入正确的 gen_ai.* Span 标签——不需要改代码。

这与上面介绍的 Envoy AI Gateway 监控是不同的使用场景：

Envoy AI Gateway 层：基础设施级可观测——网关视角，覆盖所有流量。适合负责集中 AI 路由的平台团队。
Virtual GenAI 层：应用级可观测——每个应用自己看到的 LLM 调用情况。适合没有集中网关的团队，或者需要按应用维度跟踪费用的场景。

参考资料

Envoy AI Gateway —— 项目官网和文档
Envoy AI Gateway CLI —— 本地开发用的独立运行模式
SkyWalking Envoy AI Gateway 监控 —— OAP 配置文档
SkyWalking Virtual GenAI —— 客户端侧 GenAI 可观测
OpenTelemetry GenAI 语义约定 —— 两个项目共同遵循的指标/属性标准

Blog: Agentic Vibe Coding in a Mature OSS Project: What Worked, What Didn't

Sun, 08 Mar 2026 00:00:00 +0000

Most “vibe coding” stories start with a greenfield project. This one doesn’t.

Apache SkyWalking is a 9-year-old observability platform with hundreds of production deployments, a complex DSL stack, and an external API surface that users have built dashboards, alerting rules, and automation scripts against. When I decided to replace the core scripting engine — purging the Groovy runtime from four DSL compilers — the constraint wasn’t “can AI write the code?” It was: “can AI write the code without breaking anything for existing users?”

The answer turned out to be yes — ~77,000 lines changed across 10 major PRs in about 5 weeks — but only because the AI was tightly guided by a human who understood the project’s architecture, its compatibility contracts, and its users. This post is about the methodology: what worked, what didn’t, and what mature open-source maintainers should know before handing their codebase to AI agents.

The Project in Brief

The task was to replace SkyWalking’s Groovy-based scripting engines (MAL, LAL, Hierarchy) with a unified ANTLR4 + Javassist bytecode compilation pipeline, matching the architecture already proven by the OAL compiler. The internal tech stack was completely overhauled; the external interface had to remain identical.

Beyond the compiler rewrites, the scope included a new queue infrastructure (threads dropped from 36 to 15), virtual thread support for JDK 25+, and E2E test modernization. By conventional estimates, this was 5-8 months of senior engineer work.

For the full technical details on the compiler architecture, see the Groovy elimination discussion.

What is Agentic Vibe Coding?

“Vibe coding” — a term coined by Andrej Karpathy — describes a style of programming where you describe intent and let AI write the code. It’s powerful for prototyping, but on its own, it’s risky for production systems.

Agentic vibe coding takes this further: instead of a single AI autocomplete, you orchestrate multiple AI agents — each with different strengths — under your architectural direction, with automated tests as the safety net. In my workflow:

Claude Code (plan mode): Primary coding agent. Plan mode lets me review the approach before any code is generated. This is critical for architectural decisions — I steer the design, Claude handles the implementation.
Gemini: Code review, concurrency analysis, and verification reports. Gemini reviewed every major PR for thread-safety, feature parity, and edge cases.
Codex: Autonomous task execution for well-defined, bounded work items.

The key insight: AI writes the code, but the architect owns the design. Without deep domain knowledge of SkyWalking’s internals, no AI could have planned these changes. Without AI, I couldn’t have executed them in 5 weeks.

How TDD Made AI Coding Safe

The reason I could move this fast without breaking things comes down to one principle: never let AI code without a test harness.

My workflow for each major change:

Plan mode first: Describe the goal to Claude, review the plan, iterate on architecture before any code is written.
Write the test contract: Define what “correct” means — for the compiler rewrites, this meant cross-version comparison tests that run every expression through both the old and new engines, asserting identical results across 1,290+ expressions.
Let AI implement: With the test contract in place, Claude can write thousands of lines of implementation code. If it’s wrong, the tests catch it immediately.
E2E as the final gate: Every PR must pass the full E2E test suite — Docker-based integration tests that boot the entire server with real storage backends.
AI code review: Gemini reviewed each PR for concurrency issues, thread-safety, and feature parity — catching things that unit tests alone wouldn’t find.

This is the opposite of “hope it works” vibe coding. The AI writes fast, the tests verify fast, and I steer the architecture. The feedback loop is tight enough that I can iterate on complex compiler code in minutes instead of days.

Lessons Learned

AI is a force multiplier, not a replacement. Before any AI agent wrote a single line, a human had to define the replacement solution: what gets replaced, how it gets replaced, and — critically — where the boundaries are. Which APIs could break? The internal compilation pipeline was fair game for a complete overhaul. Which APIs must stay aligned? Every external-facing DSL syntax, every YAML configuration key, every metrics name and tag structure had to remain byte-for-byte identical — because hundreds of deployed dashboards, alerting rules, and user scripts depend on them. Drawing these boundaries required deep knowledge of the codebase and its users. AI executed the plan at extraordinary speed, but the plan itself — the scope, the invariants, the compatibility contract — had to come from a human who understood the blast radius of every change.

Plan mode is non-negotiable for architectural work. Letting AI jump straight to code on a compiler rewrite would be a disaster. Plan mode’s strength is that it collects code context — scanning imports, tracing call chains, mapping class hierarchies — and uses that context to help me fill in implementation details I’d otherwise have to look up manually. But it can’t tell you the design principles. That direction had to come from me, stated clearly upfront, so the AI’s planning stayed on the right track instead of optimizing toward a locally reasonable but architecturally wrong solution.

Know when to hit ESC. Claude has a clear tendency to dive deep into solution code writing once it starts — and it won’t stop on its own when it encounters something that conflicts with the original plan’s concept. Instead of pausing to flag the conflict, it will push forward, improvising around the obstacle in ways that silently violate the design intent. I had to learn to watch for this: when Claude’s output started drifting from the plan, I’d manually cancel the task (ESC), call it off, identify where the plan and reality diverged, adjust the plan, and restart. This interrupt-replan cycle was a regular part of the workflow, not an exception. The architect has to stay in the loop — not just at planning time, but during execution — because AI agents don’t yet know when to stop and ask.

Spec-driven testing is necessary but not sufficient — the logic workflow matters more. It’s tempting to think that if you define the input/output spec clearly enough, AI can fill in the implementation and tests will catch any mistakes. I tried this. It doesn’t work for anything non-trivial. During the expression compiler rewrite, Claude would sometimes change code in unreasonable ways just to make the spec tests pass — the inputs went in, the expected outputs came out, and everything looked green. But the internal logic was wrong: inconsistent with the design patterns the rest of the codebase relied on, impossible to extend, or solving the specific test case through a hack rather than a general mechanism. A spec only checks what the code produces; it says nothing about how the code produces it. For a mature project, the “how” matters enormously — the solution needs to be consistent with the existing architecture, widely adoptable by contributors, and maintainable long-term. That’s why I needed cross-version testing and human review of the implementation path, not just the results.

Testing at two levels kept the rewrite honest. Cross-version testing was part of my design plan from the start — I architected the dual-path comparison framework so that every production DSL expression runs through both the old and new engines, asserting identical results across 1,290+ expressions. This gave me confidence no human review could match, and it was a deliberate planning decision: I knew AI-generated compiler code needed a mechanical proof of behavioral equivalence, not just eyeball review. On top of that, E2E tests served as the project’s existing infrastructure safety net — Docker-based integration tests that boot the entire server with real storage backends. Unit tests and cross-version tests verify logic in isolation; E2E tests verify the system actually works end-to-end. For infrastructure-level changes like queue replacement and thread model changes, E2E is the only gate that truly matters. Together, the two layers — designed-for-this-rewrite cross-version tests and pre-existing E2E infrastructure — caught different classes of bugs and made shipping with confidence possible.

Multiple AIs have different strengths. Claude excels at large-scale code generation with plan mode. Gemini is exceptional at logic review — it can mentally trace code branches with given input data, simulating execution without actually running the code. This is significant for reviewing AI-generated code: Gemini would walk through a generated compiler method step by step, flagging where a null check was missing or where a branch would produce wrong output for a specific edge case. Codex proved most valuable as a test reviewer and honesty checker. AI-generated code has a subtle failure mode: the coding agent can make wrong assumptions and then write tests that pass by setting expected values to match the wrong behavior — effectively bypassing the test safety net. Codex caught cases where Claude had set unreasonable expected values that happened to make tests green, masking logic errors that would have surfaced in production. Using all three as checks on each other was far more effective than relying on any single one.

The Mythical Man-Month still applies — and so does the Mythical Token-Month. Brooks taught us that a task requiring 12 person-months does not mean 12 people can finish it in one month. The same law applies to AI: you cannot simply throw more tokens, more agents, or more parallel sessions at a problem and expect it to converge faster. Communication costs, coordination overhead, requirements analysis, and conceptual integrity — these software engineering fundamentals do not disappear just because your workforce is artificial. Worse, when the direction is wrong — when there’s a conceptual error in the design or an unreasonable architectural choice — AI will not recognize it. It will charge down the wrong path at extraordinary speed, burning tokens furiously while trapped in a vortex of self-justification: patching code to make failing tests pass, adjusting expected values to match wrong behavior, adding workarounds on top of workarounds — each iteration making the codebase look more “complete” while drifting further from correctness. AI vibe coding cannot break out of this spiral on its own. Only a human who understands the domain can recognize “this is fundamentally wrong, stop,” discard the work, and redirect. Speed without direction is just expensive chaos.

The Bigger Picture

The agentic vibe coding approach worked because it combined AI’s speed with human architectural judgment and automated test discipline. It’s not magic — it’s engineering, accelerated.

Brooks also gave us “No Silver Bullet,” and its core distinction matters more than ever: software complexity comes in two kinds. Essential complexity comes from the problem itself — the domain semantics, the behavioral contracts, the concurrency invariants. No tool can eliminate this; it must be understood, modeled, and reasoned about by someone who knows the domain. Accidental complexity comes from the tools and implementation — boilerplate code, manual refactoring across hundreds of files, the mechanical work of translating a design into compilable source. This is exactly where AI excels. What made this project work was recognizing which complexity was which: I owned the essential complexity (architecture, API boundaries, correctness invariants), and AI demolished the accidental complexity (generating 77K lines of implementation, scaffolding test harnesses, rewriting repetitive patterns across dozens of config files). Confuse the two — let AI make essential decisions, or waste human time on accidental work — and you get the worst of both worlds.

Qian Xuesen(Tsien Hsue-shen)’s Engineering Cybernetics offers another lens that proved surprisingly relevant. His core framework — feedback, control, optimization — describes how to keep complex systems running toward their target. AI vibe coding at full speed is like a hypersonic missile: extraordinarily fast, but without a guidance system it just creates a bigger crater in the wrong place. The feedback loop in my workflow was the test harness — cross-version tests and E2E tests providing continuous signal on whether the system was still on course. Control was the human architect deciding when to intervene: reviewing plans before execution, hitting ESC when the direction drifted, choosing which AI to trust for which task. Optimization was iterative: each interrupt-replan cycle refined the approach, each Gemini review tightened the logic, each Codex audit caught assumptions the coding agent had smuggled past the tests. Without all three — feedback to detect deviation, control to correct course, optimization to converge — the speed of AI coding would be not an advantage but a liability. The faster the missile, the more precise the guidance must be.

For more details or to share your own experience with agentic coding on production systems, feel free to reach me on GitHub.

Zh: 在成熟开源大型项目中实践 Agentic Vibe Coding：软件工程与工程控制论还在延续

Sun, 08 Mar 2026 00:00:00 +0000

大多数"vibe coding"的故事都从一个全新项目开始，讲述一个快速构建原型或者可运行项目的过程，但这篇不是。

Apache SkyWalking 是一个有 9 年历史的Apache顶级项目，线上数以千计的集群部署，内部有一套复杂的 DSL 编译栈，对外暴露的 API 上承载着用户构建的仪表盘、告警规则和自动化脚本。当我决定替换核心脚本引擎——从四个 DSL 编译器中彻底移除 Groovy 运行时——面临的问题不是"AI 能不能写出代码"，而是"也许只有AI能完成如此大规模的一致性迭代"，以及"AI 能不能在不破坏系统的前提下写出完整且高效的代码"。

答案是可以——约 7.7 万行代码变更，10 个主要 PR，历时约 5 周——但前提是 AI 始终在一个深刻理解项目架构、兼容性要求和用户场景的人的引导下工作。这篇文章分享了我在过去几个月的实践体验，以及成熟开源项目的维护者在把代码库交给 AI 智能体之前应该知道什么。

项目概况

这次的任务是将 SkyWalking 基于 Groovy 的脚本引擎（MAL、LAL、Hierarchy）替换为统一的 ANTLR4 + Javassist 字节码编译管线，对齐 OAL 编译器已经验证过的架构。内部技术栈彻底重构，但对外接口必须保持完全一致。

除了编译器重写，范围还包括新的线程管理策略（线程数从 36 降到 15）、JDK 25+ 虚拟线程支持，以及端到端测试的现代化改造。按传统估算，这是 5-8 个月的资深工程师（以我自己为例）工作量。

编译器架构的完整技术细节，参见 Groovy 移除讨论。

什么是 Agentic Vibe Coding？

“Vibe coding”——Andrej Karpathy 提出的概念——描述的是一种你表达意图、让 AI 来写代码的编程风格。整个AI编程过程，一直以来都是用来做原型，效果强大且速度迅猛，但单独用于生产系统是有风险的。

Agentic vibe coding 更进一步：不是单一的 AI 自动补全，而是在你的架构指导下编排多个 AI 智能体——各有所长——以自动化测试作为安全网。我的工作流是这样的：

Claude Code（plan 模式）：主力编码智能体。Plan 模式让我在生成任何代码之前先审查方案。这对架构决策至关重要——我把控设计方向，Claude 负责实现。
Gemini：代码审查、并发分析和验证报告。每个主要 PR 都经过 Gemini 审查线程安全性、功能对等性和边界情况。
Codex：对定义明确、边界清晰的工作项进行自主任务执行。

核心洞察：AI 写代码，但架构师掌控设计。 没有对 SkyWalking 内部机制的深入领域知识，任何 AI 都无法规划这些变更。没有 AI，我也不可能在 5 周内完成执行。

TDD 如何让 AI 编程变得安全

我能以这样的速度推进而不搞砸，归结为一个原则：绝不让 AI 在没有测试保护的情况下写代码。

每次重大变更的工作流：

先进 plan 模式：向 Claude 描述目标，审查方案，在写任何代码之前先在架构层面迭代。
编写测试契约：定义"正确"意味着什么——对于编译器重写，这意味着交叉版本对比测试，让每个表达式同时通过新旧两个引擎运行，在 1290+ 个表达式上断言结果完全一致。
让 AI 实现：有了测试契约，Claude 可以写出数千行实现代码。如果写错了，测试会立即捕获。
端到端测试作为最终关卡：每个 PR 都必须通过完整的端到端测试套件——基于 Docker 的集成测试，启动整个服务器并连接真实存储后端。
AI 代码审查：Gemini 审查每个 PR 的并发问题、线程安全性和功能对等性——捕获单元测试无法发现的问题。

这和"写完祈祷能跑"的 vibe coding 完全相反。AI 写得快，测试验证得快，我把控架构方向。反馈循环足够紧凑，让我能在几分钟而不是几天内迭代复杂的编译器代码。

经验教训

AI 是力量倍增器，不是替代品。 在任何 AI 智能体写下第一行代码之前，必须由人来定义替换方案：替换什么、怎么替换，以及——至关重要的——边界在哪里。哪些 API 可以破坏性变更？内部编译管线可以彻底重构。哪些 API 必须保持对齐？每一个对外的 DSL 语法、每一个 YAML 配置键、每一个指标名称和标签结构都必须逐字节保持一致——因为数百个已部署的仪表盘、告警规则和用户脚本依赖于它们。划定这些边界需要对代码库及其用户的深入了解。AI 以惊人的速度执行了计划，但计划本身——范围、不变量、兼容性契约——必须来自一个理解每次变更影响半径的人。

架构级工作，plan 模式不可妥协。 让 AI 在编译器重写上直接跳到写代码，那是灾难。Plan 模式的价值在于它会收集代码上下文——扫描 import、追踪调用链、映射类继承关系——并利用这些上下文帮我补全那些我本来需要手动查找的实现细节。但它无法告诉你设计原则。方向必须由我在前期明确给出，这样 AI 的规划才能沿着正确的轨道走，而不是朝着一个局部合理但架构上错误的方案去优化。

要知道什么时候该按 ESC。 Claude 有一个明显的倾向：一旦开始写解决方案代码就会一头扎进去——当遇到与原始计划概念冲突的东西时，它不会自己停下来。它不会暂停来标记冲突，而是会继续推进，用即兴的方式绕过障碍，悄无声息地违背设计意图。我必须学会观察这个信号：当 Claude 的输出开始偏离计划时，我会手动取消任务（ESC），叫停它，找出计划和现实的分歧点，调整计划，然后重新开始。这种中断-重新规划的循环是工作流的常态，而非例外。架构师必须始终在环路中——不仅是在规划阶段，执行阶段也是——因为 AI 智能体还不知道什么时候该停下来问一句。

Spec-Driven 更多的运用于测试，而非开发。它只是一个必要的但不充分条件，而逻辑工作流更重要。 很容易产生一种想法：只要把输入/输出规格定义得足够清楚，AI 就能填充实现，测试会捕获任何错误。我试过。对于任何复杂的生产场景，这行不通。在表达式编译器重写过程中，Claude 有时会以不合理的方式修改代码，仅仅为了让规格测试通过——输入进去了，预期输出出来了，一切看起来都是正常的。但内部逻辑是错的：与代码库其他部分依赖的设计模式不一致，无法扩展，或者通过 hack （代码反射、字段名称静态比较等不可接受的工程方法）而非通用机制来解决特定测试用例。规格只检查代码产出了什么；它对代码如何产出一无所知。对于成熟项目，“如何"极其重要——解决方案需要与现有架构一致，能被贡献者广泛采用，并且长期可维护可扩展。这就是为什么我需要交叉版本测试加上对实现路径的人工审查，而不仅仅是审查结果。

两个层次的测试让重写的代码验证更有保障。 交叉版本测试从一开始就是我设计方案的一部分——我架构了双路径对比框架，让每个生产环境的 DSL 表达式同时通过新旧两个引擎运行，在 1290+ 个表达式上断言结果完全一致。这给了我任何人工审查都无法匹敌的信心，而且这是一个刻意的规划决策：我知道 AI 生成的编译器代码需要行为等价性的机械证明，而不仅仅是肉眼审查。在此之上，端到端测试作为项目已有的基础设施安全网——基于 Docker/K8s 的集成测试，启动整个服务器并连接真实存储后端。单元测试和交叉版本测试在隔离环境中验证逻辑；端到端测试验证系统真正能端到端地工作。对于队列替换和线程模型变更这样的基础设施级变更，端到端测试是唯一真正重要的关卡。两个层次——为本次重写专门设计的交叉版本测试和预先存在的端到端基础设施——捕获了不同类别的 bug，使得有信心地发布成为可能。

多个 AI 各有所长。 Claude 擅长配合 plan 模式进行大规模代码生成。Gemini 在逻辑审查方面表现出色——它能在给定输入数据的情况下在脑中追踪代码分支，模拟执行而无需实际运行代码。这对审查 AI 生成的代码意义重大：Gemini 会逐步走查一个编译器生成的方法，标记出哪里缺少空值检查，或者哪个分支在特定边界情况下会产生错误输出。Codex 作为测试审查者和诚实性检查者最有价值。AI 生成的代码有一种微妙的失败模式：编码智能体可能做出错误假设，然后编写测试时将期望值设置为匹配错误行为——实际上绕过了测试安全网。Codex 捕获了 Claude 设置不合理期望值使测试变绿的情况，掩盖了本会在生产环境中暴露的逻辑错误。将三者互相校验，远比依赖其中任何一个更有效。

人月神话依然适用——基于Token的AI月神话同样如此。 Brooks 告诉我们，一个需要 12 人月的任务不意味着 12 个人能在一个月内完成。同样的定律适用于 AI：你不能简单地投入更多 token、更多智能体或更多并行会话，就指望问题更快收敛。沟通成本、协调开销、需求分析和概念完整性——这些软件工程的基本规律不会因为你的劳动力是人工智能就消失。更糟糕的是，当方向错误时——当设计中存在概念性错误或不合理的架构选择时——AI 不会识别出来。它会以惊人的速度冲向错误的方向，疯狂消耗 token，同时陷入自我辩护的漩涡：修补代码让失败的测试通过，调整期望值去匹配错误行为，在变通方案上叠加变通方案——每次迭代都让代码库看起来更"完整”，实际上却离正确越来越远。AI vibe coding 无法自行跳出这个螺旋。只有理解领域的人才能认识到"这从根本上就是错的，停下来"，丢弃这些工作，重新引导方向。没有方向的速度，只是昂贵的混乱。

更大的图景

Agentic vibe coding 之所以有效，是因为它将 AI 的速度与人的架构判断力和自动化测试纪律结合在了一起。这不是魔法——这是被加速的工程。

Brooks 还给了我们《没有银弹》，其核心区分在今天比以往任何时候都更重要：软件复杂性分为两种。本质复杂性来自问题本身——领域语义、行为契约、并发不变量。没有任何工具能消除它；它必须由理解领域的人去理解、建模和推理。偶然复杂性来自工具和实现——样板代码、跨数百个文件的手动重构、将设计翻译成可编译源码的机械工作。这恰恰是 AI 擅长的地方。这个项目之所以成功，在于认清了哪种复杂性是哪种：我掌控本质复杂性（架构、API 边界、正确性不变量），AI 消灭偶然复杂性（生成 7.7 万行实现代码、搭建测试框架、跨数十个配置文件重写重复模式）。搞混这两者——让 AI 做本质决策，或者让人浪费时间在偶然工作上——你会得到两个世界中最差的结果。

钱学森的《工程控制论》提供了另一个视角，在实践中出人意料地切题。他的核心框架——反馈、控制、优化——描述的是如何让复杂系统持续朝目标运行。全速运转的 AI vibe coding 就像一枚高超音速导弹：速度惊人，但没有制导系统只会在错误的地方炸出一个更大的坑。我工作流中的反馈回路是测试体系——交叉版本测试和端到端测试持续提供系统是否仍在航线上的信号。控制是人类架构师决定何时介入：在执行前审查方案，在方向偏移时按 ESC，选择哪个 AI 负责哪项任务。优化是迭代式的：每次中断-重新规划的循环都在精炼方法，每次 Gemini 审查都在收紧逻辑，每次 Codex 审计都在捕获编码智能体偷偷绕过测试的假设。缺少其中任何一个——检测偏差的反馈、纠正航向的控制、趋向收敛的优化——AI 编程的速度就不是优势而是负债。导弹越快，制导就必须越精确。

AI Vibe Coding以及它的迭代，正在快速地走进每一个开发者，也正在广泛地融入开源和商业软件。我们都在见证这种新的开发模式，以及AI Vibe Coding和软件工程理论的融合。如果你想和我探讨更多的AI + OSS话题，欢迎在 GitHub 上联系我。

Zh: 使用 SkyWalking 监控 Flink

Fri, 25 Apr 2025 00:00:00 +0000

背景介绍

Apache Flink 是一个框架和分布式处理引擎，用于在无边界和有边界数据流上进行有状态的计算。Flink 能在所有常见集群环境中运行，并能以内存速度和任意规模进行计算。从SkyWalking OAP 10.3 版本开始，新增了对来自Flink的指标数据监控面板，本文将展示并介绍如何使用 SkyWalking来监控Flink。

部署

准备

启动流程

启动 jobmanager 和 taskmanager
启动 skywalking oap 和 ui
启动 opentelmetry-collector
启动job

DataFlow:

配置

docker-compose

version: "3"

services:
  oap:
    extends:
      file: ../../script/docker-compose/base-compose.yml
      service: oap
    ports:
      - "12800:12800"
    networks:
      - e2e

  banyandb:
    extends:
      file: ../../script/docker-compose/base-compose.yml
      service: banyandb
    ports:
      - 17912

  jobmanager:
    image: flink:2.0-preview1
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
        metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory
        metrics.reporter.prom.port: 9260
    ports:
      - "8081:8081"
      - "9260:9260"
    command: jobmanager
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8081"]
      interval: 30s
      timeout: 10s
      retries: 3
    networks:
      - e2e

  taskmanager:
    image: flink:2.0-preview1
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
        metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory
        metrics.reporter.prom.port: 9261
    depends_on:
      jobmanager:
        condition: service_healthy
    ports:
      - "9261:9261"
    command: taskmanager
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9261/metrics"]
      interval: 30s
      timeout: 10s
      retries: 3
    networks:
      - e2e

  executeJob:
    image: flink:2.0-preview1
    depends_on:
      taskmanager:
        condition: service_healthy
    command: >
      bash -c "
      ./bin/flink run -m jobmanager:8081 examples/streaming/WindowJoin.jar"
    networks:
      - e2e

  otel-collector:
    image: otel/opentelemetry-collector:${OTEL_COLLECTOR_VERSION}
    networks:
      - e2e
    command: [ "--config=/etc/otel-collector-config.yaml" ]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    expose:
      - 55678
    depends_on:
      oap:
        condition: service_healthy

networks:
  e2e:

如果是使用pushGateWay模式来暴露metrics数据请参考。

OpenTelemetry-collector

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: "flink-jobManager-monitoring"
          scrape_interval: 30s
          static_configs:
            - targets: ['jobmanager:9260']
              labels:
                cluster: flink-cluster
          relabel_configs:
            - source_labels: [ __address__ ]
              target_label: jobManager_node
              replacement: $$1
          metric_relabel_configs:
            - source_labels: [ job_name ]
              action: replace
              target_label: flink_job_name
              replacement: $$1
            - source_labels: [ ]
              target_label: job_name
              replacement: flink-jobManager-monitoring

        - job_name: "flink-taskManager-monitoring"
          scrape_interval: 30s
          static_configs:
            - targets: [ "taskmanager:9261" ]
              labels:
                cluster: flink-cluster
          relabel_configs:
            - source_labels: [ __address__ ]
              regex: (.+)
              target_label: taskManager_node
              replacement: $$1
          metric_relabel_configs:
            - source_labels: [ job_name ]
              action: replace
              target_label: flink_job_name
              replacement: $$1
            - source_labels: [ ]
              target_label: job_name
              replacement: flink-taskManager-monitoring

exporters:
  otlp:
    endpoint: oap:11800
    tls:
      insecure: true

processors:
  batch:
service:
  pipelines:
    metrics:
      receivers:
        - prometheus
      processors:
        - batch
      exporters:
        - otlp

注意:
job_name的值请不要修改,否则 skyWalking 不会处理这部分数据。
oap 为 skywalking oap 地址,请自行替换。
因为原始flink数据中含有job_name标签，而skyWalking又根据job_name标签来处理对应OTEL任务的数据，为了避免冲突，使用metric_relabel_configs替换原始数据中job_name的标签为flink_job_name。

监控指标

指标分为三个维度,cluster,taskManager,job

Cluster Metrics

Cluster Metrics主要是站在集群的角度统计以及jobManager的jvm相关指标展示,比如

Running Jobs：正在运行的任务数量
TaskManagers：taskManager数量
Task Managers Slots Total：taskManager slot数量
Task Managers Slots Available：taskManager可用slot数量
JVM CPU Load：jobManager的jvm占用cpu的负载

TaskManager Metrics

TaskManager Metrics主要是站在taskManager节点的角度来统计展示,比如

JVM Memory Heap Used：taskManager节点JVM已用内存大小。
JVM Memory Heap Available：taskManager节点JVM可用内存大小。
NumRecordsIn：taskManager每分钟接受的数据数量。
NumBytesInPerSecond：taskManager每秒接受的Bytes数量。
IsBackPressured：该taskManager节点是否处在背压。
IdleTimeMsPerSecond：该taskManager节点每秒的闲置时长。

Job Metrics

Job Metrics主要是站在运行任务的角度来统计展示,比如

Job RunningTime：该任务运行的时长。
Job Restart Number：该任务重启次数。
Checkpoints Failed：失败的checkpoints数量。
NumBytesInPerSecond：该任务每秒接受的Bytes数量。

各个指标的含义可以在图标的 tip 上找到解释

参考文档

Blog: How to run Apache SkyWalking on AWS EKS and RDS/Aurora

Tue, 13 Dec 2022 00:00:00 +0000

Introduction

Apache SkyWalking is an open source APM tool for monitoring and troubleshooting distributed systems, especially designed for microservices, cloud native and container-based (Docker, Kubernetes, Mesos) architectures. It provides distributed tracing, service mesh observability, metric aggregation and visualization, and alarm.

In this article, I will introduce how to quickly set up Apache SkyWalking on AWS EKS and RDS/Aurora, as well as a couple of sample services, monitoring services to observe SkyWalking itself.

Prerequisites

AWS account
AWS CLI
Terraform
kubectl

We can use the AWS web console or CLI to create all resources needed in this tutorial, but it can be too tedious and hard to debug when something goes wrong. So in this artical I will use Terraform to create all AWS resources, deploy SkyWalking, sample services, and load generator services (Locust).

Architecture

The demo architecture is as follows:

graph LR
    subgraph AWS
        subgraph EKS
          subgraph istio-system namespace
              direction TB
              OAP[[SkyWalking OAP]]
              UI[[SkyWalking UI]]
            Istio[[istiod]]
          end
          subgraph sample namespace
              Service0[[Service0]]
              Service1[[Service1]]
              ServiceN[[Service ...]]
          end
          subgraph locust namespace
              LocustMaster[[Locust Master]]
              LocustWorkers0[[Locust Worker 0]]
              LocustWorkers1[[Locust Worker 1]]
              LocustWorkersN[[Locust Worker ...]]
          end
        end
        RDS[[RDS/Aurora]]
    end
    OAP --> RDS
    Service0 -. telemetry data -.-> OAP
    Service1 -. telemetry data -.-> OAP
    ServiceN -. telemetry data -.-> OAP
    UI --query--> OAP
    LocustWorkers0 -- traffic --> Service0
    LocustWorkers1 -- traffic --> Service0
    LocustWorkersN -- traffic --> Service0
    Service0 --> Service1 --> ServiceN
    LocustMaster --> LocustWorkers0
    LocustMaster --> LocustWorkers1
    LocustMaster --> LocustWorkersN
    User --> LocustMaster

As shown in the architecture diagram, we need to create the following AWS resources:

EKS cluster
RDS instance or Aurora cluster

Sounds simple, but there are a lot of things behind the scenes, such as VPC, subnets, security groups, etc. You have to configure them correctly to make sure the EKS cluster can connect to RDS instance/Aurora cluster otherwise the SkyWalking won’t work. Luckily, Terraform can help us to create and destroy all these resources automatically.

I have created a Terraform module to create all AWS resources needed in this tutorial, you can find it in the GitHub repository.

Create AWS resources

First, we need to clone the GitHub repository and cd into the folder:

git clone https://github.com/kezhenxu94/oap-load-test.git

Then, we need to create a file named terraform.tfvars to specify the AWS region and other variables:

cat > terraform.tfvars <<EOF
aws_access_key = ""
aws_secret_key = ""
cluster_name   = "skywalking-on-aws"
region         = "ap-east-1"
db_type        = "rds-postgresql"
EOF

If you have already configured the AWS CLI, you can skip the aws_access_key and aws_secret_key variables. To install SkyWalking with RDS postgresql, set the db_type to rds-postgresql, to install SkyWalking with Aurora postgresql, set the db_type to aurora-postgresql.

There are a lot of other variables you can configure, such as tags, sample services count, replicas, etc., you can find them in the variables.tf.

Then, we can run the following commands to initialize the Terraform module and download the required providers, then create all AWS resources:

terraform init
terraform apply -var-file=terraform.tfvars

Type yes to confirm the creation of all AWS resources, or add the -auto-approve flag to the terraform apply to skip the confirmation:

terraform apply -var-file=terraform.tfvars -auto-approve

Now what you need to do is to wait for the creation of all AWS resources to complete, it may take a few minutes. You can check the progress of the creation in the AWS web console, and check the deployment progress of the services inside the EKS cluster.

Generate traffic

Besides creating necessary AWS resources, the Terraform module also deploys SkyWalking, sample services, and Locust load generator services to the EKS cluster.

You can access the Locust web UI to generate traffic to the sample services:

open http://$(kubectl get svc -n locust -l app=locust-master -o jsonpath='{.items[0].status.loadBalancer.ingress[0].hostname}'):8089

The command opens the browser to the Locust web UI, you can configure the number of users and hatch rate to generate traffic.

Observe SkyWalking

You can access the SkyWalking web UI to observe the sample services.

First you need to forward the SkyWalking UI port to local

kubectl -n istio-system port-forward $(kubectl -n istio-system get pod -l app=skywalking -l component=ui -o name) 8080:8080

And then open the browser to http://localhost:8080 to access the SkyWalking web UI.

Observe RDS/Aurora

You can also access the RDS/Aurora web console to observe the performance of RDS/Aurora instance/Aurora cluste.

Test Results

Test 1: SkyWalking with EKS and RDS PostgreSQL

Service Traffic

RDS Performance

SkyWalking Performance

Test 2: SkyWalking with EKS and Aurora PostgreSQL

Service Traffic

RDS Performance

SkyWalking Performance

Clean up

When you are done with the demo, you can run the following command to destroy all AWS resources:

terraform destroy -var-file=terraform.tfvars -auto-approve

Zh: 如何在 AWS EKS 和 RDS/Aurora 上运行 Apache SkyWalking

Tue, 13 Dec 2022 00:00:00 +0000

介绍

Apache SkyWalking 是一个开源的 APM 工具，用于监控分布式系统和排除故障，特别是为微服务、云原生和基于容器（Docker、Kubernetes、Mesos）的架构而设计。它提供分布式跟踪、服务网格可观测性、指标聚合和可视化以及警报。

在本文中，我将介绍如何在 AWS EKS 和 RDS/Aurora 上快速设置 Apache SkyWalking，以及几个示例服务，监控服务以观察 SkyWalking 本身。

先决条件

AWS 账号
AWS CLI
Terraform
kubectl

我们可以使用 AWS Web 控制台或 CLI 来创建本教程所需的所有资源，但是当出现问题时，它可能过于繁琐且难以调试。因此，在本文中，我将使用 Terraform 创建所有 AWS 资源、部署 SkyWalking、示例服务和负载生成器服务 (Locust)。

架构

演示架构如下：

graph LR
    subgraph AWS
        subgraph EKS
          subgraph istio-system namespace
              direction TB
              OAP[[SkyWalking OAP]]
              UI[[SkyWalking UI]]
            Istio[[istiod]]
          end
          subgraph sample namespace
              Service0[[Service0]]
              Service1[[Service1]]
              ServiceN[[Service ...]]
          end
          subgraph locust namespace
              LocustMaster[[Locust Master]]
              LocustWorkers0[[Locust Worker 0]]
              LocustWorkers1[[Locust Worker 1]]
              LocustWorkersN[[Locust Worker ...]]
          end
        end
        RDS[[RDS/Aurora]]
    end
    OAP --> RDS
    Service0 -. telemetry data -.-> OAP
    Service1 -. telemetry data -.-> OAP
    ServiceN -. telemetry data -.-> OAP
    UI --query--> OAP
    LocustWorkers0 -- traffic --> Service0
    LocustWorkers1 -- traffic --> Service0
    LocustWorkersN -- traffic --> Service0
    Service0 --> Service1 --> ServiceN
    LocustMaster --> LocustWorkers0
    LocustMaster --> LocustWorkers1
    LocustMaster --> LocustWorkersN
    User --> LocustMaster

如架构图所示，我们需要创建以下 AWS 资源：

EKS 集群
RDS 实例或 Aurora 集群

听起来很简单，但背后有很多东西，比如 VPC、子网、安全组等。你必须正确配置它们以确保 EKS 集群可以连接到 RDS 实例 / Aurora 集群，否则 SkyWalking 不会不工作。幸运的是，Terraform 可以帮助我们自动创建和销毁所有这些资源。

我创建了一个 Terraform 模块来创建本教程所需的所有 AWS 资源，您可以在 GitHub 存储库中找到它。

创建 AWS 资源

首先，我们需要将 GitHub 存储库克隆 cd 到文件夹中：

git clone https://github.com/kezhenxu94/oap-load-test.git

然后，我们需要创建一个文件 terraform.tfvars 来指定 AWS 区域和其他变量：

cat > terraform.tfvars <<EOF
aws_access_key = ""
aws_secret_key = ""
cluster_name   = "skywalking-on-aws"
region         = "ap-east-1"
db_type        = "rds-postgresql"
EOF

如果您已经配置了 AWS CLI，则可以跳过 aws_access_key 和 aws_secret_key 变量。要使用 RDS postgresql 安装 SkyWalking，请将 db_type 设置为 rds-postgresql，要使用 Aurora postgresql 安装 SkyWalking，请将 db_type 设置为 aurora-postgresql。

您可以配置许多其他变量，例如标签、示例服务计数、副本等，您可以在 variables.tf 中找到它们。

然后，我们可以运行以下命令来初始化 Terraform 模块并下载所需的提供程序，然后创建所有 AWS 资源：

terraform init
terraform apply -var-file=terraform.tfvars

键入 yes 以确认所有 AWS 资源的创建，或将标志 -auto-approve 添加到 terraform apply 以跳过确认：

terraform apply -var-file=terraform.tfvars -auto-approve

现在你需要做的就是等待所有 AWS 资源的创建完成，这可能需要几分钟的时间。您可以在 AWS Web 控制台查看创建进度，也可以查看 EKS 集群内部服务的部署进度。

产生流量

除了创建必要的 AWS 资源外，Terraform 模块还将 SkyWalking、示例服务和 Locust 负载生成器服务部署到 EKS 集群。

您可以访问 Locust Web UI 以生成到示例服务的流量：

open http://$(kubectl get svc -n locust -l app=locust-master -o jsonpath='{.items[0].status.loadBalancer.ingress[0].hostname}'):8089

该命令将浏览器打开到 Locust web UI，您可以配置用户数量和孵化率以生成流量。

观察 SkyWalking

您可以访问 SkyWalking Web UI 来观察示例服务。

首先需要将 SkyWalking UI 端口转发到本地：

kubectl -n istio-system port-forward $(kubectl -n istio-system get pod -l app=skywalking -l component=ui -o name) 8080:8080

然后在浏览器中打开 http://localhost:8080 访问 SkyWalking web UI。

观察 RDS/Aurora

您也可以访问 RDS/Aurora web 控制台，观察 RDS/Aurora 实例 / Aurora 集群的性能。

试验结果

测试 1：使用 EKS 和 RDS PostgreSQL 的 SkyWalking

服务流量

RDS 性能

SkyWalking 性能

测试 2：使用 EKS 和 Aurora PostgreSQL 的 SkyWalking

服务流量

RDS 性能

SkyWalking 性能

清理

完成演示后，您可以运行以下命令销毁所有 AWS 资源：

terraform destroy -var-file=terraform.tfvars -auto-approve

Blog: [Video] Distributed tracing demo using Apache SkyWalking and Kong API Gateway

Thu, 11 Aug 2022 00:00:00 +0000

Observability essential when working with distributed systems. Built on 3 pillars of metrics, logging and tracing, having the right tools in place to quickly identify and determine the root cause of an issue in production is imperative. In this Kongcast interview, we explore the benefits of having observability and demo the use of Apache SkyWalking. We walk through the capabilities that SkyWalking offers out of the box and debug a common HTTP 500 error using the tool.

Andrew Kew is interviewed by Viktor Gamov, a developer advocate at Kong Inc

Andrew is a highly passionate technologist with over 16 valuable years experience in building server side and cloud applications. Having spent the majority of his time in the Financial Services domain, his meritocratic rise to CTO of an Algorithmic Trading firm allowed him to not only steer the business from a technology standpoint, but build robust and scalable trading algorithms. His mantra is “right first time”, thus ensuring the projects or clients he is involved in are left in a better place than they were before he arrived.

He is the founder of a boutique software consultancy in the United Kingdom, QuadCorps Ltd, working in the API and Integration Ecosystem space and is currently on a residency programme at Kong Inc as a senior field engineer and technical account manager working across many of their enterprise strategic accounts.

Blog: How to use the java agent injector?

Tue, 19 Apr 2022 00:00:00 +0000

content:

1. Introduction

1.1 What’s SWCK?

SWCK is a platform for the SkyWalking user, provisions, upgrades, maintains SkyWalking relevant components, and makes them work natively on Kubernetes.

In fact, SWCK is an operator developed based on kubebuilder, providing users with Custom Resources ( CR ) and controllers for managing resources ( Controller ), all CustomResourceDefinitions（CRDs）are as follows:

1.2 What’s the java agent injector?

For a java application, users need to inject the java agent into the application to get metadata and send it to the SkyWalking backend. To make users use the java agent more natively, we propose the java agent injector to inject the java agent sidecar into a pod. The java agent injector is actually a Kubernetes Mutation Webhook Controller. The controller intercepts pod events and applies mutations to the pod if annotations exist within the request.

2. Features

Transparent. User’s applications generally run in normal containers while the java agent runs in the init container, and both belong to the same pod. Each container in the pod mounts a shared memory volume that provides a storage path for the java agent. When the pod starts, the java agent in the init container will run before the application container, and the injector will store the java agent file in the shared memory volume. When the application container starts, the injector injects the agent file into the application by setting the JVM parameter. Users can inject the java agent in this way without rebuilding the container image containing the java agent.
Configurability. The injector provides two ways to configure the java agent: global configuration and custom configuration. The default global configuration is stored in the configmap, you can update it as your own global configuration, such as backend_service. In addition, you can also set custom configuration for some applications via annotation, such as “service_name”. For more information, please see java-agent-injector.
Observability. For each injected java agent, we provide CustomDefinitionResources called JavaAgent to observe the final agent configuration. Please refer to javaagent to get more details.

3. Install SWCK

In the next steps, we will show how to build a stand-alone Kubernetes cluster and deploy the 0.6.1 version of SWCK on the platform.

3.1 Tool Preparation

Firstly, you need to install some tools as follows:

kind, which is used to create a stand-alone Kubernetes cluster.
kubectl, which is used to communicate with the Kubernetes cluster.

3.2 Install stand-alone Kubernetes cluster

After installing kind , you could use the following command to create a stand-alone Kubernetes cluster.

Notice! If your terminal is configured with a proxy, you need to close it before the cluster is created to avoid some errors.

$ kind create cluster --image=kindest/node:v1.19.1

After creating a cluster, you can get the pods as below.

$ kubectl get pod -A                          
NAMESPACE            NAME                                         READY   STATUS    RESTARTS   AGE
kube-system          coredns-f9fd979d6-57xpc                      1/1     Running   0          7m16s
kube-system          coredns-f9fd979d6-8zj8h                      1/1     Running   0          7m16s
kube-system          etcd-kind-control-plane                      1/1     Running   0          7m23s
kube-system          kindnet-gc9gt                                1/1     Running   0          7m16s
kube-system          kube-apiserver-kind-control-plane            1/1     Running   0          7m23s
kube-system          kube-controller-manager-kind-control-plane   1/1     Running   0          7m23s
kube-system          kube-proxy-6zbtb                             1/1     Running   0          7m16s
kube-system          kube-scheduler-kind-control-plane            1/1     Running   0          7m23s
local-path-storage   local-path-provisioner-78776bfc44-jwwcs      1/1     Running   0          7m16s

3.3 Install certificates manger(cert-manger)

The certificates of SWCK are distributed and verified by the certificate manager. You need to install the cert-manager through the following command.

$ kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.3.1/cert-manager.yaml

Verify whether cert-manager is installed successfully.

$ kubectl get pod -n cert-manager
NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-7dd5854bb4-slcmd              1/1     Running   0          73s
cert-manager-cainjector-64c949654c-tfmt2   1/1     Running   0          73s
cert-manager-webhook-6bdffc7c9d-h8cfv      1/1     Running   0          73s

3.4 Install SWCK

The java agent injector is a component of the operator, so please follow the next steps to install the operator first.

Get the deployment yaml file of SWCK and deploy it.

$ curl -Ls https://archive.apache.org/dist/skywalking/swck/0.6.1/skywalking-swck-0.6.1-bin.tgz | tar -zxf - -O ./config/operator-bundle.yaml | kubectl apply -f -

Check SWCK as below.

$ kubectl get pod -n skywalking-swck-system
NAME                                                  READY   STATUS    RESTARTS   AGE
skywalking-swck-controller-manager-7f64f996fc-qh8s9   2/2     Running   0          94s

3.5 Install Skywalking components — OAPServer and UI

Deploy the OAPServer and UI in the default namespace.

$ kubectl apply -f https://raw.githubusercontent.com/apache/skywalking-swck/master/operator/config/samples/default.yaml

Check the OAPServer.

$ kubectl get oapserver
NAME      INSTANCES   RUNNING   ADDRESS
default   1           1         default-oap.default

Check the UI.

$ kubectl get ui
NAME      INSTANCES   RUNNING   INTERNALADDRESS      EXTERNALIPS   PORTS
default   1           1         default-ui.default                 [80]

4. Deploy a demo application

In the third step, we have installed SWCK and related Skywalking components. Next, we will show how to use the java agent injector in SWCK through two java application examples in two ways: global configuration and custom configuration.

4.1 Set the global configuration

When we have installed SWCK, the default configuration is the configmap in the system namespace, we can get it as follows.

$  kubectl get configmap skywalking-swck-java-agent-configmap -n skywalking-swck-system -oyaml
apiVersion: v1
data:
  agent.config: |-
    # The service name in UI
    agent.service_name=${SW_AGENT_NAME:Your_ApplicationName}

    # Backend service addresses.
    collector.backend_service=${SW_AGENT_COLLECTOR_BACKEND_SERVICES:127.0.0.1:11800}

    # Please refer to https://skywalking.apache.org/docs/skywalking-java/latest/en/setup/service-agent/java-agent/configurations/#table-of-agent-configuration-properties to get more details.

In the cluster created by kind, the backend_service may not be correct, we need to use the real OAPServer’s address default-oap.default to replace the default 127.0.0.1, so we can edit the configmap as follow.

$ kubectl edit configmap skywalking-swck-java-agent-configmap -n skywalking-swck-system
configmap/skywalking-swck-java-agent-configmap edited

$ kubectl get configmap skywalking-swck-java-agent-configmap -n skywalking-swck-system -oyaml
apiVersion: v1
data:
  agent.config: |-
    # The service name in UI
    agent.service_name=${SW_AGENT_NAME:Your_ApplicationName}

    # Backend service addresses.
    collector.backend_service=${SW_AGENT_COLLECTOR_BACKEND_SERVICES:default-oap.default:11800}

    # Please refer to https://skywalking.apache.org/docs/skywalking-java/latest/en/setup/service-agent/java-agent/configurations/#table-of-agent-configuration-properties to get more details.

4.2 Set the custom configuration

In some cases, we need to use the Skywalking component to monitor different java applications, so the agent configuration of different applications may be different, such as the name of the application, and the plugins that the application needs to use, etc. Next, we will take two simple java applications developed based on spring boot and spring cloud gateway as examples for a detailed description. You can use the source code to build the image.

# build the springboot and springcloudgateway image 
$ git clone https://github.com/dashanji/swck-spring-cloud-k8s-demo 
$ cd swck-spring-cloud-k8s-demo && make

# check the image
$ docker images
REPOSITORY     TAG       IMAGE ID       CREATED          SIZE
gateway        v0.0.1    51d16251c1d5   48 minutes ago   723MB
app            v0.0.1    62f4dbcde2ed   48 minutes ago   561MB

# load the image into the cluster
$ kind load docker-image app:v0.0.1 && kind load docker-image gateway:v0.0.1

4.3 deploy spring boot application

Create the springboot-system namespace.

$ kubectl create namespace springboot-system

Label the springboot-systemnamespace to enable the java agent injector.

$ kubectl label namespace springboot-system swck-injection=enabled

Deploy the corresponding deployment file springboot.yaml for the spring boot application, which uses annotation to override the default agent configuration, such as service_name.

Notice! Before using the annotation to override the agent configuration, you need to add strategy.skywalking.apache.org/agent.Overlay: "true" to make the override take effect.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: demo-springboot
  namespace: springboot-system
spec:
  selector:
    matchLabels:
      app: demo-springboot
  template:
    metadata:
      labels:
        swck-java-agent-injected: "true"  # enable the java agent injector
        app: demo-springboot
      annotations:
        strategy.skywalking.apache.org/agent.Overlay: "true"  # enable the agent overlay
        agent.skywalking.apache.org/agent.service_name: "backend-service"
    spec:
      containers:
      - name: springboot
        imagePullPolicy: IfNotPresent
        image: app:v0.0.1
        command: ["java"]
        args: ["-jar","/app.jar"]
---
apiVersion: v1
kind: Service
metadata:
  name: demo
  namespace: springboot-system
spec:
  type: ClusterIP
  ports:
  - name: 8085-tcp
    port: 8085
    protocol: TCP
    targetPort: 8085
  selector:
    app: demo-springboot

Deploy a spring boot application in the springboot-system namespace.

$ kubectl apply -f springboot.yaml

Check for deployment.

$ kubectl get pod -n springboot-system
NAME                               READY   STATUS    RESTARTS   AGE
demo-springboot-7c89f79885-dvk8m   1/1     Running   0          11s

Get the finnal injected java agent configuration through JavaAgent.

$ kubectl get javaagent -n springboot-system
NAME                            PODSELECTOR           SERVICENAME       BACKENDSERVICE
app-demo-springboot-javaagent   app=demo-springboot   backend-service   default-oap.default:11800

4.4 deploy spring cloud gateway application

Create the gateway-system namespace.

$ kubectl create namespace gateway-system

Label the gateway-systemnamespace to enable the java agent injector.

$ kubectl label namespace gateway-system swck-injection=enabled

Deploy the corresponding deployment file springgateway.yaml for the spring cloud gateway application, which uses annotation to override the default agent configuration, such as service_name. In addition, when using spring cloud gateway, we need to add the spring cloud gateway plugin to the agent configuration.

Notice! Before using the annotation to override the agent configuration, you need to add strategy.skywalking.apache.org/agent.Overlay: "true" to make the override take effect.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: demo-gateway
  name: demo-gateway
  namespace: gateway-system
spec:
  selector:
    matchLabels:
      app: demo-gateway
  template:
    metadata:
      labels:
        swck-java-agent-injected: "true"
        app: demo-gateway
      annotations:
        strategy.skywalking.apache.org/agent.Overlay: "true"
        agent.skywalking.apache.org/agent.service_name: "gateway-service"     
        optional.skywalking.apache.org: "cloud-gateway-3.x" # add spring cloud gateway plugin
    spec:
      containers:
      - image: gateway:v0.0.1
        name: gateway
        command: ["java"]
        args: ["-jar","/gateway.jar"]
---
apiVersion: v1
kind: Service
metadata:
  name: service-gateway
  namespace: gateway-system
spec:
  type: ClusterIP
  ports:
  - name: 9999-tcp
    port: 9999
    protocol: TCP
    targetPort: 9999
  selector:
    app: demo-gateway

Deploy a spring cloud gateway application in the gateway-system namespace.

$ kubectl apply -f springgateway.yaml

Check for deployment.

$ kubectl get pod -n gateway-system
NAME                           READY   STATUS    RESTARTS   AGE
demo-gateway-5bb77f6d85-9j7c6   1/1     Running   0          15s

Get the finnal injected java agent configuration through JavaAgent.

$ kubectl get javaagent -n gateway-system
NAME                         PODSELECTOR        SERVICENAME       BACKENDSERVICE
app-demo-gateway-javaagent   app=demo-gateway   gateway-service   default-oap.default:11800

5. Verify the injector

After completing the above steps, we can view detailed state of the injected pod, like the injected agent container.

# get all injected pod
$ kubectl get pod -A -lswck-java-agent-injected=true
NAMESPACE           NAME                               READY   STATUS    RESTARTS   AGE
gateway-system      demo-gateway-5bb77f6d85-lt4z7      1/1     Running   0          69s
springboot-system   demo-springboot-7c89f79885-lkb5j   1/1     Running   0          75s

# view detailed state of the injected pod [demo-springboot]
$ kubectl describe pod -l app=demo-springboot -n springboot-system
...
Events:
  Type   Reason   Age                From                           Message
  ----   ------  ----                ----                           -------
  ...
  Normal Created  91s  kubelet,kind-control-plane Created  container inject-skywalking-agent
  Normal Started  91s  kubelet,kind-control-plane Started  container inject-skywalking-agent
  ...
  Normal Created  90s  kubelet,kind-control-plane Created  container springboot
  Normal Started  90s  kubelet,kind-control-plane Started  container springboot

# view detailed state of the injected pod [demo-gateway] 
$ kubectl describe pod -l app=demo-gateway -n gateway-system
...
Events:
  Type   Reason   Age            From                         Message
  ----   ------  ----            ----                         -------
  ...
  Normal Created 2m20s kubelet,kind-control-plane Created container inject-skywalking-agent
  Normal Started 2m20s kubelet,kind-control-plane Started container inject-skywalking-agent
  ...
  Normal Created 2m20s kubelet,kind-control-plane Created container gateway
  Normal Started 2m20s kubelet,kind-control-plane Started container gateway

Now we can expose the service and watch the data displayed on the web. First of all, we need to get the gateway service and the ui service as follows.

$ kubectl get service service-gateway -n gateway-system
NAME              TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service-gateway   ClusterIP   10.99.181.145   <none>        9999/TCP   9m19s

$ kubectl get service default-ui
NAME         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
default-ui   ClusterIP   10.111.39.250   <none>        80/TCP    82m

Then open two terminals to expose the service: service-gateway、default-ui.

$ kubectl port-forward service/service-gateway -n gateway-system 9999:9999
Forwarding from 127.0.0.1:9999 -> 9999
Forwarding from [::1]:9999 -> 9999

$ kubectl port-forward service/default-ui 8090:80                     
Forwarding from 127.0.0.1:8090 -> 8080
Forwarding from [::1]:8090 -> 8080

Use the following commands to access the spring boot demo 10 times through the spring cloud gateway service.

$ for i in {1..10}; do curl http://127.0.0.1:9999/gateway/hello && echo ""; done
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!

We can see the Dashboard by accessing http://127.0.0.1:8090.

All services’ topology is shown below.

We can see the trace information of gateway-service.

We can see the trace information of backend-service.

6. Concluding remarks

If your application is deployed in the Kubernetes platform and requires Skywalking to provide monitoring services, SWCK can help you deploy, upgrade and maintain the Skywalking components in the Kubernetes cluster. In addition to this blog, you can also view swck document and Java agent injector documentation for more information. If you find this project useful, please give SWCK a star! If you have any questions, welcome to ask in Issues or Discussions.

Zh: 将 Apache SkyWalking 与源代码集成

Thu, 14 Apr 2022 00:00:00 +0000

_{Read this post in original language: English}

介绍

最具影响力的技术是那些消失的技术。他们交织在日常生活中，直到二者完全相融。 - 马克韦瑟

马克韦瑟在 1980 年代后期预言，影响最深远的技术是那些消失在空气中的技术。

“当人们足够熟知它，就不会再意识到它。”

正如韦瑟所说，这种消失的现象不只源于技术，更是人类的心理。正是这种经验使我们能够摆脱对底层的考量，进入更高层次的思考。一旦我们不再被平凡的细枝末节所阻碍，我们就可以自如地专注于新的目标。

随着 APM(应用性能管理系统) 变得越来越普遍，这种认识变得更加重要。随着更多的应用程序开始使用 APM 部署，底层源代码抽象表示的数量也在同步增加。虽然这为组织内的许多非开发角色提供了巨大的价值，但它确实也对开发人员提出了额外的挑战 - 他们必须将这些表示转化为可操作的概念（即源代码）。对此，韦瑟相当简洁的总结道,“就像不应要求汽车机械师在不查看引擎的情况下工作一样，我们不应要求程序员在不访问源代码的情况下工作”。

尽管如此，APM 收集更多信息只是为了产生充足的新抽象表示。在本文中，我们将介绍开源实时编码平台 Source++ 中的一个新概念，旨在让开发人员更直观地监控生产应用程序。

实时查看

我们尚且不理解在收集了数百个指标之后，是什么让程序更容易理解、修改、重复使用或借用。我不认为我们能够通过原理程序本身而到它们的抽象接口中找到答案。答案就在源代码之中。 - 马克韦瑟

随着 APM 从“有了更好”转变为“必须拥有”，有一个基本特性阻碍了它们的普及。它们必须从意识中消失。作为开发人员，我们不应急于打开浏览器以更好地理解底层源代码，答案就在源代码中。相反，我们应该改进我们的工具，以便源代码直观地告诉我们需要了解的内容。想想如果失败的代码总是表明它是如何以及为什么失败的，生活会多么简单。这就是 Source++ 背后的理念。

在我们的上一篇博客中，我们讨论了不间断断点 Extending Apache SkyWalking。我们介绍了一个名为 Live Instruments(实时埋点) 的概念，开发人员可以使用它轻松调试实时生产应用程序，而无需离开他们的开发环境。而今天，我们将讨论如何通过一个名为 Live Views（实时查看）的新概念将现有部署的 SkyWalking 集成到您的 IDE 中。与专为调试实时应用程序而设计的 Live Instruments (实时埋点) 不同，Live Views（实时查看）旨在提高对应用程序的理解和领悟。这将通过输入到 Live Command Palette (实时命令面板) 中的各种命令来完成。

实时命令面板

Live Command Palette (LCP) 是一个当前上下文场景下的命令行面板，这个组件包含在 Source++ JetBrains 插件中，它允许开发人员从 IDE 中直接控制和对实时应用程序发起查询。

LCP 通过键盘快捷键 (Ctrl+Shift+S) 打开，允许开发人员轻松了解与他们当前正在查看的源代码相关的运行指标。

目前 LCP 支持以下实时查看命令：

命令：`view`（overview/activity/traces/Logs）- 查看总览/活动/追踪/日志

view 查看命令会展示一个与当前源码的实时运维数据关联的弹窗。这些命令允许开发人员查看根据相关指标过滤的传统 SkyWalking 的运维数据。

命令：`watch log` - 实时监听日志

本日志命令允许开发人员实时跟踪正在运行的应用程序的每一条日志。通过此命令开发人员无需手动查阅大量日志就可以查找特定日志语句的实例。

命令：(show/hide) quick stats （显示/隐藏）快速统计

show quick stats 显示快速统计命令显示实时端点指标，以便快速了解端点的活动。使用此命令，开发人员可以快速评估端点的状态并确定端点是否按预期正常运行。

未来的工作

好工具是无形的。我所指的无形，是指这个工具不会侵入你的意识；你专注于任务，而不是工具。眼镜就是很好的工具——你看的是世界，而不是眼镜。 - 马克韦瑟

Source++ 旨在扩展 SkyWalking，使 SkyWalking 本身变得无需感知。为此，我们计划支持自定义的开发人员命令。开发人员将能够构建自定义命令，以及与团队共享的命令。这些命令将识别上下文、类型和条件，从而允许广泛的操作。随着更多命令的添加，开发人员将能够洞悉 SkyWalking 所提供的所有功能，同时专注于最重要的源码。

如果您觉得这些功能有用，请考虑尝试使用 Source++。您可以通过 JetBrains Marketplace 或直接从您的 JetBrains IDE 安装插件。如果您有任何疑问，请到这提 issue。

欢迎随时反馈！

Blog: SourceMarker: Continuous Feedback for Developers

Tue, 16 Mar 2021 00:00:00 +0000

SourceMarker is an open-source continuous feedback IDE plugin built on top of Apache SkyWalking, a popular open-source APM system with monitoring, tracing, and diagnosing capabilities for distributed software systems. SkyWalking, a truly holistic system, provides the means for automatically producing, storing, and querying software operation metrics. It requires little to no code changes to implement and is lightweight enough to be used in production. By itself, SkyWalking is a formidable force in the realm of continuous monitoring technology.

SourceMarker, leveraging the continuous monitoring functionality provided by SkyWalking, creates continuous feedback technology by automatically linking software operation metrics to source code and displaying feedback directly inside of the IDE. While currently only supporting JetBrains-based IDEs and JVM-based programming languages, SourceMarker may be extended to support any number of programming languages and IDEs. Using SourceMarker, software developers can understand and validate software operation inside of their IDE. Instead of charts that indicate the health of the application, software developers can view the health of individual source code components and interpret software operation metrics from a much more familiar perspective. Such capabilities improve productivity as time spent continuously context switching from development to monitoring would be eliminated.

Logging

The benefits of continuous feedback technology are immediately apparent with the ability to view and search logs directly from source code. Instead of tailing log files or viewing logs through the browser, SourceMarker allows software developers to navigate production logs just as easily as they navigate source code. By using the source code as the primary perspective for navigating logs, SourceMarker allows software developers to view logs specific to any package, class, method, or line directly from the context of the source code which resulted in those logs.

Tracing

Furthermore, continuous feedback technology offers software developers a deeper understanding of software by explicitly tying the implicit software operation to source code. Instead of visualizing software traces as Gantt charts, SourceMarker allows software developers to step through trace stacks while automatically resolving trace tags and logs. With SourceMarker, software developers can navigate production software traces in much the same way one debugs local applications.

Alerting

Most importantly, continuous feedback technology keeps software developers aware of production software operation. Armed with an APM-powered IDE, every software developer can keep track of the behavior of any method, class, package, and even the entire application itself. Moreover, this allows for source code to be the medium through which production bugs are made evident, thereby creating the feasibility of source code with the ability to self-diagnose and convey its own health.

Download SourceMarker

SourceMarker aims to bridge the theoretical and empirical practices of software development through continuous feedback. The goal is to make developing software with empirical data feel natural and intuitive, creating more complete software developers that understand the entire software development cycle.

https://github.com/sourceplusplus/sourcemarker

This project is still early in its development, so if you think of any ways to improve SourceMarker, please let us know.

Blog: [Design] The Verifier of NGE2E

Mon, 01 Feb 2021 00:00:00 +0000

Background

The verifier is an important part of the next generation End-to-End Testing framework (NGE2E), which is responsible for verifying whether the actual output satisfies the expected template.

Design Thinking

We will implement the verifier with Go template, plus some enhancements. Firstly, users need to write a Go template file with provided functions and actions to describe how the expected data looks like. Then the verifer renders the template with the actual data object. Finally, the verifier compares the rendered output with the actual data. If the rendered output is not the same with the actual output, it means the actual data is inconsist with the expected data. Otherwise, it means the actual data match the expected data. On failure, the verifier will also print out what are different between expected and actual data.

Branches / Actions

The verifier inherits all the actions from the standard Go template, such as if, with, range, etc. In addition, we also provide some custom actions to satisfy our own needs.

List Elements Match

contains checks if the actual list contains elements that match the given template.

Examples:

metrics:
{{- contains .metrics }}
  - name: {{ notEmpty .name }}
    id: {{ notEmpty .id }}
    value: {{ gt .value 0 }}
{{- end }}

It means that the list metrics must contain an element whose name and id are not empty, and value is greater than 0.

metrics:
{{- contains .metrics }}
  - name: p95
    value: {{ gt .value 0 }}
  - name: p99
    value: {{ gt .value 0 }}
{{- end }}

This means that the list metrics must contain an element named p95 with a value greater than 0, and an element named p95 with a value greater than 0. Besides the two element, the list metrics may or may not have other random elements.

Functions

Users can use these provided functions in the template to describe the expected data.

Not Empty

notEmpty checks if the string s is empty.

Example:

id: {{ notEmpty .id }}

Regexp match

regexp checks if string s matches the regular expression pattern.

Examples:

label: {{ regexp .label "ratings.*" }}

Base64

b64enc s returns the Base64 encoded string of s.

Examples:

id: {{ b64enc "User" }}.static-suffix # this evalutes the base64 encoded string of "User", concatenated with a static suffix ".static-suffix"

Result:

id: VXNlcg==.static-suffix

Full Example

Here is an example of expected data:

# expected.data.yaml
nodes:
  - id: {{ b64enc "User" }}.0
    name: User
    type: USER
    isReal: false
  - id: {{ b64enc "Your_ApplicationName" }}.1
    name: Your_ApplicationName
    type: Tomcat
    isReal: true
  - id: {{ $h2ID := (index .nodes 2).id }}{{ notEmpty $h2ID }} # We assert that nodes[2].id is not empty and save it to variable `h2ID` for later use
    name: localhost:-1
    type: H2
    isReal: false
calls:
  - id: {{ notEmpty (index .calls 0).id }}
    source: {{ b64enc "Your_ApplicationName" }}.1
    target: {{ $h2ID }} # We use the previously assigned variable `h2Id` to asert that the `target` is equal to the `id` of the nodes[2]
    detectPoints:
      - CLIENT
  - id: {{ b64enc "User" }}.0-{{ b64enc "Your_ApplicationName" }}.1
    source: {{ b64enc "User" }}.0
    target: {{ b64enc "Your_ApplicationName" }}.1
    detectPoints:
      - SERVER

will validate this data:

# actual.data.yaml
nodes:
  - id: VXNlcg==.0
    name: User
    type: USER
    isReal: false
  - id: WW91cl9BcHBsaWNhdGlvbk5hbWU=.1
    name: Your_ApplicationName
    type: Tomcat
    isReal: true
  - id: bG9jYWxob3N0Oi0x.0
    name: localhost:-1
    type: H2
    isReal: false
calls:
  - id: WW91cl9BcHBsaWNhdGlvbk5hbWU=.1-bG9jYWxob3N0Oi0x.0
    source: WW91cl9BcHBsaWNhdGlvbk5hbWU=.1
    detectPoints:
      - CLIENT
    target: bG9jYWxob3N0Oi0x.0
  - id: VXNlcg==.0-WW91cl9BcHBsaWNhdGlvbk5hbWU=.1
    source: VXNlcg==.0
    detectPoints:
      - SERVER
    target: WW91cl9BcHBsaWNhdGlvbk5hbWU=.1

# expected.data.yaml
metrics:
{{- contains .metrics }}
  - name: {{ notEmpty .name }}
    id: {{ notEmpty .id }}
    value: {{ gt .value 0 }}
{{- end }}

will validate this data:

# actual.data.yaml
metrics:
  - name: business-zone::projectA
    id: YnVzaW5lc3Mtem9uZTo6cHJvamVjdEE=.1
    value: 1
  - name: system::load balancer1
    id: c3lzdGVtOjpsb2FkIGJhbGFuY2VyMQ==.1
    value: 0
  - name: system::load balancer2
    id: c3lzdGVtOjpsb2FkIGJhbGFuY2VyMg==.1
    value: 0

and will report an error when validating this data, because there is no element with a value greater than 0:

# actual.data.yaml
metrics:
  - name: business-zone::projectA
    id: YnVzaW5lc3Mtem9uZTo6cHJvamVjdEE=.1
    value: 0
  - name: system::load balancer1
    id: c3lzdGVtOjpsb2FkIGJhbGFuY2VyMQ==.1
    value: 0
  - name: system::load balancer2
    id: c3lzdGVtOjpsb2FkIGJhbGFuY2VyMg==.1
    value: 0

The contains does an unordered list verification, in order to do list verifications including orders, you can simply use the basic ruls like this:

# expected.data.yaml
metrics:
  - name: p99
    value: {{ gt (index .metrics 0).value 0 }}
  - name: p95
    value: {{ gt (index .metrics 1).value 0 }}

which expects the actual metrics list to be exactly ordered, with first element named p99 and value greater 0, second element named p95 and value greater 0.

Blog: [Design] NGE2E - Next Generation End-to-End Testing Framework

Mon, 14 Dec 2020 00:00:00 +0000

NGE2E is the next generation End-to-End Testing framework that aims to help developers to set up, debug, and verify E2E tests with ease. It’s built based on the lessons learnt from tens of hundreds of test cases in the SkyWalking main repo.

Goal

Keep the feature parity with the existing E2E framework in SkyWalking main repo;
Support both docker-compose and KinD to orchestrate the tested services under different environments;
Get rid of the heavy Java/Maven stack, which exists in the current E2E; be language independent as much as possible, users only need to configure YAMLs and run commands, without writing codes;

Non-Goal

This framework is not involved with the build process, i.e. it won’t do something like mvn package or docker build, the artifacts (.tar, docker images) should be ready in an earlier process before this;
This project doesn’t take the plugin tests into account, at least for now;
This project doesn’t mean to add/remove any new/existing test case to/from the main repo;
This documentation won’t cover too much technical details of how to implement the framework, that should go into an individual documentation;

Design

Before diving into the design details, let’s take a quick look at how the end user might use NGE2E.

All the following commands are mock, and are open to debate.

To run a test case in a directory /path/to/the/case/directory

e2e run /path/to/the/case/directory

# or

cd /path/to/the/case/directory && e2e run

This will run the test case in the specified directory, this command is a wrapper that glues all the following commands, which can be executed separately, for example, to debug the case:

NOTE: because all the options can be loaded from a configuration file, so as long as a configuration file (say e2e.yaml) is given in the directory, every command should be able to run in bare mode (without any option explicitly specified in the command line);

Set Up

e2e setup --env=compose --file=docker-compose.yaml --wait-for=service/health
e2e setup --env=kind --file=kind.yaml --manifests=bookinfo.yaml,gateway.yaml --wait-for=pod/ready
e2e setup # If configuration file e2e.yaml is present

--env: the environment, may be compose or kind, represents docker-compose and KinD respectively;
--file: the docker-compose.yaml or kind.yaml file that declares how to set up the environment;
--manifests: for KinD, the resources files/directories to apply (using kubectl apply -f);
--command: a command to run after the environment is started, this may be useful when users need to install some extra tools or apply resources from command line, like istioctl install --profile=demo;
--wait-for: can be specified multiple times to give a list of conditions to be met; wait until the given conditions are met; the most frequently-used strategy should be --wait-for=service/health, --wait-for=deployments/available, etc. that make the e2e setup command to wait for all conditions to be met; other possible strategies may be something like --wait-for="log:Started Successfully", --wait-for="http:localhost:8080/healthcheck", etc. if really needed;

Trigger Inputs

e2e trigger --interval=3s --times=0 --action=http --url="localhost:8080/users"
e2e trigger --interval=3s --times=0 --action=cmd --cmd="curl localhost:8080/users"
e2e trigger # If configuration file e2e.yaml is present

--interval=3s: trigger the action every 3 seconds;
--times=0: how many times to trigger the action, 0=infinite;
--action=http: the action of the trigger, i.e. “perform an http request as an input”;
--action=cmd: the action of the trigger, i.e. “execute the cmd as an input”;

Query Output

swctl service ls

this is a project-specific step, different project may use different tools to query the actual output, for SkyWalking, it uses swctl to query the actual output.

Verify

e2e verify --actual=actual.data.yaml --expected=expected.data.yaml
e2e verify --query="swctl service ls" --expected=expected.data.yaml
e2e verify # If configuration file e2e.yaml is present

--actual: the actual data file, only YAML file format is supported;
--expected: the expected data file, only YAML file format is supported;
--query: the query to get the actual data, the query result must have the same format as --actual and --expected;

The --query option will get the output into a temporary file and use the --actual under the hood;

Cleanup

e2e cleanup --env=compose --file=docker-compose.yaml
e2e cleanup --env=kind --file=kind.yaml --resources=bookinfo.yaml,gateway.yaml
e2e cleanup # If configuration file e2e.yaml is present

This step requires the same options in the setup step so that it can clean up all things necessarily.

Summarize

To summarize, the directory structure of a test case might be

case-name
├── agent-service        # optional, an arbitrary project that is used in the docker-compose.yaml if needed
│   ├── Dockerfile
│   ├── pom.xml
│   └── src
├── docker-compose.yaml
├── e2e.yaml             # see a sample below
└── testdata
    ├── expected.endpoints.service1.yaml
    ├── expected.endpoints.service2.yaml
    └── expected.services.yaml

case-name
├── kind.yaml
├── bookinfo
│   ├── bookinfo.yaml
│   └── bookinfo-gateway.yaml
├── e2e.yaml             # see a sample below
└── testdata
    ├── expected.endpoints.service1.yaml
    ├── expected.endpoints.service2.yaml
    └── expected.services.yaml

a sample of e2e.yaml may be

setup:
  env: kind
  file: kind.yaml
  manifests:
    - path: bookinfo.yaml
      wait: # you can have multiple conditions to wait
        - namespace: bookinfo
          label-selector: app=product
          for: deployment/available
        - namespace: reviews
          label-selector: app=product
          for: deployment/available
        - namespace: ratings
          label-selector: app=product
          for: deployment/available

  run:
    - command: | # it can be a shell script or anything executable
        istioctl install --profile=demo -y
        kubectl label namespace default istio-injection=enabled
      wait:
        - namespace: istio-system
          label-selector: app=istiod
          for: deployment/available

  # OR
  # env: compose
  # file: docker-compose.yaml

trigger:
  action: http
  interval: 3s
  times: 0
  url: localhost:9090/users

verify:
  - query: swctl service ls
    expected: expected.services.yaml
  - query: swctl endpoint ls --service="YnVzaW5lc3Mtem9uZTo6cHJvamVjdEM=.1"
    expected: expected.projectC.endpoints.yaml

then a single command should do the trick.

e2e run

Modules

This project is divided into the following modules.

Controller

A controller command (e2e run) composes all the steps declared in the e2e.yaml, it should be progressive and clearly display which step is currently running. If it failed in a step, the error message should be as much comprehensive as possible. An example of the output might be

e2e run
✔ Started Kind Cluster - Cluster Name
✔ Checked Pods Readiness - All pods are ready
? Generating Traffic - http localhost:9090/users (progress spinner)
✔ Verified Output - service ls
(progress spinner) Verifying Output - endpoint ls
✘ Failed to Verify Output Data - endpoint ls
  <the diff content>
✔ Clean Up

Compared with running the steps one by one, the controller is also responsible for cleaning up env (by executing cleanup command) no mater what status other commands are, even if they are failed, the controller has the following semantics in terms of setup and cleanup.

// Java
try {
    setup();
    // trigger step
    // verify step
    // ...
} finally {
    cleanup();
}

// GoLang
func run() {
    setup();
    defer cleanup();
    // trigger step
    // verify step
    // ...
}

Initializer

The initializer is responsible for

When env==compose
- Start the docker-compose services;
- Check the services’ healthiness;
- Wait until all services are ready according to the interval, etc.;
When env==kind
- Start the KinD cluster according to the config files;
- Apply the resources files (--manifests) or/and run the custom init command (--commands);
- Check the pods’ readiness;
- Wait until all pods are ready according to the interval, etc.;

Verifier

According to scenarios we have at the moment, the must-have features are:

Matchers
- Exact match
- Not null
- Not empty
- Greater than 0
- Regexp match
- At least one of list element match
Functions
- Base64 encode/decode

in order to help to identify simple bugs from the GitHub Actions workflow, there are some “nice to have” features:

Printing the diff content when verification failed is a super helpful bonus proved in the Python agent repo;

Logging

When a test case failed, all the necessary logs should be collected into a dedicated directory, which could be uploaded to the GitHub Artifacts for downloading and analysis;

Logs through the entire process of a test case are:

KinD clusters logs;
Containers/pods logs;
The logs from the NGE2E itself;

More Planned

Debugging

Debugging the E2E locally has been a strong requirement and time killer that we haven’t solve up to date, though we have enhancements like https://github.com/apache/skywalking/pull/5198 , but in this framework, we will adopt a new method to “really” support debugging locally.

The most common case when debugging is to run the E2E tests, with one or more services forwarded into the host machine, where the services are run in the IDE or in debug mode.

For example, you may run the SkyWalking OAP server in an IDE and run e2e run, expecting the other services (e.g. agent services, SkyWalking WebUI, etc.) inside the containers to connect to your local OAP, instead of the one declared in docker-compose.yaml.

For Docker Desktop Mac/Windows, we can access the services running on the host machine inside containers via host.docker.internal, for Linux, it’s 172.17.0.1.

One possible solution is to add an option --debug-services=oap,other-service-name that rewrites all the router rules inside the containers from oap to host.docker.internal/172.17.0.1.

CodeGen

When adding new test case, a code generator would be of great value to eliminate the repeated labor and copy-pasting issues.

e2e new <case-name>

Blog: The first design of Satellite 0.1.0

Wed, 25 Nov 2020 00:00:00 +0000

Author: Jiapeng Liu. Baidu.
skywalking-satellite: The Sidecar Project of Apache SkyWalking
Nov. 25th, 2020

A lightweight collector/sidecar which can be deployed close to the target monitored system, to collect metrics, traces, and logs. It also provides advanced features, such as local cache, format transformation, and sampling.

Design Thinking

Satellite is a 2 level system to collect observability data from other core systems. So, the core element of the design is to guarantee data stability during Pod startup all the way to Pod shutdown avoiding alarm loss. All modules are designed as plugins, and if you have other ideas, you can add them yourself.

SLO

Single gatherer supports > 1000 ops (Based 0.5 Core,50M)
At least once delivery.(Optional)
Data stability: 99.999%.(Optional)

Because they are influenced by the choice of plugins, some items in SLO are optional.

Role

Satellite would be running as a Sidecar. Although Daemonset mode would take up fewer resources, it will cause more troubles to the forwarding of agents. So we also want to use Sidecar mode by reducing the costs. But Daemonset mode would be also supported in the future plan.

Core Modules

The Satellite has 3 core modules which are Gatherer, Processor, and Sender.

The Gatherer module is responsible for fetching or receiving data and pushing the data to Queue.
The Processor module is responsible for reading data from the queue and processing data by a series of filter chains.
The Sender module is responsible for async processing and forwarding the data to the external services in the batch mode. After sending success, Sender would also acknowledge the offset of Queue in Gatherer.

Detailed Structure

The overall design is shown in detail in the figure below. We will explain the specific components one by one.

Gatherer

Concepts

The Gatherer has 4 components to support the data collection, which are Input, Collector, Worker, and Queue. There are 2 roles in the Worker, which are Fetcher and Receiver.

The Input is an abstraction of the input source, which is usually mapped to a configuration file.
The Collector is created by the Source, but many collectors could be created by the same Source. For example, when a log path has been configured as the /var/*.log in an Input, the number of collectors is the same as the file number in this path.
The Fetcher and Receiver is the real worker to collect data. The receiver interface is an abstraction, which has multiple implementations, such as gRPC receiver and HTTP receiver.Here are some specific use cases:
- Trace Receiver is a gRPC server for receiving trace data created by Skywalking agents.
- Log Receiver is also a gRPC server for receiving log data which is collected by Skywalking agents. (In the future we want Skywalking Agent to support log sending, and RPC-based log sending is more efficient and needs fewer resources than file reading. For example, the way of file reading will bring IO pressure and performance cost under multi-line splicing.)
- Log Fetcher is like Filebeat, which fits the common log collection scenario. This fetcher will have more responsibility than any other workers because it needs to record the offset and process the multi-line splicing. This feature will be implemented in the future.
- Prometheus Fetcher supports a new way to fetch Prometheus data and push the data to the upstream.
- ……
The Queue is a buffer module to decouple collection and transmission. In the 1st release version, we will use persistent storage to ensure data stability. But the implementation is a plug-in design that can support pure memory queues later.

The data flow

We use the Trace Receiver as an example to introduce the data flow.

Queue

MmapQueue

We have simplified the design of MmapQueue to reduce the resources cost on the memory and disk.

Concepts

There are 2 core concepts in MmapQueue.

Segment: Segment is the real data store center, that provides large-space storage and does not reduce read and write performance as much as possible by using mmap. And we will avoid deleting files by reusing them.
Meta: The purpose of meta is to find the data that the consumer needs.

Segment

One MmapQueue has a directory to store the whole data. The Queue directory is made up with many segments and 1 meta file. The number of the segments would be computed by 2 params, which are the max cost of the Queue and the cost of each segment. For example, If the max cost is 512M and each segment cost is 256K, the directory can hold up to 2000 files. Once capacity is exceeded, an coverage policy is adopted that means the 2000th would override the first file.

Each segment in Queue will be N times the size of the page cache and will be read and written in an appended sequence rather than randomly. These would improve the performance of Queue. For example, each Segment is a 128k file, as shown in the figure below.

Mmap Performance Test

The test is to verify the efficiency of mmap in low memory cost.

The rate of data generation: 7.5K/item 1043 item/s (Based on Aifanfan online pod.)
The test structure is based on Bigqueue because of similar structure.
Test tool: Go Benchmark Test
Command: go test -bench BenchmarkEnqueue -run=none -cpu=1
Result On Mac(15-inch, 2018,16 GB 2400 MHz DDR4, 2.2 GHz Intel Core i7 SSD):
- BenchmarkEnqueue/ArenaSize-128KB/MessageSize-8KB/MaxMem-384KB 66501 21606 ns/op 68 B/op 1 allocs/op
- BenchmarkEnqueue/ArenaSize-128KB/MessageSize-8KB/MaxMem-1.25MB 72348 16649 ns/op 67 B/op 1 allocs/op
- BenchmarkEnqueue/ArenaSize-128KB/MessageSize-16KB/MaxMem-1.25MB 39996 33199 ns/op 103 B/op 1 allocs/op
Result On Linux(INTEL Xeon E5-2450 V2 8C 2.5GHZ2,INVENTEC PC3L-10600 16G8,INVENTEC SATA 4T 7.2K*8):
- BenchmarkEnqueue/ArenaSize-128KB/MessageSize-8KB/MaxMem-384KB 126662 12070 ns/op 62 B/op 1 allocs/op
- BenchmarkEnqueue/ArenaSize-128KB/MessageSize-8KB/MaxMem-1.25MB 127393 12097 ns/op 62 B/op 1 allocs/op
- BenchmarkEnqueue/ArenaSize-128KB/MessageSize-16KB/MaxMem-1.25MB 63292 23806 ns/op 92 B/op 1 allocs/op
Conclusion: Based on the above tests, mmap is both satisfied at the write speed and at little memory with very low consumption when running as a sidecar.

Processor

The Processor has 3 core components, which are Consumer, Filter, and Context.

The Consumer is created by the downstream Queue. The consumer has its own read offset and committed offset, which is similar to the offset concept of Spark Streaming.
Due to the particularity of APM data preprocessing, Context is a unique concept in the Satellite filter chain, which supports storing the intermediate event because the intermediate state event also needs to be sent in sometimes.
The Filter is the core data processing part, which is similar to the processor of beats. Due to the context, the upstream/downstream filters would be logically coupling.

Sender

BatchConverter decouples the Processor and Sender by staging the Buffer structure, providing parallelization. But if BatchBuffer is full, the downstream processors would be blocked.
Follower is a real send worker that has a client, such as a gRPC client or Kafka client, and a fallback strategy. Fallback strategy is an interface, we can add more strategies to resolve the abnormal conditions, such as Instability in the network, upgrade the oap cluster.
When sent success, Committed Offset in Queue would plus the number of this batch.

High Performance

The scenario using Satellite is to collect a lot of APM data collection. We guarantee high performance by the following ways.

Shorten transmission path, that means only join 2 components,which are Queue and Processor, between receiving and forwarding.
High Performance Queue. MmapQueue provides a big, fast and persistent queue based on memory mapped file and ring structure.
Processor maintains a linear design, that could be functional processed in one go-routine to avoid too much goroutines switching.

Stability

Stability is a core point in Satellite. Stability can be considered in many ways, such as stable resources cost, stable running and crash recovery.

Stable resource cost

In terms of resource cost, Memory and CPU should be a concern.

In the aspect of the CPU, we keep a sequence structure to avoid a large number of retries occurring when facing network congestion. And Satellite avoids keep pulling when the Queue is empty based on the offset design of Queue.

In the aspect of the Memory, we have guaranteed only one data caching in Satellite, that is Queue. For the queue structure, we also keep the size fixed based on the ring structure to maintain stable Memory cost. Also, MmapQueue is designed for minimizing memory consumption and providing persistence while keeping speed as fast as possible. Maybe supports some strategy to dynamically control the size of MmapQueue to process more extreme conditions in the future.

Stable running

There are many cases of network congestion, such as the network problem on the host node, OAP cluster is under upgrating, and Kafka cluster is unstable. When facing the above cases, Follower would process fallback strategy and block the downstream processes. Once the failure strategy is finished, such that send success or give up this batch, the Follower would process the next batch.

Crash Recovery

The crash recovery only works when the user selects MmapQueue in Gatherer because of persistent file system design. When facing a crash, Reading Offset would be overridden by Committed Offset that ensure the at least once delivery. And Writed Offset would override Writing Offset that ensures the consumer always works properly and avoid encountering uncrossable defective data blocks.

Buffer pool

The Queue is to store fixed structure objects, object buffer pool would be efficient to reuse memory to avoid GC.

ackChan
batch convertor

Some metrics

In Satellite, we should also collect its own monitoring metrics. The following metrics are necessary for Satellite.

cpu
memory
go routine number
gatherer_writing_offset
gatherer_watermark_offset
processor_reading_count
sender_committed_offset
sender_abandoned_count
sender_retry_count

Input and Output

We will reuse this diagram to explain the input and output.

Input
- Because the push-pull mode is both supported, Queue is a core component.
- Queue is designed to be a ring-shaped fixed capacity, that means the oldest data would be overridden by the latest data. If users find data loss, users should raise the ceiling of memory Queue. MmapQueue generally doesn’t face this problem unless the Sender transport is congested.
Ouput
- If the BatchBuffer is full, the processor would be blocked.
- If the Channel is full, the downstream components would be blocked, such as BatchConvertor and Processor.
- When SenderWorker sends failure, the batch data would do a failure strategy that would block pulling data from the Channel. The strategy is a part of Sender,the operation mode is synchronous.
- Once the failure strategy is finished, such that send success or give up this batch, the Sendworker would keep pulling data from the Channel.

Questions

How to avoid keep pulling when the Queue is empty?

If Watermark Offset is less than or equal to Reading Offset, a signal would be sent to the consumer to avoid keep pulling.

Why reusing files in Queue?

The unified model is a ring in Queue, that limits fixed resources cost in memory or disk.In Mmap Queue, reusing files turns the delete operations into an overwrite operations, effectively reducing the creation and deletion behavior in files.

What are the strategies for file creation and deletion in MmapQueue?

As Satellite running, the number of the files in MmapQueue would keep growing until up to the maximum capacity. After this, the old files will be overridden by the new data to avoid file deletion. When the Pod died, all resources were recycled.

Blog: SkyWalking alarm webhook sharing

Wed, 25 Sep 2019 00:00:00 +0000

Author: Wei Qiang
GitHub

Background

SkyWalking backend provides the alarm function, we can define some Alarm rules, call webhook after the rule is triggered. I share my implementation

Demonstration

SkyWalking alarm UI

dingtalk message body

Introduction

install

go get -u github.com/weiqiang333/infra-skywalking-webhook
cd $GOPATH/src/github.com/weiqiang333/infra-skywalking-webhook/
bash build/build.sh
./bin/infra-skywalking-webhook help

Configuration

main configs file: configs/production.yml
dingtalk:
  p3: token...

Example

./bin/infra-skywalking-webhook --config configs/production.yml --address 0.0.0.0:8000

SkyWalking backend alarm settings

webhooks:
  - http://127.0.0.1:8000/dingtalk

Collaboration

Hope that we can improve together webhook

SkyWalking alarm rules may add more metric names (eg priority name), we can send different channels by locating different levels of alerts (dingtalk / SMS / phone)

Thanks.

Blog: SkyWalking performance in Service Mesh scenario

Fri, 25 Jan 2019 00:00:00 +0000

Author: Hongtao Gao, Apache SkyWalking & ShardingShpere PMC
GitHub, Twitter, Linkedin

Service mesh receiver was first introduced in Apache SkyWalking 6.0.0-beta. It is designed to provide a common entrance for receiving telemetry data from service mesh framework, for instance, Istio, Linkerd, Envoy etc. What’s the service mesh? According to Istio’s explain:

The term service mesh is used to describe the network of microservices that make up such applications and the interactions between them.

As a PMC member of Apache SkyWalking, I tested trace receiver and well understood the performance of collectors in trace scenario. I also would like to figure out the performance of service mesh receiver.

Different between trace and service mesh

Following chart presents a typical trace map:

You could find a variety of elements in it just like web service, local method, database, cache, MQ and so on. But service mesh only collect service network telemetry data that contains the entrance and exit data of a service for now(more elements will be imported soon, just like Database). A smaller quantity of data is sent to the service mesh receiver than the trace.

But using sidecar is a little different.The client requesting “A” that will send a segment to service mesh receiver from “A”’s sidecar. If “A” depends on “B”, another segment will be sent from “A”’s sidecar. But for a trace system, only one segment is received by the collector. The sidecar model splits one segment into small segments, that will increase service mesh receiver network overhead.

Deployment Architecture

In this test, I will pick two different backend deployment. One is called mini unit, consist of one collector and one elasticsearch instance. Another is a standard production cluster, contains three collectors and three elasticsearch instances.

Mini unit is a suitable architecture for dev or test environment. It saves your time and VM resources, speeds up depolyment process.

The standard cluster provides good performance and HA for a production scenario. Though you will pay more money and take care of the cluster carefully, the reliability of the cluster will be a good reward to you.

I pick 8 CPU and 16GB VM to set up the test environment. This test targets the performance of normal usage scenarios, so that choice is reasonable. The cluster is built on Google Kubernetes Engine(GKE), and every node links each other with a VPC network. For running collector is a CPU intensive task, the resource request of collector deployment should be 8 CPU, which means every collector instance occupy a VM node.

Testing Process

Receiving mesh fragments per second(MPS) depends on the following variables.

Ingress query per second(QPS)
The topology of a microservice cluster
Service mesh mode(proxy or sidecar)

In this test, I use Bookinfo app as a demo cluster.

So every request will touch max 4 nodes. Plus picking the sidecar mode(every request will send two telemetry data), the MPS will be QPS * 4 *2.

There are also some important metrics that should be explained

Client Query Latency: GraphQL API query response time heatmap.
Client Mesh Sender: Send mesh segments per second. The total line represents total send amount and the error line is the total number of failed send.
Mesh telemetry latency: service mesh receiver handling data heatmap.
Mesh telemetry received: received mesh telemetry data per second.

Mini Unit

You could find collector can process up to 25k data per second. The CPU usage is about 4 cores. Most of the query latency is less than 50ms. After login the VM on which collector instance running, I know that system load is reaching the limit(max is 8).

According to the previous formula, a single collector instance could process 3k QPS of Bookinfo traffic.

Standard Cluster

Compare to the mini-unit, cluster’s throughput increases linearly. Three instances provide total 80k per second processing power. Query latency increases slightly, but it’s also very small(less than 500ms). I also checked every collector instance system load that all reached the limit. 10k QPS of BookInfo telemetry data could be processed by the cluster.

Conclusion

Let’s wrap them up. There are some important things you could get from this test.

QPS varies by the there variables. The test results in this blog are not important. The user should pick property value according to his system.
Collector cluster’s processing power could scale out.
The collector is CPU intensive application. So you should provide sufficient CPU resource to it.

This blog gives people a common method to evaluate the throughput of Service Mesh Receiver. Users could use this to design their Apache Skywalking backend deployment architecture.