From Trace to Insight: Harness Engineering in SmartPerfetto AI Agent

Word count: 10.8kReading time: 67 min

 2026/04/10

From Trace to Insight: Harness Engineering in SmartPerfetto AI Agent

SmartPerfetto Cover

This article documents the key technical decisions made while building SmartPerfetto from scratch – why certain approaches were chosen over others, where pitfalls were encountered, and how the design was adjusted afterward.

Why Build This Tool

I have spent years working on Android performance optimization. A significant portion of my daily work involves analyzing Perfetto traces – Perfetto is Google’s open-source system-level tracing tool that captures frame rendering, thread scheduling, CPU frequency, Binder IPC, and other data, making it essentially the standard tool for Android performance analysis. Its trace_processor engine loads traces into an embedded SQLite database, enabling SQL-based queries.

The trace analysis process is highly repetitive: locate the problem interval, inspect frame data, examine thread states, trace the blocking chain, and correlate system metrics. The steps are similar each time, but the details differ for every trace. This pattern of “fixed workflow, varying details” is well-suited for an AI Agent – automate the data collection and initial attribution within the fixed workflow, while humans make the final judgments and confirmations.

SmartPerfetto is the product of this experiment. It adds an AI analysis panel on top of the Perfetto UI, where users ask questions in natural language (e.g., “analyze scrolling performance”), and a Claude Agent uses MCP (Model Context Protocol, Anthropic’s tool-calling protocol) behind the scenes to execute SQL queries via trace_processor, autonomously completing multiple rounds of data collection and analysis.

The purpose of this article is to document the engineering decisions and lessons learned during the construction process. From the initial “just call the API directly” approach to the current system with up to 20 MCP tools (9 always-on + 11 conditionally injected) + 164 YAML Skills + a three-layer verification system, every design choice was driven by concrete counterexamples – approaches that were tried and failed before being replaced. These hard-won lessons can be directly applied by engineers building AI Agent applications or Android performance tools.

Opening: Two Analysis Paths for the Same Trace

A scrolling trace from a 120Hz device, where the user reports occasional stuttering during list scrolling. Opening it in Perfetto reveals 18 dropped frames during the inertial scrolling phase, with 3 Full-severity frames (~60ms, where the per-frame budget on a 120Hz device is 8.33ms).

Jank (Dropped Frames): Perfetto’s frame_timeline track records the rendering duration of each frame. Exceeding the VSync period (8.33ms at 120Hz) results in user-perceptible stuttering. The jank_type field distinguishes jank types: App-side timeouts, SurfaceFlinger composition delays, Buffer Stuffing (BufferQueue backpressure), etc.

Path A: Manual Analysis

1. Open Perfetto UI, drag the timeline to locate the scrolling interval
2. Expand the frame_timeline track, inspect each frame that exceeds the VSync period
3. 18 dropped frames -- expand each one, examine thread_state slices to see what the main thread was doing
4. Frame 1: Sleeping, manually check waker_utid -> system_server (the core Android system process hosting AMS/WMS and other system services) Binder response was slow
   Frame 2: Running, but stuck in Choreographer#doFrame -> RecyclerView onBind is too heavy
   Frame 3: Sleeping + Running alternating -> dequeueBuffer waiting for SurfaceFlinger composition
   ... (15 more frames to check one by one)
5. Correlate with CPU frequency tracks to check for thermal throttling or governor frequency ramp-up delays
6. Check for GC pauses, Lock contention, Binder timeouts
7. Aggregate evidence and organize conclusions

thread_state records a thread’s scheduling state (Running / Runnable / Sleeping / Uninterruptible Sleep, etc.). Different states point to different investigation directions – Runnable typically indicates CPU scheduling issues, while Sleeping typically indicates waiting/blocking issues. The waker_utid field records the source thread ID that woke up the thread, helping trace cross-process blocking chains.

Steps 3-4 represent the bulk of the work – 18 dropped frames, each requiring expanding thread_state, tracing the blocking cause, and correlating CPU scheduling. The analysis process is serial, frame by frame: the drill-down path may differ for each frame (Binder? Lock? GC? IO?), and all must be examined before drawing conclusions.

Path B: SmartPerfetto Agent

The user inputs "analyze scrolling performance", and the following shows the Agent’s actual operations (from session log session_agent-1774679540422):

classifyScene("analyze scrolling performance") -> scrolling  (<1ms, keyword matching)
buildSystemPrompt() -> 4500 tokens (scrolling.strategy.md injected)

submit_plan({
  phases: ["p0: architecture + time range", "p1: scrolling overview + jank list",
           "p1_9: root cause deep drill", "p3: comprehensive conclusion"],
  successCriteria: "WHY chain >= 2 levels, actionable optimization suggestions"
})

invoke_skill("scrolling_analysis", {process_name: "com.example..."})
  -> All 18 dropped frames extracted, each with:
    - Duration, jank_type, reason_code (App/SF/GPU)
    - Main thread state distribution (Running/Sleeping/IO)
    - Top-3 blocking functions
  -> ArtifactStore: art_1 (~440 tokens reference)

fetch_artifact("art_1", "rows", 0, 18)  -> Retrieve structured data for all 18 frames
invoke_skill("frame_blocking_calls", {frame_ids: [3, 7, 12]})
  -> Blocking chains for 3 representative frames: dequeueBuffer / Binder / GC
lookup_knowledge("rendering-pipeline")  -> Load rendering pipeline reference material

Output conclusion:
  [HIGH] 18 jank frames during inertial scrolling, 3 Full-severity ~60ms drops
  [MEDIUM] 14 Buffer Stuffing frames (App finished rendering on time, but BufferQueue was full causing presentation delay)
  [MEDIUM] Vulkan Shader first-frame compilation + CPU cold frequency (18.66ms, 2.2x over budget)

Metrics snapshot (from logs/metrics/): 16 tool calls, 0 failures, SQL average 652ms.

The following diagram shows the request lifecycle of a complete analysis – every step from user input to final conclusion:

Request Lifecycle

Both paths follow the same analysis steps – query frame data -> locate jank -> trace blocking chains -> correlate system state -> synthesize conclusions.

The difference is: manual analysis processes frames serially, with each frame requiring manual expansion and tracing; the Agent uses the scrolling_analysis Skill to batch-retrieve structured data for all 18 frames via a single SQL query, then selects representative frames for deep-drill blocking chain analysis.

The Agent’s analysis results are also rendered directly in the Perfetto UI:

Auto-Pin: Key frames and slices mentioned by the Agent are automatically marked on the timeline
Click-to-Navigate: Timestamps and frame IDs in conclusions support click-to-navigate to the corresponding position in Perfetto
Data Tables: Complete performance data for all 18 frames is rendered as sortable, filterable structured tables

Screenshot: SmartPerfetto frontend exists as a Perfetto plugin

SmartPerfetto UI Overview

Screenshot: During scrolling analysis, each dropped frame is analyzed in detail – click the arrow on the left to expand

Scrolling Analysis Jank Details

Screenshot: Scrolling analysis conclusion

Screenshot: Scrolling analysis conclusion, representative frame analysis

Representative Frame Analysis 1

Screenshot: Scrolling analysis conclusion, representative frame analysis

Representative Frame Analysis 2

Screenshot: Each analysis round has its own analysis report, with content consistent with (and more detailed than) the frontend display

Analysis Report

Analysis conclusions, data tables, and the Perfetto timeline are all on the same interface. After the Agent completes batch data collection and initial attribution, engineers confirm key findings directly in the Perfetto UI.

It should be noted that the current Agent still requires human judgment in complex edge cases (misdiagnosis issues are discussed in detail later). This article documents the engineering decision-making process behind building this Agent.

Part 1: Why Can’t an LLM Directly Analyze a Trace?

Before discussing the architecture, a fundamental question must be answered first: why can’t we simply send trace data to an LLM and let it analyze? The answer to this question determines the starting point of SmartPerfetto’s entire architecture.

Data Scale: Traces Don’t Fit in Context

The data scale of a real Perfetto trace looks like this:

Dimension	Typical Value
Trace file size	50MB - 500MB
Event count	Millions to tens of millions
Serialized as text	Several GB
Claude’s maximum context	~200K tokens (approximately 150KB of text)

The gap is several orders of magnitude. Even a relatively small 50MB trace, with its slices (function call records), counters (CPU frequency sample points), thread_state (thread scheduling states) and other data, far exceeds the LLM’s context capacity when serialized.

This means the LLM cannot possibly “see” the trace data directly. It must query on demand through tools – first use SQL to find the needed data subset (e.g., the state distribution of a specific thread within a certain time range), then analyze the query results. This constraint fundamentally dictates that SmartPerfetto must be a tool-driven Agent architecture, rather than a simple approach of feeding data into the prompt.

Precise Computation: LLMs Are Poor at Numerical Processing

The daily work of performance analysis revolves around precise numerical values: P50/P90/P99 percentiles of frame durations, VSync period detection (requiring median calculation of VSYNC-sf intervals and snapping to standard refresh rates), CPU utilization percentages, and time proportions of various thread states.

LLMs frequently make errors with these numerical computations. A real example: during early testing, Claude judged a 16.7ms frame duration as “normal, not exceeding VSync period” – it was using the 60Hz (16.67ms) frame budget. But this trace was captured from a 120Hz device, where the per-frame budget should be 8.33ms, making 16.7ms actually twice over budget. Such errors may seem small, but in performance analysis they lead to completely opposite conclusions.

Numerical computation must be handled by tools – SQL’s AVG(), PERCENTILE() and the predefined statistical logic in YAML Skills ensure consistent and precise results every time.

Domain Knowledge: LLMs Know It but Can’t Apply It

The complexity of Android’s rendering pipeline exceeds many developers’ expectations. The three most common rendering paths are: the standard HWUI pipeline (HWUI is Android’s default hardware-accelerated rendering engine, where View drawing instructions are generated on the main thread, submitted to the GPU by RenderThread, and finally composited to the screen by SurfaceFlinger), Flutter’s dual-thread model (1.ui -> 1.raster, bypassing RenderThread), and WebView’s Chromium pipeline (CrRendererMain thread handles rendering). Beyond these, there are also Jetpack Compose, game engines, camera pipelines, and more. SmartPerfetto’s architecture detection system currently identifies 24+ rendering pipelines, and jank analysis for different pipelines requires examining different threads and metrics – which is why architecture detection is the first step of analysis.

The root cause of stuttering may span threads (main thread blocked -> cause is in RenderThread), span processes (App waiting -> system_server’s WindowManagerService responds slowly), or even span hardware layers (CPU scheduled to a little core -> insufficient compute power -> frame timeout).

The LLM’s training data includes these concepts – it “knows” what RenderThread is, what Binder is, what SurfaceFlinger is. But when facing a specific trace, it lacks the ability to apply this knowledge in a phased, scene-specific manner. For example, when analyzing scrolling jank, one needs to first check frame-level data (which frames dropped, what type of jank), then choose different deep-drill paths based on the predominant root cause type (App-side blocking goes to blocking_chain_analysis, compositor-side delays go to SurfaceFlinger analysis). This kind of step-by-step, conditional-branching analysis workflow requires strategy injection for guidance.

Reliability: Error Rates Are High in Practice

Even with the data access problem solved, having an LLM directly produce performance analysis conclusions still faces reliability issues. In SmartPerfetto’s actual operation, I observed several typical output problems:

Hallucinations: Generating data or metrics that don’t exist in the trace
Omissions: Missing critical checks (e.g., not checking the impact of JIT compilation and class loading when analyzing startup performance)
Shallow attribution: Stopping at “main thread busy” without further tracing whether it’s busy with futex (lock contention), binder_wait (cross-process waiting), or GC pause
Inconsistent conclusions: Analyzing the same trace twice and getting different severity assessments

Part 2 later in the article discusses this issue in detail – a quality audit 18 days after agentv3 went live showed that approximately 30% of Agent conclusions contained varying degrees of misjudgment.

SmartPerfetto’s Division of Labor

Based on these four problems, SmartPerfetto’s architecture divides responsibilities as follows:

LLM (Claude) is responsible for:     Tool system is responsible for:
|- Understanding user intent          |- Precise SQL queries (trace_processor)
|- Formulating analysis plans         |- Numerical computation & statistics (built into Skills)
|- Reasoning about causality          |- Rendering architecture detection (24+ pipelines)
|- Cross-domain correlation analysis  |- Layered data extraction (L1-L4)
|- Generating structured conclusions  |- Perfetto stdlib queries
|- Natural language interaction       |- Data summarization & compression (Artifact Store)

Connectivity layer: MCP Protocol -- up to 20 tools (9 always-on + 11 conditional)
Strategy layer: 12 scene-specific strategies (.strategy.md)
Quality layer: 3-layer verification + SQL error-correction learning

The LLM handles reasoning and expression; the tools handle querying and computation. Connecting the two is MCP (Model Context Protocol, Anthropic’s tool-calling protocol) – Claude calls trace_processor to execute SQL, invokes YAML Skills for structured analysis, and queries Perfetto stdlib modules through standard MCP interfaces. Analysis results are streamed to the Perfetto UI frontend in real-time via SSE (Server-Sent Events).

The engineering infrastructure supporting this division of labor includes: scene routing (injecting different analysis strategies based on the user’s question), data compression (controlling the volume of data returned to the LLM), and quality verification (intercepting the LLM’s domain misjudgments). The following sections discuss each of these in detail.

The diagram below shows the complete system architecture, illustrating the 4 stages from user request to analysis conclusion:

System Architecture Overview

Part 2: From Workflow to Agent

The Difference Between Workflow and Agent

In their December 2024 publication Building Effective Agents (authors Erik Schluntz, Barry Zhang), Anthropic classified AI systems into two categories:

Workflow: LLM and tools are orchestrated through predefined code paths. What each step does and where the next step goes are all defined by the developer in advance.
Agent: The LLM dynamically drives its own process and tool usage, autonomously deciding how to complete the task.

The practical significance of this distinction lies in the trade-off between flexibility and controllability. Workflows provide predictability, suitable for tasks with fixed steps; Agents provide flexibility, suitable for open-ended problems that require adjusting direction based on intermediate data. Andrew Ng’s description is apt: rather than making a binary judgment about whether a system is an Agent, think of it as varying degrees of Agent-ness. SmartPerfetto’s agentv2 and agentv3 correspond to opposite ends of this spectrum.

Why Performance Analysis Needs an Agent, Not a Pipeline

Performance analysis is not a fixed “input-in, output-out” process – it’s an exploratory reasoning process. Take a real scrolling analysis as an example:

1. Start with the overview -> find 47 jank frames, P90 = 23.5ms
2. Decide direction based on overview -> 40% stuck in APP phase, prioritize APP side
3. Deep-drill representative frames -> Frame #234's RenderThread blocked by Binder for 23ms
4. Form hypothesis -> "Possibly system_server's Binder response is slow"
5. Verify hypothesis -> Check Binder counterpart's thread_state, find system_server CPU scheduling delay
6. If hypothesis doesn't hold -> fall back, change direction (e.g., check GPU or GC instead)
7. Synthesize all findings into a conclusion

Each step’s decision depends on the previous step’s results – all steps cannot be determined before the analysis begins. A pipeline cannot handle requirements like “this trace’s problem might be in GPU or GC, and the drill-down direction needs to be dynamically chosen based on intermediate data.”

SmartPerfetto’s design is a hybrid of determinism and flexibility: known scenes (scrolling, startup, ANR, etc.) use Strategy files to constrain mandatory checks, ensuring nothing is missed; but the specific queries and drill-down directions within each phase are autonomously decided by Claude. Unmatched scenes are entirely handed over to Claude for autonomous exploration.

agentv2: A Typical Workflow

agentv2 used DeepSeek as the backend, employing a Governance Pipeline architecture – orchestrated through planner / executor / synthesizer three-phase pipeline, essentially a predefined multi-step workflow (historical commit 6d80aefb: “Replace the 13-step agentv2 governance pipeline with Claude-as-orchestrator”).

This architecture worked well for scrolling analysis of standard Android applications, but encountered problems with non-standard scenarios. For example, Flutter application traces don’t have standard frame_timeline data, and the pipeline would get empty results but continue executing subsequent steps, ultimately outputting conclusions based on empty data.

agentv3: Migration to Agent Architecture

On March 2, 2026 (commit 6d80aefb), I switched to the Claude Agent SDK. After receiving tool definitions and strategies, Claude autonomously decides which tools to call, in what order, and what data to query.

An AI Agent typically possesses the following characteristics, with agentv3’s implementation mapped as follows:

Characteristic	Implementation in SmartPerfetto	Code Location
Autonomy	Agent autonomously decides which tools to call and in what order	`claudeRuntime.ts`
Reasoning	REASONING_NUDGE appended after each tool call to trigger explicit reflection	`claudeMcpServer.ts:84`
Tool Use	Up to 20 MCP tool calls to trace_processor	9 always-on + 11 conditional
Planning	submit_plan + requirePlan() gating	Disabled in lightweight mode
Reflection	3-layer Verifier + Correction Prompt (max 2 rounds)	`claudeVerifier.ts`
Error Recovery	SQL error-correction learning + cross-session misdiagnosis pattern learning	Cross-file
Memory	Short-term: Analysis Notes / Artifact Store; Long-term: Pattern Memory / SQL Fix Pairs	7 layers of memory

1 2	agentv2 (Workflow): Fixed pipeline -> each step predefined -> unexpected data = wrong conclusion agentv3 (Agent): Dynamic plan -> autonomous tool calls -> unexpected data = adjust plan

9 Rounds of Review After Migration

From March 2 to March 20, 9 rounds of architecture review were conducted. The most impactful rounds:

Round	Date	Key Findings
Round 1	3/2	12 fixes after initial SDK integration – SQL knowledge base not wired into System Prompt, CPU core count hardcoded to 4 in jank_frame_detail
Round 3	3/12	Architecture wiring audit – 12 instances of “implemented but not connected,” e.g., verification pipeline skipped when 0 findings
Round 7	3/15	Perfetto Stdlib integration – preloaded modules 4->22, Schema Index 708->761
Round 9	3/20	Production quality audit after 18 days of real traces – 3 P0 + 4 P1 + 5 P2, which catalyzed the three-layer verification system

Cold Start 4-Layer Cascading Bug

On March 19, 2026 (commit d5a1d7b3), cold starts were found to be incorrectly classified as warm starts. Tracing revealed this was a cascading issue across 4 layers:

Layer A (Perfetto Stdlib): bindApplication's ts is ~98ms earlier than launchingActivity -> excluded by filter
Layer B (Skill logic):      startup_events_in_range's time filter incompatible with Layer A
Layer C (10 downstream Skills):  Redundant startup_type filter conditions -> returned 0 rows after reclassification
Layer D (Quality gate):         startup_analysis's filter rules and reclassification logic out of sync

Fix scope: rewrote 10 downstream Skills, added 4 new startup analysis Skills. This issue demonstrates that in a Skill dependency chain, a semantic error in a single upstream field amplifies through each layer.

Ghost MCP Query – Async Lifecycle Mismatch

Another cross-layer bug caught on April 7, 2026 (commit a0ad63ba): after an analysis timed out, the session had already been cleaned up and the SSE stream closed, but trace_processor’s stderr continued to emit errors like no such table: cpu_frequency_counters / no such column: ts 90 seconds later – orphan logs with no owner attributable to any session. The two error-correction pairs from Error 4 mentioned earlier were traced back from this pile of stderr output.

The root cause was in the SDK Query’s async lifecycle:

Layer A (Claude Agent SDK):    SDK's internal AsyncIterator still producing messages
Layer B (claudeRuntime.ts):    Timeout triggered, break out of for-await loop
Layer C (MCP tool queue):       SDK subprocess received no close signal,
                                continued dispatching remaining invoke_skill / execute_sql from the queue
Layer D (trace_processor):     Executed these "ghost queries," errors returned as normal --
                                but the session was already cleaned up, nobody listening

AsyncGenerator.return() and break only take effect on the consumer side and do not notify the producer-side external resources (SDK subprocess + MCP tool execution queue). The fix was to change the return type of sdkQueryWithRetry from a single AsyncIterable to a { stream, close } tuple, with close() explicitly called on all three paths – timeout / exception / finally – allowing the SDK to actively abort the subprocess:

// claudeRuntime.ts
const { stream, close } = sdkQueryWithRetry({ ... });
const timer = setTimeout(() => { close(); /* abort SDK subprocess */ }, AGENT_TIMEOUT);
try {
  for await (const message of stream) { ... }
} finally {
  clearTimeout(timer);
  close();   // Safety net: close even on normal exit
}

This bug was particularly insidious because the surface symptom (orphan SQL error logs) and the true root cause (async resource lifecycle mismatch) were far apart – the error messages looked like “Agent wrote bad SQL,” but the reality was “the Agent had long since stopped; it was the SDK subprocess that hadn’t.” In Agent applications, “breaking out of a loop” is often not as clean as one might imagine; any long-lived async resource needs an explicit close channel rather than relying on for-await‘s automatic cleanup.

Part 3: Three Key Engineering Decisions

Decision 1: Scene Classification – From Full Injection to On-Demand Loading

Initially, I stuffed the analysis strategies for all 12 scenes (scrolling / startup / ANR / interaction / pipeline / game / memory, etc.) into the System Prompt, totaling 15000+ tokens. The logic was: Claude should know all scene analysis methods so it can handle whatever the user asks.

After running in practice, Claude was found to confuse terminology across scenes – when analyzing scrolling, it referenced startup-phase metrics, mixing up VSync intervals (inter-frame timing) with bindApplication (process initialization). The root cause was that different scenes have heavily overlapping terminology: “frame” in the scrolling context means a rendering frame, while in the startup context it means the first frame display. When all 12 strategy sets are present simultaneously, the LLM cannot distinguish the context.

The solution was to implement scene classification, injecting only one strategy set at a time:

// sceneClassifier.ts -- 12 scenes, <1ms execution
export function classifyScene(query: string): SceneType {
  const scenes = getRegisteredScenes(); // Loaded from .strategy.md frontmatter
  const sorted = scenes
    .filter(s => s.scene !== 'general')
    .sort((a, b) => a.priority - b.priority); // ANR(1) -> startup(2) -> scrolling(3)

  for (const scene of sorted) {
    if (scene.compound_patterns.some(p => p.test(query))) return scene.scene;
    if (scene.keywords.some(k => lower.includes(k))) return scene.scene;
  }
  return 'general';
}

Keywords and priorities are declared in each .strategy.md‘s YAML frontmatter, not hardcoded in the source:

# scrolling.strategy.md
---
scene: scrolling
priority: 3
keywords: [滑动, 掉帧, jank, scroll, fps, 帧率, 卡顿]
compound_patterns:
  - "(?:分析|看看|检查).*(?:滑动|滚动|列表)"
---

Adding a new scene only requires creating a new .strategy.md file. DEV mode supports hot reloading – changes take effect upon browser refresh.

After the adjustment, the System Prompt dropped from ~15000 tokens to ~4500 tokens, and the strategy confusion issue did not recur. Adding new scenes went from modifying code to simply creating a new .md file.

When multi-turn conversations accumulate substantial context (analysis notes, historical plans, pattern memory, etc.), the System Prompt may again exceed the 4500-token budget. At that point, lower-value paragraphs are progressively dropped by priority: SQL knowledge base references (Claude can query on demand via the lookup_sql_schema tool) -> historical analysis experience -> historical lessons learned -> SQL error-correction pairs -> sub-agent collaboration guidelines -> historical analysis plans. Core paragraphs (role, methodology, scene strategy, output format) are never dropped.

Decision 2: Artifact Store – Controlling Data Volume Returned to the LLM

Decision 1 solved the System Prompt bloat problem. But even with only one strategy set injected, each Skill call during Agent execution still produces large amounts of data (200+ rows of frame data), and putting all of it into context creates new problems.

Early versions returned complete Skill execution results (e.g., 200 rows of frame data, 487 rows of blocking analysis) to Claude. Each Skill result was approximately 3000 tokens, and with 5-8 Skill calls per analysis, Skill data alone consumed 15000-24000 tokens.

Token cost was one concern, but a more surprising discovery was: the more data Claude received, the worse its output quality became. When faced with 200 rows of frame data, it tended toward line-by-line description (“Frame 1 took 12.3ms, Frame 2 took 15.7ms…”) rather than pattern-level summarization. I suspect the reason is that when the context is flooded with large amounts of numbers, the LLM’s attention becomes scattered.

The solution was to store Skill results in the ArtifactStore, returning only compact references (~440 tokens) to Claude – row count, column names, and summary information. When details are needed, Claude retrieves them on demand via fetch_artifact with pagination. Complete data is sent to the frontend for rendering through an independent SSE (Server-Sent Events) channel, bypassing the LLM entirely.

invoke_skill("scrolling_analysis") execution result:
  |-- Frontend: Full DataEnvelope (200 rows) -> SSE -> UI table rendering
  |         (DataEnvelope: self-describing data contract, containing column names, types, interaction actions.
  |          The frontend automatically renders tables/charts based on the schema,
  |          no need to write code for each Skill)
  |-- Claude: Compact reference (~440 tokens)
              "scrolling_analysis complete. Summary: 347 frames, jank rate 10.6%
               art_1 (details: fetch_artifact('art_1', 'rows', 0, 20))"

Three granularity levels for fetch_artifact:

Level	Returns	Approx. Tokens
`summary`	Row count + column names + first row sample	~50
`rows`	Paginated data (offset/limit)	~200-500
`full`	Complete raw data	~3000

After the adjustment, each Skill’s token cost dropped from ~3000 to ~440, and 8 Skills went from ~24000 to ~3520 tokens. Claude’s output shifted from line-by-line description to pattern-level summarization, while the frontend still received complete data for table rendering.

Decision 3: Three-Layer Verification – Lessons from Real Misjudgments

18 days after agentv3 went live, I conducted a systematic quality audit (March 20, 2026, commit da63eaf9). The statistics were surprising: approximately 30% of Agent conclusions contained varying degrees of misjudgment.

The following are actual misjudgment cases encountered:

[Case 1] Agent marked VSync alignment offset as CRITICAL
Reality: Modern high-refresh-rate devices (90Hz/120Hz/144Hz) inherently have
non-perfectly-fixed VSync intervals, with normal micro-offsets (on the order of +/-0.5ms).
The Agent treated this normal offset as an anomaly.

[Case 2] Agent counted Buffer Stuffing frames in the jank statistics
Reality: Buffer Stuffing indicates the App finished rendering on time, but the
BufferQueue was full, causing producer-side backpressure. This is not an App logic
problem and should not be directly counted as App-side jank.
SmartPerfetto handles this through dual-signal detection: excluded by default, but
still counted as perceived jank if the actual presentation interval > 1.5x VSync.

[Case 3] Agent marked a single-frame duration anomaly as CRITICAL
Reality: An isolated single-frame anomaly does not constitute a pattern and needs
confirmation of whether it recurs.

[Case 4] Agent marked main thread Sleeping at 35% (469ms) as MEDIUM
Reality: In the context of total startup duration, 469ms of main thread sleeping is
already a high proportion and should be marked as HIGH.

These misjudgments share a common characteristic: they are not logic errors, but gaps in domain expertise. VSync micro-offsets on high-refresh-rate devices being normal, Buffer Stuffing delays occurring at the pipeline queue level rather than in App logic, single-frame anomalies not constituting a pattern – these judgments depend on deep understanding of Android’s graphics stack, which Claude’s training data covers insufficiently.

Recognizing this, I established a three-layer progressive verification system:

Layer 1: Heuristic checks (no LLM calls)
  -- Regex matching of known misdiagnosis patterns (VSync offset marked CRITICAL,
     Buffer Stuffing counted as jank, single frame marked CRITICAL)

Layer 2: Plan compliance checks (no LLM calls)
  -- Cross-reference against submit_plan steps, checking whether the conclusion
     covers all planned phases

Layer 3: Independent model review (using Haiku)
  -- Use a different model to check whether each finding has data evidence support
     and whether the causal chain is complete

When verification finds severe issues, a Correction Prompt is generated for Claude to revise its conclusion (up to 2 rounds).

Cross-session learning: Confirmed misdiagnosis patterns are persisted to logs/learned_misdiagnosis_patterns.json and automatically injected into the System Prompt for subsequent analyses. For example, the system learned:

{
  "keywords": ["R008", "TTID", "exceeded", "LOW"],
  "message": "TTID exceedance marked as LOW, but TTID(1912ms) exceeded dur_ms(1338ms) by 43%,
              should be marked as MEDIUM or higher",
  "occurrences": 1
}

Note: Learned misdiagnosis patterns do not take effect immediately. The code requires occurrences >= 2 before entering the active pattern set – the first recording is just a marker, and the same pattern is only injected into subsequent analyses’ System Prompt when it appears a second time, preventing isolated events from causing over-correction.

Part 4: Why Not Use a Standard Skill System?

The Design Choice from SOP to YAML Skills

Performance analysis teams typically have their own SOPs (Standard Operating Procedures): how to investigate scrolling jank, how to analyze slow startups, how to diagnose ANR. SOPs are usually documents or checklists – experienced engineers follow them, new team members learn from them.

Anthropic’s Claude Code has a Skills system, essentially parameterized Prompt templates – inject context and submit to the Agent for execution. A natural idea is to write performance analysis SOPs as such Prompt templates and have Claude execute them according to the SOP.

I initially went down this path. The Prompt given to Claude was: “Query the frame_timeline table, find jank frames, analyze the main thread’s state distribution during jank frames.”

Claude understood the intent without issue, but the SQL it generated was different each time. Sometimes the JOIN path was correct (slice -> thread_track -> thread), other times it wrote slice.utid directly – a column that doesn’t exist. The result format was also inconsistent, sometimes 3 columns, sometimes 5, making frontend rendering impossible.

The reason is simple: SOPs are written for humans. An engineer seeing “query frame_timeline” knows exactly what SQL to write. The LLM’s understanding of Perfetto’s SQL schema is incomplete (these schemas have limited coverage in training data), and each translation from SOP text to SQL introduces variance.

SmartPerfetto’s YAML Skills took a different approach – not Prompt templates, but declarative SQL execution units:

# YAML Skill: SQL predefined, result format fixed
steps:
  - id: thread_state_distribution
    type: atomic
    sql: |
      SELECT state, SUM(dur) as total_dur,
             ROUND(SUM(dur) * 100.0 / SUM(SUM(dur)) OVER(), 2) as pct
      FROM thread_state ts
      JOIN thread_track tt ON ts.track_id = tt.id
      WHERE tt.utid = ${main_thread_utid}
        AND ts.ts BETWEEN ${frame_start} AND ${frame_end}
      GROUP BY state ORDER BY total_dur DESC
    display:
      level: detail
      columns:
        - { name: state, type: string }
        - { name: total_dur, type: duration }
        - { name: pct, type: percentage }

The core difference between the two approaches is “who writes the SQL.” Prompt templates have the LLM dynamically generate SQL each time, producing unpredictable result formats that can’t be regression-tested; YAML Skills predefine SQL and output schema, execute after parameter substitution, with fixed result formats that enable stable regression testing and frontend rendering.

Dimension	Prompt Template (SOP-style)	YAML Skill (Declarative Execution)
SQL source	LLM generates dynamically each time	Predefined in YAML, parameter substitution
Result format	May differ each time	Fixed column names and types
Regression testable	Not supported	All 6 trace regression tests pass
Frontend rendering	Requires parsing free text	Schema-driven tables/charts
Composable	Not supported	Composite skills call atomic skills
Vendor adaptation	Requires different Prompts	`.override.yaml` overrides SQL

The final division of labor: Claude is responsible for understanding intent, selecting Skills, and reasoning about attribution; YAML Skills are responsible for precise SQL queries and structured output. Claude invokes Skills via invoke_skill, Skills return structured data, and Claude makes judgments based on the data.

Why Not Expose Each Skill as an Independent MCP Tool?

A natural question: why not directly register all 87 atomic analysis capabilities as 87 MCP Tools and let Claude call them directly?

Trying this in practice reveals a problem: the MCP tool list grows linearly with the number of tools. 87 tools means every API call must include descriptions (name, parameter schema, usage instructions) for all 87 tools in the request, and this fixed overhead consumes significant tokens. More importantly, when Claude faces 87 tools, its selection accuracy degrades – with too many tools, it doesn’t know which one to use.

SmartPerfetto’s design gives Claude only 2 Skill-related MCP Tools:

invoke_skill(skillId, params) – Execute a specified Skill
list_skills(category?) – Query available Skills by scene category

Capabilities are discovered on demand via list_skills(category="scrolling"), then invoked with invoke_skill. 2 MCP Tools encapsulate 160+ analysis capabilities, and the tool list’s token overhead is fixed.

Another benefit is that the YAML format lowers the contribution barrier. If a performance analysis expert has experience with a particular analysis scenario, they can directly write a YAML Skill defining SQL queries and output formats, without needing to know TypeScript or modify backend code. Changes take effect in development mode upon browser refresh (hot reloading), with iteration cycles measured in seconds.

Skill System Structure

The Skill count grew from a dozen in the early project to 164 now. The growth was not driven by “as many as possible,” but by continuously encountering new scenarios that needed coverage in analysis practice – initially there was only frame analysis for standard HWUI, then Flutter applications required specialized Skills, then vendor differences required overrides, then startup analysis needed independent detection logic for JIT, class loading, and Binder pool respectively.

The current Skill distribution by type:

Type	Count	Location	Description
Atomic	87	`skills/atomic/`	Single detection capabilities (VSync period, CPU topology, GPU frequency, GC events, etc.)
Composite	29	`skills/composite/`	Multi-step combined analyses (e.g., scrolling_analysis orchestrates multiple atomic Skills)
Pipeline	28	`skills/pipelines/`	Rendering pipeline detection + teaching (24+ Android rendering architecture identification)
Module	18	`skills/modules/`	Module-categorized analyses (app / framework / hardware / kernel)
Deep	2	`skills/deep/`	Deep analysis (CPU profiling, callstack analysis)

Additionally, skills/vendors/ contains .override.yaml for 8 vendors (Pixel / Samsung / Xiaomi / Honor / OPPO / Vivo / Qualcomm / MTK), overriding vendor-specific SQL in generic Skills.

Layered Results

Early Skill output was flat – a Skill returned one large table, with 200 rows of frame data mixed together. Users would see all the data at once upon opening, with no sense of hierarchy. In practice, engineers’ reading habits follow a pattern: first look at the summary (what’s the jank rate, what’s P90), then decide whether to expand the details, then deep-drill into specific frames.

Now Skill output is organized by layer, with progressive frontend rendering:

summary  -- "47 jank frames, P90=23.5ms, SEVERE accounts for 12%"
  |            Aggregate metrics, quick overview
  v
key      -- Key data (most important metrics and findings)
  |            Highlighted display
  v
detail   -- Complete data list (frame_id, duration, jank_type)
  |            Expandable data table
  v
hidden   -- Supporting data (intermediate calculation results, collapsed by default)
               Expand on demand

Each Skill step declares its display layer via display.level (the most frequently used in practice are detail – 240 occurrences, key – 170 occurrences, summary – 81 occurrences). The frontend automatically renders tables and charts based on column types (timestamp, duration, percentage, bytes, etc.) and interaction actions (navigate_timeline to jump to the trace position, navigate_range to select a time range, copy to copy data) from the DataEnvelope – adding a new Skill requires no additional frontend code. This is key to how 164 Skills coexist while frontend code volume remains manageable.

Step Types

Initially, all Skills had only one step type: execute a single SQL query. Later, scenarios requiring composition of multiple Skills arose (e.g., scrolling_analysis needs to first query frame data, then perform blocking analysis on each jank frame), as well as scenarios requiring iteration over data rows (per-frame diagnosis), leading to gradual expansion of step types:

Step Type	Description	Usage Frequency
`atomic`	Single SQL query, the most basic step type	Most common
`skill`	References another Skill’s result, for reusing existing capabilities in composite analysis	56 occurrences
`iterator`	Iterates over data rows, executing sub-queries for each row	Used in 5 composite Skills
`diagnostic`	Diagnostic step, generating structured diagnostic conclusions	38 occurrences
`parallel`	Execute multiple steps in parallel (code supports it, not yet used in Skills)	0
`conditional`	Branch selection based on conditions (code supports it, not yet used in Skills)	0

iterator is the core of per-frame analysis – for example, for the 8 most severe out of 18 jank frames, execute blocking_chain_analysis on each one individually, analyzing the blocking cause for each frame independently. parallel and conditional are defined in the type system but not yet used by any Skill – this is because current analysis scenarios are already covered by skill references + iterator traversal. When more complex scenarios are introduced (such as multi-path parallel data collection), these will come into play.

Domain Skill Examples

The following examples illustrate why so many specialized Skills are needed – each Skill addresses a specific problem that a generic approach can’t handle.

Consumer Jank Detection – Framework Labels != User Perception

The framework’s jank_type label does not equal user-perceived frame drops. Hidden Jank exists – the framework marks jank_type='None' but the user perceives stuttering. The reason is a discrepancy between the framework’s judgment criteria and the user’s actual perception.

SmartPerfetto uses an independent consumer_jank_detection Skill for jank determination: it estimates the actual VSync period via the median of VSYNC-sf intervals, then uses 1.5x the VSync period as the threshold, judging whether jank occurred based on the present_ts difference between adjacent frames (the timestamp when a frame is actually displayed on screen). It does not rely on framework labels.

Blocking Chain Analysis – Cross-Thread, Cross-Process Root Cause Tracing

The root cause of a single dropped frame may involve a multi-layer causal chain:

Frame 42 took 62ms (budget 8.33ms)
  |-> Main thread blocked for 35ms
      |-> Blocked on futex_wait (lock contention)
          |-> Lock holder is a Binder thread
              |-> Binder thread waiting for system_server response

The blocking_chain_analysis Skill provides key evidence for this chain via 3 SQL steps: main thread state distribution (Running / Sleeping / IO proportions) -> waker tracing (finding who woke the main thread via waker_utid) -> blocking function aggregation (cumulative time for futex / binder_wait / io_schedule). This kind of cross-layer analysis is very difficult to implement stably by having Claude write SQL on its own with generic prompts.

Flutter Architecture Branching – Different Rendering Modes Require Different Analysis Logic

Flutter’s two rendering modes involve different threads, requiring examination of different targets during analysis:

Mode	Jank Analysis Target Threads	Goes Through Host RenderThread
TextureView (dual pipeline)	1.ui + 1.raster + RenderThread	Yes
SurfaceView (single pipeline)	1.ui + 1.raster	No

If standard HWUI analysis logic is used to analyze a Flutter SurfaceView application, 1.raster thread’s duration would be incorrectly attributed to RenderThread. SmartPerfetto automatically identifies Flutter applications through architecture detection (24+ rendering pipelines) and switches to the dedicated flutter_scrolling_analysis Skill.

But “automatically identifying Flutter” itself also hit a snag (commit 355df8ee, 4/6). The early pipeline detector scored each architecture independently, with the highest score winning – as a result, Flutter TextureView traces were frequently misclassified as STANDARD. The reason was that Flutter TextureView’s host side still goes through the HWUI pipeline (Choreographer#doFrame / DrawFrame / RenderThread), and these signals were absorbed by both the STANDARD and TEXTUREVIEW classifiers. STANDARD had broader signal coverage (traces almost always contain Choreographer frames), so its total score often exceeded the specialized TEXTUREVIEW, misclassifying Flutter apps as STANDARD. The same problem occurred with WeChat Skyline (absorbed by WEBVIEW) and game engines (absorbed by STANDARD/MIXED).

The fix was not to adjust weights, but to add exclude_if to specialized pipelines: once TEXTUREVIEW detects Flutter 1.ui / 1.raster signals, it directly blocks the STANDARD classification; STANDARD_LEGACY/MIXED/SURFACEVIEW_BLAST block each other upon detecting Game Engine signals; OPENGL_ES blocks upon detecting WebView/Game signals. 24+ pipelines cannot score independently – they need a “specialized -> generic” priority chain. This is a classic example of “once you have many pipelines, you must implement mutual exclusion” – and it’s why, after the Skill count grew to 160+, simply “routing to the correct Skill” itself became an independent engineering problem.

Vendor Overrides – Same Metric Has Different Field Names on Different Platforms

In traces from Qualcomm, MediaTek, and Google Tensor, the same metric has different field names (e.g., GPU frequency is called gpufreq on Qualcomm but might be gpu_freq_khz on MediaTek). .override.yaml lets the same Skill automatically adapt its SQL across different platforms, without needing to write independent Skills for each vendor.

Part 5: SQL Engineering

The Skill system discussed above ultimately comes down to SQL queries – each Skill step executes predefined SQL. SQL is SmartPerfetto’s core – all performance data retrieval ultimately goes through SQL queries to trace_processor. This section expands on several engineering aspects at the SQL level: query pattern design, official stdlib reuse, Schema indexing, result compression, and error-correction learning.

SQL Query Patterns: Time Interval JOINs and Recursive Bucketing

Perfetto trace data is essentially a stream of events with timestamps and durations. The most common operation in performance analysis is determining whether two events overlap in time – for example, whether the main thread was blocked by a Binder call during a frame’s rendering.

The core SQL pattern heavily used in YAML Skills is the time interval JOIN – determining whether two events overlap in time. The following SQL’s business meaning is: for each jank frame, find the blocking calls (such as GC, Binder, locks) that occurred simultaneously during the frame’s rendering, and calculate how many milliseconds they overlapped:

-- Business meaning: Time overlap analysis between jank frames and blocking calls
SELECT
  jf.frame_id,
  b.name as blocking_call,
  -- Calculate precise overlap duration (nanosecond precision)
  ROUND((MIN(b.ts + b.dur, jf.ts + jf.dur) - MAX(b.ts, jf.ts)) / 1e6, 2) as overlap_ms
FROM jank_frames jf
JOIN blocking_calls b
  ON b.ts < jf.ts + jf.dur       -- Blocking call starts before frame ends
  AND b.ts + b.dur > jf.ts       -- Blocking call ends after frame starts
HAVING overlap_ms > 0.5           -- Filter out micro-overlaps under 0.5ms

The MIN(end1, end2) - MAX(start1, start2) here is the standard formula for calculating overlap length between two intervals. In Perfetto traces, timestamps are nanosecond-precise, and this interval JOIN achieves 0.001ms granularity.

Another common pattern is recursive CTE for time bucketing. For example, analyzing the distribution of big-core vs. little-core CPU usage during the startup process:

-- Recursively generate time buckets (max 30, preventing runaway recursion)
WITH RECURSIVE buckets AS (
  SELECT 0 as idx, ${start_ts} as bucket_start,
         MIN(${start_ts} + bucket_ns, ${end_ts}) as bucket_end
  UNION ALL
  SELECT idx + 1, bucket_end, MIN(bucket_end + bucket_ns, ${end_ts})
  FROM buckets WHERE bucket_end < ${end_ts} AND idx < 29
)
-- Within each time bucket, aggregate scheduling time for big cores vs. little cores
SELECT
  ROUND(SUM(CASE WHEN core_type IN ('prime','big','medium')
    THEN overlap_dur ELSE 0 END) / 1e6, 2) as big_core_ms,
  ROUND(SUM(CASE WHEN core_type = 'little'
    THEN overlap_dur ELSE 0 END) / 1e6, 2) as little_core_ms
FROM buckets b
LEFT JOIN main_thread_sched ms ON ms.ts < b.bucket_end AND ms.ts + ms.dur > b.bucket_start
GROUP BY b.idx

_cpu_topology is a view provided by the Perfetto stdlib that classifies CPU cores as prime / big / medium / little. The recursive CTE limits to a maximum of 30 buckets to prevent runaway recursion on extremely long traces.

These SQL patterns are encapsulated in YAML Skills, accepting parameters via ${param|default} syntax. Claude doesn’t need to write these complex time interval JOINs itself – it calls invoke_skill with the time range and process name, and the Skill handles executing the predefined SQL and returning structured results.

Perfetto Stdlib Reuse

Perfetto officially maintains a SQL standard library (stdlib) that provides numerous predefined views and functions. For example, the android_frames view encapsulates multi-table join logic for frame rendering data, and the _android_critical_blocking_calls internal table aggregates critical blocking calls. Using these official abstractions directly is far more stable than hand-writing SQL from base tables.

SmartPerfetto’s stdlib integration went through several iterations – one rollback even completely reversed the “direction of optimization”:

Initial phase: Only 4 stdlib modules were preloaded (android.frames.timeline, android.binder, android.startup.startups, android.input), with most Skills’ SQL querying base tables directly. The advantage was fast startup; the downside was Skills everywhere hand-writing multi-table JOINs.
Round 7 (3/15): Expanded the preload set to 22 modules, including linux.cpu.utilization, android.garbage_collection, android.oom_adjuster, slices.with_context, covering CPU/GC/OOM/slice and other commonly used dimensions. The logic at the time was: batch INCLUDE all commonly used stdlib modules at trace load time, zero overhead for subsequent Skill queries.

Rollback to lazy loading (4/1, commit 0afeb60f): The eager preload of 22 modules failed in production – on 200MB+ large traces, concurrent INCLUDE of 22 modules at startup simultaneously occupied trace_processor_shell’s RPC connection, triggering socket hang up. The root cause is that trace_processor_shell is a single-threaded SQLite engine, least suited for concurrent INCLUDE – a “batch schema mutation” workload. The final fix was to pull back eager preload, keeping only 3 Tier-0 modules, and switching to lazy + serial loading on first query + max 3 retries:

// workingTraceProcessor.ts
// Tier 0: absolute minimum stdlib modules needed for any analysis to start.
const CRITICAL_STDLIB_MODULES = [
  'android.frames.timeline',    // 19 skills reference -- frame/jank foundation
  'android.startup.startups',   // 16 skills reference -- startup foundation
  'android.binder',             // 22 skills reference -- IPC/blocking foundation
];

These 3 were selected by “skill reference count” as the highest-frequency dependencies. Other stdlib modules are now declared on-demand via Skill YAML’s prerequisites section or explicit INCLUDE PERFETTO MODULE in SQL when first used.

On-demand discovery: perfettoStdlibScanner.ts scans the Perfetto source directory to automatically discover all available modules, exposed through the list_stdlib_modules MCP tool for Claude to INCLUDE non-preloaded modules on demand.

// perfettoStdlibScanner.ts -- scans perfetto/src/trace_processor/perfetto_sql/stdlib/
function scanDirectory(dir: string, prefix: string): string[] {
  // Recursively scan .sql files, convert to module path format (e.g., "android.frames")
  // Exclude prelude directory (these are auto-loaded and don't need manual INCLUDE)
}

The lesson from this rollback is the same type as the earlier “more data makes Claude’s output worse” – “preparing all potentially needed resources at system startup” is the intuitively optimal but practically worst strategy. Whether it’s context for the LLM or stdlib for trace_processor, start lazy + load on demand, and only consider prewarming when a real performance bottleneck appears – this is almost always the more stable choice.

A separate lesson: using the stdlib’s android_garbage_collection_events view is far more stable than manually JOINing slice + thread + process tables to query GC events – because GC event slice names vary across Android versions (concurrent mark sweep vs young concurrent copying vs HeapTaskDaemon), and the stdlib has already handled these compatibility issues. But stdlib views have their own pitfalls (column name prefixes, modules not auto-loaded), which are expanded upon in the SQL error-correction section below.

SQL Schema Index: Letting Claude Know What Tables Are Available

Perfetto trace_processor contains hundreds of tables and views, plus stdlib modules – Claude cannot possibly remember them all. The lookup_sql_schema MCP tool provides a search interface for Claude to look up relevant tables, views, and function definitions by keyword.

Under the hood is an index file automatically generated from the Perfetto source code (perfettoSqlIndex.light.json), containing 761 templates, each recording the name, category, type (table/view/function), column definitions, and parameters.

Queries use tokenized matching + score-based ranking:

Name/category/description containing the full search term -> high score
Multi-word queries match by token separately -> >= 50% token match required to count as relevant
Underscore-separated table name segments support prefix matching (“frame_time” matches “frame_timeline_slice”)
Returns top 30 results

Combined with sqlKnowledgeBase.ts‘s intent mapping, bilingual queries are also supported: user input of “jank” maps to search terms like ['jank', 'frame', 'dropped'], and input of “startup” maps to ['android_startups', 'launch', 'time_to_display']. When multiple intents match simultaneously, scores are additive – for example, querying “startup frame jank” triggers both the startup and jank intents simultaneously, giving the highest scores to templates matching their intersection.

SQL Result Compression

When Claude queries directly via execute_sql, it can pass a summary=true parameter to trigger result compression. The compression logic is implemented in sqlSummarizer.ts:

Numeric columns: Compute min, max, avg, and percentiles (P50 / P90 / P95 / P99), letting Claude understand the data distribution without seeing raw rows.

String columns: Count the top 5 values and their occurrence frequencies, providing the data’s category distribution.

Sample row selection: Select 10 representative sample rows from the complete result. The selection strategy is: if the data contains performance-related columns like dur, latency, jank, count, sort descending by that column and take the top 10 (the most severe data is usually most analytically valuable); if there’s no clear performance metric column, use evenly-spaced sampling.

-- 200 raw result rows (~3000 tokens) compressed to:
{
  "totalRows": 200,
  "columnStats": [
    { "column": "dur_ms", "type": "numeric",
      "min": 2.1, "max": 67.3, "avg": 12.8,
      "p50": 9.2, "p90": 23.5, "p95": 35.1, "p99": 62.0 },
    { "column": "jank_type", "type": "string",
      "topValues": [
        { "value": "App Deadline Missed", "count": 87 },
        { "value": "Buffer Stuffing", "count": 45 },
        { "value": "None", "count": 68 }
      ] }
  ],
  "sampleRows": [ /* 10 rows of most severe frame data */ ]
}
// ~500 tokens, ~85% compression ratio

This works in conjunction with the Artifact Store mentioned earlier – the Artifact Store compresses Skill results (data returned by invoke_skill), while the SQL Summarizer compresses results when Claude executes SQL directly. Together, two layers of compression cover both paths through which the Agent obtains data.

SQL Error-Correction Learning

Claude is not fully familiar with Perfetto’s SQL schema and writes incorrect queries. The following are typical errors actually recorded (from logs/sql_learning/error_fix_pairs.json):

Error 1: JOINed on a non-existent column

Perfetto’s slice table has no direct utid column. To associate slice with thread, the intermediate table thread_track is needed: slice.track_id -> thread_track.id -> thread_track.utid -> thread.utid.

-- Error: no such column: s.utid
SELECT s.ts, s.name FROM slice s
JOIN thread t ON s.utid = t.utid

-- Fix: go through the thread_track intermediate table
SELECT s.ts, s.name FROM slice s
JOIN thread_track tt ON s.track_id = tt.id
JOIN thread t ON tt.utid = t.utid

Error 2: Ambiguous column names

-- Error: ambiguous column name: name (both slice and process have a name column)
SELECT name, ts FROM slice s JOIN process p ON ...

-- Fix: add table name prefix
SELECT s.name, s.ts FROM slice s JOIN process p ON ...

Error 3: Incorrect understanding of the counter table’s data model

Perfetto’s counter table stores sample points (timestamp + value), not interval data, and has no dur column.

-- Error: no such column: c.dur
SELECT SUM(c.value * c.dur) FROM counter c

-- Fix: use simple average or LEAD window function
SELECT AVG(c.value) FROM counter c WHERE ...

Error 4: stdlib table name / column name pitfalls

Both of these were added in commit 05922e67 – the discovery source was “orphan stderr with no owner”: the analysis had long since ended, but trace_processor was still spewing error logs. Tracing back revealed that the Agent was repeatedly hitting the same pitfalls in dynamic SQL.

-- Error 4a: no such table: cpu_frequency_counters
-- This is a stdlib table, not in the 3 lazy-preloaded Tier-0 modules
SELECT ts, cpu, freq FROM cpu_frequency_counters WHERE ...

-- Fix: INCLUDE the module first, then query
INCLUDE PERFETTO MODULE linux.cpu.frequency;
SELECT ts, cpu, freq FROM cpu_frequency_counters WHERE ...

-- Error 4b: no such column: ts
-- android_garbage_collection_events uses gc_-prefixed column names, not the conventional ts/dur
SELECT ts/1e6, dur/1e6 FROM android_garbage_collection_events

-- Fix: use gc_ts / gc_dur / gc_running_dur / gc_runnable_dur
SELECT gc_ts/1e6, gc_dur/1e6, gc_running_dur/1e6
FROM android_garbage_collection_events

Stdlib view authors often use domain-prefixed column names to avoid ambiguous column issues during multi-table JOINs, but Claude defaults to assuming conventional names like ts/dur. This “stdlib’s own naming conventions vs. general SQL conventions” conflict can’t be fully solved automatically through schema introspection – the lookup_sql_schema tool can tell Claude the table exists and what columns it has, but it can’t predict “this time Claude will definitely assume ts.” The only fallback is accumulating error-correction pairs.

The detection and learning mechanism for these errors works as follows: when SQL execution fails, the error message and SQL are temporarily stored; when a subsequent SQL execution succeeds, the system uses Jaccard similarity matching (excluding SQL structural keywords like SELECT/FROM/WHERE, and Perfetto common tokens like utid/dur/slice) to determine whether it’s a corrected version of the same query. The matching threshold is >30%, with a 60-second time window. A successful match generates an error->fix pair that is persisted to disk.

When a new analysis begins, the 10 most recent error-correction pairs are loaded into the System Prompt, so Claude can see these known pitfalls before writing SQL. Error-correction pairs have a 30-day TTL and are automatically cleaned up upon expiration – Perfetto’s SQL schema changes with version updates.

Part 6: The Evolution of the Development Process Itself

This final section steps slightly outside the product itself to discuss the development process. SmartPerfetto was built with AI-assisted development – from the first line of code to now, Claude Code has been the primary programming tool. Looking back over these three months, the way I use AI-assisted development has itself gone through several iterations, with logic similar to SmartPerfetto’s evolution from agentv2 to agentv3.

Stages of AI-Assisted Development

First, a brief explanation of the tools and concepts involved:

Claude Code: Anthropic’s CLI tool that enables terminal conversations with Claude, where Claude can directly read/write files and execute commands. Throughout development, I ran with --dangerously-skip-permissions (dangerous mode) and bypass permissions enabled, allowing Claude to autonomously execute file edits, command runs, Git operations, etc. without per-action confirmation. This dramatically improved iteration speed – Claude can continuously execute complete “modify code -> run tests -> check results -> fix -> re-run” cycles without being interrupted by permission dialogs, at the cost of requiring the developer to have sufficient trust in Claude’s operations and perform after-the-fact review.
Claude Agent SDK: Anthropic’s Agent development framework, on which SmartPerfetto’s agentv3 backend is built. The SDK encapsulates multi-turn conversation management, MCP tool-calling loops, automatic context compression (auto-compact), and other capabilities. Developers define the tool set and System Prompt, and the SDK drives Claude to autonomously complete multi-round analysis.
Plan Mode: Claude Code’s planning mode, where the AI first outputs a structured implementation plan (which files to modify, what to change, order and dependencies), the human reviews and confirms, then the AI executes the code changes.
SuperPower: The third-party plugin ecosystem for Claude Code, which injects additional capabilities into Claude Code through MCP Servers. In SmartPerfetto development, I used the Chrome DevTools Protocol plugin (directly controlling the browser for screenshots, debugging frontend) and Playwright plugin (automated UI testing and screenshots), among others. These plugins extended Claude Code’s capabilities from code editing to browser interaction and visual verification.
Codex + Codex MCP: Codex is OpenAI’s code reasoning model. After being integrated into Claude Code via the Codex MCP Server, Claude can directly call Codex for independent review during conversations – send the implementation plan to Codex, which accesses the codebase in read-only mode and provides feedback from three angles: architectural soundness, edge cases, and missed risks. The entire process stays within the Claude Code workflow.
Agent Team: Claude Code supports launching multiple sub-Agents working in parallel, each with independent tool sets and role definitions.
Skills / Hooks: Claude Code’s extension mechanisms. Skills are reusable task templates (e.g., /commit, /simplify), and Hooks are scripts that automatically execute on specific events (e.g., before/after tool calls).

My Actual Evolution

Stage 1: Direct Conversation

The earliest development approach was describing requirements directly in Claude Code and letting the AI modify code. Similar to pair programming where one person talks and the other writes. At this stage, every change needed line-by-line human review because the AI’s understanding of project context was limited, frequently making local modifications that didn’t align with the overall architecture.

Stage 2: Plan Mode (SuperPower)

After adopting Plan Mode, the workflow became: I describe the requirement -> AI outputs a structured implementation plan (which files to modify, what to change in each, modification order and dependencies) -> I review the plan -> AI executes after confirmation. This shifted the review focus from “examining code line by line” to “reviewing the architecture plan,” noticeably improving efficiency.

Stage 3: Plan Mode (SuperPower) + Peer Review (Codex)

Relying on a single AI to generate plans is prone to blind spots. I began sending the plan to Codex for independent review after the Plan Mode plan was finalized. Codex accesses the codebase in read-only mode and provides feedback from three angles: architectural soundness, edge cases, and missed risks. This effectively introduces a code review step into the AI development workflow.

Most of the 9 rounds of architecture review mentioned earlier in the article went through this process. Taking the Perfetto Stdlib integration as an example (Round 7, March 15), Codex reviewed 3 rounds, raising a cumulative 36 feedback points, including issues I had missed in the plan such as the stdlib module preloading strategy and the Schema Index cache invalidation mechanism.

Stage 4: Harness-ified Engineering Pipeline

By the later stages, the development workflow had become:

1. I determine requirements and architectural direction
2. Claude Code outputs an implementation plan in Plan Mode
3. Codex reviews the plan in read-only mode, provides feedback
4. I evaluate Codex's feedback (not blindly followed -- ~20% of suggestions are rejected with stated reasons)
5. Claude Code executes code changes per the revised plan
6. Automated full test suite runs:
   - npx tsc --noEmit (type checking)
   - npm run test:scene-trace-regression (6 trace regression, verifying Skill data output)
   - npm run validate:skills + validate:strategies (Skill/strategy contract validation)
   - For startup/scrolling/Flutter-related changes, E2E Agent analysis with real traces:
     Uses the verifyAgentSseScrolling.ts script to load real trace files,
     initiates a complete Agent analysis session, checks the SSE event stream,
     tool call sequence, and whether the final conclusion covers mandatory
     checks defined in the strategy.
     For example, scrolling scenarios check whether the Agent executed Phase 1.9
     root cause deep drill, and Flutter scenarios check whether the Agent correctly
     identified the TextureView/SurfaceView architecture and invoked
     flutter_scrolling_analysis rather than the standard scrolling_analysis.
     This step verifies not whether a Skill can run, but whether the Agent's
     reasoning path and conclusion quality meet expectations when facing real data.
     <!-- TODO: 贴真实截图 -->
     <!-- 截图 6: E2E 测试输出 — verifyAgentSseScrolling.ts 的终端输出，展示 SSE 事件统计和通过/失败状态 -->
7. Tests fail -> analyze failure cause (read session log + metrics) -> fix -> re-test
8. Tests pass -> /simplify (code cleanup)
9. Final Codex review confirmation

In this workflow, human involvement is concentrated at Step 1 (requirements and architectural decisions) and Step 4 (evaluating review feedback). Code details, test execution, and formatting cleanup are handled by the engineering pipeline.

Correspondence with SmartPerfetto’s Architecture

Looking back, my AI-assisted development workflow and SmartPerfetto’s Agent analysis workflow are structurally similar:

Dimension	SmartPerfetto Agent Analysis	My AI-Assisted Development
Intent Understanding	Scene Classifier identifies the scene	I determine the requirement direction
Strategy Injection	.strategy.md injects analysis methodology	Plan Mode outputs implementation plan
Execution	MCP tool calls for SQL/Skills	Claude Code executes code changes
Quality Verification	3-layer Verifier (heuristic + Plan + Haiku)	Regression tests + Codex review
Correction Loop	Correction Prompt has Claude revise	Test failure -> analyze -> fix -> re-run
Cross-session Learning	Pattern Memory + SQL error-correction	CLAUDE.md rule accumulation + memory system

Both systems evolved in the same direction: human involvement shifted from the execution layer up to the decision layer. SmartPerfetto went from a fixed pipeline (humans define every step) to autonomous reasoning (humans define goals and constraints); my development approach went from line-by-line review to reviewing architecture plans.

This is not coincidental – the core of Harness Engineering is building sufficient engineering infrastructure (tests, verification, review) so that humans can trust the AI’s execution results and focus their attention on higher-level decisions.

The diagram below summarizes SmartPerfetto’s complete Harness Engineering landscape – from input routing to cross-session learning:

Harness Engineering Landscape

Conclusion

Looking back over three months of iteration, from agentv2’s 13-step fixed pipeline to agentv3’s autonomous reasoning, from ~30% misjudgment rate to three-layer verification, from a 15000-token System Prompt to 4500-token on-demand loading – every change was driven by a concrete failure experience.

After completing this project, I have two takeaways about AI Agent application development.

The first is: the bulk of the work is not in the LLM API calls themselves, but in the engineering infrastructure surrounding the LLM:

How should the System Prompt be organized so the LLM doesn’t confuse context? -> Scene classification + on-demand loading + token budget
How do you control the LLM’s execution order so it thinks before acting? -> Planning Gate + hypothesis submission
How much data should be returned to the LLM? -> Artifact Store: give summaries, not full data
How do you detect and intercept the LLM’s domain misjudgments? -> Three-layer verification + correction loop
How do you ensure data query precision? -> YAML Skills (declarative SQL) + SQL error-correction learning
How do you adapt to different rendering architectures and chip platforms? -> Architecture detection + vendor overrides

The second takeaway is: the Agent’s “environment” matters far more than prompt wording. In agentv3’s early days, I spent considerable time adjusting System Prompt wording and formatting. I later discovered that what truly affects Agent output quality is not how the prompt is written, but what tools it’s given, what data is returned, and what constraints are imposed. Three concrete examples:

After adding submit_plan gating, Claude stopped making directionless SQL queries (previously it would make purposeless queries like SELECT * FROM slice -> SELECT * FROM thread), and analysis paths became organized
After adding ArtifactStore, the data Claude received dropped from 200 rows to summary references, and reasoning focus noticeably improved
After adding the lookup_knowledge tool, root cause analysis depth progressed from “main thread blocked” to “Binder counterpart system_server had delayed response due to CPU being scheduled to a little core”

None of these improvements were achieved through adjusting prompt text – they were achieved by changing the Agent’s tool set and data environment. If I had one piece of advice for engineers building AI Agent applications, it would be: invest your effort in tool design and data control, not prompt engineering.

Future Directions

The current SmartPerfetto is an interactive analysis tool that is far from ready for release, so it remains closed-source for now, developed solely by me. Future engineering directions include:

Deep vendor integration – The current 8 vendors’ .override.yaml only covers core Skills. More vendor-specific metrics (Qualcomm Snapdragon Profiler data, MediaTek MAGT signals, Samsung GameOptimizing service) need to be integrated one by one.
CI integration + batch processing – Moving from interactive analysis to automatically analyzing performance regressions for each build in CI pipelines. Including unattended mode, result comparison against baselines, and automatic regression tagging.
E2E verification framework – The current 6-trace regression test verifies the correctness of Skill data output, but does not verify the quality of Agent conclusions. An E2E verification system is needed: given a trace + known root cause -> check whether the Agent correctly identified it.
Codebase integration – Mapping slices/functions in the trace back to source code locations, combined with git blame to locate the change that introduced the issue.

This tool will be open-sourced at the appropriate time (since major companies are all building their own internally, open-sourcing it would allow everyone to share ideas and develop together). Those interested in progress updates are welcome to add me on WeChat to join the group chat or direct message.