SmartPerfetto Architecture Q&A: 8 In-Depth Technical Questions

Word count: 10.8kReading time: 67 min

 2026/04/10

SmartPerfetto Architecture Q&A

This article collects technical questions received after publishing From Trace to Insight: Harness Engineering in SmartPerfetto AI Agent and discusses them in Q&A format.

Q1: Why build a custom YAML Skill system instead of using Claude Code’s standard Skills?

Question context: Claude Code’s Skill system supports placing deterministic scripts in a scripts/ directory to avoid LLM generalization. Since you can use scripts/ to execute fixed SQL, why build a separate YAML Skill system? Isn’t a YAML Skill essentially a tool that lets performance engineers execute SQL according to predefined rules?

Key distinction: The two Skill systems operate at different layers

Claude Code Skills and SmartPerfetto YAML Skills solve problems at different stages:

Development stage (when I write code):
  Claude Code + Skills/Hooks → helps me develop SmartPerfetto

Runtime stage (when users analyze traces):
  SmartPerfetto Backend + YAML Skills → helps users analyze performance data

Claude Code’s Skills run in the developer’s terminal as CLI tool extensions. SmartPerfetto’s YAML Skills run in the Skill Engine within the Express backend, invoked by the Agent at runtime through the MCP tool invoke_skill. The execution environments, invocation methods, and data flows are completely different.

Even focusing only on “deterministic execution,” YAML Skills have several targeted designs

1. Parameterized SQL, not fixed scripts

Performance analysis SQL isn’t hardcoded – the same Skill needs to accept different parameters (process names, time ranges, frame ID lists):

steps:
  - id: thread_state_distribution
    type: atomic
    sql: |
      SELECT state, SUM(dur) as total_dur
      FROM thread_state ts
      JOIN thread_track tt ON ts.track_id = tt.id
      WHERE tt.utid = ${main_thread_utid}
        AND ts.ts BETWEEN ${start_ts} AND ${end_ts}
      GROUP BY state

${main_thread_utid} and ${start_ts} are parameters passed in when Claude calls invoke_skill. The YAML Skill Engine performs parameter substitution before executing the SQL. With scripts/, you’d either write shell scripts that accept parameters and concatenate SQL (prone to injection issues) or write full Python/Node scripts – far more complex than YAML.

2. Self-describing output format (DataEnvelope)

display:
  level: detail
  columns:
    - { name: state, type: string }
    - { name: total_dur, type: duration }

Each step declares the output columns’ names and types. The frontend automatically renders tables based on this schema – duration types are automatically formatted as ms, timestamp types support click-to-navigate to the Perfetto timeline. With scripts/, the output is free-form text that the frontend can’t automatically render.

3. Composable (composite + iterator)

A composite Skill can reference multiple atomic Skills, and iterators can traverse data rows for per-frame analysis. This composition is declarative in YAML, with the Skill Engine handling orchestration. The scripts/ approach would require writing your own orchestration logic for the same composition.

4. Designed for performance engineers, not developers

The questioner got it right: YAML Skills are essentially a tool that lets performance engineers contribute analysis logic through predefined rules. Performance engineers know which SQL to query and which metrics to examine, but they don’t necessarily know TypeScript. The YAML format lets them directly define SQL queries and output formats without touching backend code. After modifications, changes take effect by simply refreshing the browser in DEV mode.

Comparison summary

Dimension	Claude Code scripts/	SmartPerfetto YAML Skill
Runtime environment	Developer terminal (CLI)	Express backend (runtime)
Caller	Developer via `/skill` command	Agent via `invoke_skill` MCP tool
Parameterization	Must handle yourself	`${param
Output format	Free-form text	DataEnvelope (schema-driven)
Frontend rendering	Not involved	Automatic tables/charts
Composition	Manual orchestration	composite / iterator / conditional
Contribution barrier	Must write scripts	Just YAML + SQL

The two are not alternatives but solve different problems at different layers.

Q2: How exactly is “deterministic + flexible” implemented?

Question context: The article says “known scenarios use Strategy files to constrain mandatory checks, but within each phase, the specific queries and deep drill directions are autonomously decided by Claude.” Where is the boundary between constraint and autonomy? How exactly is this achieved?

Three-layer mechanism working together

This hybrid design relies on three layers working in concert: Strategy files define “what must be done,” Planning Gate enforces “plan before acting,” and Verifier performs post-hoc checks on “whether it was actually done.”

Layer 1: Strategy files – hard constraints alongside soft guidance

Taking the scrolling analysis scrolling.strategy.md as an example, it defines multiple analysis phases, but the constraint strength differs across phases:

Hard constraints (must execute; skipping triggers verification errors):

Phase 1.9 root cause deep drill is the most strictly constrained phase, with the strategy file using red circle markers and “prohibited” language:

**Phase 1.9 -- Root Cause Deep Drill (Red circle mandatory, cannot skip):**

For each reason_code accounting for >15% in `batch_frame_root_cause`,
you **must** select the most severe frame for deep drilling.
**Prohibited:** drawing conclusions solely from `batch_frame_root_cause` statistical classifications.

| Condition                     | Deep drill action                                   |
| Any reason_code Q4>20%        | invoke_skill("blocking_chain_analysis", ...)        |
| binder_overlap >5ms           | invoke_skill("binder_root_cause", ...)              |
| ...

Soft guidance (suggested but skippable):

Phase 1.5 (architecture-aware branching) and Phase 1.7 (root cause branching) use suggestive language like “switch to” and “note,” allowing Claude to decide whether to execute based on actual data:

**Phase 1.5 -- Architecture-Aware Branching:**

| Architecture | Adjustment action |
| Flutter      | Switch to flutter_scrolling_analysis |
| WebView      | Note CrRendererMain thread |

The entire content of the Strategy file is injected verbatim into the System Prompt, with a hard constraint statement added at injection time:

1 2	Scene Strategy (must be strictly followed) For the following common scenarios, proven analysis pipelines exist. All phases must be fully executed and cannot be skipped.

Claude sees these phase definitions, red circle markers, and “prohibited” language directly in the System Prompt.

Layer 2: Planning Gate – forces planning first, but doesn’t limit plan content

Before executing any SQL queries or Skill invocations, Claude must first call submit_plan to submit an analysis plan. Calling execute_sql or invoke_skill without submitting a plan is directly rejected:

function requirePlan(toolName: string): string | null {
  if (analysisPlanRef.current) return null;  // Plan exists, allow
  return `Must call submit_plan to submit an analysis plan before using ${toolName}`;
}

The key point is: the Gate only requires a plan to exist, not that it precisely matches the Strategy’s phases. Claude can submit any plan structure – it can merge Phase 1 and 1.5, add extra steps not mentioned in the Strategy, or adjust deep drill directions based on preliminary data.

When submitting a plan, the system performs scene-aware keyword checking (e.g., for scrolling scenarios, checking whether the plan mentions “frames,” “jank,” etc.), but this is only at the warning level – the plan is accepted even without these keywords.

The purpose of this design is: force Claude to think clearly about what it wants to do before acting (planning discipline), without restricting how it thinks (planning freedom).

Layer 3: Verifier – multi-dimensional post-hoc checking

There can be gaps between planning and execution – Claude might submit a plan but actually skip a critical step. The Verifier performs multi-dimensional post-hoc checks after analysis completes, primarily using heuristic behavioral checks while supplementing with plan/hypothesis/scene completeness validation:

a) Scene completeness check – whether the analysis output covers the scene’s core content:

// Scrolling scenario: check if significant jank exists but Phase 1.9 deep drill was skipped
case 'scrolling': {
  const hasSignificantJank = /* detect if text mentions significant frame drops */;
  const hasDeepDrill = /* detect if blocking_chain / binder_root_cause etc. were called */;
  if (hasSignificantJank && !hasDeepDrill) {
    issues.push({
      severity: 'error',
      message: 'Scrolling analysis has jank but missing Phase 1.9 root cause deep drill -- reason_code is just a classification label, not the real root cause'
    });
  }
}

b) Hypothesis closure check – whether all submit_hypothesis calls have corresponding resolve_hypothesis calls.

c) Causal chain depth check – whether CRITICAL/HIGH severity findings contain sufficient causal connectors and mechanistic terminology (heuristic text matching).

d) Optional LLM review – using an independent Haiku model for evidence support verification (can be disabled).

If the check finds ERROR-level issues, it triggers a Correction Prompt for Claude to fill in the gaps.

Note that the Verifier does not check whether Claude’s plan phases match the Strategy’s phase numbers – it checks whether “critical analysis actions are reflected in the output,” not whether “the plan format is correct.”

The complete constraint spectrum

Layering all three mechanisms together, different phases form a spectrum of constraint strength:

Phase	Strategy tone	Planning Gate	Verifier check	Constraint strength
Phase 1 (overview)	Suggestive	Plan required	Not individually checked	Medium
Phase 1.5 (architecture branch)	Suggestive	–	Not checked	Low
Phase 1.7 (root cause branch)	Suggestive + conditional	–	Not checked	Low
Phase 1.9 (root cause deep drill)	Must/Prohibited	–	Checks if deep drill tools were called	High
Phase 2 (supplementary deep drill)	Optional	–	Not checked	None
Phase 3 (comprehensive conclusion)	Must cover distribution	–	Checks conclusion completeness	Medium

Meanwhile, general.strategy.md (the fallback when no scene matches) is entirely soft guidance: it only provides a routing decision tree based on the user’s focus direction (CPU -> cpu_analysis, memory -> memory_analysis), with no mandatory phases. Claude has complete autonomy in the general scenario.

One-sentence summary

Strategy files tell Claude “analyzing scrolling issues requires at least these steps,” Planning Gate ensures it thinks before acting, and Verifier post-checks whether critical steps were actually performed. But within this framework, what specific data to query, which tools to use, and in what order – all of these are autonomously decided by Claude based on actual data.

Q3: What’s the biggest difference between Agent and Workflow? Where are the Agent’s capability boundaries and what determines them?

Question context: In building our Agent, we went from initially assuming the Agent could understand and make decisions about every Skill given to it, to now having essentially hardcoded a decision tree in the Skills. The stepping stones along the way were always: “we assumed the Agent had capability X, but it didn’t,” causing its output to deviate from our expectations, so we kept adding boundaries to the Skills until it became a hardcoded Workflow.

The essential difference: Who holds decision-making authority

Agent and Workflow aren’t two tools or two frameworks – they’re two ends of the same spectrum:

Hardcoded Workflow <------------------------------------> Fully Autonomous Agent
  |                                                              |
  Developer controls every branch                     LLM decides everything
  |                                                              |
  High determinism, low flexibility                   Low determinism, high flexibility

Dimension	Workflow	Agent
Control flow	Developer hardcodes `if/else` in code	LLM autonomously selects next step
Tool selection	Predefined execution order	LLM selects on-demand based on data
Branch conditions	Conditional logic in code	LLM reasoning
Failure handling	`try/catch` + retry logic	LLM self-reflection + direction change
Predictability	Highly deterministic	Highly uncertain
Adapting to new scenarios	Developer must add branches	Can explore autonomously

But in engineering practice, almost no one operates at either extreme of the spectrum. Pure Workflows can’t handle unknown scenarios; pure Agents are unreliable on critical steps. Real-world production systems sit somewhere in the middle.

The root cause of your pitfalls: Making global assumptions about Agent capabilities

“Defaulting to ‘Agent can understand everything’ -> discovering it can’t -> constantly adding constraints -> becoming a hardcoded Workflow” – the fundamental problem with this path is: making a one-size-fits-all judgment about Agent capabilities.

But Agent capabilities vary enormously across different tasks:

Capability dimension	LLM reliability	Who should handle it
Intent understanding (what the user wants)	High	Agent (though simple scenarios can use keyword matching instead)
Plan formulation (how many steps, what order)	Medium	Needs a constraint framework: Strategy files provide structure, LLM fills in details
Data collection (what to query)	Medium	Semi-autonomous: Skills define what to query, Agent decides order and parameters
Data reasoning (attribution after seeing data)	High	Agent – this is LLM’s greatest value
Precise computation (numerical statistics)	Very low	Tool system (SQL / Skill Engine)
Self-evaluation (knowing if it’s right)	Low	External Verifier; don’t trust Agent self-assessment

The correct approach isn’t “choose Agent globally or choose Workflow globally,” but assign by task:

Scene recognition     -> Workflow (deterministic logic, no LLM needed)
Data collection       -> Semi-Workflow (Skills define what to query, Agent decides order and params)
Reasoning/attribution -> Agent (this is LLM's core value; given enough data, it performs well)
Output formatting     -> Workflow (templated, deterministic)
Quality verification  -> Workflow (rule checking) + Agent (LLM review)

SmartPerfetto’s approach: A constraint strength spectrum

SmartPerfetto doesn’t choose between Agent and Workflow; instead, it sets different constraint strengths for different analysis phases (detailed in Q2). Here we re-examine this design from the “capability boundary” perspective:

High constraint (Phase 1.9 root cause deep drill) – because the Agent is unreliable at “deciding whether to deep drill”:

# Phase 1.9 in scrolling.strategy.md

**Phase 1.9 -- Root Cause Deep Drill (Red circle mandatory, cannot skip):**

For each reason_code accounting for >15% in batch_frame_root_cause,
you **must** select the most severe frame for deep drilling.
**Prohibited:** drawing conclusions solely from batch_frame_root_cause statistical classifications.

Why the hard constraint? Because we found the Agent has a systematic bias: it tends to jump straight to conclusions after getting overview data, skipping the deep drill. This isn’t because the model isn’t smart enough – Claude is perfectly capable of root cause deep drilling – but because the model exhibits “path dependency”: overview data already contains statistical classifications (reason_code), and for the model, “using classification labels directly for conclusions” has much lower cognitive cost than “spending 3 tool-call rounds doing per-frame deep drilling.”

Low constraint (Phase 1.5 architecture branch) – because the Agent is reliable enough at “selecting tools based on data”:

# Phase 1.5 in scrolling.strategy.md

**Phase 1.5 -- Architecture-Aware Branching:**

| Architecture | Adjustment action |
| Flutter      | Switch to flutter_scrolling_analysis |
| WebView      | Note CrRendererMain thread |

This uses suggestive language like “switch to” and “note” without enforcement. Because the architecture detection result (Flutter/WebView/Standard) has already been placed in the system prompt by deterministic code, the Agent has a high probability of selecting the correct Skill after seeing this information.

Zero constraint (general scenario) – because the Agent’s autonomous exploration is the only option:

# general.strategy.md -- only a routing decision tree, no mandatory steps

Scene: general
priority: 99

The current query did not match a specific scene strategy. Based on the user's
focus direction, use the following decision tree to select the appropriate analysis path.

The general scenario has zero hard constraints because entering general means the user’s question exceeds predefined scenarios, and a Workflow can’t handle it. At this point, the only option is to trust the Agent’s autonomous exploration capability.

What determines Agent capability boundaries

Agent capability boundaries don’t depend on model parameter count or benchmark scores, but on three engineering factors:

1. Observation capability – what data can the Agent “see”

The same model, given structured L2 per-frame data from the scrolling_analysis Skill vs. writing SQL to query raw tables itself, produces significantly different analysis quality. The Agent’s ceiling is determined by the data tools you provide. SmartPerfetto uses 164 YAML Skills to encapsulate domain experts’ query logic; the Agent gets processed, structured analysis data through invoke_skill, not raw millions of trace events.

2. Constraint framework – within what bounds does the Agent make decisions

An unconstrained Agent is like an intern without a task checklist – knowledgeable enough but unsure what to do first. Strategy files, Planning Gate, and Verifier together define the Agent’s decision boundaries: Strategy tells it “at minimum, what must be done,” Planning Gate forces it to “think before acting,” and Verifier post-checks “whether the analysis is sufficient” (heuristic checks + hypothesis closure + scene completeness + optional LLM review).

3. Feedback quality – can the Agent be corrected when wrong

A significant proportion of Agent findings have issues (shallow attribution, false positives, missing critical steps). Relying solely on model self-correction has limited effectiveness. SmartPerfetto uses multi-layer verification + external correction prompts to close the loop:

1
2
3

Verifier finds ERROR -> Generates Correction Prompt -> Triggers SDK retry
                        ^                              |
                  Learned patterns <- Accumulated historical misjudgment patterns

Addendum: Strategy files are SOPs, and that’s fine

Some might point out that scrolling.strategy.md reads like an SOP (Standard Operating Procedure) – with numbered Phases, condition tables, mandatory items, and even explicit invoke_skill("scrolling_analysis", {...}). How is this different from “hardcoding a decision tree in Skills”?

Let’s be direct: in the data collection phase, SmartPerfetto’s scrolling analysis is a Workflow. Strategy files are SOPs that encode domain experts’ analysis experience into deterministic steps. This is intentional.

The key is understanding what the SOP covers and what it doesn’t:

What the SOP can cover (data collection) – what Strategy files do:

scrolling.strategy.md:
  Phase 1:   "Invoke scrolling_analysis"                <- Hardcoded what data to collect
  Phase 1.5: "Flutter switches to flutter_scrolling_analysis" <- Hardcoded conditional branch
  Phase 1.7: Condition table -> deep drill actions      <- Hardcoded if/then table
  Phase 1.9: "Red circle reason_code >15% must deep drill"  <- Hardcoded mandatory items

What the SOP can’t cover (reasoning/attribution) – where the Agent’s value lies:

- Among 47 jank frames, which share the same root cause? (Data clustering)
- Among 19 workload_heavy frames, which is "most severe" and worth deep drilling? (Priority judgment)
- Deep drill reveals Binder blocking 23ms + thermal throttling simultaneously; what's the causal direction? (Causal reasoning)
- How should the final conclusion be organized? How to differentiate recommendations for App developers vs. platform engineers? (Expression decisions)

Different scenarios have different SOP levels:

Strategy file	SOP level	Reason
`scrolling.strategy.md`	High – Phase numbers + condition tables + mandatory items	Scrolling analysis methodology is most mature; optimal data collection paths are known
`startup.strategy.md`	Medium-high – Has Phase structure, but deep drill directions are more open	Startup scenarios are more diverse (cold/warm/hot, different bottlenecks)
`anr.strategy.md`	Medium – 2-skill pipeline, but root cause analysis relies entirely on reasoning	ANR root causes are highly diverse
`general.strategy.md`	Low – Only a routing decision tree, no mandatory items	Unknown scenarios, impossible to turn into an SOP

scrolling.strategy.md has the highest SOP level because scrolling analysis methodology is the most mature. general.strategy.md has almost no SOP because user questions are completely unpredictable.

So the correct understanding is: SmartPerfetto = “SOP-driven data collection + Agent-driven reasoning/attribution.”

The SOP addresses “what data to look at, at minimum, when analyzing scrolling issues” – this question has a deterministic answer, and using an SOP is correct. The Agent addresses “how to reason about causality and organize conclusions after getting the data” – this question differs for every trace and can’t be turned into an SOP.

Back to your pitfalls: The problem isn’t “Skills becoming SOPs” – the data collection phase should use SOPs. The problem is “the SOP consuming the reasoning” – if the SOP hardcodes even the conclusions (“if you see X, output Y”), the Agent truly degrades into a Workflow. The key is letting the SOP stop at data collection and leaving reasoning to the Agent.

One-sentence summary

The difference between Agent and Workflow isn’t “intelligent vs. hardcoded,” but “how decision-making authority is allocated.” Agent capability boundaries are jointly determined by “observation capability x constraint framework x feedback quality.” The correct approach is to allocate decision-making authority by task – granting autonomy where the Agent is reliable, adding constraints where it isn’t – rather than making a one-size-fits-all choice.

Q4: Does the Agent architecture need improvement from a business perspective?

Question context: Agent architecture has gone through different designs and evolutions – from the initial ReAct architecture to LangGraph’s node-based architecture. How do different Agent architecture designs affect it? When building your own business Agent, should you consider the architecture’s impact on Agent performance? For example, SmartPerfetto has added different Skill loading modes on top of the Claude Agent SDK based on business understanding.

Essential differences of three mainstream architectures

Architecture	Control flow model	Developer role	Best suited for
ReAct	Linear loop: Think -> Act -> Observe -> Think…	Define tools	Simple tasks with few tools and short paths
LangGraph Node-based	DAG graph: nodes=steps, edges=conditional jumps	Design graph structure + define nodes + write jump conditions	Deterministic processes with clear steps and limited branches
Native SDK	SDK manages turn loop, developer only defines tools	Define tools + inject context	Many tools, unpredictable paths, requiring LLM autonomous orchestration

Their core difference lies in “who decides what to do next”:

ReAct: LLM makes complete decisions at every step (what to think, what to do, which tool to use); the framework just forwards
LangGraph: Developers predefine all possible paths (nodes + edges); LLM only makes local decisions within nodes
Native SDK: SDK manages the conversation loop, LLM autonomously selects tools, developers indirectly constrain through system prompts and tool design

Why SmartPerfetto chose Native SDK + custom constraint layers

SmartPerfetto’s architecture is Claude Agent SDK (Native SDK) + three constraint layers (Strategy/Planning Gate/Verifier):

Claude Agent SDK provides:              SmartPerfetto custom-built:
+- Turn loop (auto multi-turn mgmt)     +- Scene Classification (scene routing, <1ms)
+- Tool dispatching (MCP tool calls)     +- Strategy Injection (inject analysis strategy by scene)
+- Streaming (SSE event stream)          +- Planning Gate (force plan before execute)
+- Session resume (multi-turn recovery)  +- Verifier (post-hoc verification + correction retry)
+- Sub-agent orchestration               +- ArtifactStore (3-level cache for token compression)
                                         +- Conditional Tool Loading (inject/hide tools by scene)
                                         +- Cross-Session Memory (pattern memory + negative memory)

Why not LangGraph?

The root cause reasoning paths in performance analysis are unpredictable. The same “scrolling stuttering” could be caused by:

Binder blocking -> needs to trace thread state on the system_server side
Slow GPU rendering -> needs to check GPU frequency and fence wait
GC pauses -> needs to examine Java heap and GC events
Thermal throttling -> needs to check thermal zone and CPU frequency
Lock contention -> needs to check monitor contention
A combination of multiple causes above

With LangGraph, you’d need to predefine a DAG node and jump condition for every root cause path. With 21 reason_codes in performance analysis, each combinable with deep drills, the combinatorial explosion of paths makes the DAG graph unmaintainable.

The more fundamental problem is: before seeing the data, you don’t know which path to take. LangGraph’s DAG graph assumes the developer can predict all branch conditions in advance, but performance analysis branch conditions depend on runtime data.

Advantages of the Native SDK architecture:

LLM autonomously selects tool paths, but three constraint layers ensure critical steps aren’t skipped:

# LangGraph requires predefined DAG:
graph.add_edge("overview", "check_binder")
graph.add_edge("overview", "check_gpu")
graph.add_edge("overview", "check_gc")
# ... every new root cause requires modifying the graph structure

# SmartPerfetto Strategy only needs to declare:
# "For each reason_code accounting for >15%, must select the most severe frame for deep drill"
# Agent autonomously decides which specific deep drill tool to call

Key business-driven architectural designs

The following are architectural improvements SmartPerfetto made based on business understanding, each directly corresponding to a business problem:

1. Conditional tool loading – reduce the Agent’s decision space

SmartPerfetto has up to 20 MCP tools total (9 always-on + 11 conditionally injected), but only a subset is injected per analysis:

// claudeMcpServer.ts -- switch tool sets by mode

if (options.lightweight) {
  // Factual queries (e.g., "what's the frame rate"): only 3 tools
  toolEntries = [executeSql, invokeSkill, lookupSqlSchema];
} else {
  // Full analysis: 9 always-on (incl. recall_patterns) + conditionally inject by context
  // Comparison mode -> inject compare_skill, execute_sql_on, get_comparison_context
  // Hypothesis management -> inject submit_hypothesis, resolve_hypothesis, flag_uncertainty
  // Planning tools -> inject submit_plan, update_plan_phase, revise_plan
}

Business reason: More tools means higher probability of the Agent selecting the wrong one. A query that just needs a quick answer to “what’s the frame rate,” if presented with a dozen planning/hypothesis/comparison tools, might lead the Agent to over-analyze.

2. Sub-Agent scene gating – avoid unnecessary parallel overhead

// claudeAgentDefinitions.ts -- only enable sub-agents in complex scenarios

const ORCHESTRATOR_ONLY_TOOLS = new Set([
  'submit_plan', 'update_plan_phase', 'revise_plan',
  'submit_hypothesis', 'resolve_hypothesis', 'flag_uncertainty',
  'compare_skill', 'execute_sql_on', 'get_comparison_context',
]);

// Sub-agents only get data collection tools, not planning/hypothesis tools
// Design principle: sub-agents collect evidence, orchestrator makes diagnosis

Scenario	Sub-Agent configuration	Reason
scrolling	frame-expert + system-expert	Frame analysis and system analysis are suitable for splitting, coordinated by orchestrator
startup	startup-expert + system-expert	Startup phase analysis and resource contention analysis are suitable for splitting
anr	No sub-agent	ANR is a 2-skill pipeline; extra sub-agents would only add overhead

Note: The actual parallelism of sub-agents depends on the SDK’s internal scheduling strategy. We design prompts for parallel evidence collection, but actual execution may be sequential.

3. Lightweight vs Full dual mode – quick Q&A doesn’t go through the full pipeline

When users ask “what’s the frame rate of this trace,” there’s no need to go through the full Planning -> Skill -> Verification pipeline. SmartPerfetto’s ClaudeRuntime performs complexity classification at the entry point:

analyze(query)
  |
queryComplexity === 'quick'
  -> analyzeQuick(): 3 tools, no Planning Gate, no Verifier
  -> Directly answers factual questions

queryComplexity === 'full'
  -> Full pipeline: Planning -> Skill -> Verification -> Correction
  -> Systematic analysis

Business reason: A significant proportion of user questions are factual queries (“what’s the frame rate,” “is there an ANR”), and running the full pipeline would unnecessarily increase latency.

Answering “whether architecture needs business-driven improvement”

Yes, but the improvement direction isn’t switching the underlying framework (ReAct -> LangGraph), but adding business constraint layers on top of the existing framework. Specific recommendations:

Extract Strategy files from your Skill decision tree: Rewrite the hardcoded if/else decision logic in code as natural language analysis strategies (Markdown files), injected into the system prompt by scene. This way, domain experts can directly modify analysis logic without touching code.
Add a Planning Gate: The requirePlan() implementation is extremely simple (under 10 lines of code), but the effect is significant – forcing the Agent to think before acting empirically reduces going off-track dramatically.
Add a post-hoc Verifier: Don’t check whether the Agent’s intermediate steps are “correct” (this is hard to judge); only check whether critical steps “happened” (this is easy to judge).
Dynamically adjust tool sets and constraint strength by scenario/complexity: Not all queries need the same analysis depth; give simple queries a fast path.

One-sentence summary

Architecture choice is a business problem, not a technical one. ReAct/LangGraph/SDK are just different implementations of control flow – what truly affects Agent performance is the constraint layer you build on top of the control flow. SmartPerfetto chose Native SDK not because it’s the most advanced, but because performance analysis root cause paths are unpredictable, and predefined DAGs are less effective than letting the Agent explore autonomously within a constraint framework.

Q5: How should a performance AI agent handle scene recognition?

Question context: We want to do scene recognition and route to the correct Skill. When building this, should we rely more on “user utterance” or “logs,” or is there a better approach? Several paths for scene recognition each have issues: code matching (keyword matching leads to results that are too broad or too narrow), LLM understanding (LLM understanding isn’t necessarily accurate), log reconstruction (can filter for scrolling presence, but that may not be what the user cares about).

SmartPerfetto’s approach: Three signal layers with clear division of labor

SmartPerfetto’s scene recognition doesn’t rely on a single signal source but uses three signal layers working together, each solving a different problem:

1
2
3

Layer 1: User utterance -- keyword matching -> scene type (scrolling / startup / anr / ...)
Layer 2: Trace data -- deterministic detection -> architecture info (Flutter / WebView / Standard / Compose)
Layer 3: Data completeness -- table existence checks -> available analysis dimensions (GPU data available? thermal throttling data available?)

Layer 1: User utterance (keyword matching, <1ms)

// sceneClassifier.ts -- 46 lines of code, handles all scene classification

export function classifyScene(query: string): SceneType {
  const scenes = getRegisteredScenes();  // Loaded from 12 .strategy.md YAML frontmatters
  const lower = query.toLowerCase();

  const sorted = scenes
    .filter(s => s.scene !== 'general')
    .sort((a, b) => a.priority - b.priority);  // ANR(1) > startup(2) > scrolling(3) > ...

  for (const scene of sorted) {
    // Match compound patterns first (more specific, e.g., "startup.*slow")
    if (scene.compoundPatterns.some(p => p.test(query))) return scene.scene;
    // Then match single keywords
    if (scene.keywords.some(k => lower.includes(k))) return scene.scene;
  }
  return 'general';  // Fallback
}

Keywords are defined in each Strategy file’s YAML frontmatter, not hardcoded in TypeScript:

# scrolling.strategy.md frontmatter
keywords:
  - 滑动
  - 卡顿
  - 掉帧
  - jank
  - scroll
  - fps
  - 帧
  - frame
  - 列表
  - 流畅
  - fling
  - stuttering
  - dropped frame
  - 不流畅
  # ... 30+ keywords total

Why keyword matching instead of LLM?

Cost: Scene classification executes at the entry of every analysis; keyword matching is <1ms + 0 tokens; LLM calls take ~500ms + ~500 tokens
Determinism: The cost of misclassification is very high (injecting the wrong Strategy file); keyword matching behavior is fully predictable
Sufficiently accurate: In the performance analysis domain, user queries are highly formatted – someone saying “scrolling stuttering” is asking about scrolling, someone saying “slow startup” is asking about startup. No LLM needed to “understand” this

What about cases where keyword matching falls short?

Keyword matching does have boundaries – when a user says “why is this app slow,” keywords can’t determine whether it’s slow startup or slow scrolling. SmartPerfetto handles this by: falling back to the general scene when no match is found, letting the Agent autonomously choose a direction within the general strategy’s routing decision tree.

# general.strategy.md -- no hard constraints, only routing suggestions

| User focus direction         | Recommended path |
| CPU / scheduling / threads   | invoke_skill("cpu_analysis") |
| Memory / OOM / leaks         | invoke_skill("memory_analysis") |
| Unknown direction            | invoke_skill("scene_reconstruction") -> route by scene |

The core idea of this design is: don’t try to achieve 100% accurate classification at the entry point; instead, let accurate cases take the fast path (keywords -> Strategy), and uncertain cases take the exploration path (general -> Agent autonomous routing).

Layer 2: Trace data – architecture detection (deterministic code)

Scene classification only resolves “what the user wants to analyze,” but the same scenario (e.g., scrolling) requires completely different analysis paths under different rendering architectures:

Architecture	Rendering pipeline	Analysis differences
Standard Android	UI Thread -> RenderThread -> SurfaceFlinger	Dual-track analysis of main thread + RenderThread
Flutter TextureView	1.ui -> 1.raster -> JNISurfaceTexture -> RenderThread updateTexImage	Dual-pipeline; need to analyze Flutter engine threads + texture bridging
Flutter SurfaceView	1.ui -> 1.raster -> BufferQueue -> SurfaceFlinger	Single pipeline; doesn’t go through RenderThread
WebView	CrRendererMain -> Viz Compositor	Chromium rendering pipeline; different thread names
Compose	UI Thread (Composition) -> RenderThread	Similar to Standard but with Composition phase

Architecture detection is delegated to the YAML skill rendering_pipeline_detection – it performs thread/Slice signal collection, pipeline scoring, and sub-variant determination at the SQL layer, supporting 24 fine-grained rendering architectures. The TypeScript side (architectureDetector.ts) is only responsible for calling the skill and mapping results; it doesn’t do direct if/else judgments:

rendering_pipeline_detection skill (SQL)
  -> Collect thread signals (1.ui / 1.raster / CrRendererMain / RenderThread ...)
  -> Pipeline scoring (FLUTTER_TEXTUREVIEW / WEBVIEW_BLINK / ANDROID_VIEW_STANDARD ...)
  -> architectureDetector.ts maps to ArchitectureInfo type

Detection results are injected into the system prompt (via templates like arch-flutter.template.md), and the Agent selects the corresponding analysis tools after seeing the architecture information.

Layer 3: Data completeness – capability register

Different trace capture configurations yield different available data dimensions. Some traces lack GPU frequency data; others lack thermal zone data. SmartPerfetto probes the availability of 18 data dimensions before analysis begins:

frame_rendering: OK (456 rows)
cpu_scheduling: OK (12000 rows)
gpu: MISSING (no gpu_frequency counter)
thermal_throttling: OK (4 zones)
binder_ipc: OK (890 transactions)

This information is likewise injected into the system prompt, telling the Agent which dimensions can be analyzed and which lack data. This prevents the Agent from invoking a Skill with no data backing, getting empty results, then switching directions – such trial-and-error wastes 1-2 tool calls’ worth of tokens.

Evaluating the three paths from the question

1. Code matching (keywords): Viable, but needs well-designed fallback

The questioner said “keyword matching leads to results that are too broad or too narrow.” SmartPerfetto’s experience:

Priority ordering solves the “too broad” problem: ANR(1) > startup(2) > scrolling(3); when multiple keywords match simultaneously, take the highest priority
Compound patterns improve precision: /startup.*slow/ is more precise than matching “startup” alone
general fallback solves the “too narrow” problem: when nothing matches, don’t guess – hand it to Agent for autonomous exploration

2. LLM understanding: Not recommended for the classification entry point; can be used in the fallback path

LLM classification in SmartPerfetto isn’t Layer 1 but rather the Agent’s autonomous routing within the general scenario – at that point, the Agent already has trace data and can make more accurate judgments combined with the data.

3. Log reconstruction: Suitable as a supplementary signal for Layer 2

Logs can tell you “what’s in the trace” (whether scrolling events exist, whether ANR exists), but can’t tell you “what the user cares about.” SmartPerfetto’s data completeness probing plays exactly this role – it doesn’t participate in scene classification but provides the Agent with data availability information.

One-sentence summary

Scene recognition shouldn’t try to solve all problems with a single signal source. Use keyword matching for fast routing (accurate cases), use general fallback for Agent autonomous exploration (uncertain cases), and use trace data for architecture and completeness supplementation. Keyword matching + priority ordering + compound patterns + fallback strategy – 46 lines of code is sufficient.

Q6: How to better leverage AI autonomous exploration in “deterministic steps + AI exploration”?

Question context: We’ve found in production that when AI explores autonomously and drills into root causes, it tends to go off track and give incorrect results. For example, in the SmartPerfetto blog post example – after identifying RenderThread being blocked by Binder in the earlier steps (based on deterministic steps), are the subsequent hypothesis formations and validations pure AI, or do we give the AI some common causes as guidance for Binder blocking and let it investigate on its own?

First, answering the core question: It’s neither pure AI nor hardcoded guidance

SmartPerfetto’s approach is structured reasoning framework + on-demand knowledge injection:

Deterministic steps produce data
  |
Agent forms hypothesis (autonomous, but constrained by reasoning framework)
  |
Agent selects verification tool (autonomous, but guided by Strategy suggestions)
  |
Verification result feedback
  |
Hypothesis confirmed -> go deeper / Hypothesis rejected -> backtrack and change direction
  |
Verifier post-hoc check

Three key mechanisms make AI autonomous exploration more reliable:

Mechanism 1: Hypothesis management tools – adding structure to the reasoning process

SmartPerfetto provides submit_hypothesis and resolve_hypothesis as two MCP tools. Instead of letting the Agent reason implicitly in internal monologue, it forces externalization:

Agent calls:
  submit_hypothesis({
    description: "system_server slow Binder response causing RenderThread blocking",
    expected_evidence: "system_server's corresponding Binder transaction thread_state shows prolonged Runnable/Sleeping"
  })

Agent calls:
  execute_sql("SELECT ... FROM thread_state WHERE utid = ... AND ts BETWEEN ...")

Agent calls:
  resolve_hypothesis({
    id: "h1",
    outcome: "rejected",
    evidence: "system_server Binder thread state is normal, response time <2ms,
              RenderThread blocking cause is dequeueBuffer waiting for SurfaceFlinger"
  })

Why is this effective? The hypothesis management tools force the Agent to explicitly declare “what I’m verifying” and “what I expect to see” before taking action. This has two benefits:

Prevents goal drift – the Agent won’t forget what it originally wanted to verify while collecting data
Auditable – every hypothesis has a complete record; the Verifier can check whether all hypotheses were resolved

Mechanism 2: Knowledge injection – not hardcoded guidance, but on-demand domain knowledge loading

“Could we give the AI some common causes as guidance for Binder blocking?”

Yes, but not hardcoded in Skills. Instead, it’s loaded on demand through the lookup_knowledge MCP tool. After discovering Binder blocking, the Agent can call:

1	invoke lookup_knowledge("binder-ipc")

This returns a Binder IPC knowledge template (knowledge-binder-ipc.template.md) containing:

Classification of typical Binder transaction blocking causes (server-side busy, process frozen, CPU scheduling delay, oneway queue full)
Investigation paths and key metrics for each cause
Common misdiagnosis scenarios (e.g., oneway transactions don’t block the caller)

Key design: Knowledge is actively pulled by the Agent, not force-injected by the system. There are currently 8 knowledge templates (rendering-pipeline, binder-ipc, gc-dynamics, cpu-scheduler, thermal-throttling, lock-contention, startup-root-causes, data-sources). If all templates were pre-injected into the system prompt, it would consume a large number of tokens and most would be irrelevant. Through on-demand loading via MCP tools, the Agent only retrieves the relevant domain background knowledge when needed.

SmartPerfetto also provides conditional deep drill suggestion tables in Strategy files, which is another form of guidance:

# scrolling.strategy.md Phase 1.9

| Condition                         | Deep drill action |
| Any reason_code Q4>20%            | invoke_skill("blocking_chain_analysis", ...) |
| binder_overlap >5ms               | invoke_skill("binder_root_cause", ...) |
| cpu_runnable_ratio >30%           | invoke_skill("cpu_analysis", ...) |
| thermal_throttle detected         | invoke_skill("thermal_throttling", ...) |
| gc_pause_total >10ms              | invoke_skill("gc_analysis", ...) |

This table isn’t a hardcoded decision tree – it’s a lookup table for the Agent. The Agent decides which row to follow based on metric values in the data. If the data doesn’t match any row, the Agent can explore autonomously.

Mechanism 3: ReAct Reasoning Nudge – triggering reflection when tools return results

During the first few successful returns from data tools (execute_sql / invoke_skill), SmartPerfetto appends a reasoning prompt at the end of the result:

// claudeMcpServer.ts

const REASONING_NUDGE = '\n\n[REFLECT] Before executing the next step: ' +
  'What are the key findings from this data? Does it support/refute your hypothesis? ' +
  'If there are important inferences, please record them using submit_hypothesis or write_analysis_note.';

// Only appended during the first N data tool calls, then stopped (to control token cost)
const REASONING_NUDGE_MAX_CALLS = 4;

Extremely low cost (~20 tokens/call, ~80 tokens total for the first 4 calls), but significantly effective. The reason it’s not applied throughout is to control token overhead in the latter half of analysis – the first few nudges already establish the pattern of “receive data -> reflect first -> then act.” Without this nudge, the Agent tends to call tools consecutively without pausing to think – collecting data 5 times without forming any intermediate conclusions, resulting in poor final summary quality.

Walking through the complete flow with the Binder example from the article

1. Start with overview -> discovers 47 jank frames, P90 = 23.5ms
   [Deterministic: invoke_skill("scrolling_analysis")]

2. Decide direction based on overview -> 40% stuck in APP phase, prioritize App side
   [Deterministic: data-driven, Strategy file suggestion]

3. Select representative frame for deep drill -> Frame #234 RenderThread blocked by Binder for 23ms
   [Deterministic: invoke_skill("jank_frame_detail")]

--- AI autonomous exploration begins below ---

4. Agent proactively loads knowledge: lookup_knowledge("binder-ipc")
   -> Gets Binder blocking common cause classification table
   [AI decision + knowledge guidance]

5. Agent forms hypothesis: submit_hypothesis("system_server slow Binder response")
   Expected evidence: server-side thread_state showing prolonged Sleeping/Runnable
   [AI reasoning, hypothesis tool forces externalization]

6. Agent verifies: execute_sql("query Binder counterpart thread_state")
   -> Discovers system_server CPU scheduling delay, not slow Binder response
   [AI autonomous tool call]

7. [REFLECT] nudge triggers reflection
   Agent: "Hypothesis 1 doesn't hold; server-side thread_state shows Runnable queuing,
          actual cause is CPU scheduling delay"
   -> resolve_hypothesis(outcome: "refined",
       evidence: "system_server CPU scheduling delay causing Binder thread queuing")
   [AI reasoning + investigation path from knowledge template]

8. Agent goes deeper: execute_sql("query CPU frequency + thermal zone")
   -> Discovers thermal throttling, CPU big cores frequency-limited to 50%
   [AI autonomously selects next deep drill direction]

9. Comprehensive conclusion: RenderThread Binder blocking <- system_server CPU scheduling delay <- thermal throttling
   [AI synthesis, forming WHY chain]

--- Verifier post-hoc check ---

10. Verifier heuristic checks:
    - Text pattern matching: does the conclusion reflect deep drill analysis (not just referencing overview data)
    - Hypothesis closure: do all submit_hypothesis calls have corresponding resolve_hypothesis calls
    - Scene completeness: does scrolling analysis include frame/jank related content
    - Causal chain heuristic: are there sufficient causal connectors and mechanistic terminology
    [Deterministic: heuristic rule checks, not precise verification]

Note that Steps 4-9 are all AI autonomous exploration, but constrained by three mechanisms:

Hypothesis tools force reasoning externalization (Steps 5, 7)
Knowledge injection provides domain investigation paths (Step 4)
REFLECT nudge triggers reflection after the first few tool returns (Step 7)

Four practical recommendations for making AI autonomous exploration more reliable

1. Give data, not conclusions

Skills should return structured data (frame durations, thread state distributions, blocking function lists), not pre-drawn conclusions (“RenderThread blocking is caused by Binder”). Having AI reason its own conclusions from data is more reliable than having it build further analysis on conclusions provided by others.

2. Give framework, not path

Strategy files should define “what must be done” (Phase 1.9 must deep drill), not “how to do it” (first query A, then B, then C). An Agent autonomously selecting paths within a framework constraint is far more reliable than free exploration without any constraints, yet far more flexible than hardcoded paths.

3. Give knowledge, not answers

Knowledge templates should contain “possible cause classifications and investigation methods,” not “if you see X, it’s Y.” The former helps the Agent build a reasoning framework; the latter turns the Agent back into a Workflow.

4. Verify behavior, not conclusions

Verifiers should use heuristic rules to check “whether the analysis output reflects critical actions” (does the conclusion show traces of deep drill analysis, are all hypotheses resolved, does the causal chain have sufficient depth), rather than trying to judge “whether the conclusion is correct” (this should be done offline with LLM Judge evaluation, not at runtime). Note that these are text pattern matching level heuristic checks, not precise tool call log audits.

One-sentence summary

AI autonomous exploration reliability isn’t guaranteed by “hardcoded guidance,” but by three mechanisms: hypothesis management tools externalize reasoning, on-demand knowledge injection provides domain investigation frameworks, and ReAct nudge prevents blind tool calling. The key principles are “give data not conclusions, give framework not path, give knowledge not answers.”

Q7: How is the Prompt assembled for each turn?

Question context: LLM Agent output quality depends heavily on system prompt design. How does SmartPerfetto construct the prompt for each analysis? How does the prompt change across different scenarios and different turns? How is the token budget controlled?

Overall design: Four-tier layered assembly + cache optimization

SmartPerfetto’s system prompt isn’t a static string but is dynamically assembled by the buildSystemPrompt() function (claudeSystemPrompt.ts:260) before each SDK query. The assembly follows a core principle:

Sort by “stability” – content that changes less frequently goes first, more dynamic content goes last.

The reason for this design is Anthropic API’s automatic caching mechanism: when the system prompt exceeds 1024 tokens, the API automatically caches the prompt prefix. By placing unchanging content at the very front, most of the prompt can hit the cache across multi-turn conversations, significantly reducing latency and cost:

1
2
3

Same trace + same scene:     ~4000 tokens cached (~80% savings)
Same trace + different scene: ~800 tokens cached  (~18% savings)
Different trace:              ~400 tokens cached   (~8% savings)

Four-tier assembly structure

+-------------------------------------------------------+
|  Tier 1: STATIC (unchanged within process lifetime)    |
|  +---------------------------------------------------+ |
|  | prompt-role.template.md        (~200 tokens)       | |
|  | -> Role definition: Android performance expert     | |
|  +---------------------------------------------------+ |
|  | prompt-output-format.template.md (~850 tokens)     | |
|  | -> Output format: [CRITICAL]/[HIGH]/[MEDIUM]/[LOW] | |
|  | -> Root cause reasoning chain format, Mermaid rules| |
|  | -> Slice nesting rules, CPU frequency estimation   | |
|  +---------------------------------------------------+ |
+-------------------------------------------------------+
|  Tier 2: PER-TRACE (stable within same trace)          |
|  +---------------------------------------------------+ |
|  | Architecture info          (~150 tokens)            | |
|  | -> "Flutter TextureView, confidence 92%"            | |
|  | -> + arch-flutter.template.md architecture guide    | |
|  +---------------------------------------------------+ |
|  | Focus application          (~100 tokens)            | |
|  | -> "com.example.app (primary focus), 456 frames"    | |
|  +---------------------------------------------------+ |
|  | Data completeness          (~200 tokens, droppable) | |
|  | -> Only reports missing/insufficient dimensions     | |
|  +---------------------------------------------------+ |
|  | SQL knowledge base ref     (~300 tokens, droppable) | |
|  | -> Tables/views/functions matched from stdlib index | |
|  +---------------------------------------------------+ |
+-------------------------------------------------------+
|  Tier 3: PER-QUERY (changes with scene/query)          |
|  +---------------------------------------------------+ |
|  | Methodology + scene strategy (~1200 tokens)         | |
|  | -> prompt-methodology.template.md                   | |
|  | -> {{sceneStrategy}} = scrolling.strategy.md        | |
|  |   or one of 12 strategies (startup/anr/general...) | |
|  +---------------------------------------------------+ |
|  | Sub-agent collaboration     (~200 tokens, droppable)| |
|  | -> When to delegate vs direct call                  | |
|  | -> Scrolling-specific parallel evidence collection  | |
|  +---------------------------------------------------+ |
+-------------------------------------------------------+
|  Tier 4: PER-INTERACTION (may change every query)      |
|  +---------------------------------------------------+ |
|  | User selection context     (~300 tokens, not drop.) | |
|  | -> Time range selection (selection-area.template.md)| |
|  | -> Or Slice selection (selection-slice.template.md) | |
|  +---------------------------------------------------+ |
|  | Comparison mode context    (conditionally injected) | |
|  | -> Dual-trace comparison methodology + tool guide   | |
|  +---------------------------------------------------+ |
|  | Conversation context       (~500 tokens)            | |
|  | -> Analysis notes (<=10, sorted by priority)        | |
|  | -> Previous findings (<=10)                         | |
|  | -> Known entities (for drill-down)                  | |
|  | -> Conversation summary (cross-turn compressed,     | |
|  |    <=2000 tokens)                                   | |
|  +---------------------------------------------------+ |
|  | SQL pitfall records        (<=5, droppable)         | |
|  | -> ERROR -> BAD SQL -> FIX SQL                      | |
|  +---------------------------------------------------+ |
|  | Historical analysis patterns (cross-session, drop.) | |
|  | Historical pitfall records  (cross-session, drop.)  | |
|  +---------------------------------------------------+ |
|  | Historical analysis plans  (<=3 turns, droppable)   | |
|  | -> Phase status: done/skipped/pending + summary     | |
|  +---------------------------------------------------+ |
+-------------------------------------------------------+

Template loading and variable substitution

All prompt content is defined in Markdown files (detailed in Q1); TypeScript only handles loading and variable substitution:

// strategyLoader.ts -- template system

// 1. Load template (DEV mode skips cache, browser refresh takes effect)
loadPromptTemplate('prompt-methodology')  // -> strategies/prompt-methodology.template.md

// 2. Load scene strategy (extract keywords from YAML frontmatter, body as strategy content)
getStrategyContent('scrolling')  // -> Markdown body of strategies/scrolling.strategy.md

// 3. Variable substitution
renderTemplate(methodologyTemplate, { sceneStrategy })
// Replaces {{sceneStrategy}} with scrolling.strategy.md content

Template file inventory:

Category	File	Purpose
Static templates	`prompt-role.template.md`	Role definition
	`prompt-output-format.template.md`	Output format rules (91 lines)
	`prompt-quick.template.md`	Quick mode streamlined prompt
Methodology	`prompt-methodology.template.md`	Analysis methodology (contains `{{sceneStrategy}}` placeholder)
Architecture guides	`arch-standard.template.md`	Standard Android rendering guidance
	`arch-flutter.template.md`	Flutter engine guidance
	`arch-compose.template.md`	Jetpack Compose guidance
	`arch-webview.template.md`	WebView guidance
Selection templates	`selection-area.template.md`	Time range selection (`{{startNs}}`, `{{endNs}}`…)
	`selection-slice.template.md`	Slice selection (`{{eventId}}`, `{{ts}}`…)
Comparison mode	`comparison-methodology.template.md`	Dual-trace comparison methodology
Scene strategies	12 `*.strategy.md` files	scrolling/startup/anr/memory/…
Knowledge templates	8 `knowledge-*.template.md` files	On-demand domain knowledge (not injected into prompt)
Auxiliary templates	`prompt-complexity-classifier.template.md`	Quick/Full routing decision (not injected into prompt, but determines which path to take)

Token budget management

Budget ceiling: 4500 tokens (MAX_PROMPT_TOKENS). During correction retries, if SDK auto-compact is detected (conversation history automatically compressed), the budget is reduced to 3000 tokens to leave room; otherwise the original prompt is reused.

Token estimation method: Mixed Chinese-English estimation – Chinese characters at 1.5 tokens/character, ASCII at 0.3 tokens/character. This is a rough approximation but sufficiently accurate for budget management.

Progressive dropping strategy when over budget:

When the assembled prompt’s token count exceeds the budget, entire sections are dropped in order from lowest to highest priority:

Drop order (dropped first -> dropped last):
1. Perfetto SQL knowledge base reference -> Agent can use lookup_sql_schema tool instead
2. Trace data completeness              -> Helpful but Agent can discover missing data at runtime
3. Historical analysis patterns          -> Cross-session pattern memory, non-critical
4. Historical pitfall records            -> Cross-session negative memory, droppable
5. SQL pitfall records                   -> Nice to have
6. Sub-agent collaboration               -> Only useful when sub-agents are enabled
7. Historical analysis plans             -> Supplementary context

Content that is never dropped:

Role definition, output format (Tier 1 static)
Architecture info, focus application (Tier 2 per-trace)
Methodology + scene strategy (Tier 3 core)
User selection context (user’s explicit intent)
Conversation context (previous findings and analysis notes)

Complete context construction flow

In ClaudeRuntime.analyze(), prompt assembly is preceded by over twenty preparation phases to collect all context:

Phase 0:   Selection context logging
Phase 0.5: Focus app detection (3 methods: battery_stats / oom_adj / frame_timeline)
Phase 1:   Skill executor initialization
Phase 2:   Architecture detection (LRU cached, detected once per trace)
Phase 2.5: Vendor detection (OEM customization, LRU cached)
Phase 2.8: Comparison mode context (dual-trace mode)
Phase 2.9: Data completeness probing (18 dimensions, ~50ms)
Phase 3:   Session context + conversation history
Phase 4:   Entity store (drill-down references)
Phase 5:   Scene classification (keyword matching, <1ms)
Phase 5.5: Cross-session pattern memory matching
Phase 6:   ArtifactStore + analysis notes
Phase 6.5: Analysis plan (current + historical)
Phase 6.6: Watchdog feedback references
Phase 6.7: Hypothesis state
Phase 6.8: Uncertainty flags
Phase 7:   SQL error tracking
Phase 8:   MCP Server creation (inject all the above state)
Phase 9:   (removed)
Phase 10:  SQL knowledge base context
Phase 11:  Sub-agent definitions
Phase 12:  SQL error-fix pairs
Phase 13:  -> buildSystemPrompt(context) -> final prompt

All Phase results feed into the ClaudeAnalysisContext object, passed to buildSystemPrompt() for final assembly.

Quick vs Full dual mode

Not all queries need the full 4500-token prompt. When users ask factual questions (e.g., “what’s the frame rate”), SmartPerfetto uses a streamlined quick prompt:

// buildQuickSystemPrompt() -- ~1500 tokens
// Loads prompt-quick.template.md
// Only injects {{architectureContext}} and {{focusAppContext}}
// No methodology, no scene strategy, no conversation context

Dimension	Quick Mode	Full Mode
Target tokens	~1500	~4500
Scene strategy	None	One of 12
Methodology	None	prompt-methodology.template.md
Conversation context	None	findings + notes + entity + summary
Planning Gate	None	Yes
Verifier	None	Yes
Use case	“What’s the frame rate” “Is there an ANR”	“Analyze scrolling stuttering” “Analyze startup performance”

How the prompt changes across multi-turn conversations

In multi-turn analysis (user follow-ups or drill-downs), prompt changes depend on whether the SDK session hits resume:

Turn 1: No conversation context, no historical plans, no analysis notes

Turn 2 onward – SDK session resume hit (within 4 hours):

SDK internally already holds complete conversation history; previousFindings and conversationSummary are not re-injected
But still injected: analysis notes (<=10), entity context (drill-down references), historical plans (<=3 turns)
Tiers 1-3 remain unchanged, hitting ~80% cache

Turn 2 onward – SDK session expired or unavailable:

Previous turn’s findings are manually injected as “previous analysis findings” (<=10)
Conversation summary is manually injected (sessionContext.generatePromptContext(2000), <=2000 tokens)
Analysis notes, entity context, and historical plans same as above

Correction retry turn: If SDK auto-compact is detected (conversation history automatically compressed), token budget drops from 4500 to 3000, and progressive dropping more aggressively removes non-critical sections. If auto-compact hasn’t occurred, the original system prompt is reused.

A concrete example: Prompt assembly for scrolling analysis

User inputs "Analyze scrolling stuttering", Flutter TextureView architecture, Turn 1:

[Tier 1] prompt-role.template.md                    -> "You are an Android performance analysis expert..."
[Tier 1] prompt-output-format.template.md            -> Output format rules
[Tier 2] "Architecture: Flutter TextureView, confidence 92%"  -> + arch-flutter.template.md
[Tier 2] "Focus app: com.example.app (primary focus)"  -> frame count + detection method
[Tier 2] "Data completeness: gpu MISSING, suspected not captured"  -> only reports missing/insufficient
[Tier 2] SQL knowledge base: android_frames, slice_self_dur  -> stdlib match results
[Tier 3] prompt-methodology + scrolling.strategy.md   -> Phase 1->1.5->1.9->3 complete strategy
[Tier 4] (no selection, no conversation context, no historical plans)

Estimated: ~3200 tokens, within 4500 budget, no dropping needed

User follows up with "Deep dive into frame 3", Turn 2 (SDK session resume hit):

[Tier 1-3] Same as Turn 1 (hitting ~80% cache)
[Tier 4] Conversation context (SDK holds conversation history, findings not re-injected):
  - Analysis notes: "Warning [Hypothesis] Primary jank cause is RenderThread Binder blocking"
  - Entity context: frame#3's ID and time range
  - Historical plan: "Done Phase 1 overview, Done Phase 1.9 deep drill, Pending Phase 3 conclusion"
  (previousFindings and conversationSummary managed internally by SDK session, not duplicated in prompt)

Estimated: ~3400 tokens, within budget

One-sentence summary

The prompt is sorted by “stability” across four tiers (Static -> Per-Trace -> Per-Query -> Dynamic), leveraging API prefix caching to achieve ~80% token savings across multi-turn conversations. The template system lets domain experts directly edit analysis strategies without touching TypeScript. When over budget, progressive dropping by priority occurs, but role definition, scene strategy, and user selection are always preserved – these three determine the analysis direction and scope.

Q8: What Skills does SmartPerfetto have?

Question context: SmartPerfetto’s analysis capabilities are carried by YAML Skills. A complete Skill inventory helps understand the system’s analysis coverage.

Overview

Category	Count	Description
Atomic	87	Single-step detection/statistics, completed with one or a few SQL statements
Composite	29	Combines multiple atomic skills, supports iterator/conditional
Deep	2	Deep profiling (callstack, CPU profiling)
Pipeline	28	Rendering pipeline detection + teaching (24+ architectures)
Module	18	Modular configuration: app/framework/hardware/kernel
Total	164

Atomic Skills (87)

Single-step data extraction and detection – the building blocks for all higher-level Skills.

Frame rendering and jank:

Skill ID	One-line description
consumer_jank_detection	Detect real frame drops from SF consumer perspective (per-layer buffer starvation)
frame_blocking_calls	Identify blocking calls during each jank frame (GC, Binder, locks, IO)
frame_production_gap	Detect frame production gaps: gaps between consecutive frames exceeding 1.5x VSync
frame_pipeline_variance	Detect frame duration jitter and high-variance intervals
render_pipeline_latency	Break down latency across all stages of the frame rendering pipeline
render_thread_slices	Analyze RenderThread time slice distribution
app_frame_production	Analyze application main thread frame production
sf_frame_consumption	Analyze SurfaceFlinger frame consumption
sf_composition_in_range	Analyze SurfaceFlinger composition latency
sf_layer_count_in_range	Count active SF layers within a time range
present_fence_timing	Analyze Present Fence timing, detecting actual display latency
game_fps_analysis	Game-specific frame rate analysis, supporting fixed frame rate modes

VSync and refresh rate:

Skill ID	One-line description
vsync_period_detection	Detect VSync period, return refresh rate and confidence
vsync_config	Parse actual VSync period and refresh rate settings from trace
vsync_alignment_in_range	Analyze frame-to-VSync signal alignment
vsync_phase_alignment	Analyze input event to VSync phase relationship, locating touch-to-display latency bottlenecks
vrr_detection	Detect whether the device uses variable refresh rate (VRR/LTPO/Adaptive Sync)

CPU and scheduling:

Skill ID	One-line description
cpu_topology_detection	Dynamically detect CPU big.LITTLE core topology from cpufreq
cpu_topology_view	Create reusable SQL VIEW `_cpu_topology`
cpu_slice_analysis	Analyze CPU time slice distribution (with dynamic topology detection)
cpu_load_in_range	Analyze per-CPU core load within a specified time range
cpu_cluster_load_in_range	Calculate overall CPU load percentage for big and little core clusters
cpu_freq_timeline	Analyze per-CPU core frequency change timeline
cpu_throttling_in_range	Detect CPU thermal throttling situations
sched_latency_in_range	Analyze thread scheduling wait time distribution, detecting CPU contention
scheduling_analysis	Analyze thread scheduling latency (Runnability)
task_migration_in_range	Analyze thread migration frequency between big and little cores
thread_affinity_violation	Detect high-frequency core migration of main thread/RenderThread
thermal_predictor	Predict thermal throttling risk based on CPU frequency trends
cache_miss_impact	Count cache-miss counters and evaluate fluctuation

GPU:

Skill ID	One-line description
gpu_render_in_range	Analyze GPU rendering duration and Fence wait
gpu_freq_in_range	Analyze GPU frequency changes
gpu_metrics	Analyze GPU frequency, utilization, and rendering performance
gpu_power_state_analysis	Analyze GPU frequency state transitions, identifying frequency reduction pressure and jitter

Main thread analysis:

Skill ID	One-line description
main_thread_states_in_range	Count main thread states, blocking functions, and percentages within a range
main_thread_slices_in_range	Count main thread slice duration distribution within a range
main_thread_sched_latency_in_range	Count main thread Runnable wait time distribution
main_thread_file_io_in_range	Count main thread file IO related slice durations within a range

Binder IPC:

Skill ID	One-line description
binder_in_range	Analyze Binder transactions within a specified time range
binder_blocking_in_range	Analyze counterpart process response delays in synchronous Binder calls
binder_root_cause	Perform server/client-side blocking cause attribution for slow Binder transactions
binder_storm_detection	Detect Binder transaction storms: too many IPC calls in a short period

Locks and synchronization:

Skill ID	One-line description
lock_contention_in_range	Analyze lock contention within a specified time range
futex_wait_distribution	Count futex/mutex lock wait distribution and duration

Startup-specific (19):

Skill ID	One-line description
startup_events_in_range	Query startup events and TTID/TTFD metrics
startup_slow_reasons	Startup slow reasons (Google official classification + self-check) v3.0
startup_critical_tasks	Auto-identify all active threads during startup interval, sorted by CPU time
startup_thread_blocking_graph	Build thread block/wakeup relationship graph using waker_utid
startup_jit_analysis	Analyze JIT compilation thread impact on startup speed
startup_cpu_placement_timeline	Analyze main thread core type changes by time bucket, detecting stuck-on-little-core during startup
startup_freq_rampup	Analyze CPU frequency ramp-up speed during cold start, detecting frequency scaling delays
startup_binder_pool_analysis	Analyze Binder thread pool utilization and saturation during startup
startup_hot_slice_states	Analyze thread state distribution of Top N hot slices during startup interval
startup_main_thread_states_in_range	Count main thread Running/Runnable/Blocked percentages during startup
startup_main_thread_slices_in_range	Count main thread slice hotspots during startup
startup_binder_in_range	Count Binder call distribution during startup
startup_main_thread_file_io_in_range	Count main thread file IO during startup
startup_sched_latency_in_range	Count main thread Runnable wait latency during startup
startup_main_thread_sync_binder_in_range	Count main thread synchronous Binder duration during startup
startup_main_thread_binder_blocking_in_range	Analyze main thread synchronous Binder blocking details during startup
startup_breakdown_in_range	Count attribution reason time percentages during startup
startup_gc_in_range	Count GC slices and main thread percentage during startup
startup_class_loading_in_range	Count class loading slice durations during startup

Memory and GC:

Skill ID	One-line description
gc_events_in_range	Query GC events for a given process and optional time range
memory_pressure_in_range	Analyze memory pressure metrics within a specified time range
page_fault_in_range	Analyze Page Fault and memory reclaim impact on performance

Input and touch:

Skill ID	One-line description
input_events_in_range	Extract raw input events within a range, analyzing dispatch latency
input_to_frame_latency	Measure latency from each MotionEvent to corresponding frame present
touch_to_display_latency	Measure end-to-end latency from touch to frame rendering
scroll_response_latency	Measure response latency from scroll gesture input to first frame rendering

System and device:

Skill ID	One-line description
system_load_in_range	Analyze overall system CPU utilization and process activity
device_state_snapshot	Capture device environment info during trace (screen, battery, temperature, etc.)
device_state_timeline	Track device state changes over time
wakelock_tracking	Track Wake Lock holding, detecting battery drain anomalies

Others:

Skill ID	One-line description
blocking_chain_analysis	Analyze main thread blocking chain: what blocked the main thread? What was the waker doing?
anr_main_thread_blocking	Deep analysis of main thread blocking cause during ANR
anr_context_in_range	Extract first ANR event data as time window anchor
app_lifecycle_in_range	Track Activity/Fragment lifecycle events
compose_recomposition_hotspot	Detect Jetpack Compose recomposition hotspots
webview_v8_analysis	Analyze WebView V8 engine: GC, script compilation, execution time
rendering_pipeline_detection	Identify application rendering pipeline type (24 fine-grained detection types)
pipeline_key_slices_overlay	Query pipeline key Slice ts/dur for timeline overlay

Composite Skills (29)

Combine multiple atomic skills, supporting iterator (per-frame/per-event deep drill) and conditional (data-driven branching).

Skill ID	One-line description
scrolling_analysis	Scrolling analysis main entry: overview -> frame list -> root cause classification -> per-frame diagnosis
flutter_scrolling_analysis	Flutter-specific frame analysis, using Flutter thread model
jank_frame_detail	Analyze a specific jank frame in detail: deep drill into jank cause and root cause classification
startup_analysis	Startup analysis main entry: Iterator mode, big/little core analysis, four-quadrant
startup_detail	Analyze a single startup event: main thread duration, Binder, CPU big/little core ratio
anr_analysis	ANR v3.0 analysis: system issue vs. app issue, categorized handling
anr_detail	Single ANR event detail: four-quadrant, Binder dependencies, deadlock detection
cpu_analysis	CPU analysis: time distribution, big/little core analysis, scheduling chain
gpu_analysis	GPU analysis: frequency distribution, memory usage, frame rendering correlation
memory_analysis	Memory analysis: GC events, GC-to-frame correlation, thread states
gc_analysis	GC analysis: based on stdlib android_garbage_collection_events
binder_analysis	Binder deep analysis: transaction basics, thread states
binder_detail	Single Binder transaction detail: CPU big/little core, four-quadrant, blocking cause
thermal_throttling	Temperature monitoring, thermal throttling detection, CPU frequency correlation
lock_contention_analysis	Lock contention multi-dimensional analysis: based on android.monitor_contention
surfaceflinger_analysis	SF frame composition performance: GPU/HWC composition ratio, slow composition detection
click_response_analysis	Click response analysis: based on stdlib android_input_events
click_response_detail	Single slow input event detail: latency breakdown, four-quadrant, main thread blocking
scroll_session_analysis	Single complete scroll session: Touch phase vs Fling phase FPS
navigation_analysis	Activity/Fragment navigation performance: lifecycle, transition animations
lmk_analysis	LMK analysis: cause distribution, timeline, frequency
dmabuf_analysis	DMA Buffer analysis: allocation, release, leak detection
block_io_analysis	Block IO analysis: device-level statistics, queue depth, long-duration IO
io_pressure	IO blocking data detection, IO Wait time, severity assessment
suspend_wakeup_analysis	Suspend/wakeup analysis: time distribution, wakeup source ranking
network_analysis	Network analysis: traffic overview, per-app traffic, protocol distribution
irq_analysis	Hard interrupt and soft interrupt frequency, duration, nesting
scene_reconstruction	Reconstruct user operation scenarios through user input and screen state
state_timeline	Four-lane continuous state timeline: device/user/app/system

Deep Skills (2)

Deep profiling, typically requiring longer execution time.

Skill ID	One-line description
cpu_profiling	CPU performance profiling: usage hotspots and scheduling efficiency deep analysis
callstack_analysis	Call stack hotspot analysis in Running state

Pipeline Skills (28)

Rendering pipeline detection + teaching. Each pipeline skill corresponds to a rendering architecture, including pipeline description, key threads, performance metrics, and optimization recommendations.

Skill ID	Rendering architecture
pipeline_android_view_standard_blast	Android 12+ standard HWUI + BLASTBufferQueue
pipeline_android_view_standard_legacy	Pre-Android 12 standard HWUI + Legacy BufferQueue
pipeline_android_view_software	CPU Skia software rendering, no RenderThread
pipeline_android_view_mixed	View + SurfaceView mixed rendering
pipeline_android_view_multi_window	Same-process multi-window (Dialog/PopupWindow)
pipeline_android_pip_freeform	Picture-in-Picture and freeform window mode
pipeline_compose_standard	Jetpack Compose + HWUI RenderThread
pipeline_flutter_textureview	Flutter PlatformView fallback mode
pipeline_flutter_surfaceview_skia	Flutter + Skia engine (JIT Shader)
pipeline_flutter_surfaceview_impeller	Flutter + Impeller engine (pre-compiled Shader)
pipeline_webview_gl_functor	Traditional WebView, App RenderThread synchronous wait
pipeline_webview_surface_control	Modern WebView + Viz/OOP-R independent composition
pipeline_webview_textureview_custom	X5/UC and other custom WebView engines
pipeline_webview_surfaceview_wrapper	WebView fullscreen video wrapper mode
pipeline_chrome_browser_viz	Chrome Viz compositor, multi-process architecture
pipeline_opengl_es	Direct OpenGL ES / EGL rendering
pipeline_vulkan_native	Native Vulkan rendering
pipeline_angle_gles_vulkan	ANGLE: OpenGL ES -> Vulkan translation layer
pipeline_game_engine	Unity/Unreal/Godot and other game engines
pipeline_surfaceview_blast	Standalone SurfaceView + BLAST sync
pipeline_textureview_standard	SurfaceTexture texture sampling/composition mode
pipeline_camera_pipeline	Camera2/HAL3 multi-stream camera rendering
pipeline_video_overlay_hwc	HWC video layer hardware-accelerated overlay
pipeline_hardware_buffer_renderer	Android 14+ HBR API direct Buffer rendering
pipeline_surface_control_api	NDK SurfaceControl direct transaction submission
pipeline_variable_refresh_rate	VRR/ARR + FrameTimeline dynamic refresh rate
pipeline_imagereader_pipeline	ImageReader API: ML inference, screen recording, custom camera
pipeline_software_compositing	SF CPU software composition fallback (when GPU unavailable)

Note: _base.skill.yaml is the base template file for Pipeline Skills, not registered as an available Skill, and not counted in the total.

Module Skills (18)

Modular analysis configuration, organized by layer. The Agent discovers them via list_skills and invokes on demand.

Hardware layer (5):

Skill ID	One-line description
cpu_module	CPU frequency, thermal throttling, and power states
gpu_module	GPU rendering, frequency, and VRAM usage
memory_module	Memory bandwidth, LMK, dmabuf, PSI, page faults
thermal_module	Temperature sensors, thermal throttling detection, cooling policy
power_module	Wake Lock, CPU idle, power mode, suspend/wakeup

Framework layer (6):

Skill ID	One-line description
surfaceflinger_module	Frame rendering timing, jank causes, GPU composition
choreographer_module	VSync signal, doFrame callbacks, frame production pipeline
ams_module	Application lifecycle, process management, startup timing
wms_module	Window animations, Activity transitions, multi-window
art_module	GC, JIT compilation, and memory allocation
input_module	Touch latency, input dispatch, and click response

Kernel layer (4):

Skill ID	One-line description
scheduler_module	Thread scheduling latency, CPU utilization, big/little core assignment
binder_module	Cross-process calls, blocking transactions, call latency
lock_contention_module	Mutex/Futex, Java monitor, deadlock detection
filesystem_module	Block IO, file operations, database, SharedPreferences

Application layer (3):

Skill ID	One-line description
launcher_module	Home screen performance, app launch, widget updates
systemui_module	Status bar, notification shade, quick settings, navigation bar
third_party_module	Third-party app performance, stuttering, and resource usage

Relationships between Skills

Module Skills (configuration layer)
  +-> Define analysis scope and focus areas

Composite Skills (orchestration layer)
  +-> Reference multiple Atomic Skills
  +-> Iterator: per-frame/per-event traversal deep drill
  +-> Conditional: data-driven branching

Atomic Skills (execution layer)
  +-> Directly execute SQL, return DataEnvelope

Pipeline Skills (knowledge layer)
  +-> Rendering pipeline teaching + detection

Deep Skills (profiling layer)
  +-> Callstack / CPU profiling deep analysis

Agent’s typical invocation path (scrolling analysis example):

invoke_skill("scrolling_analysis")          <- Composite, internally calls multiple Atomic
  -> consumer_jank_detection                 <- Atomic, detect frame drops
  -> per-frame iterator -> jank_frame_detail <- Composite, deep drill per frame
    -> main_thread_states_in_range           <- Atomic
    -> binder_blocking_in_range              <- Atomic
    -> frame_blocking_calls                  <- Atomic

(Continuously updated; new questions will be added as received)