Android Performance

Android ANR Series 2: ANR Analysis Methodology and Key Logs

Word count: 1.2kReading time: 7 min
2025/02/08
loading

This is the second article in the Android App ANR series, focusing on ANR analysis methodology and key logs. The series includes:

  1. Android App ANR Series 1: Understanding Android ANR Design Philosophy
  2. Android App ANR Series 2: ANR Analysis Methodology and Key Logs
  3. Android App ANR Series 3: ANR Case Studies

ANR (Application Not Responding) - a simple definition that encompasses much of Android’s system design philosophy.

First, ANR falls within the application domain. This differs from SNR (System Not Responding), which reflects issues where the system process (system_server) loses responsiveness, while ANR explicitly confines the problem to applications. SNR is ensured by the Watchdog mechanism (details can be found in Watchdog mechanism and problem analysis); ANR is ensured by the message handling mechanism. Android implements a sophisticated mechanism at the system layer to detect ANR, with the core principle being message scheduling and timeout handling.

Second, the ANR mechanism is primarily implemented at the system layer. All ANR-related messages are scheduled by the system process (system_server) and then dispatched to application processes for actual processing. Simultaneously, the system process designs different timeout limits to track message processing. Once an application mishandles a message, the timeout limit takes effect: it collects system states such as CPU/IO usage, process function call stacks, and reports to the user that a process is not responding (ANR dialog; some ROMs don’t display the ANR dialog but directly crash to the home screen).

Third, ANR issues are essentially performance problems. The ANR mechanism actually imposes restrictions on the application’s main thread, requiring it to complete the most common operations (starting services, processing broadcasts, handling input) within specified time limits. If processing times out, the main thread is considered to have lost the ability to respond to other operations. Time-consuming operations on the main thread, such as intensive CPU computations, heavy I/O, complex UI layouts, etc., all reduce the application’s responsiveness.

Finally, some ANR problems are very difficult to analyze. Sometimes due to underlying system influences, message scheduling fails, and the problematic scenario is hard to reproduce. Such ANR issues often require significant time to understand system behaviors, going beyond the scope of the ANR mechanism itself. Some ANR problems are hard to investigate because there are many factors causing system instability, such as memory fragmentation caused by Linux Kernel bugs, hardware damage, etc. Such low-level causes often leave ANR problems untraceable, and these aren’t application issues at all, wasting much time for application developers. If you’ve worked on entire system development and maintenance, you’ll deeply understand this. Therefore, I cannot guarantee that understanding all content in this chapter will solve every ANR problem. If you encounter very difficult ANR issues, I suggest talking to friends working on Framework, drivers, and kernel, or if the problem is just a one-in-a-hundred-thousand偶然 phenomenon that doesn’t affect normal program operation, I’d suggest ignoring it.

– From duanqz

ANR Analysis Methodology

ANR problems mainly stem from two causes: application-side issues and system-side anomalies. When analyzing ANR problems, the most important task is to determine which cause is responsible (though there are some gray areas, such as poorly written code that doesn’t manifest under normal conditions but quickly surfaces when the system has problems).

General ANR Analysis Steps:

  1. Check EventLog for specific ANR time (search for am_anr) to see if it matches the ANR log, determining whether the ANR log is valid. If the ANR log is valid, analyze it to extract useful information: pid, tid, deadlocks, etc. When facing ANR problems, we need to question whether the trace before us is the original crime scene. If the ANR发生时输出的信息很多, and CPU and I/O resources were tight at that time, the log output timestamp might be delayed by 10 to 20 seconds. So we sometimes need to be vigilant. However, normally, the am_anr output time in EventLog is the earliest and closest to the actual ANR time.
  2. Check MainLog (Android Log) or SystemLog for ANR details (search for “ANR in”), extracting effective information:
    1. Time when ANR occurred
    2. Process that printed the ANR log
    3. Process where ANR occurred
    4. Reason for ANR
    5. CPU load
    6. Memory load
    7. CPU usage statistics time period
    8. CPU usage rate of each process
      1. Total CPU usage rate
      2. Page fault counts
        1. xxx minor indicates page faults in cache, which can be understood as the process performing memory access
        2. xxx major indicates page faults in memory, which can be understood as the process performing I/O operations
    9. CPU usage summary
  3. Combine MainLog (Android Log) and EventLog to extract all useful information within the CPU start and end time points into a file:
    1. Collect key operations, such as unlock, app installation, screen on/off, app launch, etc.
    2. Collect exceptions and system key logs:
      1. System slowdown: such as Slow operation, Slow dispatch, Slow delivery, dvm_lock_sample
      2. Process changes: am_kill, am_proc_died, lowmemorykiller, ANR, app launch relationships, etc.
      3. System information: cpu info, meminfo, binder info (whether full), iowait (whether too high)
    3. Collect all key thread running conditions and thread priorities of the ANR process
    4. Based on the key information file extracted in step 4, further understand the system’s situation and state at that time (recommend using vscode or notepad++ with global search for clues), such as:
      1. Was it in low memory frequently killing processes?
      2. First unlock after reboot with system busy?
      3. Multiple app launches in short time causing system busy?
      4. Or application’s own logic waiting?
  4. If still unclear, add logs to reproduce.

Distinguishing Between Application Problems and System Problems

First, Analyze Whether It’s an Application Problem

The key to analyzing application problems is to understand what the user was doing at that time and what role the application played during this user operation, then proceed with further analysis:

  1. Analyze whether there are time-consuming operations in key component lifecycles that might not be exposed normally but surface when system load increases (suggest adding corresponding logs in key lifecycle functions for easier debugging).
  2. Analyze whether extreme situations occurred causing application logic to be time-consuming, such as large amounts of data processing or import, too many threads running simultaneously, etc. (check application’s CPU/I/O usage).
  3. Analyze whether deadlocks exist.
  4. Analyze whether waiting for binder return.
  5. Analyze whether MainThread and RenderThread have abnormalities in Trace files.
  6. Analyze whether MainThread and WorkerThread have waiting relationships in Trace files.

Analyze System State

  1. Check CPU usage (CPU usage rate and CPU load), see if system-related processes or threads like SystemServer, lowmemorykiller, HeapTaskDeamon, Audio, SurfaceFlinger占用高
  2. Check whether大量 IO situations exist, check IO load:
    1. faults: 118172 minor (page faults in cache).
    2. major (page faults in memory).
  3. Check whether system is in low memory:
    1. Check dumpsys meminfo results to see if in low memory.
    2. Check kernel log for frequent lowmemorykiller.
    3. Check event log for frequent applications killed by system low memory policy.
    4. Check kswapd0.
  4. Whether application is frozen: Application in D state, ANR occurs. If the last operation is refriger, then the application was frozen. Normally caused by power optimization; check前后是否有 xxxHansManager : unfreeze这样的 Log; or in Systrace’s Kernel Callstack显示: {kernel callsite when blocked:: "\_\_refrigerator+0xe4/0x198"}.

Note: This is a partial translation. The complete article will be translated in subsequent steps.

CATALOG
  1. 1. ANR Analysis Methodology
    1. 1.1. General ANR Analysis Steps:
    2. 1.2. Distinguishing Between Application Problems and System Problems
      1. 1.2.1. First, Analyze Whether It’s an Application Problem
      2. 1.2.2. Analyze System State