This is the first article in the Android App ANR series, mainly analyzing the design philosophy of Android ANR from a system perspective. The series directory is as follows:
- Android App ANR Series 1: Understanding Android ANR Design Philosophy
- Android App ANR Series 2: ANR Analysis Routines and Key Log Introduction
- Android App ANR Series 3: ANR Case Sharing
1. Universality and Complexity of ANR
In the Android ecosystem, Application Not Responding (ANR) is not only a common challenge for developers but also a core embodiment of system design philosophy. Although ANR is often simplified as a synonym for “time-consuming operations on the main thread,” this superficial understanding is far from enough to reveal the essence of the problem. In fact, the root cause of ANR lies in the complex synergy between Android’s multi-process architecture, event distribution, and resource scheduling mechanisms. Its essence is a comprehensive manifestation of strict constraints and monitoring imposed by the system on application behavior.
Android explicitly characterizes ANR as an application-layer problem, which is in sharp contrast to SNR (System Not Responding). SNR refers to the loss of responsiveness of system processes (such as system_server), usually relying on the Watchdog mechanism to monitor the status of key system threads; while ANR relies on the message scheduling mechanism, where the system process uses a carefully designed timeout model to track the responsiveness of the application’s main thread. This distinction reflects the governance strategies adopted by the system for problems at different levels:
- SNR focuses on ensuring the survival of core system services, adopting an active polling monitoring method;
- ANR focuses on the real-time response of application processes, judging through an event-driven asynchronous detection mechanism.
From the perspective of system architecture, the ANR mechanism is mainly implemented at the system layer (i.e., in the system_server process). Its core lies in building a cross-process event monitoring system. When an application process initiates an operation request (such as starting Activity, processing application broadcasts, etc.) to system services through Binder, the system will synchronize the start of a timeout timer; and for asynchronous operations such as input events, InputDispatcher will establish an event channel with the window through socket, and start timeout detection after the event is dispatched. This layered monitoring design fully reflects the differentiated processing strategies adopted by Android for different task types.
The deep significance of the ANR mechanism lies in balancing openness and system controllability. As an open platform, Android allows applications to freely apply for hardware resources (such as CPU, IO, memory, etc.), but strict rules must be used to prevent the abnormal behavior of a single application from spreading to the entire system. When a timeout event is detected, the system initiates a multi-dimensional circuit breaker mechanism: first, force-terminate the problematic process to release key system resources (such as preventing it from occupying the Binder thread pool or file descriptors), thereby avoiding cascading failures; at the same time, the system will freeze the process state and collect key information such as CPU usage, thread stacks, and memory snapshots, writing this data to /data/anr/traces.txt to preserve the scene for subsequent problem analysis. Even more cleverly, the system hands over the final operation right to the user through a user-visible pop-up window, which not only avoids the risk of misjudgment caused by automated processing but also maintains the continuity of human-computer interaction. This design combining “fault isolation - scene protection - user decision” fully demonstrates Android’s wisdom in balancing technical rigor and user experience friendliness.
2. Core Design Philosophy of ANR
Essence of ANR: System-Level Monitoring and Mandatory Intervention
The ANR mechanism constitutes a deep-seated stability defense system in the Android system architecture. Its core lies in building a global safety net independent of application status through cross-layer collaborative monitoring and asynchronous decision isolation. This design is far from simple timeout detection, but is deeply rooted in the organic combination of the Linux process sandbox mechanism and the Android component architecture. System processes (such as system_server) conduct comprehensive monitoring of component lifecycles and input event flows through two core modules: ActivityManagerService (AMS) and InputManagerService (IMS). Due to this layered architecture, monitoring logic is decoupled from business logic. Even if the main thread of the target application is completely blocked, the system can still rely on independent threads for timeout adjudication, fundamentally avoiding the risk of “monitor being dragged down by the monitored object.”
At the implementation level, ANR fully embodies the essence of event-driven system design. For example, in component-type ANR scenarios, when AMS dispatches cross-process tasks to application processes through Binder, the system synchronizes the start of a countdown timer (such as a 20-second threshold for Service startup). This “bomb planting” mechanism essentially transforms asynchronous tasks into synchronous contracts with timeout constraints. After the application process completes the task, it must actively “defuse the bomb” through the Binder callback. Otherwise, the system will intervene to collect scene information (such as main thread stacks) and trigger user interaction. The entire process is dominated by the system process, and the application only exists as an event responder, ensuring the absolute authority of monitoring.
Task dispatch relies on Binder synchronous calls to ensure atomicity. At the same time, AMS pushes timeout detection messages into the message queue through a dedicated Handler, thereby monitoring the execution of tasks within the specified time. This design not only ensures the integrity of tasks during cross-process communication but also quickly triggers circuit breaker processing after timeouts.
Component-Class ANR: Global Protection Logic for Asynchronous Tasks
The monitoring logic of component-class ANR revolves around ActivityManagerService (AMS). Its essence is to realize full-link tracking of the asynchronous task lifecycle through the three-stage model of task dispatch – callback – circuit breaker. When the system dispatches a task to an application via Binder cross-process communication (such as starting a Service), AMS synchronizes the start of a timeout detection mechanism: using MainHandler to send delayed messages for precise timing. Taking Service startup as an example, when AMS calls scheduleCreateService() of IApplicationThread, it starts the corresponding timeout monitoring (default 20 seconds). If the application does not notify AMS via the serviceDoneExecuting() callback within the specific time, an ANR determination is triggered.
Developers need to pay special attention to the timing trap of cross-process callbacks: even if the asynchronous task is completed in a sub-thread, if the Binder callback is delayed due to main thread message queue blocking (such as excessive calls to runOnUiThread), the system will still judge it as ANR.
ProcessStateRecord introduced in Android 14+ has made a finer division of process states. It not only records the status of main thread message processing in detail but also monitors background tasks and suspended states in real-time, thereby reducing the false positive rate and providing developers with richer debugging information.
The key to this design lies in the decoupling of synchronous transactions and asynchronous circuit breaking. Task dispatch relies on Binder synchronous calls to ensure atomicity, while timeout detection executes asynchronously via the Handler message mechanism, avoiding blocking the system main thread.
When ANR is triggered, the system executes a multi-dimensional circuit breaker strategy:
Scene Collection
The system collects key data such as main thread stack information, CPU usage, and process status, and writes this data to the/data/anr/traces.txtfile. At the same time, the system usesProcessCpuTrackerto record detailed CPU usage statistics, providing a basis for subsequent problem analysis.Resource Isolation
The system ensures the real-time nature of circuit breaker decisions through the scheduling priority adjustment mechanism ofProcessRecord, ensuring that ANR processing flows can be executed in a timely manner even under high system load.Diagnostic Data Collection
The system provides theApplicationExitInfoAPI, allowing developers to query historical ANR records, including occurrence time, process status, exception stacks, and other detailed information. These data are extremely important for problem reproduction and root cause analysis.
It is worth sensing that Android 15 imposes stricter constraints on background services: foreground services must complete initialization and call startForeground() within 3 seconds, otherwise the system will directly trigger ANR. Specifically, the system manages this timeout mechanism via internal attributes (such as persist.sys.fgs_timeout) and API parameters (such as AMS internal parameters controlling foreground service startup timeouts). Developers can refer to the latest API documentation to understand these changes, ensuring that strictly response time limits are met when designing services.
The system also provides a variety of tools to support the diagnosis and analysis of ANR problems:
- System Log Collection: Developers can obtain ANR stack information and system reports via
adbcommands, which contain detailed system status at the time of the problem. - Performance Analysis Tools: Android Studio’s CPU Profiler can monitor application performance in real-time, helping developers discover potential performance issues.
- System-Level Analysis: Perfetto provides powerful system-level performance analysis capabilities to help developers understand complex performance issues.
Through this multi-level monitoring and protection mechanism, the Android system ensures the reliability of application response performance and provides developers with a complete problem diagnosis toolchain. Developers need to deeply understand these mechanisms, fully consider performance factors in application design, follow system lifecycle contracts, reasonably manage main thread load, and ensure timely response of key callbacks.
This design philosophy reflects the Android platform’s strict requirements for application quality: promoting developers to build more reliable and responsive applications through clear timeout limits and perfect monitoring mechanisms. At the same time, rich diagnostic tools also provide necessary support for developers, helping them quickly locate and solve problems when they encounter them.
Input-Class ANR: Dynamic Circuit Breaker System for Event Distribution
The monitoring mechanism for input-class ANR is more complex. Its core challenge lies in balancing high real-time requirements and resource efficiency. From hardware event generation to application main thread processing, the input system builds an efficient and controllable event distribution link through three major components: EventHub, InputReader, and InputDispatcher.
Event Reading Layer (
EventHub)EventHubuses Linux’sepollmechanism to listen to/dev/inputdevice nodes, supporting concurrent listening of multiple devices, and implementing zero idle CPU consumption through an event-driven model (rather than polling). When a hardware interrupt is triggered, the system receives raw input data via inotify and encapsulates it as aRawEvent.Event Preprocessing (
InputReader)InputReaderperforms device-related preprocessing (such as touch calibration) on raw data through specificInputMappersand converts it into standard input events (such asMotionEventorKeyEvent). At the same time, necessary event filtering is performed based on device type and configuration to ensure data quality.Event Distribution Layer (
InputDispatcher)
The core responsibility ofInputDispatcheris to determine the current focus window and push events to the application process throughInputChannelbased on Unix Domain Socket. It uses a multiplexing mechanism to efficiently manage multipleInputChannelsand relies onWindowManagerServiceto obtain the latest window focus information, ensuring events are accurately delivered to the target window.
The input ANR mechanism relies on continuous tracking and timeout determination of event status. Its core lies in the design of queue status management and cross-thread collaboration capability:
inboundQueue: Stores events to be distributed received fromInputReaderoutboundQueues: Output queues maintained for each connection, tracking events that have been distributed but have not received completion responses viawaitQueuewaitQueue: Records events that have been distributed but have not yet received processing confirmation from the application side.
After an event is distributed, the system tracks its processing status through the MonitoredTimeout mechanism. The default timeout includes 5000 milliseconds (adjustable via system properties). Timeout detection adopts an event-driven mode, triggered when new events arrive, application callbacks complete, or periodic heartbeat checks occur. Once a timeout is detected, the system notifies ActivityManagerService via WindowManagerService and collects diagnostic data including InputDispatcher status and application process information, subsequently potentially triggering ANR pop-ups and process restart flows.
The entire input system adopts an optimized thread model design:
InputReaderThreadfocuses on event reading and preprocessingInputDispatcherThreadis responsible for event distribution and timeout monitoring
The two achieve efficient inter-thread communication through lock-free queues, ensuring that even if the main thread of an application process is blocked, system-level input processing can still operate normally, thereby effectively preventing problem spread.
For developers, special attention should be paid to the responsiveness of the main thread to avoid performing time-consuming operations in input event processing callbacks. At the same time, understanding the layered design of the input system helps to improve the efficiency of the event processing link from a holistic perspective during performance optimization.
No Focused Window Class ANR
No Focused Window ANR is another important unresponsive scenario in the input system. Its essence lies in the abnormal window focus state, leading to the correct distribution of input events being impossible. Unlike regular input timeouts, this type of ANR reflects coordination issues between the WindowManager subsystem and the input system.
In the design of WindowManagerService (WMS), window focus management is an independent and complex subsystem. When the user interface changes (such as Activity switching or dialog popping up), the system triggers a series of window transaction operations: first execute relayoutWindow on the old window to remove the focus flag, then execute addWindow for the new window and grant focus. These state changes are synchronized in real-time via WindowManagerPolicy to InputDispatcher, ensuring input events can be routed to the current focus window.
Acquisition and loss of focus are triggered by various system behaviors. For example:
- Focus Acquisition: New Activity finishes starting and displays the first frame,
DialogorPopupWindowpops up, touching window area in split-screen mode, restoring application from background task switcher, foreground application resumes after unlocking. - Focus Loss: Activity covered by full-screen Activity, user presses
Homekey, system pops up key levelDialogsuch as permission request, application enters background, device lock screen, etc.
No Focused Window ANR is often related to abnormal window lifecycle management. The most common situation is that during Activity switching, due to the delayed execution of handleResumeActivity of the target Activity, the system cannot determine a legal focus window within a certain period of time. Unlike input timeout ANR, input timeout means the target window exists but fails to process the event in time, while No Focused Window ANR means the system cannot find a suitable event receiver. Based on this difference, the system adopts different protection strategies for these two situations: for input timeout, the system triggers ANR after a default of 5 seconds; for no focus window situations, if the target window cannot be found after consecutive event distributions, the system will start the ANR process faster.
From the perspective of application development, the code paths affecting focus switching are relatively limited, mainly involving links such as Activity lifecycle callbacks, window addition/removal, and input event processing. Even if problems occur in these links (such as main thread blocking), they usually trigger regular ANR rather than No Focused Window ANR. Therefore, when encountering such problems, more attention should be paid to system-level indicators such as overall system resource usage status, CPU load of the system_server process, and Binder call delays between system services, rather than purely focusing on code optimization of a certain application. This is also the fundamental reason why No Focused Window ANR is often regarded as a system performance problem rather than an application quality problem.
Unifying Principles of System Design
Whether it is component-class ANR or Input-class ANR, their monitoring mechanisms follow the following core principles:
- State Traceability
Precisely track task progress through queues (such aswaitQueue) and timers (such asHandlerin AMS), ensuring the system always grasps the latest status of application behavior. - Fault Isolation
Terminate the problematic process quickly after timeout to prevent local faults from spreading into system-level avalanches. - User Control Fallback
Ensure users always have the final right of operation through pop-up prompts and process termination mechanisms, even if the application internals are completely out of control. - Developer Constraint
Mandate that the main thread remains lightweight and asynchronous in design, promoting application architecture that fits the system design philosophy better.
From an architectural perspective, the ANR mechanism is the Android system’s final answer to controllability of an open ecosystem—it allows developers to innovate freely while defining behavioral boundaries through rigid rules. This balance is not only reflected in technical implementation but also profoundly affects the performance optimization culture of the entire Android application.
Global Resolution and Active Defense of ANR Problems
The complexity of ANR problems requires us to have both technical depth and system global view within the analysis framework, and use progressive logic to transform fragmented phenomena into an evolvable cognitive system. This method is not just simple directory layering, but uses cross-validation of multi-dimensional perspectives to establish a complete mapping relationship from microscopic code defects to macroscopic system constraints.
From Phenomenon to Root Cause: Dissecting ANR Problems Layer by Layer
Building a vertical analysis path follows the chain logic of “Phenomenon → Mechanism → Support → Resource”. Its goal is to clarify the complete chain from the ANR pop-up seen by users to hardware resource problems:
Mechanism Appearance (ANR Pop-up)
As the outermost phenomenon visible to users, the ANR pop-up is actually the system’s final judgment on the fault—it does not reveal specific root causes, but only displays results. Developers often stop at the level of viewing stack logs and looking for main thread blocking points, but this is like only observing volcanic eruptions while ignoring the fundamental driving factors of crustal movement.System Implementation (AMS/InputDispatcher)
Delving into the system service layer, hidden behind the ANR pop-up is AMS’sappNotRespondingtrigger flow. AMS tracks component lifecycles through theBindertransaction state machine, whileInputDispatchermonitors input response usingsocketevent streams. Analysis at this layer reveals the difference in timeout determination logic: AMS adopts synchronous blocking detection (e.g.,BroadcastQueuetimeout calculation), whileInputDispatcherutilizes an asynchronous non-blocking model based onepollto implement event loop monitoring.Underlying Support (Binder/Scheduler)
Efficient operation of system services relies on core mechanisms of the Linux kernel. TheBinderdriver implements cross-process communication through memory mapping, and its thread pool scheduling strategy (e.g.,BINDER_MAX_POOL_THREADSthreshold limit) directly affects transaction processing capability; while the system’s fair scheduling mechanism determines whether the main thread can obtain execution resources in time through dynamic allocation of CPU time slices. The key at this layer is to analyze the contradiction between fairness and real-time performance of resource allocation—for example, to ensure fairness of multi-tasking, the system may allow CPU-intensive tasks of background processes to preempt the response time of foreground applications.Hardware Resources (CPU/IO/Memory)
Ultimately, all software behaviors are limited by physical hardware. The out-of-order execution of the CPU may lead to randomness in lock contention problems; disk I/O latency will amplify main thread blocking duringSharedPreferenceswriting; memory bandwidth contention may preventRenderThreadfrom obtaining texture data in time. This layer requires establishing an association model between hardware indicators and software behaviors, such as using theperftool to analyze the correlation between CPU cache hit rate and ANR trigger frequency.
This vertical deepening is not linear progression, but a process of cyclic verification: when hardware layer analysis finds memory bandwidth bottlenecks, it is necessary to backtrack to the Binder driver layer to check if frequent cross-process communication causes memory copy storms, and finally optimize data transmission mechanisms at the system service layer.
From Passive Response to Active Defense: Three Steps for ANR Governance
The evolution path of methodology—**”Diagnosis → Tracking → Prediction → Design”** reflects the transition of technical cognitive maturity. Specific steps include:
Stack Analysis
Traditional ANR analysis relies on thread stacks intraces.txt, which is essentially a static snapshot when a fault occurs. When a problem is caused by sporadic race conditions (such as instantaneous saturation of theBinderthread pool), the stack may show a normalNativePollOncestate and cannot reveal the true resource contention process. At this time, multi-time point stack comparison technology needs to be introduced to identify thread state migration patterns by comparing stack changes within 5 seconds before and after ANR.Dynamic Tracking
Using millisecond-level event tracking capabilities provided by tools such assystraceandperfetto, one can monitor the event processing cycle of the main threadLooperand quantify the execution time ofdispatchMessage; combined withbinder_transactionevents in theBinderdriver, heat maps of cross-process calls can be drawn. The core value of dynamic tracking lies in revealing hidden time correlations, such as discovering that input event delays often follow immediately afterSharedPreferencesdisk write operations.Machine Learning Prediction
When the root cause of ANR involves interaction of multiple subsystems (such as coupling effects of CPU scheduling, memory reclamation, and I/O load), traditional methods struggle to handle high-dimensional data. By collecting over 20 indicators such as thread status,Binderinteraction data, and CPU contention situations, and using machine learning algorithms to build analysis models, ANR types (such as main thread blocking, IPC deadlocks, or resource contention) can be automatically identified. Google has applied similar technology in Android Vitals to achieve cloud-based aggregated analysis of ANR root causes.Architectural Preventive Design
The ultimate goal is to internalize system constraints from the code design stage, for example:- Communication Topology Constraints: Limit cross-process call levels, avoid chain calls of
A → B → C, and use event bus broadcast patterns instead. - Resource Budget Management: Allocate
Bindertransaction quotas for each business module, and automatically downgrade when exceeding thresholds. - Asynchronous Boundary Reinforcement: Use
HandlerThreadandExecutorto strictly isolate synchronous and asynchronous operations to prevent thread model confusion.
- Communication Topology Constraints: Limit cross-process call levels, avoid chain calls of
This evolutionary path from passive response to active defense not only provides effective strategies for preventing ANR from the root cause but also provides developers with rich diagnostic tools and optimization ideas.
ANR Related Material Sharing
- Reflection | Design and Implementation of Android Input System & ANR Mechanism
- Xigua Video Stability Governance System Construction 1: Tailor Principle and Practice (Note: Links simplified/corrected for formatting)
- Xigua Video Stability Governance System Construction 2: Raphael Principle and Practice
- Xigua Video Stability Governance System Construction 3: Sliver Principle and Practice
- Xigua Stutter & ANR Optimization Governance and Monitoring System Construction
- Toutiao ANR Optimization Practice Series - Design Principle and Influencing Factors
- Toutiao ANR Optimization Practice Series - Monitoring Tools and Analysis Ideas
- Toutiao ANR Optimization Practice Series - Instance Analysis Collection
- Toutiao ANR Optimization Practice Series - Barrier Causes Main Thread Deadlock
- Toutiao ANR Optimization Practice Series - Farewell to SharedPreference Waiting
- Android ANR | Principle Analysis and Common Cases
References
- https://duanqz.github.io/2015-10-12-ANR-Analysis#1-%E6%A6%82%E8%A7%88
- https://duanqz.github.io/2015-10-12-ANR-Analysis
- http://gityuan.com/2016/12/02/app-not-response/
- http://gityuan.com/2017/01/01/input-anr/
- https://xiaozhuanlan.com/topic/5097486132
About Me && Blog
Below are personal introduction and related links. I hope to communicate more with everyone in the industry. If three people walk together, there must be one who can be my teacher!
- Blogger Personal Introduction: There are personal WeChat and WeChat group links inside.
- This Blog Content Navigation: A navigation of personal blog content.
- Excellent blog articles organized and collected by individuals - A must-know for Android efficiency optimization: Everyone is welcome to recommend themselves and recommend others (WeChat private chat is fine)
- Android Performance Optimization Knowledge Planet: Welcome to join, thanks for your support~
One person can go faster, a group of people can go further
