This is the third article in the Android App ANR series, sharing several ANR case studies. The series includes:
- Android App ANR Series 1: Understanding Android ANR Design Philosophy
- Android App ANR Series 2: ANR Analysis Methodology and Key Logs
- Android App ANR Series 3: ANR Case Studies
ANR (Application Not Responding) - a simple definition that encompasses much of Android’s system design philosophy.
First, ANR falls within the application domain. This differs from SNR (System Not Responding), which reflects issues where the system process (
system_server) loses responsiveness, while ANR explicitly confines the problem to applications. SNR is ensured by the Watchdog mechanism (details can be found in Watchdog mechanism and problem analysis); ANR is ensured by the message handling mechanism. Android implements a sophisticated mechanism at the system layer to detect ANR, with the core principle being message scheduling and timeout handling.Second, the ANR mechanism is primarily implemented at the system layer. All ANR-related messages are scheduled by the system process (
system_server) and then dispatched to application processes for actual processing. Simultaneously, the system process designs different timeout limits to track message processing. Once an application mishandles a message, the timeout limit takes effect: it collects system states such as CPU/IO usage, process function call stacks, and reports to the user that a process is not responding (ANR dialog; some ROMs don’t display the ANR dialog but directly crash to the home screen).Third, ANR issues are essentially performance problems. The ANR mechanism actually imposes restrictions on the application’s main thread, requiring it to complete the most common operations (starting services, processing broadcasts, handling input) within specified time limits. If processing times out, the main thread is considered to have lost the ability to respond to other operations. Time-consuming operations on the main thread, such as intensive CPU computations, heavy I/O, complex UI layouts, etc., all reduce the application’s responsiveness.
Finally, some ANR problems are very difficult to analyze. Sometimes due to underlying system influences, message scheduling fails, and the problematic scenario is hard to reproduce. Such ANR issues often require significant time to understand system behaviors, going beyond the scope of the ANR mechanism itself. Some ANR problems are hard to investigate because there are many factors causing system instability, such as memory fragmentation caused by Linux Kernel bugs, hardware damage, etc. Such low-level causes often leave ANR problems untraceable, and these aren’t application issues at all, wasting much time for application developers. If you’ve worked on entire system development and maintenance, you’ll deeply understand this. Therefore, I cannot guarantee that understanding all content in this chapter will solve every ANR problem. If you encounter very difficult ANR issues, I suggest talking to friends working on Framework, drivers, and kernel, or if the problem is just a one-in-a-hundred-thousand偶然 phenomenon that doesn’t affect normal program operation, I’d suggest ignoring it.
– From duanqz
Common ANR Causes
For ANR causes, the usual approach is: bold assumptions, careful verification. After extracting abnormal points, first assume they’re the cause, then use this assumption as a starting point to check surrounding logs to see if they support your hypothesis. If not, try another point.
Issues Within Current Process
- Deadlock
- Main thread calling thread’s join(), sleep(), wait() methods or waiting for thread locks
- Main thread blocked in nSyncDraw
- Main thread time-consuming operations, such as complex layouts, large for loops, I/O, etc.
- Main thread blocked by child thread synchronization locks
- Main thread waiting for child thread timeout
- Main thread Activity lifecycle function execution timeout
- Main thread Service lifecycle function execution timeout
- Main thread Broadcast.onReceive function execution timeout (even if goAsync is called)
- Render thread time-consuming operations
- Time-consuming network access
- Large amounts of data reading/writing
- Database operations
- Hardware operations (e.g., Camera)
- Service binder count reaching limit
- Other threads terminating or crashing causing main thread to wait indefinitely
- Memory dump operations
- Large amounts of SharedPreferences simultaneous reading/writing
Issues in Remote Processes or System
- Binder communication with SystemServer, SystemServer execution time-consuming
- Method itself execution time-consuming causing timeout
- Too much SystemServer Binder lock competition, causing lock wait timeout
- Waiting for other process return timeout, e.g., getting data from other process’s ContentProvider timeout
- Window disorder causing Input timeout
- ContentProvider corresponding process频繁崩溃,也会杀掉当前进程
- Whole device low memory
- Whole device CPU占用高
- Whole device IO使用率高
- SurfaceFlinger timeout
- System freeze function出现 Bug
- System Server中 WatchDog出现 ANR
- Whole device triggering thermal control frequency限制
Note: This is a partial translation. The complete article will be translated in subsequent steps.