View Source

This wiki tracks developer-testing for NextGenMapReduce.

This aim of this document is to capture various failure handling scenarios for MapReduce applications running under YARN and the YARN framework itself.

Failure scenarios

User task error

Corrective measures	Developer(s) verifying the corrective measures	Date(s)
RM is immediately notified of error by NM with appropriate error code/status-msg
CapacityScheduler releases resources for queue, user and application
RM notifies AM about status (including error code) of the container
AM fails the task attempt
AM re-runs task-attempt before other 'virgin' tasks on a different node

User task error, same task fails 4 times

Corrective measures	Developer(s) verifying the corrective measures	Date(s)
RM is immediately notified of error by NM with appropriate error code/status-msg
CapacityScheduler releases resources for queue, user and application
RM notifies AM about status (including error code) of the container
AM fails the task attempt
AM re-runs task-attempt before other 'virgin' tasks on a different node
AM fails the MapReduce job and exits

Container failure

Localization error

Corrective measures	Developer(s) verifying the corrective measures	Date(s)
RM is immediately notified of error by NM with appropriate error code/status-msg
CapacityScheduler releases resources for queue, user and application
RM notifies AM about status (including error code) of the container
AM fails the task attempt
AM re-runs task-attempt before other 'virgin' tasks on a different node

Exceeding memory or disk limits

Corrective measures	Developer(s) verifying the corrective measures	Date(s)
RM is immediately notified of error by NM with appropriate error code/status-msg
CapacityScheduler releases resources for queue, user and application
RM notifies AM about status (including error code) of the container
AM fails the task attempt
AM re-runs task-attempt before other 'virgin' tasks on a different node

Lost map output or faulty NM Netty

Corrective measures	Developer(s) verifying the corrective measures	Date(s)
Reduces report shuffle failure errors to AM
On sufficient fetch-failure notifications the AM re-runs map

User fails/kills map or reduce task

Corrective measures	Developer(s) verifying the corrective measures	Date(s)
RM is immediately notified of error by NM with appropriate error code/status-msg
CapacityScheduler releases resources for queue, user and application
RM notifies AM about status (including error code) of the container
AM fails the task attempt
AM re-runs task-attempt before other 'virgin' tasks on a different node

Node failure due to timeout or health-check error

Corrective measures	Developer(s) verifying the corrective measures	Date(s)
RM fails all running containers and informs appropriate AMs
Shuffle failures for completed map containers... handled (aggressively?) by AM
AM re-runs running task-attempts and completed maps

MapReduce AM failure

Corrective measures	Developer(s) verifying the corrective measures	Date(s)
NM notifies RM
CapacityScheduler releases resources for queue, user and application
ASM recognises AM failure
ASM kills all running containers
ASM restarts MapReduce AM
MapReduce AM recovers and re-runs only non-complete tasks

ResourceManager bounce

Corrective measures	Developer(s) verifying the corrective measures	Date(s)
RM recovers all running AMs
RM recovers all running containers
RM rebuilds CapacityScheduler queue & user capacities
MapReduce AMs re-runs only non-complete tasks