How Facebook Keeps Its Large-Scale Infrastructure Up and Running

Facebook spends billions yearly to keep its data centers up and running. Having so many servers, certain types of failers are to be expected.

Once MachineChecker creates an alert in a centralized alert handling system, a tool called Facebook Auto-Remediation (FBAR) then picks up the alert and executes customizable remediations to fix the error.

This post talks about how exactly they manage to keep all of their hardware functional and reliable.


