Every engineering discipline has a theory of reliability. For civil engineers, it lives in load calculations and redundancy specs. Aviation earned its safety record through obsessive incident review. In the software industry, reliability has become the practice of assuming things will eventually break: what happens when you stop treating failure as an anomaly, and start preparing for it as a certainty? The answer to this question has become known as Site Reliability Engineering, or SRE.
SRE originated at Google in 2003 when Ben Treynor Sloss founded the first site reliability engineering team. The discipline produced the Google SRE Handbook, which sought to answer one question above all others: how do you keep massive, complex technology infrastructure running reliably when the people operating it are human?