Reliability in distributed systems


Follow to receive video recommendations   a   A

Is your system stable? Do you know what happens if any of your system’s dependency will start failing? Do you even know what exactly each part of your system does or did any time in the past? Or how fast you will identify root of the problem in case your system goes down at 2am?

The talk focuses on distributed systems (microservices, APIs that communicate with databases, memory, third party services, etc.), monitoring, their failures and recovery in order to help you answer yourself questions above.

First part aims on importance of monitoring such systems on several levels - monitoring of hardware, application monitoring, monitoring from outside of the systems, detecting malfunctions based on anomalies within system’s data flows.

Second part presents several standard techniques for preventing system failure in case of outage of dependency and technique how to recover from inconsistent state after outage.

Content of presentation is helpful and interesting for beginners and intermediates. Senior developers and developers working on reliable distributed systems should bear in mind content of this presentation and master shown techniques.

Editors Note:

If you like this website, please upvote my Awesome Python pull request.