Editor's Review: A great talk, on a subject very different from developing application functionality. The speaker is an expert on ensuring production reliability and performance. He talks about identifying, analyzing and debugging problems in production systems using Sentry, CollectD, Graphana, and OpenTracing.

You just deployed your new version of an application or micro-service; how do you know everything works as expected?   You run your comprehensive test suite to verify functional correctness for known scenarios and performance tests before deploying, but does your application really work at the moment or is it just responding with error messages to all incoming requests? I’m part of the team that runs a huge infrastructure for the SAP HANA development. This infrastructure is vital for nearly all development &testing activities of SAP HANA. As this infrastructure is powered by multiple in-house developed applications, we immediately want to know if an application starts to fail and we need to be able to quickly diagnose what caused the failure.This talk will give you an overview how we monitor our full stack from the 2000 physical machines up to the 10,000 parallel running Python application processes, micro-service instances and batch processing jobs. It includes a review about the used tools, bad and good examples of instrumentation in Python code, the resulting visualization and an outlook on upcoming improvements.


0::  Describes his job in Quality Assurance.

2:15 Anything which could go wrong, will go wrong.

2:40 Identify something is wrong

3:30 Analyze the problem

4:15 Observability

6:05 How To Log

10:00 How to look at logs

10:53 Distributed Logs

15:20 Can you fix the problem /Sentry

18:26  Error Metrics / CollectD

20:25 Visualize Metrics / Grafana

22:25 Distributed Tracing / Open Tracing

25:05 Visualization of Distrubuted Tracing

26:15 Conclusion

27:10 Include Developers in Decisions

