Robert Lehmann
12 - 13 March 2025 | Alte Kaserne Winterthur
Staff SRE, Google
Robert has spent the past decade with Google SRE, building the tools that a majority of Google's incident response uses for production monitoring. In a previous life, he was a Python aficionado and otherwise enjoys teaching.
Talk
Planet-Scale Dashboards
Google runs hundreds of thousands of services globally, often interdependent and with shared concerns. At that scale, classical Federated Observability — a platform team providing foundations and/or building blocks for each team to assemble on their own — does not scale anymore.
In this talk, we will demonstrate how Google managed to cut toil dramatically while providing best-in-class monitoring out-of-the-box:
- What are the unique circumstances that contributed to Google’s scaling problem?
- A data model for re-usable dashboards
- Impact on both configuration overhead and incident response
- Looking beyond dashboards, how such re-use can be facilitated in the broader observability space
Robert will draw on Google’s research paper on Planet-Scale Dashboards (to be published mid 2025) and more than a decade of experience in SRE.