Wednesday, May 26th, 2021 - 4:00 PM to 5:30 PM PST
4:00pm - Welcome + Networking
4:15pm - 4:45pm - Jeffery Smith
4:45pm - 5:15pm - Jason Yee
5:15pm - Wrap up
** Times are approximate
Jeff Smith has been in the technology industry for over 20 years, oscillating between management and individual contributor. Jeff currently serves as the Director of Production Operations for Centro, an advertising software company headquartered in Chicago, Illinois. Before that he served as the Manager of Site Reliability Engineering at Grubhub.
Jeff is passionate about DevOps transformations in organizations large and small, with a particular interest in the psychological aspects of problems in companies. He lives in Chicago with his wife Stephanie and their two kids Ella and Xander.
Jeff is also the author of Operations Anti-Patterns, DevOps Solutions with Manning publishing. (https://www.manning.com/books/operations-anti-patterns-devops-solutions)
Troubleshooting Tiered Tragedy: A Peek Into Failure
Talk Abstract: Failure is complicated. Sometimes an incident can reveal latent failures in your systems that have just been sitting dormant, waiting for the right combination of factors to activate them. In this talk Jeff Smith will walk through a real failure scenario and the process Centro uses to highlight issues that go beyond just the life cycle of an outage. We’ll walk through the importance of looking into signals before they become catastrophic and ensuring your team has the capacity to do so. We’ll examine how monitoring the same system from multiple vantage points can help avoid confusion and gain clarity during an incident. How the Product organization plays a vital role in protecting system uptime, and lastly how a collaborative culture can decrease your Mean Time to Recovery.
Jason Yee is Director of Advocacy at Gremlin where he helps companies build more resilient systems by learning from how they fail. He also leads the internal Chaos Engineering practices to make Gremlin more reliable. Previously, he worked at Datadog, O’Reilly Media, and MongoDB. Outside of work, he enjoys drinking whiskey, playing Pokemon Go, and making craft chocolate.
Validating your incident retrospective
How many times have you responded to an incident and thought, “This seems familiar.” But after the last incident you ran a retrospective, generated action items, and implemented those changes. So what went wrong? Complex systems. In a complex system, failures are always a combination of factors. Solving for one or more of those factors can often expose other risks that can contribute to other (sometimes similar) failures. In this talk, I’ll share how to use Chaos Engineering to validate your incident response/retrospective and uncover any latent issues they may cause.