SRE文化到底包含些什么内容?
看到一篇讲什么是SRE的文章,或者说你到底想要什么样SRE的文章,观点很不错,分享一下。
其实下面这些知识点我在极客时间的专栏里都有很详细的讲解,可以关注公众号找到我的专栏链接。
SRE is first and foremost about culture. Do you have a culture that is reliability focussed? What does reliability mean to you? Is it MTBF or MTTR? Do you have well defined SLAs (or SLOs)? How do you measure them? Do you meet them? Do you have Observability? Do you do blameless postmortems of incidents? Are they really blameless? Do you make your developers ‘carry pagers’ and do they do it without fear? Do you deploy on Fridays and sleep well over the weekend?
问了几个关键的问题:
1、稳定性对你的团队来说意味什么?是不是用MTBF 和 MTTR?
2、有没有衡量稳定性的SLA或SLO?怎们衡量的?又是怎么打成的?
3、有没有可视化的系统?
4、有没有无指责的故障复盘机制?过程中真地能做到无指责吗?
5、等等等等
最终聚焦一个点,团队所开展的工作是否是以稳定性为核心?注意,不是以DevOps为核心的奥。具体可以看原文,链接如下:
https://sdarchitect.blog/2020/02/20/you-dont-need-sre-what-you-need-is-sre/
另外,我开了个聊聊SRE的社区,大家有兴趣可以加入,如果群满,可以加我微信greatchengge,我拉你入群。