一个监控系统应该有监控数据采集,数据聚合,数据可视化和警报几个部分构成
数据采集
服务器监控:
cpu | memory | disk usage | disk IO | net
服务监控
数据库 | 消息队列 | Web服务容器
API及网站 监控
错误及追踪
聚合
对计数数据和计时数据做聚合处理
可视化
- 对时间序列数据的快速配置
- Grafana
警报
- 邮件
- 短信 | 语言电话
- App
- 微信
VictorOps?
VictorOps is a hub for centralizing the flow of information throughout the incident lifecycle. Driven by IT and DevOps system data, VictorOps provides a unified platform for real-time alerting, collaboration, and documentation.
Using VictorOps, teams resolve incidents faster to help minimize the impact of downtime and speed innovation.
slack
OpsGenie delivers alerts with all the supporting information to the right people, enabling them to assess the incident and take appropriate actions rapidly. Email and SMS delivery times can be unpredictable. OpsGenie provides mobile apps to take advantage of push notifications technology and deliver notifications in near real time, and allows users to define multiple notifications methods that can be used in succession. For instance, users can choose to receive notifications via email or mobile push notifications immediately, and set to receive a text message or a phone call if they don’t see the alert in 5 minutes
PagerDuty是一款IT警报系统工具