OpenStack 生产运维
本文介绍 OpenStack 生产环境的运维管理,包括日常巡检、日志管理、监控告警等实践。
日常巡检
系统健康检查
bash
#!/bin/bash
# daily_check.sh - OpenStack 日常巡检脚本
echo "========== OpenStack Daily Check $(date) =========="
echo -e "\n=== Nova Services ==="
openstack compute service list
echo -e "\n=== Neutron Agents ==="
openstack network agent list
echo -e "\n=== Cinder Services ==="
openstack volume service list
echo -e "\n=== Glance Services ==="
openstack image list
echo -e "\n=== Host Aggregates ==="
openstack aggregate list
echo -e "\n=== Hypervisor Status ==="
openstack hypervisor list
echo -e "\n=== Instance Status ==="
openstack server list --all-projects
echo -e "\n=== Volume Status ==="
openstack volume list --all-projects资源使用检查
bash
# 检查计算资源使用
openstack usage list --all-projects
# 检查配额使用
openstack quota show --tenant <project-id>
# 检查网络资源
openstack network list
openstack subnet list
openstack port list
# 检查存储使用
openstack volume list --all-projects
cinder list-attachments节点状态检查
bash
# 检查控制节点服务状态
systemctl status openstack-keystone
systemctl status openstack-nova-api
systemctl status neutron-server
# 检查计算节点
systemctl status nova-compute
systemctl status neutron-linuxbridge-agent
# 检查网络节点
systemctl status neutron-l3-agent
systemctl status neutron-dhcp-agent
# 检查磁盘使用
df -h
# 检查内存使用
free -h
# 检查 CPU 负载
uptime日志管理
日志位置
| 服务 | 日志路径 |
|---|---|
| Keystone | /var/log/keystone/keystone.log |
| Nova | /var/log/nova/nova-api.log, nova-compute.log |
| Neutron | /var/log/neutron/server.log, neutron-*.log |
| Cinder | /var/log/cinder/cinder-api.log, cinder-volume.log |
| Glance | /var/log/glance/api.log |
| Horizon | /var/log/apache2/horizon-error.log |
日志级别配置
ini
# /etc/keystone/keystone.conf
[log]
level = INFO
handlers = file, syslog
# /etc/nova/nova.conf
[DEFAULT]
log_dir = /var/log/nova
log_file = nova-api.log
# /etc/neutron/neutron.conf
[DEFAULT]
log_dir = /var/log/neutron
log_level = INFO日志轮转配置
bash
# /etc/logrotate.d/openstack
/var/log/nova/*.log {
daily
rotate 30
compress
delaycompress
missingok
notifempty
create 0640 nova nova
postrotate
systemctl reload nova-api > /dev/null 2>&1 || true
endscript
}集中式日志收集
ELK Stack 集成:
yaml
# filebeat 配置
filebeat.inputs:
- type: log
paths:
- /var/log/nova/*.log
fields:
service: nova
fields_under_root: true
output.logstash:
hosts: ["logstash.example.com:5044"]OpenStack 日志收集配置:
yaml
# /etc/neutron/neutron.conf
[DEFAULT]
use_syslog = True
syslog_format = neutron: %(message)s监控告警
Prometheus 集成
安装 Prometheus Operator:
bash
# 安装 kube-prometheus
git clone https://github.com/prometheus-operator/kube-prometheus.git
cd kube-prometheus
kubectl apply -f manifests/setup/
kubectl apply -f manifests/配置 OpenStack Exporter:
bash
# 安装 openstack-exporter
pip install openstack-exporter
# 配置 openstack_exporter.yml
auth_url: http://keystone:5000/v3
username: admin
password: password
project_name: admin
domain_name: defaultPrometheus 配置:
yaml
# prometheus.yaml
scrape_configs:
- job_name: 'openstack'
static_configs:
- targets: ['openstack-exporter:9183']关键监控指标
| 指标 | 描述 | 告警阈值 |
|---|---|---|
| nova_instance_count | 虚拟机数量 | > 阈值 90% |
| neutron_agent_status | 网络代理状态 | DOWN |
| cinder_volume_usage | 卷使用率 | > 85% |
| keystone_token_expiry | Token 过期时间 | < 5 分钟 |
| rabbitmq_queue_depth | 消息队列深度 | > 10000 |
| mysql_connections | 数据库连接数 | > 80% 最大值 |
| compute_cpu_usage | 计算节点 CPU | > 90% |
| compute_memory_usage | 计算节点内存 | > 90% |
Grafana 仪表板
json
{
"dashboard": {
"title": "OpenStack Nova Dashboard",
"panels": [
{
"title": "Instance Count",
"type": "graph",
"targets": [
{
"expr": "openstack_nova_instances_running"
}
]
},
{
"title": "Hypervisor CPU Usage",
"type": "graph",
"targets": [
{
"expr": "openstack_nova_hypervisor_cpu_util"
}
]
}
]
}
}告警规则示例
yaml
# alert-rules.yaml
groups:
- name: openstack
rules:
- alert: OpenStackServiceDown
expr: up{job="openstack"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "OpenStack service is down"
- alert: HighQueueDepth
expr: rabbitmq_queue_messages > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "RabbitMQ queue depth is high"
- alert: ComputeNodeHighCPU
expr: node_cpu_usage > 90
for: 10m
labels:
severity: warning
annotations:
summary: "Compute node CPU usage is high"备份与恢复
数据库备份
bash
# MySQL 全量备份
mysqldump -u root -p --all-databases --single-transaction \
--routines --triggers > openstack_backup_$(date +%Y%m%d).sql
# 定时备份脚本
#!/bin/bash
BACKUP_DIR="/backup/openstack"
DATE=$(date +%Y%m%d)
mkdir -p $BACKUP_DIR
# 备份各服务数据库
for db in nova nova_api neutron cinder glance keystone placement heat; do
mysqldump -u root -p $db > $BACKUP_DIR/${db}_${DATE}.sql
done
# 压缩备份
tar -czf openstack_db_${DATE}.tar.gz $BACKUP_DIR/*.sql
# 清理 7 天前的备份
find $BACKUP_DIR -name "*.sql" -mtime +7 -delete配置文件备份
bash
# 备份所有 OpenStack 配置文件
BACKUP_DIR="/backup/openstack/config"
mkdir -p $BACKUP_DIR
# 备份各服务配置
tar -czf $BACKUP_DIR/keystone_$(date +%Y%m%d).tar.gz /etc/keystone/
tar -czf $BACKUP_DIR/nova_$(date +%Y%m%d).tar.gz /etc/nova/
tar -czf $BACKUP_DIR/neutron_$(date +%Y%m%d).tar.gz /etc/neutron/
tar -czf $BACKUP_DIR/cinder_$(date +%Y%m%d).tar.gz /etc/cinder/恢复流程
bash
# 停止所有 OpenStack 服务
systemctl stop 'openstack-*'
# 恢复数据库
mysql -u root -p < openstack_backup_20240101.sql
# 恢复配置文件
tar -xzf /backup/openstack/config/keystone_20240101.tar.gz -C /
# 重启服务
systemctl start 'openstack-*'升级维护
小版本升级
bash
# Kolla-Ansible 升级
pip install kolla-ansible==2023.2
# 拉取新镜像
kolla-ansible -i multinode pull
# 执行升级
kolla-ansible -i multinode upgrade滚动升级
bash
# 升级控制节点
# 1. 升级第一个控制节点
# 2. 验证服务正常
# 3. 升级第二个控制节点
# 4. 验证服务正常
# 5. 升级第三个控制节点
# 升级计算节点
# 1. 迁移虚拟机到其他节点
# 2. 升级计算节点
# 3. 验证服务正常
# 4. 恢复虚拟机维护窗口操作
bash
# 锁定计算节点(禁止新虚拟机调度)
openstack compute service set --disable compute01
# 查看运行中的虚拟机
openstack server list --host compute01
# 迁移虚拟机
openstack server migrate --live compute01 --block-migration vm1
# 完成后解锁节点
openstack compute service set --enable compute01