Skip to content

OpenStack 生产运维

本文介绍 OpenStack 生产环境的运维管理,包括日常巡检、日志管理、监控告警等实践。

日常巡检

系统健康检查

bash
#!/bin/bash
# daily_check.sh - OpenStack 日常巡检脚本

echo "========== OpenStack Daily Check $(date) =========="

echo -e "\n=== Nova Services ==="
openstack compute service list

echo -e "\n=== Neutron Agents ==="
openstack network agent list

echo -e "\n=== Cinder Services ==="
openstack volume service list

echo -e "\n=== Glance Services ==="
openstack image list

echo -e "\n=== Host Aggregates ==="
openstack aggregate list

echo -e "\n=== Hypervisor Status ==="
openstack hypervisor list

echo -e "\n=== Instance Status ==="
openstack server list --all-projects

echo -e "\n=== Volume Status ==="
openstack volume list --all-projects

资源使用检查

bash
# 检查计算资源使用
openstack usage list --all-projects

# 检查配额使用
openstack quota show --tenant <project-id>

# 检查网络资源
openstack network list
openstack subnet list
openstack port list

# 检查存储使用
openstack volume list --all-projects
cinder list-attachments

节点状态检查

bash
# 检查控制节点服务状态
systemctl status openstack-keystone
systemctl status openstack-nova-api
systemctl status neutron-server

# 检查计算节点
systemctl status nova-compute
systemctl status neutron-linuxbridge-agent

# 检查网络节点
systemctl status neutron-l3-agent
systemctl status neutron-dhcp-agent

# 检查磁盘使用
df -h

# 检查内存使用
free -h

# 检查 CPU 负载
uptime

日志管理

日志位置

服务日志路径
Keystone/var/log/keystone/keystone.log
Nova/var/log/nova/nova-api.log, nova-compute.log
Neutron/var/log/neutron/server.log, neutron-*.log
Cinder/var/log/cinder/cinder-api.log, cinder-volume.log
Glance/var/log/glance/api.log
Horizon/var/log/apache2/horizon-error.log

日志级别配置

ini
# /etc/keystone/keystone.conf
[log]
level = INFO
handlers = file, syslog

# /etc/nova/nova.conf
[DEFAULT]
log_dir = /var/log/nova
log_file = nova-api.log

# /etc/neutron/neutron.conf
[DEFAULT]
log_dir = /var/log/neutron
log_level = INFO

日志轮转配置

bash
# /etc/logrotate.d/openstack
/var/log/nova/*.log {
    daily
    rotate 30
    compress
    delaycompress
    missingok
    notifempty
    create 0640 nova nova
    postrotate
        systemctl reload nova-api > /dev/null 2>&1 || true
    endscript
}

集中式日志收集

ELK Stack 集成

yaml
# filebeat 配置
filebeat.inputs:
- type: log
  paths:
    - /var/log/nova/*.log
  fields:
    service: nova
  fields_under_root: true

output.logstash:
  hosts: ["logstash.example.com:5044"]

OpenStack 日志收集配置

yaml
# /etc/neutron/neutron.conf
[DEFAULT]
use_syslog = True
syslog_format = neutron: %(message)s

监控告警

Prometheus 集成

安装 Prometheus Operator

bash
# 安装 kube-prometheus
git clone https://github.com/prometheus-operator/kube-prometheus.git
cd kube-prometheus
kubectl apply -f manifests/setup/
kubectl apply -f manifests/

配置 OpenStack Exporter

bash
# 安装 openstack-exporter
pip install openstack-exporter

# 配置 openstack_exporter.yml
auth_url: http://keystone:5000/v3
username: admin
password: password
project_name: admin
domain_name: default

Prometheus 配置

yaml
# prometheus.yaml
scrape_configs:
  - job_name: 'openstack'
    static_configs:
      - targets: ['openstack-exporter:9183']

关键监控指标

指标描述告警阈值
nova_instance_count虚拟机数量> 阈值 90%
neutron_agent_status网络代理状态DOWN
cinder_volume_usage卷使用率> 85%
keystone_token_expiryToken 过期时间< 5 分钟
rabbitmq_queue_depth消息队列深度> 10000
mysql_connections数据库连接数> 80% 最大值
compute_cpu_usage计算节点 CPU> 90%
compute_memory_usage计算节点内存> 90%

Grafana 仪表板

json
{
  "dashboard": {
    "title": "OpenStack Nova Dashboard",
    "panels": [
      {
        "title": "Instance Count",
        "type": "graph",
        "targets": [
          {
            "expr": "openstack_nova_instances_running"
          }
        ]
      },
      {
        "title": "Hypervisor CPU Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "openstack_nova_hypervisor_cpu_util"
          }
        ]
      }
    ]
  }
}

告警规则示例

yaml
# alert-rules.yaml
groups:
- name: openstack
  rules:
  - alert: OpenStackServiceDown
    expr: up{job="openstack"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "OpenStack service is down"

  - alert: HighQueueDepth
    expr: rabbitmq_queue_messages > 10000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "RabbitMQ queue depth is high"

  - alert: ComputeNodeHighCPU
    expr: node_cpu_usage > 90
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Compute node CPU usage is high"

备份与恢复

数据库备份

bash
# MySQL 全量备份
mysqldump -u root -p --all-databases --single-transaction \
  --routines --triggers > openstack_backup_$(date +%Y%m%d).sql

# 定时备份脚本
#!/bin/bash
BACKUP_DIR="/backup/openstack"
DATE=$(date +%Y%m%d)
mkdir -p $BACKUP_DIR

# 备份各服务数据库
for db in nova nova_api neutron cinder glance keystone placement heat; do
    mysqldump -u root -p $db > $BACKUP_DIR/${db}_${DATE}.sql
done

# 压缩备份
tar -czf openstack_db_${DATE}.tar.gz $BACKUP_DIR/*.sql

# 清理 7 天前的备份
find $BACKUP_DIR -name "*.sql" -mtime +7 -delete

配置文件备份

bash
# 备份所有 OpenStack 配置文件
BACKUP_DIR="/backup/openstack/config"
mkdir -p $BACKUP_DIR

# 备份各服务配置
tar -czf $BACKUP_DIR/keystone_$(date +%Y%m%d).tar.gz /etc/keystone/
tar -czf $BACKUP_DIR/nova_$(date +%Y%m%d).tar.gz /etc/nova/
tar -czf $BACKUP_DIR/neutron_$(date +%Y%m%d).tar.gz /etc/neutron/
tar -czf $BACKUP_DIR/cinder_$(date +%Y%m%d).tar.gz /etc/cinder/

恢复流程

bash
# 停止所有 OpenStack 服务
systemctl stop 'openstack-*'

# 恢复数据库
mysql -u root -p < openstack_backup_20240101.sql

# 恢复配置文件
tar -xzf /backup/openstack/config/keystone_20240101.tar.gz -C /

# 重启服务
systemctl start 'openstack-*'

升级维护

小版本升级

bash
# Kolla-Ansible 升级
pip install kolla-ansible==2023.2

# 拉取新镜像
kolla-ansible -i multinode pull

# 执行升级
kolla-ansible -i multinode upgrade

滚动升级

bash
# 升级控制节点
# 1. 升级第一个控制节点
# 2. 验证服务正常
# 3. 升级第二个控制节点
# 4. 验证服务正常
# 5. 升级第三个控制节点

# 升级计算节点
# 1. 迁移虚拟机到其他节点
# 2. 升级计算节点
# 3. 验证服务正常
# 4. 恢复虚拟机

维护窗口操作

bash
# 锁定计算节点(禁止新虚拟机调度)
openstack compute service set --disable compute01

# 查看运行中的虚拟机
openstack server list --host compute01

# 迁移虚拟机
openstack server migrate --live compute01 --block-migration vm1

# 完成后解锁节点
openstack compute service set --enable compute01