Shut down all VMs for Ceph maintenance

To perform maintenance on Ceph that brings Ceph down, all VMs must be stopped beforehand and restarted after maintenance has been finished.

Prerequisite

Have the following command line tools available:

As an admin user, create an API key and store it to ADMIN_APIKEY environment variable. E.g. export ADMIN_APIKEY=x4g...

Store the API hostname to API_HOST environment variable. E.g. export API_HOST=api.pilw.io

Step by Step Guide

Mark all hypervisors as not accepting workloads so no new resources can be created.

curl -sS -X GET -H "apikey: $ADMIN_APIKEY" https://$API_HOST/v1/base-operator/host/list | \
jq -r '.[]|.uuid' | \
xargs -n 1 -I {} curl -X PUT -H "apikey: $ADMIN_APIKEY" https://$API_HOST/v1/base-operator/admin/host_flags -d uuid={} -d is_accepting_workloads=0

Get a list of all currently running VMs, store the result to a file.

curl -sS -H "apikey: $ADMIN_APIKEY" -X GET https://$API_HOST/v1/user-resource/vm/all?status=running > running_vms.json

Stop all running VMs. This can take time. Not all VMs might agree to stop. These must be stopped forcefully.

cat running_vms.json | jq -r '.[]|.uuid' | \
xargs -n 1 -I {} curl -X POST -H "apikey: $ADMIN_APIKEY" https://$API_HOST/v1/user-resource/admin/vm/stop -d uuid={}

Open VMs panel in admin UI, change Filter by status to have only running and press the Reload button. Keep reloading every once in a while and see how the list gets shorter.

Not all VMs agree to stop. Windows is especially known for not stopping when requested. These VMs must be stopped forcefully.

# There is no harm in sending stop again to VMs that are already stopped. We can reuse the same list.
cat running_vms.json | jq -r '.[]|.uuid' | \
xargs -n 1 -I {} curl -X POST -H "apikey: $ADMIN_APIKEY" https://$API_HOST/v1/user-resource/admin/vm/stop -d uuid={} -d force=True

It is now safe to perform maintenance and bring Ceph offline.

Maintenance-maintenance-maintenance-...

Once Ceph is available again, mark hosts as accepting workloads.

curl -sS -X GET -H "apikey: $ADMIN_APIKEY" https://$API_HOST/v1/base-operator/host/list | \
jq -r '.[]|.uuid' | \
xargs -n 1 -I {} curl -X PUT -H "apikey: $ADMIN_APIKEY" https://$API_HOST/v1/base-operator/admin/host_flags -d uuid={} -d is_accepting_workloads=1

Now start all VMs that were running before.

cat running_vms.json | jq -r '.[]|.uuid' | \
xargs -n 1 -I {} curl -X POST -H "apikey: $ADMIN_APIKEY" https://$API_HOST/v1/user-resource/admin/vm/start -d uuid={}

This process will take some time. Some starts might fail, these need investigation, possibly by the VM owner.

Finally, check the current status of VMs in the running_vms.json file.

cat running_vms.json | jq -r '.[]|.uuid' | \
xargs -n 1 -I {} bash -c \
"curl -sS -X GET -H \"apikey: $ADMIN_APIKEY\" https://$API_HOST/v1/user-resource/admin/vm?uuid={} | jq -r '.uuid+\"\t\"+(.user_id|tostring)+\"\t\"+.status'"

The result has three columns:

  • VM UUID
  • User ID
  • VM status

Make note of all VMs that do not have status running. Either try to start the VM again, maybe manually from the UI while impersonating the user or send a notification to the user that their VM was unable to start, they should go and have a look. Virtual Console is useful for troubleshooting VM boot issues.