OpenStack Best Practices for Reliable Cloud Deployments
Over the past few years, OpenStack has undergone major transformations. Driven by a dynamic open-source community with hundreds of contributors and the growing demands of modern cloud-native workloads, its evolution has been both rapid and profound. New components have emerged, others have been deprecated, deployment strategies have matured, and best practices have shifted to address real-world production challenges. To successfully face so many changes in this fast-moving landscape, staying up to date is critical, especially to avoid early architectural pitfalls that can become costly or even irreversible later on.
Keeping up with this fast-paced evolution is far from easy. The good news is that we at Cloudification have spent years building and operating OpenStack environments – from lean developer testbeds to production cloud infrastructures with tens of thousands of instances. Our expertise has led us to create our own c12n Private Cloud Solution, which delivers enterprise-grade OpenStack and now we want to share some of that knowledge with you by creating a compilation of the Best Practices that we have learned along our OpenStack journey.
We’ve prepared this guide for you to help you get the most out of your OpenStack deployment, whether you’re just getting started or scaling up to support demanding mission-critical workloads. This guide consolidates cloud architecture design and operational best practices. Some of these choices – especially related to storage, networking, and service layout – should be considered early in the design process, as they can be difficult, if not impossible, to change later.
P.S. If you’re new to our blog – or just beginning your OpenStack journey and want to get up to speed—we invite you to check out our earlier OpenStack articles where we cover the basics (and a little bit more). It’s a great place to start before diving into the deeper best practices we explore here.🤿
15 Proven OpenStack Best Practices for a Smooth Production Deployment
Let’s take a closer look at the key architectural and operational practices that will help you build a rock-solid cloud – based on lessons we’ve learned in the field with tens of OpenStack deployments of different versions, scale and purpose.
1. Migrate to OVN for Modern Networking
ML2 with Open vSwitch (OVS) is deprecated and no longer actively developed, making it increasingly difficult to operate and troubleshoot.
🤓Migrate to Open Virtual Network (OVN), which provides better performance, native distributed L3 routing (East-West and North-South), and more stable SDN features. While the migration from ML2/OVS can be complex—especially in production—it’s a worthwhile investment for long-term maintainability and scalability.
If you’re starting a new OpenStack cloud deployment make sure to go with ML2/OVN from the start. The old ML2/OVS setup known to have scaling limitations, is harder to debug due to additional network agents and requires more Ops knowledge with painful full sync and other problems.
2. Always Gracefully Stop Neutron Services Before Reboots
Rebooting nodes without gracefully stopping Neutron agents can result in inconsistent IP assignments, dangling ports, and broken metadata services.
🛑Always stop Neutron agents cleanly using systemctl stop neutron-* before rebooting. This ensures leases and state data are saved properly, reducing post-reboot issues with DHCP, metadata service, or router namespaces.
If using OVN, make sure to gracefully stop the ovn-controller and ovn-metadata. If using Octavia, octavia-health-manager should be stopped too.
When running OpenStack on Kubernetes, you’ll need to taint the node because K8s DaemonSet Pods cannot be deleted and are instantly scheduled back. What you can do is:
$ kubectl taint nodes $NODE_NAME key=value:NoSchedule 3. Boot Instances from Cinder Root Volumes
Relying on ephemeral storage (default in Nova) means instance data is lost after deletion, reboot, or host failure. What can you do instead?
🗄️Boot instances from Cinder root volumes instead. This enables persistent storage, easier backups, migration, and snapshotting. It works great with Ceph or other storage options. It’s also essential for production workloads where durability matters.
When using Cinder for VM root volumes you can also decide whether to keep the root volume on VM deletion or delete it with the VM.
4. Separate Storage, Compute and Networking Roles in Larger Clusters
The Hyperconverged infrastructure (HCI) comes with great cost efficiency, however, overloading nodes with too many roles (compute, storage, networking, etc.) leads to noisy-neighbor issues and hard-to-debug performance bottlenecks.
Check out this article if you want to discover how to use OpenStack with both HCI and traditional infrastructure environments.
⚡For clusters larger than ~20–30 nodes, separate out dedicated nodes for storage (Ceph OSDs, controllers) and networking (Neutron OVN). This keeps workloads better isolated and improves system stability and scaling.
5. Physically or Logically Separate Storage, API, and Tenant Traffic
A single flat network for all traffic can become saturated as I/O and instance traffic grow.
📝Create separate networks for:
- Ceph storage replication
- Internal API
- Tenant traffic
Using physical interfaces or at least separate VLANs can be used to ensure that each type of traffic gets enough bandwidth, QoS and isolation it needs. Especially when using high-throughput solutions like Ceph or NVMe over Fabrics (NVMe-oF).
6. Use Ceph RGW over Classic Swift for Object Storage
Swift’s traditional object store has fallen behind in terms of operational convenience and features compared to Ceph. It has also been recently deprecated in popular tools such as openstack-kolla.
🧩Ceph’s RADOS Gateway (RGW) offers S3 compatibility, multi-tenancy, and great integration with the rest of the OpenStack ecosystem. If you already use Ceph for block and file storage, it makes architectural sense to unify around it for object storage as well.
With Ceph you can also assign pools to different device classes, allowing the use of classic HDD drives for Object storage and fast NVMe drives for Cinder Block devices in one cluster.
7. Design Tenant and Project Structure with Future Scaling in Mind
Poorly structured projects and domains lead to permission issues, billing confusion, and management headaches.
🔑Use Keystone Domains for large orgs or tenants, projects for departments or teams, and sub-projects to represent finer-grained access. This hierarchy supports better RBAC (role-based access control) and clearer resource organization.
For more detailed information on OpenStack Multitenancy, check out this article where we cover Keystone and RBAC and how they form the core of identity and access control in OpenStack.
8. Use Fine-Grained RBAC to Avoid Resource Duplication
Admins often duplicate resources like networks or DNS zones across projects instead of sharing them securely.
🎉OpenStack supports fine-grained RBAC policies in many services. Use it to share external networks, floating IP pools, and Designate DNS zones from a central “services” project. This reduces duplication, ensures consistency, and improves governance.
9. Always Check Release Notes Before Upgrading
Upgrading without reading release notes can lead to broken APIs, use of deprecated options, or even compatibility problems.
📒Always read release notes—yes, for every component you’re upgrading. Glance, Cinder, Nova, and Neutron may introduce subtle behavioral changes or configuration updates that can affect your setup. Plan for dry runs and use pre-production environments to test upgrades first because rollback is never easy.
10. Think twice about OpenStack Vendor Lock-In
Some well-known vendors offer “open” OpenStack that isn’t really open – tying you hard to their vendor tools or their licensing model.
🫂Build OpenStack expertise in-house and use community-backed, vanilla OpenStack. In case the learning curve is too steep, managed solutions like c12n Private Cloud by Cloudification could help you to avoid vendor traps.
11. Deploy MySQL and RabbitMQ Separately Per Service
A single RabbitMQ or MySQL cluster can quickly become a bottleneck with even as little as 20 servers.
☁️In large clouds, consider running separate MySQL instances (e.g., one for Nova, one for Neutron, one for Cinder, etc.) and RabbitMQ clusters per service. This improves performance, fault isolation, and makes troubleshooting easier because there is only 1 service using 1 cluster. Always monitor connection counts and queue lengths.
12. Use Infrastructure as Code for Full Lifecycle Management
Manual changes to OpenStack service configurations or manual resources provisioning and configuration via CLI or Horizon don’t scale well and are easy to forget or misconfigure.
🧰Use Infrastructure as Code (IaC) tools to manage OpenStack resources just like any other infrastructure. Tools like Terraform, OpenTofu, Ansible, or OpenStack Heat let you document, review, and automate changes – making your setup more reproducible and auditable.
13. Stick to Mature OpenStack Services
Some OpenStack components – like Trove (DBaaS) or Sahara (Hadoop-as-a-Service) – aren’t widely adopted or actively maintained.
🦉Stick with mature, well-supported services unless you have a specific use case and the resources to maintain it. Core services like Nova, Neutron, Glance, Keystone, Cinder, and Ceph-backed storage are safe bets. Octavia, Manila, Designate, Heat are stable enough too.
Check project activity in Git and bug tracker on Launchpad before deploying another OpenStack service in production.
14. Integrate Centralized Logging & Monitoring Early
You can’t troubleshoot what you can’t see. OpenStack’s distributed architecture means logs are spread across multiple services and nodes.
💡Deploy a logging stack (like EFK – Elasticsearch, Fluentd, Kibana or OpenSearch, or Loki + Promtail). And set up monitoring with Prometheus/Grafana from the start. Include key OpenStack exporters and import ready dashboards.
Alerts on RabbitMQ queues and agent (service) status are must-haves as well as alerts on pretty much everything. In our c12n.cloud, for example, we have around 250 alerts providing end-to-end visibility across all components.
15. Use Containerized Control Plane (Kolla-Ansible or OpenStack-Helm)
Containerized OpenStack control planes (like with Kolla-Ansible or OpenStack-Helm) allow faster upgrades, better separation of concerns, and more reproducibility. Many modern deployments are moving toward this model.
📦If you’re setting up a new cloud, consider starting with Kolla-Ansible or OpenStack-Helm. If you’re already running a traditional deployment with OS packages and systemd units, plan a migration over time to reduce tech debt and simplify day-2 operations.
Conclusion: Build a Solid Foundation Early
Many of the challenges we see in OpenStack environments aren’t caused by bugs but by early design decisions that don’t scale well or can’t easily be changed later. By following these proven best practices, you’re laying the groundwork for a cloud that’s robust, maintainable, and ready for future growth.
Needless to say, our own c12n Private Cloud solution is built on these very best practices and some other and continuously tested for real-world production scenarios. Whether you’re operating OpenStack yourself or looking for a fully managed option, c12n is designed to give you the power of OpenStack without the pain of vendor lock-in or unpredictable behavior.
And we’re not stopping here.
👀Stay tuned for our next article, where we’ll share a comprehensive OpenStack Troubleshooting Guide – full of real-world fixes, debugging tips, and insights from our cloud engineering team.
Want help hardening OpenStack or migrating your current setup to something more reliable? Reach out to us – we’re here to help you design, scale, and optimize your OpenStack cloud with confidence.