|
| 1 | +--- |
| 2 | +title: Scaling Cloud Controller |
| 3 | +owner: CAPI |
| 4 | +--- |
| 5 | + |
| 6 | +<strong><%= modified_date %></strong> |
| 7 | + |
| 8 | +The purpose of this document is to provide operators with guidance on knowing how and when to scale the jobs within capi-release. It is broken down by BOSH job (e.g. `cloud_controller_ng`) and highlights some of the key metrics, heuristics, and logs we have found helpful. These lists are not exhaustive by any means, but should serve as a good place to start. |
| 9 | + |
| 10 | +As always, suggestions and contributions to this document are welcome! |
| 11 | + |
| 12 | +## <a id='jobs'></a> Jobs [cf-deployment instance group] |
| 13 | +### <a id='cloud_controller_ng'></a> cloud\_controller\_ng [api] |
| 14 | +The `cloud_controller_ng` Ruby process can be considered the _main_ job in capi-release. It, along with `nginx_cc`, powers the Cloud Controller API that all users of Cloud Foundry interact with. In addition to serving external clients (e.g. the cf cli, web UIs, log/metrics aggregators), `cloud_controller_ng` also provides APIs for internal components within Cloud Foundry, such as the Loggregator and Networking subsystems. |
| 15 | + |
| 16 | +#### When to scale |
| 17 | +##### Key Metrics |
| 18 | +* `cc.requests.outstanding` is at or consistently near 20 |
| 19 | +* `system.cpu.user` is above 0.85 utilization of a single core on the api vm (see footnote [1] on BOSH cpu metrics) |
| 20 | +* `cc.vitals.cpu_load_avg` is 1 or higher |
| 21 | +* `cc.vitals.uptime` is consistently low indicating frequent restarts (possibly due to memory pressure) |
| 22 | + |
| 23 | +##### Heuristics |
| 24 | +* Average response latency |
| 25 | +* Degraded web UI responsiveness or timeouts |
| 26 | + |
| 27 | +##### Logs |
| 28 | +* `/var/vcap/sys/log/cloud_controller_ng/cloud_controller_ng.log` |
| 29 | +* `/var/vcap/sys/log/cloud_controller_ng/nginx-access.log` |
| 30 | + |
| 31 | +#### How to scale |
| 32 | +Before and after scaling Cloud Controller API VMs, its important to verify that the CC's database is not overloaded. All Cloud Controller processes are backed by the same database (ccdb), so heavy load on the database will impact API performance regardless of the number of Cloud Controllers deployed. Cloud Controller supports both PostgreSQL and MySQL and the ability for users to "bring their own database" so we are unable to provide specific scaling guidance for the database (see footnote [2] on database performance). |
| 33 | + |
| 34 | +In CF deployments with internal MySQL clusters, a single MySQL database VM with CPU usage over ~80% can be considered overloaded. When this happens, the MySQL VMs must be scaled up or the added load of additional Cloud Controllers will only exacerbate the problem. External |
| 35 | + |
| 36 | +Cloud Controllers API VMs should primarily be scaled horizontally. Scaling up the number of cores on a single VM is not helpful because of Ruby's Global Interpreter Lock (GIL). This limits the `cloud_controller_ng` process so that it can only effectively use a single CPU core on a multi-core machine. |
| 37 | + |
| 38 | +### <a id='cloud_controller_worker_local'></a> cloud\_controller\_worker\_local [api] |
| 39 | +Colloquially known as "local workers," this job is primarily responsible for handling files uploaded to the API VMs during `cf push` (e.g. `packages`, `droplets`, resource matching). |
| 40 | + |
| 41 | +#### When to scale |
| 42 | +##### Key Metrics |
| 43 | +* `cc.job_queue_length.cc-<VM_NAME>-<VM_INDEX>` (ie. `cc.job_queue_length.cc-api-0`) is continuously growing |
| 44 | +* `cc.job_queue_length.total` is continuously growing |
| 45 | + |
| 46 | +##### Heuristics |
| 47 | +* `cf push` is intermittently failing |
| 48 | +* `cf push` average time is elevated |
| 49 | + |
| 50 | +##### Logs |
| 51 | +* `/var/vcap/sys/log/cloud_controller_ng/cloud_controller_ng.log` |
| 52 | + |
| 53 | +#### How to scale |
| 54 | +Because local workers are colocated with the Cloud Controller API job, they are scaled horizontally along with the API. |
| 55 | + |
| 56 | +### <a id='cloud_controller_worker'></a> cloud\_controller\_worker [cc-worker] |
| 57 | +Colloquially known as "generic workers" or just "workers", this job (and VM) is responsible for handling asynchronous work, batch deletes, and other periodic tasks scheduled by the `cloud_controller_clock`. |
| 58 | + |
| 59 | +#### When to scale |
| 60 | +##### Key Metrics |
| 61 | +* `cc.job_queue_length.cc-<VM_TYPE>-<VM_INDEX>` (ie. `cc.job_queue_length.cc-cc-worker-0`) is continuously growing |
| 62 | +* `cc.job_queue_length.total` is continuously growing |
| 63 | + |
| 64 | +##### Heuristics |
| 65 | +* `cf delete-org ORG_NAME` appears to leave its contained resources around for a long time |
| 66 | +* Users report slow deletes for other resources |
| 67 | +* cf-acceptance-tests succeed generally, but fail during cleanup |
| 68 | + |
| 69 | +##### Logs |
| 70 | +* `/var/vcap/sys/log/cloud_controller_worker/cloud_controller_worker.log` |
| 71 | + |
| 72 | +#### How to scale |
| 73 | +The cc-worker VM can safely scale horizontally in all deployments, but if your worker VMs have CPU/memory headroom you can also use the `cc.jobs.generic.number_of_workers` BOSH property to increase the number of worker processes on each VM. |
| 74 | + |
| 75 | +### <a id='clock_and_updater'></a> cloud\_controller\_clock and cc\_deployment\_updater [scheduler] |
| 76 | +The `cloud_controller_clock` runs Diego sync process and schedules periodic background jobs. The `cc_deployment_updater` is responsible for handling v3 [rolling app deployments](https://docs.cloudfoundry.org/devguide/deploy-apps/rolling-deploy.html). |
| 77 | + |
| 78 | +#### When to scale |
| 79 | +##### Key Metrics |
| 80 | +* `cc.Diego_sync.duration` is continuously increasing over time |
| 81 | +* `system.cpu.user` is high on the scheduler VM (see footnote [1] on BOSH cpu metrics) |
| 82 | + |
| 83 | +##### Heuristics |
| 84 | +* Diego domains are [frequently unfresh](https://github.com/cloudfoundry/bbs/blob/master/doc/domains.md#domain-freshness) |
| 85 | +* The Diego Desired LRP count is larger than the total process instance count reported via the Cloud Controller APIs |
| 86 | +* Deployments are slow to increase and decrease instance count |
| 87 | + |
| 88 | +##### Logs |
| 89 | +* `/var/vcap/sys/log/cloud_controller_clock/cloud_controller_clock.log` |
| 90 | +* `/var/vcap/sys/log/cc_deployment_updater/cc_deployment_updater.log` |
| 91 | + |
| 92 | +#### How to scale |
| 93 | +Both of these jobs are singletons (only a single instance is active), so extra instances are for failover HA rather than scalability. Performance issues are likely due to database overloading or greedy neighbors on the scheduler VM. |
| 94 | + |
| 95 | +### <a id='blobstore_nginx'></a> blobstore\_nginx [singleton-blobstore] |
| 96 | +The internal [WebDAV](http://www.webdav.org/) blobstore that comes included with CF by default. It is used by the platform to store `packages` (app bits), staged `droplets`, `buildpacks`, and cached app resources. Files are typically uploaded to the internal blobstore via the Cloud Controller local workers and downloaded by Diego when app instances are started. |
| 97 | + |
| 98 | +#### When to scale |
| 99 | +##### Key Metrics |
| 100 | +* `system.cpu.user` is consistently high on the singleton-blobstore VM |
| 101 | +* `system.disk.persistent.percent` is high indicates the blobstore is running out of room for additional files |
| 102 | + |
| 103 | +##### Heuristics |
| 104 | +* `cf push` is intermittently failing |
| 105 | +* `cf push` average time is elevated |
| 106 | +* App droplet downloads are timing out/failing on Diego |
| 107 | + |
| 108 | +##### Logs |
| 109 | +* `/var/vcap/sys/log/blobstore/internal_access.log` |
| 110 | + |
| 111 | +#### How to scale |
| 112 | +The internal WebDAV blobstore **cannot be scaled horizontally**, not even for availability purposes because of its reliance on the `singleton-blobstore` VM's persistent disk for file storage. For this reason, it is [not recommended](https://docs.cloudfoundry.org/concepts/high-availability.html#blobstore) for environments that require high availability. For these environments we recommend an [external blobstore](https://docs.cloudfoundry.org/deploying/common/cc-blobstore-config.html) be used. |
| 113 | + |
| 114 | +It can be scaled vertically, however, so scaling up the number of CPUs or adding faster disk storage can improve the performance of the internal WebDAV blobstore if it is under high load. |
| 115 | + |
| 116 | +High numbers of concurrent app container starts on Diego can cause stress on the blobstore. This typically can happen during upgrades in environments with a large number of apps and Diego cells. If vertically scaling the blobstore or improving its disk performance is not an option, [limiting the max number of concurrent app container starts](https://bosh.io/jobs/auctioneer?source=github.com/cloudfoundry/diego-release&version=2.29.0#p%3ddiego.auctioneer.starting_container_count_maximum) can be used as a mitigation. |
| 117 | + |
| 118 | + |
| 119 | +--- |
| 120 | +### <a id='footnotes'></a> Footnotes |
| 121 | + |
| 122 | +##### [1] BOSH CPU Metrics |
| 123 | + |
| 124 | +BOSH system CPU metrics can be confusing. For example, running `bosh instances --vitals` will return CPU values that may look something like this: |
| 125 | +``` |
| 126 | +CPU CPU CPU CPU |
| 127 | +Total User Sys Wait |
| 128 | +- 2.9% 3.2% 1.3% |
| 129 | +``` |
| 130 | + |
| 131 | +The CPU User value corresponds with the `system.cpu.user` metric that was mentioned earlier and it is **scaled by the number of CPUs** so on a 4-core `api` VM, a `cloud_controller_ng` process that is using 100% of a core will appear as using `25%` in the `system.cpu.user` metric. |
| 132 | + |
| 133 | +##### [2] Assessing Database Health |
| 134 | +Since Cloud Controller supports both PostgreSQL and MySQL (and the concept of "bring your own database") it is difficult for us to provide absolute guidance on what a healthy database might look like. In general we've found that high database CPU utilization is a good indicator of scaling issues, but always defer to the documentation specific to your database. |
0 commit comments