Skip to content

Allocate Crucible Datasets On Separate Sleds #3702

@faithanalog

Description

@faithanalog

As of #3650 we allocate datasets randomly across the rack. We generate an error if two datasets on the same zpool are selected (not currently possible, may be in future if another crucible dataset is added for some purpose). We do not factor the sled into account at all. We should. This will ensure resiliency against sled-level failures.

After speaking with sean on the matter (see some conversation in 3650), we both think it should be possible to do this in SQL with a few query steps in the CTE. This looks like the following code, approximately, but im probably getting something slightly wrong.

First, select one random zpool from each sled:

let our_random_ordering = /* dont feel like including it here, see existing code */;

zpool_dsl
  .inner_join(
    /* the query we have for making sure the zpools have enough space */
  )
  .order_by((zpool_dsl::sled_id, our_random_ordering))
  .distinct_on(zpool_dsl::sled_id)
  .select((zpool_dsl::id,));

Then, select one random dataset from each zpool. Note that this step should resolve the problem we have currently with potentially picking 2 datasets from a single pool during the selection, eliminating the need for the special error for that case. We could break that out into its own PR if we wanted.

dataset_dsl
  .inner_join(
    sled_selection.query_source()
      .on(sled_selection_dsl::zpool_id.eq(dataset_dsl::zpool_id))
  )
  .order_by((dataset_dsl::zpool_id, our_random_ordering))
  .distinct_on(dataset_dsl::zpool_id)
  .select((dataset_dsl::id,))

This gives us our list of possible datasets. Each dataset is from a distinct sled and zpool, and is on a zpool with enough space. Now we select 3 random candidates

dataset_selection /* or whatever name */
  .query_source()
  .order(our_random_ordering)
  .limit(REGION_REDUNDANCY_THRESHOLD)
  .select((dataset_selection_dsl::id,))

and then we should finally have the datasets we want.

The change is not quite that simple though. We may have tests that currently rely on being able to select 3 pool on a single sled. Additionally, being able to select 3 pools on one sled is a very nice feature to have for development. So we also will need to

  • Modify tests accordingly, which tests remains to be seen
  • Add a way for devs to configure nexus to use a 1-sled allocation strategy instead of a 3-sled allocation strategy.

3650 introduced the concept of an allocation strategy at lowest levels of the region allocation code, but did not run it through the rest of the codebase, so there's currently no way to configure it. This is a good time to add a way to do that. I do not know where that configuration should go (RSS config file?), or how I should get that configuration from wherever we set it to the place region allocation happens.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions