-
Notifications
You must be signed in to change notification settings - Fork 58
Description
A customer recently hit what looks like IP address exhaustion while trying to provision instances in a VPC Subnet. They received the error: "No available IP addresses for interface"
when creating only 4 instances in a /26
subnet. That subnet should have space for 2**6 - 6 == 58 addresses, so we're definitely not allocating addresses correctly.
The address allocation is handled by a "next item" query, specifically this one:
omicron/nexus/db-queries/src/db/queries/network_interface.rs
Lines 521 to 529 in 068bf5d
pub struct NextIpv4Address { | |
inner: NextItemSelfJoined< | |
nexus_db_schema::schema::network_interface::table, | |
IpNetwork, | |
nexus_db_schema::schema::network_interface::dsl::ip, | |
Uuid, | |
nexus_db_schema::schema::network_interface::dsl::subnet_id, | |
>, | |
} |
This is a thin wrapper around this query:
omicron/nexus/db-queries/src/db/queries/next_item.rs
Lines 600 to 674 in 068bf5d
/// Select the next available item from a table, using a self-join. | |
/// | |
/// This query is similar in spirit to the above `NextItem` query. However, it's | |
/// implementation is different, and specifically designed to limit memory | |
/// consumption of the overall query. | |
/// | |
/// CockroachDB eagerly evaluates subqueries. This leads to high memory usage | |
/// for the basic `NextItem` query, since that relies on a subquery that uses | |
/// `generate_series()` to create the list of all possible items. That entire | |
/// list must be materialized and held in memory. | |
/// | |
/// In contrast, this query is implemented using a self-join between the | |
/// existing table to be searched and the "next entry" (e.g., the value of the | |
/// target column plus 1). This can result in lower memory consumption, since | |
/// that series need not be stored in memory. Note that this relies on an index | |
/// on the item column, which may be partial. | |
/// | |
/// The full query looks like this: | |
/// | |
/// ```sql | |
/// SELECT IF( | |
/// -- This condition detects if the table is empty. | |
/// EXISTS( | |
/// SELECT | |
/// 1 | |
/// FROM | |
/// <table> | |
/// WHERE | |
/// <scope_column> = <scope_key> AND | |
/// time_deleted IS NULL | |
/// LIMIT 1 | |
/// ), | |
/// -- If the table is _not_ empty, we do the self-join between `item` and | |
/// -- `item + 1`, and take the first value where `item` is NULL, i.e., the | |
/// -- smallest value of `item + 1` where there is no corresponding `item` | |
/// -- in the table. | |
/// ( | |
/// SELECT next_item AS <item_column> | |
/// FROM ( | |
/// SELECT | |
/// <item_column> + 1 AS next_item | |
/// FROM | |
/// <table> | |
/// WHERE | |
/// <scope_column> = <scope_key> AND | |
/// time_deleted IS NULL | |
/// ) | |
/// LEFT OUTER JOIN | |
/// <table> | |
/// ON | |
/// (<scope_column>, next_item <= <max_item>, time_deleted IS NULL) = | |
/// (<scope_key>, TRUE, TRUE) | |
/// WHERE | |
/// <item_column> IS NULL AND next_item <= <max_item> | |
/// ORDER BY next_item | |
/// LIMIT 1 | |
/// ), | |
/// -- If the table _is_ empty, just insert the minimum value. | |
/// <min_item> | |
/// ) | |
/// ``` | |
/// | |
/// # Which one to use? | |
/// | |
/// One should probably prefer to use this form of the query, if possible. Both | |
/// queries appear to read roughly the same amount of data from disk, but the | |
/// series-based implementation uses far more memory and runs more slowly. | |
/// | |
/// One downside of this query is that it always allocates the next _smallest_ | |
/// item that's available. The `NextItem` query lets the caller choose a base | |
/// from which to start the search, which is useful in situations where one | |
/// would prefer to randomly distribute items rather than sequentially allocate | |
/// them. | |
#[derive(Debug, Clone, Copy)] | |
pub(super) struct NextItemSelfJoined< |
That query uses a self-join between the table's "item column" (the thing we're trying to allocate) and that item column plus one. This self-join version of the query was written to reduce the memory consumption of the vanilla version of the query, which uses the generate_series()
SQL function to join the table to the sequence of all possible items.
But while self-join uses less memory, it behaves differently from the original version of the query. If you allocate two items, and then deallocate the first, they differ in what happens on the next allocation. The original version of the query will return that lowest / first item again. The self-join version will return the third item in the sequence. So if you allocate right up to the limit of your range, and then delete almost all of the rows, you'll still fail the next query.
There's a unit test that's supposed to catch this, here:
async fn test_next_item_self_joined_with_gaps() { |
But that isn't quite exhaustive enough. It checks that we correctly fill gaps in the middle of the sequence, but not that we fill gaps at the beginning of the sequence. I've rewritten that test to skip the first value in the range too, and it immediately fails.
We need to make sure this query can also allocate free items at the beginning of the query too, similar to the original query but without the memory consumption issues.