-
Notifications
You must be signed in to change notification settings - Fork 25
handle migrated kind nodes in diff and merge logic #6401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CodSpeed Performance ReportMerging #6401 will not alter performanceComparing Summary
|
8e81395
to
2e7329d
Compare
migrated kind/inheritance
2e7329d
to
8ece551
Compare
@@ -69,6 +70,56 @@ async def _run_diff_calculation_query( | |||
has_more_data = last_result.get_as_type("has_more_data", bool) | |||
offset += limit | |||
|
|||
async def _apply_kind_migrated_nodes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runs a new cypher query to determine which nodes are part of a node kind/inheritance migration and sets is_node_kind_migation = True
on all of those nodes
this will be run any time we generate or update a diff and is_node_kind_migration
will stay True
on a given node across updates once it is set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are all updates to use NodeIdentifier
for uniqueness instead of the node's UUID b/c the UUID will not actually be unique in the case of a node kind/inheritance migration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes to support using the db_id
instead of labels
for node uniqueness
@@ -53,6 +59,11 @@ async def merge_graph(self, at: Timestamp) -> EnrichedDiffRoot: | |||
) | |||
log.info(f"Diff {latest_diff.uuid} retrieved") | |||
batch_num = 0 | |||
migrated_kinds_id_map = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this map tells the node merge queries which node with a given UUID is the new one and which nodes are part of a node kind migration on the branch
) | ||
await merge_properties_query.execute(db=self.db) | ||
log.info(f"Batch #{batch_num} merged") | ||
batch_num += 1 | ||
migrated_kind_uuids = {n.identifier.uuid for n in enriched_diff.nodes if n.is_node_kind_migration} | ||
if migrated_kind_uuids: | ||
migrated_merge_query = await DiffMergeMigratedKindsQuery.init( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the new specialized query to merge nodes with migrated kinds
self.add_to_query(query=query) | ||
|
||
|
||
class DiffMergeMigratedKindsQuery(Query): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new query to merge nodes with a kind migration
when a node's kind or inheritance is migrated, we create a new node with the same UUID and an updated kind
property and/or labels, then add "deleted"
edges for all edges linked to the old node and "active"
edges to the new node
this query just copies the latest edges linked to any node with a given UUID from the source to the target branch, unless they already exist, in which case they are ignored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes to save is_node_kind_migration
property and uniquely identify nodes using UUID/database ID instead of just UUID
diff_node.identifier.kind = database_path.node_kind | ||
diff_node.db_id = database_path.node_db_id | ||
diff_node.from_time = database_path.node_changed_at | ||
diff_node.status = database_path.node_status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was necessary when we were combining diffs for a node with an updated kind into 1, but we don't do that any longer
AND ALL( | ||
r in [r_node, r_prop] | ||
WHERE r.from < $to_time AND r.branch = top_diff_rel.branch | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apparently using all
is less efficient than just doing the same conditional multiple times if you can
has_more_data: bool | ||
|
||
|
||
class DiffMigratedKindNodesQuery(DiffCalculationQuery): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new lightweight (well, pretty lightweight) query to identify which nodes are part of a kind migration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to run any performance tests against this to see if there are any regressions?
) -> None: | ||
has_more_data = True | ||
offset = 0 | ||
limit = int(config.SETTINGS.database.query_size_limit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config.SETTINGS.database.query_size_limit already is an int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed int()
here and in one other place above for the same config parameter
) | ||
log.info(f"Getting one batch of migrated kind nodes {limit=}, {offset=}") | ||
await diff_query.execute(db=self.db) | ||
log.info("Migrated kind nodes query complete") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be confusing to have this show up multiple times as it's just within this iteration, perhaps also include the offset to indicate that the query is complete for a given offset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the {limit=}, {offset=}
to the query complete log message here and above in _run_diff_calculation_query
self.path_to_node.get("branch"), | ||
self.path_to_attribute.get("branch"), | ||
self.path_to_property.get("branch"), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using this property would the order of these tuples matter in any way? I.e. would the caller have to know how the first str
differs from the other two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, nothing uses this anymore, so just going to delete it
ORDER BY diff_rel.from ASC | ||
WITH collect(diff_rel.status) AS statuses | ||
RETURN statuses = ["active", "deleted"] AS intra_branch_update | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have a comment explaining above CALL block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the review
ORDER BY diff_rel.from ASC | ||
WITH collect(diff_rel.status) AS statuses | ||
RETURN statuses = ["active", "deleted"] AS intra_branch_update | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a comment
self.path_to_node.get("branch"), | ||
self.path_to_attribute.get("branch"), | ||
self.path_to_property.get("branch"), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, nothing uses this anymore, so just going to delete it
) | ||
log.info(f"Getting one batch of migrated kind nodes {limit=}, {offset=}") | ||
await diff_query.execute(db=self.db) | ||
log.info("Migrated kind nodes query complete") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the {limit=}, {offset=}
to the query complete log message here and above in _run_diff_calculation_query
) -> None: | ||
has_more_data = True | ||
offset = 0 | ||
limit = int(config.SETTINGS.database.query_size_limit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed int()
here and in one other place above for the same config parameter
@@ -87,13 +87,13 @@ class NodeIdentifier: | |||
|
|||
uuid: str | |||
kind: str | |||
labels: frozenset[str] | |||
db_id: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be fine because we don't allow deleting nodes from the database on the default branch and the only way that a node would be deleted on a branch is if the whole branch was deleted, in which case, there would be no use for the diff
this is a good question though. I needed to think about it for a minute
IFC-1452
TODOs
updates to the diff and merge logic to support nodes that have had their kind or namespace update on a branch
this problematic workflow is
previously, the diff calculation queries completely ignored these duplicated nodes, so they were not part of the diff or the merge logic at all. the only reason no one recognized the problem was that the schema migrations run during the merge operation would create a third node with the same UUID on the default branch. this kind of worked, but the database was in an unexpected state and would only get more unexpected as time continued
the basic solution I've implemented in this PR is to track which nodes are part of a kind migration as we calculate the diff and then merge the migrated-kind nodes using their own special query at the end of the diff merge logic. we cannot keep the migrated-kind nodes completely separate from the regular updates, b/c a node can have both kinds of changes (eg I update a relationship on a node, then I update the node schema's name, then I update an attribute on that node).