Add affected, inserted, updated, deleted row to DatabricksAdapterResponse #883

ghjklw · 2024-12-18T09:06:32Z

Resolves #351

Description

Try to parse the result of the databricks query to extract the following variables and add them to the DatabricksAdapterResponse:

num_affected_rows
num_inserted_rows
num_deleted_rows
num_updated_rows

I have only quickly tested it and have been able to retrieve results like this in run_results.json:

      "adapter_response": {
        "_message": "OK",
        "rows_affected": 358,
        "query_id": "2128ebeb-d6e8-4033-8849-dfa2a5279d65",
        "rows_updated": 358,
        "rows_deleted": 0,
        "rows_inserted": 0
      },

I have not yet written any test for it as I would first like to get your feedback on the approach used. It also certainly requires some more testing to make sure that it returns these metrics consistently and did not have a side effect on other methods like fetchone, fetchmany and fetchall (the last 2 in particular are likely to have been impacted).

Checklist

I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

…onse Signed-off-by: Malo Jaffré <[email protected]>

ghjklw · 2024-12-18T09:29:03Z

By the way, a much cleaner (and potentially more efficient) approach than the caching of fetchone result would be to make sure that this is executed only for some request types (MERGE, INSERT, UPDATE and DELETE), or even better based on the materialization type. This way, we would now that there would be no reason to use a fetch method on the cursor and could get rid of the caching issue.

Likewise, I don't believe this would work if the table is not Delta. It would not fail either, but it would be nice to avoid trying.

I found no way in DatabricksConnectionManager.execute to find any information about the query type, materialization or table type, and parsing the SQL query would be even more hacky. Maybe there could be a way to feed this information to it? It could for example take an additional optional boolean parameter (defaulting to False) to define whether the query impact metadata should be return, just like I did for get_response.

nicor88 · 2024-12-18T09:34:39Z

dbt/adapters/databricks/connections.py

        return self._cursor.fetchall()

-    def fetchone(self) -> Optional[tuple]:
+    def query_impact(self) -> DatabricksQueryImpact:


little nit on the name. Maybe you can call it get_query_statistics - I'm not a native English speaker, but in other query engines the term statistics is used for this type of information.
If you like this name more, you should consider to replace impact with statistics everywhere for consistency.

benc-db · 2024-12-18T17:36:30Z

Another great issue to tackle, thanks.

benc-db

I like the feature. Can we add some unit tests?

benc-db · 2024-12-18T17:37:28Z

dbt/adapters/databricks/connections.py



+@dataclass
+class DatabricksQueryImpact:


As mentioned by the other commenter, 'statistics' are used elsewhere for a similar concept, so could be used here.

dbt/adapters/databricks/connections.py

benc-db · 2024-12-18T17:41:49Z

dbt/adapters/databricks/connections.py

+    def fetchone(self) -> Optional[Row]:
+        if self._cache_fetchone:
+            # If `fetchone` result was cached by `query_metadata`, return it and invalidate it
+            row = self._cache_fetchone


Why do we invalidate the cache, but in the callsite reassign the cache to the same value?

feels like you would just want:

def fetchone(self) -> Optional[Row]: if not self._cache_fetchone: self._cache_fetchone = self._cursor.fetchone() return self._cache_fetchone

benc-db · 2024-12-20T16:32:11Z

@ghjklw I'll be on vacation next two weeks. Just wanted to give you heads up that I still want your two PRs, even though you won't see any activity for me until the new year.

ghjklw · 2024-12-20T18:44:56Z

@ghjklw I'll be on vacation next two weeks. Just wanted to give you heads up that I still want your two PRs, even though you won't see any activity for me until the new year.

Thanks for the heads up and no worries! That will give me some time to polish these 😊

If you have a couple of minutes, just to circle back to my message above, do you know if there's any chance within DatabricksConnectionManager.execute to get information about either the type of operation/materialisation or the type of query to avoid trying to collect the statistics when unnecessary... Or is that a dead-end?

benc-db · 2024-12-20T18:52:52Z

If you have a couple of minutes, just to circle back to my message above, do you know if there's any chance within DatabricksConnectionManager.execute to get information about either the type of operation/materialisation or the type of query to avoid trying to collect the statistics when unnecessary... Or is that a dead-end?

I don't think there is any metadata that conveys this client-side, short of inspecting the SQL. Do you have any idea of the overhead of performing this for every operation? If it's small enough, we can ignore; if it's large, we probably want to give a config hook to opt into the behavior, since I don't think most users care about the data, though for some scenarios it is very nice to have. It might be possible to get the information from the cursor, but the type information in the databricks python sql connector (https://github.com/databricks/databricks-sql-python) is a little sparse.

ghjklw · 2025-01-31T09:00:26Z

Just a short update to let you know that I'm not forgetting this (and the other PR)! Things have just been quite busy lately, but I absolutely intend to pick it up. In the meantime, if anyone feel like this is moving too slowly, feel free to take over ;)

benc-db · 2025-01-31T17:06:34Z

Just a short update to let you know that I'm not forgetting this (and the other PR)! Things have just been quite busy lately, but I absolutely intend to pick it up. In the meantime, if anyone feel like this is moving too slowly, feel free to take over ;)

Thanks for the update. Things have been quite busy on my side as well, so I am content to wait until you free up :). Appreciate the effort and the communication.

ghjklw · 2025-03-08T01:34:13Z

Hi!

I have had this PR in the back of my head for quite some time, and the thing is that I really don't like my approach using "caching", I feel like there are many ways this can break fetchone/many/all. This PR also needs some pretty serious rebasing as the implementation of the connection/cursor has changed significantly since.

I'm therefore thinking of creating an entirely new PR with a new approach: treating the cursor as an iterator, and using itertools.tee to create 2 copies of it:

One meant to be used to get query statistics, if relevant
The other one to be used by fetchone/many/all, which will then need to be reimplemented

The point is to make sure that fetching the statistics doesn't impact the cursor used by fetch*. I have done some research and there is unfortunately no way to reset it.

I'd love to hear your thoughts before doing something that you wouldn't be comfortable with 😉

benc-db · 2025-03-11T19:45:27Z

@ghjklw At this point I would recommend storing the stats on the CursorWrapper object. Since we don't execute async, I'm reasonably certain that at the point we create the cursor wrapper, we have completed execution but not fetched any results. Could we just store the stats in this wrapper as part of CursorWrapper construction? Do we even need to use tee?

alaturqua · 2025-07-28T19:45:14Z

Any updates on this?
This feature would be really useful when working with dbt-artifacts package or having post-hook to persist run results, especially to track the number of affected rows per model run. It would help a lot with monitoring and debugging transformations.

Add affected, inserted, updated, deleted row to DatabricksAdapterResp…

64dd513

…onse Signed-off-by: Malo Jaffré <[email protected]>

ghjklw requested review from andrefurlan-db, benc-db and rcypher-databricks as code owners December 18, 2024 09:06

nicor88 reviewed Dec 18, 2024

View reviewed changes

benc-db requested changes Dec 18, 2024

View reviewed changes



		@dataclass
		class DatabricksQueryImpact:

Add affected, inserted, updated, deleted row to DatabricksAdapterResponse #883

Are you sure you want to change the base?

Add affected, inserted, updated, deleted row to DatabricksAdapterResponse #883

Uh oh!

Conversation

ghjklw commented Dec 18, 2024

Description

Checklist

Uh oh!

ghjklw commented Dec 18, 2024

Uh oh!

nicor88 Dec 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benc-db commented Dec 18, 2024

Uh oh!

benc-db left a comment

Choose a reason for hiding this comment

Uh oh!

benc-db Dec 18, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

benc-db Dec 18, 2024

Choose a reason for hiding this comment

Uh oh!

benc-db Dec 18, 2024

Choose a reason for hiding this comment

Uh oh!

benc-db commented Dec 20, 2024

Uh oh!

ghjklw commented Dec 20, 2024

Uh oh!

benc-db commented Dec 20, 2024

Uh oh!

ghjklw commented Jan 31, 2025

Uh oh!

benc-db commented Jan 31, 2025

Uh oh!

ghjklw commented Mar 8, 2025

Uh oh!

benc-db commented Mar 11, 2025

Uh oh!

alaturqua commented Jul 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nicor88 Dec 18, 2024 •

edited

Loading