-
Notifications
You must be signed in to change notification settings - Fork 125
CLN: Use to_dataframe
to download query results.
#247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Looks like some of our system tests run |
There are a couple of errors where the The one that most concerns me is the following:
|
bd39414
to
77d2528
Compare
I've updated the tests to match the new dtypes. The main difference is that we now use |
I don't know what's going on with CircleCI right now. It can't find |
tests/system/test_gbq.py
Outdated
empty_columns = { | ||
"title": pandas.Series([], dtype=object), | ||
"id": pandas.Series([], dtype=object), | ||
"is_bot": pandas.Series([], dtype=object), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But isn't this a worse result that previous - that an empty column reverts to object
rather than its original type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is worse. We'd have to add a second type-conversion step for any non-null object columns. I believe this case only happens on empty dataframes, so I guess it wouldn't be the worst to check for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am definitely keen on doing this well once, rather than us & google-cloud-bigquery
repeating logic.
But here, it seems like pandas-gbq
logic is better - to read the schema, and then use that schema in the dataframe. No?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 0a6e8d9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
Should we do this on every dataframe; no need to have a special case for empty ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do this on every dataframe; no need to have a special case for empty ones?
We could if we tell astype() to ignore errors, but I think it's unnecessary. The dataframe constructor seems to pick the right type when bools and ints are passed in: object if it contains nulls, bool/int64 otherwise.
I'm very tempted to use the new nullable integer type by default at #242, but probably want to wait for pandas to make that change. For the cases where folks want a very particular dtype, I propose in #242 that we add a dtypes argument, much as we did in google-cloud-bigquery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK great, thanks @tswast
This allows us to remove logic for parsing the schema and align with google-cloud-bigquery.
770743e
to
159bda0
Compare
I think the conda failure uncovered a real known issue in pandas-dev/pandas#12513. The workaround is to use the previous behavior of timezone-naive datetimes for BQ TIMEZONE columns, even though that's not exactly correct. |
@max-sixty Tests are finally green. Mind doing another review before I squash and merge? |
Interesting. I haven't hit that before, but don't use conda, so I guess that explains it. I think the workaround is fine in the meantime - we'll hear about it if anyone strongly needs the tz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
# http://pandas.pydata.org/pandas-docs/dev/missing_data.html | ||
# #missing-data-casting-rules-and-indexing | ||
dtype_map = { | ||
"FLOAT": np.dtype(float), | ||
# Even though TIMESTAMPs are timezone-aware in BigQuery, pandas doesn't | ||
# support datetime64[ns, UTC] as dtype in DataFrame constructors. See: | ||
# https://github.com/pandas-dev/pandas/issues/12513 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not generally an issue as this is a single dtype for all columns; you are much better off constructing a properly dtypes Series in order to use the tz-aware dtypes
furthermore you need to be really clear if this is a localized timestamp or not ; see the sql conversion code in pandas which handles all of this correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wasn't having an issue except on the conda build with pandas pre-wheels. From pandas-dev/pandas#12513 (comment) you can see in the stacktrace that this is actually constructing Series
and then combining them as a DataFrame. It's the Series
constructor that's failing on dtype='datetime64[ns, UTC]'
.
/opt/conda/envs/test-environment/lib/python3.6/site-packages/google/cloud/bigquery/table.py:1325: in _to_dataframe_dtypes
columns[column] = pandas.Series(columns[column], dtype=dtypes[column])
/opt/conda/envs/test-environment/lib/python3.6/site-packages/pandas/core/series.py:248: in __init__
raise_cast_failure=True)
/opt/conda/envs/test-environment/lib/python3.6/site-packages/pandas/core/series.py:2967: in _sanitize_array
subarr = _try_cast(data, False)
I had failures on the DataFrame constructor too in the unit tests. If it had worked to construct a series, you're right that I'd prefer to have these be tz-aware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you must have an old version
this was worked out a while back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll double-check the Conda CI config.
I kind of prefer to leave this as-is (not tz-aware) as that's the existing behavior. https://github.com/pydata/pandas-gbq/blob/f729a44a48744acc2898350fbfbded791d900967/pandas_gbq/gbq.py#L647 I should make a separate issue for changing that for TIMESTAMP columns.
This allows us to remove logic for parsing the schema and align with
google-cloud-bigquery.
Towards #149
Note, I ran the profiler before and after. For some reason I got about an extra 20 seconds with this versus before (331 before, 352 after). I can't figure out why that would be since in both cases we're just looping over all the rows and constructing a dataframe. Maybe it was just a fluke of my work WiFi network?
In either case, this change will help us with #242 once I release the next version of
google-cloud-bigquery
. It'll also help us with #133 whenever the mystery faster API launches.