-
Notifications
You must be signed in to change notification settings - Fork 125
Faster dataframe construction #128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #128 +/- ##
=========================================
Coverage ? 31.03%
=========================================
Files ? 8
Lines ? 1621
Branches ? 0
=========================================
Hits ? 503
Misses ? 1118
Partials ? 0
Continue to review full report at Codecov.
|
I added this. Though needs a review, as it's possible I haven't thought of a corner case EDIT: Travis is fake news, this currently fails https://travis-ci.org/maxim-lian/pandas-gbq/builds/343686588. Will look into soon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing anything obviously wrong, but I trust Travis. Thanks for linking to that build.
pandas_gbq/gbq.py
Outdated
|
||
column_dtypes = { | ||
str(field['name']): | ||
dtype_map.get(field['type'].upper(), object) for field in fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if our dtype_map
needs some updating, as google-cloud-bigquery
already does some type conversions. Travis seems to think something is up with this.
pandas_gbq/gbq.py:853: in read_gbq
final_df[field['name']].astype(type_map[field['type'].upper()])
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/generic.py:3054: in astype
raise_on_error=raise_on_error, **kwargs)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:3189: in astype
return self.apply('astype', dtype=dtype, **kwargs)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:3056: in apply
applied = getattr(b, f)(**kwargs)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:461: in astype
values=values, **kwargs)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:504: in _astype
values = _astype_nansafe(values.ravel(), dtype, copy=True)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/types/cast.py:534: in _astype_nansafe
return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
pandas/lib.pyx:980: in pandas.lib.astype_intsafe (pandas/lib.c:17409)
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E ValueError: invalid literal for long() with base 10: '2017-12-13 17:40:39'
Sidenote (note relevant for this PR): I think this line may be important for #123. We'd want to fall back to object if the mode is REPEATED
.
ecd6a2b
to
5641fc3
Compare
This now works - was actually very close to working prior. This is probably 40% of the way to the 'load in each page into an array and then concat at the end'. 20% of the work though; so a decent place to pause & merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|
||
df = DataFrame(data=(iter(r) for r in rows), columns=column_dtypes.keys()) | ||
for column in df: | ||
df[column] = df[column].astype(column_dtypes[column]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we even need the astype
? I recall that when I dropped it last time I encountered #174. A behavior change like that would require a version bump, though, so I'm happy with keeping this a 0.6.1 release if you'd prefer to get this out as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point re int
with nulls. I'll add a test and see what happens.
I think we do need this for dates; otherwise they'll be str
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually there's already a test: https://github.com/max-sixty/pandas-gbq/blob/no-python-loop/tests/system/test_gbq.py#L165
What's happening: pandas is ignoring setting a column with NaN
to int
, and keeping it as object
This is an attempt to push the DataFrame construction down to Pandas, which is extremely fast, rather than looping over the entire dataframe in python ourselves
One downside is that the DataFrame doesn't accept multiple dtypes, so the resulting dataframe will be defined only by the data.
We could run
.astype
on each column, which I think would be fairly performant. But we'd need to write around cases such as non-float/obj columns having nulls. And the DataFrame constructor likely gets it right a vast majority of the time