Skip to content

Faster dataframe construction #128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Aug 22, 2018
Merged

Conversation

max-sixty
Copy link
Contributor

This is an attempt to push the DataFrame construction down to Pandas, which is extremely fast, rather than looping over the entire dataframe in python ourselves

One downside is that the DataFrame doesn't accept multiple dtypes, so the resulting dataframe will be defined only by the data.

We could run .astype on each column, which I think would be fairly performant. But we'd need to write around cases such as non-float/obj columns having nulls. And the DataFrame constructor likely gets it right a vast majority of the time

@codecov-io
Copy link

codecov-io commented Feb 20, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@d6b7507). Click here to learn what that means.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #128   +/-   ##
=========================================
  Coverage          ?   31.03%           
=========================================
  Files             ?        8           
  Lines             ?     1621           
  Branches          ?        0           
=========================================
  Hits              ?      503           
  Misses            ?     1118           
  Partials          ?        0
Impacted Files Coverage Δ
pandas_gbq/gbq.py 20.57% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d6b7507...729b9f3. Read the comment docs.

@max-sixty
Copy link
Contributor Author

max-sixty commented Feb 20, 2018

We could run .astype on each column, which I think would be fairly performant. But we'd need to write around cases such as non-float/obj columns having nulls. And the DataFrame constructor likely gets it right a vast majority of the time

I added this.

Though needs a review, as it's possible I haven't thought of a corner case

EDIT: Travis is fake news, this currently fails https://travis-ci.org/maxim-lian/pandas-gbq/builds/343686588. Will look into soon

Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing anything obviously wrong, but I trust Travis. Thanks for linking to that build.


column_dtypes = {
str(field['name']):
dtype_map.get(field['type'].upper(), object) for field in fields
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if our dtype_map needs some updating, as google-cloud-bigquery already does some type conversions. Travis seems to think something is up with this.

pandas_gbq/gbq.py:853: in read_gbq
    final_df[field['name']].astype(type_map[field['type'].upper()])
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/generic.py:3054: in astype
    raise_on_error=raise_on_error, **kwargs)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:3189: in astype
    return self.apply('astype', dtype=dtype, **kwargs)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:3056: in apply
    applied = getattr(b, f)(**kwargs)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:461: in astype
    values=values, **kwargs)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:504: in _astype
    values = _astype_nansafe(values.ravel(), dtype, copy=True)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/types/cast.py:534: in _astype_nansafe
    return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
pandas/lib.pyx:980: in pandas.lib.astype_intsafe (pandas/lib.c:17409)
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
>   ???
E   ValueError: invalid literal for long() with base 10: '2017-12-13 17:40:39'

Sidenote (note relevant for this PR): I think this line may be important for #123. We'd want to fall back to object if the mode is REPEATED.

@max-sixty
Copy link
Contributor Author

This now works - was actually very close to working prior.

This is probably 40% of the way to the 'load in each page into an array and then concat at the end'. 20% of the work though; so a decent place to pause & merge

Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!


df = DataFrame(data=(iter(r) for r in rows), columns=column_dtypes.keys())
for column in df:
df[column] = df[column].astype(column_dtypes[column])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need the astype? I recall that when I dropped it last time I encountered #174. A behavior change like that would require a version bump, though, so I'm happy with keeping this a 0.6.1 release if you'd prefer to get this out as-is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point re int with nulls. I'll add a test and see what happens.

I think we do need this for dates; otherwise they'll be str

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there's already a test: https://github.com/max-sixty/pandas-gbq/blob/no-python-loop/tests/system/test_gbq.py#L165

What's happening: pandas is ignoring setting a column with NaN to int, and keeping it as object

@max-sixty max-sixty merged commit 428845d into googleapis:master Aug 22, 2018
@max-sixty max-sixty deleted the no-python-loop branch August 22, 2018 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants