Faster dataframe construction #128

max-sixty · 2018-02-20T03:00:08Z

This is an attempt to push the DataFrame construction down to Pandas, which is extremely fast, rather than looping over the entire dataframe in python ourselves

One downside is that the DataFrame doesn't accept multiple dtypes, so the resulting dataframe will be defined only by the data.

We could run .astype on each column, which I think would be fairly performant. But we'd need to write around cases such as non-float/obj columns having nulls. And the DataFrame constructor likely gets it right a vast majority of the time

codecov-io · 2018-02-20T03:11:59Z

Codecov Report

❗ No coverage uploaded for pull request base (master@d6b7507). Click here to learn what that means.
The diff coverage is 0%.

@@            Coverage Diff            @@
##             master     #128   +/-   ##
=========================================
  Coverage          ?   31.03%           
=========================================
  Files             ?        8           
  Lines             ?     1621           
  Branches          ?        0           
=========================================
  Hits              ?      503           
  Misses            ?     1118           
  Partials          ?        0

Impacted Files	Coverage Δ
pandas_gbq/gbq.py	`20.57% <0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d6b7507...729b9f3. Read the comment docs.

max-sixty · 2018-02-20T15:14:49Z

We could run .astype on each column, which I think would be fairly performant. But we'd need to write around cases such as non-float/obj columns having nulls. And the DataFrame constructor likely gets it right a vast majority of the time

I added this.

Though needs a review, as it's possible I haven't thought of a corner case

EDIT: Travis is fake news, this currently fails https://travis-ci.org/maxim-lian/pandas-gbq/builds/343686588. Will look into soon

tswast

I'm not seeing anything obviously wrong, but I trust Travis. Thanks for linking to that build.

tswast · 2018-02-21T13:01:54Z

pandas_gbq/gbq.py

+
+    column_dtypes = {
+        str(field['name']):
+        dtype_map.get(field['type'].upper(), object) for field in fields


I wonder if our dtype_map needs some updating, as google-cloud-bigquery already does some type conversions. Travis seems to think something is up with this.

pandas_gbq/gbq.py:853: in read_gbq final_df[field['name']].astype(type_map[field['type'].upper()]) ../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/generic.py:3054: in astype raise_on_error=raise_on_error, **kwargs) ../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:3189: in astype return self.apply('astype', dtype=dtype, **kwargs) ../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:3056: in apply applied = getattr(b, f)(**kwargs) ../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:461: in astype values=values, **kwargs) ../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:504: in _astype values = _astype_nansafe(values.ravel(), dtype, copy=True) ../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/types/cast.py:534: in _astype_nansafe return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape) pandas/lib.pyx:980: in pandas.lib.astype_intsafe (pandas/lib.c:17409) ??? _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > ??? E ValueError: invalid literal for long() with base 10: '2017-12-13 17:40:39'

Sidenote (note relevant for this PR): I think this line may be important for #123. We'd want to fall back to object if the mode is REPEATED.

max-sixty · 2018-08-22T06:32:03Z

This now works - was actually very close to working prior.

This is probably 40% of the way to the 'load in each page into an array and then concat at the end'. 20% of the work though; so a decent place to pause & merge

tswast

Thanks!

tswast · 2018-08-22T16:25:24Z

pandas_gbq/gbq.py

+
+    df = DataFrame(data=(iter(r) for r in rows), columns=column_dtypes.keys())
+    for column in df:
+        df[column] = df[column].astype(column_dtypes[column])


Do we even need the astype? I recall that when I dropped it last time I encountered #174. A behavior change like that would require a version bump, though, so I'm happy with keeping this a 0.6.1 release if you'd prefer to get this out as-is.

That's a good point re int with nulls. I'll add a test and see what happens.

I think we do need this for dates; otherwise they'll be str

Actually there's already a test: https://github.com/max-sixty/pandas-gbq/blob/no-python-loop/tests/system/test_gbq.py#L165

What's happening: pandas is ignoring setting a column with NaN to int, and keeping it as object

max-sixty added 2 commits February 19, 2018 21:56

push df construction to pandas

51a7fb2

lint

61a429b

max-sixty added 2 commits February 20, 2018 00:25

gitignore additions

fec6dc6

.astype conversion

59dfb01

tswast reviewed Feb 21, 2018

View reviewed changes

max-sixty added 6 commits March 10, 2018 15:30

Merge branch 'master' into no-python-loop

bb4321c

Merge branch 'master' into no-python-loop

a028e1d

ignore warnings from Google Cloud SDK auth

bb68875

handle repeated fields in schema

acfac47

remove comment

87531a0

whats new

5641fc3

max-sixty force-pushed the no-python-loop branch from ecd6a2b to 5641fc3 Compare August 22, 2018 06:31

tswast approved these changes Aug 22, 2018

View reviewed changes

remove duplicate test

729b9f3

max-sixty merged commit 428845d into googleapis:master Aug 22, 2018

max-sixty deleted the no-python-loop branch August 22, 2018 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster dataframe construction #128

Faster dataframe construction #128

max-sixty commented Feb 20, 2018

codecov-io commented Feb 20, 2018 •

edited

Loading

max-sixty commented Feb 20, 2018 •

edited

Loading

tswast left a comment

tswast Feb 21, 2018

max-sixty commented Aug 22, 2018

tswast left a comment

tswast Aug 22, 2018

max-sixty Aug 22, 2018

max-sixty Aug 22, 2018

Faster dataframe construction #128

Faster dataframe construction #128

Conversation

max-sixty commented Feb 20, 2018

codecov-io commented Feb 20, 2018 • edited Loading

Codecov Report

max-sixty commented Feb 20, 2018 • edited Loading

tswast left a comment

Choose a reason for hiding this comment

tswast Feb 21, 2018

Choose a reason for hiding this comment

max-sixty commented Aug 22, 2018

tswast left a comment

Choose a reason for hiding this comment

tswast Aug 22, 2018

Choose a reason for hiding this comment

max-sixty Aug 22, 2018

Choose a reason for hiding this comment

max-sixty Aug 22, 2018

Choose a reason for hiding this comment

codecov-io commented Feb 20, 2018 •

edited

Loading

max-sixty commented Feb 20, 2018 •

edited

Loading