Skip to content

Slow read_gbq comparing to QueryJob.to_dataframe #924

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tulinski opened this issue May 8, 2025 · 2 comments
Open

Slow read_gbq comparing to QueryJob.to_dataframe #924

tulinski opened this issue May 8, 2025 · 2 comments
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery-pandas API. priority: p3 Desirable enhancement or fix. May not be included in next release.

Comments

@tulinski
Copy link

tulinski commented May 8, 2025

There is a substantial difference in execution time between pandas_gbq.read_gbq and google.cloud.bigquery.job.query.QueryJob.to_dataframe functions.

The reason is pandas_gbq.read_gbq calls rows_iter.to_dataframe here with dtypes=conversion_dtypes which causes converting columns in RowIterator.to_dataframe.

conversion_dtypes is created this way:
conversion_dtypes = _bqschema_to_nullsafe_dtypes(schema_fields)

Why this is default behavior?
In our case we don't need these costly column conversions.

Profiling results

117.538    RowIterator.to_dataframe        google/cloud/bigquery/table.py:2261
    65.551      DataFrame.__setitem__      pandas/core/frame.py:3630
    48.537      RowIterator.to_arrow       google/cloud/bigquery/table.py:2058
    2.302       table_to_dataframe

Column transformations take 55% time of RowIterator.to_dataframe.

Environment details

  • Python version: 3.10.12
  • pandas-gbq version: 0.28.0
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-pandas API. label May 8, 2025
@shollyman shollyman assigned Linchin and unassigned shollyman May 8, 2025
@Linchin
Copy link
Contributor

Linchin commented May 13, 2025

Hi @tulinski, thanks for raising the issue. The default mapper is necessary because some BigQuery types don't map to the types that pandas defaults to. Now you can add to the column types through passing a dictionary as the parameter dtypes, but indeed there isn't a way to remove the existing mappers.

We could offer a flag such as overwrite_default_types but I think it might become really confusing to general users.

A pretty hacky way, though, is to pass a dtypes={"FLOAT": None, "INTEGER": None, "TIME": None, "BOOLEAN": None} to overwrite it, because we use dictionary.update() to combine the user provided dict with the default one. I think it might work as a temporary solution.

@Linchin Linchin added the priority: p3 Desirable enhancement or fix. May not be included in next release. label May 13, 2025
@tulinski
Copy link
Author

Hi @Linchin, thank you for your response.

Could you please give an example of a query where the default mapper make difference between pandas_gbq.read_gbq and google.cloud.bigquery.job.query.QueryJob.to_dataframe default behavior?

We've experimented with FLOAT, INTEGER, TIME, BOOLEAN BigQuery data types (including NULL values) and there was no difference in DataFrames produced by pandas_gbq.read_gbq and QueryJob.to_dataframe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-pandas API. priority: p3 Desirable enhancement or fix. May not be included in next release.
Projects
None yet
Development

No branches or pull requests

3 participants