how to debug arrow/dplyr to consider a bug report? #46383

jameshowison · 2025-05-09T18:09:05Z

jameshowison
May 9, 2025

We are seeing unexpected behavior with arrow using dplyr filter. The issue seems to be centered around a less than filter that works when we use in-memory but doesn't work when we use open_dataset.

We asked the issue on stackoverflow here: https://stackoverflow.com/questions/79607580/how-to-properly-use-less-than-in-a-dplyr-filter-of-a-sharded-arrow-dataset#comment140408196_79607580

And I've created a test dataset and code at: https://github.com/softcite/softcite-extractions-parquet-analysis in the https://github.com/softcite/softcite-extractions-parquet-analysis/blob/main/analysis/queries_on_parquet.qmd file.

I have no idea if this is pointing to a bug, so I don't want to post an issue. I didn't think that posit forums would help, since I think the arrow/parquet versions of the dplyr verbs are implemented here?

But I also don't know how to debug this further, so any guidance on that would be appreciated. If I can debug it further and it does look like an issue I'll try to create a smaller dataset to show the behavior (but there is one in the GitHub repo above that it's too giant).

Thanks!
James

thisisnic · 2025-05-12T14:17:10Z

thisisnic
May 12, 2025
Collaborator

Hi @jameshowison, totally fine to post an issue whether it's a bug or not, but I can help you look into it and walk through some debugging steps. Generally, what I'd do is try a few different things to rule out some issues, so I'll post my experiments below.

The difference in behaviour between read_parquet() and open_dataset() is likely caused by the fact that when you call read_parquet(), you pull the data into R session memory and then run the dplyr chain on the data frame, whereas with open_dataset() it converts the dplyr chain to Acero (Arrow C++ compute engine) commands and runs them before pulling the results back into R. So whatever is happening is happening in Acero, or the R bindings to it.

0 replies

thisisnic · 2025-05-12T14:30:42Z

thisisnic
May 12, 2025
Collaborator

One useful thing to try at this point is working out whether the discrepancy lives in the R bindings to the Arrow C++ library or in the Arrow C++ library itself. In the case of the former, I'll dig into it more myself, but in the case of the latter, I might choose to ask someone more familiar with it to help. One way to work this out is to test out the equivalent PyArrow code - both R and Python provide bindings to the C++ library, so if they have different results, we can conclude the issue is in R.

I asked chatGPT for the Python equivalent of the snippet:

full_papers <- open_dataset('data/softcite-extractions-oa-data/p01_one_percent_random_subset/papers.parquet', format = 'parquet')

full_papers |>
  filter(published_year < 1990) |>
  collect() |>
  nrow()

and got this:

import pyarrow.dataset as ds

# Load dataset
full_papers = ds.dataset('data/softcite-extractions-oa-data/p01_one_percent_random_subset/papers.parquet', format='parquet')

# Filter and count rows
full_papers.to_table(filter=ds.field("published_year") < 1990).num_rows

which gave me the result:

And just to check things looked the same, I also tried the following Python:

full_papers.to_table(filter=ds.field("published_year") >= 1990).num_rows

which returned

Given that this maps to what you found in R, it looks like this is happening at the C++ level.

0 replies

thisisnic · 2025-05-12T14:37:48Z

thisisnic
May 12, 2025
Collaborator

I'm curious if there's anything special about the parquet file itself too, so I installed parquet-tools and took a look:

nic@xps-15:~/arrow$ parquet-tools inspect ../Downloads/papers.parquet 

############ file meta data ############
created_by: parquet-go version 18.0.0-SNAPSHOT
num_columns: 13
num_rows: 64141
num_row_groups: 1
format_version: 2.6
serialized_size: 1819


############ Columns ############
paper_id
softcite_id
title
published_year
published_date
publication_venue
publisher_name
doi
pmcid
pmid
genre
license_type
has_mentions

############ Column(paper_id) ############
name: paper_id
path: paper_id
max_definition_level: 0
max_repetition_level: 0
physical_type: INT32
logical_type: Int(bitWidth=32, isSigned=false)
converted_type (legacy): UINT_32
compression: GZIP (space_saved: 13%)

############ Column(softcite_id) ############
name: softcite_id
path: softcite_id
max_definition_level: 0
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: 50%)

############ Column(title) ############
name: title
path: title
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: 56%)

############ Column(published_year) ############
name: published_year
path: published_year
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Int(bitWidth=16, isSigned=false)
converted_type (legacy): UINT_16
compression: GZIP (space_saved: 18%)

############ Column(published_date) ############
name: published_date
path: published_date
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Date
converted_type (legacy): DATE
compression: GZIP (space_saved: 10%)

############ Column(publication_venue) ############
name: publication_venue
path: publication_venue
max_definition_level: 0
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: 59%)

############ Column(publisher_name) ############
name: publisher_name
path: publisher_name
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: 49%)

############ Column(doi) ############
name: doi
path: doi
max_definition_level: 0
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: 61%)

############ Column(pmcid) ############
name: pmcid
path: pmcid
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: 63%)

############ Column(pmid) ############
name: pmid
path: pmid
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: 57%)

############ Column(genre) ############
name: genre
path: genre
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: 60%)

############ Column(license_type) ############
name: license_type
path: license_type
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: 49%)

############ Column(has_mentions) ############
name: has_mentions
path: has_mentions
max_definition_level: 0
max_repetition_level: 0
physical_type: BOOLEAN
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: 99%)

Nothing too out of the ordinary here, though I'll note the file was written with a snapshot (i.e. dev) version of parquet-go though I don't think this should be an issue. It's Parquet 2.6 which is good, a later version. The column in question is a uint16 type, but this should be an issue.

Next thing I'm gonna do is try to rule out any issues with working with this column type and work out whether it's something up with this file or with Arrow itself.

0 replies

thisisnic · 2025-05-12T14:42:57Z

thisisnic
May 12, 2025
Collaborator

First thing I'm gonna try is writing the dataset to a temporary file - this is all done at the arrow level without bringing it into R. Then I'll read it in again and see if the filter works.

tf <- tempfile()
dir.create(tf)

open_dataset('data/softcite-extractions-oa-data/p01_one_percent_random_subset/papers.parquet') %>%
  write_dataset(tf)

open_dataset(tf) |>
  filter(published_year < 1990) |>
  collect() |>
  nrow()

I go 1720 here, so it feel like there's something wrong either with the file or how it's being read. The next step is comparing the new file with the old one and seeing if there are any differences.

1 reply

jameshowison May 12, 2025
Author

Amazing @thisisnic I'm learning a ton watching you work through this!

thisisnic · 2025-05-12T14:56:14Z

thisisnic
May 12, 2025
Collaborator

Next I tried inspecting the Parquet file to see if there was anything particular about it and the comparing it with a version I'd written from Arrow C++.

I added a few extra parameters to try to match as closely as possible:

open_dataset('data/softcite-extractions-oa-data/p01_one_percent_random_subset/papers.parquet') %>%
  write_dataset(tf, compression = "gzip", min_rows_per_group = 100000)

The I ran parquet-tools to compare them, e.g.

parquet-tools inspect "/tmp/RtmpfoyxmB/file18fa6b312836/part-0.parquet"

There's a diff here comparing the original file and the one written by Arrow C++ which seems to be working fine: https://www.diffchecker.com/OE6AnZgn/

Super weird - they're pretty similar but getting different results. Unsure how to proceed right now but I'll have a think and get back to you!

0 replies

thisisnic · 2025-05-12T15:49:10Z

thisisnic
May 12, 2025
Collaborator

I'm going to close this discussion as I see you've already opened an issue, and I'll copy the key info across to there.

0 replies

how to debug arrow/dplyr to consider a bug report? #46383

Uh oh!

jameshowison May 9, 2025

Replies: 6 comments · 1 reply

Uh oh!

thisisnic May 12, 2025 Collaborator

Uh oh!

Uh oh!

thisisnic May 12, 2025 Collaborator

Uh oh!

Uh oh!

thisisnic May 12, 2025 Collaborator

Uh oh!

thisisnic May 12, 2025 Collaborator

Uh oh!

jameshowison May 12, 2025 Author

Uh oh!

Uh oh!

thisisnic May 12, 2025 Collaborator

Uh oh!

thisisnic May 12, 2025 Collaborator

jameshowison
May 9, 2025

Replies: 6 comments 1 reply

thisisnic
May 12, 2025
Collaborator

thisisnic
May 12, 2025
Collaborator

thisisnic
May 12, 2025
Collaborator

thisisnic
May 12, 2025
Collaborator

jameshowison May 12, 2025
Author

thisisnic
May 12, 2025
Collaborator

thisisnic
May 12, 2025
Collaborator