how to debug arrow/dplyr to consider a bug report? #46383
Replies: 6 comments 1 reply
-
Hi @jameshowison, totally fine to post an issue whether it's a bug or not, but I can help you look into it and walk through some debugging steps. Generally, what I'd do is try a few different things to rule out some issues, so I'll post my experiments below. The difference in behaviour between |
Beta Was this translation helpful? Give feedback.
-
One useful thing to try at this point is working out whether the discrepancy lives in the R bindings to the Arrow C++ library or in the Arrow C++ library itself. In the case of the former, I'll dig into it more myself, but in the case of the latter, I might choose to ask someone more familiar with it to help. One way to work this out is to test out the equivalent PyArrow code - both R and Python provide bindings to the C++ library, so if they have different results, we can conclude the issue is in R. I asked chatGPT for the Python equivalent of the snippet: full_papers <- open_dataset('data/softcite-extractions-oa-data/p01_one_percent_random_subset/papers.parquet', format = 'parquet')
full_papers |>
filter(published_year < 1990) |>
collect() |>
nrow() and got this: import pyarrow.dataset as ds
# Load dataset
full_papers = ds.dataset('data/softcite-extractions-oa-data/p01_one_percent_random_subset/papers.parquet', format='parquet')
# Filter and count rows
full_papers.to_table(filter=ds.field("published_year") < 1990).num_rows which gave me the result:
And just to check things looked the same, I also tried the following Python: full_papers.to_table(filter=ds.field("published_year") >= 1990).num_rows which returned
Given that this maps to what you found in R, it looks like this is happening at the C++ level. |
Beta Was this translation helpful? Give feedback.
-
I'm curious if there's anything special about the parquet file itself too, so I installed parquet-tools and took a look:
Nothing too out of the ordinary here, though I'll note the file was written with a snapshot (i.e. dev) version of parquet-go though I don't think this should be an issue. It's Parquet 2.6 which is good, a later version. The column in question is a uint16 type, but this should be an issue. Next thing I'm gonna do is try to rule out any issues with working with this column type and work out whether it's something up with this file or with Arrow itself. |
Beta Was this translation helpful? Give feedback.
-
First thing I'm gonna try is writing the dataset to a temporary file - this is all done at the arrow level without bringing it into R. Then I'll read it in again and see if the filter works.
I go |
Beta Was this translation helpful? Give feedback.
-
Next I tried inspecting the Parquet file to see if there was anything particular about it and the comparing it with a version I'd written from Arrow C++. I added a few extra parameters to try to match as closely as possible:
The I ran parquet-tools to compare them, e.g.
There's a diff here comparing the original file and the one written by Arrow C++ which seems to be working fine: https://www.diffchecker.com/OE6AnZgn/ Super weird - they're pretty similar but getting different results. Unsure how to proceed right now but I'll have a think and get back to you! |
Beta Was this translation helpful? Give feedback.
-
I'm going to close this discussion as I see you've already opened an issue, and I'll copy the key info across to there. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
We are seeing unexpected behavior with arrow using dplyr
filter
. The issue seems to be centered around a less than filter that works when we use in-memory but doesn't work when we useopen_dataset
.We asked the issue on stackoverflow here: https://stackoverflow.com/questions/79607580/how-to-properly-use-less-than-in-a-dplyr-filter-of-a-sharded-arrow-dataset#comment140408196_79607580
And I've created a test dataset and code at: https://github.com/softcite/softcite-extractions-parquet-analysis in the https://github.com/softcite/softcite-extractions-parquet-analysis/blob/main/analysis/queries_on_parquet.qmd file.
I have no idea if this is pointing to a bug, so I don't want to post an issue. I didn't think that posit forums would help, since I think the arrow/parquet versions of the dplyr verbs are implemented here?
But I also don't know how to debug this further, so any guidance on that would be appreciated. If I can debug it further and it does look like an issue I'll try to create a smaller dataset to show the behavior (but there is one in the GitHub repo above that it's too giant).
Thanks!
James
Beta Was this translation helpful? Give feedback.
All reactions