-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Timezones with date only indexes in Athena #9368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Having just run this with some preaggregations set up, the queries generated to build preaggregations do not use Back to the drawing board. I must be doing something wrong here, surely. Or is this a limitation of Cube with Athena? |
Hi @benswinburne 👋 Thanks for a very detailed report/question and elaborating on the steps that you've taken. I don't think I fully grasped the issue here, so let me start with a question. I'm curious about this part:
What do you mean, exactly, by "need to query the data taking timezone into account? Why aren't you including I understand this as "3pm in the database should mean 3pm Sydney time when queried by a user from Syndey, 3pm NY time when queried by a user from NY." Is it really what you're assuming here? |
Hi Igor,
My data is all in UTC. It was only queried in UTC so no problem. Now I need to query the same data but taking timezones into account. My previous The missing Z is absent left over from me testing many things (although it is inconsequential because a missing Z defaults to Z anyway as far as I am aware). Regardless, that's not what is key to my question, perhaps I wasn't clear. Let me try again :) If were to run the following, the time defaults to 00:00:00 obviously.
Whilst this obviously does not give me the correct query results in the absence of a correct time, Athena correctly hits the partition because ultimately it translates to The problem is when I run the following
Athena is now unable to hit the correct partitions. I assume because it sees there are multiple columns it does not scan only the partitions which contain My data is partitioned as I.e. If I do any manipulation within To go back to my example in the OP, I added (where fromDate and toDate are YYYY-MM-DD)
And expected it to add (just dates, nom timestamps)
But it actually ends up querying the following because it's manipulated the dates
I am just trying to find a way such that I can modify my base query without Cube messing with it so that I can add additional dt=x where clauses and hit the right partitions in Athena. |
My data is stored in Athena, and indexed by partitions denoting the date with daily granularity, i.e.
2025-01-01
is a single partition.Up until recently I've been able to run queries against the
dt
column which hits the correct partition and scans the correct amount of data for the dates requested in the query.Now I need to query the data taking timezone into account. Obviously the above doesn't have time included so it'll select the wrong data.
The problem here is that now the partition isn't hit because there's no where clause directly referencing
dt
such asWHERE dt = '2025-01-01'
.The problem I'm facing is that using Cube's traditional means, I'm unable to select against the
dt
column.The way that I've currently got this working is using
queryRewrite
to add filters on a string version ofdt
which overlaps the date range set by the timeDimesions query by adding or subtracting a day where necessary.Resulting in
Update: This does not work with preagg building as it doesn't use queryRewrite so building preaggs still scans all my data because cube is not letting me just add a where clause to
dt
in Athena's expected format to match a partitionTo Reproduce
Steps to reproduce the behavior:
Using Athena as the driver, run the test below in vitest and observe results.
Expected behavior
I'm not entirely sure what the best course of action here is as obviously this is partly to do with Athena.
I tried to use various time dimensions rather than filters to manipulate the query prior to reaching the above solution. I could only manipulate the left side of the from
For example take the below filter function - I can't slice the time values off because cube modifies them
and it ends up like so
I suppose some better ability to control how dimensions (particularly time dimensions) are managed by cube maybe is the way here?
Perhaps I'm going about the whole thing wrong or it's just a missing feature relating to the Athena Driver which has been overlooked or something. It's felt like a huge struggle to get to the above so maybe it's the API, the DX, the Documentation or just me but I'm hoping to open a discussion about it in some way.
Minimally reproducible Cube Schema
In case your bug report is data modelling related please put your minimally reproducible Cube Schema here.
You can use selects without tables in order to achieve that as follows.
Use Athena as the driver.
And a vitest file
Version:
[e.g. 1.2.0]
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: