Skip to content

Timezones with date only indexes in Athena #9368

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
benswinburne opened this issue Mar 20, 2025 · 3 comments
Open

Timezones with date only indexes in Athena #9368

benswinburne opened this issue Mar 20, 2025 · 3 comments
Assignees
Labels
question The issue is a question. Please use Stack Overflow for questions.

Comments

@benswinburne
Copy link

benswinburne commented Mar 20, 2025

My data is stored in Athena, and indexed by partitions denoting the date with daily granularity, i.e. 2025-01-01 is a single partition.

Up until recently I've been able to run queries against the dt column which hits the correct partition and scans the correct amount of data for the dates requested in the query.

dt: {
  sql: `from_iso8601_timestamp(dt)`, // or date_parse(dt, '%Y-%m-%d')
  type: `time`,
},

Now I need to query the data taking timezone into account. Obviously the above doesn't have time included so it'll select the wrong data.

dt: {
  sql: `from_iso8601_timestamp(dt || 'T' || time)`,
  type: `time`,
},

The problem here is that now the partition isn't hit because there's no where clause directly referencing dt such as WHERE dt = '2025-01-01'.

-- hits the partition
select * from e where from_iso8601_timestamp(dt) BETWEEN ...

-- does not hit the partition
select * from e where from_iso8601_timestamp(dt || 'T' || time) BETWEEN ...

The problem I'm facing is that using Cube's traditional means, I'm unable to select against the dt column.

The way that I've currently got this working is using queryRewrite to add filters on a string version of dt which overlaps the date range set by the timeDimesions query by adding or subtracting a day where necessary.

queryRewrite: (query) => {
  const tdDt = query.timeDimensions.find(
    ({ dimension }) => dimension === 'EventsTesting.dt'
  );

  const offset = dayjs().tz(query.timezone).utcOffset();

  const from = dayjs(tdDt.dateRange[0]).subtract(offset > 0 ? 1 : 0, 'days');
  const to = dayjs(tdDt.dateRange[1]).add(offset < 0 ? 1 : 0, 'day');

  query.filters.push({
    member: 'EventsTesting.dtString',
    operator: 'gte',
    values: [from.format('YYYY-MM-DD')],
  });

  query.filters.push({
    member: 'EventsTesting.dtString',
    operator: 'lte',
    values: [to.format('YYYY-MM-DD')],
  });

  return query;
}

Resulting in

WHERE
  (
    from_iso8601_timestamp(dt || 'T' || time) >= from_iso8601_timestamp(?)
    AND from_iso8601_timestamp(dt || 'T' || time) <= from_iso8601_timestamp(?)
  )
  AND ("events_testing".dt >= ?)
  AND ("events_testing".dt <= ?)

Update: This does not work with preagg building as it doesn't use queryRewrite so building preaggs still scans all my data because cube is not letting me just add a where clause to dt in Athena's expected format to match a partition

To Reproduce
Steps to reproduce the behavior:

Using Athena as the driver, run the test below in vitest and observe results.

Expected behavior
I'm not entirely sure what the best course of action here is as obviously this is partly to do with Athena.

I tried to use various time dimensions rather than filters to manipulate the query prior to reaching the above solution. I could only manipulate the left side of the from

For example take the below filter function - I can't slice the time values off because cube modifies them

      FILTER_PARAMS.EventsTesting.dt.filter((from, to) => {
        const fromDate = from.substr(0, 10);
        const toDate = to.substr(0, 10);
        return `from_iso8601_timestamp(dt || 'T' || time)
            BETWEEN from_iso8601_timestamp(${from})
            AND from_iso8601_timestamp(${to})
            AND dt BETWEEN ${fromDate} AND ${toDate}
          `;
      })

and it ends up like so

(from_iso8601_timestamp(dt || 'T' || time)
            BETWEEN from_iso8601_timestamp('2025-01-31T13:00:00.000Z')
            AND from_iso8601_timestamp('2025-02-01T12:59:59.999Z')
            AND dt BETWEEN '2025-01-31T13:00:00.000Z' AND '2025-02-01T12:59:59.999Z'
          )

I suppose some better ability to control how dimensions (particularly time dimensions) are managed by cube maybe is the way here?

Perhaps I'm going about the whole thing wrong or it's just a missing feature relating to the Athena Driver which has been overlooked or something. It's felt like a huge struggle to get to the above so maybe it's the API, the DX, the Documentation or just me but I'm hoping to open a discussion about it in some way.

Minimally reproducible Cube Schema
In case your bug report is data modelling related please put your minimally reproducible Cube Schema here.
You can use selects without tables in order to achieve that as follows.

Use Athena as the driver.

cube(`EventsTesting`, {
  // LA 7,8,9,10
  // NY 6,7,8,9
  // SYD 3,4,5,6
  // UTC 5,6,7,8
  // ATH 4,5,6,7

  data_source: `events`,

  sql: `
    WITH data AS (
        SELECT * FROM (
            VALUES
                ('2025-01-31', '01:00:00', 1),
                ('2025-01-31', '07:00:00', 2),
                ('2025-01-31', '15:00:00', 3),
                ('2025-01-31', '23:00:00', 4),
                ('2025-02-01', '01:00:00', 5),
                ('2025-02-01', '07:00:00', 6),
                ('2025-02-01', '15:00:00', 7),
                ('2025-02-01', '23:00:00', 8),
                ('2025-02-02', '01:00:00', 9),
                ('2025-02-02', '07:00:00', 10),
                ('2025-02-02', '15:00:00', 11),
                ('2025-02-02', '23:00:00', 12),
                ('2025-02-03', '01:00:00', 13),
                ('2025-02-03', '07:00:00', 14),
                ('2025-02-03', '15:00:00', 15),
                ('2025-02-03', '23:00:00', 16)
        ) AS t(dt, time, event_id)
    )
    SELECT e.*
    FROM data e
    WHERE ${FILTER_GROUP(
      FILTER_PARAMS.EventsTesting.dt.filter(
        (from, to) =>
          `from_iso8601_timestamp(dt || 'T' || time)
            BETWEEN from_iso8601_timestamp(${from})
            AND from_iso8601_timestamp(${to})`
      )
    )}
  `,

  dimensions: {
    eventId: {
      sql: `event_id`,
      type: `number`,
      primaryKey: true,
    },

    dt: {
      sql: `from_iso8601_timestamp(dt || 'T' || time || 'Z')`,
      type: `time`,
    },

    dtString: {
      public: false,
      sql: `dt`,
      type: `string`,
    },
  },
});

And a vitest file

import { test, expect } from 'vitest';
import fetch from 'node-fetch';

test.concurrent.each([
  { timezone: 'America/Los_Angeles', ids: [7, 8, 9, 10] },
  { timezone: 'America/New_York', ids: [6, 7, 8, 9] },
  { timezone: 'Australia/Sydney', ids: [3, 4, 5, 6] },
  { timezone: 'Europe/London', ids: [5, 6, 7, 8] },
  { timezone: 'Europe/Athens', ids: [4, 5, 6, 7] },
])(
  'can return results in $timezone',
  { timeout: 10000 },
  async ({ timezone, ids }) => {
    const query = {
      timezone: timezone,
      measures: [],
      timeDimensions: [
        {
          dimension: 'EventsTesting.dt',
          dateRange: ['2025-02-01', '2025-02-01'],
        },
      ],
      dimensions: ['EventsTesting.eventId'],
    };

    // Create URL params
    const params = new URLSearchParams();
    params.append('query', JSON.stringify(query));

    // Make the request
    const response = await fetch(
      `http://localhost:4000/cubejs-api/v1/load?${params}`,
      {
        method: 'GET',
        headers: {
          Authorization: 'token',
        },
      }
    );

    const { data } = await response.json();

    expect(
      data.map((d) => Number(d['EventsTesting.eventId'])).sort((a, b) => a - b)
    ).toEqual(ids);
  }
);

Version:
[e.g. 1.2.0]

Additional context
Add any other context about the problem here.

@benswinburne
Copy link
Author

Having just run this with some preaggregations set up, the queries generated to build preaggregations do not use rewriteQuery (obviously in hindsight), and therefore don't add my dt filters and therefore scans all data for all partitions for all timezones.

Back to the drawing board.

I must be doing something wrong here, surely. Or is this a limitation of Cube with Athena?

@igorlukanin igorlukanin self-assigned this Mar 24, 2025
@igorlukanin igorlukanin added the question The issue is a question. Please use Stack Overflow for questions. label Mar 24, 2025
@igorlukanin
Copy link
Member

Hi @benswinburne 👋

Thanks for a very detailed report/question and elaborating on the steps that you've taken.

I don't think I fully grasped the issue here, so let me start with a question.

I'm curious about this part:

Now I need to query the data taking timezone into account. Obviously the above doesn't have time included so it'll select the wrong data.

dt: {
sql: from_iso8601_timestamp(dt || 'T' || time),
type: time,
},

What do you mean, exactly, by "need to query the data taking timezone into account? Why aren't you including Z into the sql here?

I understand this as "3pm in the database should mean 3pm Sydney time when queried by a user from Syndey, 3pm NY time when queried by a user from NY." Is it really what you're assuming here?

@benswinburne
Copy link
Author

Hi Igor,

What do you mean, exactly, by "need to query the data taking timezone into account? Why aren't you including Z into the sql here?

My data is all in UTC. It was only queried in UTC so no problem. Now I need to query the same data but taking timezones into account. My previous dt dimension only referred to dt in the database which is YYYY-MM-DD, now I'm taking the time into account (perhaps thats what I should have said before) so now I'm expanding my dt dimension definition to dt + time columns.

The missing Z is absent left over from me testing many things (although it is inconsequential because a missing Z defaults to Z anyway as far as I am aware). Regardless, that's not what is key to my question, perhaps I wasn't clear. Let me try again :)

If were to run the following, the time defaults to 00:00:00 obviously.

dt: {
  sql: from_iso8601_timestamp(dt),
  type: time,
},

Whilst this obviously does not give me the correct query results in the absence of a correct time, Athena correctly hits the partition because ultimately it translates to where from_iso8601_timestamp(dt) .... I assume Athena is smart enough to see that I am querying against dt, which maps to the partition and it scans only the correct data.

The problem is when I run the following

dt: {
  sql: from_iso8601_timestamp(dt || 'T' || time || 'Z'), // or without the Z
  type: time,
},

Athena is now unable to hit the correct partitions. I assume because it sees there are multiple columns it does not scan only the partitions which contain dt and scans data for all dates ever.

My data is partitioned as dt=YYYY-MM-DD/file.gz (as opposed to the more traditional way /year=YYYY/month=MM/day=DD/file.gz. Because of the way the data I have is partitioned, I must have a where dt=xclause in my query. Cube takes away my ability to do this anywhere in the base query because it modifies anything which looks like a data _after_ thedt.filter()` function is run.

I.e. If I do any manipulation within filter((from, to) => { manipulations }) Cube changes it later and tweaks the timezone, even for just dates.

To go back to my example in the OP, I added (where fromDate and toDate are YYYY-MM-DD)

AND dt BETWEEN ${fromDate} AND ${toDate}

And expected it to add (just dates, nom timestamps)

AND dt BETWEEN '2025-01-31T' AND '2025-02-01'

But it actually ends up querying the following because it's manipulated the dates

AND dt BETWEEN '2025-01-31T13:00:00.000Z' AND '2025-02-01T12:59:59.999Z'

I am just trying to find a way such that I can modify my base query without Cube messing with it so that I can add additional dt=x where clauses and hit the right partitions in Athena.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question The issue is a question. Please use Stack Overflow for questions.
Projects
None yet
Development

No branches or pull requests

2 participants