Skip to content

Timezones with date only indexes in Athena #9368

Open
@benswinburne

Description

@benswinburne

My data is stored in Athena, and indexed by partitions denoting the date with daily granularity, i.e. 2025-01-01 is a single partition.

Up until recently I've been able to run queries against the dt column which hits the correct partition and scans the correct amount of data for the dates requested in the query.

dt: {
  sql: `from_iso8601_timestamp(dt)`, // or date_parse(dt, '%Y-%m-%d')
  type: `time`,
},

Now I need to query the data taking timezone into account. Obviously the above doesn't have time included so it'll select the wrong data.

dt: {
  sql: `from_iso8601_timestamp(dt || 'T' || time)`,
  type: `time`,
},

The problem here is that now the partition isn't hit because there's no where clause directly referencing dt such as WHERE dt = '2025-01-01'.

-- hits the partition
select * from e where from_iso8601_timestamp(dt) BETWEEN ...

-- does not hit the partition
select * from e where from_iso8601_timestamp(dt || 'T' || time) BETWEEN ...

The problem I'm facing is that using Cube's traditional means, I'm unable to select against the dt column.

The way that I've currently got this working is using queryRewrite to add filters on a string version of dt which overlaps the date range set by the timeDimesions query by adding or subtracting a day where necessary.

queryRewrite: (query) => {
  const tdDt = query.timeDimensions.find(
    ({ dimension }) => dimension === 'EventsTesting.dt'
  );

  const offset = dayjs().tz(query.timezone).utcOffset();

  const from = dayjs(tdDt.dateRange[0]).subtract(offset > 0 ? 1 : 0, 'days');
  const to = dayjs(tdDt.dateRange[1]).add(offset < 0 ? 1 : 0, 'day');

  query.filters.push({
    member: 'EventsTesting.dtString',
    operator: 'gte',
    values: [from.format('YYYY-MM-DD')],
  });

  query.filters.push({
    member: 'EventsTesting.dtString',
    operator: 'lte',
    values: [to.format('YYYY-MM-DD')],
  });

  return query;
}

Resulting in

WHERE
  (
    from_iso8601_timestamp(dt || 'T' || time) >= from_iso8601_timestamp(?)
    AND from_iso8601_timestamp(dt || 'T' || time) <= from_iso8601_timestamp(?)
  )
  AND ("events_testing".dt >= ?)
  AND ("events_testing".dt <= ?)

Update: This does not work with preagg building as it doesn't use queryRewrite so building preaggs still scans all my data because cube is not letting me just add a where clause to dt in Athena's expected format to match a partition

To Reproduce
Steps to reproduce the behavior:

Using Athena as the driver, run the test below in vitest and observe results.

Expected behavior
I'm not entirely sure what the best course of action here is as obviously this is partly to do with Athena.

I tried to use various time dimensions rather than filters to manipulate the query prior to reaching the above solution. I could only manipulate the left side of the from

For example take the below filter function - I can't slice the time values off because cube modifies them

      FILTER_PARAMS.EventsTesting.dt.filter((from, to) => {
        const fromDate = from.substr(0, 10);
        const toDate = to.substr(0, 10);
        return `from_iso8601_timestamp(dt || 'T' || time)
            BETWEEN from_iso8601_timestamp(${from})
            AND from_iso8601_timestamp(${to})
            AND dt BETWEEN ${fromDate} AND ${toDate}
          `;
      })

and it ends up like so

(from_iso8601_timestamp(dt || 'T' || time)
            BETWEEN from_iso8601_timestamp('2025-01-31T13:00:00.000Z')
            AND from_iso8601_timestamp('2025-02-01T12:59:59.999Z')
            AND dt BETWEEN '2025-01-31T13:00:00.000Z' AND '2025-02-01T12:59:59.999Z'
          )

I suppose some better ability to control how dimensions (particularly time dimensions) are managed by cube maybe is the way here?

Perhaps I'm going about the whole thing wrong or it's just a missing feature relating to the Athena Driver which has been overlooked or something. It's felt like a huge struggle to get to the above so maybe it's the API, the DX, the Documentation or just me but I'm hoping to open a discussion about it in some way.

Minimally reproducible Cube Schema
In case your bug report is data modelling related please put your minimally reproducible Cube Schema here.
You can use selects without tables in order to achieve that as follows.

Use Athena as the driver.

cube(`EventsTesting`, {
  // LA 7,8,9,10
  // NY 6,7,8,9
  // SYD 3,4,5,6
  // UTC 5,6,7,8
  // ATH 4,5,6,7

  data_source: `events`,

  sql: `
    WITH data AS (
        SELECT * FROM (
            VALUES
                ('2025-01-31', '01:00:00', 1),
                ('2025-01-31', '07:00:00', 2),
                ('2025-01-31', '15:00:00', 3),
                ('2025-01-31', '23:00:00', 4),
                ('2025-02-01', '01:00:00', 5),
                ('2025-02-01', '07:00:00', 6),
                ('2025-02-01', '15:00:00', 7),
                ('2025-02-01', '23:00:00', 8),
                ('2025-02-02', '01:00:00', 9),
                ('2025-02-02', '07:00:00', 10),
                ('2025-02-02', '15:00:00', 11),
                ('2025-02-02', '23:00:00', 12),
                ('2025-02-03', '01:00:00', 13),
                ('2025-02-03', '07:00:00', 14),
                ('2025-02-03', '15:00:00', 15),
                ('2025-02-03', '23:00:00', 16)
        ) AS t(dt, time, event_id)
    )
    SELECT e.*
    FROM data e
    WHERE ${FILTER_GROUP(
      FILTER_PARAMS.EventsTesting.dt.filter(
        (from, to) =>
          `from_iso8601_timestamp(dt || 'T' || time)
            BETWEEN from_iso8601_timestamp(${from})
            AND from_iso8601_timestamp(${to})`
      )
    )}
  `,

  dimensions: {
    eventId: {
      sql: `event_id`,
      type: `number`,
      primaryKey: true,
    },

    dt: {
      sql: `from_iso8601_timestamp(dt || 'T' || time || 'Z')`,
      type: `time`,
    },

    dtString: {
      public: false,
      sql: `dt`,
      type: `string`,
    },
  },
});

And a vitest file

import { test, expect } from 'vitest';
import fetch from 'node-fetch';

test.concurrent.each([
  { timezone: 'America/Los_Angeles', ids: [7, 8, 9, 10] },
  { timezone: 'America/New_York', ids: [6, 7, 8, 9] },
  { timezone: 'Australia/Sydney', ids: [3, 4, 5, 6] },
  { timezone: 'Europe/London', ids: [5, 6, 7, 8] },
  { timezone: 'Europe/Athens', ids: [4, 5, 6, 7] },
])(
  'can return results in $timezone',
  { timeout: 10000 },
  async ({ timezone, ids }) => {
    const query = {
      timezone: timezone,
      measures: [],
      timeDimensions: [
        {
          dimension: 'EventsTesting.dt',
          dateRange: ['2025-02-01', '2025-02-01'],
        },
      ],
      dimensions: ['EventsTesting.eventId'],
    };

    // Create URL params
    const params = new URLSearchParams();
    params.append('query', JSON.stringify(query));

    // Make the request
    const response = await fetch(
      `http://localhost:4000/cubejs-api/v1/load?${params}`,
      {
        method: 'GET',
        headers: {
          Authorization: 'token',
        },
      }
    );

    const { data } = await response.json();

    expect(
      data.map((d) => Number(d['EventsTesting.eventId'])).sort((a, b) => a - b)
    ).toEqual(ids);
  }
);

Version:
[e.g. 1.2.0]

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

Labels

questionThe issue is a question. Please use Stack Overflow for questions.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions