-
Notifications
You must be signed in to change notification settings - Fork 209
Adding CURRENT_WATERMARK recipe #47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
85 changes: 85 additions & 0 deletions
85
other-builtin-functions/03_current_watermark/03_current_watermark.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
# 03 Filtering out Late Data | ||
|
||
 | ||
|
||
> :bulb: This example will show how to filter out late data using the `CURRENT_WATERMARK` function. | ||
|
||
The source table (`mobile_usage`) is backed by the [`faker` connector](https://flink-packages.org/packages/flink-faker), which continuously generates rows in memory based on Java Faker expressions. | ||
|
||
As explained before in the [watermarks recipe](../../aggregations-and-analytics/02_watermarks/02_watermarks.md), Flink uses watermarks to measure progress in event time. By using a `WATERMARK` attribute in a table's DDL, we signify a column as the table's event time attribute and tell Flink how out of order we expect our data to arrive. | ||
|
||
There are many cases when rows are arriving even more out of order than anticipated, i.e. after the watermark. This data is called *late*. An example could be when someone is using a mobile app while being offline because of lack of mobile coverage or flight mode being enabled. When Internet access is restored, previously tracked activities would then be sent. | ||
|
||
In this recipe, we'll filter out this late data using the [`CURRENT_WATERMARK`](https://ci.apache.org/projects/flink/flink-docs-release-1.14/docs/dev/table/functions/systemfunctions/) function. In the first statement, we'll use the non-late data combined with the [`TUMBLE`](../../aggregations-and-analytics/01_group_by_window/01_group_by_window_tvf.md) function to send the unique IP addresses per minute to a downstream consumer (like a BI tool). Next to this use case, we're sending the late data to a different sink. For example, you might want to use these rows to change the results of your product recommender for offline mobile app users. | ||
|
||
This table DDL contains both an event time and a processing time definition. `ingest_time` is defined as processing time, while `log_time` is defined as event time and will contain timestamps between 45 and 10 seconds ago. | ||
|
||
## Script | ||
MartijnVisser marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```sql | ||
-- Create source table | ||
CREATE TABLE `mobile_usage` ( | ||
`activity` STRING, | ||
`client_ip` STRING, | ||
`ingest_time` AS PROCTIME(), | ||
`log_time` TIMESTAMP_LTZ(3), | ||
WATERMARK FOR log_time AS log_time - INTERVAL '15' SECONDS | ||
) WITH ( | ||
'connector' = 'faker', | ||
'rows-per-second' = '50', | ||
'fields.activity.expression' = '#{regexify ''(open_push_message|discard_push_message|open_app|display_overview|change_settings)''}', | ||
'fields.client_ip.expression' = '#{Internet.publicIpV4Address}', | ||
'fields.log_time.expression' = '#{date.past ''45'',''10'',''SECONDS''}' | ||
); | ||
|
||
-- Create sink table for rows that are non-late | ||
CREATE TABLE `unique_users_per_window` ( | ||
`window_start` TIMESTAMP(3), | ||
`window_end` TIMESTAMP(3), | ||
`ip_addresses` BIGINT | ||
) WITH ( | ||
'connector' = 'blackhole' | ||
); | ||
|
||
-- Create sink table for rows that are late | ||
CREATE TABLE `late_usage_events` ( | ||
`activity` STRING, | ||
`client_ip` STRING, | ||
`ingest_time` TIMESTAMP_LTZ(3), | ||
`log_time` TIMESTAMP_LTZ(3), | ||
`current_watermark` TIMESTAMP_LTZ(3) | ||
) WITH ( | ||
'connector' = 'blackhole' | ||
); | ||
|
||
-- Create a view with non-late data | ||
CREATE TEMPORARY VIEW `mobile_data` AS | ||
SELECT * FROM mobile_usage | ||
WHERE CURRENT_WATERMARK(log_time) IS NOT NULL | ||
OR log_time < CURRENT_WATERMARK(log_time); | ||
|
||
-- Create a view with late data | ||
CREATE TEMPORARY VIEW `late_mobile_data` AS | ||
SELECT * FROM mobile_usage | ||
WHERE CURRENT_WATERMARK(log_time) IS NULL | ||
OR log_time > CURRENT_WATERMARK(log_time); | ||
|
||
BEGIN STATEMENT SET; | ||
|
||
-- Send all rows that are non-late to the sink for data that's on time | ||
INSERT INTO `unique_users_per_window` | ||
SELECT `window_start`, `window_end`, COUNT(DISTINCT client_ip) AS `ip_addresses` | ||
FROM TABLE( | ||
TUMBLE(TABLE mobile_data, DESCRIPTOR(log_time), INTERVAL '1' MINUTE)) | ||
GROUP BY window_start, window_end; | ||
|
||
-- Send all rows that are late to the sink for late data | ||
INSERT INTO `late_usage_events` | ||
SELECT *, CURRENT_WATERMARK(log_time) as `current_watermark` from `late_mobile_data`; | ||
|
||
END; | ||
``` | ||
|
||
## Example Output | ||
|
||
 |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.