Skip to content

Commit 48e9cf1

Browse files
committed
Update new about compression
1 parent 6e22754 commit 48e9cf1

File tree

1 file changed

+80
-172
lines changed

1 file changed

+80
-172
lines changed

use-timescale/compression/about-compression.md

Lines changed: 80 additions & 172 deletions
Original file line numberDiff line numberDiff line change
@@ -21,179 +21,87 @@ This section explains how to enable native compression, and then goes into
2121
detail on the most important settings for compression, to help you get the
2222
best possible compression ratio.
2323

24-
## Compression policy intervals
25-
26-
Data is usually compressed after an interval of time, and not
27-
immediately. In the "Enabling compression" procedure, you used a seven day
28-
compression interval. Choosing a good compression interval can make your queries
29-
more efficient, and also allow you to handle data that is out of order.
30-
31-
### Query efficiency
32-
33-
Research has shown that when data is newly ingested, the queries are more likely
34-
to be shallow in time, and wide in columns. Generally, they are debugging
35-
queries, or queries that cover the whole system, rather than specific, analytic
36-
queries. An example of the kind of query more likely for new data is "show the
37-
current CPU usage, disk usage, energy consumption, and I/O for a particular
38-
server". When this is the case, the uncompressed data has better query
39-
performance, so the native PostgreSQL row-based format is the best option.
40-
41-
However, as data ages, queries are likely to change. They become more
42-
analytical, and involve fewer columns. An example of the kind of query run on
43-
older data is "calculate the average disk usage over the last month." This type
44-
of query runs much faster on compressed, columnar data.
45-
46-
To take advantage of this and increase your query efficiency, you want to run
47-
queries on new data that is uncompressed, and on older data that is compressed.
48-
Setting the right compression policy interval means that recent data is ingested
49-
in an uncompressed, row format for efficient shallow and wide queries, and then
50-
automatically converted to a compressed, columnar format after it ages and is
51-
more likely to be queried using deep and narrow queries. Therefore, one
52-
consideration for choosing the age at which to compress the data is when your
53-
query patterns change from shallow and wide to deep and narrow.
54-
55-
### Modified data
56-
57-
Trying to change chunks that have already been compressed can be inefficient.
58-
You can always query data in compressed chunks, but the current version of
59-
compression does not support `DELETE` actions on compressed chunks. This
60-
limitation means you really only want to compress a chunk at a time when it is
61-
unlikely to be modified again. How much time this requires is highly dependent
62-
on your individual setup. Choose a compression interval that minimizes the need
63-
to decompress chunks, but keep in mind that you want to avoid storing data that
64-
is out of order.
65-
66-
You can manually decompress a chunk to modify it if you need to. For more
67-
information on how to do that,
68-
see [decompressing chunks][decompress-chunks].
69-
70-
### Compression states over time
71-
72-
A chunk can be in one of three states:
73-
74-
* `Active` and uncompressed
75-
* `Compression candidate` and uncompressed
76-
* `Compressed`
77-
78-
Active chunks are uncompressed and able to ingest data. Due to the nature of the
79-
compression mechanism, they cannot effectively ingest data while compressed. As
80-
shown in this illustration, as active chunks age, they become compression
81-
candidates, and are eventually compressed when they become old enough according
82-
to the compression policy.
83-
84-
<img class="main-content__illustration"
85-
width={1375} height={944}
86-
src="https://s3.amazonaws.com/assets.timescale.com/docs/images/compression_diagram.webp"
87-
alt="Compression timeline" />
88-
89-
## Segment by columns
90-
91-
When you compress data, you need to select which column to segment by. Each row
92-
in a compressed table must contain data about a single item. The column that a
93-
table is segmented by contains only a single entry, while all other columns can
94-
have multiple arrayed entries. For example, in this compressed table, the first
95-
row contains all the values for device ID 1, and the second row contains all the
96-
values for device ID 2:
97-
98-
|time|device_id|cpu|disk_io|energy_consumption|
99-
|---|---|---|---|---|
100-
|[12:00:02, 12:00:01]|1|[88.2, 88.6]|[20, 25]|[0.8, 0.85]|
101-
|[12:00:02, 12:00:01]|2|[300.5, 299.1]|[30, 40]|[0.9, 0.95]|
102-
103-
Because a single value is associated with each compressed row, there is no need
104-
to decompress to evaluate the value in that column. This means that queries with
105-
`WHERE` clauses that filter by a `segmentby` column are much more efficient,
106-
because decompression can happen after filtering instead of before. This avoids
107-
the need to decompress filtered-out rows altogether.
108-
109-
Because some queries are more efficient than others, it is important to pick the
110-
correct set of `segmentby` columns. If your table has a primary key all of the
111-
primary key columns, except for `time`, can go into the `segmentby` list. For
112-
example, if our example table uses a primary key on `(device_id, time)`, then
113-
the `segmentby` list is `device_id`.
114-
115-
Another method is to determine a set of values that can be graphed over time.
116-
For example, in this EAV (entity-attribute-value) table, the series can be
117-
defined by `device_id` and `metric_name`. Therefore, the `segmentby` option
118-
should be `device_id, metric_name`:
119-
120-
|time|device_id|metric_name|value|
121-
|---|---|---|---|
122-
|8/22/2019 0:00|1|cpu|88.2|
123-
|8/22/2019 0:00|1|device_io|0.5|
124-
|8/22/2019 1:00|1|cpu|88.6|
125-
|8/22/2019 1:00|1|device_io|0.6|
126-
127-
The `segmentby` columns are useful, but can be overused. If you specify a lot of
128-
`segmentby` columns, the number of items in each compressed column is reduced,
129-
and compression is not as effective. A good guide is for each segment to contain
130-
at least 100 rows per chunk. To achieve this, you might also need to use
131-
the `compress_orderby` column.
132-
133-
## Order entries
134-
135-
By default, the items inside a compressed array are arranged in descending order
136-
according to the hypertable's `time` column. In most cases, this works well,
137-
provided you have set the `segmentby` option appropriately. However, in some
138-
more complicated scenarios, you want to manually adjust the
139-
`compress_orderby`setting as well. Changing this value can improve the
140-
compression ratio and query performance.
141-
142-
Compression is most effective when adjacent data is close in magnitude or
143-
exhibits some sort of trend. Random data, or data that is out of order,
144-
compresses poorly. This means that it is important that the order of the input
145-
data causes it to follow a trend.
146-
147-
In this example, there are no `segmentby` columns set, so the data is sorted by
148-
the `time` column. If you look at the `cpu` column, you can see that it might
149-
not be able to be compressed, because even though both devices are outputting a
150-
value that is a float, the measurements have different magnitudes, with device 1
151-
showing numbers around 88, and device 2 showing numbers around 300:
152-
153-
|time|device_id|cpu|disk_io|energy_consumption|
154-
|-|-|-|-|-|
155-
|[12:00:02, 12:00:02, 12:00:01, 12:00:01 ]|[1, 2, 1, 2]|[88.2, 300.5, 88.6, 299.1]|[20, 30, 25, 40]|[0.8, 0.9, 0.85, 0.95]|
156-
157-
To improve the performance of this data, you can order by `device_id, time`
158-
instead, using these commands:
24+
## Key aspects of compression
15925

160-
```sql
161-
ALTER TABLE example
162-
SET (timescaledb.compress,
163-
timescaledb.compress_orderby = 'device_id, time');
164-
```
26+
Compression always starts with the hypertable. Every table has a different schema but they do share some commonalities that we need to think about.
16527

166-
Using those settings, the compressed table now shows each measurement in
167-
consecutive order, and the `cpu` values show a trend. This table compresses much
168-
better:
28+
Let's take the following schema as an example table named `metrics`:
16929

170-
|time|device_id|cpu|disk_io|energy_consumption|
30+
|Column|Type|Collation|Nullable|Default|
17131
|-|-|-|-|-|
172-
|[12:00:02, 12:00:01, 12:00:02, 12:00:01 ]|[1, 1, 2, 2]|[88.2, 88.6, 300.5, 299.1]|[20, 25, 30, 40]|[0.8, 0.85, 0.9, 0.95]|
173-
174-
Putting items in `orderby` and `segmentby` columns often achieves similar
175-
results. In this same example, if you set it to segment by the `device_id`
176-
column, it has good compression, even without setting `orderby`. This is because
177-
ordering only matters within a segment, and segmenting by device means that each
178-
segment represents a series if it is ordered by time. So, if segmenting by an
179-
identifier causes segments to become too small, try moving the `segmentby`
180-
column into a prefix of the `orderby` list.
181-
182-
You can also use ordering to increase query performance. If a query uses similar
183-
ordering as the compression, you can decompress incrementally and still return
184-
results in the same order. You can also avoid a `SORT`. Additionally, the system
185-
automatically creates additional columns to store the minimum and maximum value
186-
of any `orderby` column. This way, the query executor looks at this additional
187-
column that specifies the range of values in the compressed column, without
188-
first performing any decompression, to determine whether the row could possibly
189-
match a time predicate specified by the query.
190-
191-
## Insert historical data into compressed chunks
192-
193-
In TimescaleDB 2.3 and later, you can insert data directly into compressed
194-
chunks. When you do this,the data that is being inserted is not compressed
195-
immediately. It is stored alongside the chunk it has been inserted into, and
196-
then a separate job merges it with the chunk and compresses it later on.
197-
198-
[decompress-chunks]: /use-timescale/:currentVersion:/compression/decompress-chunks
199-
[manual-compression]: /use-timescale/:currentVersion:/compression/manual-compression/
32+
time|timestamp with time zone|| not null|
33+
device_id| integer|| not null|
34+
device_type| integer|| not null|
35+
cpu| double precision|||
36+
disk_io| double precision|||
37+
38+
Our hypertable needs to have a primary dimension. In the example, we are showing the classic time-series use case with the `time` column as the primary dimension that is used for partitioning. Besides that we have two columns `cpu` and `disk_io` which are essentially value columns that we are capturing over time. There is also another column `device_id` which is used as a designator or a lookup key which designates which device these captured values belong to at a certain point in time.
39+
Columns can be used in a few different ways:
40+
You can use values in a column as a lookup key, in the example above the device_id is a typical example of such a column.
41+
You can use a column for partitioning a table. This is typically a time column, but it is possible to partition the table using other columns as well.
42+
You can use a column as a filter to narrow down on what data you select. The column device_type is an example of such a column where you can decide to only look at, for example, solid state drives (SSDs)
43+
The remaining columns typically are the values or metrics you are querying for. They are typically aggregated or presented in other ways. The columns `cpu` and `disk_io` are typical examples of such columns.
44+
An example query using value columns and filter on time and device type could look like this:
45+
46+
<CodeBlock canCopy={false} showLineNumbers={false} children={`
47+
SELECT avg(cpu), sum(disk_io)
48+
FROM metrics
49+
WHERE device_type = ‘SSD’
50+
AND time >= now() - ‘1 day’::interval;
51+
`} />
52+
53+
When chunks are compressed in a hypertable, data stored in them is reorganized and stored in column-order rather than row-order. As a result, it is not possible to use the same uncompressed schema version of the chunk and a different schema must be created. This is automatically handled by TimescaleDB, but it has a few implications:
54+
The compression ratio and query performance is very dependent on the order and structure of the compressed data, so some considerations are needed when setting up compression.
55+
Indexes on the hypertable cannot always be used in the same manner for the compressed data.
56+
57+
58+
Based on the previous schema, filtering of data should happen over a certain time period and analytics are done on device granularity. This pattern of data access lends itself to organizing the data layout suitable for compression.
59+
60+
### Segmenting and ordering.
61+
62+
Segmenting the compressed data should be based on the way you access the data. Basically, you want to segment your data in such a way that you can make it easier for your queries to fetch the right data at the right time. That is to say, your queries should dictate how you segment the data so they can be optimized and yield even better query performance.
63+
64+
For example, If you want to access a single device using a specific `device_id` value (either all records or maybe for a specific time range), you would need to filter all those records one by one during row access time. To get around this, you can use device_id column for segmenting. This would allow you to run analytical queries on compressed data much faster if you are looking for specific device IDs.
65+
66+
<CodeBlock canCopy={false} showLineNumbers={false} children={`
67+
postgres=# \timing
68+
Timing is on.
69+
postgres=# SELECT device_id, AVG(cpu) AS avg_cpu, AVG(disk_io) AS avg_disk_io
70+
FROM metrics
71+
WHERE device_id = 5
72+
GROUP BY device_id;
73+
device_id | avg_cpu | avg_disk_io
74+
-----------+--------------------+---------------------
75+
5 | 0.4972598866221261 | 0.49820356730280524
76+
(1 row)
77+
Time: 177,399 ms
78+
postgres=# ALTER TABLE metrics SET (timescaledb.compress, timescaledb.compress_segmentby = 'device_id', timescaledb.compress_orderby='time');
79+
ALTER TABLE
80+
Time: 6,607 ms
81+
postgres=# SELECT compress_chunk(c) FROM show_chunks('metrics') c;
82+
compress_chunk
83+
----------------------------------------
84+
_timescaledb_internal._hyper_2_4_chunk
85+
_timescaledb_internal._hyper_2_5_chunk
86+
_timescaledb_internal._hyper_2_6_chunk
87+
(3 rows)
88+
Time: 3070,626 ms (00:03,071)
89+
postgres=# SELECT device_id, AVG(cpu) AS avg_cpu, AVG(disk_io) AS avg_disk_io
90+
FROM metrics
91+
WHERE device_id = 5
92+
GROUP BY device_id;
93+
device_id | avg_cpu | avg_disk_io
94+
-----------+-------------------+---------------------
95+
5 | 0.497259886622126 | 0.49820356730280535
96+
(1 row)
97+
Time: 42,139 ms
98+
`} />
99+
100+
Ordering the data will have a great impact on the compression ratio since we want to have rows that change over a dimension (most likely time) close to each other. Most of the time data changes in a predictable fashion, following a certain trend. We can exploit this fact to encode the data so it takes less space to store. For example, if you order the records over time, they will get compressed in that order and subsequently also accessed in the same order.
101+
102+
103+
This makes the time column a perfect candidate for ordering your data since the measurements evolve as time goes on. If you were to use that as your only compression setting, you would most likely get a good enough compression ratio to save a lot of storage. However, accessing the data effectively depends on your use case and your queries. With this setup, you would always have to access the data by using the time dimension and subsequently filter all the rows based on any other criteria.
104+
105+
[insert query showing what happens when querying compressed chunks with orderby time]
106+
[alter table, compress chunk, run and time query]
107+

0 commit comments

Comments
 (0)