perf(retention): improve performance of exemplars removal #1018

kolesnikovae · 2022-04-08T21:01:45Z

We observe significant slowdown when retention policy is configured and the server with historical data (presumable) runs for the first time - as a result, this has a large impact on overall performance manifesting in data losses and inability to query data.

Partial revert of fix: allow cache eviction while purging storage #962: BadgerDB GC and retention tasks will not run concurrently.
Refactored the part responsible for exemplars removal (large read-only transaction was broken into many smaller ones), batch size decreased to 10K (previously, the default was set to the maximum allowed (millions).
Now, exemplars are written to DB in batches.
Added graceful cancelation for exemplars removal.
Exemplars retention period now defaults to the absolute retention period if not defined.
Added metrics to track exemplars:
- pyroscope_storage_exemplars_read_bytes
- pyroscope_storage_exemplars_write_bytes
- pyroscope_storage_exemplars_removed_total
- pyroscope_storage_exemplars_retention_task_duration_seconds
Added a metric that indicates data losses (profiles rejected due to the internal storage queue overflow:
- pyroscope_storage_discarded_total

Load testing shows that Pyroscope server is only able to handle thousands of exemplars per second, which might be insufficient. The bandwidth can be increased significantly, if we tune the internal storage queue - we may either implement some sort of throttling, or allow users to configure its size explicitly – now it's size is hardcoded to 100. As a side effect, pyroscope server will rejects less data, creating more pressure on the storage subsystem (retention task + GC).

In the picture below results of synthetic tests: ~300K exemplars, pull mode - Diff View before and after:

Note: stacks marked as added simply because their proportion to the total has increased, not the absolute values - I think it would be more useful, if we either normalised the values, or let users choose.

However, the cost of deleting obsolete data (outside the retention period) is still a concern as well as BadgerDB GC that invokes noticeable spike of I/O and CPU consumption. Even after the changes from this PR, GC will consume resources comparable to ones required to ingest the data: underlying WiscKey LSM is optimised for writes (like an append only log), and a delegation operation is treat as a write operation. Furthermore, because of how BadgerDB GC is implemented, in practice, it may be even more costly: when a value log is getting rewritten, data locality plays important role. Given that the profile_id (span ID, basically a random string) is part of the key, we have a pretty poor data locality. As a result, at some point, GC may need to rewrite the whole database (all its value log files).

For example, the profile below depicts the CPU time consumed by a GC job that "compacted" exemplars DB from 1GB to ~500MB:

A subtle improvement (apart from the beneficial effect of write batches) has been achieved by forcing BadgerDB compaction before starting removing data, this decreased CPU consumption by 15%-20% according to the tests.

Future work:

Experiment with ULID as a profile_id (instead of span IDs). This should drastically improve data locality.
Consider time-based partitioning: instead of using a single DB for exemplars, we may have a row of smaller databases storing data for a limited period of time (an hour or so). There is considerable memory overhead (defaults to 100MB per db) caused by Badger DB, this may be a serious problem in large deployments.
Consider alternative storage engines. BadgerDB is very good in many aspects but in our use case (exemplars storage) it might be not the best option.

#962)" This reverts commit cad1afc.

codecov · 2022-04-08T21:05:44Z

Codecov Report

Merging #1018 (532290a) into main (10baacd) will increase coverage by 0.03%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #1018      +/-   ##
==========================================
+ Coverage   70.11%   70.13%   +0.03%     
==========================================
  Files          84       83       -1     
  Lines        2940     2926      -14     
  Branches      605      601       -4     
==========================================
- Hits         2061     2052       -9     
+ Misses        876      871       -5     
  Partials        3        3

Impacted Files	Coverage Δ
webapp/javascript/ui/Sidebar.tsx	`92.50% <0.00%> (-0.52%)`	⬇️
packages/pyroscope-flamegraph/src/index.tsx	`100.00% <0.00%> (ø)`
webapp/javascript/ui/Sidebar.module.css

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 10baacd...532290a. Read the comment docs.

github-actions · 2022-04-08T21:06:22Z

size-limit report 📦

Path	Size	Loading time (3g)	Running time (snapdragon)	Total time
webapp/public/assets/app.js	496.62 KB (0%)	10 s (0%)	2.4 s (+1.68% 🔺)	12.3 s

pyroscopebot · 2022-04-08T21:07:14Z

Screenshots

Throughput

Disk Usage

Memory

Upload Errors (Total)

Successful Uploads (Total)

CPU Utilization

Generated by 🚫 dangerJS against e9e63a9

…emoval

* Revert "fix: allow cache eviction and write-back while purging storage (#962)" This reverts commit cad1afc. * fix(exemplars): optimise exemplars removal * perf(exemplars): batch insert * perf(ingestion): adjust storage queue size * perf(ingestion): make exemplars to handle batches queue during data removal * fix tests

kolesnikovae added 2 commits April 8, 2022 16:26

Revert "fix: allow cache eviction and write-back while purging storage (

133c2d1

#962)" This reverts commit cad1afc.

fix(exemplars): optimise exemplars removal

fc912b1

kolesnikovae added storage Low level storage matters performance If there's anything we have to be really good at it's this labels Apr 8, 2022

kolesnikovae force-pushed the fix/retention-slowdown branch from 6479953 to b9fb64e Compare April 12, 2022 14:05

kolesnikovae marked this pull request as ready for review April 12, 2022 15:23

perf(exemplars): batch insert

4e8cfd8

kolesnikovae force-pushed the fix/retention-slowdown branch from b9fb64e to 4e8cfd8 Compare April 12, 2022 15:28

perf(ingestion): adjust storage queue size

ec88985

kolesnikovae marked this pull request as draft April 12, 2022 20:58

kolesnikovae added 2 commits April 13, 2022 17:04

perf(ingestion): make exemplars to handle batches queue during data r…

8070823

…emoval

fix tests

532290a

kolesnikovae marked this pull request as ready for review April 13, 2022 16:02

petethepig merged commit 8e7e596 into main Apr 14, 2022

petethepig deleted the fix/retention-slowdown branch April 14, 2022 04:18

kolesnikovae mentioned this pull request Apr 19, 2022

fix: remove force compation #1036

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(retention): improve performance of exemplars removal #1018

perf(retention): improve performance of exemplars removal #1018

Uh oh!

kolesnikovae commented Apr 8, 2022 •

edited

Loading

Uh oh!

codecov bot commented Apr 8, 2022 •

edited

Loading

Uh oh!

github-actions bot commented Apr 8, 2022 •

edited

Loading

Uh oh!

pyroscopebot commented Apr 8, 2022 •

edited

Loading

Uh oh!

Uh oh!

perf(retention): improve performance of exemplars removal #1018

perf(retention): improve performance of exemplars removal #1018

Uh oh!

Conversation

kolesnikovae commented Apr 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Apr 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

size-limit report 📦

Uh oh!

pyroscopebot commented Apr 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Screenshots

Uh oh!

Uh oh!

kolesnikovae commented Apr 8, 2022 •

edited

Loading

codecov bot commented Apr 8, 2022 •

edited

Loading

github-actions bot commented Apr 8, 2022 •

edited

Loading

pyroscopebot commented Apr 8, 2022 •

edited

Loading