Skip to content

perf(retention): improve performance of exemplars removal #1018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 14, 2022

Conversation

kolesnikovae
Copy link
Collaborator

@kolesnikovae kolesnikovae commented Apr 8, 2022

We observe significant slowdown when retention policy is configured and the server with historical data (presumable) runs for the first time - as a result, this has a large impact on overall performance manifesting in data losses and inability to query data.

  • Partial revert of fix: allow cache eviction while purging storage #962: BadgerDB GC and retention tasks will not run concurrently.
  • Refactored the part responsible for exemplars removal (large read-only transaction was broken into many smaller ones), batch size decreased to 10K (previously, the default was set to the maximum allowed (millions).
  • Now, exemplars are written to DB in batches.
  • Added graceful cancelation for exemplars removal.
  • Exemplars retention period now defaults to the absolute retention period if not defined.
  • Added metrics to track exemplars:
    • pyroscope_storage_exemplars_read_bytes
    • pyroscope_storage_exemplars_write_bytes
    • pyroscope_storage_exemplars_removed_total
    • pyroscope_storage_exemplars_retention_task_duration_seconds
  • Added a metric that indicates data losses (profiles rejected due to the internal storage queue overflow:
    • pyroscope_storage_discarded_total

Load testing shows that Pyroscope server is only able to handle thousands of exemplars per second, which might be insufficient. The bandwidth can be increased significantly, if we tune the internal storage queue - we may either implement some sort of throttling, or allow users to configure its size explicitly – now it's size is hardcoded to 100. As a side effect, pyroscope server will rejects less data, creating more pressure on the storage subsystem (retention task + GC).

In the picture below results of synthetic tests: ~300K exemplars, pull mode - Diff View before and after:
image
Note: stacks marked as added simply because their proportion to the total has increased, not the absolute values - I think it would be more useful, if we either normalised the values, or let users choose.


However, the cost of deleting obsolete data (outside the retention period) is still a concern as well as BadgerDB GC that invokes noticeable spike of I/O and CPU consumption. Even after the changes from this PR, GC will consume resources comparable to ones required to ingest the data: underlying WiscKey LSM is optimised for writes (like an append only log), and a delegation operation is treat as a write operation. Furthermore, because of how BadgerDB GC is implemented, in practice, it may be even more costly: when a value log is getting rewritten, data locality plays important role. Given that the profile_id (span ID, basically a random string) is part of the key, we have a pretty poor data locality. As a result, at some point, GC may need to rewrite the whole database (all its value log files).

For example, the profile below depicts the CPU time consumed by a GC job that "compacted" exemplars DB from 1GB to ~500MB:
image

A subtle improvement (apart from the beneficial effect of write batches) has been achieved by forcing BadgerDB compaction before starting removing data, this decreased CPU consumption by 15%-20% according to the tests.


Future work:

  1. Experiment with ULID as a profile_id (instead of span IDs). This should drastically improve data locality.
  2. Consider time-based partitioning: instead of using a single DB for exemplars, we may have a row of smaller databases storing data for a limited period of time (an hour or so). There is considerable memory overhead (defaults to 100MB per db) caused by Badger DB, this may be a serious problem in large deployments.
  3. Consider alternative storage engines. BadgerDB is very good in many aspects but in our use case (exemplars storage) it might be not the best option.

@kolesnikovae kolesnikovae added storage Low level storage matters performance If there's anything we have to be really good at it's this labels Apr 8, 2022
@codecov
Copy link

codecov bot commented Apr 8, 2022

Codecov Report

Merging #1018 (532290a) into main (10baacd) will increase coverage by 0.03%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #1018      +/-   ##
==========================================
+ Coverage   70.11%   70.13%   +0.03%     
==========================================
  Files          84       83       -1     
  Lines        2940     2926      -14     
  Branches      605      601       -4     
==========================================
- Hits         2061     2052       -9     
+ Misses        876      871       -5     
  Partials        3        3              
Impacted Files Coverage Δ
webapp/javascript/ui/Sidebar.tsx 92.50% <0.00%> (-0.52%) ⬇️
packages/pyroscope-flamegraph/src/index.tsx 100.00% <0.00%> (ø)
webapp/javascript/ui/Sidebar.module.css

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 10baacd...532290a. Read the comment docs.

@github-actions
Copy link
Contributor

github-actions bot commented Apr 8, 2022

size-limit report 📦

Path Size Loading time (3g) Running time (snapdragon) Total time
webapp/public/assets/app.js 496.62 KB (0%) 10 s (0%) 2.4 s (+1.68% 🔺) 12.3 s

@pyroscopebot
Copy link
Contributor

pyroscopebot commented Apr 8, 2022

Screenshots

Throughput Throughput
Disk Usage Disk Usage
Memory Memory
Upload Errors (Total) Upload Errors (Total)
Successful Uploads (Total) Successful Uploads (Total)
CPU Utilization CPU Utilization

Generated by 🚫 dangerJS against e9e63a9

@kolesnikovae kolesnikovae force-pushed the fix/retention-slowdown branch from 6479953 to b9fb64e Compare April 12, 2022 14:05
@kolesnikovae kolesnikovae marked this pull request as ready for review April 12, 2022 15:23
@kolesnikovae kolesnikovae force-pushed the fix/retention-slowdown branch from b9fb64e to 4e8cfd8 Compare April 12, 2022 15:28
@kolesnikovae kolesnikovae marked this pull request as draft April 12, 2022 20:58
@kolesnikovae kolesnikovae marked this pull request as ready for review April 13, 2022 16:02
@petethepig petethepig merged commit 8e7e596 into main Apr 14, 2022
@petethepig petethepig deleted the fix/retention-slowdown branch April 14, 2022 04:18
thisisobate pushed a commit that referenced this pull request Apr 16, 2022
* Revert "fix: allow cache eviction and write-back while purging storage (#962)"

This reverts commit cad1afc.

* fix(exemplars): optimise exemplars removal

* perf(exemplars): batch insert

* perf(ingestion): adjust storage queue size

* perf(ingestion): make exemplars to handle batches queue during data removal

* fix tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance If there's anything we have to be really good at it's this storage Low level storage matters
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants