Skip to content

Initial pipelines docs #18595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 116 commits into from
Apr 9, 2025
Merged
Show file tree
Hide file tree
Changes from 92 commits
Commits
Show all changes
116 commits
Select commit Hold shift + click to select a range
01564e4
Initial pipelines docs
maheshwarip Dec 5, 2024
e0cc62f
Fixed wrangler commands
maheshwarip Dec 5, 2024
1dfcc0d
Improved pipelines index page
maheshwarip Dec 6, 2024
4ab692a
Fixed broken links
maheshwarip Dec 6, 2024
4afaa0e
Fixed typos
maheshwarip Dec 6, 2024
72381fc
improved worker binding documentation
maheshwarip Dec 6, 2024
112140e
Fixed broken links
maheshwarip Dec 6, 2024
51ce611
PIPE-155 Add prefix to Pipelines naming (#18656)
oliy Dec 10, 2024
dedafff
Modified worker binding docs
maheshwarip Dec 17, 2024
8ffc4bc
added local development notes
maheshwarip Dec 17, 2024
9f6d180
Renamed worker bindings to .mdx
maheshwarip Dec 17, 2024
31bdfe1
Fixed filenames and broken comments
maheshwarip Dec 17, 2024
b4fcf86
updated dates
maheshwarip Dec 17, 2024
69d9dd5
Fixed render issues and page titles
maheshwarip Dec 17, 2024
59a1902
Updated prefix instructions
maheshwarip Jan 7, 2025
2631676
Updated heading
maheshwarip Jan 7, 2025
4511e58
Updated docs for Oauth flow
maheshwarip Jan 13, 2025
4c940f7
Added documentation about batch hints
maheshwarip Jan 27, 2025
5184319
Added R2 as a destination
maheshwarip Feb 3, 2025
38cb4b4
updated ordering
maheshwarip Feb 3, 2025
ff67c0e
Updated performance characteristics
maheshwarip Feb 3, 2025
a8392d1
updated ordering
maheshwarip Feb 3, 2025
8cd419a
Updated docs for new wrangler commands
maheshwarip Feb 12, 2025
f231b40
updated batch settings
maheshwarip Feb 12, 2025
ce0c086
Added CORS documentation
maheshwarip Feb 12, 2025
f73c367
Removed pricing doc
maheshwarip Feb 12, 2025
1ace647
updated cors
maheshwarip Feb 12, 2025
9166d38
Add tutorial (#20025)
harshil1712 Feb 27, 2025
7784a6a
initial transformations documentation
maheshwarip Mar 14, 2025
8b37f29
Added docs for env and ctx
maheshwarip Mar 17, 2025
45175ca
Fixed typo
maheshwarip Mar 17, 2025
fef991b
Add tutorial (#20476)
harshil1712 Mar 19, 2025
436084b
Fixed typos
maheshwarip Mar 20, 2025
420eae7
Updated import
maheshwarip Mar 20, 2025
80a4d85
Removed changelog entry for now
maheshwarip Mar 21, 2025
eeea924
removed changelog reference
maheshwarip Mar 21, 2025
ca49906
fixed imports
maheshwarip Mar 21, 2025
80f0ae4
Fixed broken link
maheshwarip Mar 21, 2025
20f8814
removed transformations
maheshwarip Mar 28, 2025
bf20e02
removed broken links
maheshwarip Mar 28, 2025
4f74185
Updated text
maheshwarip Mar 28, 2025
85fa5bd
Added javascript API
maheshwarip Apr 1, 2025
b9cdb1a
Removed old file
maheshwarip Apr 1, 2025
8781ded
fixed broken links
maheshwarip Apr 1, 2025
4df10ad
Fixed typo
maheshwarip Apr 1, 2025
85af8d4
Updated guides
maheshwarip Apr 1, 2025
dc61631
Re-organized docs
maheshwarip Apr 1, 2025
cff26ee
Improved getting started guide
maheshwarip Apr 2, 2025
924796d
Improved output settings
maheshwarip Apr 2, 2025
33137ac
Fixed broken links
maheshwarip Apr 2, 2025
152dd1c
Added more content about how pipelines works
maheshwarip Apr 2, 2025
fe1aefa
Added documentation for how pipelines work
maheshwarip Apr 3, 2025
09646a4
Improvements to overview
maheshwarip Apr 3, 2025
2fcd519
Improved getting started guide
maheshwarip Apr 3, 2025
09b61c2
Improved http endpoint docs
maheshwarip Apr 3, 2025
be960d7
Improved http options
maheshwarip Apr 3, 2025
f237dea
Improved output settings
maheshwarip Apr 3, 2025
41bbabf
Updated logo and text
maheshwarip Apr 3, 2025
a95fb41
Updated architecture
maheshwarip Apr 3, 2025
f04c9eb
Update how-pipelines-work.mdx
maheshwarip Apr 3, 2025
e1a8c28
Update how-pipelines-work.mdx
maheshwarip Apr 3, 2025
3af05d6
Fixed broken links
maheshwarip Apr 3, 2025
ed87e98
Updated tagline
maheshwarip Apr 3, 2025
c87125d
Updated text
maheshwarip Apr 3, 2025
a9196f6
Small changes
maheshwarip Apr 3, 2025
2c5d9f8
Changed phrasing
maheshwarip Apr 3, 2025
8b41ff4
improved shards documentation
maheshwarip Apr 4, 2025
88f5c53
Updated shard count documentation
maheshwarip Apr 4, 2025
38a5bc1
Updated wrangler commands
maheshwarip Apr 4, 2025
047106b
Updated metrics documentation
maheshwarip Apr 4, 2025
28363ce
Updated docs
maheshwarip Apr 4, 2025
759a8b0
Fixed typos
maheshwarip Apr 6, 2025
4f2ff46
Updated tutorial commands
maheshwarip Apr 6, 2025
36c5679
Improved tutorial text
maheshwarip Apr 6, 2025
31e9b75
Fixed typo
maheshwarip Apr 6, 2025
16222db
Improved text
maheshwarip Apr 6, 2025
d884e01
Fixed broken links
maheshwarip Apr 6, 2025
9a3da5d
fixed limits typo
maheshwarip Apr 6, 2025
73d7457
small improvements
maheshwarip Apr 6, 2025
5a60392
Improved data lake tutorial
maheshwarip Apr 6, 2025
db61fa0
Merge branch 'production' into pipelines-docs
maheshwarip Apr 6, 2025
2e3bd0a
Added changelog entry
maheshwarip Apr 6, 2025
45b0c23
Small updates
maheshwarip Apr 6, 2025
900ba16
Improved grammar
maheshwarip Apr 6, 2025
a203abc
Small changes to wrangler commands
maheshwarip Apr 6, 2025
f257f1e
Fixed typos
maheshwarip Apr 7, 2025
85a14a5
Improved wrangler configuration
maheshwarip Apr 7, 2025
5ee167f
Updated changelog entry
maheshwarip Apr 7, 2025
658dd38
Fixed changelog typo
maheshwarip Apr 7, 2025
686d709
updated changelog
maheshwarip Apr 7, 2025
d6524f2
updated link
maheshwarip Apr 7, 2025
447a1bd
Merge branch 'production' into pipelines-docs
kodster28 Apr 7, 2025
7dcde9e
Update src/content/docs/pipelines/getting-started.mdx
maheshwarip Apr 7, 2025
ae072d8
Apply suggestions from code review
maheshwarip Apr 7, 2025
591e655
Improved language
maheshwarip Apr 7, 2025
1327639
Update 2025-04-10-launching-pipelines.mdx
maheshwarip Apr 8, 2025
47ec978
Update http.mdx
maheshwarip Apr 8, 2025
525cc91
Update index.mdx
maheshwarip Apr 8, 2025
e1a1970
Update index.mdx
maheshwarip Apr 8, 2025
b452a6d
Update 2025-04-10-launching-pipelines.mdx
maheshwarip Apr 8, 2025
f75a3ec
Update how-pipelines-work.mdx
maheshwarip Apr 8, 2025
5bdc9f8
Clarified shard count
maheshwarip Apr 8, 2025
da23c75
Reorganized sources
maheshwarip Apr 8, 2025
c54275c
improved language
maheshwarip Apr 8, 2025
bdb1652
improved grammar
maheshwarip Apr 8, 2025
235c256
Updated limits
maheshwarip Apr 8, 2025
af6301c
Updated casing
maheshwarip Apr 8, 2025
00b8105
Fixed typo
maheshwarip Apr 8, 2025
e07dd28
Improved pricing docs
maheshwarip Apr 8, 2025
48b564d
Improved getting started guide
maheshwarip Apr 8, 2025
b2aa8bc
Improved text
maheshwarip Apr 8, 2025
37feeed
Update output-settings.mdx
maheshwarip Apr 9, 2025
7f80c98
Update output-settings.mdx
maheshwarip Apr 9, 2025
3b26f53
Update pipelines.mdx
maheshwarip Apr 9, 2025
6f7193a
Update pipelines.mdx
maheshwarip Apr 9, 2025
f66a13b
minor fixes to tutorials (#21552)
harshil1712 Apr 9, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added src/assets/images/pipelines/architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/assets/images/pipelines/shards.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
---
title: Cloudflare Pipelines now available in beta
description: Use Cloudflare Pipelines to ingest real time data streams, and load into R2.
products:
- pipelines
- r2
- workers
date: 2025-04-10 12:00:00 UTC
hidden: true
---

Cloudflare Pipelines is now available in beta, to all users with a Workers Paid plan.

Pipelines let you ingest high volumes of real time data, without managing any infrastructure. A single pipeline can ingest up to 100 MB of data per second, via HTTP or from a [Worker](/workers). Ingested data is automatically batched, written to output files, and delivered to an [R2 bucket](/r2) in your account. You can use Pipelines to build a data lake of clickstream data, or to store events from a Worker.

Create your first pipeline with a single command:

```bash title="Create a pipeline"
$ npx wrangler@latest pipelines create my-clickstream-pipeline --r2-bucket my-bucket

🌀 Authorizing R2 bucket "my-bucket"
🌀 Creating pipeline named "my-clickstream-pipeline"
✅ Successfully created pipeline my-clickstream-pipeline

Id: 0e00c5ff09b34d018152af98d06f5a1xvc
Name: my-clickstream-pipeline
Sources:
HTTP:
Endpoint: https://0e00c5ff09b34d018152af98d06f5a1xvc.pipelines.cloudflare.com/
Authentication: off
Format: JSON
Worker:
Format: JSON
Destination:
Type: R2
Bucket: my-bucket
Format: newline-delimited JSON
Compression: GZIP
Batch hints:
Max bytes: 100 MB
Max duration: 300 seconds
Max records: 100,000

🎉 You can now send data to your Pipeline!

Send data to your Pipeline's HTTP endpoint:
curl "https://0e00c5ff09b34d018152af98d06f5a1xvc.pipelines.cloudflare.com/" -d '[{ ...JSON_DATA... }]'

To send data to your Pipeline from a Worker, add the following configuration to your config file:
{
"pipelines": [
{
"pipeline": "my-clickstream-pipeline",
"binding": "PIPELINE"
}
]
}
```

Head over to our [getting started guide](/pipelines/getting-started) for an in-depth tutorial to building with Pipelines.
109 changes: 109 additions & 0 deletions src/content/docs/pipelines/build-with-pipelines/http.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
---
title: Configure HTTP Endpoint
pcx_content_type: concept
sidebar:
order: 1
head:
- tag: title
content: Configure HTTP Endpoint
---

import { Render, PackageManagers } from "~/components";

Pipelines support data ingestion over HTTP. When you create a new pipeline, you'll receive a globally scalable ingestion endpoint. To ingest data, make HTTP POST requests to the endpoint.

```sh
$ npx wrangler@latest pipelines create my-clickstream-pipeline --r2-bucket my-bucket

🌀 Authorizing R2 bucket "my-bucket"
🌀 Creating pipeline named "my-clickstream-pipeline"
✅ Successfully created pipeline my-clickstream-pipeline

Id: 0e00c5ff09b34d018152af98d06f5a1xvc
Name: my-clickstream-pipeline
Sources:
HTTP:
Endpoint: https://0e00c5ff09b34d018152af98d06f5a1xvc.pipelines.cloudflare.com/
Authentication: off
Format: JSON
Worker:
Format: JSON
Destination:
Type: R2
Bucket: my-bucket
Format: newline-delimited JSON
Compression: GZIP
Batch hints:
Max bytes: 100 MB
Max duration: 300 seconds
Max records: 100,000

🎉 You can now send data to your Pipeline!

Send data to your Pipeline's HTTP endpoint:
curl "https://0e00c5ff09b34d018152af98d06f5a1xvc.pipelines.cloudflare.com/" -d '[{ ...JSON_DATA... }]'
```

## Accepted data formats
Pipelines accept arrays of valid JSON objects. You can send multiple objects in a single request, provided the total data volume is within the [documented limits](/pipelines/platform/limits). Sending data in a different format will result in an error.

For example, you can send data to your pipeline using a curl command like this:
```sh
curl -X POST https://<PIPELINE-ID>.pipelines.cloudflare.com \
-H "Content-Type: application/json" \
-d '[{"foo":"bar"}, {"foo":"bar"}, {"foo":"bar"}]'

{"success":true,"result":{"committed":3}}
```

## Turning HTTP ingestion off
By default, ingestion via HTTP is turned on. You can turn it off by excluding it from the list of sources, by using `--sources` when creating or updating a pipeline.

```sh
$ npx wrangler pipelines create [PIPELINE-NAME] --r2-bucket [R2-BUCKET-NAME] --sources worker
```

Ingestion URLs are tied to your pipeline ID. Turning HTTP off, and then turning it back on, will not change the URL.

## Authentication
You can secure your HTTP ingestion endpoint using Cloudflare API tokens. By default, authentication is turned off. To configure authentication, use the `--require-http-auth` flag while creating or updating a pipeline.

```sh
$ npx wrangler pipelines create [PIPELINE-NAME] --r2-bucket [R2-BUCKET-NAME] --require-http-auth true
```

Once authentication is turned on, you will need to include a Cloudflare API token in your request headers.

### Get API token
1. Log in to the [Cloudflare dashboard](https://dash.cloudflare.com) and select your account.
2. Navigate to your [API Keys](https://dash.cloudflare.com/profile/api-tokens)
3. Select *Create Token*
4. Choose the template for Workers Pipelines. Click on *continue to summary*, and finally on *create token*. Make sure to copy the API token, and save it securely.

### Making authenticated requests
Include the API token you created in the previous step in the headers for your request:

```sh
curl https://<PIPELINE-ID>.pipelines.cloudflare.com
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${API_TOKEN}" \
-d '[{"foo":"bar"}, {"foo":"bar"}, {"foo":"bar"}]'
```

## Specifying CORS Settings
If you want to use your pipeline to ingest client side data, such as website clicks, you'll need to configure your [Cross-Origin Resource Sharing (CORS) settings](https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS).

Without setting your CORS settings, browsers will restrict requests made to your pipeline endpoint. For example, if your website domain is `https://my-website.com`, and you want to post client side data to your pipeline at `https://<PIPELINE-ID>.pipelines.cloudflare.com`, without CORS settings, the request will fail.

To fix this, you need to configure your pipeline to accept requests from `https://my-website.com`. You can do so while creating or updating a pipeline, using the flag `--cors-origins`. You can specify multiple domains separated by a space.

```sh
$ npx wrangler pipelines update [PIPELINE-NAME] --cors-origins https://mydomain.com http://localhost:8787
```

You can specify that all cross origin requests are accepted. We recommend only using this option in development, and not for production use cases.
```sh
$ npx wrangler pipelines update [PIPELINE-NAME] --cors-origins "*"
```

After your the `--cors-origins` have been set on your pipeline, your pipeline will respond to preflight requests and POST requests with the appropriate `Access-Control-Allow-Origin` headers set.
8 changes: 8 additions & 0 deletions src/content/docs/pipelines/build-with-pipelines/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
title: Build with Pipelines
pcx_content_type: navigation
sidebar:
order: 3
group:
hideIndex: true
---
103 changes: 103 additions & 0 deletions src/content/docs/pipelines/build-with-pipelines/output-settings.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
---
title: Customize output settings
pcx_content_type: concept
sidebar:
order: 3
head:
- tag: title
content: Customize output settings
---

import { Render, PackageManagers } from "~/components";

Pipelines convert a stream of records into output files, and deliver the files to an R2 bucket in your account. This guide details how you can change the output destination, and how to customize batch settings to generate query ready files.

## Configure an R2 bucket as a destination
To create or update a pipeline using Wrangler, run the following command in a terminal:

```sh
npx wrangler pipelines create [PIPELINE-NAME] --r2-bucket [R2-BUCKET-NAME]
```

After running this command, you'll be prompted to authorize Cloudflare Workers Pipelines to create an R2 API token on your behalf. Your pipeline uses the R2 API token to load data into your bucket. You can approve the request through the browser link which will open automatically.

If you prefer not to authenticate this way, you may pass your [R2 API Token](/r2/api/tokens/) to Wrangler:
```sh
npx wrangler pipelines create [PIPELINE-NAME] --r2 [R2-BUCKET-NAME] --r2-access-key-id [ACCESS-KEY-ID] --r2-secret-access-key [SECRET-ACCESS-KEY]
```

## File format and compression
Output files are generated as Newline Delimited JSON files (`ndjson`). Each line in an output file maps to a single record.

By default, output files are compressed in the `gzip` format. Compression can be turned off using the `--compression` flag:
```sh
npx wrangler pipelines update [PIPELINE-NAME] --compression none
```

Output files are named using a [UILD](https://github.com/ulid/spec) slug, followed by an extension.

## Customize batch behavior
When configuring your pipeline, you can define how records are batched before they are delivered to R2. Batches of records are written out to a single output file.

Batching can:
1. Reduce the number of output files written to R2, and thus reduce the [cost of writing data to R2](/r2/pricing/#class-a-operations)
2. Increase the size of output files, making them more efficient to query

There are three ways to define how ingested data is batched:

1. `batch-max-mb`: The maximum amount of data that will be batched, in megabytes. Default is 10 MB, maximum is 100 MB.
2. `batch-max-rows`: The maximum number of rows or events in a batch before data is written. Default, and maximum, is 10,000 rows.
Copy link
Contributor

@cmackenzie1 cmackenzie1 Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Max is 10,000,000

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. `batch-max-rows`: The maximum number of rows or events in a batch before data is written. Default, and maximum, is 10,000 rows.
2. `batch-max-rows`: The maximum number of rows or events in a batch before data is written. Default, and maximum, is `10,000` rows.

3. `batch-max-seconds`: The maximum duration of a batch before data is written, in seconds. Default is 15 seconds, maximum is 300 seconds.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. `batch-max-seconds`: The maximum duration of a batch before data is written, in seconds. Default is 15 seconds, maximum is 300 seconds.
3. `batch-max-seconds`: The maximum duration of a batch before data is written in seconds. Default is `15 seconds`, maximum is `300 seconds`.


Batch definitions are hints. A pipeline will follow these hints closely, but batches might not be exact.

All three batch definitions work together. Whichever limit is reached first triggers the delivery of a batch.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
All three batch definitions work together. Whichever limit is reached first triggers the delivery of a batch.
All three batch definitions work together. Whichever limit is reached first triggers the delivery of a batch.
Suggested change
All three batch definitions work together. Whichever limit is reached first triggers the delivery of a batch.
All three batch definitions work together and whichever limit is reached first triggers the delivery of a batch.


For example, a `batch-max-mb` = 100 MB and a `batch-max-seconds` = 100 means that if 100 MB of events are posted to the pipeline, the batch will be delivered. However, if it takes longer than 100 seconds for 100 MB of events to be posted, a batch of all the messages that were posted during those 100 seconds will be created.

### Defining batch settings using Wrangler
You can use the following batch settings flags while creating or updating a pipeline:
* `--batch-max-mb`
* `--batch-max-rows`
* `--batch-max-seconds`

For example:
```sh
npx wrangler pipelines update [PIPELINE-NAME] --batch-max-mb 100 --batch-max-rows 10000 --batch-max-seconds 300
```

#### Batch size limits
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Batch size limits
### Batch size limits


| Setting | Default | Minimum | Maximum |
| ----------------------------------------- | ----------- | --------- | ----------- |
| Maximum Batch Size `batch-max-mb` | 10 MB | 0.001 MB | 100 MB |
| Maximum Batch Timeout `batch-max-seconds` | 15 seconds | 0 seconds | 300 seconds |
| Maximum Batch Rows `batch-max-rows` | 10,000 rows | 1 row | 10,000,000 rows |


## Deliver partitioned data
Partitioning organizes data into directories based on specific fields to improve query performance. Partitions reduce the amount of data scanned for queries, enabling faster reads.

:::note
By default, Pipelines partition data by event date and time. This will be customizable in the future.
:::

Output files are prefixed with event date and hour. For example, the output from a Pipeline in your R2 bucket might look like this:
```sh
- event_date=2025-04-01/hr=15/01JQWBZCZBAQZ7RJNZHN38JQ7V.json.gz
- event_date=2025-04-01/hr=15/01JQWC16FXGP845EFHMG1C0XNW.json.gz
```

## Deliver data to a prefix
You can specify an optional prefix for all the output files stored in your specified R2 bucket, using the flag `--r2-prefix`.

For example:
```sh
npx wrangler pipelines update [PIPELINE-NAME] --r2-prefix test
```

After running the above command, the output files generated by your pipeline will be stored under the prefix "test". Files will remain partitioned. Your output will look like this:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
After running the above command, the output files generated by your pipeline will be stored under the prefix "test". Files will remain partitioned. Your output will look like this:
After running the above command, the output files generated by your pipeline will be stored under the prefix `test`. Files will remain partitioned. Your output will look like this:

```sh
- test/event_date=2025-04-01/hr=15/01JQWBZCZBAQZ7RJNZHN38JQ7V.json.gz
- test/event_date=2025-04-01/hr=15/01JQWC16FXGP845EFHMG1C0XNW.json.gz
```
56 changes: 56 additions & 0 deletions src/content/docs/pipelines/build-with-pipelines/shards.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
pcx_content_type: concept
title: Customize shard count
sidebar:
order: 11
---

import { Render, PackageManagers } from "~/components";

Shards affect a pipeline's throughput. Increasing the shard count for a pipeline increases the maximum throughput. By default, each pipeline is configured with two shards.

To set the shard count, use the `--shard-count` flag while creating or updating a pipeline:
```sh
$ npx wrangler pipelines update [PIPELINE-NAME] --shard-count 10
```

:::note
The default shard count will be set to `auto` in the future, with support for automatic horizontal scaling.
:::

## How shards work
![Pipeline shards](~/assets/images/pipelines/shards.png)

Each pipeline is composed of stateless, independent shards. These shards are spun up when a pipeline is created. Each shard is composed of layers of [Durable Objects](/durable-objects). The Durable Objects buffer data, replicate for durability, handle compression, and delivery to R2.

When a record is sent to a pipeline:
1. The Pipelines [Worker](/workers) receives the record
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. The Pipelines [Worker](/workers) receives the record
1. The Pipelines [Worker](/workers) receives the record.

2. The record is routed to to one of the shards
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. The record is routed to to one of the shards
2. The record is routed to to one of the shards.

3. The record is handled by a set of Durable Objects, which commmit the record to storage, and replicate for durability.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. The record is handled by a set of Durable Objects, which commmit the record to storage, and replicate for durability.
3. The record is handled by a set of Durable Objects, which commmit the record to storage, and replicate for durability.
Suggested change
3. The record is handled by a set of Durable Objects, which commmit the record to storage, and replicate for durability.
3. The record is handled by a set of Durable Objects, which commmit the record to storage, and replicate for durability.
Suggested change
3. The record is handled by a set of Durable Objects, which commmit the record to storage, and replicate for durability.
3. The record is handled by a set of Durable Objects, which commmit the record to storage, and replicate for durability.
Suggested change
3. The record is handled by a set of Durable Objects, which commmit the record to storage, and replicate for durability.
3. The record is handled by a set of Durable Objects, which commit the record to storage and replicate for durability.

4. Records accumulate, until the [batch definitions](/pipelines/build-with-pipelines/output-settings/#customize-batch-behavior) are met.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
4. Records accumulate, until the [batch definitions](/pipelines/build-with-pipelines/output-settings/#customize-batch-behavior) are met.
4. Records accumulate until the [batch definitions](/pipelines/build-with-pipelines/output-settings/#customize-batch-behavior) are met.

5. The batch is written to an output file, and optionally compressed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
5. The batch is written to an output file, and optionally compressed.
5. The batch is written to an output file and optionally compressed.

6. The output file is delivered to the configured R2 bucket
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
6. The output file is delivered to the configured R2 bucket
6. The output file is delivered to the configured R2 bucket.


Increasing the number of shards will increase the maximum throughput of a pipeline, as well as the number of output files created.

### Example
Your workload might require making 5,0000 requests per second to a pipeline. If you create a pipeline with a single shard, all 5,000 requests will be routed to the same shard. If your pipeline has been configured with a maximum batch duration of 1 second, every second, all 5,000 requests will be batched, and a single file will be delivered.

Increasing the shard count to 2 will double the number of output files. The 5,000 requests will be split into 2,500 requests to each shard. Every second, each shard will create a batch of data, and deliver to R2.

## Why shouldn't I set the shard count to the maximum?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend changing this out of a question form. With questions as headers, it can be difficult for users to find what they need when searching. I'd recommend changing it to something like "Maximum shard count considerations" or whatever makes most sense.

Increasing the shard count also increases the number of output files that your pipeline generates. This in turn increases the [cost of writing data to R2](/r2/pricing/#class-a-operations), as each file written to R2 counts as a single class A operation. Additionally, smaller files are slower, and more expensive, to query. Rather than setting the maximum, choose a shard count based on your workload needs.

## How should I decide the number of shards to use?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. I'd recommend something like "Determine shard amount" or "Choose number of shards."

Choose a shard count based on these factors:
* How many requests per second you will make to your pipeline
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* How many requests per second you will make to your pipeline
* The number of requests per second you will make to your pipeline

* How much data per second you will send to your pipeline
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* How much data per second you will send to your pipeline
* The amount of data per second you will send to your pipeline


Each shard is capable of handling approximately 7,000 requests per second, or ingesting 7 MB / s of data. Either factor might act as the bottleneck, so choose the shard count based on the higher number.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Each shard is capable of handling approximately 7,000 requests per second, or ingesting 7 MB / s of data. Either factor might act as the bottleneck, so choose the shard count based on the higher number.
Each shard is capable of handling approximately 7,000 requests per second, or ingesting 7 MB/s of data. Either factor might act as the bottleneck, so choose the shard count based on the higher number.


For example, if you estimate that you will ingest 70 MB / s, making 70,000 requests per second, setup a pipeline with 10 shards. However, if you estimate that you will ingest 70 MB / s while making 100,000 requests per second, setup a pipeline with 15 shards.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For example, if you estimate that you will ingest 70 MB / s, making 70,000 requests per second, setup a pipeline with 10 shards. However, if you estimate that you will ingest 70 MB / s while making 100,000 requests per second, setup a pipeline with 15 shards.
For example, if you estimate that you will ingest 70 MB/s, making 70,000 requests per second, setup a pipeline with 10 shards. However, if you estimate that you will ingest 70 MB/s while making 100,000 requests per second, setup a pipeline with 15 shards.


## Limits
| Setting | Default | Minimum | Maximum |
| ----------------------------------------- | ----------- | --------- | ----------- |
| Shards per pipeline `shard-count` | 2 | 1 | 15 |
Loading
Loading