Skip to content

Commit dfaec48

Browse files
maheshwaripoliyharshil1712kodster28dcpena
authored andcommitted
Initial pipelines docs (#18595)
* Initial pipelines docs * Fixed wrangler commands * Improved pipelines index page * Fixed broken links * Fixed typos * improved worker binding documentation * Fixed broken links * PIPE-155 Add prefix to Pipelines naming (#18656) * Modified worker binding docs * added local development notes * Renamed worker bindings to .mdx * Fixed filenames and broken comments * updated dates * Fixed render issues and page titles * Updated prefix instructions * Updated heading * Updated docs for Oauth flow * Added documentation about batch hints * Added R2 as a destination * updated ordering * Updated performance characteristics * updated ordering * Updated docs for new wrangler commands * updated batch settings * Added CORS documentation * Removed pricing doc * updated cors * Add tutorial (#20025) * adds pipelines tutorial * minor fixes * initial transformations documentation * Added docs for env and ctx * Fixed typo * Add tutorial (#20476) * add pipelines client-side tutorial * minor updates * Fixed typos * Updated import * Removed changelog entry for now * removed changelog reference * fixed imports * Fixed broken link * removed transformations * removed broken links * Updated text * Added javascript API * Removed old file * fixed broken links * Fixed typo * Updated guides * Re-organized docs * Improved getting started guide * Improved output settings * Fixed broken links * Added more content about how pipelines works * Added documentation for how pipelines work * Improvements to overview * Improved getting started guide * Improved http endpoint docs * Improved http options * Improved output settings * Updated logo and text * Updated architecture * Update how-pipelines-work.mdx * Update how-pipelines-work.mdx * Fixed broken links * Updated tagline * Updated text * Small changes * Changed phrasing * improved shards documentation * Updated shard count documentation * Updated wrangler commands * Updated metrics documentation * Updated docs * Fixed typos * Updated tutorial commands * Improved tutorial text * Fixed typo * Improved text * Fixed broken links * fixed limits typo * small improvements * Improved data lake tutorial * Added changelog entry * Small updates * Improved grammar * Small changes to wrangler commands * Fixed typos * Improved wrangler configuration * Updated changelog entry * Fixed changelog typo * updated changelog * updated link * Update src/content/docs/pipelines/getting-started.mdx Co-authored-by: Denise Peña <[email protected]> * Apply suggestions from code review Co-authored-by: Denise Peña <[email protected]> Co-authored-by: Kody Jackson <[email protected]> * Improved language * Update 2025-04-10-launching-pipelines.mdx * Update http.mdx * Update index.mdx * Update index.mdx * Update 2025-04-10-launching-pipelines.mdx * Update how-pipelines-work.mdx * Clarified shard count * Reorganized sources * improved language * improved grammar * Updated limits * Updated casing * Fixed typo * Improved pricing docs * Improved getting started guide * Improved text * Update output-settings.mdx * Update output-settings.mdx * Update pipelines.mdx * Update pipelines.mdx * minor fixes to tutorials (#21552) --------- Co-authored-by: Oliver Yu <[email protected]> Co-authored-by: Harshil Agrawal <[email protected]> Co-authored-by: kodster28 <[email protected]> Co-authored-by: Denise Peña <[email protected]>
1 parent bbfc5b6 commit dfaec48

File tree

26 files changed

+2034
-0
lines changed

26 files changed

+2034
-0
lines changed
21.7 KB
Loading
307 KB
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
---
2+
title: Cloudflare Pipelines now available in beta
3+
description: Use Cloudflare Pipelines to ingest real time data streams, and load into R2.
4+
products:
5+
- pipelines
6+
- r2
7+
- workers
8+
date: 2025-04-10 12:00:00 UTC
9+
hidden: true
10+
---
11+
12+
[Cloudflare Pipelines](/pipelines) is now available in beta, to all users with a [Workers Paid](/workers/platform/pricing) plan.
13+
14+
Pipelines let you ingest high volumes of real time data, without managing the underlying infrastructure. A single pipeline can ingest up to 100 MB of data per second, via HTTP or from a [Worker](/workers). Ingested data is automatically batched, written to output files, and delivered to an [R2 bucket](/r2) in your account. You can use Pipelines to build a data lake of clickstream data, or to store events from a Worker.
15+
16+
Create your first pipeline with a single command:
17+
18+
```bash title="Create a pipeline"
19+
$ npx wrangler@latest pipelines create my-clickstream-pipeline --r2-bucket my-bucket
20+
21+
🌀 Authorizing R2 bucket "my-bucket"
22+
🌀 Creating pipeline named "my-clickstream-pipeline"
23+
✅ Successfully created pipeline my-clickstream-pipeline
24+
25+
Id: 0e00c5ff09b34d018152af98d06f5a1xvc
26+
Name: my-clickstream-pipeline
27+
Sources:
28+
HTTP:
29+
Endpoint: https://0e00c5ff09b34d018152af98d06f5a1xvc.pipelines.cloudflare.com/
30+
Authentication: off
31+
Format: JSON
32+
Worker:
33+
Format: JSON
34+
Destination:
35+
Type: R2
36+
Bucket: my-bucket
37+
Format: newline-delimited JSON
38+
Compression: GZIP
39+
Batch hints:
40+
Max bytes: 100 MB
41+
Max duration: 300 seconds
42+
Max records: 100,000
43+
44+
🎉 You can now send data to your pipeline!
45+
46+
Send data to your pipeline's HTTP endpoint:
47+
curl "https://0e00c5ff09b34d018152af98d06f5a1xvc.pipelines.cloudflare.com/" -d '[{ ...JSON_DATA... }]'
48+
49+
To send data to your pipeline from a Worker, add the following configuration to your config file:
50+
{
51+
"pipelines": [
52+
{
53+
"pipeline": "my-clickstream-pipeline",
54+
"binding": "PIPELINE"
55+
}
56+
]
57+
}
58+
```
59+
60+
Head over to our [getting started guide](/pipelines/getting-started) for an in-depth tutorial to building with Pipelines.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
title: Build with Pipelines
3+
pcx_content_type: navigation
4+
sidebar:
5+
order: 3
6+
group:
7+
hideIndex: true
8+
---
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
---
2+
title: Configure output settings
3+
pcx_content_type: how-to
4+
sidebar:
5+
order: 3
6+
head:
7+
- tag: title
8+
content: Configure output settings
9+
---
10+
11+
import { Render, PackageManagers } from "~/components";
12+
13+
Pipelines convert a stream of records into output files and deliver the files to an R2 bucket in your account. This guide details how you can change the output destination and customize batch settings to generate query ready files.
14+
15+
## Configure an R2 bucket as a destination
16+
To create or update a pipeline using Wrangler, run the following command in a terminal:
17+
18+
```sh
19+
npx wrangler pipelines create [PIPELINE-NAME] --r2-bucket [R2-BUCKET-NAME]
20+
```
21+
22+
After running this command, you will be prompted to authorize Cloudflare Workers Pipelines to create an R2 API token on your behalf. Your pipeline uses the R2 API token to load data into your bucket. You can approve the request through the browser link which will open automatically.
23+
24+
If you prefer not to authenticate this way, you can pass your [R2 API Token](/r2/api/tokens/) to Wrangler:
25+
```sh
26+
npx wrangler pipelines create [PIPELINE-NAME] --r2 [R2-BUCKET-NAME] --r2-access-key-id [ACCESS-KEY-ID] --r2-secret-access-key [SECRET-ACCESS-KEY]
27+
```
28+
29+
## File format and compression
30+
Output files are generated as Newline Delimited JSON files (`ndjson`). Each line in an output file maps to a single record.
31+
32+
By default, output files are compressed in the `gzip` format. Compression can be turned off using the `--compression` flag:
33+
```sh
34+
npx wrangler pipelines update [PIPELINE-NAME] --compression none
35+
```
36+
37+
Output files are named using a [UILD](https://github.com/ulid/spec) slug, followed by an extension.
38+
39+
## Customize batch behavior
40+
When configuring your pipeline, you can define how records are batched before they are delivered to R2. Batches of records are written out to a single output file.
41+
42+
Batching can:
43+
- Reduce the number of output files written to R2 and thus reduce the [cost of writing data to R2](/r2/pricing/#class-a-operations).
44+
- Increase the size of output files making them more efficient to query.
45+
46+
There are three ways to define how ingested data is batched:
47+
48+
1. `batch-max-mb`: The maximum amount of data that will be batched in megabytes. Default, and maximum, is `100 MB`.
49+
2. `batch-max-rows`: The maximum number of rows or events in a batch before data is written. Default, and maximum, is `10,000` rows.
50+
3. `batch-max-seconds`: The maximum duration of a batch before data is written in seconds. Default, and maximum, is `300 seconds`.
51+
52+
Batch definitions are hints. A pipeline will follow these hints closely, but batches might not be exact.
53+
54+
All three batch definitions work together and whichever limit is reached first triggers the delivery of a batch.
55+
56+
For example, a `batch-max-mb` = 100 MB and a `batch-max-seconds` = 100 means that if 100 MB of events are posted to the pipeline, the batch will be delivered. However, if it takes longer than 100 seconds for 100 MB of events to be posted, a batch of all the messages that were posted during those 100 seconds will be created.
57+
58+
### Defining batch settings using Wrangler
59+
You can use the following batch settings flags while creating or updating a pipeline:
60+
* `--batch-max-mb`
61+
* `--batch-max-rows`
62+
* `--batch-max-seconds`
63+
64+
For example:
65+
```sh
66+
npx wrangler pipelines update [PIPELINE-NAME] --batch-max-mb 100 --batch-max-rows 10000 --batch-max-seconds 300
67+
```
68+
69+
### Batch size limits
70+
71+
| Setting | Default | Minimum | Maximum |
72+
| ----------------------------------------- | ----------------| --------- | ----------- |
73+
| Maximum Batch Size `batch-max-mb` | 100 MB | 1 MB | 100 MB |
74+
| Maximum Batch Timeout `batch-max-seconds` | 300 seconds | 1 second | 300 seconds |
75+
| Maximum Batch Rows `batch-max-rows` | 10,000,000 rows | 1 row | 10,000,000 rows |
76+
77+
78+
## Deliver partitioned data
79+
Partitioning organizes data into directories based on specific fields to improve query performance. Partitions reduce the amount of data scanned for queries, enabling faster reads.
80+
81+
:::note
82+
By default, Pipelines partition data by event date and time. This will be customizable in the future.
83+
:::
84+
85+
Output files are prefixed with event date and hour. For example, the output from a Pipeline in your R2 bucket might look like this:
86+
```sh
87+
- event_date=2025-04-01/hr=15/01JQWBZCZBAQZ7RJNZHN38JQ7V.json.gz
88+
- event_date=2025-04-01/hr=15/01JQWC16FXGP845EFHMG1C0XNW.json.gz
89+
```
90+
91+
## Deliver data to a prefix
92+
You can specify an optional prefix for all the output files stored in your specified R2 bucket, using the flag `--r2-prefix`.
93+
94+
For example:
95+
```sh
96+
npx wrangler pipelines update [PIPELINE-NAME] --r2-prefix test
97+
```
98+
99+
After running the above command, the output files generated by your pipeline will be stored under the prefix `test`. Files will remain partitioned. Your output will look like this:
100+
```sh
101+
- test/event_date=2025-04-01/hr=15/01JQWBZCZBAQZ7RJNZHN38JQ7V.json.gz
102+
- test/event_date=2025-04-01/hr=15/01JQWC16FXGP845EFHMG1C0XNW.json.gz
103+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
---
2+
pcx_content_type: concept
3+
title: Increase pipeline throughput
4+
sidebar:
5+
order: 11
6+
---
7+
8+
import { Render, PackageManagers } from "~/components";
9+
10+
A pipeline's maximum throughput can be increased by increasing the shard count. A single shard can handle approximately 7,000 requests per second, or can ingest 7 MB/s of data.
11+
12+
By default, each pipeline is configured with two shards. To set the shard count, use the `--shard-count` flag while creating or updating a pipeline:
13+
```sh
14+
$ npx wrangler pipelines update [PIPELINE-NAME] --shard-count 10
15+
```
16+
17+
:::note
18+
The default shard count will be set to `auto` in the future, with support for automatic horizontal scaling.
19+
:::
20+
21+
## How shards work
22+
![Pipeline shards](~/assets/images/pipelines/shards.png)
23+
24+
Each pipeline is composed of stateless, independent shards. These shards are spun up when a pipeline is created. Each shard is composed of layers of [Durable Objects](/durable-objects). The Durable Objects buffer data, replicate for durability, handle compression, and delivery to R2.
25+
26+
When a record is sent to a pipeline:
27+
1. The Pipelines [Worker](/workers) receives the record.
28+
2. The record is routed to to one of the shards.
29+
3. The record is handled by a set of Durable Objects, which commit the record to storage and replicate for durability.
30+
4. Records accumulate until the [batch definitions](/pipelines/build-with-pipelines/output-settings/#customize-batch-behavior) are met.
31+
5. The batch is written to an output file and optionally compressed.
32+
6. The output file is delivered to the configured R2 bucket.
33+
34+
Increasing the number of shards will increase the maximum throughput of a pipeline, as well as the number of output files created.
35+
36+
### Example
37+
Your workload might require making 5,000 requests per second to a pipeline. If you create a pipeline with a single shard, all 5,000 requests will be routed to the same shard. If your pipeline has been configured with a maximum batch duration of 1 second, every second, all 5,000 requests will be batched, and a single file will be delivered.
38+
39+
Increasing the shard count to 2 will double the number of output files. The 5,000 requests will be split into 2,500 requests to each shard. Every second, each shard will create a batch of data, and deliver to R2.
40+
41+
## Considerations while increasing the shard count
42+
Increasing the shard count also increases the number of output files that your pipeline generates. This in turn increases the [cost of writing data to R2](/r2/pricing/#class-a-operations), as each file written to R2 counts as a single class A operation. Additionally, smaller files are slower, and more expensive, to query. Rather than setting the maximum, choose a shard count based on your workload needs.
43+
44+
## Determine the right number of shards
45+
Choose a shard count based on these factors:
46+
* The number of requests per second you will make to your pipeline
47+
* The amount of data per second you will send to your pipeline
48+
49+
Each shard is capable of handling approximately 7,000 requests per second, or ingesting 7 MB/s of data. Either factor might act as the bottleneck, so choose the shard count based on the higher number.
50+
51+
For example, if you estimate that you will ingest 70 MB/s, making 70,000 requests per second, setup a pipeline with 10 shards. However, if you estimate that you will ingest 70 MB/s while making 100,000 requests per second, setup a pipeline with 15 shards.
52+
53+
## Limits
54+
| Setting | Default | Minimum | Maximum |
55+
| ----------------------------------------- | ----------- | --------- | ----------- |
56+
| Shards per pipeline `shard-count` | 2 | 1 | 15 |
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
---
2+
title: Configure HTTP endpoint
3+
pcx_content_type: concept
4+
sidebar:
5+
order: 1
6+
head:
7+
- tag: title
8+
content: Configure HTTP endpoint
9+
---
10+
11+
import { Render, PackageManagers } from "~/components";
12+
13+
Pipelines support data ingestion over HTTP. When you create a new pipeline using the default settings you will receive a globally scalable ingestion endpoint. To ingest data, make HTTP POST requests to the endpoint.
14+
15+
```sh
16+
$ npx wrangler@latest pipelines create my-clickstream-pipeline --r2-bucket my-bucket
17+
18+
🌀 Authorizing R2 bucket "my-bucket"
19+
🌀 Creating pipeline named "my-clickstream-pipeline"
20+
✅ Successfully created pipeline my-clickstream-pipeline
21+
22+
Id: 0e00c5ff09b34d018152af98d06f5a1xvc
23+
Name: my-clickstream-pipeline
24+
Sources:
25+
HTTP:
26+
Endpoint: https://0e00c5ff09b34d018152af98d06f5a1xvc.pipelines.cloudflare.com/
27+
Authentication: off
28+
Format: JSON
29+
Worker:
30+
Format: JSON
31+
Destination:
32+
Type: R2
33+
Bucket: my-bucket
34+
Format: newline-delimited JSON
35+
Compression: GZIP
36+
Batch hints:
37+
Max bytes: 100 MB
38+
Max duration: 300 seconds
39+
Max records: 100,000
40+
41+
🎉 You can now send data to your pipeline!
42+
43+
Send data to your pipeline's HTTP endpoint:
44+
curl "https://0e00c5ff09b34d018152af98d06f5a1xvc.pipelines.cloudflare.com/" -d '[{ ...JSON_DATA... }]'
45+
```
46+
47+
## Authentication
48+
You can secure your HTTP ingestion endpoint using Cloudflare API tokens. By default, authentication is turned off. To configure authentication, use the `--require-http-auth` flag while creating or updating a pipeline.
49+
50+
```sh
51+
$ npx wrangler pipelines create [PIPELINE-NAME] --r2-bucket [R2-BUCKET-NAME] --require-http-auth true
52+
```
53+
54+
Once authentication is turned on, you will need to include a Cloudflare API token in your request headers.
55+
56+
### Get API token
57+
1. Log in to the [Cloudflare dashboard](https://dash.cloudflare.com) and select your account.
58+
2. Navigate to your [API Keys](https://dash.cloudflare.com/profile/api-tokens).
59+
3. Select **Create Token**.
60+
4. Choose the template for Workers Pipelines. Select **Continue to summary** > **Create token**. Make sure to copy the API token and save it securely.
61+
62+
### Making authenticated requests
63+
Include the API token you created in the previous step in the headers for your request:
64+
65+
```sh
66+
curl https://<PIPELINE-ID>.pipelines.cloudflare.com
67+
-H "Content-Type: application/json" \
68+
-H "Authorization: Bearer ${API_TOKEN}" \
69+
-d '[{"foo":"bar"}, {"foo":"bar"}, {"foo":"bar"}]'
70+
```
71+
72+
## Specifying CORS Settings
73+
If you want to use your pipeline to ingest client side data, such as website clicks, you will need to configure your [Cross-Origin Resource Sharing (CORS) settings](https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS).
74+
75+
Without setting your CORS settings, browsers will restrict requests made to your pipeline endpoint. For example, if your website domain is `https://my-website.com`, and you want to post client side data to your pipeline at `https://<PIPELINE-ID>.pipelines.cloudflare.com`, without CORS settings, the request will fail.
76+
77+
To fix this, you need to configure your pipeline to accept requests from `https://my-website.com`. You can do so while creating or updating a pipeline, using the flag `--cors-origins`. You can specify multiple domains separated by a space.
78+
79+
```sh
80+
$ npx wrangler pipelines update [PIPELINE-NAME] --cors-origins https://mydomain.com http://localhost:8787
81+
```
82+
83+
You can specify that all cross origin requests are accepted. We recommend only using this option in development, and not for production use cases.
84+
```sh
85+
$ npx wrangler pipelines update [PIPELINE-NAME] --cors-origins "*"
86+
```
87+
88+
After the `--cors-origins` have been set on your pipeline, your pipeline will respond to preflight requests and `POST` requests with the appropriate `Access-Control-Allow-Origin` headers set.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
---
2+
title: Sources
3+
pcx_content_type: concept
4+
sidebar:
5+
order: 1
6+
group:
7+
hideIndex: false
8+
---
9+
10+
Pipelines let you ingest data from the following sources:
11+
* [HTTP Clients](/pipelines/build-with-pipelines/sources/http), with optional authentication and CORS settings
12+
* [Cloudflare Workers](/workers/), using the [Pipelines Workers API](/pipelines/build-with-pipelines/sources/workers-apis)
13+
14+
Multiple sources can be active on a single pipeline simultaneously. For example, you can create a pipeline which accepts data from Workers and via HTTP. There is no limit to the number of source clients. Multiple Workers can be configured to send data to the same pipeline.
15+
16+
Each pipeline can ingest up to 100 MB/s of data or accept up to 100,000 requests per second, aggregated across all sources.
17+
18+
## Configuring allowed sources
19+
By default, ingestion via HTTP and from Workers is turned on. You can configure the allowed sources by using the `--source` flag while creating or updating a pipeline.
20+
21+
For example, to create a pipeline which only accepts data via a Worker, you can run this command:
22+
```sh
23+
$ npx wrangler pipelines create [PIPELINE-NAME] --r2-bucket [R2-BUCKET-NAME] --source worker
24+
```
25+
26+
## Accepted data formats
27+
Pipelines accept arrays of valid JSON objects. You can send multiple objects in a single request, provided the total data volume is within the [documented limits](/pipelines/platform/limits). Sending data in a different format will result in an error.

0 commit comments

Comments
 (0)