A No-Code Pattern for Large-Scale Data Ingestion in Google Cloud using Application Integration

Synchronizing large volumes of data between systems is a common but significant challenge for most Enterprises.

Whether you’re migrating a multi-million-record database or performing a daily bulk data sync, you often face issues with performance, system timeouts, and error handling.

The question is:

How can you build a reliable and scalable ingestion pipeline without writing extensive code?

This article provides a robust, no-code pattern using Application Integration to tackle exactly this challenge. We will focus on a pattern for full or partial data synchronization, where large datasets are fetched in manageable chunks from a source, stored as raw files in Google Cloud Storage (GCS), and then loaded into their final destination.

Staging the raw data in GCS is a best practice that:

  • Provides a Safety Net for debugging & retries

  • Allows for Data Consistency Checks

  • Decouples the extraction process from the final loading step.

While Application Integration can also be used for real-time, event-based ingestion using its various triggers, this article focuses specifically on Bulk Data Movement.

Asynchronous Orchestrator-Worker Model

At the core of this pattern is an asynchronous orchestrator-worker model designed to efficiently transfer large datasets without hitting system limits. This approach breaks the large task into two distinct roles:

  • The Orchestrator: This integration flow acts as the project manager. It initiates the overall job, keeps track of progress, and triggers the worker. When all the work is done, the orchestrator handles the final steps, like sending a completion notification.

  • The Worker: It handles the actual data transfer by pulling a “chunk” of data from the source system and uploading it directly to a specified GCS bucket. Once its task is complete, it calls back to the orchestrator to report its progress, allowing the orchestrator to decide the next step.

This recursive and sequential process continues until the entire dataset is ingested, providing a scalable and observable workflow :

This pattern can be used with any of the connectors available in Application Integration, but the implementation differs slightly based on the source system’s capabilities.

Using Bulk or Batch APIs

Saas providers like Salesforce, often offer optimized bulk ingestion and query APIs. The pattern described here can successfully implement efficient paged batch queries using the Salesforce Bulk API 2.0 via the Integration Salesforce connector. This ensures the reliable and performant querying of large data volumes (e.g millions of records) in just a few minutes.

In this model, the orchestrator first initiates a query job in the source system (e.g., Salesforce) and then polls periodically to check its status. Once the source system confirms the job is ready, the orchestrator triggers the worker to download the results in chunks.

Another strategy worth considering for even greater efficiency, instead of polling is to listen to batch query completion events (like the Salesforce Platform Event). Using an Application Integration Salesforce Trigger.

Using Standard Pagination

Many APIs and databases don’t have a dedicated Bulk API but support standard pagination using page numbers or tokens. The MongoDB example demonstrates this universal pattern, which can be easily adapted for any system that allows you to fetch data page by page.

Here, the process is more direct: the orchestrator calls the worker, telling it which page of data to fetch. The worker retrieves that specific page, uploads it to GCS, and then calls back to the orchestrator with the identifier for the next page.


Get Started with Pre-Built Templates

To help you get started, we have published two official templates in Application Integration that implement this pattern:

  • Large-scale Data Ingestion - Salesforce with Bulk API

  • Large-scale Data Ingestion - Mongo_DB

These templates contain all the logic described above and are designed to be adapted to your specific needs.

You can use them as a starting point and easily switch out the source connector to Salesforce, an HTTP API, or any other supported system as needed.

For a detailed walkthrough of how to configure and run these templates, review these video tutorials:

A key part of this integration architecture is the workload JSON variable. This single variable acts as the central “state” for the entire ingestion job and is passed between the orchestrator and worker integrations. This design offers two major advantages:

  1. Simplified Management: Having one consolidated variable shared across all sub-integrations makes the entire process easier to understand, manage, and update. There’s no need to map several individual variables between flows.

Resilience and Replayability: The workload variable holds the complete context of the job at every step (e.g., which page to process next). If an execution fails mid-way, you can diagnose the issue, correct the workload variable with the parameters from the last successful step, and restart the integration from that exact point. This avoids having to re-ingest the entire dataset from the beginning.

To start an integration, one of your first steps is to define this workload object. Here is an example from the Salesforce template:

{
  "SOQLQuery": "SELECT Id, Name, BillingStreet, BillingCity, BillingState, BillingPostalCode FROM Account",
  "page_size": "25000",
  "next_page": "",
  "job_status": "Open",
  "gcs_bucket": "YOUR_BUCKET",
  "gcs_folder": "YOUR_FOLDER",
  "job_ID": "YOU_UNIQUE_ID",
  "timer": "30",
  "notification_emails": "YOUR_EMAIL"
}

Next Steps

Once your data is successfully ingested into Google Cloud Storage, we encourage you to perform a sanity check to verify the consistency of the output files.

From there, if your final destination is BigQuery, you can easily use BigQuery Jobs to transfer the raw data into your destination tables.

These BigQuery Jobs can also be initiated directly using the BigQuery Connector from Application Integration using a simple flow like below which could be initiated via a scheduler or an API call.

Happy Integrating!

Simon Lebrun and Christopher Karl Chan