Skip to content

How Extraction Works

Extraction is the process of pulling data from a source system and loading it into Snowflake. Rime handles this end-to-end through a multi-stage pipeline. This page explains each stage, why it is designed this way, and what you can expect during a typical extraction run.

Pipeline overview

Data flows through six stages:

Source System -> Arrow RecordBatch -> Parquet File -> S3 -> Snowpipe -> Snowflake
  1. The connector authenticates with the source and reads data
  2. Data is held in memory as Apache Arrow record batches
  3. Arrow data is written to Parquet files with Snappy compression
  4. Parquet files are uploaded to an S3 staging bucket
  5. S3 event notifications trigger Snowpipe
  6. Snowpipe loads the data into Snowflake raw tables

Each stage is designed for a specific purpose: Arrow for fast in-memory processing, Parquet for compact storage, S3 for durable staging, and Snowpipe for event-driven loading.

Stage 1: Source connection

The connector authenticates with the source system using the credentials stored in Rime (encrypted at rest with AES-256-GCM). For databases, this means opening a connection and executing queries. For APIs, this means sending authenticated HTTP requests. For files, this means reading the uploaded file from Rime’s storage.

The connector only reads data. It never writes to, modifies, or deletes anything in the source system.

Stage 2: Arrow RecordBatch

Data from the source is converted into Apache Arrow record batches. Arrow is a columnar in-memory format designed for analytical workloads.

Why Arrow matters:

  • Columnar layout: data is organized by column rather than by row, which enables efficient compression and fast type-specific processing. A column of integers is stored as a contiguous block of integers, not interleaved with strings and timestamps.
  • Zero-copy reads: Arrow’s memory layout allows data to be passed between pipeline stages without serialization or copying. The bytes that come out of the source query are the same bytes written to Parquet.
  • Type safety: Arrow has a rich type system (int8 through int64, float32/float64, UTF-8 strings, timestamps with timezone, decimals, etc.) that preserves the source data’s types accurately through the pipeline.
  • Streaming: data is processed in batches rather than loading entire tables into memory. A table with 100 million rows is processed as a stream of batches (typically 10,000-100,000 rows each), keeping memory usage bounded regardless of table size.

You do not need to configure anything about Arrow. It is an internal implementation detail. But understanding it helps explain why Rime can extract large datasets efficiently without running out of memory.

Stage 3: Parquet with Snappy compression

Arrow record batches are written to Parquet files. Parquet is a columnar file format optimised for storage and query performance.

Why Parquet:

  • Compression: Parquet’s columnar layout means similar values are stored together, which compresses well. Rime uses Snappy compression, which prioritizes speed over compression ratio. A typical table compresses to 20-40% of its raw size.
  • Schema preservation: Parquet files carry their schema (column names, types, nullability), so downstream systems can read them without external metadata.
  • Efficient for Snowflake: Snowflake’s Snowpipe is optimised for loading Parquet files. It can read individual columns without scanning entire rows, and it maps Parquet types directly to Snowflake types.

Rime writes one Parquet file per table per extraction run. For very large tables, the file may be split into multiple parts to stay under size limits.

Stage 4: S3 staging

Parquet files are uploaded to an S3 bucket in your project’s AWS account. Rime organizes files in the bucket using a consistent path structure:

s3://<bucket>/raw/<connector-id>/<table-name>/<run-id>/<part>.parquet
  • connector-id: unique identifier for the connector
  • table-name: name of the source table or endpoint
  • run-id: unique identifier for this extraction run
  • part: file part number (if the table was split)

S3 serves as durable staging between extraction and loading. Files remain in S3 after Snowpipe loads them, providing a recovery path if Snowflake data needs to be reloaded. Lifecycle policies on the bucket control how long files are retained (configurable per project in the infrastructure settings).

Stage 5: Snowpipe ingestion

When a Parquet file arrives in S3, an S3 event notification (via SNS) triggers Snowpipe. Snowpipe is Snowflake’s continuous data loading service.

How Snowpipe works:

  1. The S3 bucket is configured with event notifications that fire on s3:ObjectCreated:* events.
  2. Notifications are sent to an SNS topic.
  3. Snowpipe subscribes to the SNS topic via an SQS queue.
  4. When a notification arrives, Snowpipe queues a COPY INTO command that loads the Parquet file into the target Snowflake table.

Snowpipe is event-driven, meaning data loads begin within seconds of the Parquet file arriving in S3. You do not need to schedule or trigger the load manually.

Copy history: Snowpipe tracks which files it has loaded. If the same file is somehow uploaded again (for example, due to a retry), Snowpipe skips it based on the file name. This prevents duplicate data from accidental re-uploads.

Stage 6: Snowflake raw tables

Data arrives in Snowflake’s raw schema. Each source table maps to a Snowflake table with the same column names and types mapped from the source through Arrow and Parquet. See Schema Discovery for details on type mapping.

Raw tables are the starting point for transformation. From here, dbt models transform raw data through staging, warehouse, and mart layers.

Per-table extraction

Rime extracts each selected table independently within a single connector run. This means:

  • If one table fails (due to a permission change, a timeout, a schema mismatch, or any other error), the remaining tables continue to extract normally.
  • The run status reflects the outcome of all tables: “completed” if all succeeded, “partially failed” if some failed, or “failed” if all failed.
  • Per-table status, row counts, and error details are available in the run history.

This design prevents a problem with one table from blocking the rest of your data pipeline. A missing column in a rarely-used lookup table should not stop the extraction of your core transaction tables.

The connector runner process

Each extraction runs in an isolated connector runner process. The runner is a standalone binary that:

  1. Receives the connector configuration and encrypted credentials from the Rime API
  2. Decrypts credentials in memory
  3. Executes the extraction pipeline for all selected tables
  4. Reports progress and results back to the Rime API

The runner process is isolated from the main Rime API server. If a connector crashes (due to a bug, an out-of-memory condition, or an unexpected source system response), it does not affect other connectors or the Rime platform itself. The crash is recorded as a failed run in the run history.

Connector runners are stateless. They do not store any data or credentials between runs. Each run starts fresh with the current connector configuration.

Timing and performance

Extraction time depends on several factors:

  • Data volume: a table with 1 million rows takes longer than a table with 1,000 rows
  • Network latency: extracting from a database in the same AWS region is faster than extracting across regions or from an on-premises server
  • Source system performance: a heavily loaded database responds to queries more slowly
  • API rate limits: SaaS connectors are constrained by the source API’s rate limits

Rime does not impose artificial limits on extraction speed. It reads data as fast as the source system and network allow, subject to any rate limiting configuration you have set for REST API connectors.

Next steps