Snowpipe: The Free Lunch From Snowflake. Load Data On The Cheap.

Okay, okay, it’s not free, obviously. But almost… 😇

In this article we’ll cover a few ways to load data to Snowflake and why Snowpipe is the cheapest and most effective.  We’ll also review some Snowflake Stage types that you may not have played with before, and reasons to move away from External Stages. We will leverage the Snowpipe API to trigger the copy into command, instead of traditional methods of invoking Snowpipe (such as bucket event notifications, SQS, SNS, Pub Sub).

Quick Background on the Inspiration

Recently in the dbt Slack community, I mentioned I was considering converting dbt Incremental models to read from Streams instead of fetching new records based on timestamp select from {{this}}.  The response from one community member was “why not Dynamic Tables?”, which led to “maybe you can create a blog post on comparing the three options”.  That blog post is coming soon, but first I had to create a massive amount of fake data in Snowflake to facilitate that case study. The data has to feel heavy, and it needs new records coming in often.  In creating this case study, some very valuable lessons came out which I am sharing in this post.

Begin With the End In Mind

Consider this scenario: you need to load data to Snowflake every 5 minutes around the clock.

The copy into command is very efficient in Snowflake, but it requires a Virtual Warehouse to execute.  Since the minimum billing increment is 1 minute, that means you will pay for 1 minute every time that copy command is executed.

Using the copy into command with a scheduled task every 5 minutes, the warehouse will turn on 288 times in a day, causing 288 minutes or 4.8 hours of billing.  In us-east region, Enterprise Edition, this will cost you about $14 per day.  Not breaking the bank. No enterprise is worried about $14 per day, but we can do much better!

Let’s learn how Snowpipe, a serverless feature that does not start up your warehouse, can drastically reduce this cost of loading data. I also want to talk to you about a very fancy way you can use Snowpipe: with a table stage and calling the Snowpipe API. But first, some background on Stages.

What is a Snowflake Stage?

A stage in Snowflake is a location where files (csv, parquet, json, etc.) are staged to be copied into or out of Snowflake.  This is usually Cloud Storage such as AWS S3, Google Cloud Storage, or Azure Blob Storage.  But there is more to the story.

External Stages

If you use Snowflake, likely you’ve created an External Stage.  An external stage points to cloud storage that is owned by you or your company.  It is your S3 bucket or GCS bucket, etc.  This is the most common type of stage for users to create and interact with.

Creating an External Stage usually looks like this:

Internal Stages

An internal stage is a location in Snowflake’s cloud storage bucket.  The location is associated with your Snowflake account, but it is not managed by you.  These are the most popular types of stages for ETL tools to interact with.  Fivetran, Airbyte, Estuary, Portable, Alteryx, etc. are all writing to Internal Stages.  This is convenient for them because they don’t need access to your S3 in order to stage data, they just need the Snowflake credentials you already gave them. The fewer sets of credentials we need to load data, the better.

Three Types of Internal Stages

User Stage: 

Each Snowflake user gets a stage where they can put data.  This is convenient because you don’t need to ask your IT department to set up a bucket / folder for you.  I don’t recommend using your personal User Stage for Prod use cases. You will win the lottery and your user stage will be dropped when IT shuts down your account.  🙂

For Prod use cases, you can use a User Stage with a Service Account. As long as that service account exists, the stage will be there.

User stages are referenced with @~, for example list @~ will list the files in your personal stage. To learn how to put files and copy into Snowflake, see the “Table Stage” section below, but use the @ sign instead.

There is no create stage syntax for User Stage! It automatically exists.

Named Stages:

This is what I see a lot of ETL tools using.  If you look at your query history created by a lot of ETL tools, you will see this:

  • create temp stage NamedInternalStage

  • put file

  • copy into

  • drop stage

You can create your own named stage with this DDL:

Table Stages

Let’s welcome to the stage, the feature of this blog post, Table Stages!

💡 Every table in Snowflake has a Stage automatically associated with that table. 💡

This is super handy because you don’t need any DDL to create it, and you don’t need your IT department to provision a bucket.

The name of the stage is the same as the name of the table, and is referenced with @% for example @%my_table.  

To put a file there, you simply do this:

Copying Data to Snowflake

Now that we’ve covered the types of stages, let’s quickly look at how to copy data into Snowflake from Stages.

The conventional way to load data to Snowflake is to leverage the copy command.  copy is a bulk-load command.  It happens very fast and is very efficient.  For more on optimizing the copy command, see this overview from select.dev blog.

Example usage:

This command will load data to a Snowflake table called mycsvtable from a named external stage.

As mentioned already, the cost of the copy command is just the cost of your running warehouse. There are no data ingress fees.

Snowflake’s write_pandas() function

When writing Python dataframes to Snowflake, you never want to use df.to_sql(). This causes inefficient inserts into Snowflake. Instead, you want to use the write_pandas() function that comes with snowflake-connector-python. write_pandas() will follow Snowflake’s bulk load best practice. Let’s use Query History to see exactly what it does:

Here’s what it did (read the query history in reverse order):

  • Create temp stage

  • Put File

  • Create temp file format

  • Infer schema to prepare the DDL to create the table.

  • Create the table if it doesn’t exist.

  • Copy the data in.

You can see that two of these steps used a Warehouse, which we are trying to avoid doing.

Snowpipe Basics

This article is not a deep dive on how Snowpipe works, but let’s cover the basics.  Snowpipe is an event-based mechanism to execute the copy command - it copies data from a stage to a Snowflake table, when some event happens.  Often this event is a Pub/Sub model or AWS Simple Queuing Service / Simple Notification Service. Your notification service will send an Event Notification to Snowflake that says “Hey, a new file has arrived in the bucket, please issue that copy command now”, the Snowflake handles the loading.

Creating a pipe can look like this:

 You can see a Snowpipe is just wrapping the copy command and providing a notification channel for Snowflake to know a new file has arrived.

🛑But We Don’t Want to Manage Buckets, SQS, Pub/Sub 🤚

For whatever reason, you may not want to create new objects in your Cloud Provider account. Maybe you don’t want a copy of the data on your own bucket or the responsibility to manage the lifecycle of those files.  Instead you want to do everything with native Snowflake.

The Secret Sauce: Copy Data to a Named Stage and call the Snowpipe API

We said before that Snowpipe needs an event to trigger it.  But the event doesn’t have to come from your cloud provider and the file doesn’t need to be in your cloud storage first!

Instead, you can put the data to a table stage and call the Snowpipe API to trigger the copy command.  This is 100% serverless.

The Snowflake documentation is very unclear on how much this costs, so we need to experiment with it.

In the example I mentioned earlier, we need to copy data every 5 minutes around the clock.  I have been doing this now for >5 days, using AWS Lambda to create the fake data, stage it, and trigger Snowpipe.  Here are my costs:

Snowflake costs: After 5 full days, it has cost less than $1. That’s right, less than 1 US Dollar. If I consider only my highest usage day (after I increased the volume and frequency), it would cost $1.83 per week.

Running Snowpipe for less than 1 Dollar. Day 1 is partial. Load frequency and volume was increased on 6/15.

AWS Costs: less than twenty cents.

  • Lambda cost: Free

  • Container Registry: 1 cent

  • AWS Secrets Manager 13 cents

AWS Cost for the project: $0.16 (Secrets) + $0.02 (ECR)

Brass Tacks:

  • Copy Command on a similar schedule schedule: $14 Per Day

  • Serverless (Snowpipe): Roughly $1 per week!

In this example, I’m loading about 1.4 million rows per day.

Loading 1.4 Million rows per day to Snowflake for less than $2/ Week

A quick word on the results: my approach was not particularly scientific, your mileage will vary. This is an “order of magnitude” demonstration.

Architecture for this Experiment

How to PUT DataFrame to the Snowflake Stage:

How to Trigger Snowpipe Rest API

Full Code

Obviously you can use those snippets above to get to the repo, but I thought I’d mention the main file is here. Everything else is in support of that file.

Also - the repo is not yet at a state where you can just clone it and run it. I will get there soon! When I update it, I will update here too.

What’s missing:

  • Snowflake formation stuff. I will add steps using Titan OSS. (Create database, schema, table, role, warehouse, user & key pair auth

    • You need a user configured with Keypair Authentication to generate the JWT required for the Snowpipe API.

  • AWS formation stuff: I may add steps to build using cdk, but I’m not committed to doing that yet. My goal is for someone to be able to run it locally and they can deploy it however they want.

Challenge of Using Snowpipe API

The only challenge of using the API vs using bucket events is the authentication is a little tricky.  It requires you use Keypair Auth, then use that Key Pair to create a JWT.  Creating the JWT was problematic at first, but eventually, I found some boilerplate from Snowflake that I could copy and paste (slightly edited). If you need to generate a JWT for Snowflake, you can use this method. (Click the file name in bottom left of the snippet to go there.)

Then using the JWTGenerator is simple:

What does the JWT do? Without the JWT, your user Key Pair says “this is who I am, it is really me”. The JWT says “This is who I am, it is really me, and this is what time it is, my token is fresh”. You can also limit the scope of a JWT. So, it is just an extra layer of security.

Wrap up

Hopefully I’ve convinced you that Snowpipe is cheaper than trying to automate the copy into command via cron job or Snowflake Task, especially if the data is arriving frequently.

Additionally, I hope you've gained new insights into the about using table stage and Snowpipe Rest API!

I’m looking forward to writing Part 2 of this Blog, where I will cover dbt Incremental models vs Snowflake Streams vs Snowflake Dynamic Tables.

For more info on ways to load data to Snowflake, head over to this blog post from SELECT.

About Me

Hey! I’m Jeff. I’d really love to hear from you! I founded GMDS in 2018 after slinging data at Keurig for many years. My background is in Supply Chain Analytics, which is an area where it is so easy to make such a HUGE impact with data. If you want to chat about data, Snowflake challenges you are having, reducing overall data stack costs, or making a big impact to your org with data, please hit the “Chat with us” button below. It will go right to my slack.

Next
Next

Snowpark Container Services: Quick Tutorial Running Metabase