0% found this document useful (0 votes)
368 views

Snowflake Core Certification Guide Dec 2022

This document provides an overview of Snowflake including its data warehouse basics, cloud service models, history, editions, pricing, architecture, virtual warehouses, role-based access controls, organizations, accounts, and schema objects like stages, tables, views, and file formats. It aims to guide users in obtaining Snowflake Core Certification. The document covers key aspects of using and understanding Snowflake's cloud data platform.

Uploaded by

Lalit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
368 views

Snowflake Core Certification Guide Dec 2022

This document provides an overview of Snowflake including its data warehouse basics, cloud service models, history, editions, pricing, architecture, virtual warehouses, role-based access controls, organizations, accounts, and schema objects like stages, tables, views, and file formats. It aims to guide users in obtaining Snowflake Core Certification. The document covers key aspects of using and understanding Snowflake's cloud data platform.

Uploaded by

Lalit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 204

2022

Snowflake Core
Certification Guide 2.0
Contents
1. Introduction to Snowflake ........................................................................................................................... 7
Data Warehouse Basics................................................................................................................................. 7
Cloud Service Models .................................................................................................................................... 9
Cloud Deployment Models ........................................................................................................................... 10
Modern Data Stack ...................................................................................................................................... 11
Data Cloud.................................................................................................................................................... 12
Snowflake Data Marketplace ....................................................................................................................... 12
Snowflake History......................................................................................................................................... 13
Cloud data platform ...................................................................................................................................... 14
Supported Regions and Platforms ............................................................................................................... 15

Editions ......................................................................................................................................................... 17

Standard Edition ....................................................................................................................................... 17

Enterprise Edition ..................................................................................................................................... 17

Business Critical Edition............................................................................................................................ 17

Virtual Private Snowflake (VPS) ................................................................................................................ 17

Pricing ........................................................................................................................................................... 19
Snowflake Credits ..................................................................................................................................... 19
Virtual Warehouses Size .......................................................................................................................... 19
Cloud Services.......................................................................................................................................... 19
On-Demand Buying .................................................................................................................................. 20
Pre-purchased Capacity ........................................................................................................................... 20

Snowflake Releases ...................................................................................................................................... 22


Sign up trial account ..................................................................................................................................... 24
2. Introduction to Web UI - Snowsight.......................................................................................................... 28
Admin Menu ................................................................................................................................................. 28
Data Menu .................................................................................................................................................... 30
Activity Menu ................................................................................................................................................ 34
3. Snowflake EcoSystem & Partner Connect............................................................................................... 40
4. Snowflake Architecture ............................................................................................................................. 42
5. Virtual warehouse ..................................................................................................................................... 44

Virtual Warehouse Parallel Processing ......................................................................................................... 44

Virtual Warehouse Considerations............................................................................................................... 48

How are Credits Charged for Warehouses? ............................................................................................. 48


How Does Query Composition Impact Warehouse Processing? ............................................................. 48

How Does Warehouse Caching Impact Queries?..................................................................................... 48

Selecting an Initial Warehouse Size .......................................................................................................... 49

Automating Warehouse Suspension......................................................................................................... 49

Scaling Up vs Scaling Out ......................................................................................................................... 50

Warehouse Resizing Improves Performance ............................................................................................ 50

Multi-cluster Warehouses Improve Concurrency..................................................................................... 50

Effects of Resizing a Running Warehouse .................................................................................................... 51

Monitoring Warehouse Load........................................................................................................................ 51

Using the Load Monitoring Chart to Make Decisions .................................................................................. 52

Slow Query Performance ............................................................................................................................. 52

Peak Query Performance ............................................................................................................................. 52

Excessive Credit Usage................................................................................................................................. 52


6. Role Based Access Control...................................................................................................................... 53

System Roles ................................................................................................................................................ 55

User, Role, Grants provisioning .................................................................................................................... 57

User Creation- Pending ................................................................................................................................ 59

Access Control Best Practices ...................................................................................................................... 60


7. Organization ............................................................................................................................................. 63
8. Accounts and Schema Objects ................................................................................................................ 65

Stages ........................................................................................................................................................... 67

Create Stage Command ............................................................................................................................ 69

Compression of Staged Files ..................................................................................................................... 70

Encryption of Staged Files ........................................................................................................................ 70

Platform Vs Stages .................................................................................................................................... 71

File Format.................................................................................................................................................... 72

Tables ........................................................................................................................................................... 73

External Tables.......................................................................................................................................... 73

Views ............................................................................................................................................................ 77

Secure Views............................................................................................................................................. 77

Materialized Views.................................................................................................................................... 77

Best Practices for Creating Materialized Views ........................................................................................ 79


9. Loading Structured Data .......................................................................................................................... 79

Bulk Loading Overview ................................................................................................................................ 79

Loading Using the Web Interface ................................................................................................................. 81


Bulk loading from local data source using SnowSQL ................................................................................... 82

Bulk loading from external source (Amazon Web Service) .......................................................................... 83

Copy into table command ............................................................................................................................ 84

Snowpipe - Continuous Loading.................................................................................................................. 86

Snowpipe Different Vs Bulk Data Loading ................................................................................................... 88

Data Loading Summary ................................................................................................................................ 90

• Data Loading Considerations ................................................................................................................ 92

Preparing Your Data Files - Best Practices ............................................................................................... 92

Staging Data - Best Practices .................................................................................................................... 92

Loading Data- Best Practices .................................................................................................................... 93

Managing Regular Data Loads .................................................................................................................. 93

Preparing to Load Data ............................................................................................................................. 95


10. Data Unloading ..................................................................................................................................... 98

Copy into Stage – Pending ........................................................................................................................... 98


11. Loading Semistructured Data ............................................................................................................... 98

Supported File Formats ................................................................................................................................ 99

Time Travel - Pending................................................................................................................................... 99

Cloning.......................................................................................................................................................... 99

Column Level Security ................................................................................................................................ 100

Snowpipe .................................................................................................................................................... 102

Streams ....................................................................................................................................................... 103

Tasks ........................................................................................................................................................... 105

Stored Procedures and UDF....................................................................................................................... 107


12. Stored Procedures and User Defined Functions................................................................................ 108

Differences Between Stored Procedures and UDFs .................................................................................. 109

User Defined Functions .............................................................................................................................. 110

External Function ....................................................................................................................................... 110


13. Snowflake Scripting ............................................................................................................................ 113
1. Understanding Blocks in Snowflake Scripting .................................................................................... 113
2. Working with Variables ....................................................................................................................... 113
3. Working with Branching Constructs ................................................................................................... 113
4. Working with Loops............................................................................................................................. 114
5. Working with Cursors .......................................................................................................................... 114
6. Working with RESULTSETs ............................................................................................................... 115
7. Handling Exceptions ........................................................................................................................... 116
14. Storage - Understanding Micropartitions ............................................................................................ 117

15. Clustering ............................................................................................................................................ 118


16. Caching ............................................................................................................................................... 121

Using Persisted Query Results ................................................................................................................... 121

Post-processing Query Results................................................................................................................... 121

Warehouse Caching Impact Queries .......................................................................................................... 122


17. Query and Results history .................................................................................................................. 123

Overview .................................................................................................................................................... 123

Query History Results ................................................................................................................................ 124

Export Results............................................................................................................................................. 124

Viewing Query Profile ................................................................................................................................... 125

Query profiling ........................................................................................................................................... 126


18. Resource Monitoring ........................................................................................................................... 136
Overview .................................................................................................................................................... 136

Credit Quota ............................................................................................................................................... 137


19. Data sharing ........................................................................................................................................ 138

Introduction to Secure Data Sharing .......................................................................................................... 138

How Does Secure Data Sharing Work ....................................................................................................... 138

Share Object ............................................................................................................................................... 139

Data Providers ............................................................................................................................................ 140

Data Consumers ......................................................................................................................................... 140

Reader Account .......................................................................................................................................... 141

General Limitations for Shared Databases ................................................................................................. 141

Overview of the Product Offerings for Secure Data Sharing..................................................................... 142

Direct Share ................................................................................................................................................ 142

Snowflake Data Marketplace ..................................................................................................................... 142

Data Exchange ............................................................................................................................................ 143


20. Access Control-Security Feature ........................................................................................................ 145

External Interfaces ...................................................................................................................................... 148

Infrastructure Security................................................................................................................................ 149

Security Compliance ................................................................................................................................... 150

Four Levels(Editions) of Snowflake Security .............................................................................................. 152

Encryption .................................................................................................................................................. 155

Network security ........................................................................................................................................ 160

Private Connectivity to Snowflake Internal Stages .................................................................................... 160


Federated Authentication & SSO ............................................................................................................... 161

Key Pair Authentication & Key Pair Rotation ............................................................................................. 163

External Function ....................................................................................................................................... 164

Column Level Security ................................................................................................................................ 166

Row Level Security ..................................................................................................................................... 167

Feature / Edition Matrix ............................................................................................................................. 167

Release Management .............................................................................................................................. 167

Security, Governance, & Data Protection ............................................................................................... 168

Data Replication & Failover .................................................................................................................... 169

Data Sharing ........................................................................................................................................... 169

Customer Support................................................................................................................................... 170


21. Database Replication .......................................................................................................................... 171

Primary Database ....................................................................................................................................... 171

Database Failover/Fallback ........................................................................................................................ 173

Understanding Billing for Database Replication ......................................................................................... 174

Replication and Automatic Clustering ........................................................................................................ 175

Replication and Materialized Views............................................................................................................ 175

Replication and External Tables.................................................................................................................. 175

Replication and Policies (Masking & Row Access) ...................................................................................... 176

Time Travel ................................................................................................................................................. 176


22. Search Optimization Service .............................................................................................................. 177
23. Account Usage Views ......................................................................................................................... 178
24. Differences Between Account Usage and Information Schema ........................................................ 178
25. Parameters.......................................................................................................................................... 178
26. Connectors and Drivers ...................................................................................................................... 182
Snowflake Connector for Python ............................................................................................................... 182

Snowflake Spark Connector ....................................................................................................................... 182

JDBC Driver................................................................................................................................................ 183

ODBC Driver .............................................................................................................................................. 183

PHP PDO Driver for Snowflake ................................................................................................................. 183

Snowflake Kafka Connector ....................................................................................................................... 183

Schema of Tables for Kafka Topics ......................................................................................................... 184

Workflow for the Kafka Connector ........................................................................................................ 185

Snowflake SQL Api ..................................................................................................................................... 187


27. Metadata Fields in Snowflake ............................................................................................................. 188
28. Few Facts to remember ...................................................................................................................... 189
29. Introduction to Snowpark .................................................................................................................... 193
30. SnowPro Core Certification (COF-C02) ............................................................................................. 195
Domain 1.0: Snowflake Cloud Data Platform Features and Architecture ................................................. 195

Snowflake Cloud Data Platform Features and Architecture Study Resources........................................... 196
Domain 2.0: Account Access and Security................................................................................................ 196
Domain 2.0: Account Access and Security Study Resources ............................................................... 197
Domain 3.0: Performance Concepts.......................................................................................................... 198
Domain 3.0: Performance Concepts Study Resources ......................................................................... 198
Domain 4.0: Data Loading and Unloading................................................................................................. 200
Domain 4.0: Data Loading and Unloading Study Resources ................................................................ 200
Domain 5.0: Data Transformations ............................................................................................................ 201
Domain 5.0: Data Transformations Study Resources ........................................................................... 201
Domain 6.0: Data Protection and Data Sharing ........................................................................................ 202
Domain 6.0: Data Protection and Data Sharing Study Resources ........................................................ 203
1. Introduction to Snowflake
Data Warehouse Basics

OLTP VS OLAP

Online transaction processing (OLTP): Online transaction processing provides


transaction-oriented applications in a 3-tier architecture. OLTP administers the day-to-
day transactions of an organization.

Online Analytical Processing (OLAP): Online Analytical Processing consists of a type


of software tools that are used for data analysis for business decisions. OLAP provides
an environment to get insights from the database retrieved from multiple database
systems at one time.
Comparisons of OLAP vs OLTP :
Sr.
No. Category OLAP (Online analytical processing) OLTP (Online transaction processing)

It is well-known as an online database query It is well-known as an online database


1. Definition management system. modifying system.

Consists of historical data from various Consists of only of operational current data.
Databases. In other words, different OLTP In other words, the original data source is
2. Data source databases are used as data sources for OLAP. OLTP and its transactions.

It makes use of a standard database


3. Method used It makes use of a data warehouse. management system (DBMS).

It is subject-oriented. Used for Data Mining, It is application-oriented. Used for business


4. Application Analytics, Decisions making, etc. tasks.

In an OLTP database, tables are normalized


5. Normalized In an OLAP database, tables are not normalized. (3NF).

The data is used in planning, problem-solving, The data is used to perform day-to-day
6. Usage of data and decision-making. fundamental operations.

A large amount of data is stored typically in TB, The size of the data is relatively small as the
9. Volume of data PB historical data is archived. For ex MB, GB

Relatively slow as the amount of data involved is Very Fast as the queries operate on 5% of the
10. Queries large. Queries may take hours. data.

The OLAP database is not often updated. As a The data integrity constraint must be
11. Update result, data integrity is unaffected. maintained in an OLTP database.

Backup and It only need backup from time to time as Backup and recovery process is maintained
12. Recovery compared to OLTP. rigorously

Processing The processing of complex queries can take a It is comparatively fast in processing because
13. time lengthy time. of simple and straightforward queries.

This data is generally managed by CEO, MD,


14. Types of users GM. This data is managed by clerks, managers.

15. Operations Only read and rarely write operation. Both read and write operations.

Database
18. Design Design with a focus on the subject. Design that is focused on the application.

19. Productivity Improves the efficiency of business analysts. Enhances the user’s productivity.
ON-PREMISE Vs CLOUD

On-Premise

1. On-prem solutions sit on your local network, which means a high upfront cost, as you must invest in
hardware and the appropriate software licenses.
2. you need the right skills, which may involve hiring a consultant to assist with installation and
ongoing support.
3. The advantage of on-premise is that you control every aspect of the repository. You also control
when and how data leaves your network.

Cloud System

4. Cloud solutions are highly cost-effective with minimal upfront costs.


5. Simply set up an account with a cloud host and you’re ready to go. Many cloud environments are
effectively plug-and-play, especially if you use a cloud-based ETL.
6. Ongoing costs are typically a monthly or annual subscription, with flexible pricing depending on your
data storage needs.
7. There's also the advantage of off-site data backups, which are vital in disaster recovery.
8. Using a cloud solution may raise some security concerns, as you'll be transmitting sensitive
information to a third party. But many cloud partners offer reassuringly strong encryption and high-
level physical security.

Cloud Service Models


Cloud Deployment Models

Model: Description:

• Exclusive user by a single organization comprising multiple consumers (e.g.


business units).
Private • The platform for cloud computing is implemented on a cloud-based secure
Cloud environment that is safeguarded by a firewall which is under the governance of the
IT department that belongs to the particular customer.

• Provisioned for exclusive user by a specific community of consumers from


organizations that have shared concerns.
• It may be owned, managed, and operated by one or more of the organizations in the
Community community, a third party, or some combination of them and it may exist on or off
Cloud premises.
• The setup is mutually shared between many organizations that belong to a particular
community.

• It may be owned, managed, and operated by a business, academic, or government


Public organization, or some combination of them.
Cloud • Services are delivered over a network which is open for public usage.

• The cloud infrastructure is a composition of two or more distinct cloud


Hybrid
infrastructures (private,community, or public) that remains unique entities, but are
Cloud
bound together by standardized or proprietary technology that enables data and
application portability.
ETL VS ELT

Modern Data Stack

https://blog.dataiku.com/dataikus-role-in-the-modern-data-stack

Reverse ETL completes the loop of data integration by copying data from the warehouse into systems of
record to enable teams to finally act on the same data that has been powering all the beautiful reports they
have been consuming.
Data Cloud Platform

© 2021 Snowflake Inc. All Rights Reserved

Data Cloud
Over 400 million SaaS data sets remained siloed globally, isolated in cloud data storage and on-premise
data centers. The Data Cloud eliminates these silos, allowing you to seamlessly unify, analyze, share, and
even monetize your data.
Snowflake's cloud data platform supports multiple data workloads, from Data Warehousing and Data
Lake to Data Engineering, Data Science, and Data Application development across multiple cloud
providers and regions from anywhere in the organization. Snowflake’s unique architecture delivers near-
unlimited storage and computing in real time to virtually any number of concurrent users in the Data Cloud.

Snowflake Data Marketplace


The Snowflake Marketplace allows businesses to offer, discover and consume live, governed data, and
data services at scale minus the latency, effort, and cost required with traditional marketplaces. Snowflake
Marketplace users can access data sets from data providers such as Weather Source, Safegraph, FactSet,
Zillow, and more.
Snowflake for Data Exchange allows businesses to create their own version of a data marketplace to
manage data access between business units, partners, customers, and other stakeholders, and across
your supply chain.
Snowflake History

1. Snowflake history:

❖ Founded in 2012 in “stealth mode” by industry veterans


❖ Initial customers came on board in 2014
❖ Became Generally Available in 2015
❖ Cloud Providers:
▪ Amazon Web Services (AWS)--since inception
o Microsoft Azure Cloud Platforms--generally available as of September 2018
o Google Cloud Platform (2020)
❖ Snowflake’s global presence and deployments continue to grow
o Currently available on select Regions within each Cloud Provider

2. Snowflake’s mission is to enable every organization to be data-driven.


3. Snowflake is an analytical data warehouse providing Software-as-a-Service (SaaS).
4. Snowflake’s data warehouse is not built on an existing database or “big data” software platform
such as Hadoop.
5. The Snowflake data warehouse uses a new SQL database engine with a unique architecture
designed for the cloud.
Cloud data platform

Why Snowflake
Supported Regions and Platforms

© 2022 Snowflake Inc. All Rights Reserved

Considerations for hosting Region for Your Account

1. When you request a Snowflake account, choose the region where the account is located.
2. If latency is a concern, you should choose the available region with the closest geographic proximity
to your end users
3. Cloud Provider provides additional backup and disaster recovery beyond the standard recovery
support provided by Snowflake.
4. Snowflake does not place any restrictions on the region where you choose to locate each account.
5. Choosing specific region may have cost implications, due to pricing differences between the
regions.
6. If you are a government agency or a commercial organization that must comply with specific privacy
and security requirements of the US government, you can choose between two dedicated
government regions provided by Snowflake.
7. Snowflake does not move data between accounts, so any data in an account in a region remains in
the region unless users explicitly choose to copy, move, or replicate the data.

Differences Between Regions


8. Snowflake features and services are identical across regions except for some newly-introduced
features (based on cloud platform or region). However, there are some differences in unit costs for
credits and data storage between regions.
9. Another factor that impacts unit costs is whether your Snowflake account is On

Demand or Capacity.
Editions

Standard Edition

1. Standard Edition is introductory level offering, providing full, unlimited access to all of
Snowflake’s standard features. It provides a strong balance between features, level of support,
and cost.

Enterprise Edition

2. Enterprise Edition provides all the features and services of Standard Edition, with additional
features designed specifically for the needs of large-scale enterprises and organizations.

Business Critical Edition

3. Business Critical Edition, formerly known as Enterprise for Sensitive Data (ESD)
4. It offers even higher levels of data protection to support the needs of organizations with
extremely sensitive data, particularly PHI data that must comply with HIPAA and HITRUST
CSF regulations.
5. It includes all the features and services of Enterprise Edition, with the addition of enhanced
security and data protection.
6. In addition, database failover/failback adds support for business continuity and disaster recovery.

Virtual Private Snowflake (VPS)

7. Virtual Private Snowflake offers highest level of security for organizations that have the strictest
requirements, such as financial institutions and any other large enterprises that collect, analyze,
and share highly sensitive data.
8. All new accounts, regardless of Snowflake Edition, receive Premier support, which includes 24/7
coverage.

9. It includes all the features and services of Business-Critical Edition, but in a completely separate
Snowflake environment, isolated from all other Snowflake accounts (i.e. VPS accounts do not share
any resources with accounts outside the VPS)
10. A hostname for a Snowflake account starts with an account identifier and ends with the Snowflake
domain (snowflakecomputing.com). Snowflake supports two formats to use as the account
identifier in your hostname:
o Account name (preferred)
o Account locator
11. Example
o Organization Name : NHPREQQ
o Account Name : MRFSPORE
o Account_locator : JB92030
o Region Name : ap-south-1
o Cloud Provider Name : aws

https://nhpreqq-xm17812.snowflakecomputing.com

or

https://jb92030.ap-south-1.aws.snowflakecomputing.com
Pricing

Snowflake Credits

1. Snowflake credits are used to pay for the consumption of resources on Snowflake
2. If ONE server running for ONE hour then ONE credit will be charged.
3. Snowflake credit is a unit of measure, and it is consumed only when a customer is using resources
such as

a) When a virtual warehouse is running


b) Cloud services layer is performing work
c) Serverless features are used

Virtual Warehouses Size

4. Snowflake supports wide range of virtual warehouse sizes

5. The size of the virtual warehouse determines how fast queries will run
6. When a virtual warehouse is not running, it does not consume any snowflake credits
7. The different size of virtual warehouses consumes the above rates , billed by the second with one
minute minimum

Cloud Services

8. Cloud services resources are automatically assigned by Snowflake based on the requirements of
the workload.
9. Typical utilization of cloud services (up to 10% of daily compute credits) is included for free, which
means most customers will not see incremental charges for cloud services usage.
On-Demand Buying

Pre-purchased Capacity
https://www.snowflake.com/pricing/
Snowflake Releases
Snowflake is committed to providing a seamless, always up-to-date experience for our
users while also delivering ever-increasing value through rapid development and continual
innovation.
• deploys new releases each week. This allows us to regularly deliver service
improvements in the form of new features, enhancements, and fixes.

Each week, Snowflake deploys two planned/scheduled releases:

Full release

A full release may include any of the following:

• New features
• Feature enhancements or updates
• Fixes

In addition, a full release includes the following documentation deliverables (as


appropriate):

• Weekly Release Notes (if needed), published in the Snowflake


Community.

https://docs.snowflake.com/en/release-notes/new-features.html

.Full releases may be deployed on any day of the week, except Friday.

Patch release

A patch release includes fixes only. Note that the patch release for a given week
may be canceled if the full release for the week is sufficiently delayed or prolonged.

If needed, additional patch releases are deployed to address any issues that are
encountered during or after the release process.

Behavior change release

Every month, Snowflake deploys one behavior change release. Behavior change
releases contain changes to existing behaviors that may impact customers.
Behavior change releases take place over two months: during the first month, or test
period, the behavior change release is disabled by default, but you can enable it in
your account; during the second month, or opt-out period, the behavior change is
enabled by default, but you can disable it in your account.

Snowflake does not override these settings during the release: if you disable a
release during the testing period, we do not enable it at the beginning of the opt-out
period. At the end of the opt-out period, Snowflake enables the behavior changes in
all accounts. However, you can still request an extension to temporarily disable
specific behavior changes from the release by contacting Snowflake Support.
Sign up trial account
2. Introduction to Web UI - Snowsight
Admin Menu

1. Warehouses
2. Resource Monitors
3. Users and Roles
4. Billing
5. Partner Connect
6. Help and Support

Warehouse Creation

System Roles
New Role Creation

New User Creation


Resource Monitor Creation

Data Menu
Database Creation
Schema Creation

Access Privilige Creation

Table Creation
View Creation

Stage Creation
File Format Creation
Activity Menu
Query History

Copy History
Worksheet

1. Create Folder to organize your Worksheets


2. Worksheet Creation
3. Code Auto Complete Feature
4. Format Query
5. Import SQL from File.
6. Automatic Contextual Statistics
o Filled/empty meters
o Histograms
7. Stored Results for Prior Worksheet Versions
8. Custom Filter
9. Worksheet sharing
10. Chart

Code Auto Completion


Format Query

Worksheet Filter
Automatic Contextual Statistics

Worksheet History (Versioning)


Worksheet Charts
3. Snowflake EcoSystem & Partner Connect

• Data Integration
• Business Intelligence (BI)
• Machine Learning & Data Science
• Security & Governance
• SQL Development & Management

• Native Programmatic Interfaces

Partner Connect

https://youtu.be/8sO53KczJ4M

● Snowflake does not provide tools to extract data from source systems and/or visualize data--it relies
upon its Technology Partners to provide those tools
● Snowflake’s relationships with/integrations with Technology Partners are driven largely by customer
requests and needs for them
● Snowflake engages with Technology Partners and works with technologies that are both cloud and on-
premises based
● As most activity in Snowflake revolves around integrating and visualizing data, Data Integration
and Business Intelligence technologies are the most prevalent in the Snowflake Technology
Ecosystem
● Various technologies offer different levels of integrations and advantages with Snowflake:
○ ELT tools like Talend and Matillion leverage Snowflake's scalable compute for data
transformation by pushing tranform processing to Snowflake
○ BI tools like Tableau and Looker offer native connectivity built into their products, with Looker
leveraging Snowflake’s in-database processing scalable compute for querying
○ Snowflake has built a custom Spark library that allows the results of a Snowflake query to be
loaded directly into a dataframe
● To fully understand the advantages of Snowflake, one must understand its advantages vs. its
competitors
○ On-Premises EDW
■ Instant scalability
■ Separation of compute and storage
■ No need for data distribution

○ Cloud EDW
■ Concurrency
■ Automatic failover and disaster recovery
■ Built for the cloud
○ Hadoop
■ No hardware to manage
■ No need to manage files
■ Native SQL (including on semi-structured)
○ Data Engines
■ No need to manage files
■ Automated cluster management
■ Native SQL
○ Apache Spark
■ No need to manage data files
■ Automated cluster management
■ Full SQL Support

● Snowflake Ecosystem and Competitive Positioning Self-Guided Learning Material

https://www.snowflake.com/partners/technology-partners/

Code of Conduct for Competitive Partnerships

Overview of the Ecosystem

Snowflake Partner Connect

Getting Started on Snowflake with Partner Connect (Video)

Cloud Data Warehouse Buyer’s Guide

Snowflake Partners YouTube Playlist (Videos)

Snowflake Customers YouTube Playlist (Videos)


4. Snowflake Architecture

1. Snowflake’s unique architecture is known as multi-cluster, shared data


2. The Snowflake architecture is comprised of three distinct layers
I. Storage (Databases) Layer
• Cloud object storage via AWS S3 or Azure Blobs
• The Snowflake data scales infinitely as data requirements grow
• Costs are based on a daily average of all compressed data storage, including data stored
according to Time Travel retention policy and failsafe practices
II. Compute (Virtual Warehouses) Layer
• Cloud compute via AWS EC2 or Azure Compute
• Can scale up and out to handle workloads, even as queries are running
III. Cloud Services Layer
• Transparent to end users
• The Snowflake brain
o Security management
o Infrastructure management
o Metadata management
o Query optimization

3. Snowflake’s processing engine is ANSI SQL, the most familiar and utilized database querying
language. SQL capabilities have been natively built into the product.
• Allows customers to leverage the skills they already have
• Enables interoperability with trusted tools, specifically in data integration and business
intelligence
• Promotes simplified migration from legacy platforms
4. SQL functionality can be extended via SQL User Defined Functions (UDFs), Javascript UDFs,
session variables and Stored Procedures

5. Snowflake supports structured and semi-structured data within one fully SQL data warehouse.
• Semi-structured data strings are stored in a column with a data type of “VARIANT”
• Snowflake’s storage methodology optimizes semi-structured, or VARIANT, storage based on
repeated elements
• Just like structured data, semi-structured data can be queried using SQL while incorporating
JSON path notation

6. Snowflake Architecture - A Deeper Dive Self-Guided Learning Material


• Snowflake Architecture
• Data Storage Considerations
• Virtual Warehouses
• Cloud Services
• The Snowflake X-Factor: Separate Metadata Processing (Video)
• Queries
• UDFs (User-Defined Functions)
• Semi-Structured Data Types
• Querying Semi-Structured Data
• Using Persisted Query Results
• Accelerating BI Queries with Caching in Snowflake (Video)
• Top 10 Cool Snowflake Features, #10: Snowflake Query Result Sets Available to Users
via History
5. Virtual warehouse
Virtual Warehouse Parallel Processing
MAX_CONCURRENCY_LEVEL Parameter

Type : Object (for warehouses) — Can be set for Account » Warehouse

Data Type : Number

Description

Specifies the concurrency level for SQL statements (i.e. queries and DML) executed by a warehouse. When
the level is reached, the operation performed depends on whether the warehouse is a single or multi-cluster
warehouse:

Single or multi-cluster (in Maximized mode): Statements are queued until already-allocated resources are
freed or additional resources are provisioned, which can be accomplished by increasing the size of the
warehouse.

Multi-cluster (in Auto-scale mode): Additional warehouses are started.

MAX_CONCURRENCY_LEVEL can be used in conjunction with


the STATEMENT_QUEUED_TIMEOUT_IN_SECONDS parameter to ensure a warehouse is never
backlogged.

As each statement is submitted to a warehouse, Snowflake allocates resources for executing the statement;
if there aren’t enough resources available, the statement is queued or additional warehouses are started,
depending on the warehouse.

The actual number of statements executed concurrently by a warehouse might be more or less than the
specified level:

Smaller, more basic statements: More statements might execute concurrently because small statements
generally execute on a subset of the available compute resources in a warehouse. This means they only
count as a fraction towards the concurrency level.

Larger, more complex statements: Fewer statements might execute concurrently.

Default : 8

Lowering the concurrency level for a warehouse increases the compute resource allocation per statement,
which potentially results in faster query performance, particularly for large/complex and multi-statement
queries.

Raising the concurrency level for a warehouse decreases the compute resource allocation per statement;
however, it does not necessarily limit the total number of concurrent queries that can be executed by the
warehouse, nor does it necessarily improve total warehouse performance, which depends on the nature of
the queries being executed.

Note that, as described earlier, this parameter impacts multi-cluster warehouses (in Auto-scale mode)
because Snowflake automatically starts a new warehouse within the multi-cluster warehouse to avoid
queuing. Thus, lowering the concurrency level for a multi-cluster warehouse (in Auto-scale mode) potentially
increases the number of active warehouses at any time.

Also, remember that Snowflake automatically allocates resources for each statement when it is submitted
and the allocated amount is dictated by the individual requirements of the statement. Based on this, and
through observations of user query patterns over time, we’ve selected a default that balances performance
and resource usage.
Snowflake typical X-Small Warehouse with assumed* compute capacity:

1. For a Multi-Cluster warehouse, the SCALING_POLICY decides when the additional clusters
spawns up. When the value is set to ECONOMY then Snowflake starts the additional cluster
in a delayed fashion, giving more importance to cost control over performance. When the
value is set to STANDARD, Snowflake provides importance to performance and starts the
additional clusters immediately when query starts getting queued up.
2. In case the MAX_CONCURRENCY_LEVEL value is lower, then the additional cluster in a
multi-cluster warehouse might starts quicker.
3. The value of the parameter STATEMENT_QUEUED_TIMEOUT_IN_SECONDS has an impact
on the timing when the additional cluster in a multi-cluster warehouse will spawn. The
default is 0 which means no time out. Any non-zero value set for this parameter is the
number of seconds the queued query will wait, and in a single cluster warehouse the query
will be cancelled if did not get any compute resource to execute within that number of
seconds. In case of a multi-clustered warehouse, an additional cluster will be spawned and
the compute resources will be allocated to that query.
4. It is very important to use multiple warehouses for different type and size of processing
needs. Specially in case there is a process which is comparatively very complex and deals
with huge volume data and takes a lot of time and compute resource, then use a separate
bigger size warehouse to handle that process and do not use the same warehouse for any
other needs.
5. Consider tweaking with the MAX_CONCURRENCY_LEVEL parameter to provide more
compute resources to the single process, so that it can execute faster. Keeping in mind the
discussion about the “Concurrency within a single cluster warehouse”, below in an example
of how a smaller sized warehouse can provide performance like a bigger warehouse and in
turn reduces the cost. The below comparison provides an example of how this can be
performed.
Example where the performance consistently improves with warehouse size increase:

Example where the performance does not improve with increase in warehouse sizes. The LARGE
warehouse is not going to improve performance in this case.

*Important Notes:
• The number of CPU and threads are used for discussion purpose and not disclosed by
Snowflake Computing. There is no such documentation from Snowflake Computing to know
the exact type of servers they use in the Warehouses and the detailed architecture of those
servers / instances. Even when the number of CPU(s) in a server or the number of Cores in a
CPU might be different than the above numbers used for the discussion, the basic concepts
behind the concurrency and how it is handled in Snowflake does not changes.
• We may assume that the number of servers (nodes) per cluster will remain same as per
Snowflake current standards, but over time the underlying architecture of the compute
clusters will keep changing from the number of CPU(s) per node and amount of RAM and
SSD available in the clusters with changes in available compute instances available in the
underlying cloud platforms.
Virtual Warehouse Considerations

1. This topic provides general guidelines and best practices for using virtual warehouses in Snowflake to
process queries.
2. It does not provide specific or absolute numbers, values, or recommendations because every query
scenario is different and is affected by numerous factors, including number of concurrent users/queries,
number of tables being queried, and data size and composition, as well as your specific requirements for
warehouse availability, latency, and cost.
3. The keys to using warehouses effectively and efficiently are:

• Experiment with different types of queries and different warehouse sizes to determine the
combinations that best meet your specific query needs and workload.
• Don’t focus on warehouse size. Snowflake utilizes per-second billing, so you can run larger
warehouses (Large, X-Large, 2X-Large, etc.) and simply suspend them when not in use.

How are Credits Charged for Warehouses?


4. Credit charges are calculated based on:

• The warehouse size.


• The number of warehouses (if using multi-cluster warehouses).
• The length of time each warehouse is running.

5. When warehouses are provisioned:


• The minimum billing charge for provisioning a warehouse is 1 minute (i.e. 60 seconds).
• After the first 60 seconds, all subsequent billing for a running warehouse (until it is shut
down)
• If a warehouse runs for 30 to 60 seconds, it is billed for 60 seconds.
• If a warehouse runs for 61 seconds, it is billed for only 61 seconds.
• If a warehouse runs for 61 seconds, shuts down, and then restarts and runs for less
than 60 seconds, it is billed for 121 seconds (60 + 1 + 60).
6. Resizing a warehouse provisions additional compute resources for the warehouse: The additional
compute resources are billed when they are provisioned

How Does Query Composition Impact Warehouse Processing?


7. The compute resources required to process a query depends on the size and complexity of the query
8. queries scale linearly with regards to warehouse size, particularly for larger, more complex queries.

How Does Warehouse Caching Impact Queries?

Each warehouse, when running, maintains a cache of table data accessed as queries are processed by the
warehouse. This enables improved performance for subsequent queries if they are able to read from the
cache instead of from the table(s) in the query. The size of the cache is determined by the compute
resources in the warehouse (i.e. the larger the warehouse and, therefore, more compute resources in the
warehouse), the larger the cache.

This cache is dropped when the warehouse is suspended, which may result in slower initial performance for
some queries after the warehouse is resumed.

As the resumed warehouse runs and processes more queries, the cache is rebuilt, and queries that are able
to take advantage of the cache will experience improved performance.

Keep this in mind when deciding whether to suspend a warehouse or leave it running. In other words,
consider the trade-off between saving credits by suspending a warehouse versus maintaining the cache of
data from previous queries to help with performance.

When creating a warehouse, the two most critical factors to consider, from a cost and performance
perspective, are:

• Warehouse size (i.e. available compute resources)


• Manual vs automated management (for starting/resuming and suspending warehouses).

Selecting an Initial Warehouse Size

The initial size you select for a warehouse depends on the task the warehouse is performing and the
workload it processes. For example:

• For data loading, the warehouse size should match the number of files being loaded and the amount
of data in each file. For more details, see Planning a Data Load.
• For queries in small-scale testing environments, smaller warehouses sizes (X-Small, Small, Medium)
may be sufficient.
• For queries in large-scale production environments, larger warehouse sizes (Large, X-Large, 2X-
Large, etc.) may be more cost effective.

However, note that per-second credit billing and auto-suspend give you the flexibility to start with larger
sizes and then adjust the size to match your workloads. You can always decrease the size of a warehouse at
any time.

• larger is not necessarily faster for smaller, more basic queries

Automating Warehouse Suspension


• We recommend enabling/disabling auto-resume depending on how much control you wish to exert
over usage of a particular warehouse:
• If cost and access are not an issue, enable auto-resume to ensure that the warehouse starts
whenever needed. Keep in mind that there might be a short delay in the resumption of the
warehouse due to provisioning.
• If you wish to control costs and/or user access, leave auto-resume disabled and instead manually
resume the warehouse only when needed.

Scaling Up vs Scaling Out

Snowflake supports two ways to scale warehouses:

• Scale up by resizing a warehouse.


• Scale out by adding warehouses to a multi-cluster warehouse

Warehouse Resizing Improves Performance

Resizing a warehouse generally improves query performance, particularly for larger, more complex queries.
It can also help reduce the queuing that occurs if a warehouse does not have enough compute resources to
process all the queries that are submitted concurrently. Note that warehouse resizing is not intended for
handling concurrency issues; instead, use additional warehouses to handle the workload or use a multi-
cluster warehouse (if this feature is available for your account).

Snowflake supports resizing a warehouse at any time, even while running. If a query is running slowly and
you have additional queries of similar size and complexity that you want to run on the same warehouse, you
might choose to resize the warehouse while it is running; however, note the following:

• As stated earlier about warehouse size, larger is not necessarily faster; for smaller, basic queries that are
already executing quickly, you may not see any significant improvement after resizing.
• Resizing a running warehouse does not impact queries that are already being processed by the
warehouse; the additional compute resources, once fully provisioned, are only used for queued and new
queries.
• Resizing between a 5XL or 6XL warehouse to a 4XL or smaller warehouse, will result in a brief period
during which the customer is charged for both the new warehouse and the old warehouse while the old
warehouse is quiesced.
• Keep this in mind when choosing whether to decrease the size of a running warehouse or keep it at the
current size. In other words, there is a trade-off with regards to saving credits versus maintaining the
warehouse cache.

Multi-cluster Warehouses Improve Concurrency


• Multi-cluster warehouses are designed specifically for handling queuing and performance issues
related to large numbers of concurrent users and/or queries and number of users/queries tend to
fluctuate.
• Unless you have a specific requirement for running in Maximized mode, multi-cluster warehouses should
be configured to run in Auto-scale mode, which enables Snowflake to automatically start and stop
warehouses as needed.
• When choosing the minimum and maximum number of warehouses for a multi-cluster warehouse:
• Keep the default value of minimum cluster to 1 ; this ensures that additional warehouses are only started
as needed. However, if high-availability of the warehouse is a concern, set the value higher than 1 . This
helps ensure multi-cluster warehouse availability and continuity in the unlikely event that a warehouse
fails.
• Set Maximum value as large as possible, while being mindful of the warehouse size and corresponding
credit costs

Effects of Resizing a Running Warehouse


• Resizing a suspended warehouse does not provision any new compute resources for the warehouse
• Compute resources added to a warehouse start using credits when they are provisioned
• Compute resources are removed from a warehouse only when they are no longer being used to execute
any current statements.

Monitoring Warehouse Load


The web interface provides a query load chart that depicts concurrent queries processed by
a warehouse over a two-week period.

Query Status Execution Time / Interval (in Seconds) Query Load

Query 1 Running 30 / 300 0.10

Query 2 Running 201 / 300 0.67

Query 3 Running 15 / 300 0.05

Query 4 Running 30 / 300 0.10

Running Load 0.92

Query 5 Queued 24 / 300 0.08

Queued Load 0.08


Query Status Execution Time / Interval (in Seconds) Query Load

TOTAL WAREHOUSE LOAD 1.00

Using the Load Monitoring Chart to Make Decisions


• The load monitoring chart can help you make decisions for managing your warehouses by showing
current and historic usage patterns.

Slow Query Performance


• When you notice that a query is running slowly, check whether an overloaded warehouse is causing the
query to compete for resources or get queued:
• if the running query load is high or there’s queuing, consider starting a separate warehouse and moving
queued queries to that warehouse. Alternatively, if you are using multi-cluster warehouses, you
could change your multi-cluster settings to add additional warehouses to handle higher concurrency
going forward.
• If the running query load is low and query performance is slow, you could resize the warehouse to
provide more compute resources. You would need to restart the query once all the new resources were
fully provisioned to take advantage of the added resources.

Peak Query Performance


• Analyze the daily workload on the warehouse over the previous two weeks. If you see recurring usage
spikes, consider moving some of the peak workload to its own warehouse and potentially running the
remaining workload on a smaller warehouse.

Excessive Credit Usage


• Analyze the daily workload on the warehouse over the previous two weeks. If the chart shows recurring
time periods when the warehouse was running and consuming credits, but the total query load was less
than 1 for substantial periods of time, the warehouse use is inefficient.
6. Role Based Access Control
1. Snowflake provides granular control over access to objects
➢ who can access what objects,
➢ what operations can be performed on those objects
➢ and who can create or alter access control policies.

2. Snowflake’s approach to access control combines aspects from both of the following models:
➢ Discretionary Access Control (DAC): Each object has an owner role, who can in turn grant
access to that object.
➢ Role-based Access Control (RBAC): Access privileges are assigned to roles, which are in turn
assigned to users.

3. The key concepts to understanding access control in Snowflake are:


➢ Securable object: An entity to which access can be granted. Unless allowed by a grant, access
will be denied.
➢ Privilege: A defined level of access to an object. Multiple distinct privileges may be used to
control the granularity of access granted.
➢ Role: An entity to which privileges can be granted. Roles are in turn assigned to users. Note
that roles can also be assigned to other roles, creating a role hierarchy.
➢ User: A user identity recognized by Snowflake, whether associated with a person or program.
4. In the Snowflake model, access to securable objects is allowed via privileges assigned to roles, which
are in turn assigned to other roles or users.

5. In addition, each securable object has an owner that can grant access to other roles.
6. Every securable object is owned by a single role, which is typically the role used to create the object.

7. The owning role has all privileges on the object by default, including the ability to grant or revoke
privileges on the object to other roles.

Roles

8. Roles are the entities to which privileges on securable objects can be granted and revoked.

9. A user can be assigned multiple roles. This allows users to switch roles (i.e. choose which role is
active in the current Snowflake session) to perform different actions using separate sets of privileges.

10. Roles can be also granted to other roles, creating a hierarchy of roles. The privileges associated with
a role are inherited by any roles above that role in the hierarchy.
System Roles
The following diagram illustrates the hierarchy for the system-defined roles along with the recommended
structure for additional, user-defined custom roles:

1. There are a small number of system-defined roles in a Snowflake account. System-defined roles
cannot be dropped. In addition, the privileges granted to these roles by Snowflake cannot be
revoked.Additional privileges can be granted to the system-defined roles, but is not recommended.

2. System-defined roles are created with privileges related to account-management. As a best practice,
it is not recommended to mix account-management privileges and entity-specific privileges in the
same role.If additional privileges are needed, we recommend granting the additional privileges to a
custom role and assigning the custom role to the system-defined role

ORGADMIN(aka Organization Administrator)

• Role that manages operations at the organization level.


• More specifically, this role:
o Can create accounts in the organization.
o Can view all accounts in the organization (using SHOW ORGANIZATION
ACCOUNTS) as well as all regions enabled for the organization (using SHOW
REGIONS).
o Can view usage information across the organization.

ACCOUNTADMIN(aka Account Administrator)

• Role that encapsulates the SYSADMIN and SECURITYADMIN system-defined roles.


• It is the top-level role in the system and should be granted only to a limited/controlled
number of users in your account.

SECURITYADMIN(aka Security Administrator)

• Role that can manage any object grant globally, as well as create, monitor, and manage users
and roles. More specifically, this role:
• Is granted the MANAGE GRANTS security privilege to be able to modify any grant, including
revoking it.
• Inherits the privileges of the USERADMIN role via the system role hierarchy (e.g.
USERADMIN role is granted to SECURITYADMIN).

USERADMIN (aka User and Role Administrator)

• Role that is dedicated to user and role management only.


• More specifically, this role:
o Is granted the CREATE USER and CREATE ROLE security privileges.
o Can create users and roles in the account.
o This role can also manage users and roles that it owns. Only the role with the
OWNERSHIP privilege on an object (i.e. user or role), or a higher role, can modify the
object properties.

SYSADMIN(aka System Administrator)

• Role that has privileges to create warehouses and databases (and other objects) in an
account.
• If, as recommended, you create a role hierarchy that ultimately assigns all custom roles to the
SYSADMIN role, this role also has the ability to grant privileges on warehouses, databases,
and other objects to other roles.

PUBLIC

• Pseudo-role that is automatically granted to every user and every role in your account.
• The PUBLIC role can own securable objects, just like any other role; however, the objects
owned by the role are, by definition, available to every other user and role in your account.
• This role is typically used in cases where explicit access control is not needed and all users are
viewed as equal with regard to their access rights.

Custom Roles

• Custom roles (i.e. any roles other than the system-defined roles) can be created by the
SECURITYADMIN roles as well as by any role to which the CREATE ROLE privilege has been
granted.
• By default, the newly-created role is not assigned to any user, nor granted to any other role.
• Conversely, if a custom role is not assigned to SYSADMIN through a role hierarchy, the
system administrators will not be able to manage the objects owned by the role. Only those
roles granted the MANAGE GRANTS privilege (typically only the SECURITYADMIN role) will
see the objects and be able to modify their access grants.

PRIVILEGES

• For each securable object, there is a set of privileges that can be granted on it.

• For existing objects, privileges must be granted on individual objects, e.g. the SELECT
privilege on the mytable table.

• Future grants allow defining an initial set of privileges on objects created in a schema; i.e. the
SELECT privilege on all new tables created in the myschema schema.

• In regular (i.e. non-managed) schemas, use of these commands is restricted to the role that
owns an object (i.e. has the OWNERSHIP privilege on the object) or roles that have the
MANAGE GRANTS global privilege for the object (typically the SECURITYADMIN role).

• In managed access schemas, object owners lose the ability to make grant decisions. Only the
schema owner or a role with the MANAGE GRANTS privilege can grant privileges on objects
in the schema, including future grants, centralizing privilege management.

User, Role, Grants provisioning

Recursive Grants

3. To simplify grant management, future grants allow defining an initial set of privileges to grant on
new (i.e. future) objects of a certain type in a database or a schema. As new objects are created,
the defined privileges are automatically granted to a specified role.
The below will grant select to all tables in a database

use role sysadmin;


grant usage on database MRF_DB to role HR;
// to all schema and all tables in MRF_DB
grant usage on all schemas in MRF_DB to role HR;
grant select on all tables in schema MRF_DB.HR_SCHEMA to role HR;

Future Grants

Similarly, to grant select on all future tables in a schema and database level.

// to all schema and all tables in MRF_DB


grant select on future schemas in database MRF_DB to role HR;
grant select on future tables in database MRF_DB to role HR;

4. You must define future grants on each object type (schemas, tables, views, streams, etc.)
individually.

5. When future grants are defined at both the database and schema level, the schema level grants
take precedence over the database level grants, and the database level grants are ignored.

6. At database level, the global MANAGE GRANTS privilege is required to grant or revoke privileges
on future objects in a database. Only the SECURITYADMIN and ACCOUNTADMIN system roles
have the MANAGE GRANTS privilege; however, the privilege can be granted to custom roles.

7. Future grants are not supported for:

• Data sharing

• Data replication

Enforcement Model

The Primary Role and Secondary Roles

Every active user session has a “current role,” also referred to as a primary role.
1. For organizations whose security model includes a large number of roles, each with a fine
granularity of authorization via permissions, the use of secondary roles simplifies role
management.

2. All roles that were granted to a user can be activated in a session. Secondary roles are
particularly useful for SQL operations such as cross-database joins that would otherwise require
creating a parent role of the roles that have permissions to access the objects in each database.

3. When a user attempts to create an object, Snowflake compares the privileges available to the
current role in the user’s session against the privileges required to create the object.

For any other SQL actions attempted by the user, Snowflake compares the privileges available to
the current primary and secondary roles against the privileges required to execute the action on
the target objects. If the session has the required privileges on the objects, the action is allowed.

User Creation- Pending


https://docs.snowflake.com/en/sql-reference/sql/create-user.html
Access Control Best Practices
https://docs.snowflake.com/en/user-guide/security-access-control-considerations.html

4. The account administrator (i.e users with the ACCOUNTADMIN system role) role is the most
powerful role in the system

5. This role alone is responsible for configuring parameters at the account level. Users with the
ACCOUNTADMIN role can view and operate on all objects in the account, can view and manage
Snowflake billing and credit data, and can stop any running SQL statements.

We strongly recommend the following precautions when assigning the ACCOUNTADMIN role to users:

6. Assign this role only to a select/limited number of people in your organization.

7. All users assigned the ACCOUNTADMIN role should also be required to use multi-factor
authentication (MFA) for login (for details, see Configuring Access Control).

8. Assign this role to at least two users. We follow strict security procedures for resetting a
forgotten or lost password for users with the ACCOUNTADMIN role. Assigning the
ACCOUNTADMIN role to more than one user avoids having to go through these procedures
because the users can reset each other’s passwords.

9. Avoid Using the ACCOUNTADMIN Role to Create Objects

10. All securable database objects (such as TABLE, FUNCTION, FILE FORMAT, STAGE, SEQUENCE,
etc.) are contained within a SCHEMA object within a DATABASE. As a result, to access database
objects, in addition to the privileges on the specific database objects, users must be granted the
USAGE privilege on the container database and schema.

11. When a custom role is first created, it exists in isolation. The custom role must
also be granted to any roles that will manage the objects created by the custom
role.
12. With regular (i.e. non-managed) schemas in a database, object owners (i.e. roles
with the OWNERSHIP privilege on one or more objects) can grant access on
those objects to other roles, with the option to further grant those roles the ability
to manage object grants.
13. In a managed access schema, object owners lose the ability to make grant
decisions. Only the schema owner (i.e. the role with the OWNERSHIP privilege on
the schema) or a role with the MANAGE GRANTS privilege can grant privileges on
objects in the schema, including future grants, centralizing privilege management.
14. In Query History , A user cannot view the result set from a query that another
user executed.
15. A cloned object is considered a new object in Snowflake. Any privileges granted
on the source object do not transfer to the cloned object.
16. However, a cloned container object (a database or schema) retains any privileges
granted on the objects contained in the source object. For example, a cloned
schema retains any privileges granted on the tables, views, UDFs, and other
objects in the source schema.
7. Organization
https://docs.snowflake.com/en/user-guide-organizations.html

1. An organization is a first-class Snowflake object that links the accounts owned by your business entity.
2. Organizations simplify
• Account management and billing,
• Database Replication and Failover/Fallback,
• Snowflake Secure Data Sharing,
• Other account administration tasks.
3. Once an account is created, ORGADMIN can view the account properties but does not have access to
the account data.
4. Snowflake provides historical usage data for all accounts in your organization via views in the
ORGANIZATION_USAGE schema

Benefits

5. A central view of all accounts within your organization.


SHOW ORGANIZATION ACCOUNTS;

6. Self-service account creation.


create account myaccount1
admin_name = admin
admin_password = 'TestPassword1'
first_name = jane
last_name = smith
email = '[email protected]'
edition = enterprise
region = aws_us_west_2;

7. Once an account is created, ORGADMIN can view the account properties but does not have access to
the account data.
8. Data availability and durability by leveraging data replication and failover.
9. Seamless data sharing with Snowflake consumers across regions.
10. Ability to monitor and understand usage across all accounts in the organization

ORGADMIN Role

11. The organization administrator (ORGADMIN) system role is responsible for managing operations at the
organization level.
12. A user with the ORGADMIN role can perform the following actions:
• Create an account in the organization. For more information, see Creating an Account.
• View/show all accounts within the organization. For more information, see Viewing a List of
Organization Accounts.
• View/show a list of regions enabled for the organization. For more information, see Viewing a List of
Regions Available for Your Organization.
• View usage information for all accounts in the organization. For more information, see Organization
Usage.

Organization Functions and Views

13. To support retrieving information about organizations, Snowflake provides the following SQL function:
• SYSTEM$GLOBAL_ACCOUNT_SET_PARAMETER

select system$global_account_set_parameter('myaccount1',

'ENABLE_ACCOUNT_DATABASE_REPLICATION', 'true');

• In addition, Snowflake provides historical usage data for all accounts in your organization via views in the
ORGANIZATION_USAGE schema in a shared database named SNOWFLAKE.
• For information, see Organization Usage.
8. Accounts and Schema Objects
Database. Schema = Namespace

Schema Objects
Stages
External Stages

1. Loading data from any of the following cloud storage services is supported regardless of the cloud
platform that hosts your Snowflake account:
• Amazon S3
• Google Cloud Storage
• Microsoft Azure

2. You cannot access data held in archival cloud storage classes that requires restoration before it can
be retrieved.
3. A named external stage is a database object created in a schema.
a. This object stores the URL to files in cloud storage,
b. the settings used to access the cloud storage account, and
c. format of staged files.

Internal Stages

4. Table Stage

• A table stage is available for each table created in Snowflake.


• This stage type is designedto store files that are staged and managed by one or more users but
only loaded into a single table. Table stages cannot be altered or dropped.
• Table stage is not a separate database object; rather, it is an implicit stage tied to the table itself.
A table stage has no grantable privileges of its own.
• To stage files to a table stage, list the files, query them on the stage, or drop them, you must be
the table owner (have the role with the OWNERSHIP privilege on the table).

5. User Stage

• A user stage is allocated to each user for storing files.


• This stage type is designed to store files that are staged and managed by a single user but can be
loaded into multiple tables.
• User stages cannot be altered or dropped.

6. Named Stage

Named stages are database objects that provide the greatest degree of flexibility for data loading:

Named stages are optional but recommended when you plan regular data loads that could involve
multiple users and/or tables.
• A named internal stage is a database object created in a schema.
• This stage type can store files that are staged and managed by one or more users and loaded into
one or more tables.
• Because named stages are database objects, the ability to create, modify, use, or drop them can
be controlled using security access control privileges.
Create Stage Command
-- Internal stage
CREATE [ OR REPLACE ] [ TEMPORARY ] STAGE [ IF NOT EXISTS ] <internal_stage_name>
internalStageParams
directoryTableParams
[ FILE_FORMAT = ( { FORMAT_NAME = '<file_format_name>' | TYPE = { CSV | JSON | AVRO | ORC | PARQUET | XML
} [ formatTypeOptions ] ) } ]
[ COPY_OPTIONS = ( copyOptions ) ]
[ [ WITH ] TAG ( <tag_name> = '<tag_value>' [ , <tag_name> = '<tag_value>' , ... ] ) ]
[ COMMENT = '<string_literal>' ]

-- External stage
CREATE [ OR REPLACE ] [ TEMPORARY ] STAGE [ IF NOT EXISTS ] <external_stage_name>
externalStageParams
directoryTableParams
[ FILE_FORMAT = ( { FORMAT_NAME = '<file_format_name>' | TYPE = { CSV | JSON | AVRO | ORC | PARQUET | XML
} [ formatTypeOptions ] ) } ]
[ COPY_OPTIONS = ( copyOptions ) ]
[ [ WITH ] TAG ( <tag_name> = '<tag_value>' [ , <tag_name> = '<tag_value>' , ... ] ) ]
[ COMMENT = '<string_literal>' ]

internalStageParams ::=
[ ENCRYPTION = (TYPE = 'SNOWFLAKE_FULL' | TYPE = 'SNOWFLAKE_SSE') ]

externalStageParams (for Amazon S3) ::=


URL = 's3://<bucket>[/<path>/]'

[ { STORAGE_INTEGRATION = <integration_name> } | { CREDENTIALS = ( { { AWS_KEY_ID = '<string>'


AWS_SECRET_KEY = '<string>' [ AWS_TOKEN = '<string>' ] } | AWS_ROLE = '<string>' } ) ) } ]
[ ENCRYPTION = ( [ TYPE = 'AWS_CSE' ] [ MASTER_KEY = '<string>' ] |
[ TYPE = 'AWS_SSE_S3' ] |
[ TYPE = 'AWS_SSE_KMS' [ KMS_KEY_ID = '<string>' ] ] |
[ TYPE = 'NONE' ] ) ]

directoryTableParams (for internal stages) ::=


[ DIRECTORY = ( { ENABLE = TRUE | FALSE } ) ]

directoryTableParams (for Amazon S3) ::=


[ DIRECTORY = ( ENABLE = { TRUE | FALSE }
[ AUTO_REFRESH = { TRUE | FALSE } ]

copyOptions ::=
ON_ERROR = { CONTINUE | SKIP_FILE | SKIP_FILE_<num> | SKIP_FILE_<num>% | ABORT_STATEMENT }
SIZE_LIMIT = <num>
PURGE = TRUE | FALSE
RETURN_FAILED_ONLY = TRUE | FALSE
MATCH_BY_COLUMN_NAME = CASE_SENSITIVE | CASE_INSENSITIVE | NONE
ENFORCE_LENGTH = TRUE | FALSE
TRUNCATECOLUMNS = TRUE | FALSE
FORCE = TRUE | FALSE
Compression of Staged Files

Encryption of Staged Files


The following table describes how Snowflake handles encryption of data files for loading. The options are
different depending on whether the files are staged unencrypted or already-encrypted:

Feature Supported Notes

Unencrypted 128-bit or When staging unencrypted files in a Snowflake internal


files 256-bit keys location, the files are automatically encrypted using 128-bit
keys. 256-bit keys can be enabled (for stronger encryption);
however, additional configuration is required.

Already- User- Files that are already encrypted can be loaded into Snowflake
encrypted files supplied key from external cloud storage; the key used to encrypt the files
must be provided to Snowflake.
Platform Vs Stages
File Format
File format options specify the type of data contained in a file, as well as other related characteristics about
the format of the data.

The file format options you can specify are different depending on the type of data you plan to load.
Snowflake provides a full set of file format option defaults

Supported File Formats:

Semi-structured File Formats

Snowflake natively supports semi-structured data, which means semi-structured data can be loaded into
relational tables without requiring the definition of a schema in advance.

Named File Formats


Snowflake supports creating named file formats, which are database objects that encapsulate all of the
required format information.

Named file formats are optional, but are recommended when you plan to regularly load similarly-formatted
data.

If file format options are specified in multiple locations, the load operation applies the options in the
following order of precedence:
1. COPY INTO TABLE statement.
2. Stage definition.
3. Table definition.
Tables

External Tables
https://docs.snowflake.com/en/user-guide/tables-external-intro.html#partitioned-external-tables

• External tables reference data files located in a cloud storage (Amazon S3, Google Cloud Storage, or
Microsoft Azure) data lake
• External tables store file-level metadata about the data files such as the file path, a version identifier, and
partitioning information.
• External tables can access data stored in any format supported by COPY INTO <table> statements.
• External tables are read-only, therefore no DML operations can be performed on them; however,
external tables can be used for query and join operations.
• Views can be created against external tables.
• Querying data stored external to the database is likely to be slower than querying native database
tables; however, materialized views based on external tables can improve query performance.
• VALUE : A VARIANT type column that represents a single row in the external file.
• METADATA$FILENAME : A pseudocolumn that identifies the name of each staged data file included in
the external table, including its path in the stage.
• METADATA$FILE_ROW_NUMBER : A pseudocolumn that shows the row number for each record in a
staged data file.

• To create external tables, you are only required to have some knowledge of the file
format and record format of the source data files. Knowing the schema of the data files
is not required.
• When queried, external tables cast all regular or semi-structured data to a variant in the
VALUE column.
• Create Table Command
Summary of Data Types
Bulk vs Continuous Loading

1. Bulk Loading Using the COPY Command

This option enables loading batches of data from files already available in cloud storage, or copying
(i.e. staging) data files from a local machine to an internal (i.e. Snowflake) cloud storage location before
loading the data into tables using the COPY command.

Compute Resources

Bulk loading relies on user-provided virtual warehouses, which are specified in the COPY statement. Users
are required to size the warehouse appropriately to accommodate expected loads.

Simple Transformations During a Load

Snowflake supports transforming data while loading it into a table using the COPY command. Options
include:

• Column reordering
• Column omission
• Casts
• Truncating text strings that exceed the target column length

There is no requirement for your data files to have the same number and ordering of columns as your target
table.

2. Continuous Loading Using Snowpipe

This option is designed to load small volumes of data (i.e. micro-batches) and incrementally make them
available for analysis. Snowpipe loads data within minutes after files are added to a stage and submitted for
ingestion. This ensures users have the latest results, as soon as the raw data is available.

Compute Resources

Snowpipe uses compute resources provided by Snowflake (i.e. a serverless compute model). These
Snowflake-provided resources are automatically resized and scaled up or down as required, and are charged
and itemized using per-second billing. Data ingestion is charged based upon the actual workloads.

3. Simple Transformations During a Load

The COPY statement in a pipe definition supports the same COPY transformation options as when bulk
loading data.

In addition, data pipelines can leverage Snowpipe to continously load micro-batches of data into staging
tables for transformation and optimization using automated tasks and the change data capture (CDC)
information in streams.

4. Data Pipelines for Complex Transformations


A data pipeline enables applying complex transformations to loaded data. This workflow generally leverages
Snowpipe to load “raw” data into a staging table and then uses a series of table streams and tasks to
transform and optimize the new data for analysis.

Views
Secure Views
https://docs.snowflake.com/en/user-guide/views-secure.html
Materialized Views
1. A materialized view is a pre-computed data set derived from a query specification (the SELECT in the
view definition) and stored for later use.
2. Because the data is pre-computed, querying a materialized view is faster than executing a query against
the base table of the view.
3. This performance difference can be significant when a query is run frequently or is sufficiently complex.
4. As a result, materialized views can speed up expensive aggregation, projection, and selection operations,
especially those that run frequently and that run on large data sets.
5. Materialized views are designed to improve query performance for workloads composed of common,
repeated query patterns. However, materializing intermediate results incurs additional costs.
6. Materialized views are particularly useful when:
• Query results contain a small number of rows and/or columns relative to the base table (the table on
which the view is defined).
• Query results contain results that require significant processing, including:
o Analysis of semi-structured data.
o Aggregates that take a long time to calculate.
• The query is on an external table (i.e. data sets stored in files in an external stage), which might have
slower performance compared to querying native database tables.
• The view’s base table does not change frequently.

Deciding When to Create a Materialized View or a Regular View


• The query results from the view don’t change often.
• The results of the view are used often
7. Both materialized views and cached query results provide query performance benefits:

• Materialized views are more flexible than, but typically slower than, cached results.
• Materialized views are faster than tables because of their “cache” (i.e. the query results for
the view); in addition, if data has changed, they can use their “cache” for data that hasn’t
changed and use the base table for any data that has changed.
8. We don’t need to specify a materialized view in a SQL statement in order for the view to be used. The
query optimizer can automatically rewrite queries against the base table or regular views to use the
materialized view instead. For example, suppose that a materialized view contains all of the rows and
columns that are needed by a query against a base table. The optimizer can decide to rewrite the query
to use the materialized view, rather than the base table. This can dramatically speed up a query,
especially if the base table contains a large amount of historical data.

DML Operations on Materialized Views


• Snowflake does not allow standard DML (e.g. INSERT, UPDATE, DELETE) on materialized views.
• Snowflake does not allow users to truncate materialized views.

Materialized Views and Clustering


• Defining a clustering key on a materialized view is supported and can increase performance in many
situations. However, it also adds costs.
• If you cluster both the materialized view(s) and the base table on which the materialized view(s) are
defined, you can cluster the materialized view(s) on different columns from the columns used to cluster
the base table.
• In most cases, clustering a subset of the materialized views on a table tends to be more cost-effective
than clustering the table itself. If the data in the base table is accessed (almost) exclusively through the
materialized views, and (almost) never directly through the base table, then clustering the base table
adds costs without adding benefit.

Viewing Costs
• MATERIALIZED_VIEW_REFRESH_HISTORY table function (in the Snowflake Information Schema).
• MATERIALIZED_VIEW_REFRESH_HISTORY View view (in Account Usage).

Creating a Materialized View on Shared Data


You can create a materialized view on shared data.
Remember that maintaining materialized views will consume credits. When you create a
materialized view on someone else’s shared table, the changes to that shared table will
result in charges to you as your materialized view is maintained.
Best Practices for Creating Materialized Views
• Filtering rows (e.g. defining the materialized view so that only very recent data is included).In some
applications, the best data to store is the abnormal data. For example, if you are monitoring pressure in a
gas pipeline to estimate when pipes might fail, you might store all pressure data in the base table, and
store only unusually high pressure measurements in the materialized view. Similarly, if you are
monitoring network traffic, your base table might store all monitoring information, while your
materialized view might store only unusual and suspicious information
• Filtering columns (e.g. selecting specific columns rather than “SELECT * …”). Using SELECT * ... to define
a materialized view typically is expensive. It can also lead to future errors; if columns are added to the
base table later (e.g. ALTER TABLE ... ADD COLUMN ...), the materialized view does not automatically
incorporate the new columns.
• Perform resource-intensive operations and store the results so that the resource intensive operations
don’t need to be performed as often.
• You can create more than one materialized view for the same base table. For example, you can create
one materialized view that contains just the most recent data, and another materialized view that stores
unusual data. You can then create a non-materialized view that joins the two tables and shows recent
data that matches unusual historical data so that you can quickly detect unusual situations, such as a
DOS (denial of service) attack that is ramping up.

9. Loading Structured Data


Bulk Loading Overview
1. Both Structured and Semi Structured data can be loaded in to snowflake.
2. The methods for loading data include:
o Bulk loading in to snowflake tables, from local files or external cloud storage
o Continuous loading in micro batches with Snowpipe
3. Snowflake refers to the location of data files in cloud storage as a stage.
4. The COPY INTO <table> command used for both bulk and continuous data loads (Snowpipe)
supports cloud storage accounts managed by your business entity (i.e. external stages) as well as
cloud storage contained in your Snowflake account (i.e. internal stages).
Loading Using the Web Interface
5. The Snowflake web interface provides a convenient wizard for loading limited amounts of data into a
table from a small set of flat files.
6. Behind the scenes, the wizard uses the PUT and COPY commands to load data; however, the wizard
simplifies the data loading process by combining the two phases (staging files and loading data) into a
single operation and deleting all staged files after the load completes.
Bulk loading from local data source using SnowSQL

© 2021 Snowflake Inc. All Rights Reserved

Prerequisites

Download the SnowSQL and install it in your local system

Step 1. Log into SnowSQL

Step 2. Create Snowflake Objects

Step 3. Stage the Data Files


Step 4. Copy Data into the Target Table

Bulk loading from external source (Amazon Web Service)

© 2021 Snowflake Inc. All Rights Reserved


Copy into table command
/* Standard data load */

COPY INTO [<namespace>.]<table_name>

FROM { internalStage | externalStage | externalLocation }

[ FILES = ( '<file_name>' [ , '<file_name>' ] [ , ... ] ) ]

[ PATTERN = '<regex_pattern>' ]

[ FILE_FORMAT = ( { FORMAT_NAME = '[<namespace>.]<file_format_name>' |

TYPE = { CSV | JSON | AVRO | ORC | PARQUET | XML } [ formatTypeOptions ] } ) ]

[ copyOptions ]

[ VALIDATION_MODE = RETURN_<n>_ROWS | RETURN_ERRORS | RETURN_ALL_ERRORS ]

FILES = ( 'file_name' [ , 'file_name' ... ] )

Specifies a list of one or more files names (separated by commas) to be loaded. If


any of the specified files cannot be found, the default
behavior ON_ERROR = ABORT_STATEMENT aborts the load operation unless a
different ON_ERROR option is explicitly set in the COPY statement.

The maximum number of files names that can be specified is 1000.

PATTERN = 'regex_pattern'

A regular expression pattern string, enclosed in single quotes, specifying the file
names and/or paths to match.

FILE_FORMAT = ( FORMAT_NAME = 'file_format_name' ) or

FILE_FORMAT = ( TYPE = CSV | JSON | AVRO | ORC | PARQUET | XML [ ... ] )

COPY_OPTIONS = ( ... )

ON_ERROR = CONTINUE | SKIP_FILE | SKIP_FILE_num | SKIP_FILE_num% | ABORT_STATEMENT

Default

Bulk loading using COPY ABORT_STATEMENT

Snowpipe SKIP_FILE

SIZE_LIMIT = num

Number (> 0) that specifies the maximum size (in bytes) of data to be loaded for a given COPY
statement.
For example, suppose a set of files in a stage path were each 10 MB in size. If multiple COPY
statements set SIZE_LIMIT to 25000000 (25 MB), each would load 3 files. That is, each COPY
operation would discontinue after the SIZE_LIMIT threshold was exceeded.

PURGE = TRUE | FALSE

Boolean that specifies whether to remove the data files from the stage automatically after the data is
loaded successfully.

Set PURGE=TRUE for the table to specify that all files successfully loaded into the table are purged
after loading:
alter table mytable set stage_copy_options = (purge = true);

copy into mytable;

You can also override any of the copy options directly in the COPY command:
copy into mytable purge = true;

MATCH_BY_COLUMN_NAME = CASE_SENSITIVE | CASE_INSENSITIVE | NONE

String that specifies whether to load semi-structured data into columns in the target table that match
corresponding columns represented in the data.

ENFORCE_LENGTH = TRUE | FALSE

Alternative syntax for TRUNCATECOLUMNS with reverse logic (for compatibility with other systems)

Boolean that specifies whether to truncate text strings that exceed the target column length:

TRUNCATECOLUMNS = TRUE | FALSE

Alternative syntax for ENFORCE_LENGTH with reverse logic (for compatibility with other systems)

Boolean that specifies whether to truncate text strings that exceed the target column length:

FORCE = TRUE | FALSE

Boolean that specifies to load all files, regardless of whether they’ve been loaded previously and
have not changed since they were loaded. Note that this option reloads files, potentially duplicating
data in a table.

In the following example, the first command loads the specified files and the second command forces the
same files to be loaded again (producing duplicate rows), even though the contents of the files have not
changed:
copy into load1 from @%load1/data1/ files=('test1.csv', 'test2.csv');

copy into load1 from @%load1/data1/ files=('test1.csv', 'test2.csv') force=true;

LOAD_UNCERTAIN_FILES = TRUE | FALSE

Boolean that specifies to load files for which the load status is unknown. The COPY command skips
these files by default.

Validate files in a stage without loading:

Run the COPY command in validation mode and see all errors:
copy into mytable validation_mode = 'RETURN_ERRORS';
Snowpipe - Continuous Loading
Snowpipe enables loading data from files as soon as they’re available in a stage. This means you can load
data from files in micro-batches, making it available to users within minutes, rather than manually executing
COPY statements on a schedule to load larger batches.

How Does Snowpipe Work?

A pipe is a named, first-class Snowflake object that contains a COPY statement used by Snowpipe. The
COPY statement identifies the source location of the data files (i.e., a stage) and a target table. All data types
are supported, including semi-structured data types such as JSON and Avro.

Different mechanisms for detecting the staged files are available:

• Automating Snowpipe using cloud messaging


• Calling Snowpipe REST endpoints

Automating Snowpipe Using Cloud Messaging

1. Automated data loads leverage event notifications for cloud storage to inform Snowpipe of the
arrival of new data files to load.
2. Snowpipe copies the files into a queue, from which they are loaded into the target table in a
continuous, serverless fashion based on parameters defined in a specified pipe object.

Calling Snowpipe REST Endpoints

1. Client application calls a public REST endpoint with the name of a pipe object and a list of data
filenames.
2. If new data files matching the list are discovered in the stage referenced by the pipe object, they are
queued for loading.
3. Snowflake-provided compute resources load data from the queue into a Snowflake table based on
parameters defined in the pipe.
Snowpipe Different Vs Bulk Data Loading

Authentication
Bulk data load

Relies on the security options supported by the client for authenticating and
initiating a user session.

Snowpipe

When calling the REST endpoints: Requires key pair authentication with JSON
Web Token (JWT). JWTs are signed using a public/private key pair with RSA
encryption.

Load History
Bulk data load

Stored in the metadata of the target table for 64 days. Available upon
completion of the COPY statement as the statement output.

Snowpipe

Stored in the metadata of the pipe for 14 days. Must be requested from
Snowflake via a REST endpoint, SQL table function, or ACCOUNT_USAGE
view.

Transactions
Bulk data load

Loads are always performed in a single transaction. Data is inserted into table
alongside any other SQL statements submitted manually by users.

Snowpipe

Loads are combined or split into a single or multiple transactions based on the
number and size of the rows in each data file. Rows of partially loaded files
(based on the ON_ERROR copy option setting) can also be combined or split
into one or more transactions.

Compute Resources
Bulk data load
Requires a user-specified warehouse to execute COPY statements.

Snowpipe

Uses Snowflake-supplied compute resources.

Cost
Bulk data load

Billed for the amount of time each virtual warehouse is active.

Snowpipe

Billed according to the compute resources used in the Snowpipe warehouse


while loading the files.

Continuous Data Loads (i.e. Snowpipe) and File Sizing

1. Snowpipe is designed to load new data typically within a minute after a file notification is sent
2. In addition to resource consumption, an overhead to manage files in the internal load queue is
included in the utilization costs charged for Snowpipe. Snowpipe charges 0.06 credits per 1000
files queued.
3. Creating a new (potentially smaller) data file once per minute. This approach typically leads to a
good balance between cost and performance.

Alternatives to Loading Data

It is not always necessary to load data into Snowflake before executing queries.

External Tables (Data Lake)

4. External tables enable querying existing data stored in external cloud storage for analysis without
first loading it into Snowflake.
5. The source of truth for the data remains in the external cloud storage. Data sets materialized in
Snowflake via materialized views are read-only.
6. This solution is especially beneficial to accounts that have a large amount of data stored in
external cloud storage and only want to query a portion of the data; for example, the most recent
data. Users can create materialized views on subsets of this data for improved query
performance.
Data Loading Summary
● Two separate commands can be used to load data into Snowflake:
○ COPY
■ Bulk insert
■ Allows insert on a SELECT against a staged file, but a WHERE clause cannot be used
■ More performant
○ INSERT
■ Row-by-row insert
■ Allows insert on a SELECT against a staged file, and a WHERE clause can be used
■ Less performant
● Snowflake also offers a continuous data ingestion service, Snowpipe, to detect and load streaming
data:
○ Snowpipe loads data within minutes after files are added to a stage and submitted for
ingestion.
○ The service provides REST endpoints and uses Snowflake-provided compute resources to
load data and retrieve load history reports.
○ The service can load data from any internal (i.e. Snowflake) or external (i.e. AWS S3 or Microsoft
Azure) stage.
○ With Snowpipe’s server-less compute model, Snowflake manages load capacity, ensuring
optimal compute resources to meet demand. In short, Snowpipe provides a “pipeline” for
loading fresh data in micro-batches as soon as it’s available.
● To load data into Snowflake, the following must be in place:
○ A Virtual Warehouse
○ A pre-defined target table
○ A Staging location with data staged
○ A File Format
● Snowflake supports loading from the following file/data types:
○ Text Delimited (referenced as CSV in the UI)
○ JSON
○ XML
○ Avro
○ Parquet
○ ORC
● Data must be staged prior to being loaded, either in an Internal Stage (managed by Snowflake) or
an External Stage (self-managed) in AWS S3 or Azure Blob Storage
● As data is loaded:
○ Snowflake compresses the data and converts it into an optimized internal format for efficient
storage, maintenance, and retrieval.
○ Snowflake gathers various statistics for databases, tables, columns, and files and stores this
information in the Metadata Manager in the Cloud Services Layer for use in query optimization

● Working With Snowflake - Loading and Querying Self-Guided Learning Material
○ Summary of Data Loading Features
○ Getting Started - Introduction to Data Loading (Video)
○ Loading Data
○ Bulk Load Using COPY
○ COPY INTO <table>
○ INSERT
○ Getting Started - Introduction to Databases and Querying (Video)
○ Easily Loading and Analyzing Semi-Structured Data in Snowflake (Video)
○ Processing JSON data in Snowflake (Video)
○ Queries
○ Analyzing Queries Using Query Profile
○ Using the History Page to Monitor Queries

Reference : https://community.snowflake.com/s/article/Performance-of-Semi-Structured-Data-Types-in-
Snowflake

For Further Study:

https://docs.snowflake.com/en/sql-reference/data-types-semistructured.html#object

https://docs.snowflake.com/en/user-guide/querying-semistructured.html#sample-data-used-in-examples

https://docs.snowflake.com/en/user-guide/semistructured-considerations.html
• Data Loading Considerations
This set of topics provides best practices, general guidelines, and important considerations for bulk data
loading using the COPY INTO <table> command.

Preparing Your Data Files - Best Practices


General File Sizing Recommendations

• The number of load operations that run in parallel cannot exceed the number of data files to be loaded.
To optimize the number of parallel operations for a load, we recommend aiming to produce data files
roughly 100-250 MB (or larger) in size compressed.

Semi-structured Data Size Limitations

• The VARIANT data type imposes a 16 MB size limit on individual rows.

Continuous Data Loads (i.e. Snowpipe) and File Sizing

• loading via snowpipe can take significantly longer for really large files or in cases where an unusual
amount of compute resources is necessary to decompress, decrypt, and transform the new data.
• In addition to resource consumption, an overhead to manage files in the internal load queue is included
in the utilization costs charged for Snowpipe. This overhead increases in relation to the number of files
queued for loading. Snowpipe charges 0.06 credits per 1000 files queued.
• consider creating a new (potentially smaller) data file once per minute. This approach typically leads to a
good balance between cost (i.e. resources spent on Snowpipe queue management and the actual load)
and performance (i.e. load latency).

Preparing Delimited Text Files

• UTF-8 is the default character set, however, additional encodings are supported.

Staging Data - Best Practices


Organizing Data by Path

• Organizing your data files by path lets you copy any fraction of the partitioned data into Snowflake with
a single command. This allows you to execute concurrent COPY statements that match a subset of files,
taking advantage of parallel operations.
• For example, if you were storing data for a North American company by geographical location, you
might include identifiers such as continent, country, and city in paths along with data write dates:

Canada/Ontario/Toronto/2016/07/10/05/

United_States/California/Los_Angeles/2016/06/01/11/

United_States/New York/New_York/2016/12/21/03/
United_States/California/San_Francisco/2016/08/03/17/
Loading Data- Best Practices

Options for Selecting Staged Data Files

The COPY command supports several options for loading data files from a stage:

1. By path (internal stages) / prefix (Amazon S3 bucket). See Organizing Data by Path for information.

2. Specifying a list of specific files to load.

3. Using pattern matching to identify specific files by pattern.

• These options enable you to copy a fraction of the staged data into Snowflake with a single command.
This allows you to execute concurrent COPY statements that match a subset of files, taking advantage
of parallel operations.

• Of the three options for identifying/specifying data files to load from a stage, providing a discrete list of
files is generally the fastest; however, the FILES parameter supports a maximum of 1,000 files,
• Pattern matching using a regular expression is generally the slowest of the three options for
identifying/specifying data files to load from a stage; however, this option works well if you exported
your files in named order from your external application and want to batch load the files in the same
order.

Executing Parallel COPY Statements That Reference the Same Data Files

• When a COPY statement is executed, Snowflake sets a load status in the table metadata
for the data files referenced in the statement. This prevents parallel COPY statements
from loading the same files into the table, avoiding data duplication.
• If one or more data files fail to load, Snowflake sets the load status for those files as load
failed. These files are available for a subsequent COPY statement to load.

Managing Regular Data Loads


Removing Loaded Data Files

When data from staged files is loaded successfully, consider removing the staged files to
ensure the data is not inadvertently loaded again (duplicated).

Staged files can be deleted from a Snowflake stage (user stage, table stage, or named stage)
using the following methods:

• Files that were loaded successfully can be deleted from the stage during a load
by specifying the PURGE copy option in the COPY INTO <table> command.
• After the load completes, use the REMOVE command to remove the files in
the stage.

Removing files ensures they aren’t inadvertently loaded again. It also improves load
performance, because it reduces the number of files that COPY commands must scan to
verify whether existing files in a stage were loaded already.
Preparing to Load Data
Data File Compression

We recommend that you compress your data files when you are loading large data sets.

See CREATE FILE FORMAT for the compression algorithms supported for each data type.

When loading compressed data, specify the compression method for your data files. The COMPRESSION
file format option describes how your data files are already compressed in the stage. Set the
COMPRESSION option in one of the following ways:

• As a file format option specified directly in the COPY INTO <table> statement.
• As a file format option specified for a named file format or stage object. The named file format/stage
object can then be referenced in the COPY INTO <table> statement.

Supported Copy Options

Copy options determine the behavior of a data load with regard to error handling, maximum data size, and
so on.For descriptions of all copy options and the default values, see COPY INTO <table>.

Overriding Default Copy Options

You can specify the desired load behavior (i.e. override the default settings) in any of the following locations:

If copy options are specified in multiple locations, the load operation applies the options in the following
order of precedence:

1. COPY INTO TABLE statement.


2. Stage definition.
3. Table definition.

Supported File Formats

Structured/Semi- Type Notes


structured

Structured Delimited Any valid singlebyte delimiter is supported;


(CSV, TSV, default is comma (i.e. CSV).
etc.)

Semi-structured JSON

Avro Includes automatic detection and processing


of compressed Avro files.

ORC Includes automatic detection and processing


of compressed ORC files.
Structured/Semi- Type Notes
structured

Parquet Includes automatic detection and processing


of compressed Parquet files. Currently,
Snowflake supports the schema of Parquet
files produced using the Parquet writer v1.
Files produced using v2 of the writer are not
supported.

XML Supported as a preview feature.

Named File Formats

Snowflake supports creating named file formats, which are database objects that encapsulate all of the
required format information.

Named file formats are optional, but are recommended when you plan to regularly load similarly-formatted
data.

Overriding Default File Format Options

You can define the file format settings for your staged data (i.e. override the default settings) in any of the
following locations:

If file format options are specified in multiple locations, the load operation applies the options in the
following order of precedence:

• COPY INTO TABLE statement.


• Stage definition.
• Table definition.

Monitoring Files Staged Internally

Snowflake maintains detailed metadata for each file uploaded into internal stage (for users, tables, and
stages), including:

• File name
• File size (compressed, if compression was specified during upload)
• LAST_MODIFIED date, i.e. the timestamp when the data file was initially staged or when it was last
modified, whichever is later

In addition, Snowflake retains historical data for COPY INTO commands executed within the previous 14
days. The metadata can be used to monitor and manage the loading process, including deleting files after
upload completes:

• Use the LIST command to view the status of data files that have been staged.
• Monitor the status of each COPY INTO <table> command on the History page of the Snowflake
web interface.
• Use the VALIDATE function to validate the data files you’ve loaded and retrieve any errors
encountered during the load.
• Use the LOAD_HISTORY Information Schema view to retrieve the history of data loaded into tables
using the COPY INTO command.
10. Data Unloading

Data Unloading

The process for unloading data into files is the same as the loading process, except in reverse:

1. Use the COPY INTO <location> command to copy the data from the Snowflake database table
into one or more files in a Snowflake or external stage.
2. Download the file from the stage:
3. From a Snowflake stage, use the GET command to download the data file(s).
4. From S3, use the interfaces/tools provided by Amazon S3 to get the data file(s).
5. From Azure, use the interfaces/tools provided by Microsoft Azure to get the data file(s).

create or replace file format my_csv_unload_format


type = 'CSV' field_delimiter = ',';

create or replace stage my_unload_stage


file_format = my_csv_unload_format;

copy into @my_unload_stage/unload/ from trips;


get @my_unload_stage/unload/data_7_7_1.csv.gz file://unload;

Copy into Stage – Pending

11. Loading Semistructured Data

Semi-structured data is data that does not conform to the standards of traditional structured data, but it
contains tags or other types of mark-up that identify individual, distinct entities within the data.

Two of the key attributes that distinguish semi-structured data from structured data are nested data
structures and lack of a fixed schema

❖ Structured data requires a fixed schema that is defined before the data can be loaded and
queried in a relational database system. Semi-structured data does not require a prior definition
of a schema and can constantly evolve, i.e. new attributes can be added at any time.
❖ Unlike structured data, which represents data as a flat table, semi-structured data can contain n-
level hierarchies of nested information.
❖ Snowflake is doing the same with semi structured data as it does with structured data
❖ Full Support of SQL operations is available such as joins and aggregations

Supported File Formats


• XML (Extended MarkUp Language)
• JSON (JavaScript Object Notation) is a lightweight, plain-text, data-interchange format based on
a subset of the JavaScript Programming Language.
• Avro is an open-source data serialization and RPC framework originally developed for use with
Apache Hadoop. It utilizes schemas defined in JSON to produce serialized data in a compact
binary format. The serialized data can be sent to any destination (i.e., application or program) and
can be easily deserialized at the destination because the schema is included in the data.
• ORC: Used to store Hive data, the ORC (Optimized Row Columnar) file format was designed for
efficient compression and improved performance for reading, writing, and processing data over
earlier Hive file formats
• Parquet: Parquet is a compressed, efficient columnar data representation designed for projects in
the Hadoop ecosystem. The file format supports complex nested data structures and uses
Dremel record shredding and assembly algorithms.

Time Travel - Pending

Cloning
• A clone is writable and is independent of its source (i.e. changes made to the source or clone are not
reflected in the other object).

• Parameters that are explicitly set on a source database, schema, or table are retained in
any clones created from the source container or child objects.
• To create a clone, your current role must have the following privilege(s) on the source object:
o Tables : SELECT
o Pipes, Streams, Tasks : OWNERSHIP
o Other objects : USAGE
• For databases and schemas, cloning is recursive:However, the following object types
are not cloned:
o External tables
o Internal (Snowflake) stages
• Cloning a table replicates the structure, data, and certain other properties (e.g. STAGE FILE FORMAT ) of
the source table.
• The CREATE TABLE … CLONE syntax includes the COPY GRANTS keywords. If the COPY GRANTS
keywords are used, then the new object inherits any explicit access privileges granted on the original
table but does not inherit any future grants

When queried, external tables cast all regular or semi-structured data to a variant in the
VALUE column.

Column Level Security


https://docs.snowflake.com/en/user-guide/security-column-intro.html

• You can apply the masking policy to one or more table/view columns with the matching data type
• While Snowflake offers secure views to restrict access to sensitive data, secure views present
management challenges due to large numbers of views and derived business intelligence (BI) dashboards
from each view.
• Masking policies support segregation of duties (SoD) through the role separation of policy administrators
from object owners.
o Object owners (i.e. the role that has the OWNERSHIP privilege on the object) do not have the
privilege to unset masking policies.
o Object owners cannot view column data in which a masking policy applies.

• Conditional masking uses a masking policy to selectively protect the column data in a table or view
based on the values in one or more different columns.
• Snowflake supports nested masking policies, such as a masking policy on a table and a masking policy on
a view for the same table.
• you cannot have masking policy for
o Shared objects.
o Materialized views (MV)
o Virtual columns.
o External tables.
• Masking policies on columns in a table carry over to a stream on the same table.
• Cloning a schema results in the cloning of all masking policies within the schema.
Snowpipe
Snowpipe enables loading data from files as soon as they’re available in a stage. This means you can load
data from files in micro-batches, making it available to users within minutes, rather than manually executing
COPY statements on a schedule to load larger batches.

How Does Snowpipe Work?

A pipe is a named, first-class Snowflake object that contains a COPY statement used by Snowpipe. The
COPY statement identifies the source location of the data files (i.e., a stage) and a target table. All data types
are supported, including semi-structured data types such as JSON and Avro.

Different mechanisms for detecting the staged files are available:

• Automating Snowpipe using cloud messaging


• Calling Snowpipe REST endpoints

Automating Snowpipe Using Cloud Messaging

3. Automated data loads leverage event notifications for cloud storage to inform Snowpipe of the
arrival of new data files to load.
4. Snowpipe copies the files into a queue, from which they are loaded into the target table in a
continuous, serverless fashion based on parameters defined in a specified pipe object.

Calling Snowpipe REST Endpoints

4. Client application calls a public REST endpoint with the name of a pipe object and a list of data
filenames.
5. If new data files matching the list are discovered in the stage referenced by the pipe object, they are
queued for loading.
6. Snowflake-provided compute resources load data from the queue into a Snowflake table based on
parameters defined in the pipe.

How Is Snowpipe Different from Bulk Data Loading

https://docs.snowflake.com/en/user-guide/data-load-snowpipe-intro.html#how-is-snowpipe-different-
from-bulk-data-loading

Continuous Data Loads (i.e. Snowpipe) and File Sizing

7. Snowpipe is designed to load new data typically within a minute after a file notification is sent
8. In addition to resource consumption, an overhead to manage files in the internal load queue is
included in the utilization costs charged for Snowpipe. Snowpipe charges 0.06 credits per 1000 files
queued.
9. Creating a new (potentially smaller) data file once per minute. This approach typically leads to a good
balance between cost and performance.

Streams
• A stream object records data manipulation language (DML) changes made to tables, including inserts,

updates, and deletes, as well as metadata about each change, so that actions can be taken using the

changed data. This process is referred to as change data capture (CDC).

• Streams are database objects which resides within a schema

(Database -> Schema -> Streams).

• When a stream object is created, it takes an initial snapshot of every row in source table by

initializing a point in time (called offset) as current transactional version of the table.

• If stream is consumed, the offset moves forward to capture new data change.

• Change information mirrors the column structure of the tracked source table and includes additional

metadata columns that describe change event. The additional metadata columns are

• METADATA$ACTION, indicates DML operation (Insert, Delete) recorded.

• METADATA$ISUPDATE, indicates whether operation was part of update statement.

• Update to the rows in source table are represented as pair of delete and insert operations in

Metadata columns and METADATA$ISUPDATE is set to True.

• METADATA$ROW_ID, specifies the unique row ID which is used to track changes to rows.

• Streams only stores an offset for the source table and not any actual table data. Therefore, you can

create any number of streams for a table without incurring significant cost.

• When Stream is created for a table, a pair of hidden columns are added to source table and begin to

store change tracking metadata.These columns consume a small amount of storage.

• The CDC records returned when querying a stream rely on combination of offset stored in the

stream and change tracking metadata stored in the table.

• When stream is dropped and recreated using create/ replace command, loses all its tracking history

(offset).

• Multiple queries(SELECT) can independently consume the same data from a stream without changing

offset.

• To Ensure Multiple statements access the same change records in the stream, surround them with an

explicit transaction statement and this locks the stream.

• Types of Streams

Standard
A standard table stream tracks all DML changes to source table, includes insert, update and delete

(including table truncate).

Append – only

1. An Append only table stream tracks only row inserts. Update and delete (including truncate)

operations are not recorded.2. Supported on Standard/ local tables only.

• The stream object leverages the time travel feature of the table to enable CDC and if it kept beyond

that period, the stream become stale.

• Stream object become stale if not consumed with in time travel period.If the data retention period

for a source table is less than 14 days and if stream has not been consumed, snowflake temporarily

extends this period to prevent it from going stale.

• The period is extended to the stream’s offset, up to maximum of 14 days by default, regardless of

snowflake edition for your account.

• Extending the data retention period requires additional storage which will be reflected in your

monthly storage charges.

• To determine whether a stream has become stale, execute DESCRIBE STREAM or SHOW STREAMS

Command.

• Renaming a source table does not break a stream or cause it to go stale. In addition, if a table is

dropped and a new table created with the same name, any streams linked to the original table

are not linked to the new table.

• You can Clone a Stream.The clone of stream inherits the current offset.

• When a database or schema that contains source table and stream is cloned, any unconsumed

records in the stream (in the clone) are not accessible.

• Streams are supported on local and External tables.

• Streams cannot track changes in materialized views.

Tasks

• A Snowflake task in simple terms is a scheduler that can help you to schedule a single SQL or a

stored procedure.
• Tasks can be combined with table streams for continuous ELT workflows to process recently

changed table rows .

• Tasks can also be used independently to generate periodic reports by inserting or merging rows into

a report table or performing other periodic work.

• Tasks require compute resources to execute SQL code. Either Snowflake-managed (i.e.serverless

compute model) or User-managed (i.e. virtual warehouse) can be chosen for individual tasks.

• Serverless compute model for tasks compute resources are automatically resized and scaled up or

down by Snowflake based on a dynamic analysis of previous runs of the same task .

• There is no event source that can trigger a task; instead, a task runs on a schedule, which can be

defined when creating a task.

• A scheduled task runs according to the specified cron expression in the local time for a given time

zone.

• Users can define a simple tree-like structure of tasks that starts with a root task and is linked

together by task dependencies.

• A predecessor task can be defined when creating a task (using CREATE TASK ...AFTER) or later

(using ALTER TASK … ADD AFTER).

• A task can have a maximum of 100 child tasks.


• All tasks in a simple tree must have the same task owner and be stored in the same database and

schema.

• If the predecessor for a child task is removed, then the former child task becomes either a standalone

task or a root task

• To modify or recreate any child task in a tree of tasks, the root task must first be suspended.

• When the root task is suspended, all future scheduled runs of the root task are cancelled; however, if

any tasks are currently running then these tasks and any descendent tasks continue to run.

• In addition to the task owner, a role that has the OPERATE privilege on the task can suspend or

resume the task. This role must have the USAGE privilege on the database and schema that contain

the task.

• When the owner role of a given task (i.e. the role with the OWNERSHIP privilege on the task) is

deleted, the task is “re-possessed” by the role that dropped the owner role. This ensures that

ownership moves to a role that is closer to the root of the role hierarchy.

Stored Procedures and UDF


1. Stored procedures allow you to extend Snowflake SQL by combining it with JavaScript so that you

can include programming constructs such as branching and looping.

2. Stored procedures are loosely similar to functions. As with functions, a stored procedure is created

once and can be executed many times. A stored procedure is created with a CREATE PROCEDURE

command and is executed with a CALL command.

3. A stored procedure returns a single value. Although you can run SELECT statements inside a stored

procedure, the results must be used within the stored procedure, or be narrowed to a single value to

be returned.

4. Snowflake stored procedures use JavaScript and, in most cases, SQL:

• JavaScript provides the control structures (branching and looping).

• SQL is executed by calling functions in a JavaScript API.

Example: Suppose that you want to clean up a database by deleting data older than a specified date.

You can write multiple DELETE statements, each of which deletes data from one specific table. You
can put all of those statements in a single stored procedure and pass a parameter that specifies the

cut-off date. Then you can simply call the procedure to clean up the database.

The API enables you to perform operations such as:

1. Execute a SQL statement.

2. Retrieve the results of a query (i.e. a result set).

1. Snowflake, which has methods to create a Statement object and execute a SQL command.

2. Statement, which helps you execute prepared statements and access metadata for those

prepared statements, and allows you to get back a ResultSet object.

3. ResultSet, which holds the results of a query (e.g. the rows of data retrieved for a SELECT

statement).

var my_sql_command1 = "delete from history_table where event_year < 2016";

var statement1 = snowflake.createStatement(my_sql_command1);

var result = statement1.execute();

5. When calling, using, and getting values back from stored procedures, you often need to convert from

a Snowflake SQL data type to a JavaScript data type or vice versa.

6. Snowflake supports overloading of stored procedure names. Multiple stored procedures in the same

schema can have the same name, as long as their signatures differ, either by the number of

arguments or the argument types.

7. Stored procedures are not atomic; if one statement in a stored procedure fails, the other statements

in the stored procedure are not necessarily rolled back. You can use stored procedures with

transactions to make a group of statements atomic.

12. Stored Procedures and User Defined


Functions
Stored procedures allow:
1. Procedural logic (branching and looping), which straight SQL does not support.

2. Error handling.

3. Dynamically creating a SQL statement and execute it.

4. Writing code that executes with the privileges of the role that owns the procedure, rather than with

the privileges of the role that runs the procedure. This allows the stored procedure owner to

delegate the power to perform specified operations to users who otherwise could not do so.

Differences Between Stored Procedures and UDFs


1. A stored procedure is called as an independent statement, rather than as part of a statement.

CALL MyStoredProcedure1(argument_1);

Calling Functions

SELECT calculate_bonus(emp_salary) FROM employee_tbl;

2. You can call the stored procedure inside another stored procedure; the JavaScript in the outer stored

procedure can retrieve and store the output of the inner stored procedure. Remember, however,

that the outer stored procedure (and each inner stored procedure) is still unable to return more than

one value to its caller.

3. You can call the stored procedure and then call the RESULT_SCAN function and pass it the

statement ID generated for the stored procedure.

4. You can store a result set in a temporary table or permanent table, and use that table after returning

from the stored procedure call.

5. If the volume of data is not too large, you can store multiple rows and multiple columns in

a VARIANT (for example, as a JSON value) and return that VARIANT.

6. A stored procedure runs with either the caller’s rights or the owner’s rights. It cannot run with both

at the same time.

7. A caller’s rights stored procedure runs with the privileges of the caller. The primary advantage of a

caller’s rights stored procedure is that it can access information about that caller or about the caller’s

current session. For example, a caller’s rights stored procedure can read the caller’s session variables

and use them in a query.


8. An owner’s rights stored procedure runs mostly with the privileges of the stored procedure’s owner.

The primary advantage of an owner’s rights stored procedure is that the owner can delegate specific

administrative tasks, such as cleaning up old data, to another role without granting that role more

general privileges, such as privileges to delete all data from a specific table.

User Defined Functions


1. User-defined functions (UDFs) let you extend the system to perform operations that are not

available through the built-in, system-defined functions provided by Snowflake. Snowflake currently

supports the following languages for writing UDFs:

• JavaScript: A JavaScript UDF lets you use the JavaScript programming language to
manipulate data and return either scalar or tabular results.
• SQL: A SQL UDF evaluates an arbitrary SQL expression and returns either scalar or tabular
results.
• Java: A Java UDF lets you use the Java programming language to manipulate data and return
either scalar or tabular results.
2. UDFs may be scalar or tabular.

• A scalar function returns one output row for each input row. The returned row consists of a

single column/value.

• A tabular function, also called a table function, returns zero, one, or multiple rows for each

input row.

3. To avoid conflicts when calling functions, Snowflake does not allow creating UDFs with the same

name as any of the system-defined functions.

4. Snowflake supports overloading of SQL UDF names. Multiple SQL UDFs in the same schema can

have the same name, as long as their argument signatures differ, either by the number of arguments

or the argument types.

5. For security or privacy reasons, you might not wish to expose the underlying tables or algorithmic

details for a UDF. With secure UDFs, the definition and details are visible only to authorized users

(i.e. users who are granted the role that owns the UDF).

External Function
Calling AWS Lambda service from Snowflake external function using API Integration Object
1. An external function calls code that is executed outside Snowflake.
2. The remotely executed code is known as a remote service.
3. Information sent to a remote service is usually relayed through a proxy service.
4. Snowflake stores security-related external function information in an API integration.

An external function is a type of UDF. Unlike other UDFs, an external function does not
contain its own code; instead, the external function calls code that is stored and executed
outside Snowflake.

Inside Snowflake, the external function is stored as a database object that contains
information that Snowflake uses to call the remote service. This stored information includes
the URL of the proxy service that relays information to and from the remote service.

Remote Service:

1. The remotely executed code is known as a remote service.


2. The remote service must act like a function. For example, it must return a value.
3. Snowflake supports scalar external functions; the remote service must return
exactly one row for each row received.
4. Remote service can be implemented as:

• An AWS Lambda function.


• A Microsoft Azure Function.
• An HTTPS server (e.g. Node.js) running on an EC2 instance.
===========================================================================

================== Tracking Worksheet: CloudFormation Template ============

===========================================================================

Cloud Platform Account Id: 756826086051

lambda function : mysnowflake_lambda

New IAM Role Name: mysnowflake-lambda_role

New IAM Role ARN: arn:aws:iam::756826086051:role/mysnowflake-lambda_role

proxy_service_resource name : mysnowflake_proxy

Resource Invocation URL: https://sd4ykzvsql.execute-api.us-east-1.amazonaws.com/test/


method request arn : arn:aws:execute-api:us-east-1:756826086051:sd4ykzvsql/*/POST/

API_AWS_IAM_USER_ARN: arn:aws:iam::841748965781:user/u5at-s-insa6850

API_AWS_EXTERNAL_ID: WZ00864_SFCRole=3_mQPCuXPCeS072Ucv5HIeoiEMu68=
13. Snowflake Scripting
Snowflake Scripting is an extension to Snowflake SQL that adds support for procedural logic.

You can use Snowflake Scripting 1) to write stored procedures and 2) procedural code outside of a
stored procedure.

1. Understanding Blocks in Snowflake Scripting


DECLARE
... (variable declarations, cursor declarations, etc.) ...
BEGIN
... (Snowflake Scripting and SQL statements) ...
EXCEPTION
... (statements for handling exceptions) ...
END;

Example

declare
radius_of_circle float;
area_of_circle float;
begin
radius_of_circle := 3;
area_of_circle := pi() * radius_of_circle * radius_of_circle;
return area_of_circle;
end;

2. Working with Variables

declare
profit number(38, 2) default 0.0;
cost number(38, 2) default 100.0;
revenue number(38, 2) default 110.0;
begin
profit := revenue - cost;
return profit;
end;

3. Working with Branching Constructs


IF (<condition>) THEN
-- Statements to execute if the <condition> is true.

[ ELSEIF ( <condition_2> ) THEN


-- Statements to execute if the <condition_2> is true.
]

[ ELSE
-- Statements to execute if none of the conditions is true.
]

END IF;
Example

begin
let count := 1;
if (count < 0) then
return 'negative value';
elseif (count = 0) then
return 'zero';
else
return 'positive value';
end if;
end;

4. Working with Loops


A FOR loop repeats a sequence of steps for a specified number of times or for each row in a result set. Snowflake
Scripting supports the following types of FOR loops:
1. Counter-Based FOR Loops
2. Cursor-Based FOR Loops

--Counter-Based FOR Loops


declare
counter integer default 0;
maximum_count integer default 5;
begin
for i in 1 to maximum_count do
counter := counter + 1;
end for;
return counter;
end;
-- Cursor-Based FOR Loops
-- The following example uses a FOR loop iterate over the rows in a cursor for the
invoices table:

use schema airbnb.public;


create or replace table invoices (price number(12, 2));
insert into invoices (price) values
(11.11),
(22.22);

declare
total_price float;
c1 cursor for select price from invoices;
begin
total_price := 0.0;
for record in c1 do
total_price := total_price + record.price;
end for;
return total_price;
end;

5. Working with Cursors

• You can use a cursor to iterate through query results one row at a time. To retrieve data from the results of a

query, use a cursor. You can use a cursor in loops to iterate over the rows in the results.

• To use a cursor, do the following:

1) In the DECLARE section, declare the cursor.The declaration includes the query for the cursor.

2) Execute the OPEN command to open the cursor. This executes the query and loads the results into the

cursor.

3) Execute the FETCH command to fetch one or more rows and process those rows.

4) Execute the CLOSE command to close the cursor.


Introduction to cursors

To retrieve data from the results of a query, use a cursor.


You can use a cursor in loops to iterate over the rows in the results.
To use a cursor, do the following:
1. In the DECLARE section, declare the cursor.The declaration includes the query for the cursor.
2. Execute the OPEN command to open the cursor. This executes the query and loads the results into
the cursor.
3. Execute the FETCH command to fetch one or more rows and process those rows.
4. Execute the CLOSE command to close the cursor.

create or replace table invoices (id integer, price number(12, 2));


insert into invoices (id, price) values
(1, 11.11),
(2, 22.22);
declare
id integer default 0;
minimum_price number(13,2) default 10.00;
maximum_price number(13,2) default 20.00;
c1 cursor for select id from invoices where price > ? and price < ?;
begin
open c1 using (minimum_price, maximum_price);
fetch c1 into id;
return id;
end;

6. Working with RESULTSETs

In Snowflake Scripting, a RESULTSET is a SQL data type that points to the


result set of a query.
Because a RESULTSET is just a pointer to the results, you must do one of the following
to access the results through the RESULTSET:
1) Use the TABLE() syntax to retrieve the results as a table.
2) Iterate over the RESULTSET with a cursor.
---Examples of Using a RESULTSET

create or replace procedure sp_host_rtr()


returns table(id integer,name varchar)
language sql
as
declare
res resultset default (select id,name from airbnb.raw.raw_hosts order by name);
begin
return table(res);
end;
7. Handling Exceptions

1. Snowflake Scripting raises an exception if an error occurs while executing a statement (e.g. if a
statement attempts to DROP a table that doesn’t exist).
2. An exception prevents the next lines of code from executing.
3. In a Snowflake Scripting block, you can write exception handlers that catch specific types of
exceptions declared in that block and in blocks nested inside that block.
4. In addition, for errors that can occur in your code, you can define your own exceptions that you
can raise when errors occur.

declare
my_exception exception (-20002, 'Raised MY_EXCEPTION.');
begin
let counter := 0;
let should_raise_exception := true;
if (should_raise_exception) then
raise my_exception;
end if;
counter := counter + 1;
return counter;
end;
14. Storage - Understanding Micropartitions
Micro partitions

1. All data in Snowflake is stored in database tables, logically structured as collections of columns and
rows. it is helpful to have an understanding of the physical structure behind the logical structure.

2. Traditional data warehouses rely on static partitioning of large tables to achieve acceptable
performance and enable better scaling. however, static partitioning has a number of well-known
limitations, such as maintenance overhead

3. The Snowflake Data Platform implements a powerful and unique form of partitioning, called micro-
partitioning, that delivers all the advantages of static partitioning

4. All data in Snowflake tables is automatically divided into micro-partitions, which are contiguous units
of storage.

5. Each micro-partition contains between 50 MB and 500 MB of uncompressed data (note that the
actual size in Snowflake is smaller because data is always stored compressed).

6. Groups of rows in tables are mapped into individual micro-partitions, organized in a columnar
fashion.

7. This size and structure allows for extremely granular pruning of very large tables, which can be
comprised of millions, or even hundreds of millions, of micro-partitions.

8. Snowflake stores metadata about all rows stored in a micro-partition, including:


• The range of values for each of the columns in the micro-partition.
• The number of distinct values.
• Additional properties used for both optimization and efficient query processing.
9. Columns are stored independently within micro-partitions, often referred to as columnar storage.
This enables efficient scanning of individual columns; only the columns referenced by a query are
scanned. Columns are also compressed individually within micro-partitions.

10. All DML operations (e.g. DELETE, UPDATE, MERGE) take advantage of the underlying micro-
partition metadata. deleting all rows from a table, are metadata-only operations.

Query Pruning

11. The micro-partition metadata maintained by Snowflake enables precise pruning of columns in micro-
partitions at query run-time, including columns containing semi-structured data.

12. Snowflake uses columnar scanning of partitions so that an entire partition is not scanned if a query
only filters by one column.

13. Snowflake does not prune micro-partitions based on a predicate with a subquery, even if the
subquery results in a constant
15. Clustering
Data Clustering

14. In general, Snowflake produces well-clustered data in tables; however, over time, particularly as DML
occurs on very large tables (as defined by the amount of data in the table, not the number of rows),
the data in some table rows might no longer cluster optimally on desired dimensions.

15. Typically, data stored in tables is sorted/ordered along natural dimensions (e.g. date and/or
geographic regions).

16. This “clustering” is a key factor in queries because table data that is not sorted or is only partially
sorted may impact query performance, particularly on very large tables.

17. To improve the clustering of the underlying table micro-partitions, you can always manually sort
rows on key table columns and re-insert them into the table; however, performing these tasks could
be cumbersome and expensive.

18. Instead, Snowflake supports automating these tasks by designating one or more table
columns/expressions as a clustering key for the table. A table with a clustering key defined is
considered to be clustered.

19. In Snowflake, as data is inserted/loaded into a table, clustering metadata is collected and recorded
for each micro-partition created during the process.

20. Snowflake then leverages this clustering information to avoid unnecessary scanning of micro-
partitions during querying, significantly accelerating the performance of queries that reference these
columns.

© 2021 Snowflake Inc. All Rights Reserved

21. Snowflake maintains clustering metadata for the micro-partitions in a table, including:
• The total number of micro-partitions that comprise the table.
• The number of micro-partitions containing values that overlap with each other (in a specified
subset of table columns).
• The depth of the overlapping micro-partitions.
Clustering Depth

22. The clustering depth for a populated table measures the average depth ( 1 or greater) of the
overlapping micro-partitions for specified columns in a table. The smaller the average depth, the
better clustered the table is with regards to the specified columns.
23. Clustering depth can be used for a variety of purposes, including:
o Monitoring the clustering “health” of a large table, particularly over time as DML is performed
on the table.
o Determining whether a large table would benefit from explicitly defining a clustering key.
o A table with no micro-partitions (i.e. an unpopulated/empty table) has a clustering depth of 0 .
24. The clustering depth for a table is not an absolute or precise measure of whether the table is well-
clustered. Ultimately, query performance is the best indicator of how well-clustered a table is:
o If queries on a table are performing as needed or expected, the table is likely well-clustered.
o If query performance degrades over time, the table is likely no longer well-clustered and may
benefit from clustering.

Clustering Keys

25. In general, Snowflake produces well-clustered data in tables; however, over time, particularly as DML
occurs on very large tables (as defined by the amount of data in the table, not the number of rows),
the data in some table rows might no longer cluster optimally on desired dimensions.
26. To improve the clustering of the underlying table micro-partitions, you can always manually sort
rows on key table columns and re-insert them into the table; however, performing these tasks could
be cumbersome and expensive.
27. Instead, Snowflake supports automating these tasks by designating one or more table
columns/expressions as a clustering key for the table. A table with a clustering key defined is
considered to be clustered.

• Note: Clustering keys are not intended for all tables. The size of a table, as well as the query
performance for the table, should dictate whether to define a clustering key for the table.
16. Caching
Using Persisted Query Results
1. When a query is executed, the result is persisted (i.e. cached) for a period of time. At the end of the time
period, the result is purged from the system.
2. If a user repeats a query that has already been run, and the data in the table(s) hasn’t changed since the
last time that the query was run, then the result of the query is the same.

Typically, query results are reused if all of the following conditions are met:

3. The new query syntactically matches the previously-executed query.


4. The query does not include functions that must be evaluated at execution time
(e.g. CURRENT_TIMESTAMP() and UUID_STRING()). Note that the CURRENT_DATE() function is an
exception to this rule; even though CURRENT_DATE() is evaluated at execution time, queries that use
CURRENT_DATE() can still use the query reuse feature.
5. The query does not include user-defined functions (UDFs) or external functions.
6. The table data contributing to the query result has not changed.
7. The persisted result for the previous query is still available.
8. The role accessing the cached results has the required privileges.
9. If the query was a SELECT query, the role executing the query must have the necessary access privileges
for all the tables used in the cached query.
10. If the query was a SHOW query, the role executing the query must match the role that generated the
cached results.
11. Any configuration options that affect how the result was produced have not changed.
12. The table’s micro-partitions have not changed (e.g. been reclustered or consolidated) due to changes to
other data in the table.
Note : Meeting all these conditions does not guarantee that Snowflake reuses the query results.
By default, result reuse is enabled, but can be overridden at the account, user, and session level using
the USE_CACHED_RESULT session parameter.

13. Each time the persisted result for a query is reused, Snowflake resets the 24-hour retention period for
the result, up to a maximum of 31 days from the date and time that the query was first executed. After
31 days, the result is purged and the next time the query is submitted, a new result is generated and
persisted.

Post-processing Query Results


14. In some cases, you might want to perform further processing on the result of a query that you’ve already
run
• You are developing a complex query step-by-step and you want to add a new layer on top of the
previous query and run the new query without recalculating the partial results from scratch.
• The previous query was a SHOW <objects>, DESCRIBE <object>, or CALL statement, which
returns results in a form that are not easy to reuse.

-- Show the tables that are empty using result_scan

show tables;

select "schema_name", "name" as "table_name", "rows" from table(result_scan(last_query_id()))

where "rows" = 0;

Warehouse Caching Impact Queries


15. Each warehouse, when running, maintains a cache of table data accessed as queries are processed by
the warehouse.
16. This enables improved performance for subsequent queries if they are able to read from the cache
instead of from the table(s) in the query.
17. The size of the cache is determined by the compute resources in the warehouse (i.e. the larger the
warehouse and, therefore, more compute resources in the warehouse), the larger the cache.
18. This cache is dropped when the warehouse is suspended, which may result in slower initial performance
for some queries after the warehouse is resumed.
19. As the resumed warehouse runs and processes more queries, the cache is rebuilt, and queries that are
able to take advantage of the cache will experience improved performance.
20. Keep this in mind when deciding whether to suspend a warehouse or leave it running. In other words,
consider the trade-off between saving credits by suspending a warehouse versus maintaining the cache
of data from previous queries to help with performance.
17. Query and Results history
Overview
The History tab page allows you to view and drill into the details of all queries executed in the last 14 days.
The page displays a historical listing of queries, including queries executed from SnowSQL or other SQL
clients.

The default information displayed for each query includes:

• Current status of queries: waiting in queue, running, succeeded, failed.


• SQL text of your query.
• Query ID.
• Information about the warehouse used to execute the query.
• Query start and end time, as well as duration.
• Information about the query, including number of bytes scanned and number of rows returned.

1. Use the auto-refresh checkbox in the upper right to enable/disable auto-refresh for the session. If
selected, the page is refreshed every 10 seconds. You can also click the Refresh icon to refresh the
display at any time.
2. Use the Include client-generated statements checkbox to show or hide SQL statements run by web
interface sessions outside of SQL worksheets.
3. Use the Include queries executed by user tasks checkbox to show or hide SQL statements executed
or stored procedures called by user tasks.
4. Click any column header to sort the page by the column or add/remove columns in the display.
5. Click the text of a query (or select the query and click View SQL) to view the full SQL for the query.
6. Select a query that has not yet completed and click Abort to abort the query.
7. Click the ID for a query to view the details for the query, including the result of the query and the
Query Profile.
Query History Results
8. Snowflake persists the result of a query for a period of time, after which the result is purged. This
limit is not adjustable.
9. To view the details and result for a particular query, click the Query ID in the History page. The
Query Detail page appears (see below), where you can view query execution details, as well as the
query result (if still available).
10. You can also use the Export Result button to export the result of the query (if still available) to a file.
11. You can view results only for queries you have executed. If you have privileges to view queries
executed by another user, the Query Detail page displays the details for the query, but, for data
privacy reasons, the page does not display the actual query result.

Export Results
12. On any page in the interface where you can view the result of a query (e.g. Worksheets, Query
Detail), if the query result is still available, you can export the result to a file.
13. When you click the Export Result button for a query, you are prompted to specify the file name and
format. Snowflake supports the following file formats for query export:

Comma-separated values (CSV)

Tab-separated values (TSV)

14. You can export results only for queries for which you can view the results (i.e. queries you’ve
executed). If you didn’t execute a query or the query result is no longer available, the Export
Result button is not displayed for the query.
15. The web interface only supports exporting results up to 100 MB in size. If a query result exceeds this
limit, you are prompted whether to proceed with the export.
16. The export prompts may differ depending on your browser. For example, in Safari, you are prompted
only for an export format (CSV or TSV). After the export completes, you are prompted to download
the exported result to a new window, in which you can use the Save Page As… browser option to
save the result to a file.
Viewing Query Profile
In addition to query details and results, Snowflake provides the Query Profile for analyzing query statistics
and details, including the individual execution components that comprise the query. For more information,
see Analyzing Queries Using Query Profile.

<DEMO with Sample Data Queries >


Query profiling

1. Query Profile, available through the Snowflake web interface, provides execution details for a query.
2. For the selected query, it provides a graphical representation of the main components of the
processing plan for the query, with statistics for each component, along with details and statistics for
the overall query.
3. It can be used whenever you want or need to know more about the performance or behavior of a
particular query.
4. It is designed to help you spot typical mistakes in SQL query expressions to identify potential
performance bottlenecks and improvement opportunities.
5. Query IDs can be clicked on, specifically Worksheets & History
6. Run the below query
SELECT SUM(O_TOTALPRICE)
FROM ORDERS , LINEITEM
WHERE ORDERS.O_ORDERKEY = LINEITEM.L_ORDERKEY
AND O_TOTALPRICE > 300
AND L_QUANTITY < (select avg(L_QUANTITY) from LINEITEM);

The interface consists of the following main elements:


1. Steps: If the query was processed in multiple steps, you can toggle between each step.
2. Operator tree : The middle pane displays a graphical representation of all the operator nodes for the selected
step, including the relationships between each operator node.
3. Node list :The middle pane includes a collapsible list of operator nodes by execution time.
4. Overview :The right pane displays an overview of the query profile. The display changes to operator details
when an operator node is selected.
Operator Types
Data Access and Generation Operators

1. TableScan : Represents access to a single table. Attributes:


• Full table name — the name of the accessed table, including database and schema.
• Columns — list of scanned columns
• Table alias — used table alias, if present
• Extracted Variant paths — list of paths extracted from VARIANT columns

2. ValuesClause : List of values provided with the VALUES clause. Attributes:

• Number of values — the number of produced values.


• Values — the list of produced values.

3. Generator : Generates records using the TABLE(GENERATOR(...)) construct. Attributes:

• rowCount — provided rowCount parameter.


• timeLimit — provided timeLimit parameter.

4. ExternalScan : Represents access to data stored in stage objects. Can be a part of queries that scan data
from stages directly, but also for data-loading COPY queries. Attributes:

• Stage name — the name of the stage where the data is read from.
• Stage type — the type of the stage (e.g. TABLE STAGE).

5. InternalObject : Represents access to an internal data object (e.g. an Information Schema table or the
result of a previous query). Attributes:

• Object Name — the name or type of the accessed object.


Data Processing Operators

6. Filter : Represents an operation that filters the records. Attributes:


• Filter condition - the condition used to perform filtering.

7. Join : Combines two inputs on a given condition. Attributes:


• Join Type — Type of join (e.g. INNER, LEFT OUTER, etc.).
• Equality Join Condition — for joins which use equality-based conditions, it lists the expressions used for
joining elements.
• Additional Join Condition — some joins use conditions containing non-equality based predicates. They are
listed here.

8. Aggregate : Groups input and computes aggregate functions. Can represent SQL constructs such as
GROUP BY, as well as SELECT DISTINCT. Attributes:

• Grouping Keys — if GROUP BY is used, this lists the expressions we group by.
• Aggregate Functions — list of functions computed for each aggregate group, e.g. SUM.

9. GroupingSets : Represents constructs such as GROUPING SETS, ROLLUP and CUBE. Attributes:
• Grouping Key Sets — list of grouping sets
• Aggregate Functions — list of functions computed for each group, e.g. SUM.

10. WindowFunction : Computes window functions. Attributes:


• Window Functions — list of window functions computed.

11. Sort : Orders input on a given expression. Attributes:


• Sort keys — expression defining the sorting order.

12. SortWithLimit : Produces a part of the input sequence after sorting, typically a result of
an ORDER BY ... LIMIT ... OFFSET ... construct in SQL. Attributes:
• Sort keys — expression defining the sorting order.
• Number of rows — number of rows produced.
• Offset — position in the ordered sequence from which produced tuples are emitted.

13. Flatten : Processes VARIANT records, possibly flattening them on a specified path. Attributes:
• input — the input expression used to flatten the data.

14. JoinFilter : Special filtering operation that removes tuples that can be identified as not possibly matching
the condition of a Join further in the query plan. Attributes:
• Original join ID — the join used to identify tuples that can be filtered out.

15. UnionAll : Concatenates two inputs. Attributes: none.

16. ExternalFunction : Represents processing by an external function.


DML Operators
1. Insert :Adds records to a table either through an INSERT or COPY operation. Attributes:

• Input expressions — which expressions are inserted.


• Table names — names of tables that records are added to.

2. Delete :Removes records from a table. Attributes:

• Table name — the name of the table that records are deleted from.

3. Update: Updates records in a table. Attributes:

• Table name — the name of the updated table.

Merge : Performs a MERGE operation on a table. Attributes:


• Full table name — the name of the updated table.

4. Unload : Represents a COPY operation that exports data from a table into a file in a stage. Attributes:

• Location - the name of the stage where the data is saved.


Metadata Operators

Some queries include steps that are pure metadata/catalog operations rather than data-processing
operations. These steps consist of a single operator. Some examples include:

1. DDL and Transaction Commands : Used for creating or modifying objects, session, transactions, etc.
Typically, these queries are not processed by a virtual warehouse and result in a single-step profile
that corresponds to the matching SQL statement. For example:

o CREATE DATABASE | SCHEMA | …


o ALTER DATABASE | SCHEMA | TABLE | SESSION | …
o DROP DATABASE | SCHEMA | TABLE | …
o COMMIT

2. Table Creation Command


o DDL command for creating a table. For example:
o CREATE TABLE
o Similar to other DDL commands, these queries result in a single-step profile; however, they can
also be part of a multi-step profile, such as when used in a CTAS statement. For example:
o CREATE TABLE … AS SELECT …
3. Query Result Reuse : A query that reuses the result of a previous query.
4. Metadata-based Result : A query whose result is computed based purely on metadata, without
accessing any data. These queries are not processed by a virtual warehouse. For example:

o SELECT COUNT(*) FROM …


o SELECT CURRENT_DATABASE()

Miscellaneous Operators
Result :Returns the query result. Attributes:
• List of expressions - the expressions produced.
Query/Operator Details

To help you analyze query performance, the detail panel provides two classes of profiling
information:

• Execution time, broken down into categories


• Detailed statistics

In addition, attributes are provided for each operator (described in Operator Types in this topic).

Execution Time

Execution time provides information about “where the time was spent” during the processing of a query.
Time spent can be broken down into the following categories, displayed in the following order:

o Processing — time spent on data processing by the CPU.


o Local Disk IO — time when the processing was blocked by local disk access.
o Remote Disk IO — time when the processing was blocked by remote disk access.
o Network Communication — time when the processing was waiting for the network data transfer.
o Synchronization — various synchronization activities between participating processes.
o Initialization — time spent setting up the query processing.
Statistics

A major source of information provided in the detail panel is the various statistics, grouped in the following
sections:

• IO — information about the input-output operations performed during the query:


• Scan progress — the percentage of data scanned for a given table so far.
• Bytes scanned — the number of bytes scanned so far.
• Percentage scanned from cache — the percentage of data scanned from the local
disk cache.
• Bytes written — bytes written (e.g. when loading into a table).
• Bytes written to result — bytes written to a result object.
• Bytes read from result — bytes read from a result object.
1. External bytes scanned — bytes read from an external object, e.g. a stage.

• DML — statistics for Data Manipulation Language (DML) queries:


• Number of rows inserted — number of rows inserted into a table (or tables).
• Number of rows updated — number of rows updated in a table.
• Number of rows deleted — number of rows deleted from a table.
• Number of rows unloaded — number of rows unloaded during data export.
• Number of bytes deleted — number of bytes deleted from a table.

• Pruning — information on the effects of table pruning:


• Partitions scanned — number of partitions scanned so far.
• Partitions total — total number of partitions in a given table.

• Spilling — information about disk usage for operations where intermediate results do not
fit in memory:
• Bytes spilled to local storage — volume of data spilled to local disk.
• Bytes spilled to remote storage — volume of data spilled to remote disk.

• Network — network communication:


• Bytes sent over the network — amount of data sent over the network.

• External Functions — information about calls to external functions:


• The following statistics are shown for each external function called by the SQL
statement. If the same function was called more than once from the same SQL
statement, then the statistics are aggregated.
• Total invocations — The number of times that an external function was called.
(This can be different from the number of external function calls in the text of
the SQL statement due to the number of batches that rows are divided into,
the number of retries (if there are transient network problems), etc.)
• Rows sent — The number of rows sent to external functions.
• Rows received — The number of rows received back from external functions.
• Bytes sent (x-region) — The number of bytes sent to external functions. If the
label includes “(x-region)”, the data was sent across regions (which can impact
billing).
• Bytes received (x-region) — The number of bytes received from external
functions. If the label includes “(x-region)”, the data was sent across regions
(which can impact billing).
• Retries due to transient errors — The number of retries due to transient errors.
• Average latency per call — The average amount of time per invocation (call)
between the time Snowflake sent the data and received the returned data.

If the value of a field, for example “Retries due to transient errors”, is zero, then
the field is not displayed.
Common Query Problems Identified by Query Profile

“Exploding” Joins
One of the common mistakes SQL users make is joining tables without providing a join condition (resulting
in a “Cartesian Product”), or providing a condition where records from one table match multiple records from
another table. For such queries, the Join operator produces significantly (often by orders of magnitude)
more tuples than it consumes.

This can be observed by looking at the number of records produced by a Join operator, and typically is also

reflected in Join operator consuming a lot of time.

The following example shows input in the hundred

UNION Without ALL

In SQL, it is possible to combine two sets of data with either UNION or UNION ALL constructs. The
difference between them is that UNION ALL simply concatenates inputs, while UNION does the same, but
also performs duplicate elimination.

A common mistake is to use UNION when the UNION ALL semantics are sufficient. These queries show in
Query Profile as a UnionAll operator with an extra Aggregate operator on top (which performs duplicate
elimination).

Queries Too Large to Fit in Memory

For some operations (e.g. duplicate elimination for a huge data set), the amount of memory available for the
compute resources used to execute the operation might not be sufficient to hold intermediate results. As a
result, the query processing engine will start spilling the data to local disk. If the local disk space is not
sufficient, the spilled data is then saved to remote disks.
This spilling can have a profound effect on query performance (especially if remote disk is used for spilling).
To alleviate this, we recommend:

• Using a larger warehouse (effectively increasing the available memory/local disk space for the
operation), and/or
• Processing data in smaller batches.

Inefficient Pruning

Snowflake collects rich statistics on data allowing it not to read unnecessary parts of a table based on the
query filters. However, for this to have an effect, the data storage order needs to be correlated with the
query filter attributes.

The efficiency of pruning can be observed by comparing Partitions scanned and Partitions total statistics in
the TableScan operators. If the former is a small fraction of the latter, pruning is efficient. If not, the pruning
did not have an effect.

Of course, pruning can only help for queries that actually filter out a significant amount of data. If the
pruning statistics do not show data reduction, but there is a Filter operator above TableScan which filters
out a number of records, this might signal that a different data organization might be beneficial for this
query.

18. Resource Monitoring


Overview
To help control costs and avoid unexpected credit usage caused by running warehouses,
Snowflake provides resource monitors.

Resource monitors can be used to impose limits on the number of credits that are
consumed by:

• User-managed virtual warehouses


• Virtual warehouses used by cloud services

Limits can be set for a specified interval or date range. When these limits are reached
and/or are approaching, the resource monitor can trigger various actions, such as sending
alert notifications and/or suspending the warehouses.
Resource monitors can only be created by account administrators. however, account
administrators can choose to enable users with other roles to view and modify resource
monitors using SQL.

Credit Quota

Credit quota specifies the number of Snowflake credits allocated to the monitor for the
specified frequency interval. Any number can be specified.

Summary

Snowflake provides Resource Monitors to help control costs and avoid unexpected credit
usage related to using Warehouses

Can be used to impose limits on the number of credits that Warehouses consume within
each monthly billing period.

When these limits are close and/or reached, the Resource Monitor can trigger various
actions, such as sending alert notifications and suspending the Warehouses.

Resource Monitors can only be created by account administrators (i.e. users with the
ACCOUNTADMIN role); however, account administrators can choose to enable users with
other roles to view and modify resource monitors.

Notification
19. Data sharing
• Use Cases

Sharing within the same organization/same Snowflake account


Sharing between different Snowflake accounts
Sharing to a non-Snowflake customer
Sharing within a cloud region

Sharing across cloud regions


Sharing across platforms
• Data Marketplace
• Data Exchange
• Data Sharing Methods

Introduction to Secure Data Sharing

Secure Data Sharing enables sharing selected objects in a database in your account with
other Snowflake accounts. The following Snowflake database objects can be shared:

• Tables
• External tables
• Secure views
• Secure materialized views
• Secure UDFs

How Does Secure Data Sharing Work

1. With Secure Data Sharing, no actual data is copied or transferred between accounts.
All sharing is accomplished through Snowflake’s unique services layer and metadata
store.
2. Shared data does not take up any storage in a consumer account .The only charges to
consumers are for the compute resources (i.e. virtual warehouses) used to query the
shared data.
3. In addition, because no data is copied or exchanged the provider creates a share of a
database in their account and grants access to specific objects in the database.
4. The provider can also share data from multiple databases, as long as these databases
belong to the same account.
5. One or more accounts are then added to the share, which can include your own
accounts
6. On the consumer side, a read-only database is created from the share.
7. Access to this database is configurable using the same, standard role-based access
control that Snowflake provides for all objects in the system.
8. Through this architecture, Snowflake enables creating a network of providers that
can share data with multiple consumers (including within their own organization) and
consumers that can access shared data from multiple providers:

Share Object
9. A new object created in a database in a share is not automatically available to
consumers.To make the object available to consumers, you must use the GRANT
<privilege> … TO SHARE command to explicitly add the object to the share.

Data Providers

10. As a data provider, you share a database with one or more Snowflake
accounts.
11. Snowflake does not place any hard limits on the number of shares you can
create or the number of accounts you can add to a share.

Data Consumers

12. A data consumer is any account that chooses to create a database from a share
made available by a data provider.
13. As a data consumer, once you add a shared database to your account, you can
access and query the objects in the database just as you would with any other
database in your account.
14. Snowflake does not place any hard limits on the number of shares you can
consume from data providers; however, you can only create one database per share.
Reader Account

15. Data sharing is only supported between Snowflake accounts. As a data


provider, you might wish to share data with a consumer who does not already have a
Snowflake account
16. Snowflake supports providers creating reader accounts.
17. Each reader account belongs to the provider account that created it.
18. The provider account uses shares to share databases with reader accounts;
19. Reader account can only consume data from the provider account that
created it

20. Users in a reader account can query data that has been shared with it, but
cannot perform any of the DML tasks that are allowed in a full account (data loading,
insert, update, etc.

General Limitations for Shared Databases

Shared databases have the following limitations for consumers:

• Shared databases are read-only. Users in a consumer account can view/query


data, but cannot insert or update data, or create any objects in the database.
• The following actions are not supported:
o Creating a clone of a shared database or any schemas/tables in the
database.
o Time Travel for a shared database or any schemas/tables in the
database.
o Editing the comments for a shared database.
• Shared databases and all the objects in the database cannot be forwarded (i.e.
re-shared with other accounts).

Sharing Data With Data Consumers in a Different Region and Cloud Platform

Overview of the Product Offerings for Secure Data Sharing

Snowflake provides three product offerings for data sharing that utilize Snowflake Secure
Data Sharing to connect providers of data with consumers.

In this Topic:

• Direct Share
• Snowflake Data Marketplace
• Data Exchange
Direct Share

Direct Share is the simplest form of data sharing that enables account-to-account sharing of
data utilizing Snowflake’s Secure Data Sharing.

As a data provider you can easily share data with another company so that your data shows
up in their Snowflake account without having to copy it over or move it.

Snowflake Data Marketplace

Snowflake Data Marketplace is available to all Snowflake accounts hosted on non-VPS


regions on all supported cloud platforms.
The Data Marketplace utilizes Snowflake Secure Data Sharing to connect providers of data
with consumers.

You can discover and access a variety of third-party data and have those datasets available
directly in your Snowflake account to query without transformation and join it with your
own data. If you need to use several different vendors for data sourcing, the Data
Marketplace gives you one single location from where to get the data.

You can also become a provider and publish data in the Data Marketplace, which is an
attractive proposition if you are thinking about data monetization and different routes to
market.

For more information, see Introduction to the Snowflake Data Marketplace.

Data Exchange

Data Exchange is your own data hub for securely collaborating around data between a
selected group of members that you invite. It enables providers to publish data that can
then be discovered by consumers.

You can share data at scale with your entire business ecosystem such as suppliers, partners,
vendors, and customers, as well as business units at your own company. It allows you to
control who can join, publish, consume, and access data.

Once your Data Exchange is provisioned and configured, you can invite members and
specify whether they can consume data, provide data, or both.

The Data Exchange is supported for all Snowflake accounts hosted on non-VPS regions on
all supported cloud platforms.

For more information, see Data Exchange.

● Snowflake Architecture - Data Sharing Self-Guided Learning Material


○ Introduction to Data Sharing
○ Getting Started With Data Sharing
○ Data Sharing for Dummies (Webinar)
○ Data Sharing for Dummies (eBook)
○ Data Providers
○ Data Consumers
○ New and Improved Snowflake Data Sharing (Video)
○ Modern Data Sharing: The Opportunities Are Endless
Snowflake Security Framework

The threat of a data security breach, someone gaining unauthorized access to an organization’s data, is what
keeps CEOs, CISOs, and CIOs awake at night. Such a breach can quickly turn into a public
relationsnightmare, resulting in lost business and steep fines from regulatory agencies.

Snowflake Cloud Data Platform sets the industry standard for data platform security, so you don’t have to
lose sleep. All aspects of Snowflake’s architecture, implementation, and operation are designed to protect
customer data in transit and at rest against both current and evolving security threats.

1. Snowflake was built from the ground up to deliver end-to-end data security for all data platform users.
2. As part of its overall security framework, it leverages NIST 800-53 and the CIS Critical Security Controls,
a set of controls created by a broad consortium of international security experts to identify the security
functions that are effective against real-world threats.
3. Snowflake comprises a multilayered security architecture to protect customer data and access to that
data. This architecture addresses the following:
• External interfaces
• Access control
• Data storage
• Physical infrastructure

This security architecture is complemented by the monitoring, alerts, controls, and processes that are part of
Snowflake’s comprehensive security framework.

4. Security for compliance requirements Snowflake is a multi-tenant service that implements isolation at
multiple levels. It runs inside a virtual private cloud (VPC), a logically isolated network section within
either Amazon Web Services (AWS), Microsoft Azure (Azure), or Google Cloud Platform (GCP).
5. The dedicated subnet, along with the implementation of security groups, enables Snowflake to isolate
and limit access to its internal components.
6. The Business Critical edition provides additional security features to support customers who have
HIPAA, PCI DSS, or other compliance requirements.
7. In addition, VPS supports customers who have specific regulatory requirements that prevent them from
loading their data into a multi-tenant environment. VPS includes the Business Critical edition within a
dedicated version of Snowflake.
8. Snowflake also isolates query processing, which is performed by one or more compute clusters called
virtual warehouses. Snowflake provisions these compute clusters in such a way that the virtual
warehouses of each customer are isolated from other customers’ virtual warehouses.
9. Snowflake also isolates data storage by customer. Each customer’s data is always stored in an
independent directory and encrypted using customer-specific keys, which are accessible only by that
customer.
20. Access Control-Security Feature
Authentication

10. Snowflake employs robust authentication mechanisms, and every request to Snowflake must be
authenticated, for example:
• User password hashes are securely stored.
• Strong password policy is enforced.
• Various mechanisms are deployed by Snowflake to thwart brute-force attacks

A brute force attack, or exhaustive search, is a cryptographic hack that uses trial-and-error to guess possible
combinations for passwords used for logins, encryption keys, or hidden web pages

• Snowflake also offers built-in multi-factor authentication (MFA), MFA for users with administrative
privileges, and key-pair authentication for non-interactive users.
• For customers who want to manage the authentication mechanism for their account, and whose
providers support SAML 2.0, Snowflake offers federated authentication.
• System for Cross-domain Identity Management : (SCIM) can be leveraged to help facilitate the
automated management of user identities and groups (that is, roles) in cloud applications using
RESTful APIs.

Authorization

11. Snowflake provides a sophisticated, role-based access control (RBAC) authorization framework to
ensure data and information can be accessed or operated on only by authorized users within an
organization.
12. Access control is applied to all database objects including tables, schemas, secure views, secure user-
defined functions (secure UDFs) and virtual warehouses. Access control grants determine a user’s ability
to both view and operate on database objects.
13. In Snowflake’s access control model, users are assigned one or more roles, each of which can be
assigned different access privileges. For every access to database objects, Snowflake validates that the
necessary privileges have been granted to a role assigned to the user.
14. Customers can choose from a set of built-in roles or create and define custom roles within the role
hierarchy defined by Snowflake.
15. The OAuth 2.0 authorization framework is also supported.

Encryption everywhere

16. In Snowflake, all customer data is always encrypted when it is stored on disk, and data is encrypted
when it’s moved into a Snowflake-provided staging location for loading into Snowflake.
17. Data is also encrypted when it is stored within a database object in Snowflake, when it is cached within a
virtual warehouse, and when Snowflake stores a query result.
Data encryption and key management

18. Snowflake uses strong AES 256-bit encryption with a hierarchical key model rooted in a cluster of
hardware security modules(HSM).
19. Each customer account has a separate key hierarchy of accountlevel, table-level, and file-level keys.
20. Snowflake automatically rotates account and table keys on a regular basis. Data encryption and key
management are entirely transparent to customers and require no configuration or management.

Data protection and recovery through retention and backups

21. Snowflake was designed from the ground up to be a continuously available cloud service that is resilient
to failures to prevent customer disruption and data loss.
22. Its continuous data protection (CDP) capabilities protect against and provide easy self-service recovery
from accidental errors, system failures, and malicious acts.

Recovery from accidental errors

23. The most common cause of data loss or corruption in a database is accidental errors made by a system
administrator, a privileged user, or an automated process.
24. Snowflake provides a unique feature called Time Travel that provides easy recovery from such errors.
Time Travel makes it possible to instantly restore or query any previous version of a table or database
from an arbitrary past point in time within a retention period.

How Time Travel works

25. When any data is modified, Snowflake internally writes those changes to a new storage object and
automatically retains the previous storage object for a period of time (the retention period) so that both
versions are preserved.
26. When data is deleted or database objects are dropped, Snowflake updates its metadata to reflect that
change but keeps the data during the retention period.
27. During the retention period, all data and data objects are fully recoverable by customers. Past versions
of a data object from any point in time within the retention period can also be accessed via SQL, both for
direct access by a SELECT statement as well as for cloning in order to create a copy of a past version of
the data object.
28. After the retention period has passed, Snowflake’s Fail-Safe feature provides an additional seven days
(the “fail-safe” period) to provide a sufficient length of time during which Snowflake can, at a customer’s
request, recover any data that was maliciously or inadvertently deleted by human or software error.
29. At the end of that Fail-Safe period, an automated process physically deletes the data. Because of this
design, it is impossible for the Snowflake service, any Snowflake personnel, or malicious intruders to
physically delete data.
30. CDP and Time Travel are standard features built into Snowflake. The length of the default retention
period is determined by the customer’s service agreement.
31. Customers can specify extended retention periods at the time that a new database, table, or schema is
created via SQL data definition language (DDL) commands. Extended retention periods incur additional
storage costs for the time that Snowflake retains the data during the retention and fail-safe periods.
32. If an errant data loading script corrupts a database, it is possible to create a logical duplicate of the
database (a clone) from the point in time just prior to the execution of a specific statement.

Protection against system failures

33. The second most common type of data loss is caused by some form of system failure: both software
failures and infrastructure failures such as the loss of a disk, a disk array, a server or, most significantly, a
data center.
34. The Snowflake architecture is designed for resilience, without data loss, in the face of such failures.
Snowflake, which runs on all the major cloud providers’ platforms (AWS, GCP, and Azure), uses a fully
distributed and resilient architecture combined with the resiliency capabilities available in these cloud
platforms to protect against a wide array of possible failures.

• Compute layer. Consists of one or more virtual warehouses, each of which is a multinode compute
cluster that processes queries. Virtual warehouses cache data from the data storage layer in
encrypted form, but they do not store persistent data.
• Cloud services layer. The brain of the system, this layer manages infrastructure, queries, security,
and metadata. The services running in this layer are implemented as a set of stateless processes

35. Each layer in the Snowflake architecture is distributed across availability zones. Because availability
zones are geographically separated data centers with independent access to power and networking,
operations continue even if one or two availability zones become unavailable.

36. When a transaction is committed in Snowflake, the data is securely stored in the cloud provider’s highly
durable data storage, which enables data survival in the event of the loss of one or more disks, servers,
or even data centers. Amazon S3 synchronously and redundantly stores data across multiple devices in
multiple facilities. It is designed for eleven 9s (99.999999999%) of data durability
External Interfaces
37. Customers access Snowflake via the internet using only secure protocols. All internet communication
between users and Snowflake is secured and encrypted using TLS 1.2 or higher.

The following drivers and tools may be used to connect to the service:

• Snowflake’s command-line interface (CLI) client


• Snowflake’s web-based user interface
• Snowflake Connector for Python
• Snowflake Connector for Spark
• Snowflake Connector for Kafka
• The Node.js driver
• The Go Snowflake driver
• The .NET driver
• The JDBC driver
• The ODBC driver

38. Snowflake also supports IP address whitelisting to enable customers to restrict access to the Snowflake
service by only trusted networks.
39. Customers who prefer to not allow any traffic to traverse the public internet may leverage either AWS
PrivateLink (and AWS DirectConnect) or Microsoft Azure Private Link.
Infrastructure Security
Threat detection

Snowflake uses advanced threat detection tools to monitor all aspects of its infrastructure.

• All security logs, including logs and alerts from third-party tools, are centralized in Snowflake’s
security data lake, where they are aggregated for analysis and alerting.
• Activities meeting certain criteria generate alerts that are triaged through Snowflake’s security
incident process.
• Specific areas of focus include the following:
o File integrity monitoring (FIM) tools are used to ensure that critical system files, such as
important system and application executable files, libraries, and configuration files, have not
been tampered with. FIM tools use integrity checks to identify any suspicious system
alterations, which include owner or permissions changes to files or directories, the use of
alternate data streams to hide malicious activities, and the introduction of new files.
o Behavioral monitoring tools monitor network, user, and binary activity against a known
baseline to identify anomalous behavior that could be an indication of compromise.

• Snowflake uses threat intelligence feeds to contextualize and correlate security events and harden
security controls to counteract malicious tactics, techniques, and procedures (TTPs).

Physical security

• Snowflake is hosted in AWS, Azure, or GCP data centers around the world.
• Snowflake’s infrastructure as-a-service cloud provider partners employ many physical security
measures, including biometric access controls and 24-hour armed guards and video surveillance to
ensure that no unauthorized access is permitted.
• Neither Snowflake personnel nor Snowflake customers have access to these data centers.
Security Compliance
Snowflake’s portfolio of security and compliance reports are continuously expanded as customers request

reports. The following is the current list of reports available to all customers and prospects who are

under a non-disclosure agreement

SOC 1 Type 2

The SOC 1 Type 2 report is an independent auditor’s attestation of the financial controls that Snowflake

had in place during the report’s coverage period.

SOC 2 Type 2

The SOC 2 Type 2 report is an independent auditor’s attestation of the security controls that Snowflake

had in place during the report’s coverage period.This report is provided for customers and prospects

to review to ensure there are no exceptions to the documented policies and procedures in the policy

documentation.

PCI DSS

The Payment Card Industry Data Security Standard is a set of prescriptive requirements to which an

organization must adhere in order to be considered compliant. Snowflake’s PCI DSS Attestation of

Compliance provides an independent auditor’s assessment results after testing Snowflake’s

security controls.

HIPAA

The Health Information Portability and Accountability Act is a law that provides data security and privacy

provisions to protect protected health information. Snowflake is able to enter into a business associate

agreement (BAA) with any covered entity that requires HIPAA compliance.

ISO/IEC 27001

The International Organization for Standardization provides requirements for establishing, implementing,
maintaining, and continually improving an information security management system. Snowflake’s ISO

certificate is available for download here.

FedRAMP

The Federal Risk and Authorization Management Program, or FedRAMP, is a government-wide

program that provides a standardized approach to security.


Four Levels(Editions) of Snowflake Security
Snowflake offers four editions, with varying levels of security. Each subsequent version contains all the

capabilities of the preceding versions. For example, the Business Critical edition includes everything the

Enterprise edition offers.

Enterprise edition

40. All data is re-encrypted annually.


41. Federated authentication is also available so users can access Snowflake with secure single sign-on
capability.
42. Snowflake’s unique data protection feature, Time Travel, enables deleted or modified data to be restored
to its original state for up to 90 days.
43. Crossregion replication is also available in the Enterprise edition, making it possible to add additional
redundancy to Snowflake’s standard in-region replication.
Business critical edition

44. The Business Critical edition is Snowflake’s solution for customers who have specific compliance
requirements.
45. It includes HIPAA support, is PCI DSS compliant, and features an enhanced security policy.
46. This edition enables customers to use Tri-Secret Secure, which provides split encryption keys for
multiple layers of data security.
47. When a customer uses Tri-Secret Secure, access to the customer’s data requires the combination of the
Snowflake encryption key, the customer encryption key (which is wholly owned by the customer), and
valid customer credentials with role-based access to the data.
48. Because the data is encrypted with split keys, it is impossible for anyone other than the customer,
including Amazon, to gain access to the underlying data.
49. Snowflake can gain access to the data only if the customer key and access credentials are provided to
Snowflake. This ensures that only the customer can respond to demands for data access, regardless of
where they come from.

Virtual Private Snowflake (VPS)

50. VPS represents the most sophisticated solution for customers with sensitive data. It differs from other
Snowflake editions in a number of important ways.
51. With VPS, all of the servers that contain in-memory encryption keys are unique to each customer. Each
VPS customer has their own dedicated virtual servers, load balancer, and metadata store.
52. There are also dedicated virtual private networks (VPNs) or virtual private cloud (VPC) bridges from a
customer’s own VPC to the Snowflake VPC.
53. These dedicated services ensure that the most sensitive components of the customer’s data warehouse
are completely separate from those of other customers.
54. Even with VPS, Snowflake’s hardware security module and its maintenance, access, and deployment
services are still shared services. These components are secure by design, even in a multi-tenant model.
For instance, the hierarchical security module (HSM) is configured with a completely separate partition
dedicated to the customer.
55. All data is stored in Amazon S3 within a separately provisioned AWS account. As shown is the following
diagram, this design makes it possible for even the most security conscious customers to trust VPS as a
comprehensively secure solution for their data.
Outline Snowflake security principles and identify use cases where they should be applied.

• Encryption
• Network security
• User, Role, Grants provisioning
• Authentication

Encryption
Data Encryption

1. Snowflake encrypts all customer data by default, using the latest security standards, at no additional

cost. Snowflake provides best-in-class key management, which is entirely transparent to customers.

2. End-to-end encryption (E2EE) is a form of communication in which no one but end users can read
the data. In Snowflake, this means that only a customer and the runtime components can read the data.
No third parties, including Snowflake’s cloud computing platform or any ISP, can see data in the
clear.

The flow of E2EE in Snowflake is as follows

Customer Provided Staging Area

1. A user uploads one or more data files to a External stage. the user may optionally encrypt the data

files using client-side encryption. but if the data is not encrypted, Snowflake immediately encrypts
the data when it is loaded into a table.
2. Query results can be unloaded into a External stage. Results are optionally encrypted using client-

side encryption

Snowflake Provided Staging Area


3. A user uploads one or more data files to Snowflake stage. Data files are automatically encrypted by

the client on the local machine prior to being transmitted to the internal stage.
4. Query results can be unloaded into Snowflake stage. Results are automatically encrypted when

unloaded to a Snowflake stage.


5. The user loads the data from the stage into a table. The data is transformed into Snowflake’s

proprietary file format and stored in a cloud storage container (“data at rest”). In Snowflake, all data
at rest is always encrypted.
6. The user downloads data files from the stage and decrypts the data on the client side.

In all of these steps, all data files are encrypted. Only the user and the Snowflake runtime components can
read the data. The runtime components decrypt the data in memory for query processing. No third-party
service can see data in the clear.

Client-Side Encryption

1. The customer creates a secret master key, which is shared with Snowflake.

2. The client, which is provided by the cloud storage service, generates a random encryption key and
encrypts the file before uploading it into cloud storage. The random encryption key, in turn, is
encrypted with the customer’s master key.
3. Both the encrypted file and the encrypted random key are uploaded to the cloud storage service.
The encrypted random key is stored with the file’s metadata.
4. When downloading data, the client downloads both the encrypted file and the encrypted random
key.
5. The client decrypts the encrypted random key using the customer’s master key.
6. Next, the client decrypts the encrypted file using the now decrypted random key.

7. All encryption and decryption happens on the client side. At no time does the cloud storage service
or any other third party (such as an ISP) see the data in the clear.

Ingesting Client-side Encrypted Data into Snowflake


1. To load client-side encrypted data from a customer-provided stage, you create a named stage object
with an additional MASTER_KEY parameter using a CREATE STAGE command, and then load data
from the stage into your Snowflake tables. The MASTER_KEY parameter requires a 256-bit
Advanced Encryption Standard (AES) key encoded in Base64.

Encryption Key Management

1. Snowflake uses strong AES 256-bit encryption with a hierarchical key model rooted in a hardware
security module(HSM).
2. Keys are automatically rotated on a regular basis by the Snowflake service, and data can be
automatically re-encrypted (“rekeyed”) on a regular basis.
3. Data encryption and key management is entirely transparent and requires no configuration or

management.

4. A hierarchical key model provides a framework for Snowflake’s encryption key management. The
hierarchy is composed of several layers of keys in which each higher layer of keys (parent keys)
encrypts the layer below (child keys).
5. In security terminology, a parent key encrypting all child keys is known as “wrapping”.
6. Snowflake’s hierarchical key model consists of four levels of keys:

1. The root key


2. Account master keys
3. Table master keys
4. File keys

7. Each customer account has a separate key hierarchy of account level, table level, and file level keys.
8. Encryption Key Rotation

9. Account and table master keys are automatically rotated by Snowflake when they are more than 30

days old.

10. Active keys are retired, and new keys are created. When Snowflake determines the retired key is no

longer needed, the key is automatically destroyed.

Periodic Rekeying
11. If periodic rekeying is enabled, when the retired encryption key for a table is older than one year,
Snowflake automatically creates a new encryption key and re-encrypts all data previously protected
by the retired key using the new key. The new key is used to decrypt the table data going forward.
12. Snowflake relies on one of several cloud-hosted hardware security module (HSM) services as a

tamper-proof, highly secure way to generate, store, and use the root keys of the key hierarchy
13. Tri-Secret Secure lets you control access to your data using a master encryption key that you
maintain in the key management service for the cloud provider that hosts your Snowflake account:
14. The customer-managed key can then be combined with a Snowflake-managed key to create a

composite master key. When this occurs, Snowflake refers to this as Tri-Secret Secure
15. With Tri-Secret Secure enabled for your account, Snowflake combines your key with a Snowflake-

maintained key to create a composite master key. This dual-key encryption model, together with
Snowflake’s built-in user authentication, enables the three levels of data protection offered by Tri-
Secret Secure.
Network security
1. Network policies provide options for managing network configurations to the Snowflake service.
2. Network policies allow restricting access to your account based on user IP address. Effectively, a
network policy enables you to create an IP allowed list, as well as an IP blocked list, if desired.
3. By default, Snowflake allows users to connect to the service from any computer or device IP
address.
4. A security administrator (or higher) can create a network policy to allow or deny access to a single IP
address or a list of addresses.
5. Network policies currently support only Internet Protocol version 4 (i.e. IPv4) addresses.
6. An administrator with sufficient permissions can create any number of network policies.
7. A network policy is not enabled until it is activated at the account or individual user level.
8. To activate a network policy, modify the account or user properties and assign the network policy to
the object.
9. Only a single network policy can be assigned to the account or a specific user at a time.
10. Snowflake supports specifying ranges of IP addresses using Classless Inter-Domain Routing (i.e.
CIDR) notation. For example, 192.168.1.0/24 represents all IP addresses in the range

of 192.168.1.0 to 192.168.1.255.
11. To enforce a network policy for all users in your Snowflake account, activate the network policy for
your account.
12. If a network policy is activated for an individual user, the user-level network policy takes precedence.
For information about activating network policies at the user level, see Activating Network Policies
for Individual Users (in this topic).
13. To determine whether a network policy is set on your account or for a specific user, execute
the SHOW PARAMETERS command.

Private Connectivity to Snowflake Internal Stages


These topics describe the administrative concepts and tasks associated with managing features that enable
private connectivity to Snowflake internal stages.
These topics are intended primarily for administrators (i.e. users with the ACCOUNTADMIN, SYSADMIN, or
SECURITYADMIN roles).
Federated Authentication & SSO

1. Federated authentication enables your users to connect to Snowflake using secure SSO (single sign-on).
2. With SSO enabled, your users authenticate through an external, SAML 2.0-compliant identity provider
(IdP).
3. Once authenticated by this IdP, users can securely initiate one or more sessions in Snowflake for the
duration of their IdP session without having to log into Snowflake.
4. They can choose to initiate their sessions from within the interface provided by the IdP or directly in
Snowflake.

5. For example, in the Snowflake web interface, a user connects by clicking the IdP option on the login
page:
a. If they have already been authenticated by the IdP, they are immediately granted access to
Snowflake.
b. If they have not yet been authenticated by the IdP, they are taken to the IdP interface where
they authenticate, after which they are granted access to Snowflake.

What is a Federated Environment?


6. In a federated environment, user authentication is separated from user access through the use of one or
more external entities that provide independent authentication of user credentials.
7. The authentication is then passed to one or more services, enabling users to access the services through
SSO. A federated environment consists of the following components:

Service provider (SP):

• In a Snowflake federated environment, Snowflake serves as the SP.

Identity provider (IdP):

The external, independent entity responsible for providing the following services to the SP:

• Creating and maintaining user credentials and other profile information.


• Authenticating users for SSO access to the SP.
8. Snowflake supports most SAML 2.0-compliant vendors as an IdP; however, certain vendors include
native support for Snowflake (see below for details).
Supported Identity Providers

The following vendors provide native Snowflake support for federated authentication and SSO:

• Okta— hosted service


• Microsoft ADFS(Active Directory Federation Services) — on-premises software (installed on
Windows Server)

In addition to the native Snowflake support provided by Okta and ADFS, Snowflake supports
using most SAML 2.0-compliant vendors as an IdP, including:

• Google G Suite
• Microsoft Azure Active Directory
• OneLogin
• Ping Identity PingOne
Key Pair Authentication & Key Pair Rotation
1. Snowflake supports using key pair authentication for enhanced authentication security as an
alternative to basic authentication (i.e. username and password).
2. This authentication method requires, as a minimum, a 2048-bit RSA key pair. You can generate the
Privacy Enhanced Mail (i.e. PEM) private-public key pair using OpenSSL. Some of the Supported
Snowflake Clients allow using encrypted private keys to connect to Snowflake.
3. The public key is assigned to the Snowflake user who uses the Snowflake client to connect and
authenticate to Snowflake.
4. Snowflake also supports rotating public keys in an effort to allow compliance with more robust
security and governance postures.

Configuring Key Pair Authentication


• Step 1: Generate the Private Key
• Step 2: Generate a Public Key
• Step 3: Store the Private and Public Keys Securely
• Step 4: Assign the Public Key to a Snowflake User
• Step 5: Verify the User’s Public Key Fingerprint
• Step 6: Configure the Snowflake Client to Use Key Pair Authentication
External Function
Calling AWS Lambda service from Snowflake external function using API Integration Object

5. An external function calls code that is executed outside Snowflake.


6. The remotely executed code is known as a remote service.
7. Information sent to a remote service is usually relayed through a proxy service.
8. Snowflake stores security-related external function information in an API integration.

An external function is a type of UDF. Unlike other UDFs, an external function does not
contain its own code; instead, the external function calls code that is stored and executed
outside Snowflake.

Inside Snowflake, the external function is stored as a database object that contains
information that Snowflake uses to call the remote service. This stored information includes
the URL of the proxy service that relays information to and from the remote service.

Remote Service:

5. The remotely executed code is known as a remote service.


6. The remote service must act like a function. For example, it must return a value.
7. Snowflake supports scalar external functions; the remote service must return
exactly one row for each row received.
8. Remote service can be implemented as:
• An AWS Lambda function.
• A Microsoft Azure Function.
• An HTTPS server (e.g. Node.js) running on an EC2 instance.
Column Level Security

1. Dynamic Data Masking is a Column-level Security feature that uses masking policies
to selectively mask data at query time that was previously loaded in plain-text into
Snowflake.
2. Currently, Snowflake supports using Dynamic Data Masking on tables and views.
3. At query runtime, the masking policy is applied to the column at every location where
the column appears.
4. Depending on the masking policy conditions, Snowflake query operators may see the
plain-text value, a partially masked value, or a fully masked value.
5. Easily change masking policy content without having to reapply the masking policy to
thousands of columns.
use role sysadmin;
create or replace masking policy email_mask as (val string) returns string ->
case when current_role() in ('SYSADMIN') then val
else '*********'
end;
-- allow table_owner role to set or unset the ssn_mask masking policy (optional)
use role accountadmin;
grant apply on masking policy email_mask to role sysadmin;
use role sysadmin;
-- apply masking policy to a table column

alter table if exists emp_basic_ingest modify column email set masking policy email_mask;
use role accountadmin;
select * from emp_basic_ingest;
create view emp_info_view as select * from emp_basic_ingest;
select * from emp_info_view;
Row Level Security
Snowflake supports row-level security through the use of row access policies to determine which rows to
return in the query result.

Row access policies implement row-level security to determine which rows are visible in the query result.

A row access policy is a schema-level object that determines whether a given row in a table or view can be
viewed from the following types of statements:

Feature / Edition Matrix

The following tables provide a list of the major features and services included with each edition.

Note

This is only a partial list of the features. For a more complete and detailed list, see Overview of Key
Features.

Release Management
Feature/Service Standard Enterprise Business Critical VPS

24-hour early access to weekly new ✔ ✔ ✔


releases, which can be used for
additional testing/validation before
each release is deployed to your
production accounts.
Security, Governance, & Data Protection
Feature/Service Standard Enterprise Business Critical VPS

SOC 2 Type II certification. ✔ ✔ ✔ ✔

Federated authentication and SSO for ✔ ✔ ✔ ✔


centralizing and streamlining user
authentication.

OAuth for authorizing account access ✔ ✔ ✔ ✔


without sharing or storing user login
credentials.

Network policies for ✔ ✔ ✔ ✔


limiting/controlling site access by user
IP address.

Automatic encryption of all data. ✔ ✔ ✔ ✔

Support for multi-factor ✔ ✔ ✔ ✔


authentication.

Object-level access control. ✔ ✔ ✔ ✔

Standard Time Travel (up to 1 day) for ✔ ✔ ✔ ✔


accessing/restoring modified and
deleted data.

Disaster recovery of modified/deleted ✔ ✔ ✔ ✔


data (for 7 days beyond Time Travel)
through Fail-safe.

Extended Time Travel (up to 90 days). ✔ ✔ ✔

Periodic rekeying of encrypted data for ✔ ✔ ✔


increased protection.

Column-level Security to apply


✔ ✔ ✔
masking policies to columns in tables
or views.

Row Access Policies to apply row ✔ ✔ ✔


access policies to determine which
rows are visible in a query result.

Object Tagging to apply tags to


✔ ✔ ✔
Snowflake objects to facilitate tracking
sensitive data and resource usage.
Feature/Service Standard Enterprise Business Critical VPS

Customer-managed encryption keys ✔ ✔


through Tri-Secret Secure.

Support for Private Connectivity to the ✔ ✔


Snowflake Service using AWS
PrivateLink, Azure Private Link, or
Google Cloud Private Service Connect.

Support for PHI data (in accordance ✔ ✔


with HIPAA and HITRUST
CSF regulations).

Support for PCI DSS.


✔ ✔

Support for FedRAMP Moderate data ✔ ✔


(in the US government regions).

Support for IRAP - Protected (P) data ✔ ✔


(in specified Asia Pacific regions).

Dedicated metadata store and pool of ✔


compute resources (used in virtual
warehouses).

Data Replication & Failover


Feature/Service Standard Enterprise Business Critical VPS

Database replication between ✔ ✔ ✔ ✔


Snowflake accounts (within an
organization) to keep the database
objects and stored data
synchronized.

Database failover and ✔ ✔


failback between Snowflake
accounts for business continuity
and disaster recovery.

Redirecting Client ✔ ✔
Connections between Snowflake
accounts for business continuity
and disaster recovery.

Data Sharing
Feature/Service Standard Enterprise Business Critical VPS

As a data provider, securely share ✔ ✔ ✔


data with other accounts.
Feature/Service Standard Enterprise Business Critical VPS

As a data consumer, query ✔ ✔ ✔


data shared with your account by
data providers.

Secure data sharing across regions ✔ ✔ ✔


and cloud platforms (through data
replication)

Snowflake Data Marketplace, ✔ ✔ ✔


where providers and consumers
meet to securely sharing data.

Data Exchange, a private hub of ✔ ✔ ✔


administrators, providers, and
consumers that you invite to
securely collaborate around data.

Customer Support
Feature/Service Standard Enterprise Business Critical VPS

Snowflake Community, Snowflake’s ✔ ✔ ✔ ✔


online Knowledge Base and support
portal (for logging and tracking
Snowflake Support tickets).

Premier support, which includes ✔ [1] ✔ ✔ ✔


24/7 coverage and 1-hour response
window for Severity 1 issues.
21. Database Replication
This feature enables replicating databases between Snowflake accounts (within the same
organization) and keeping the database objects and stored data synchronized. Database
replication is supported across regions and across cloud platforms.

Primary Database
• Replication can be enabled for any existing permanent or transient database.
• Enabling replication designates the database as a primary database.
• Any number of databases in an account can be designated a primary database.
• Likewise, a primary database can be replicated to any number of accounts in your organization.
• This involves creating a secondary database as a replica of a specified primary database in each of the
target accounts
• All DML/DDL operations are executed on the primary database.
• Each read-only, secondary database can be refreshed periodically with a snapshot of the primary
database, replicating all data as well as DDL operations on database objects (i.e. schemas, tables, views,
etc.).
• When a primary database is replicated, a snapshot of its database objects and data is transferred to the
secondary database

Object Type or Feature Notes

Tables Permanent tables ✔

Transient tables ✔

Temporary tables

Automatic Clustering of ✔

clustered tables

External tables Creating or refreshing a secondary database is


blocked if an external table exists in the
primary database. Planned for a future version
of database replication.

Table constraints ✔ Except if a foreign key in the database


references a primary/unique key in another
database.

Sequences ✔

Views Views ✔ If a view references any object in another


database (e.g. table columns, other views,
UDFs, or stages), both databases must be
replicated.
Object Type or Feature Notes

Materialized views ✔

Secure views ✔

File ✔

formats

Stages Stages Planned for a future version of database


replication.

Temporary stages

Pipes Planned for a future version of database


replication.

Stored ✔

procedures

Streams Planned for a future version of database


replication.

Tasks Planned for a future version of database


replication.

SQL and ✔

JavaScript
UDFs

Policies Row Access & Column- ✔ The replication operation is blocked if either of
level Security (masking) the following use cases is true: The primary
database is in an Enterprise (or higher) account
and contains a policy but one or more of the
accounts approved for replication are on lower
editions. A policy contained in the primary
database has a reference to a policy in another
database.

Tags Object Tagging ✔

• Currently, replication is supported for databases only. Other types of objects in an account cannot be
replicated. This list includes:

• Users
• Roles
• Warehouses
• Resource monitors
• Shares

• Privileges granted on database objects are not replicated to a secondary database.


• Account parameters are not replicated.

• Object parameters that are set at the schema or schema object level are replicated:

Parameter Objects

DATA_RETENTION_TIME_IN_DAYS schema, table

DEFAULT_DDL_COLLATION schema, table

MAX_DATA_EXTENSION_TIME_IN_DAYS schema, table

PIPE_EXECUTION_PAUSED [1] schema, pipe

QUOTED_IDENTIFIERS_IGNORE_CASE schema, table

• Parameter replication is only applicable to objects in the database (schema, table) and only if the
parameter is explicitly set. Database level parameters are not replicated.

Database Failover/Fallback
1. In the event of a massive outage (due to a network issue, software bug, etc.) that disrupts the cloud
services in a given region, access to Snowflake will be unavailable until the source of the outage is
resolved and services are restored.
2. To ensure continued availability and data durability in such a scenario, you can replicate your
databases in a given region to another Snowflake account (owned by your organization) in a different
region. This option allows you to recover multiple databases in the other region and continue to
process data after a failure in the first region results in full or partial loss of Snowflake availability.
3. Initiating failover involves promoting a secondary (i.e. replica) database in an available region to serve
as the primary database. When promoted, the now-primary database becomes writeable.
Concurrently, the former primary database becomes a secondary, read-only database.

Feature/Service Standard Enterprise Business Critical VPS

Database replication between Snowflake ✔ ✔ ✔ ✔


accounts (within an organization) to keep the
database objects and stored data
synchronized.

Database failover and failback between ✔ ✔


Snowflake accounts for business continuity
and disaster recovery.

Redirecting Client Connections between ✔ ✔


Snowflake accounts for business continuity
and disaster recovery.

Business Continuity and Disaster Recovery Flow


4. Two accounts in the same organization but different regions (Region A and Region B). In one
account, a local database has been promoted to serve as a primary database. Replication has been
enabled for the other account, allowing it to store a replica of the primary database (that is, a
secondary database):

1. A service outage in Region A, where the account that contains the primary database is located. The
secondary database (in Region B) has been promoted to serve as the primary database. Concurrently,
the former primary database has become a secondary, read-only database:
2. The following diagram shows that the service outage in Region A has been resolved. A database
refresh operation is in progress from the primary database (in Region B) to the secondary database
(in Region A):

3. The final diagram shows operations returned to their initial configuration (i.e. failback). The
secondary database (in Region A) has been promoted to once again serve as the primary database for
normal business operations. Concurrently, the former primary database (in Region B) has become a
secondary, read-only database:

Understanding Billing for Database Replication


• Charges based on database replication are divided into two categories: data transfer and compute
resources.
• Both categories are billed on the target account (i.e. the account that stores the secondary database that
is refreshed).
• The data transfer rate is determined by the location of the source account
• Replication operations use Snowflake-provided compute resources to copy data between accounts
across regions.
• REPLICATION_USAGE_HISTORY table function (in the Snowflake Information Schema). This function
returns replication usage activity within the last 14 days.
• REPLICATION_USAGE_HISTORY View view (in Account Usage). This view returns replication usage
activity within the last 365 days (1 year).

Replication and Automatic Clustering


• In the primary database, Snowflake monitors clustered tables using Automatic Clustering and
reclusters them as needed. As part of a refresh operation, clustered tables are replicated to a secondary
database with the current sorting of the table micro-partitions. As such, reclustering is not performed
again on the clustered tables in the secondary database, which would be redundant.

Replication and Materialized Views


• In the primary database, Snowflake performs automatic background maintenance of materialized views.
When a base table changes, all materialized views defined on the table are updated by a background
service that uses compute resources provided by Snowflake. In addition, if Automatic Clustering is
enabled for a materialized view, then the view is monitored and reclustered as necessary in the primary
database.
• A refresh operation replicates the materialized view definitions to a secondary database; the

materialized view data is not replicated. Automatic background maintenance of materialized views in a
secondary database is enabled by default. If Automatic Clustering is enabled for a materialized view in
the primary database, automatic monitoring and reclustering of the materialized view in the secondary
database is also enabled.

Replication and External Tables

External tables in the primary database currently cause the replication or refresh operation
to fail with the following error message:

003906 (55000): SQL execution error:


Primary database contains an external table '<database_name>'.Replication of a database with external table
is not supported

• To work around this limitation, we recommend that you move the external tables into a
separate database that is not replicated.
• Alternatively, if you are migrating your databases to another account, you could clone
the primary database, drop the external table from the clone, and then replicate the
cloned database.
• After you promote the secondary database in the target account, you would need to
recreate the external tables in the database.
Replication and Policies (Masking & Row Access)

For masking and row access policies, if either of the following conditions is true, then the initial replication
operation or a subsequent refresh operation fails:

• The primary database is in an Enterprise (or higher) account and contains a policy but one or more of the
accounts approved for replication are on lower editions.
• A policy contained in the primary database has a reference to a policy in another database.

Time Travel

Querying tables and views in a secondary database using Time Travel can produce different
results than when executing the same query in the primary database.

Historical Data

Historical data available to query in a primary database using Time Travel is not
replicated to secondary databases.

Data Retention Period

The data retention period for tables in a secondary database begins when the
secondary database is refreshed with the DML operations (i.e. changing or deleting
data) written to tables in the primary database.
22. Search Optimization Service
1. A point lookup query returns only one or a small number of distinct rows.
• Business users who need fast response times for critical dashboards with highly selective filters.
• Data scientists who are exploring large data volumes and looking for specific subsets of data.
2. The search optimization service aims to significantly improve the performance of selective point lookup
queries on tables.
3. A user can register one or more tables to the search optimization service. Search optimization is a table-
level property and applies to all columns with supported data types (see the list of supported data types
further below).
4. the search optimization service relies on a persistent data structure that serves as an optimized search
access path.
5. A maintenance service that runs in the background is responsible for creating and maintaining the search
access path:
• When you add search optimization to a table, the maintenance service creates and populates the
search access path with the data needed to perform the lookups.
• When data in the table is updated (for example, by loading new data sets or through DML
operations), the maintenance service automatically updates the search access path to reflect the
changes to the data.
• If queries are run when the search access path hasn’t been updated yet, the queries might run
slower but will always return up-to-date results.
• there is a cost for the storage and compute resources for this service.

Considering Other Solutions for Optimizing Query Performance

6. The search optimization service is one of several ways to optimize query performance. Related
techniques include:
• Clustering a table.
• Creating one or more materialized views (clustered or unclustered).
7. Clustering a table can speed any of the following, as long as they are on the clustering key:
o Range searches.
o Equality searches.

However, a table can be clustered on only a single key (which can contain one or more columns or
expressions).

8. The search optimization service speeds only equality searches. However, this applies to all the columns
of supported types in a table that has search optimization enabled
9. A materialized view speeds both equality searches and range searches, as well as some sort operations,
but only for the subset of rows and columns included in the materialized view.
10. Materialized views can be also used in order to define different clustering keys on the same source table
(or a subset of that table), or in conjunction with flattening JSON / variant data.
11. If you clone a table, schema, or database, the search optimization property and search access paths of
each table are also cloned

23. Account Usage Views


• In the SNOWFLAKE database, the ACCOUNT_USAGE and READER_ACCOUNT_USAGE schemas
enable querying object metadata, as well as historical usage data, for your account and all reader
accounts (if any) associated with the account.
• In general, these views mirror the corresponding views and table functions in the Snowflake Snowflake
Information Schema, but with the following differences:
o Records for dropped objects included in each view.
o Longer retention time for historical usage data.
o Data latency.

24. Differences Between Account Usage and


Information Schema
The Account Usage views and the corresponding views (or table functions) in
the Snowflake Information Schema utilize identical structures and naming conventions, but
with some key differences, as described in this section:

25. Parameters
https://docs.snowflake.com/en/user-guide/admin-account-management.html

1. Snowflake provides three types of parameters that can be set for your account:
2. Account parameters that affect your entire account. These parameters are set at the account level and
can’t be overridden at a lower level of the hierarchy. All parameters have default values, which can be
overridden at the account level. To override default values at the account level, you must be an account
administrator.
3. Session parameters: These parameters mainly relate to users and their sessions. They can be set at the
account level, but they can also be changed for each user. Within a user’s session, it can be again
changed to something different.
For example, users connecting from the US may want to see dates displayed in “mm-dd-yyyy” format,
and users from Asia may want to see dates listed as “dd/mm/yyyy”.
• The account-level value for this parameter may be the default “yyyy-mm-dd”. Setting the value at
user-level ensures different users are seeing dates the way they want to see it.
• Now, let’s say a user from the USA logs in to the data warehouse, but wants to see dates in
“MMM-dd-yyyy” format. She could change the parameter for her own session only.
Both ACCOUNTADMIN and SYSADMIN role members can assign or change parameters for the
user. If no changes are made to a session type parameter at the user or session level, the account-
level value is applied.

4. Object parameters that default to objects (warehouses, databases, schemas, and tables). These
parameters are applicable to Snowflake objects, like warehouses and databases. Warehouses don’t have
any hierarchy, so warehouse-specific parameters can be set at account-level and then changed for
individual warehouses.

• Similarly, database-specific parameters can be set at the account level and then for each database.
• Unlike warehouses though, databases have a hierarchy. Within a database, a schema can override
the parameters set at account or database level, and within the schema, a table can override the
parameters set at account, database or schema level.
• If no changes are made at lower levels of the hierarchy, the value set at the nearest higher level is
applied downstream.

5. To see a list of the parameters and their current values for your account

—- shows parameters defined at the account level


SHOW PARAMETERS IN ACCOUNT;—- shows parameters defined at the user level
SHOW PARAMETERS IN USER;-- shows parameters defined at the session level
SHOW PARAMETERS IN SESSION;-- shows parameters defined at the session level
SHOW PARAMETERS
SHOW PARAMETERS in database mydb;
SHOW PARAMETERS like 'time%' in account;
Here:

key is the parameter name

value is the current value set for the parameter

default is the default value for the parameter

level is the hierarchical position where the current value is applied to. If the default value and the current
value are the same, this field is empty.

6. Here is an example of changing a parameter at the session level:


ALTER SESSION SET TIMEZONE=’India/Kolkata’;

If you now run this command:


SHOW PARAMETERS LIKE ‘%TIMEZONE%’;

The output shows the current value is set at the session level, and it’s different from the default value:

To find what the value has been set to at account level, run this command:

SHOW PARAMETERS LIKE ‘%TIMEZONE%’ IN ACCOUNT;

It shows the default value hasn’t changed at an account level. Note how the “level” column is empty here:

To reset the parameter back to what it was, run this command:

ALTER SESSION UNSET TIMEZONE;


Let’s now talk about object type parameters. Unlike account or session types, querying object type
parameter requires specifying the type of object (database, schema, table or warehouse) and the object
name.

This code snippet shows how to list object properties for database, schema, table and warehouse:

-- Shows parameters set at database level

-- Shows parameters set at Database level


SHOW PARAMETERS IN DATABASE MYTESTDB;
-- Shows parameters set at schema level
SHOW PARAMETERS IN SCHEMA MYTESTDB.TEST_SCHEMA;
-- Shows parameters set at table level
SHOW PARAMETERS IN TABLE MYTESTDB.TEST_SCHEMA.MY_TEST_TABLE;
-- Shows parameters set for a warehouse
SHOW PARAMETERS IN WAREHOUSE MYTESTWH;

Let’s say you want to change the data retention period to 0 days for the TEST_SCHEMA, which effectively
turns off its time travel. Run a command like this:
ALTER SCHEMA MYTESTDB.TEST_SCHEMA SET DATA_RETENTION_TIME_IN_DAYS=0;

To change it back to the default value, run a command like this:


ALTER SCHEMA MYTESTDB.TEST_SCHEMA UNSET DATA_RETENTION_TIME_IN_DAYS;

And that’s about everything you need to know about querying and setting Snowflake parameters.

7. To reset a parameter for your account


ALTER ACCOUNT UNSET

Parameter.xlsx
26. Connectors and Drivers
Snowflake Connector for Python
1. The Snowflake Connector for Python provides an interface for developing Python applications that
can connect to Snowflake and perform all standard operations. It provides a programming alternative
to developing applications in Java or C/C++ using the Snowflake JDBC or ODBC drivers.
2. The connector is a native, pure Python package that has no dependencies on JDBC or ODBC. It can
be installed using pip on Linux, macOS, and Windows platforms where a supported version of
Python is installed.
3. SnowSQL, the command line client provided by Snowflake, is an example of an application
developed using the connector.

Snowflake Spark Connector


4. The Snowflake Connector for Spark enables using Snowflake as an Apache Spark data source, similar
to other data sources (PostgreSQL, HDFS, S3, etc.).
JDBC Driver

7. Snowflake provides a JDBC type 4 driver that supports core JDBC functionality. The JDBC driver
must be installed in a 64-bit environment and requires Java 1.8 (or higher).The driver can be used
with most client tools/applications that support JDBC for connecting to a database server.
8. sfsql, the now-deprecated command line client provided by Snowflake, is an example of a JDBC-
based application.

ODBC Driver

9. Snowflake provides a driver for connecting to Snowflake using ODBC-based client applications. The
ODBC driver has different prerequisites depending on the platform where it is installed.

PHP PDO Driver for Snowflake

10. The PHP PDO driver for Snowflake provides an interface for developing PHP applications that can
connect to Snowflake and perform all standard operations.

Snowflake Kafka Connector

11. Apache Kafka software uses a publish and subscribe model to write and read streams of records,
similar to a message queue or enterprise messaging system. Kafka allows processes to read and write
messages asynchronously.
12. A subscriber does not need to be connected directly to a publisher; a publisher can queue a message
in Kafka for the subscriber to receive later.
13. An application publishes messages to a topic, and an application subscribes to a topic to receive
those messages. Kafka can process, as well as transmit, messages; however, that is outside the scope
of this document. Topics can be divided into partitions to increase scalability.
14. Kafka Connect is a framework for connecting Kafka with external systems, including databases. A
Kafka Connect cluster is a separate cluster from the Kafka cluster. The Kafka Connect cluster
supports running and scaling out connectors (components that support reading and/or writing
between external systems).
15. Kafka, like many message publish/subscribe platforms, allows a many-to-many relationship between
publishers and subscribers. A single application can publish to many topics, and a single application
can subscribe to multiple topics. With Snowflake, the typical pattern is that one topic supplies
messages (rows) for one Snowflake table.
16. From the perspective of Snowflake, a Kafka topic produces a stream of rows to be inserted into a
Snowflake table. In general, each Kafka message contains one row.
17. Kafka, like many message publish/subscribe platforms, allows a many-to-many relationship between
publishers and subscribers. A single application can publish to many topics, and a single application
can subscribe to multiple topics. With Snowflake, the typical pattern is that one topic supplies
messages (rows) for one Snowflake table.
18. Kafka topics can be mapped to existing Snowflake tables in the Kafka configuration.

Schema of Tables for Kafka Topics

19. Every Snowflake table loaded by the Kafka connector has a schema consisting of two VARIANT
columns:
a. RECORD_CONTENT. This contains the Kafka message.

b. RECORD_METADATA. This contains metadata about the message, for


example, the topic from which the message was read.
20. If Snowflake creates the table, then the table contains only these two columns. The
RECORD_CONTENT column contains the Kafka message.
21. For example, a message from an IoT (Internet of Things) weather sensor might include the timestamp
at which the data was recorded, the location of the sensor, the temperature, humidity, etc.
22. Typically, each message in a specific topic has the same basic structure. Different topics typically use
different structure.
23. Each Kafka message is passed to Snowflake in JSON format or Avro format.

24. Expressed in JSON syntax, a sample message might look similar to the
following:

{
"meta":
{
"offset": 1,
"topic": "PressureOverloadWarning",
"partition": 12,
"key": "key name",
"schema_id": 123,
"CreateTime": 1234567890,
"headers":
{
"name1": "value1",
"name2": "value2"
}
},
"content":
{
"ID": 62,
"PSI": 451,
"etc": "..."
}
}

Here is a simple example of extracting data based on the topic in the


RECORD_METADATA:

select
record_metadata:CreateTime,
record_content:ID
from table1
where record_metadata:topic = 'PressureOverloadWarning';

Workflow for the Kafka Connector

25. The Kafka connector completes the following process to subscribe to Kafka topics and create
Snowflake objects:

• The Kafka connector subscribes to one or more Kafka topics based on the
configuration information provided via the Kafka configuration file or command line
(or the Confluent Control Center; Confluent only).
• The connector creates the following objects for each topic:
o One internal stage to temporarily store data files for each topic.
o One pipe to ingest the data files for each topic partition.
o One table for each topic. If the table specified for each topic does not exist, the
connector creates it; otherwise, the connector creates the
RECORD_CONTENT and RECORD_METADATA columns in the existing table
and verifies that the other columns are nullable (and produces an error if they
are not).

The following diagram shows the ingest flow for Kafka with the Kafka connector:

1. One or more applications publish JSON or Avro records to a Kafka cluster. The
records are split into one or more topic partitions.
2. The Kafka connector buffers messages from the Kafka topics. When a
threshold (time or memory or number of messages) is reached, the connector
writes the messages to a temporary file in the internal stage. The connector
triggers Snowpipe to ingest the temporary file. Snowpipe copies a pointer to
the data file into a queue.
3. A Snowflake-provided virtual warehouse loads data from the staged file into
the target table (i.e. the table specified in the configuration file for the topic)
via the pipe created for the Kafka topic partition.
4. (Not shown) The connector monitors Snowpipe and deletes each file in the
internal stage after confirming that the file data was loaded into the table.

If a failure prevented the data from loading, the connector moves the file into
the table stage and produces an error message.

5. The connector repeats steps 2-4.


Attention

Snowflake polls the insertReport API for one hour. If the status of an ingested file does not
succeed within this hour, the files being ingested are moved to a table stage.

It may take at least one hour for these files to be available on the table stage. Files are only
moved to the table stage when their ingestion status could not be found within the
previous hour.
Snowflake SQL Api

26. The Snowflake SQL API is a REST API that you can use to access and update data in a Snowflake
database. You can use this API to develop custom applications and integrations that:
• Perform queries
• Manage your deployment (e.g. provision users and roles, create tables, etc.)

The Snowflake SQL API provides operations that you can use to:

• Submit SQL statements for execution.


• Check the status of the execution of a statement.
• Cancel the execution of a statement.

You can use this API to execute standard queries and most DDL and DML statements.
27. Metadata Fields in Snowflake
1. The data contained in metadata fields may be processed outside of your Snowflake Region.
2. It is a customer responsibility to ensure that no personal data (other than for a User object), sensitive
data, export-controlled data, or other regulated data is entered into any metadata field when using
the Snowflake Service.
3. The most common metadata fields are:
a. Object definitions, such as a policy, an external function, or a view definition.
b. Object properties, such as an object name or an object comment.
28. Few Facts to remember
1. GCP region in APAC is 0
2. By default, the maximum number of accounts in an organization cannot exceed 25
3. To recover the management costs of Snowflake-provided compute resources, Snowflake applies
a 1.5x multiplier to resource consumption.
4. Snowflake credits charged per compute-hour: Snowflake-managed compute resources: 1.5 AND Cloud
services: 1
5. Virtual warehouses are billed by the second with 1 minute minimum
6. Typical utilization of cloud services (up to 10% of daily compute credits) is included for free
7. Each server in a cluster can process 8 files in parallel
8. Recommended file size for data loading 100-250 MB (or larger) in size compressed.
9. While unloading Snowflake creates 16mb each file, we can dump upto 5gb in a file using max_file_size
10. Snowpipe charges 0.06 credits per 1000 files queued.
11. An ALTER PIPE … REFRESH statement copies a set of data files staged within the previous 7 days to
the Snowpipe ingest queue for loading into the target table
12. When a pipe is paused, event messages received for the pipe enter a limited retention period. The period
is 14 days by default. If a pipe is paused for longer than 14 days, it is considered stale
13. Each micro-partition contains between 50 MB and 500 MB of uncompressed data
14. Standard Time Travel is 1 day. Extended Time Travel (up to 90 days) requires Snowflake Enterprise
Edition.
15. The fail-safe retention period is 7 days
16. total number of reader accounts a provider can create is 20
17. Result Cache holds the results of every query executed in the past 24 hours
18. History tab list includes (up to) 100 queries
19. Query results in History tab are available for a 24-hour period
20. The History tab page allows you to view and drill into the details of all queries executed in the last 14
days.
21. The web interface only supports exporting results up to 100 MB in size. If a query result exceeds this
limit, you are prompted whether to proceed with the export.
22. If the data retention period for a source table is less than 14 days and if stream has not been consumed,
snowflake temporarily extends up to maximum of 14 days by default, regardless of snowflake edition for
your account to prevent it from going stale.
23. A task can have a maximum of 100 child tasks.
24. All ingested data stored in Snowflake tables is encrypted using AES-256 strong encryption.
25. Account and table master keys are automatically rotated by Snowflake when they are more than 30 days
old.
26. Maximum compressed row size in Snowflake is 16MB
27. micro-partitions are approximately 16MB in size
28. STATEMENT_TIMEOUT_IN_SECONDS for running queries : upto 7 days — a value of 0 specifies that
the maximum timeout value is enforced. Default is 2 days
SnowPro Core Certification

New version of the SnowPro Core Certification Exam, released early September, 2022

The SnowPro ™ Core Certification is designed for individuals who would like to demonstrate their
knowledge the Snowflake Cloud Data Platform. The candidate has a thorough knowledge of:

• Data Loading and Transformation in Snowflake


• Virtual Warehouse Performance and Concurrency
• DDL and DML Queries
• Using Semi-Structured and Unstructured Data
• Cloning and Time Travel
• Data Sharing
• Snowflake Account Structure and Management

Exam Format

Exam Version: COF-C02

Total Number of Questions: 100

Question Types: Multiple Select, Multiple Choice

Time Limit: 115 minutes

Languages: English

Registration Fee: $175 USD

Passing Score: 750 + Scaled Scoring from 0 – 1000

Unscored Content: Exams may include unscored items to gather statistical information.
These items are not identified on the form and do not affect your score, and additional time
is factored in to account for this content.

Exam Domain Breakdown

This exam guide includes test domains, weightings, and objectives. It is not a comprehensive listing
of all the content that will be presented on this examination. The table below lists the main content
domains and their weighting ranges.
Target Audience:

We recommend that individuals have at least 6 months of knowledge using the Snowflake

Platform prior to attempting this exam. Familiarity with basic ANSI SQL is recommended.

• Solution Architects
• Data Engineers
• Snowflake Account Administrators
• Database Administrators
• Data Scientists
• Data Analysts
• Application Developers

SNOWFLAKE OVERVIEW

Below is a list of documents, videos, and training modules about Snowflake:

Snowflake Overview

• Data Cloud Overview: Frank Slootman


• Introduction to the Snowflake Data Cloud
• Data Goes Around The World in 80 Seconds With Snowflake
• What is Snowflake? 8 Minute Demo
• Introduction to Snowflake - Key Concepts & Architecture
• Snowflake Getting Started
• Before You Begin

Getting Started with Snowflake is a resource of 12 modules designed to help you get

familiar with Snowflake. We recommend you complete all of the modules, but the

individual topics are linked to the exam content areas.


29. Introduction to Snowpark
Integrating Python with SQL Functionality
30. SnowPro Core Certification (COF-C02)
Domain 1.0: Snowflake Cloud Data Platform Features and Architecture
1.1 Outline key features of the Snowflake Cloud Data Platform.
● Elastic Storage
● Elastic Compute
● Snowflake’s three distinct layers
● Data Cloud/ Data Exchange/ Partner Network
● Cloud partner categories

1.2 Outline key Snowflake tools and user interfaces.


● Snowflake User Interfaces (UI)
● Snowsight
● Snowflake connectors
● Snowflake drivers
● SQL scripting
● Snowpark

1.3 Outline Snowflake’s catalog and objects.


● Databases
● Schemas
● Tables Types
● View Types
● Data types
● User-Defined Functions (UDFs) and User Defined Table Functions (UDTFs)
● Stored Procedures
● Streams
● Tasks
● Pipes
● Shares
● Sequences

1.4 Outline Snowflake storage concepts.


● Micro partitions
● Types of column metadata clustering
● Data Storage Monitoring
● Search Optimization Service
Snowflake Cloud Data Platform Features and Architecture Study Resources
Snowflake University On Demand Trainings
Snowflake University, LevelUp: Snowflake’s Key Concepts

Snowflake University, Level Up: Snowflake Ecosystem

Getting Started With Snowflake

Module 2: Prepare your Lab Environment


Module 3: The Snowflake User Interface & Lab Story

Additional Assets
Quick Tour of the Web Interface (Document + Video)

Snowflake Documentation Links


• Caching in Snowflake Data Warehouse
• Classic Web Interface
• Constraints
• CREATE SEQUENCE
• CREATE STAGE
• CREATE STREAM
• CURRENT_CLIENT
• Data Storage Considerations
• Data Types
• Database, Schema, & Share DDL
• Databases, Tables & Views
• DROP STAGE
• Installing SnowSQL
• Introduction to Snowflake
• Introduction to Snowpipe
• Introduction to Tasks
• LIST
• SnowCD (Connectivity Diagnostic Tool)
• Snowflake High Availability for Data Applications
• Snowflake Information Schema

Domain 2.0: Account Access and Security


2.1 Outline security principles.
● Network security and policies
● Multi-Factor Authentication (MFA)
● Federated authentication
● Single Sign-On (SSO)

2.2 Define the entities and roles that are used in Snowflake.
● Outline how privileges can be granted and revoked
● Explain role hierarchy and privilege inheritance
2.3 Outline data governance capabilities in Snowflake.
● Accounts
● Organizations
● Databases
● Secure views
● Information schemas
● Access history and read support

Domain 2.0: Account Access and Security Study Resources


Snowflake University On Demand Trainings
Snowflake University, LevelUp: Accounts & Assurances Snowflake
University, Level Up: Container Hierarchy

Getting Started With Snowflake


Module 9: Working with Roles, Account Admin & Account Usage

Additional Assets
Crucial Security Controls for Your Cloud Data Warehouse (Video)
Quickly Visualize Snowflake’s Roles, Grants and Privileges (Article) Snowflake
Security Overview (Video)

Snowflake Documentation Links


Access Control in Snowflake
Account Usage
Authentication
CREATE TASK
Database Replication Considerations
GRANTS_TO_USERS View
GRANT OWNERSHIP
GRANT <privileges> … TO ROLE
Introduction to Organizations
LOGIN_HISTORY View
Network Policies
Overview of Views
SHOW GRANTS
Snowflake Information Schema
Summary of Governance Features
Summary of Security Features
USE SECONDARY ROLES
User Management
Using the Search Optimization Service
Working with Secure Views

Domain 3.0: Performance Concepts


3.1 Explain the use of the Query Profile.
● Explain plans
● Data spilling
● Use of the data cache
● Micro-partition pruning
● Query history

3.2 Explain virtual warehouse configurations.


● Multi-clustering
● Warehouse sizing
● Warehouse settings and access

3.3 Outline virtual warehouse performance tools.


● Monitoring warehouse loads
● Query performance
● Scaling up compared to scaling out
● Resource monitors

3.4 Optimize query performance.


● Describe the use of materialized views
● Use of specific SELECT commands
Domain 3.0: Performance Concepts Study Resources
Snowflake University On Demand Trainings
Snowflake University, Level Up: Query History & Caching
Snowflake University, LevelUp: Query & Result
Snowflake University, LevelUp: Context
Snowflake University: LevelUp: Resource Monitoring
Snowflake University, Essentials - Data Warehousing Workshop

Getting Started With Snowflake


Module 6: Working with Queries, The Results Cache & Cloning

Additional Assets
Accelerating BI Queries with Caching in Snowflake (Video)
Caching in Snowflake Data Warehouse (Article)
How to: Understand Result Caching (Article)
Managing Snowflake’s Compute Resources (Blog)
Performance Impact from Local and Remote Disk Spilling (Article)
Search Optimization: When & How to Use (Article)
Snowflake Materialized Views: A Fast, Zero-Maintenance Accurate Solution (Blog)
Snowflake Workloads Explained: Data Warehouse (Video)
Tackling High Concurrency with Multi-Cluster Warehouses (Video)
Tuning Snowflake (Article)
Using Materialized Views to Solve Multi-Clustering Performance Problems (Article)

Snowflake Documentation Links


Access Control Privileges
ALTER FILE FORMAT
ALTER WAREHOUSE
Analyzing Queries Using Query Profile
Clustering Keys & Clustered Tables
CREATE WAREHOUSE
LOGIN_HISTORY , LOGIN_HISTORY_BY_USER
Managing Cost in Snowflake
METERING_HISTORY View
Multi-cluster Warehouses
Queries
QUERY_HISTORY View
QUERY_HISTORY , QUERY_HISTORY_BY_*
Querying Semi-structured Data
RESOURCE_MONITORS View
Parameters
Understanding Snowflake Table Structures
Understanding Your Cost
Using Persisted Query Results
Using the Search Optimization Service
Virtual Warehouses
Warehouse Considerations
Working with Materialized Views
Working with Resource Monitors
Working with Warehouses
Domain 4.0: Data Loading and Unloading
4.1 Define concepts and best practices that should be considered when loading data.
● Stages and stage types
● File size
● File formats
● Folder structures
● Adhoc/bulk loading using the Snowflake UI

4.2 Outline different commands used to load data and when they should be used.
● CREATE PIPE
● COPY INTO
● GET
● INSERT/INSERT OVERWRITE
● PUT
● STREAM
● TASK
● VALIDATE

4.3 Define concepts and best practices that should be considered when unloading data.
● File formats
● Empty strings and NULL values
● Unloading to a single file
● Unloading relational tables

4.4 Outline the different commands used to unload data and when they should be used.
● LIST
● COPY INTO
● CREATE FILE FORMAT
● CREATE FILE FORMAT … CLONE
● ALTER FILE FORMAT
● DROP FILE FORMAT
● DESCRIBE FILE FORMAT
● SHOW FILE FORMAT
Domain 4.0: Data Loading and Unloading Study Resources
Snowflake University On Demand Trainings
Snowflake University, Level Up: Data Loading Badge 1:
Data Warehousing Workshop

Getting Started With Snowflake


Module 4: Preparing to Load Data Module 5:
Loading Data

Additional Assets
Best Practices for Data Unloading (Article)
Best Practices for Using Tableau with Snowflake (White Paper, requires email for access)
Building and Deploying Continuous Data Pipelines (Video)
Easily Loading and Analyzing Semi-Structured Data in Snowflake (Video)
How to Load Terabytes into Snowflake - Speeds, Feeds and Techniques (Blog)

Snowflake Documentation Links


Continuous Data Pipelines
COPY INTO <location>
COPY INTO <table>
CREATE PIPE
GET
LIST
Loading Data into Snowflake
Managing Snowpipe
OBJECT_CONSTRUCT
PUT
REMOVE
Unloading Data from Snowflake
VALIDATE

Domain 5.0: Data Transformations


5.1 Explain how to work with standard data.
● Estimating functions
● Sampling
● Supported function types
● User-Defined Functions (UDFs) and stored procedures

5.2 Explain how to work with semi-structured data.


● Supported file formats, data types, and sizes
● VARIANT column
● Flattening the nested structure
5.3 Explain how to work with unstructured data.
● Directory tables
● SQL file functions
● Rest API
● Create User-Defined Functions (UDFs) for data analysis

Domain 5.0: Data Transformations Study Resources


Snowflake University On Demand Training
Badge 2: Data Application Builders Workshop

Getting Started With Snowflake


Module 7: Working with Semi-Structured Data, Views & Joins

Additional Assets
Best Practices for Managing Unstructured Data (E-book)
Easily Loading and Analyzing Semi-Structured Data in Snowflake (Video)
Structured vs Unstructured vs Semi-Structured Data (Blog)
Understanding Unstructured Data With Language Models (Blog)

Snowflake Documentation Links


Constraints
CREATE <object> … CLONE
External Functions
FLATTEN
LAST_QUERY_ID
PARSE_JSON
SAMPLE / TABLESAMPLE
Semi-structured Data
Semi-structured Data Types
Stored Procedures
Tutorial: JSON Basics
Unstructured Data

Domain 6.0: Data Protection and Data Sharing

6.1 Outline Continuous Data Protection with Snowflake.


● Time Travel
● Fail-safe
● Data Encryption
● Cloning
● Replication

6.2 Outline Snowflake data sharing capabilities.


● Account types
● Data Marketplace and Data Exchange
● Private data exchange
● Access control options
● Shares
Domain 6.0: Data Protection and Data Sharing Study Resources
Snowflake University On Demand Trainings
Snowflake University, Level Up: Container Hierarchy
Snowflake University, Level Up: Backup and Recovery
Snowflake University, Essentials - Sharing, Marketplace & Exchanges Workshop Badge 3:
Sharing, Marketplace, & Exchanges Workshop

Getting Started With Snowflake


Module 8: Using Time Travel Module 10:
Data Sharing

Additional Assets
Data Protection with Time Travel in Snowflake (Video)
Getting Started on Snowflake with Partner Connect (Video)
Meta Data Archiving with Snowflake (Article)
Snowflake Continuous Data Protection (White Paper)
Top 10 Cool Snowflake Features, #7: Snowflake Fast Clone (Blog + Video)

Snowflake Documentation Links


Cloning Considerations
Continuous Data Protection
CREATE <object> … CLONE
Data Encryption
Data Storage Considerations
Database Replication Considerations
Key Pair Authentication & Key Pair Rotation
Parameters
Sharing Data Securely in Snowflake
Snowflake Time Travel & Fail-safe
Snowflake Marketplace
Understanding Data Transfer Billing
UNDROP SCHEMA

You might also like