Snowflake Core Certification Guide Dec 2022
Snowflake Core Certification Guide Dec 2022
Snowflake Core
Certification Guide 2.0
Contents
1. Introduction to Snowflake ........................................................................................................................... 7
Data Warehouse Basics................................................................................................................................. 7
Cloud Service Models .................................................................................................................................... 9
Cloud Deployment Models ........................................................................................................................... 10
Modern Data Stack ...................................................................................................................................... 11
Data Cloud.................................................................................................................................................... 12
Snowflake Data Marketplace ....................................................................................................................... 12
Snowflake History......................................................................................................................................... 13
Cloud data platform ...................................................................................................................................... 14
Supported Regions and Platforms ............................................................................................................... 15
Editions ......................................................................................................................................................... 17
Pricing ........................................................................................................................................................... 19
Snowflake Credits ..................................................................................................................................... 19
Virtual Warehouses Size .......................................................................................................................... 19
Cloud Services.......................................................................................................................................... 19
On-Demand Buying .................................................................................................................................. 20
Pre-purchased Capacity ........................................................................................................................... 20
Stages ........................................................................................................................................................... 67
File Format.................................................................................................................................................... 72
Tables ........................................................................................................................................................... 73
External Tables.......................................................................................................................................... 73
Views ............................................................................................................................................................ 77
Secure Views............................................................................................................................................. 77
Materialized Views.................................................................................................................................... 77
Cloning.......................................................................................................................................................... 99
Snowflake Cloud Data Platform Features and Architecture Study Resources........................................... 196
Domain 2.0: Account Access and Security................................................................................................ 196
Domain 2.0: Account Access and Security Study Resources ............................................................... 197
Domain 3.0: Performance Concepts.......................................................................................................... 198
Domain 3.0: Performance Concepts Study Resources ......................................................................... 198
Domain 4.0: Data Loading and Unloading................................................................................................. 200
Domain 4.0: Data Loading and Unloading Study Resources ................................................................ 200
Domain 5.0: Data Transformations ............................................................................................................ 201
Domain 5.0: Data Transformations Study Resources ........................................................................... 201
Domain 6.0: Data Protection and Data Sharing ........................................................................................ 202
Domain 6.0: Data Protection and Data Sharing Study Resources ........................................................ 203
1. Introduction to Snowflake
Data Warehouse Basics
OLTP VS OLAP
Consists of historical data from various Consists of only of operational current data.
Databases. In other words, different OLTP In other words, the original data source is
2. Data source databases are used as data sources for OLAP. OLTP and its transactions.
The data is used in planning, problem-solving, The data is used to perform day-to-day
6. Usage of data and decision-making. fundamental operations.
A large amount of data is stored typically in TB, The size of the data is relatively small as the
9. Volume of data PB historical data is archived. For ex MB, GB
Relatively slow as the amount of data involved is Very Fast as the queries operate on 5% of the
10. Queries large. Queries may take hours. data.
The OLAP database is not often updated. As a The data integrity constraint must be
11. Update result, data integrity is unaffected. maintained in an OLTP database.
Backup and It only need backup from time to time as Backup and recovery process is maintained
12. Recovery compared to OLTP. rigorously
Processing The processing of complex queries can take a It is comparatively fast in processing because
13. time lengthy time. of simple and straightforward queries.
15. Operations Only read and rarely write operation. Both read and write operations.
Database
18. Design Design with a focus on the subject. Design that is focused on the application.
19. Productivity Improves the efficiency of business analysts. Enhances the user’s productivity.
ON-PREMISE Vs CLOUD
On-Premise
1. On-prem solutions sit on your local network, which means a high upfront cost, as you must invest in
hardware and the appropriate software licenses.
2. you need the right skills, which may involve hiring a consultant to assist with installation and
ongoing support.
3. The advantage of on-premise is that you control every aspect of the repository. You also control
when and how data leaves your network.
Cloud System
Model: Description:
https://blog.dataiku.com/dataikus-role-in-the-modern-data-stack
Reverse ETL completes the loop of data integration by copying data from the warehouse into systems of
record to enable teams to finally act on the same data that has been powering all the beautiful reports they
have been consuming.
Data Cloud Platform
Data Cloud
Over 400 million SaaS data sets remained siloed globally, isolated in cloud data storage and on-premise
data centers. The Data Cloud eliminates these silos, allowing you to seamlessly unify, analyze, share, and
even monetize your data.
Snowflake's cloud data platform supports multiple data workloads, from Data Warehousing and Data
Lake to Data Engineering, Data Science, and Data Application development across multiple cloud
providers and regions from anywhere in the organization. Snowflake’s unique architecture delivers near-
unlimited storage and computing in real time to virtually any number of concurrent users in the Data Cloud.
1. Snowflake history:
Why Snowflake
Supported Regions and Platforms
1. When you request a Snowflake account, choose the region where the account is located.
2. If latency is a concern, you should choose the available region with the closest geographic proximity
to your end users
3. Cloud Provider provides additional backup and disaster recovery beyond the standard recovery
support provided by Snowflake.
4. Snowflake does not place any restrictions on the region where you choose to locate each account.
5. Choosing specific region may have cost implications, due to pricing differences between the
regions.
6. If you are a government agency or a commercial organization that must comply with specific privacy
and security requirements of the US government, you can choose between two dedicated
government regions provided by Snowflake.
7. Snowflake does not move data between accounts, so any data in an account in a region remains in
the region unless users explicitly choose to copy, move, or replicate the data.
Demand or Capacity.
Editions
Standard Edition
1. Standard Edition is introductory level offering, providing full, unlimited access to all of
Snowflake’s standard features. It provides a strong balance between features, level of support,
and cost.
Enterprise Edition
2. Enterprise Edition provides all the features and services of Standard Edition, with additional
features designed specifically for the needs of large-scale enterprises and organizations.
3. Business Critical Edition, formerly known as Enterprise for Sensitive Data (ESD)
4. It offers even higher levels of data protection to support the needs of organizations with
extremely sensitive data, particularly PHI data that must comply with HIPAA and HITRUST
CSF regulations.
5. It includes all the features and services of Enterprise Edition, with the addition of enhanced
security and data protection.
6. In addition, database failover/failback adds support for business continuity and disaster recovery.
7. Virtual Private Snowflake offers highest level of security for organizations that have the strictest
requirements, such as financial institutions and any other large enterprises that collect, analyze,
and share highly sensitive data.
8. All new accounts, regardless of Snowflake Edition, receive Premier support, which includes 24/7
coverage.
9. It includes all the features and services of Business-Critical Edition, but in a completely separate
Snowflake environment, isolated from all other Snowflake accounts (i.e. VPS accounts do not share
any resources with accounts outside the VPS)
10. A hostname for a Snowflake account starts with an account identifier and ends with the Snowflake
domain (snowflakecomputing.com). Snowflake supports two formats to use as the account
identifier in your hostname:
o Account name (preferred)
o Account locator
11. Example
o Organization Name : NHPREQQ
o Account Name : MRFSPORE
o Account_locator : JB92030
o Region Name : ap-south-1
o Cloud Provider Name : aws
https://nhpreqq-xm17812.snowflakecomputing.com
or
https://jb92030.ap-south-1.aws.snowflakecomputing.com
Pricing
Snowflake Credits
1. Snowflake credits are used to pay for the consumption of resources on Snowflake
2. If ONE server running for ONE hour then ONE credit will be charged.
3. Snowflake credit is a unit of measure, and it is consumed only when a customer is using resources
such as
5. The size of the virtual warehouse determines how fast queries will run
6. When a virtual warehouse is not running, it does not consume any snowflake credits
7. The different size of virtual warehouses consumes the above rates , billed by the second with one
minute minimum
Cloud Services
8. Cloud services resources are automatically assigned by Snowflake based on the requirements of
the workload.
9. Typical utilization of cloud services (up to 10% of daily compute credits) is included for free, which
means most customers will not see incremental charges for cloud services usage.
On-Demand Buying
Pre-purchased Capacity
https://www.snowflake.com/pricing/
Snowflake Releases
Snowflake is committed to providing a seamless, always up-to-date experience for our
users while also delivering ever-increasing value through rapid development and continual
innovation.
• deploys new releases each week. This allows us to regularly deliver service
improvements in the form of new features, enhancements, and fixes.
Full release
• New features
• Feature enhancements or updates
• Fixes
https://docs.snowflake.com/en/release-notes/new-features.html
.Full releases may be deployed on any day of the week, except Friday.
Patch release
A patch release includes fixes only. Note that the patch release for a given week
may be canceled if the full release for the week is sufficiently delayed or prolonged.
If needed, additional patch releases are deployed to address any issues that are
encountered during or after the release process.
Every month, Snowflake deploys one behavior change release. Behavior change
releases contain changes to existing behaviors that may impact customers.
Behavior change releases take place over two months: during the first month, or test
period, the behavior change release is disabled by default, but you can enable it in
your account; during the second month, or opt-out period, the behavior change is
enabled by default, but you can disable it in your account.
Snowflake does not override these settings during the release: if you disable a
release during the testing period, we do not enable it at the beginning of the opt-out
period. At the end of the opt-out period, Snowflake enables the behavior changes in
all accounts. However, you can still request an extension to temporarily disable
specific behavior changes from the release by contacting Snowflake Support.
Sign up trial account
2. Introduction to Web UI - Snowsight
Admin Menu
1. Warehouses
2. Resource Monitors
3. Users and Roles
4. Billing
5. Partner Connect
6. Help and Support
Warehouse Creation
System Roles
New Role Creation
Data Menu
Database Creation
Schema Creation
Table Creation
View Creation
Stage Creation
File Format Creation
Activity Menu
Query History
Copy History
Worksheet
Worksheet Filter
Automatic Contextual Statistics
• Data Integration
• Business Intelligence (BI)
• Machine Learning & Data Science
• Security & Governance
• SQL Development & Management
Partner Connect
https://youtu.be/8sO53KczJ4M
● Snowflake does not provide tools to extract data from source systems and/or visualize data--it relies
upon its Technology Partners to provide those tools
● Snowflake’s relationships with/integrations with Technology Partners are driven largely by customer
requests and needs for them
● Snowflake engages with Technology Partners and works with technologies that are both cloud and on-
premises based
● As most activity in Snowflake revolves around integrating and visualizing data, Data Integration
and Business Intelligence technologies are the most prevalent in the Snowflake Technology
Ecosystem
● Various technologies offer different levels of integrations and advantages with Snowflake:
○ ELT tools like Talend and Matillion leverage Snowflake's scalable compute for data
transformation by pushing tranform processing to Snowflake
○ BI tools like Tableau and Looker offer native connectivity built into their products, with Looker
leveraging Snowflake’s in-database processing scalable compute for querying
○ Snowflake has built a custom Spark library that allows the results of a Snowflake query to be
loaded directly into a dataframe
● To fully understand the advantages of Snowflake, one must understand its advantages vs. its
competitors
○ On-Premises EDW
■ Instant scalability
■ Separation of compute and storage
■ No need for data distribution
○ Cloud EDW
■ Concurrency
■ Automatic failover and disaster recovery
■ Built for the cloud
○ Hadoop
■ No hardware to manage
■ No need to manage files
■ Native SQL (including on semi-structured)
○ Data Engines
■ No need to manage files
■ Automated cluster management
■ Native SQL
○ Apache Spark
■ No need to manage data files
■ Automated cluster management
■ Full SQL Support
https://www.snowflake.com/partners/technology-partners/
3. Snowflake’s processing engine is ANSI SQL, the most familiar and utilized database querying
language. SQL capabilities have been natively built into the product.
• Allows customers to leverage the skills they already have
• Enables interoperability with trusted tools, specifically in data integration and business
intelligence
• Promotes simplified migration from legacy platforms
4. SQL functionality can be extended via SQL User Defined Functions (UDFs), Javascript UDFs,
session variables and Stored Procedures
5. Snowflake supports structured and semi-structured data within one fully SQL data warehouse.
• Semi-structured data strings are stored in a column with a data type of “VARIANT”
• Snowflake’s storage methodology optimizes semi-structured, or VARIANT, storage based on
repeated elements
• Just like structured data, semi-structured data can be queried using SQL while incorporating
JSON path notation
Description
Specifies the concurrency level for SQL statements (i.e. queries and DML) executed by a warehouse. When
the level is reached, the operation performed depends on whether the warehouse is a single or multi-cluster
warehouse:
Single or multi-cluster (in Maximized mode): Statements are queued until already-allocated resources are
freed or additional resources are provisioned, which can be accomplished by increasing the size of the
warehouse.
As each statement is submitted to a warehouse, Snowflake allocates resources for executing the statement;
if there aren’t enough resources available, the statement is queued or additional warehouses are started,
depending on the warehouse.
The actual number of statements executed concurrently by a warehouse might be more or less than the
specified level:
Smaller, more basic statements: More statements might execute concurrently because small statements
generally execute on a subset of the available compute resources in a warehouse. This means they only
count as a fraction towards the concurrency level.
Default : 8
Lowering the concurrency level for a warehouse increases the compute resource allocation per statement,
which potentially results in faster query performance, particularly for large/complex and multi-statement
queries.
Raising the concurrency level for a warehouse decreases the compute resource allocation per statement;
however, it does not necessarily limit the total number of concurrent queries that can be executed by the
warehouse, nor does it necessarily improve total warehouse performance, which depends on the nature of
the queries being executed.
Note that, as described earlier, this parameter impacts multi-cluster warehouses (in Auto-scale mode)
because Snowflake automatically starts a new warehouse within the multi-cluster warehouse to avoid
queuing. Thus, lowering the concurrency level for a multi-cluster warehouse (in Auto-scale mode) potentially
increases the number of active warehouses at any time.
Also, remember that Snowflake automatically allocates resources for each statement when it is submitted
and the allocated amount is dictated by the individual requirements of the statement. Based on this, and
through observations of user query patterns over time, we’ve selected a default that balances performance
and resource usage.
Snowflake typical X-Small Warehouse with assumed* compute capacity:
1. For a Multi-Cluster warehouse, the SCALING_POLICY decides when the additional clusters
spawns up. When the value is set to ECONOMY then Snowflake starts the additional cluster
in a delayed fashion, giving more importance to cost control over performance. When the
value is set to STANDARD, Snowflake provides importance to performance and starts the
additional clusters immediately when query starts getting queued up.
2. In case the MAX_CONCURRENCY_LEVEL value is lower, then the additional cluster in a
multi-cluster warehouse might starts quicker.
3. The value of the parameter STATEMENT_QUEUED_TIMEOUT_IN_SECONDS has an impact
on the timing when the additional cluster in a multi-cluster warehouse will spawn. The
default is 0 which means no time out. Any non-zero value set for this parameter is the
number of seconds the queued query will wait, and in a single cluster warehouse the query
will be cancelled if did not get any compute resource to execute within that number of
seconds. In case of a multi-clustered warehouse, an additional cluster will be spawned and
the compute resources will be allocated to that query.
4. It is very important to use multiple warehouses for different type and size of processing
needs. Specially in case there is a process which is comparatively very complex and deals
with huge volume data and takes a lot of time and compute resource, then use a separate
bigger size warehouse to handle that process and do not use the same warehouse for any
other needs.
5. Consider tweaking with the MAX_CONCURRENCY_LEVEL parameter to provide more
compute resources to the single process, so that it can execute faster. Keeping in mind the
discussion about the “Concurrency within a single cluster warehouse”, below in an example
of how a smaller sized warehouse can provide performance like a bigger warehouse and in
turn reduces the cost. The below comparison provides an example of how this can be
performed.
Example where the performance consistently improves with warehouse size increase:
Example where the performance does not improve with increase in warehouse sizes. The LARGE
warehouse is not going to improve performance in this case.
*Important Notes:
• The number of CPU and threads are used for discussion purpose and not disclosed by
Snowflake Computing. There is no such documentation from Snowflake Computing to know
the exact type of servers they use in the Warehouses and the detailed architecture of those
servers / instances. Even when the number of CPU(s) in a server or the number of Cores in a
CPU might be different than the above numbers used for the discussion, the basic concepts
behind the concurrency and how it is handled in Snowflake does not changes.
• We may assume that the number of servers (nodes) per cluster will remain same as per
Snowflake current standards, but over time the underlying architecture of the compute
clusters will keep changing from the number of CPU(s) per node and amount of RAM and
SSD available in the clusters with changes in available compute instances available in the
underlying cloud platforms.
Virtual Warehouse Considerations
1. This topic provides general guidelines and best practices for using virtual warehouses in Snowflake to
process queries.
2. It does not provide specific or absolute numbers, values, or recommendations because every query
scenario is different and is affected by numerous factors, including number of concurrent users/queries,
number of tables being queried, and data size and composition, as well as your specific requirements for
warehouse availability, latency, and cost.
3. The keys to using warehouses effectively and efficiently are:
• Experiment with different types of queries and different warehouse sizes to determine the
combinations that best meet your specific query needs and workload.
• Don’t focus on warehouse size. Snowflake utilizes per-second billing, so you can run larger
warehouses (Large, X-Large, 2X-Large, etc.) and simply suspend them when not in use.
Each warehouse, when running, maintains a cache of table data accessed as queries are processed by the
warehouse. This enables improved performance for subsequent queries if they are able to read from the
cache instead of from the table(s) in the query. The size of the cache is determined by the compute
resources in the warehouse (i.e. the larger the warehouse and, therefore, more compute resources in the
warehouse), the larger the cache.
This cache is dropped when the warehouse is suspended, which may result in slower initial performance for
some queries after the warehouse is resumed.
As the resumed warehouse runs and processes more queries, the cache is rebuilt, and queries that are able
to take advantage of the cache will experience improved performance.
Keep this in mind when deciding whether to suspend a warehouse or leave it running. In other words,
consider the trade-off between saving credits by suspending a warehouse versus maintaining the cache of
data from previous queries to help with performance.
When creating a warehouse, the two most critical factors to consider, from a cost and performance
perspective, are:
The initial size you select for a warehouse depends on the task the warehouse is performing and the
workload it processes. For example:
• For data loading, the warehouse size should match the number of files being loaded and the amount
of data in each file. For more details, see Planning a Data Load.
• For queries in small-scale testing environments, smaller warehouses sizes (X-Small, Small, Medium)
may be sufficient.
• For queries in large-scale production environments, larger warehouse sizes (Large, X-Large, 2X-
Large, etc.) may be more cost effective.
However, note that per-second credit billing and auto-suspend give you the flexibility to start with larger
sizes and then adjust the size to match your workloads. You can always decrease the size of a warehouse at
any time.
Resizing a warehouse generally improves query performance, particularly for larger, more complex queries.
It can also help reduce the queuing that occurs if a warehouse does not have enough compute resources to
process all the queries that are submitted concurrently. Note that warehouse resizing is not intended for
handling concurrency issues; instead, use additional warehouses to handle the workload or use a multi-
cluster warehouse (if this feature is available for your account).
Snowflake supports resizing a warehouse at any time, even while running. If a query is running slowly and
you have additional queries of similar size and complexity that you want to run on the same warehouse, you
might choose to resize the warehouse while it is running; however, note the following:
• As stated earlier about warehouse size, larger is not necessarily faster; for smaller, basic queries that are
already executing quickly, you may not see any significant improvement after resizing.
• Resizing a running warehouse does not impact queries that are already being processed by the
warehouse; the additional compute resources, once fully provisioned, are only used for queued and new
queries.
• Resizing between a 5XL or 6XL warehouse to a 4XL or smaller warehouse, will result in a brief period
during which the customer is charged for both the new warehouse and the old warehouse while the old
warehouse is quiesced.
• Keep this in mind when choosing whether to decrease the size of a running warehouse or keep it at the
current size. In other words, there is a trade-off with regards to saving credits versus maintaining the
warehouse cache.
2. Snowflake’s approach to access control combines aspects from both of the following models:
➢ Discretionary Access Control (DAC): Each object has an owner role, who can in turn grant
access to that object.
➢ Role-based Access Control (RBAC): Access privileges are assigned to roles, which are in turn
assigned to users.
5. In addition, each securable object has an owner that can grant access to other roles.
6. Every securable object is owned by a single role, which is typically the role used to create the object.
7. The owning role has all privileges on the object by default, including the ability to grant or revoke
privileges on the object to other roles.
Roles
8. Roles are the entities to which privileges on securable objects can be granted and revoked.
9. A user can be assigned multiple roles. This allows users to switch roles (i.e. choose which role is
active in the current Snowflake session) to perform different actions using separate sets of privileges.
10. Roles can be also granted to other roles, creating a hierarchy of roles. The privileges associated with
a role are inherited by any roles above that role in the hierarchy.
System Roles
The following diagram illustrates the hierarchy for the system-defined roles along with the recommended
structure for additional, user-defined custom roles:
1. There are a small number of system-defined roles in a Snowflake account. System-defined roles
cannot be dropped. In addition, the privileges granted to these roles by Snowflake cannot be
revoked.Additional privileges can be granted to the system-defined roles, but is not recommended.
2. System-defined roles are created with privileges related to account-management. As a best practice,
it is not recommended to mix account-management privileges and entity-specific privileges in the
same role.If additional privileges are needed, we recommend granting the additional privileges to a
custom role and assigning the custom role to the system-defined role
• Role that can manage any object grant globally, as well as create, monitor, and manage users
and roles. More specifically, this role:
• Is granted the MANAGE GRANTS security privilege to be able to modify any grant, including
revoking it.
• Inherits the privileges of the USERADMIN role via the system role hierarchy (e.g.
USERADMIN role is granted to SECURITYADMIN).
• Role that has privileges to create warehouses and databases (and other objects) in an
account.
• If, as recommended, you create a role hierarchy that ultimately assigns all custom roles to the
SYSADMIN role, this role also has the ability to grant privileges on warehouses, databases,
and other objects to other roles.
PUBLIC
• Pseudo-role that is automatically granted to every user and every role in your account.
• The PUBLIC role can own securable objects, just like any other role; however, the objects
owned by the role are, by definition, available to every other user and role in your account.
• This role is typically used in cases where explicit access control is not needed and all users are
viewed as equal with regard to their access rights.
Custom Roles
• Custom roles (i.e. any roles other than the system-defined roles) can be created by the
SECURITYADMIN roles as well as by any role to which the CREATE ROLE privilege has been
granted.
• By default, the newly-created role is not assigned to any user, nor granted to any other role.
• Conversely, if a custom role is not assigned to SYSADMIN through a role hierarchy, the
system administrators will not be able to manage the objects owned by the role. Only those
roles granted the MANAGE GRANTS privilege (typically only the SECURITYADMIN role) will
see the objects and be able to modify their access grants.
PRIVILEGES
• For each securable object, there is a set of privileges that can be granted on it.
• For existing objects, privileges must be granted on individual objects, e.g. the SELECT
privilege on the mytable table.
• Future grants allow defining an initial set of privileges on objects created in a schema; i.e. the
SELECT privilege on all new tables created in the myschema schema.
• In regular (i.e. non-managed) schemas, use of these commands is restricted to the role that
owns an object (i.e. has the OWNERSHIP privilege on the object) or roles that have the
MANAGE GRANTS global privilege for the object (typically the SECURITYADMIN role).
• In managed access schemas, object owners lose the ability to make grant decisions. Only the
schema owner or a role with the MANAGE GRANTS privilege can grant privileges on objects
in the schema, including future grants, centralizing privilege management.
Recursive Grants
3. To simplify grant management, future grants allow defining an initial set of privileges to grant on
new (i.e. future) objects of a certain type in a database or a schema. As new objects are created,
the defined privileges are automatically granted to a specified role.
The below will grant select to all tables in a database
Future Grants
Similarly, to grant select on all future tables in a schema and database level.
4. You must define future grants on each object type (schemas, tables, views, streams, etc.)
individually.
5. When future grants are defined at both the database and schema level, the schema level grants
take precedence over the database level grants, and the database level grants are ignored.
6. At database level, the global MANAGE GRANTS privilege is required to grant or revoke privileges
on future objects in a database. Only the SECURITYADMIN and ACCOUNTADMIN system roles
have the MANAGE GRANTS privilege; however, the privilege can be granted to custom roles.
• Data sharing
• Data replication
Enforcement Model
Every active user session has a “current role,” also referred to as a primary role.
1. For organizations whose security model includes a large number of roles, each with a fine
granularity of authorization via permissions, the use of secondary roles simplifies role
management.
2. All roles that were granted to a user can be activated in a session. Secondary roles are
particularly useful for SQL operations such as cross-database joins that would otherwise require
creating a parent role of the roles that have permissions to access the objects in each database.
3. When a user attempts to create an object, Snowflake compares the privileges available to the
current role in the user’s session against the privileges required to create the object.
For any other SQL actions attempted by the user, Snowflake compares the privileges available to
the current primary and secondary roles against the privileges required to execute the action on
the target objects. If the session has the required privileges on the objects, the action is allowed.
4. The account administrator (i.e users with the ACCOUNTADMIN system role) role is the most
powerful role in the system
5. This role alone is responsible for configuring parameters at the account level. Users with the
ACCOUNTADMIN role can view and operate on all objects in the account, can view and manage
Snowflake billing and credit data, and can stop any running SQL statements.
We strongly recommend the following precautions when assigning the ACCOUNTADMIN role to users:
7. All users assigned the ACCOUNTADMIN role should also be required to use multi-factor
authentication (MFA) for login (for details, see Configuring Access Control).
8. Assign this role to at least two users. We follow strict security procedures for resetting a
forgotten or lost password for users with the ACCOUNTADMIN role. Assigning the
ACCOUNTADMIN role to more than one user avoids having to go through these procedures
because the users can reset each other’s passwords.
10. All securable database objects (such as TABLE, FUNCTION, FILE FORMAT, STAGE, SEQUENCE,
etc.) are contained within a SCHEMA object within a DATABASE. As a result, to access database
objects, in addition to the privileges on the specific database objects, users must be granted the
USAGE privilege on the container database and schema.
11. When a custom role is first created, it exists in isolation. The custom role must
also be granted to any roles that will manage the objects created by the custom
role.
12. With regular (i.e. non-managed) schemas in a database, object owners (i.e. roles
with the OWNERSHIP privilege on one or more objects) can grant access on
those objects to other roles, with the option to further grant those roles the ability
to manage object grants.
13. In a managed access schema, object owners lose the ability to make grant
decisions. Only the schema owner (i.e. the role with the OWNERSHIP privilege on
the schema) or a role with the MANAGE GRANTS privilege can grant privileges on
objects in the schema, including future grants, centralizing privilege management.
14. In Query History , A user cannot view the result set from a query that another
user executed.
15. A cloned object is considered a new object in Snowflake. Any privileges granted
on the source object do not transfer to the cloned object.
16. However, a cloned container object (a database or schema) retains any privileges
granted on the objects contained in the source object. For example, a cloned
schema retains any privileges granted on the tables, views, UDFs, and other
objects in the source schema.
7. Organization
https://docs.snowflake.com/en/user-guide-organizations.html
1. An organization is a first-class Snowflake object that links the accounts owned by your business entity.
2. Organizations simplify
• Account management and billing,
• Database Replication and Failover/Fallback,
• Snowflake Secure Data Sharing,
• Other account administration tasks.
3. Once an account is created, ORGADMIN can view the account properties but does not have access to
the account data.
4. Snowflake provides historical usage data for all accounts in your organization via views in the
ORGANIZATION_USAGE schema
Benefits
7. Once an account is created, ORGADMIN can view the account properties but does not have access to
the account data.
8. Data availability and durability by leveraging data replication and failover.
9. Seamless data sharing with Snowflake consumers across regions.
10. Ability to monitor and understand usage across all accounts in the organization
ORGADMIN Role
11. The organization administrator (ORGADMIN) system role is responsible for managing operations at the
organization level.
12. A user with the ORGADMIN role can perform the following actions:
• Create an account in the organization. For more information, see Creating an Account.
• View/show all accounts within the organization. For more information, see Viewing a List of
Organization Accounts.
• View/show a list of regions enabled for the organization. For more information, see Viewing a List of
Regions Available for Your Organization.
• View usage information for all accounts in the organization. For more information, see Organization
Usage.
13. To support retrieving information about organizations, Snowflake provides the following SQL function:
• SYSTEM$GLOBAL_ACCOUNT_SET_PARAMETER
select system$global_account_set_parameter('myaccount1',
'ENABLE_ACCOUNT_DATABASE_REPLICATION', 'true');
• In addition, Snowflake provides historical usage data for all accounts in your organization via views in the
ORGANIZATION_USAGE schema in a shared database named SNOWFLAKE.
• For information, see Organization Usage.
8. Accounts and Schema Objects
Database. Schema = Namespace
Schema Objects
Stages
External Stages
1. Loading data from any of the following cloud storage services is supported regardless of the cloud
platform that hosts your Snowflake account:
• Amazon S3
• Google Cloud Storage
• Microsoft Azure
2. You cannot access data held in archival cloud storage classes that requires restoration before it can
be retrieved.
3. A named external stage is a database object created in a schema.
a. This object stores the URL to files in cloud storage,
b. the settings used to access the cloud storage account, and
c. format of staged files.
Internal Stages
4. Table Stage
5. User Stage
6. Named Stage
Named stages are database objects that provide the greatest degree of flexibility for data loading:
Named stages are optional but recommended when you plan regular data loads that could involve
multiple users and/or tables.
• A named internal stage is a database object created in a schema.
• This stage type can store files that are staged and managed by one or more users and loaded into
one or more tables.
• Because named stages are database objects, the ability to create, modify, use, or drop them can
be controlled using security access control privileges.
Create Stage Command
-- Internal stage
CREATE [ OR REPLACE ] [ TEMPORARY ] STAGE [ IF NOT EXISTS ] <internal_stage_name>
internalStageParams
directoryTableParams
[ FILE_FORMAT = ( { FORMAT_NAME = '<file_format_name>' | TYPE = { CSV | JSON | AVRO | ORC | PARQUET | XML
} [ formatTypeOptions ] ) } ]
[ COPY_OPTIONS = ( copyOptions ) ]
[ [ WITH ] TAG ( <tag_name> = '<tag_value>' [ , <tag_name> = '<tag_value>' , ... ] ) ]
[ COMMENT = '<string_literal>' ]
-- External stage
CREATE [ OR REPLACE ] [ TEMPORARY ] STAGE [ IF NOT EXISTS ] <external_stage_name>
externalStageParams
directoryTableParams
[ FILE_FORMAT = ( { FORMAT_NAME = '<file_format_name>' | TYPE = { CSV | JSON | AVRO | ORC | PARQUET | XML
} [ formatTypeOptions ] ) } ]
[ COPY_OPTIONS = ( copyOptions ) ]
[ [ WITH ] TAG ( <tag_name> = '<tag_value>' [ , <tag_name> = '<tag_value>' , ... ] ) ]
[ COMMENT = '<string_literal>' ]
internalStageParams ::=
[ ENCRYPTION = (TYPE = 'SNOWFLAKE_FULL' | TYPE = 'SNOWFLAKE_SSE') ]
copyOptions ::=
ON_ERROR = { CONTINUE | SKIP_FILE | SKIP_FILE_<num> | SKIP_FILE_<num>% | ABORT_STATEMENT }
SIZE_LIMIT = <num>
PURGE = TRUE | FALSE
RETURN_FAILED_ONLY = TRUE | FALSE
MATCH_BY_COLUMN_NAME = CASE_SENSITIVE | CASE_INSENSITIVE | NONE
ENFORCE_LENGTH = TRUE | FALSE
TRUNCATECOLUMNS = TRUE | FALSE
FORCE = TRUE | FALSE
Compression of Staged Files
Already- User- Files that are already encrypted can be loaded into Snowflake
encrypted files supplied key from external cloud storage; the key used to encrypt the files
must be provided to Snowflake.
Platform Vs Stages
File Format
File format options specify the type of data contained in a file, as well as other related characteristics about
the format of the data.
The file format options you can specify are different depending on the type of data you plan to load.
Snowflake provides a full set of file format option defaults
Snowflake natively supports semi-structured data, which means semi-structured data can be loaded into
relational tables without requiring the definition of a schema in advance.
Named file formats are optional, but are recommended when you plan to regularly load similarly-formatted
data.
If file format options are specified in multiple locations, the load operation applies the options in the
following order of precedence:
1. COPY INTO TABLE statement.
2. Stage definition.
3. Table definition.
Tables
External Tables
https://docs.snowflake.com/en/user-guide/tables-external-intro.html#partitioned-external-tables
• External tables reference data files located in a cloud storage (Amazon S3, Google Cloud Storage, or
Microsoft Azure) data lake
• External tables store file-level metadata about the data files such as the file path, a version identifier, and
partitioning information.
• External tables can access data stored in any format supported by COPY INTO <table> statements.
• External tables are read-only, therefore no DML operations can be performed on them; however,
external tables can be used for query and join operations.
• Views can be created against external tables.
• Querying data stored external to the database is likely to be slower than querying native database
tables; however, materialized views based on external tables can improve query performance.
• VALUE : A VARIANT type column that represents a single row in the external file.
• METADATA$FILENAME : A pseudocolumn that identifies the name of each staged data file included in
the external table, including its path in the stage.
• METADATA$FILE_ROW_NUMBER : A pseudocolumn that shows the row number for each record in a
staged data file.
• To create external tables, you are only required to have some knowledge of the file
format and record format of the source data files. Knowing the schema of the data files
is not required.
• When queried, external tables cast all regular or semi-structured data to a variant in the
VALUE column.
• Create Table Command
Summary of Data Types
Bulk vs Continuous Loading
This option enables loading batches of data from files already available in cloud storage, or copying
(i.e. staging) data files from a local machine to an internal (i.e. Snowflake) cloud storage location before
loading the data into tables using the COPY command.
Compute Resources
Bulk loading relies on user-provided virtual warehouses, which are specified in the COPY statement. Users
are required to size the warehouse appropriately to accommodate expected loads.
Snowflake supports transforming data while loading it into a table using the COPY command. Options
include:
• Column reordering
• Column omission
• Casts
• Truncating text strings that exceed the target column length
There is no requirement for your data files to have the same number and ordering of columns as your target
table.
This option is designed to load small volumes of data (i.e. micro-batches) and incrementally make them
available for analysis. Snowpipe loads data within minutes after files are added to a stage and submitted for
ingestion. This ensures users have the latest results, as soon as the raw data is available.
Compute Resources
Snowpipe uses compute resources provided by Snowflake (i.e. a serverless compute model). These
Snowflake-provided resources are automatically resized and scaled up or down as required, and are charged
and itemized using per-second billing. Data ingestion is charged based upon the actual workloads.
The COPY statement in a pipe definition supports the same COPY transformation options as when bulk
loading data.
In addition, data pipelines can leverage Snowpipe to continously load micro-batches of data into staging
tables for transformation and optimization using automated tasks and the change data capture (CDC)
information in streams.
Views
Secure Views
https://docs.snowflake.com/en/user-guide/views-secure.html
Materialized Views
1. A materialized view is a pre-computed data set derived from a query specification (the SELECT in the
view definition) and stored for later use.
2. Because the data is pre-computed, querying a materialized view is faster than executing a query against
the base table of the view.
3. This performance difference can be significant when a query is run frequently or is sufficiently complex.
4. As a result, materialized views can speed up expensive aggregation, projection, and selection operations,
especially those that run frequently and that run on large data sets.
5. Materialized views are designed to improve query performance for workloads composed of common,
repeated query patterns. However, materializing intermediate results incurs additional costs.
6. Materialized views are particularly useful when:
• Query results contain a small number of rows and/or columns relative to the base table (the table on
which the view is defined).
• Query results contain results that require significant processing, including:
o Analysis of semi-structured data.
o Aggregates that take a long time to calculate.
• The query is on an external table (i.e. data sets stored in files in an external stage), which might have
slower performance compared to querying native database tables.
• The view’s base table does not change frequently.
• Materialized views are more flexible than, but typically slower than, cached results.
• Materialized views are faster than tables because of their “cache” (i.e. the query results for
the view); in addition, if data has changed, they can use their “cache” for data that hasn’t
changed and use the base table for any data that has changed.
8. We don’t need to specify a materialized view in a SQL statement in order for the view to be used. The
query optimizer can automatically rewrite queries against the base table or regular views to use the
materialized view instead. For example, suppose that a materialized view contains all of the rows and
columns that are needed by a query against a base table. The optimizer can decide to rewrite the query
to use the materialized view, rather than the base table. This can dramatically speed up a query,
especially if the base table contains a large amount of historical data.
Viewing Costs
• MATERIALIZED_VIEW_REFRESH_HISTORY table function (in the Snowflake Information Schema).
• MATERIALIZED_VIEW_REFRESH_HISTORY View view (in Account Usage).
Prerequisites
[ PATTERN = '<regex_pattern>' ]
[ copyOptions ]
PATTERN = 'regex_pattern'
A regular expression pattern string, enclosed in single quotes, specifying the file
names and/or paths to match.
COPY_OPTIONS = ( ... )
Default
Snowpipe SKIP_FILE
SIZE_LIMIT = num
Number (> 0) that specifies the maximum size (in bytes) of data to be loaded for a given COPY
statement.
For example, suppose a set of files in a stage path were each 10 MB in size. If multiple COPY
statements set SIZE_LIMIT to 25000000 (25 MB), each would load 3 files. That is, each COPY
operation would discontinue after the SIZE_LIMIT threshold was exceeded.
Boolean that specifies whether to remove the data files from the stage automatically after the data is
loaded successfully.
Set PURGE=TRUE for the table to specify that all files successfully loaded into the table are purged
after loading:
alter table mytable set stage_copy_options = (purge = true);
You can also override any of the copy options directly in the COPY command:
copy into mytable purge = true;
String that specifies whether to load semi-structured data into columns in the target table that match
corresponding columns represented in the data.
Alternative syntax for TRUNCATECOLUMNS with reverse logic (for compatibility with other systems)
Boolean that specifies whether to truncate text strings that exceed the target column length:
Alternative syntax for ENFORCE_LENGTH with reverse logic (for compatibility with other systems)
Boolean that specifies whether to truncate text strings that exceed the target column length:
Boolean that specifies to load all files, regardless of whether they’ve been loaded previously and
have not changed since they were loaded. Note that this option reloads files, potentially duplicating
data in a table.
In the following example, the first command loads the specified files and the second command forces the
same files to be loaded again (producing duplicate rows), even though the contents of the files have not
changed:
copy into load1 from @%load1/data1/ files=('test1.csv', 'test2.csv');
Boolean that specifies to load files for which the load status is unknown. The COPY command skips
these files by default.
Run the COPY command in validation mode and see all errors:
copy into mytable validation_mode = 'RETURN_ERRORS';
Snowpipe - Continuous Loading
Snowpipe enables loading data from files as soon as they’re available in a stage. This means you can load
data from files in micro-batches, making it available to users within minutes, rather than manually executing
COPY statements on a schedule to load larger batches.
A pipe is a named, first-class Snowflake object that contains a COPY statement used by Snowpipe. The
COPY statement identifies the source location of the data files (i.e., a stage) and a target table. All data types
are supported, including semi-structured data types such as JSON and Avro.
1. Automated data loads leverage event notifications for cloud storage to inform Snowpipe of the
arrival of new data files to load.
2. Snowpipe copies the files into a queue, from which they are loaded into the target table in a
continuous, serverless fashion based on parameters defined in a specified pipe object.
1. Client application calls a public REST endpoint with the name of a pipe object and a list of data
filenames.
2. If new data files matching the list are discovered in the stage referenced by the pipe object, they are
queued for loading.
3. Snowflake-provided compute resources load data from the queue into a Snowflake table based on
parameters defined in the pipe.
Snowpipe Different Vs Bulk Data Loading
Authentication
Bulk data load
Relies on the security options supported by the client for authenticating and
initiating a user session.
Snowpipe
When calling the REST endpoints: Requires key pair authentication with JSON
Web Token (JWT). JWTs are signed using a public/private key pair with RSA
encryption.
Load History
Bulk data load
Stored in the metadata of the target table for 64 days. Available upon
completion of the COPY statement as the statement output.
Snowpipe
Stored in the metadata of the pipe for 14 days. Must be requested from
Snowflake via a REST endpoint, SQL table function, or ACCOUNT_USAGE
view.
Transactions
Bulk data load
Loads are always performed in a single transaction. Data is inserted into table
alongside any other SQL statements submitted manually by users.
Snowpipe
Loads are combined or split into a single or multiple transactions based on the
number and size of the rows in each data file. Rows of partially loaded files
(based on the ON_ERROR copy option setting) can also be combined or split
into one or more transactions.
Compute Resources
Bulk data load
Requires a user-specified warehouse to execute COPY statements.
Snowpipe
Cost
Bulk data load
Snowpipe
1. Snowpipe is designed to load new data typically within a minute after a file notification is sent
2. In addition to resource consumption, an overhead to manage files in the internal load queue is
included in the utilization costs charged for Snowpipe. Snowpipe charges 0.06 credits per 1000
files queued.
3. Creating a new (potentially smaller) data file once per minute. This approach typically leads to a
good balance between cost and performance.
It is not always necessary to load data into Snowflake before executing queries.
4. External tables enable querying existing data stored in external cloud storage for analysis without
first loading it into Snowflake.
5. The source of truth for the data remains in the external cloud storage. Data sets materialized in
Snowflake via materialized views are read-only.
6. This solution is especially beneficial to accounts that have a large amount of data stored in
external cloud storage and only want to query a portion of the data; for example, the most recent
data. Users can create materialized views on subsets of this data for improved query
performance.
Data Loading Summary
● Two separate commands can be used to load data into Snowflake:
○ COPY
■ Bulk insert
■ Allows insert on a SELECT against a staged file, but a WHERE clause cannot be used
■ More performant
○ INSERT
■ Row-by-row insert
■ Allows insert on a SELECT against a staged file, and a WHERE clause can be used
■ Less performant
● Snowflake also offers a continuous data ingestion service, Snowpipe, to detect and load streaming
data:
○ Snowpipe loads data within minutes after files are added to a stage and submitted for
ingestion.
○ The service provides REST endpoints and uses Snowflake-provided compute resources to
load data and retrieve load history reports.
○ The service can load data from any internal (i.e. Snowflake) or external (i.e. AWS S3 or Microsoft
Azure) stage.
○ With Snowpipe’s server-less compute model, Snowflake manages load capacity, ensuring
optimal compute resources to meet demand. In short, Snowpipe provides a “pipeline” for
loading fresh data in micro-batches as soon as it’s available.
● To load data into Snowflake, the following must be in place:
○ A Virtual Warehouse
○ A pre-defined target table
○ A Staging location with data staged
○ A File Format
● Snowflake supports loading from the following file/data types:
○ Text Delimited (referenced as CSV in the UI)
○ JSON
○ XML
○ Avro
○ Parquet
○ ORC
● Data must be staged prior to being loaded, either in an Internal Stage (managed by Snowflake) or
an External Stage (self-managed) in AWS S3 or Azure Blob Storage
● As data is loaded:
○ Snowflake compresses the data and converts it into an optimized internal format for efficient
storage, maintenance, and retrieval.
○ Snowflake gathers various statistics for databases, tables, columns, and files and stores this
information in the Metadata Manager in the Cloud Services Layer for use in query optimization
●
● Working With Snowflake - Loading and Querying Self-Guided Learning Material
○ Summary of Data Loading Features
○ Getting Started - Introduction to Data Loading (Video)
○ Loading Data
○ Bulk Load Using COPY
○ COPY INTO <table>
○ INSERT
○ Getting Started - Introduction to Databases and Querying (Video)
○ Easily Loading and Analyzing Semi-Structured Data in Snowflake (Video)
○ Processing JSON data in Snowflake (Video)
○ Queries
○ Analyzing Queries Using Query Profile
○ Using the History Page to Monitor Queries
Reference : https://community.snowflake.com/s/article/Performance-of-Semi-Structured-Data-Types-in-
Snowflake
https://docs.snowflake.com/en/sql-reference/data-types-semistructured.html#object
https://docs.snowflake.com/en/user-guide/querying-semistructured.html#sample-data-used-in-examples
https://docs.snowflake.com/en/user-guide/semistructured-considerations.html
• Data Loading Considerations
This set of topics provides best practices, general guidelines, and important considerations for bulk data
loading using the COPY INTO <table> command.
• The number of load operations that run in parallel cannot exceed the number of data files to be loaded.
To optimize the number of parallel operations for a load, we recommend aiming to produce data files
roughly 100-250 MB (or larger) in size compressed.
• loading via snowpipe can take significantly longer for really large files or in cases where an unusual
amount of compute resources is necessary to decompress, decrypt, and transform the new data.
• In addition to resource consumption, an overhead to manage files in the internal load queue is included
in the utilization costs charged for Snowpipe. This overhead increases in relation to the number of files
queued for loading. Snowpipe charges 0.06 credits per 1000 files queued.
• consider creating a new (potentially smaller) data file once per minute. This approach typically leads to a
good balance between cost (i.e. resources spent on Snowpipe queue management and the actual load)
and performance (i.e. load latency).
• UTF-8 is the default character set, however, additional encodings are supported.
• Organizing your data files by path lets you copy any fraction of the partitioned data into Snowflake with
a single command. This allows you to execute concurrent COPY statements that match a subset of files,
taking advantage of parallel operations.
• For example, if you were storing data for a North American company by geographical location, you
might include identifiers such as continent, country, and city in paths along with data write dates:
Canada/Ontario/Toronto/2016/07/10/05/
United_States/California/Los_Angeles/2016/06/01/11/
United_States/New York/New_York/2016/12/21/03/
United_States/California/San_Francisco/2016/08/03/17/
Loading Data- Best Practices
The COPY command supports several options for loading data files from a stage:
1. By path (internal stages) / prefix (Amazon S3 bucket). See Organizing Data by Path for information.
• These options enable you to copy a fraction of the staged data into Snowflake with a single command.
This allows you to execute concurrent COPY statements that match a subset of files, taking advantage
of parallel operations.
• Of the three options for identifying/specifying data files to load from a stage, providing a discrete list of
files is generally the fastest; however, the FILES parameter supports a maximum of 1,000 files,
• Pattern matching using a regular expression is generally the slowest of the three options for
identifying/specifying data files to load from a stage; however, this option works well if you exported
your files in named order from your external application and want to batch load the files in the same
order.
Executing Parallel COPY Statements That Reference the Same Data Files
• When a COPY statement is executed, Snowflake sets a load status in the table metadata
for the data files referenced in the statement. This prevents parallel COPY statements
from loading the same files into the table, avoiding data duplication.
• If one or more data files fail to load, Snowflake sets the load status for those files as load
failed. These files are available for a subsequent COPY statement to load.
When data from staged files is loaded successfully, consider removing the staged files to
ensure the data is not inadvertently loaded again (duplicated).
Staged files can be deleted from a Snowflake stage (user stage, table stage, or named stage)
using the following methods:
• Files that were loaded successfully can be deleted from the stage during a load
by specifying the PURGE copy option in the COPY INTO <table> command.
• After the load completes, use the REMOVE command to remove the files in
the stage.
Removing files ensures they aren’t inadvertently loaded again. It also improves load
performance, because it reduces the number of files that COPY commands must scan to
verify whether existing files in a stage were loaded already.
Preparing to Load Data
Data File Compression
We recommend that you compress your data files when you are loading large data sets.
See CREATE FILE FORMAT for the compression algorithms supported for each data type.
When loading compressed data, specify the compression method for your data files. The COMPRESSION
file format option describes how your data files are already compressed in the stage. Set the
COMPRESSION option in one of the following ways:
• As a file format option specified directly in the COPY INTO <table> statement.
• As a file format option specified for a named file format or stage object. The named file format/stage
object can then be referenced in the COPY INTO <table> statement.
Copy options determine the behavior of a data load with regard to error handling, maximum data size, and
so on.For descriptions of all copy options and the default values, see COPY INTO <table>.
You can specify the desired load behavior (i.e. override the default settings) in any of the following locations:
If copy options are specified in multiple locations, the load operation applies the options in the following
order of precedence:
Semi-structured JSON
Snowflake supports creating named file formats, which are database objects that encapsulate all of the
required format information.
Named file formats are optional, but are recommended when you plan to regularly load similarly-formatted
data.
You can define the file format settings for your staged data (i.e. override the default settings) in any of the
following locations:
If file format options are specified in multiple locations, the load operation applies the options in the
following order of precedence:
Snowflake maintains detailed metadata for each file uploaded into internal stage (for users, tables, and
stages), including:
• File name
• File size (compressed, if compression was specified during upload)
• LAST_MODIFIED date, i.e. the timestamp when the data file was initially staged or when it was last
modified, whichever is later
In addition, Snowflake retains historical data for COPY INTO commands executed within the previous 14
days. The metadata can be used to monitor and manage the loading process, including deleting files after
upload completes:
• Use the LIST command to view the status of data files that have been staged.
• Monitor the status of each COPY INTO <table> command on the History page of the Snowflake
web interface.
• Use the VALIDATE function to validate the data files you’ve loaded and retrieve any errors
encountered during the load.
• Use the LOAD_HISTORY Information Schema view to retrieve the history of data loaded into tables
using the COPY INTO command.
10. Data Unloading
Data Unloading
The process for unloading data into files is the same as the loading process, except in reverse:
1. Use the COPY INTO <location> command to copy the data from the Snowflake database table
into one or more files in a Snowflake or external stage.
2. Download the file from the stage:
3. From a Snowflake stage, use the GET command to download the data file(s).
4. From S3, use the interfaces/tools provided by Amazon S3 to get the data file(s).
5. From Azure, use the interfaces/tools provided by Microsoft Azure to get the data file(s).
Semi-structured data is data that does not conform to the standards of traditional structured data, but it
contains tags or other types of mark-up that identify individual, distinct entities within the data.
Two of the key attributes that distinguish semi-structured data from structured data are nested data
structures and lack of a fixed schema
❖ Structured data requires a fixed schema that is defined before the data can be loaded and
queried in a relational database system. Semi-structured data does not require a prior definition
of a schema and can constantly evolve, i.e. new attributes can be added at any time.
❖ Unlike structured data, which represents data as a flat table, semi-structured data can contain n-
level hierarchies of nested information.
❖ Snowflake is doing the same with semi structured data as it does with structured data
❖ Full Support of SQL operations is available such as joins and aggregations
Cloning
• A clone is writable and is independent of its source (i.e. changes made to the source or clone are not
reflected in the other object).
• Parameters that are explicitly set on a source database, schema, or table are retained in
any clones created from the source container or child objects.
• To create a clone, your current role must have the following privilege(s) on the source object:
o Tables : SELECT
o Pipes, Streams, Tasks : OWNERSHIP
o Other objects : USAGE
• For databases and schemas, cloning is recursive:However, the following object types
are not cloned:
o External tables
o Internal (Snowflake) stages
• Cloning a table replicates the structure, data, and certain other properties (e.g. STAGE FILE FORMAT ) of
the source table.
• The CREATE TABLE … CLONE syntax includes the COPY GRANTS keywords. If the COPY GRANTS
keywords are used, then the new object inherits any explicit access privileges granted on the original
table but does not inherit any future grants
When queried, external tables cast all regular or semi-structured data to a variant in the
VALUE column.
• You can apply the masking policy to one or more table/view columns with the matching data type
• While Snowflake offers secure views to restrict access to sensitive data, secure views present
management challenges due to large numbers of views and derived business intelligence (BI) dashboards
from each view.
• Masking policies support segregation of duties (SoD) through the role separation of policy administrators
from object owners.
o Object owners (i.e. the role that has the OWNERSHIP privilege on the object) do not have the
privilege to unset masking policies.
o Object owners cannot view column data in which a masking policy applies.
• Conditional masking uses a masking policy to selectively protect the column data in a table or view
based on the values in one or more different columns.
• Snowflake supports nested masking policies, such as a masking policy on a table and a masking policy on
a view for the same table.
• you cannot have masking policy for
o Shared objects.
o Materialized views (MV)
o Virtual columns.
o External tables.
• Masking policies on columns in a table carry over to a stream on the same table.
• Cloning a schema results in the cloning of all masking policies within the schema.
Snowpipe
Snowpipe enables loading data from files as soon as they’re available in a stage. This means you can load
data from files in micro-batches, making it available to users within minutes, rather than manually executing
COPY statements on a schedule to load larger batches.
A pipe is a named, first-class Snowflake object that contains a COPY statement used by Snowpipe. The
COPY statement identifies the source location of the data files (i.e., a stage) and a target table. All data types
are supported, including semi-structured data types such as JSON and Avro.
3. Automated data loads leverage event notifications for cloud storage to inform Snowpipe of the
arrival of new data files to load.
4. Snowpipe copies the files into a queue, from which they are loaded into the target table in a
continuous, serverless fashion based on parameters defined in a specified pipe object.
4. Client application calls a public REST endpoint with the name of a pipe object and a list of data
filenames.
5. If new data files matching the list are discovered in the stage referenced by the pipe object, they are
queued for loading.
6. Snowflake-provided compute resources load data from the queue into a Snowflake table based on
parameters defined in the pipe.
https://docs.snowflake.com/en/user-guide/data-load-snowpipe-intro.html#how-is-snowpipe-different-
from-bulk-data-loading
7. Snowpipe is designed to load new data typically within a minute after a file notification is sent
8. In addition to resource consumption, an overhead to manage files in the internal load queue is
included in the utilization costs charged for Snowpipe. Snowpipe charges 0.06 credits per 1000 files
queued.
9. Creating a new (potentially smaller) data file once per minute. This approach typically leads to a good
balance between cost and performance.
Streams
• A stream object records data manipulation language (DML) changes made to tables, including inserts,
updates, and deletes, as well as metadata about each change, so that actions can be taken using the
• When a stream object is created, it takes an initial snapshot of every row in source table by
initializing a point in time (called offset) as current transactional version of the table.
• If stream is consumed, the offset moves forward to capture new data change.
• Change information mirrors the column structure of the tracked source table and includes additional
metadata columns that describe change event. The additional metadata columns are
• Update to the rows in source table are represented as pair of delete and insert operations in
• METADATA$ROW_ID, specifies the unique row ID which is used to track changes to rows.
• Streams only stores an offset for the source table and not any actual table data. Therefore, you can
create any number of streams for a table without incurring significant cost.
• When Stream is created for a table, a pair of hidden columns are added to source table and begin to
• The CDC records returned when querying a stream rely on combination of offset stored in the
• When stream is dropped and recreated using create/ replace command, loses all its tracking history
(offset).
• Multiple queries(SELECT) can independently consume the same data from a stream without changing
offset.
• To Ensure Multiple statements access the same change records in the stream, surround them with an
• Types of Streams
Standard
A standard table stream tracks all DML changes to source table, includes insert, update and delete
Append – only
1. An Append only table stream tracks only row inserts. Update and delete (including truncate)
• The stream object leverages the time travel feature of the table to enable CDC and if it kept beyond
• Stream object become stale if not consumed with in time travel period.If the data retention period
for a source table is less than 14 days and if stream has not been consumed, snowflake temporarily
• The period is extended to the stream’s offset, up to maximum of 14 days by default, regardless of
• Extending the data retention period requires additional storage which will be reflected in your
• To determine whether a stream has become stale, execute DESCRIBE STREAM or SHOW STREAMS
Command.
• Renaming a source table does not break a stream or cause it to go stale. In addition, if a table is
dropped and a new table created with the same name, any streams linked to the original table
• You can Clone a Stream.The clone of stream inherits the current offset.
• When a database or schema that contains source table and stream is cloned, any unconsumed
Tasks
• A Snowflake task in simple terms is a scheduler that can help you to schedule a single SQL or a
stored procedure.
• Tasks can be combined with table streams for continuous ELT workflows to process recently
• Tasks can also be used independently to generate periodic reports by inserting or merging rows into
• Tasks require compute resources to execute SQL code. Either Snowflake-managed (i.e.serverless
compute model) or User-managed (i.e. virtual warehouse) can be chosen for individual tasks.
• Serverless compute model for tasks compute resources are automatically resized and scaled up or
down by Snowflake based on a dynamic analysis of previous runs of the same task .
• There is no event source that can trigger a task; instead, a task runs on a schedule, which can be
• A scheduled task runs according to the specified cron expression in the local time for a given time
zone.
• Users can define a simple tree-like structure of tasks that starts with a root task and is linked
• A predecessor task can be defined when creating a task (using CREATE TASK ...AFTER) or later
schema.
• If the predecessor for a child task is removed, then the former child task becomes either a standalone
• To modify or recreate any child task in a tree of tasks, the root task must first be suspended.
• When the root task is suspended, all future scheduled runs of the root task are cancelled; however, if
any tasks are currently running then these tasks and any descendent tasks continue to run.
• In addition to the task owner, a role that has the OPERATE privilege on the task can suspend or
resume the task. This role must have the USAGE privilege on the database and schema that contain
the task.
• When the owner role of a given task (i.e. the role with the OWNERSHIP privilege on the task) is
deleted, the task is “re-possessed” by the role that dropped the owner role. This ensures that
ownership moves to a role that is closer to the root of the role hierarchy.
2. Stored procedures are loosely similar to functions. As with functions, a stored procedure is created
once and can be executed many times. A stored procedure is created with a CREATE PROCEDURE
3. A stored procedure returns a single value. Although you can run SELECT statements inside a stored
procedure, the results must be used within the stored procedure, or be narrowed to a single value to
be returned.
Example: Suppose that you want to clean up a database by deleting data older than a specified date.
You can write multiple DELETE statements, each of which deletes data from one specific table. You
can put all of those statements in a single stored procedure and pass a parameter that specifies the
cut-off date. Then you can simply call the procedure to clean up the database.
1. Snowflake, which has methods to create a Statement object and execute a SQL command.
2. Statement, which helps you execute prepared statements and access metadata for those
3. ResultSet, which holds the results of a query (e.g. the rows of data retrieved for a SELECT
statement).
5. When calling, using, and getting values back from stored procedures, you often need to convert from
6. Snowflake supports overloading of stored procedure names. Multiple stored procedures in the same
schema can have the same name, as long as their signatures differ, either by the number of
7. Stored procedures are not atomic; if one statement in a stored procedure fails, the other statements
in the stored procedure are not necessarily rolled back. You can use stored procedures with
2. Error handling.
4. Writing code that executes with the privileges of the role that owns the procedure, rather than with
the privileges of the role that runs the procedure. This allows the stored procedure owner to
delegate the power to perform specified operations to users who otherwise could not do so.
CALL MyStoredProcedure1(argument_1);
Calling Functions
2. You can call the stored procedure inside another stored procedure; the JavaScript in the outer stored
procedure can retrieve and store the output of the inner stored procedure. Remember, however,
that the outer stored procedure (and each inner stored procedure) is still unable to return more than
3. You can call the stored procedure and then call the RESULT_SCAN function and pass it the
4. You can store a result set in a temporary table or permanent table, and use that table after returning
5. If the volume of data is not too large, you can store multiple rows and multiple columns in
6. A stored procedure runs with either the caller’s rights or the owner’s rights. It cannot run with both
7. A caller’s rights stored procedure runs with the privileges of the caller. The primary advantage of a
caller’s rights stored procedure is that it can access information about that caller or about the caller’s
current session. For example, a caller’s rights stored procedure can read the caller’s session variables
The primary advantage of an owner’s rights stored procedure is that the owner can delegate specific
administrative tasks, such as cleaning up old data, to another role without granting that role more
general privileges, such as privileges to delete all data from a specific table.
available through the built-in, system-defined functions provided by Snowflake. Snowflake currently
• JavaScript: A JavaScript UDF lets you use the JavaScript programming language to
manipulate data and return either scalar or tabular results.
• SQL: A SQL UDF evaluates an arbitrary SQL expression and returns either scalar or tabular
results.
• Java: A Java UDF lets you use the Java programming language to manipulate data and return
either scalar or tabular results.
2. UDFs may be scalar or tabular.
• A scalar function returns one output row for each input row. The returned row consists of a
single column/value.
• A tabular function, also called a table function, returns zero, one, or multiple rows for each
input row.
3. To avoid conflicts when calling functions, Snowflake does not allow creating UDFs with the same
4. Snowflake supports overloading of SQL UDF names. Multiple SQL UDFs in the same schema can
have the same name, as long as their argument signatures differ, either by the number of arguments
5. For security or privacy reasons, you might not wish to expose the underlying tables or algorithmic
details for a UDF. With secure UDFs, the definition and details are visible only to authorized users
(i.e. users who are granted the role that owns the UDF).
External Function
Calling AWS Lambda service from Snowflake external function using API Integration Object
1. An external function calls code that is executed outside Snowflake.
2. The remotely executed code is known as a remote service.
3. Information sent to a remote service is usually relayed through a proxy service.
4. Snowflake stores security-related external function information in an API integration.
An external function is a type of UDF. Unlike other UDFs, an external function does not
contain its own code; instead, the external function calls code that is stored and executed
outside Snowflake.
Inside Snowflake, the external function is stored as a database object that contains
information that Snowflake uses to call the remote service. This stored information includes
the URL of the proxy service that relays information to and from the remote service.
Remote Service:
===========================================================================
API_AWS_IAM_USER_ARN: arn:aws:iam::841748965781:user/u5at-s-insa6850
API_AWS_EXTERNAL_ID: WZ00864_SFCRole=3_mQPCuXPCeS072Ucv5HIeoiEMu68=
13. Snowflake Scripting
Snowflake Scripting is an extension to Snowflake SQL that adds support for procedural logic.
You can use Snowflake Scripting 1) to write stored procedures and 2) procedural code outside of a
stored procedure.
Example
declare
radius_of_circle float;
area_of_circle float;
begin
radius_of_circle := 3;
area_of_circle := pi() * radius_of_circle * radius_of_circle;
return area_of_circle;
end;
declare
profit number(38, 2) default 0.0;
cost number(38, 2) default 100.0;
revenue number(38, 2) default 110.0;
begin
profit := revenue - cost;
return profit;
end;
[ ELSE
-- Statements to execute if none of the conditions is true.
]
END IF;
Example
begin
let count := 1;
if (count < 0) then
return 'negative value';
elseif (count = 0) then
return 'zero';
else
return 'positive value';
end if;
end;
declare
total_price float;
c1 cursor for select price from invoices;
begin
total_price := 0.0;
for record in c1 do
total_price := total_price + record.price;
end for;
return total_price;
end;
• You can use a cursor to iterate through query results one row at a time. To retrieve data from the results of a
query, use a cursor. You can use a cursor in loops to iterate over the rows in the results.
1) In the DECLARE section, declare the cursor.The declaration includes the query for the cursor.
2) Execute the OPEN command to open the cursor. This executes the query and loads the results into the
cursor.
3) Execute the FETCH command to fetch one or more rows and process those rows.
1. Snowflake Scripting raises an exception if an error occurs while executing a statement (e.g. if a
statement attempts to DROP a table that doesn’t exist).
2. An exception prevents the next lines of code from executing.
3. In a Snowflake Scripting block, you can write exception handlers that catch specific types of
exceptions declared in that block and in blocks nested inside that block.
4. In addition, for errors that can occur in your code, you can define your own exceptions that you
can raise when errors occur.
declare
my_exception exception (-20002, 'Raised MY_EXCEPTION.');
begin
let counter := 0;
let should_raise_exception := true;
if (should_raise_exception) then
raise my_exception;
end if;
counter := counter + 1;
return counter;
end;
14. Storage - Understanding Micropartitions
Micro partitions
1. All data in Snowflake is stored in database tables, logically structured as collections of columns and
rows. it is helpful to have an understanding of the physical structure behind the logical structure.
2. Traditional data warehouses rely on static partitioning of large tables to achieve acceptable
performance and enable better scaling. however, static partitioning has a number of well-known
limitations, such as maintenance overhead
3. The Snowflake Data Platform implements a powerful and unique form of partitioning, called micro-
partitioning, that delivers all the advantages of static partitioning
4. All data in Snowflake tables is automatically divided into micro-partitions, which are contiguous units
of storage.
5. Each micro-partition contains between 50 MB and 500 MB of uncompressed data (note that the
actual size in Snowflake is smaller because data is always stored compressed).
6. Groups of rows in tables are mapped into individual micro-partitions, organized in a columnar
fashion.
7. This size and structure allows for extremely granular pruning of very large tables, which can be
comprised of millions, or even hundreds of millions, of micro-partitions.
10. All DML operations (e.g. DELETE, UPDATE, MERGE) take advantage of the underlying micro-
partition metadata. deleting all rows from a table, are metadata-only operations.
Query Pruning
11. The micro-partition metadata maintained by Snowflake enables precise pruning of columns in micro-
partitions at query run-time, including columns containing semi-structured data.
12. Snowflake uses columnar scanning of partitions so that an entire partition is not scanned if a query
only filters by one column.
13. Snowflake does not prune micro-partitions based on a predicate with a subquery, even if the
subquery results in a constant
15. Clustering
Data Clustering
14. In general, Snowflake produces well-clustered data in tables; however, over time, particularly as DML
occurs on very large tables (as defined by the amount of data in the table, not the number of rows),
the data in some table rows might no longer cluster optimally on desired dimensions.
15. Typically, data stored in tables is sorted/ordered along natural dimensions (e.g. date and/or
geographic regions).
16. This “clustering” is a key factor in queries because table data that is not sorted or is only partially
sorted may impact query performance, particularly on very large tables.
17. To improve the clustering of the underlying table micro-partitions, you can always manually sort
rows on key table columns and re-insert them into the table; however, performing these tasks could
be cumbersome and expensive.
18. Instead, Snowflake supports automating these tasks by designating one or more table
columns/expressions as a clustering key for the table. A table with a clustering key defined is
considered to be clustered.
19. In Snowflake, as data is inserted/loaded into a table, clustering metadata is collected and recorded
for each micro-partition created during the process.
20. Snowflake then leverages this clustering information to avoid unnecessary scanning of micro-
partitions during querying, significantly accelerating the performance of queries that reference these
columns.
21. Snowflake maintains clustering metadata for the micro-partitions in a table, including:
• The total number of micro-partitions that comprise the table.
• The number of micro-partitions containing values that overlap with each other (in a specified
subset of table columns).
• The depth of the overlapping micro-partitions.
Clustering Depth
22. The clustering depth for a populated table measures the average depth ( 1 or greater) of the
overlapping micro-partitions for specified columns in a table. The smaller the average depth, the
better clustered the table is with regards to the specified columns.
23. Clustering depth can be used for a variety of purposes, including:
o Monitoring the clustering “health” of a large table, particularly over time as DML is performed
on the table.
o Determining whether a large table would benefit from explicitly defining a clustering key.
o A table with no micro-partitions (i.e. an unpopulated/empty table) has a clustering depth of 0 .
24. The clustering depth for a table is not an absolute or precise measure of whether the table is well-
clustered. Ultimately, query performance is the best indicator of how well-clustered a table is:
o If queries on a table are performing as needed or expected, the table is likely well-clustered.
o If query performance degrades over time, the table is likely no longer well-clustered and may
benefit from clustering.
Clustering Keys
25. In general, Snowflake produces well-clustered data in tables; however, over time, particularly as DML
occurs on very large tables (as defined by the amount of data in the table, not the number of rows),
the data in some table rows might no longer cluster optimally on desired dimensions.
26. To improve the clustering of the underlying table micro-partitions, you can always manually sort
rows on key table columns and re-insert them into the table; however, performing these tasks could
be cumbersome and expensive.
27. Instead, Snowflake supports automating these tasks by designating one or more table
columns/expressions as a clustering key for the table. A table with a clustering key defined is
considered to be clustered.
• Note: Clustering keys are not intended for all tables. The size of a table, as well as the query
performance for the table, should dictate whether to define a clustering key for the table.
16. Caching
Using Persisted Query Results
1. When a query is executed, the result is persisted (i.e. cached) for a period of time. At the end of the time
period, the result is purged from the system.
2. If a user repeats a query that has already been run, and the data in the table(s) hasn’t changed since the
last time that the query was run, then the result of the query is the same.
Typically, query results are reused if all of the following conditions are met:
13. Each time the persisted result for a query is reused, Snowflake resets the 24-hour retention period for
the result, up to a maximum of 31 days from the date and time that the query was first executed. After
31 days, the result is purged and the next time the query is submitted, a new result is generated and
persisted.
show tables;
where "rows" = 0;
1. Use the auto-refresh checkbox in the upper right to enable/disable auto-refresh for the session. If
selected, the page is refreshed every 10 seconds. You can also click the Refresh icon to refresh the
display at any time.
2. Use the Include client-generated statements checkbox to show or hide SQL statements run by web
interface sessions outside of SQL worksheets.
3. Use the Include queries executed by user tasks checkbox to show or hide SQL statements executed
or stored procedures called by user tasks.
4. Click any column header to sort the page by the column or add/remove columns in the display.
5. Click the text of a query (or select the query and click View SQL) to view the full SQL for the query.
6. Select a query that has not yet completed and click Abort to abort the query.
7. Click the ID for a query to view the details for the query, including the result of the query and the
Query Profile.
Query History Results
8. Snowflake persists the result of a query for a period of time, after which the result is purged. This
limit is not adjustable.
9. To view the details and result for a particular query, click the Query ID in the History page. The
Query Detail page appears (see below), where you can view query execution details, as well as the
query result (if still available).
10. You can also use the Export Result button to export the result of the query (if still available) to a file.
11. You can view results only for queries you have executed. If you have privileges to view queries
executed by another user, the Query Detail page displays the details for the query, but, for data
privacy reasons, the page does not display the actual query result.
Export Results
12. On any page in the interface where you can view the result of a query (e.g. Worksheets, Query
Detail), if the query result is still available, you can export the result to a file.
13. When you click the Export Result button for a query, you are prompted to specify the file name and
format. Snowflake supports the following file formats for query export:
14. You can export results only for queries for which you can view the results (i.e. queries you’ve
executed). If you didn’t execute a query or the query result is no longer available, the Export
Result button is not displayed for the query.
15. The web interface only supports exporting results up to 100 MB in size. If a query result exceeds this
limit, you are prompted whether to proceed with the export.
16. The export prompts may differ depending on your browser. For example, in Safari, you are prompted
only for an export format (CSV or TSV). After the export completes, you are prompted to download
the exported result to a new window, in which you can use the Save Page As… browser option to
save the result to a file.
Viewing Query Profile
In addition to query details and results, Snowflake provides the Query Profile for analyzing query statistics
and details, including the individual execution components that comprise the query. For more information,
see Analyzing Queries Using Query Profile.
1. Query Profile, available through the Snowflake web interface, provides execution details for a query.
2. For the selected query, it provides a graphical representation of the main components of the
processing plan for the query, with statistics for each component, along with details and statistics for
the overall query.
3. It can be used whenever you want or need to know more about the performance or behavior of a
particular query.
4. It is designed to help you spot typical mistakes in SQL query expressions to identify potential
performance bottlenecks and improvement opportunities.
5. Query IDs can be clicked on, specifically Worksheets & History
6. Run the below query
SELECT SUM(O_TOTALPRICE)
FROM ORDERS , LINEITEM
WHERE ORDERS.O_ORDERKEY = LINEITEM.L_ORDERKEY
AND O_TOTALPRICE > 300
AND L_QUANTITY < (select avg(L_QUANTITY) from LINEITEM);
4. ExternalScan : Represents access to data stored in stage objects. Can be a part of queries that scan data
from stages directly, but also for data-loading COPY queries. Attributes:
• Stage name — the name of the stage where the data is read from.
• Stage type — the type of the stage (e.g. TABLE STAGE).
5. InternalObject : Represents access to an internal data object (e.g. an Information Schema table or the
result of a previous query). Attributes:
8. Aggregate : Groups input and computes aggregate functions. Can represent SQL constructs such as
GROUP BY, as well as SELECT DISTINCT. Attributes:
• Grouping Keys — if GROUP BY is used, this lists the expressions we group by.
• Aggregate Functions — list of functions computed for each aggregate group, e.g. SUM.
9. GroupingSets : Represents constructs such as GROUPING SETS, ROLLUP and CUBE. Attributes:
• Grouping Key Sets — list of grouping sets
• Aggregate Functions — list of functions computed for each group, e.g. SUM.
12. SortWithLimit : Produces a part of the input sequence after sorting, typically a result of
an ORDER BY ... LIMIT ... OFFSET ... construct in SQL. Attributes:
• Sort keys — expression defining the sorting order.
• Number of rows — number of rows produced.
• Offset — position in the ordered sequence from which produced tuples are emitted.
13. Flatten : Processes VARIANT records, possibly flattening them on a specified path. Attributes:
• input — the input expression used to flatten the data.
14. JoinFilter : Special filtering operation that removes tuples that can be identified as not possibly matching
the condition of a Join further in the query plan. Attributes:
• Original join ID — the join used to identify tuples that can be filtered out.
• Table name — the name of the table that records are deleted from.
4. Unload : Represents a COPY operation that exports data from a table into a file in a stage. Attributes:
Some queries include steps that are pure metadata/catalog operations rather than data-processing
operations. These steps consist of a single operator. Some examples include:
1. DDL and Transaction Commands : Used for creating or modifying objects, session, transactions, etc.
Typically, these queries are not processed by a virtual warehouse and result in a single-step profile
that corresponds to the matching SQL statement. For example:
Miscellaneous Operators
Result :Returns the query result. Attributes:
• List of expressions - the expressions produced.
Query/Operator Details
To help you analyze query performance, the detail panel provides two classes of profiling
information:
In addition, attributes are provided for each operator (described in Operator Types in this topic).
Execution Time
Execution time provides information about “where the time was spent” during the processing of a query.
Time spent can be broken down into the following categories, displayed in the following order:
A major source of information provided in the detail panel is the various statistics, grouped in the following
sections:
• Spilling — information about disk usage for operations where intermediate results do not
fit in memory:
• Bytes spilled to local storage — volume of data spilled to local disk.
• Bytes spilled to remote storage — volume of data spilled to remote disk.
If the value of a field, for example “Retries due to transient errors”, is zero, then
the field is not displayed.
Common Query Problems Identified by Query Profile
“Exploding” Joins
One of the common mistakes SQL users make is joining tables without providing a join condition (resulting
in a “Cartesian Product”), or providing a condition where records from one table match multiple records from
another table. For such queries, the Join operator produces significantly (often by orders of magnitude)
more tuples than it consumes.
This can be observed by looking at the number of records produced by a Join operator, and typically is also
In SQL, it is possible to combine two sets of data with either UNION or UNION ALL constructs. The
difference between them is that UNION ALL simply concatenates inputs, while UNION does the same, but
also performs duplicate elimination.
A common mistake is to use UNION when the UNION ALL semantics are sufficient. These queries show in
Query Profile as a UnionAll operator with an extra Aggregate operator on top (which performs duplicate
elimination).
For some operations (e.g. duplicate elimination for a huge data set), the amount of memory available for the
compute resources used to execute the operation might not be sufficient to hold intermediate results. As a
result, the query processing engine will start spilling the data to local disk. If the local disk space is not
sufficient, the spilled data is then saved to remote disks.
This spilling can have a profound effect on query performance (especially if remote disk is used for spilling).
To alleviate this, we recommend:
• Using a larger warehouse (effectively increasing the available memory/local disk space for the
operation), and/or
• Processing data in smaller batches.
Inefficient Pruning
Snowflake collects rich statistics on data allowing it not to read unnecessary parts of a table based on the
query filters. However, for this to have an effect, the data storage order needs to be correlated with the
query filter attributes.
The efficiency of pruning can be observed by comparing Partitions scanned and Partitions total statistics in
the TableScan operators. If the former is a small fraction of the latter, pruning is efficient. If not, the pruning
did not have an effect.
Of course, pruning can only help for queries that actually filter out a significant amount of data. If the
pruning statistics do not show data reduction, but there is a Filter operator above TableScan which filters
out a number of records, this might signal that a different data organization might be beneficial for this
query.
Resource monitors can be used to impose limits on the number of credits that are
consumed by:
Limits can be set for a specified interval or date range. When these limits are reached
and/or are approaching, the resource monitor can trigger various actions, such as sending
alert notifications and/or suspending the warehouses.
Resource monitors can only be created by account administrators. however, account
administrators can choose to enable users with other roles to view and modify resource
monitors using SQL.
Credit Quota
Credit quota specifies the number of Snowflake credits allocated to the monitor for the
specified frequency interval. Any number can be specified.
Summary
Snowflake provides Resource Monitors to help control costs and avoid unexpected credit
usage related to using Warehouses
Can be used to impose limits on the number of credits that Warehouses consume within
each monthly billing period.
When these limits are close and/or reached, the Resource Monitor can trigger various
actions, such as sending alert notifications and suspending the Warehouses.
Resource Monitors can only be created by account administrators (i.e. users with the
ACCOUNTADMIN role); however, account administrators can choose to enable users with
other roles to view and modify resource monitors.
Notification
19. Data sharing
• Use Cases
Secure Data Sharing enables sharing selected objects in a database in your account with
other Snowflake accounts. The following Snowflake database objects can be shared:
• Tables
• External tables
• Secure views
• Secure materialized views
• Secure UDFs
1. With Secure Data Sharing, no actual data is copied or transferred between accounts.
All sharing is accomplished through Snowflake’s unique services layer and metadata
store.
2. Shared data does not take up any storage in a consumer account .The only charges to
consumers are for the compute resources (i.e. virtual warehouses) used to query the
shared data.
3. In addition, because no data is copied or exchanged the provider creates a share of a
database in their account and grants access to specific objects in the database.
4. The provider can also share data from multiple databases, as long as these databases
belong to the same account.
5. One or more accounts are then added to the share, which can include your own
accounts
6. On the consumer side, a read-only database is created from the share.
7. Access to this database is configurable using the same, standard role-based access
control that Snowflake provides for all objects in the system.
8. Through this architecture, Snowflake enables creating a network of providers that
can share data with multiple consumers (including within their own organization) and
consumers that can access shared data from multiple providers:
Share Object
9. A new object created in a database in a share is not automatically available to
consumers.To make the object available to consumers, you must use the GRANT
<privilege> … TO SHARE command to explicitly add the object to the share.
Data Providers
10. As a data provider, you share a database with one or more Snowflake
accounts.
11. Snowflake does not place any hard limits on the number of shares you can
create or the number of accounts you can add to a share.
Data Consumers
12. A data consumer is any account that chooses to create a database from a share
made available by a data provider.
13. As a data consumer, once you add a shared database to your account, you can
access and query the objects in the database just as you would with any other
database in your account.
14. Snowflake does not place any hard limits on the number of shares you can
consume from data providers; however, you can only create one database per share.
Reader Account
20. Users in a reader account can query data that has been shared with it, but
cannot perform any of the DML tasks that are allowed in a full account (data loading,
insert, update, etc.
Sharing Data With Data Consumers in a Different Region and Cloud Platform
Snowflake provides three product offerings for data sharing that utilize Snowflake Secure
Data Sharing to connect providers of data with consumers.
In this Topic:
• Direct Share
• Snowflake Data Marketplace
• Data Exchange
Direct Share
Direct Share is the simplest form of data sharing that enables account-to-account sharing of
data utilizing Snowflake’s Secure Data Sharing.
As a data provider you can easily share data with another company so that your data shows
up in their Snowflake account without having to copy it over or move it.
You can discover and access a variety of third-party data and have those datasets available
directly in your Snowflake account to query without transformation and join it with your
own data. If you need to use several different vendors for data sourcing, the Data
Marketplace gives you one single location from where to get the data.
You can also become a provider and publish data in the Data Marketplace, which is an
attractive proposition if you are thinking about data monetization and different routes to
market.
Data Exchange
Data Exchange is your own data hub for securely collaborating around data between a
selected group of members that you invite. It enables providers to publish data that can
then be discovered by consumers.
You can share data at scale with your entire business ecosystem such as suppliers, partners,
vendors, and customers, as well as business units at your own company. It allows you to
control who can join, publish, consume, and access data.
Once your Data Exchange is provisioned and configured, you can invite members and
specify whether they can consume data, provide data, or both.
The Data Exchange is supported for all Snowflake accounts hosted on non-VPS regions on
all supported cloud platforms.
The threat of a data security breach, someone gaining unauthorized access to an organization’s data, is what
keeps CEOs, CISOs, and CIOs awake at night. Such a breach can quickly turn into a public
relationsnightmare, resulting in lost business and steep fines from regulatory agencies.
Snowflake Cloud Data Platform sets the industry standard for data platform security, so you don’t have to
lose sleep. All aspects of Snowflake’s architecture, implementation, and operation are designed to protect
customer data in transit and at rest against both current and evolving security threats.
1. Snowflake was built from the ground up to deliver end-to-end data security for all data platform users.
2. As part of its overall security framework, it leverages NIST 800-53 and the CIS Critical Security Controls,
a set of controls created by a broad consortium of international security experts to identify the security
functions that are effective against real-world threats.
3. Snowflake comprises a multilayered security architecture to protect customer data and access to that
data. This architecture addresses the following:
• External interfaces
• Access control
• Data storage
• Physical infrastructure
This security architecture is complemented by the monitoring, alerts, controls, and processes that are part of
Snowflake’s comprehensive security framework.
4. Security for compliance requirements Snowflake is a multi-tenant service that implements isolation at
multiple levels. It runs inside a virtual private cloud (VPC), a logically isolated network section within
either Amazon Web Services (AWS), Microsoft Azure (Azure), or Google Cloud Platform (GCP).
5. The dedicated subnet, along with the implementation of security groups, enables Snowflake to isolate
and limit access to its internal components.
6. The Business Critical edition provides additional security features to support customers who have
HIPAA, PCI DSS, or other compliance requirements.
7. In addition, VPS supports customers who have specific regulatory requirements that prevent them from
loading their data into a multi-tenant environment. VPS includes the Business Critical edition within a
dedicated version of Snowflake.
8. Snowflake also isolates query processing, which is performed by one or more compute clusters called
virtual warehouses. Snowflake provisions these compute clusters in such a way that the virtual
warehouses of each customer are isolated from other customers’ virtual warehouses.
9. Snowflake also isolates data storage by customer. Each customer’s data is always stored in an
independent directory and encrypted using customer-specific keys, which are accessible only by that
customer.
20. Access Control-Security Feature
Authentication
10. Snowflake employs robust authentication mechanisms, and every request to Snowflake must be
authenticated, for example:
• User password hashes are securely stored.
• Strong password policy is enforced.
• Various mechanisms are deployed by Snowflake to thwart brute-force attacks
A brute force attack, or exhaustive search, is a cryptographic hack that uses trial-and-error to guess possible
combinations for passwords used for logins, encryption keys, or hidden web pages
• Snowflake also offers built-in multi-factor authentication (MFA), MFA for users with administrative
privileges, and key-pair authentication for non-interactive users.
• For customers who want to manage the authentication mechanism for their account, and whose
providers support SAML 2.0, Snowflake offers federated authentication.
• System for Cross-domain Identity Management : (SCIM) can be leveraged to help facilitate the
automated management of user identities and groups (that is, roles) in cloud applications using
RESTful APIs.
Authorization
11. Snowflake provides a sophisticated, role-based access control (RBAC) authorization framework to
ensure data and information can be accessed or operated on only by authorized users within an
organization.
12. Access control is applied to all database objects including tables, schemas, secure views, secure user-
defined functions (secure UDFs) and virtual warehouses. Access control grants determine a user’s ability
to both view and operate on database objects.
13. In Snowflake’s access control model, users are assigned one or more roles, each of which can be
assigned different access privileges. For every access to database objects, Snowflake validates that the
necessary privileges have been granted to a role assigned to the user.
14. Customers can choose from a set of built-in roles or create and define custom roles within the role
hierarchy defined by Snowflake.
15. The OAuth 2.0 authorization framework is also supported.
Encryption everywhere
16. In Snowflake, all customer data is always encrypted when it is stored on disk, and data is encrypted
when it’s moved into a Snowflake-provided staging location for loading into Snowflake.
17. Data is also encrypted when it is stored within a database object in Snowflake, when it is cached within a
virtual warehouse, and when Snowflake stores a query result.
Data encryption and key management
18. Snowflake uses strong AES 256-bit encryption with a hierarchical key model rooted in a cluster of
hardware security modules(HSM).
19. Each customer account has a separate key hierarchy of accountlevel, table-level, and file-level keys.
20. Snowflake automatically rotates account and table keys on a regular basis. Data encryption and key
management are entirely transparent to customers and require no configuration or management.
21. Snowflake was designed from the ground up to be a continuously available cloud service that is resilient
to failures to prevent customer disruption and data loss.
22. Its continuous data protection (CDP) capabilities protect against and provide easy self-service recovery
from accidental errors, system failures, and malicious acts.
23. The most common cause of data loss or corruption in a database is accidental errors made by a system
administrator, a privileged user, or an automated process.
24. Snowflake provides a unique feature called Time Travel that provides easy recovery from such errors.
Time Travel makes it possible to instantly restore or query any previous version of a table or database
from an arbitrary past point in time within a retention period.
25. When any data is modified, Snowflake internally writes those changes to a new storage object and
automatically retains the previous storage object for a period of time (the retention period) so that both
versions are preserved.
26. When data is deleted or database objects are dropped, Snowflake updates its metadata to reflect that
change but keeps the data during the retention period.
27. During the retention period, all data and data objects are fully recoverable by customers. Past versions
of a data object from any point in time within the retention period can also be accessed via SQL, both for
direct access by a SELECT statement as well as for cloning in order to create a copy of a past version of
the data object.
28. After the retention period has passed, Snowflake’s Fail-Safe feature provides an additional seven days
(the “fail-safe” period) to provide a sufficient length of time during which Snowflake can, at a customer’s
request, recover any data that was maliciously or inadvertently deleted by human or software error.
29. At the end of that Fail-Safe period, an automated process physically deletes the data. Because of this
design, it is impossible for the Snowflake service, any Snowflake personnel, or malicious intruders to
physically delete data.
30. CDP and Time Travel are standard features built into Snowflake. The length of the default retention
period is determined by the customer’s service agreement.
31. Customers can specify extended retention periods at the time that a new database, table, or schema is
created via SQL data definition language (DDL) commands. Extended retention periods incur additional
storage costs for the time that Snowflake retains the data during the retention and fail-safe periods.
32. If an errant data loading script corrupts a database, it is possible to create a logical duplicate of the
database (a clone) from the point in time just prior to the execution of a specific statement.
33. The second most common type of data loss is caused by some form of system failure: both software
failures and infrastructure failures such as the loss of a disk, a disk array, a server or, most significantly, a
data center.
34. The Snowflake architecture is designed for resilience, without data loss, in the face of such failures.
Snowflake, which runs on all the major cloud providers’ platforms (AWS, GCP, and Azure), uses a fully
distributed and resilient architecture combined with the resiliency capabilities available in these cloud
platforms to protect against a wide array of possible failures.
• Compute layer. Consists of one or more virtual warehouses, each of which is a multinode compute
cluster that processes queries. Virtual warehouses cache data from the data storage layer in
encrypted form, but they do not store persistent data.
• Cloud services layer. The brain of the system, this layer manages infrastructure, queries, security,
and metadata. The services running in this layer are implemented as a set of stateless processes
35. Each layer in the Snowflake architecture is distributed across availability zones. Because availability
zones are geographically separated data centers with independent access to power and networking,
operations continue even if one or two availability zones become unavailable.
36. When a transaction is committed in Snowflake, the data is securely stored in the cloud provider’s highly
durable data storage, which enables data survival in the event of the loss of one or more disks, servers,
or even data centers. Amazon S3 synchronously and redundantly stores data across multiple devices in
multiple facilities. It is designed for eleven 9s (99.999999999%) of data durability
External Interfaces
37. Customers access Snowflake via the internet using only secure protocols. All internet communication
between users and Snowflake is secured and encrypted using TLS 1.2 or higher.
The following drivers and tools may be used to connect to the service:
38. Snowflake also supports IP address whitelisting to enable customers to restrict access to the Snowflake
service by only trusted networks.
39. Customers who prefer to not allow any traffic to traverse the public internet may leverage either AWS
PrivateLink (and AWS DirectConnect) or Microsoft Azure Private Link.
Infrastructure Security
Threat detection
Snowflake uses advanced threat detection tools to monitor all aspects of its infrastructure.
• All security logs, including logs and alerts from third-party tools, are centralized in Snowflake’s
security data lake, where they are aggregated for analysis and alerting.
• Activities meeting certain criteria generate alerts that are triaged through Snowflake’s security
incident process.
• Specific areas of focus include the following:
o File integrity monitoring (FIM) tools are used to ensure that critical system files, such as
important system and application executable files, libraries, and configuration files, have not
been tampered with. FIM tools use integrity checks to identify any suspicious system
alterations, which include owner or permissions changes to files or directories, the use of
alternate data streams to hide malicious activities, and the introduction of new files.
o Behavioral monitoring tools monitor network, user, and binary activity against a known
baseline to identify anomalous behavior that could be an indication of compromise.
• Snowflake uses threat intelligence feeds to contextualize and correlate security events and harden
security controls to counteract malicious tactics, techniques, and procedures (TTPs).
Physical security
• Snowflake is hosted in AWS, Azure, or GCP data centers around the world.
• Snowflake’s infrastructure as-a-service cloud provider partners employ many physical security
measures, including biometric access controls and 24-hour armed guards and video surveillance to
ensure that no unauthorized access is permitted.
• Neither Snowflake personnel nor Snowflake customers have access to these data centers.
Security Compliance
Snowflake’s portfolio of security and compliance reports are continuously expanded as customers request
reports. The following is the current list of reports available to all customers and prospects who are
SOC 1 Type 2
The SOC 1 Type 2 report is an independent auditor’s attestation of the financial controls that Snowflake
SOC 2 Type 2
The SOC 2 Type 2 report is an independent auditor’s attestation of the security controls that Snowflake
had in place during the report’s coverage period.This report is provided for customers and prospects
to review to ensure there are no exceptions to the documented policies and procedures in the policy
documentation.
PCI DSS
The Payment Card Industry Data Security Standard is a set of prescriptive requirements to which an
organization must adhere in order to be considered compliant. Snowflake’s PCI DSS Attestation of
security controls.
HIPAA
The Health Information Portability and Accountability Act is a law that provides data security and privacy
provisions to protect protected health information. Snowflake is able to enter into a business associate
agreement (BAA) with any covered entity that requires HIPAA compliance.
ISO/IEC 27001
The International Organization for Standardization provides requirements for establishing, implementing,
maintaining, and continually improving an information security management system. Snowflake’s ISO
FedRAMP
capabilities of the preceding versions. For example, the Business Critical edition includes everything the
Enterprise edition
44. The Business Critical edition is Snowflake’s solution for customers who have specific compliance
requirements.
45. It includes HIPAA support, is PCI DSS compliant, and features an enhanced security policy.
46. This edition enables customers to use Tri-Secret Secure, which provides split encryption keys for
multiple layers of data security.
47. When a customer uses Tri-Secret Secure, access to the customer’s data requires the combination of the
Snowflake encryption key, the customer encryption key (which is wholly owned by the customer), and
valid customer credentials with role-based access to the data.
48. Because the data is encrypted with split keys, it is impossible for anyone other than the customer,
including Amazon, to gain access to the underlying data.
49. Snowflake can gain access to the data only if the customer key and access credentials are provided to
Snowflake. This ensures that only the customer can respond to demands for data access, regardless of
where they come from.
50. VPS represents the most sophisticated solution for customers with sensitive data. It differs from other
Snowflake editions in a number of important ways.
51. With VPS, all of the servers that contain in-memory encryption keys are unique to each customer. Each
VPS customer has their own dedicated virtual servers, load balancer, and metadata store.
52. There are also dedicated virtual private networks (VPNs) or virtual private cloud (VPC) bridges from a
customer’s own VPC to the Snowflake VPC.
53. These dedicated services ensure that the most sensitive components of the customer’s data warehouse
are completely separate from those of other customers.
54. Even with VPS, Snowflake’s hardware security module and its maintenance, access, and deployment
services are still shared services. These components are secure by design, even in a multi-tenant model.
For instance, the hierarchical security module (HSM) is configured with a completely separate partition
dedicated to the customer.
55. All data is stored in Amazon S3 within a separately provisioned AWS account. As shown is the following
diagram, this design makes it possible for even the most security conscious customers to trust VPS as a
comprehensively secure solution for their data.
Outline Snowflake security principles and identify use cases where they should be applied.
• Encryption
• Network security
• User, Role, Grants provisioning
• Authentication
Encryption
Data Encryption
1. Snowflake encrypts all customer data by default, using the latest security standards, at no additional
cost. Snowflake provides best-in-class key management, which is entirely transparent to customers.
2. End-to-end encryption (E2EE) is a form of communication in which no one but end users can read
the data. In Snowflake, this means that only a customer and the runtime components can read the data.
No third parties, including Snowflake’s cloud computing platform or any ISP, can see data in the
clear.
1. A user uploads one or more data files to a External stage. the user may optionally encrypt the data
files using client-side encryption. but if the data is not encrypted, Snowflake immediately encrypts
the data when it is loaded into a table.
2. Query results can be unloaded into a External stage. Results are optionally encrypted using client-
side encryption
the client on the local machine prior to being transmitted to the internal stage.
4. Query results can be unloaded into Snowflake stage. Results are automatically encrypted when
proprietary file format and stored in a cloud storage container (“data at rest”). In Snowflake, all data
at rest is always encrypted.
6. The user downloads data files from the stage and decrypts the data on the client side.
In all of these steps, all data files are encrypted. Only the user and the Snowflake runtime components can
read the data. The runtime components decrypt the data in memory for query processing. No third-party
service can see data in the clear.
Client-Side Encryption
1. The customer creates a secret master key, which is shared with Snowflake.
2. The client, which is provided by the cloud storage service, generates a random encryption key and
encrypts the file before uploading it into cloud storage. The random encryption key, in turn, is
encrypted with the customer’s master key.
3. Both the encrypted file and the encrypted random key are uploaded to the cloud storage service.
The encrypted random key is stored with the file’s metadata.
4. When downloading data, the client downloads both the encrypted file and the encrypted random
key.
5. The client decrypts the encrypted random key using the customer’s master key.
6. Next, the client decrypts the encrypted file using the now decrypted random key.
7. All encryption and decryption happens on the client side. At no time does the cloud storage service
or any other third party (such as an ISP) see the data in the clear.
1. Snowflake uses strong AES 256-bit encryption with a hierarchical key model rooted in a hardware
security module(HSM).
2. Keys are automatically rotated on a regular basis by the Snowflake service, and data can be
automatically re-encrypted (“rekeyed”) on a regular basis.
3. Data encryption and key management is entirely transparent and requires no configuration or
management.
4. A hierarchical key model provides a framework for Snowflake’s encryption key management. The
hierarchy is composed of several layers of keys in which each higher layer of keys (parent keys)
encrypts the layer below (child keys).
5. In security terminology, a parent key encrypting all child keys is known as “wrapping”.
6. Snowflake’s hierarchical key model consists of four levels of keys:
7. Each customer account has a separate key hierarchy of account level, table level, and file level keys.
8. Encryption Key Rotation
9. Account and table master keys are automatically rotated by Snowflake when they are more than 30
days old.
10. Active keys are retired, and new keys are created. When Snowflake determines the retired key is no
Periodic Rekeying
11. If periodic rekeying is enabled, when the retired encryption key for a table is older than one year,
Snowflake automatically creates a new encryption key and re-encrypts all data previously protected
by the retired key using the new key. The new key is used to decrypt the table data going forward.
12. Snowflake relies on one of several cloud-hosted hardware security module (HSM) services as a
tamper-proof, highly secure way to generate, store, and use the root keys of the key hierarchy
13. Tri-Secret Secure lets you control access to your data using a master encryption key that you
maintain in the key management service for the cloud provider that hosts your Snowflake account:
14. The customer-managed key can then be combined with a Snowflake-managed key to create a
composite master key. When this occurs, Snowflake refers to this as Tri-Secret Secure
15. With Tri-Secret Secure enabled for your account, Snowflake combines your key with a Snowflake-
maintained key to create a composite master key. This dual-key encryption model, together with
Snowflake’s built-in user authentication, enables the three levels of data protection offered by Tri-
Secret Secure.
Network security
1. Network policies provide options for managing network configurations to the Snowflake service.
2. Network policies allow restricting access to your account based on user IP address. Effectively, a
network policy enables you to create an IP allowed list, as well as an IP blocked list, if desired.
3. By default, Snowflake allows users to connect to the service from any computer or device IP
address.
4. A security administrator (or higher) can create a network policy to allow or deny access to a single IP
address or a list of addresses.
5. Network policies currently support only Internet Protocol version 4 (i.e. IPv4) addresses.
6. An administrator with sufficient permissions can create any number of network policies.
7. A network policy is not enabled until it is activated at the account or individual user level.
8. To activate a network policy, modify the account or user properties and assign the network policy to
the object.
9. Only a single network policy can be assigned to the account or a specific user at a time.
10. Snowflake supports specifying ranges of IP addresses using Classless Inter-Domain Routing (i.e.
CIDR) notation. For example, 192.168.1.0/24 represents all IP addresses in the range
of 192.168.1.0 to 192.168.1.255.
11. To enforce a network policy for all users in your Snowflake account, activate the network policy for
your account.
12. If a network policy is activated for an individual user, the user-level network policy takes precedence.
For information about activating network policies at the user level, see Activating Network Policies
for Individual Users (in this topic).
13. To determine whether a network policy is set on your account or for a specific user, execute
the SHOW PARAMETERS command.
1. Federated authentication enables your users to connect to Snowflake using secure SSO (single sign-on).
2. With SSO enabled, your users authenticate through an external, SAML 2.0-compliant identity provider
(IdP).
3. Once authenticated by this IdP, users can securely initiate one or more sessions in Snowflake for the
duration of their IdP session without having to log into Snowflake.
4. They can choose to initiate their sessions from within the interface provided by the IdP or directly in
Snowflake.
5. For example, in the Snowflake web interface, a user connects by clicking the IdP option on the login
page:
a. If they have already been authenticated by the IdP, they are immediately granted access to
Snowflake.
b. If they have not yet been authenticated by the IdP, they are taken to the IdP interface where
they authenticate, after which they are granted access to Snowflake.
The external, independent entity responsible for providing the following services to the SP:
The following vendors provide native Snowflake support for federated authentication and SSO:
In addition to the native Snowflake support provided by Okta and ADFS, Snowflake supports
using most SAML 2.0-compliant vendors as an IdP, including:
• Google G Suite
• Microsoft Azure Active Directory
• OneLogin
• Ping Identity PingOne
Key Pair Authentication & Key Pair Rotation
1. Snowflake supports using key pair authentication for enhanced authentication security as an
alternative to basic authentication (i.e. username and password).
2. This authentication method requires, as a minimum, a 2048-bit RSA key pair. You can generate the
Privacy Enhanced Mail (i.e. PEM) private-public key pair using OpenSSL. Some of the Supported
Snowflake Clients allow using encrypted private keys to connect to Snowflake.
3. The public key is assigned to the Snowflake user who uses the Snowflake client to connect and
authenticate to Snowflake.
4. Snowflake also supports rotating public keys in an effort to allow compliance with more robust
security and governance postures.
An external function is a type of UDF. Unlike other UDFs, an external function does not
contain its own code; instead, the external function calls code that is stored and executed
outside Snowflake.
Inside Snowflake, the external function is stored as a database object that contains
information that Snowflake uses to call the remote service. This stored information includes
the URL of the proxy service that relays information to and from the remote service.
Remote Service:
1. Dynamic Data Masking is a Column-level Security feature that uses masking policies
to selectively mask data at query time that was previously loaded in plain-text into
Snowflake.
2. Currently, Snowflake supports using Dynamic Data Masking on tables and views.
3. At query runtime, the masking policy is applied to the column at every location where
the column appears.
4. Depending on the masking policy conditions, Snowflake query operators may see the
plain-text value, a partially masked value, or a fully masked value.
5. Easily change masking policy content without having to reapply the masking policy to
thousands of columns.
use role sysadmin;
create or replace masking policy email_mask as (val string) returns string ->
case when current_role() in ('SYSADMIN') then val
else '*********'
end;
-- allow table_owner role to set or unset the ssn_mask masking policy (optional)
use role accountadmin;
grant apply on masking policy email_mask to role sysadmin;
use role sysadmin;
-- apply masking policy to a table column
alter table if exists emp_basic_ingest modify column email set masking policy email_mask;
use role accountadmin;
select * from emp_basic_ingest;
create view emp_info_view as select * from emp_basic_ingest;
select * from emp_info_view;
Row Level Security
Snowflake supports row-level security through the use of row access policies to determine which rows to
return in the query result.
Row access policies implement row-level security to determine which rows are visible in the query result.
A row access policy is a schema-level object that determines whether a given row in a table or view can be
viewed from the following types of statements:
The following tables provide a list of the major features and services included with each edition.
Note
This is only a partial list of the features. For a more complete and detailed list, see Overview of Key
Features.
Release Management
Feature/Service Standard Enterprise Business Critical VPS
Redirecting Client ✔ ✔
Connections between Snowflake
accounts for business continuity
and disaster recovery.
Data Sharing
Feature/Service Standard Enterprise Business Critical VPS
Customer Support
Feature/Service Standard Enterprise Business Critical VPS
Primary Database
• Replication can be enabled for any existing permanent or transient database.
• Enabling replication designates the database as a primary database.
• Any number of databases in an account can be designated a primary database.
• Likewise, a primary database can be replicated to any number of accounts in your organization.
• This involves creating a secondary database as a replica of a specified primary database in each of the
target accounts
• All DML/DDL operations are executed on the primary database.
• Each read-only, secondary database can be refreshed periodically with a snapshot of the primary
database, replicating all data as well as DDL operations on database objects (i.e. schemas, tables, views,
etc.).
• When a primary database is replicated, a snapshot of its database objects and data is transferred to the
secondary database
Transient tables ✔
Temporary tables
Automatic Clustering of ✔
clustered tables
Sequences ✔
Materialized views ✔
Secure views ✔
File ✔
formats
Temporary stages
Stored ✔
procedures
SQL and ✔
JavaScript
UDFs
Policies Row Access & Column- ✔ The replication operation is blocked if either of
level Security (masking) the following use cases is true: The primary
database is in an Enterprise (or higher) account
and contains a policy but one or more of the
accounts approved for replication are on lower
editions. A policy contained in the primary
database has a reference to a policy in another
database.
• Currently, replication is supported for databases only. Other types of objects in an account cannot be
replicated. This list includes:
• Users
• Roles
• Warehouses
• Resource monitors
• Shares
• Object parameters that are set at the schema or schema object level are replicated:
Parameter Objects
• Parameter replication is only applicable to objects in the database (schema, table) and only if the
parameter is explicitly set. Database level parameters are not replicated.
Database Failover/Fallback
1. In the event of a massive outage (due to a network issue, software bug, etc.) that disrupts the cloud
services in a given region, access to Snowflake will be unavailable until the source of the outage is
resolved and services are restored.
2. To ensure continued availability and data durability in such a scenario, you can replicate your
databases in a given region to another Snowflake account (owned by your organization) in a different
region. This option allows you to recover multiple databases in the other region and continue to
process data after a failure in the first region results in full or partial loss of Snowflake availability.
3. Initiating failover involves promoting a secondary (i.e. replica) database in an available region to serve
as the primary database. When promoted, the now-primary database becomes writeable.
Concurrently, the former primary database becomes a secondary, read-only database.
1. A service outage in Region A, where the account that contains the primary database is located. The
secondary database (in Region B) has been promoted to serve as the primary database. Concurrently,
the former primary database has become a secondary, read-only database:
2. The following diagram shows that the service outage in Region A has been resolved. A database
refresh operation is in progress from the primary database (in Region B) to the secondary database
(in Region A):
3. The final diagram shows operations returned to their initial configuration (i.e. failback). The
secondary database (in Region A) has been promoted to once again serve as the primary database for
normal business operations. Concurrently, the former primary database (in Region B) has become a
secondary, read-only database:
materialized view data is not replicated. Automatic background maintenance of materialized views in a
secondary database is enabled by default. If Automatic Clustering is enabled for a materialized view in
the primary database, automatic monitoring and reclustering of the materialized view in the secondary
database is also enabled.
External tables in the primary database currently cause the replication or refresh operation
to fail with the following error message:
• To work around this limitation, we recommend that you move the external tables into a
separate database that is not replicated.
• Alternatively, if you are migrating your databases to another account, you could clone
the primary database, drop the external table from the clone, and then replicate the
cloned database.
• After you promote the secondary database in the target account, you would need to
recreate the external tables in the database.
Replication and Policies (Masking & Row Access)
For masking and row access policies, if either of the following conditions is true, then the initial replication
operation or a subsequent refresh operation fails:
• The primary database is in an Enterprise (or higher) account and contains a policy but one or more of the
accounts approved for replication are on lower editions.
• A policy contained in the primary database has a reference to a policy in another database.
Time Travel
Querying tables and views in a secondary database using Time Travel can produce different
results than when executing the same query in the primary database.
Historical Data
Historical data available to query in a primary database using Time Travel is not
replicated to secondary databases.
The data retention period for tables in a secondary database begins when the
secondary database is refreshed with the DML operations (i.e. changing or deleting
data) written to tables in the primary database.
22. Search Optimization Service
1. A point lookup query returns only one or a small number of distinct rows.
• Business users who need fast response times for critical dashboards with highly selective filters.
• Data scientists who are exploring large data volumes and looking for specific subsets of data.
2. The search optimization service aims to significantly improve the performance of selective point lookup
queries on tables.
3. A user can register one or more tables to the search optimization service. Search optimization is a table-
level property and applies to all columns with supported data types (see the list of supported data types
further below).
4. the search optimization service relies on a persistent data structure that serves as an optimized search
access path.
5. A maintenance service that runs in the background is responsible for creating and maintaining the search
access path:
• When you add search optimization to a table, the maintenance service creates and populates the
search access path with the data needed to perform the lookups.
• When data in the table is updated (for example, by loading new data sets or through DML
operations), the maintenance service automatically updates the search access path to reflect the
changes to the data.
• If queries are run when the search access path hasn’t been updated yet, the queries might run
slower but will always return up-to-date results.
• there is a cost for the storage and compute resources for this service.
6. The search optimization service is one of several ways to optimize query performance. Related
techniques include:
• Clustering a table.
• Creating one or more materialized views (clustered or unclustered).
7. Clustering a table can speed any of the following, as long as they are on the clustering key:
o Range searches.
o Equality searches.
However, a table can be clustered on only a single key (which can contain one or more columns or
expressions).
8. The search optimization service speeds only equality searches. However, this applies to all the columns
of supported types in a table that has search optimization enabled
9. A materialized view speeds both equality searches and range searches, as well as some sort operations,
but only for the subset of rows and columns included in the materialized view.
10. Materialized views can be also used in order to define different clustering keys on the same source table
(or a subset of that table), or in conjunction with flattening JSON / variant data.
11. If you clone a table, schema, or database, the search optimization property and search access paths of
each table are also cloned
25. Parameters
https://docs.snowflake.com/en/user-guide/admin-account-management.html
1. Snowflake provides three types of parameters that can be set for your account:
2. Account parameters that affect your entire account. These parameters are set at the account level and
can’t be overridden at a lower level of the hierarchy. All parameters have default values, which can be
overridden at the account level. To override default values at the account level, you must be an account
administrator.
3. Session parameters: These parameters mainly relate to users and their sessions. They can be set at the
account level, but they can also be changed for each user. Within a user’s session, it can be again
changed to something different.
For example, users connecting from the US may want to see dates displayed in “mm-dd-yyyy” format,
and users from Asia may want to see dates listed as “dd/mm/yyyy”.
• The account-level value for this parameter may be the default “yyyy-mm-dd”. Setting the value at
user-level ensures different users are seeing dates the way they want to see it.
• Now, let’s say a user from the USA logs in to the data warehouse, but wants to see dates in
“MMM-dd-yyyy” format. She could change the parameter for her own session only.
Both ACCOUNTADMIN and SYSADMIN role members can assign or change parameters for the
user. If no changes are made to a session type parameter at the user or session level, the account-
level value is applied.
4. Object parameters that default to objects (warehouses, databases, schemas, and tables). These
parameters are applicable to Snowflake objects, like warehouses and databases. Warehouses don’t have
any hierarchy, so warehouse-specific parameters can be set at account-level and then changed for
individual warehouses.
• Similarly, database-specific parameters can be set at the account level and then for each database.
• Unlike warehouses though, databases have a hierarchy. Within a database, a schema can override
the parameters set at account or database level, and within the schema, a table can override the
parameters set at account, database or schema level.
• If no changes are made at lower levels of the hierarchy, the value set at the nearest higher level is
applied downstream.
5. To see a list of the parameters and their current values for your account
level is the hierarchical position where the current value is applied to. If the default value and the current
value are the same, this field is empty.
The output shows the current value is set at the session level, and it’s different from the default value:
To find what the value has been set to at account level, run this command:
It shows the default value hasn’t changed at an account level. Note how the “level” column is empty here:
This code snippet shows how to list object properties for database, schema, table and warehouse:
Let’s say you want to change the data retention period to 0 days for the TEST_SCHEMA, which effectively
turns off its time travel. Run a command like this:
ALTER SCHEMA MYTESTDB.TEST_SCHEMA SET DATA_RETENTION_TIME_IN_DAYS=0;
And that’s about everything you need to know about querying and setting Snowflake parameters.
Parameter.xlsx
26. Connectors and Drivers
Snowflake Connector for Python
1. The Snowflake Connector for Python provides an interface for developing Python applications that
can connect to Snowflake and perform all standard operations. It provides a programming alternative
to developing applications in Java or C/C++ using the Snowflake JDBC or ODBC drivers.
2. The connector is a native, pure Python package that has no dependencies on JDBC or ODBC. It can
be installed using pip on Linux, macOS, and Windows platforms where a supported version of
Python is installed.
3. SnowSQL, the command line client provided by Snowflake, is an example of an application
developed using the connector.
7. Snowflake provides a JDBC type 4 driver that supports core JDBC functionality. The JDBC driver
must be installed in a 64-bit environment and requires Java 1.8 (or higher).The driver can be used
with most client tools/applications that support JDBC for connecting to a database server.
8. sfsql, the now-deprecated command line client provided by Snowflake, is an example of a JDBC-
based application.
ODBC Driver
9. Snowflake provides a driver for connecting to Snowflake using ODBC-based client applications. The
ODBC driver has different prerequisites depending on the platform where it is installed.
10. The PHP PDO driver for Snowflake provides an interface for developing PHP applications that can
connect to Snowflake and perform all standard operations.
11. Apache Kafka software uses a publish and subscribe model to write and read streams of records,
similar to a message queue or enterprise messaging system. Kafka allows processes to read and write
messages asynchronously.
12. A subscriber does not need to be connected directly to a publisher; a publisher can queue a message
in Kafka for the subscriber to receive later.
13. An application publishes messages to a topic, and an application subscribes to a topic to receive
those messages. Kafka can process, as well as transmit, messages; however, that is outside the scope
of this document. Topics can be divided into partitions to increase scalability.
14. Kafka Connect is a framework for connecting Kafka with external systems, including databases. A
Kafka Connect cluster is a separate cluster from the Kafka cluster. The Kafka Connect cluster
supports running and scaling out connectors (components that support reading and/or writing
between external systems).
15. Kafka, like many message publish/subscribe platforms, allows a many-to-many relationship between
publishers and subscribers. A single application can publish to many topics, and a single application
can subscribe to multiple topics. With Snowflake, the typical pattern is that one topic supplies
messages (rows) for one Snowflake table.
16. From the perspective of Snowflake, a Kafka topic produces a stream of rows to be inserted into a
Snowflake table. In general, each Kafka message contains one row.
17. Kafka, like many message publish/subscribe platforms, allows a many-to-many relationship between
publishers and subscribers. A single application can publish to many topics, and a single application
can subscribe to multiple topics. With Snowflake, the typical pattern is that one topic supplies
messages (rows) for one Snowflake table.
18. Kafka topics can be mapped to existing Snowflake tables in the Kafka configuration.
19. Every Snowflake table loaded by the Kafka connector has a schema consisting of two VARIANT
columns:
a. RECORD_CONTENT. This contains the Kafka message.
24. Expressed in JSON syntax, a sample message might look similar to the
following:
{
"meta":
{
"offset": 1,
"topic": "PressureOverloadWarning",
"partition": 12,
"key": "key name",
"schema_id": 123,
"CreateTime": 1234567890,
"headers":
{
"name1": "value1",
"name2": "value2"
}
},
"content":
{
"ID": 62,
"PSI": 451,
"etc": "..."
}
}
select
record_metadata:CreateTime,
record_content:ID
from table1
where record_metadata:topic = 'PressureOverloadWarning';
25. The Kafka connector completes the following process to subscribe to Kafka topics and create
Snowflake objects:
• The Kafka connector subscribes to one or more Kafka topics based on the
configuration information provided via the Kafka configuration file or command line
(or the Confluent Control Center; Confluent only).
• The connector creates the following objects for each topic:
o One internal stage to temporarily store data files for each topic.
o One pipe to ingest the data files for each topic partition.
o One table for each topic. If the table specified for each topic does not exist, the
connector creates it; otherwise, the connector creates the
RECORD_CONTENT and RECORD_METADATA columns in the existing table
and verifies that the other columns are nullable (and produces an error if they
are not).
The following diagram shows the ingest flow for Kafka with the Kafka connector:
1. One or more applications publish JSON or Avro records to a Kafka cluster. The
records are split into one or more topic partitions.
2. The Kafka connector buffers messages from the Kafka topics. When a
threshold (time or memory or number of messages) is reached, the connector
writes the messages to a temporary file in the internal stage. The connector
triggers Snowpipe to ingest the temporary file. Snowpipe copies a pointer to
the data file into a queue.
3. A Snowflake-provided virtual warehouse loads data from the staged file into
the target table (i.e. the table specified in the configuration file for the topic)
via the pipe created for the Kafka topic partition.
4. (Not shown) The connector monitors Snowpipe and deletes each file in the
internal stage after confirming that the file data was loaded into the table.
If a failure prevented the data from loading, the connector moves the file into
the table stage and produces an error message.
Snowflake polls the insertReport API for one hour. If the status of an ingested file does not
succeed within this hour, the files being ingested are moved to a table stage.
It may take at least one hour for these files to be available on the table stage. Files are only
moved to the table stage when their ingestion status could not be found within the
previous hour.
Snowflake SQL Api
26. The Snowflake SQL API is a REST API that you can use to access and update data in a Snowflake
database. You can use this API to develop custom applications and integrations that:
• Perform queries
• Manage your deployment (e.g. provision users and roles, create tables, etc.)
The Snowflake SQL API provides operations that you can use to:
You can use this API to execute standard queries and most DDL and DML statements.
27. Metadata Fields in Snowflake
1. The data contained in metadata fields may be processed outside of your Snowflake Region.
2. It is a customer responsibility to ensure that no personal data (other than for a User object), sensitive
data, export-controlled data, or other regulated data is entered into any metadata field when using
the Snowflake Service.
3. The most common metadata fields are:
a. Object definitions, such as a policy, an external function, or a view definition.
b. Object properties, such as an object name or an object comment.
28. Few Facts to remember
1. GCP region in APAC is 0
2. By default, the maximum number of accounts in an organization cannot exceed 25
3. To recover the management costs of Snowflake-provided compute resources, Snowflake applies
a 1.5x multiplier to resource consumption.
4. Snowflake credits charged per compute-hour: Snowflake-managed compute resources: 1.5 AND Cloud
services: 1
5. Virtual warehouses are billed by the second with 1 minute minimum
6. Typical utilization of cloud services (up to 10% of daily compute credits) is included for free
7. Each server in a cluster can process 8 files in parallel
8. Recommended file size for data loading 100-250 MB (or larger) in size compressed.
9. While unloading Snowflake creates 16mb each file, we can dump upto 5gb in a file using max_file_size
10. Snowpipe charges 0.06 credits per 1000 files queued.
11. An ALTER PIPE … REFRESH statement copies a set of data files staged within the previous 7 days to
the Snowpipe ingest queue for loading into the target table
12. When a pipe is paused, event messages received for the pipe enter a limited retention period. The period
is 14 days by default. If a pipe is paused for longer than 14 days, it is considered stale
13. Each micro-partition contains between 50 MB and 500 MB of uncompressed data
14. Standard Time Travel is 1 day. Extended Time Travel (up to 90 days) requires Snowflake Enterprise
Edition.
15. The fail-safe retention period is 7 days
16. total number of reader accounts a provider can create is 20
17. Result Cache holds the results of every query executed in the past 24 hours
18. History tab list includes (up to) 100 queries
19. Query results in History tab are available for a 24-hour period
20. The History tab page allows you to view and drill into the details of all queries executed in the last 14
days.
21. The web interface only supports exporting results up to 100 MB in size. If a query result exceeds this
limit, you are prompted whether to proceed with the export.
22. If the data retention period for a source table is less than 14 days and if stream has not been consumed,
snowflake temporarily extends up to maximum of 14 days by default, regardless of snowflake edition for
your account to prevent it from going stale.
23. A task can have a maximum of 100 child tasks.
24. All ingested data stored in Snowflake tables is encrypted using AES-256 strong encryption.
25. Account and table master keys are automatically rotated by Snowflake when they are more than 30 days
old.
26. Maximum compressed row size in Snowflake is 16MB
27. micro-partitions are approximately 16MB in size
28. STATEMENT_TIMEOUT_IN_SECONDS for running queries : upto 7 days — a value of 0 specifies that
the maximum timeout value is enforced. Default is 2 days
SnowPro Core Certification
New version of the SnowPro Core Certification Exam, released early September, 2022
The SnowPro ™ Core Certification is designed for individuals who would like to demonstrate their
knowledge the Snowflake Cloud Data Platform. The candidate has a thorough knowledge of:
Exam Format
Languages: English
Unscored Content: Exams may include unscored items to gather statistical information.
These items are not identified on the form and do not affect your score, and additional time
is factored in to account for this content.
This exam guide includes test domains, weightings, and objectives. It is not a comprehensive listing
of all the content that will be presented on this examination. The table below lists the main content
domains and their weighting ranges.
Target Audience:
We recommend that individuals have at least 6 months of knowledge using the Snowflake
Platform prior to attempting this exam. Familiarity with basic ANSI SQL is recommended.
• Solution Architects
• Data Engineers
• Snowflake Account Administrators
• Database Administrators
• Data Scientists
• Data Analysts
• Application Developers
SNOWFLAKE OVERVIEW
Snowflake Overview
Getting Started with Snowflake is a resource of 12 modules designed to help you get
familiar with Snowflake. We recommend you complete all of the modules, but the
Additional Assets
Quick Tour of the Web Interface (Document + Video)
2.2 Define the entities and roles that are used in Snowflake.
● Outline how privileges can be granted and revoked
● Explain role hierarchy and privilege inheritance
2.3 Outline data governance capabilities in Snowflake.
● Accounts
● Organizations
● Databases
● Secure views
● Information schemas
● Access history and read support
Additional Assets
Crucial Security Controls for Your Cloud Data Warehouse (Video)
Quickly Visualize Snowflake’s Roles, Grants and Privileges (Article) Snowflake
Security Overview (Video)
Additional Assets
Accelerating BI Queries with Caching in Snowflake (Video)
Caching in Snowflake Data Warehouse (Article)
How to: Understand Result Caching (Article)
Managing Snowflake’s Compute Resources (Blog)
Performance Impact from Local and Remote Disk Spilling (Article)
Search Optimization: When & How to Use (Article)
Snowflake Materialized Views: A Fast, Zero-Maintenance Accurate Solution (Blog)
Snowflake Workloads Explained: Data Warehouse (Video)
Tackling High Concurrency with Multi-Cluster Warehouses (Video)
Tuning Snowflake (Article)
Using Materialized Views to Solve Multi-Clustering Performance Problems (Article)
4.2 Outline different commands used to load data and when they should be used.
● CREATE PIPE
● COPY INTO
● GET
● INSERT/INSERT OVERWRITE
● PUT
● STREAM
● TASK
● VALIDATE
4.3 Define concepts and best practices that should be considered when unloading data.
● File formats
● Empty strings and NULL values
● Unloading to a single file
● Unloading relational tables
4.4 Outline the different commands used to unload data and when they should be used.
● LIST
● COPY INTO
● CREATE FILE FORMAT
● CREATE FILE FORMAT … CLONE
● ALTER FILE FORMAT
● DROP FILE FORMAT
● DESCRIBE FILE FORMAT
● SHOW FILE FORMAT
Domain 4.0: Data Loading and Unloading Study Resources
Snowflake University On Demand Trainings
Snowflake University, Level Up: Data Loading Badge 1:
Data Warehousing Workshop
Additional Assets
Best Practices for Data Unloading (Article)
Best Practices for Using Tableau with Snowflake (White Paper, requires email for access)
Building and Deploying Continuous Data Pipelines (Video)
Easily Loading and Analyzing Semi-Structured Data in Snowflake (Video)
How to Load Terabytes into Snowflake - Speeds, Feeds and Techniques (Blog)
Additional Assets
Best Practices for Managing Unstructured Data (E-book)
Easily Loading and Analyzing Semi-Structured Data in Snowflake (Video)
Structured vs Unstructured vs Semi-Structured Data (Blog)
Understanding Unstructured Data With Language Models (Blog)
Additional Assets
Data Protection with Time Travel in Snowflake (Video)
Getting Started on Snowflake with Partner Connect (Video)
Meta Data Archiving with Snowflake (Article)
Snowflake Continuous Data Protection (White Paper)
Top 10 Cool Snowflake Features, #7: Snowflake Fast Clone (Blog + Video)