Snowflake Certification
Snowflake Certification
SESSIONS
Data Cloud Tour for Financial Services
Data Cloud Tour for Telecommunications
Data Cloud Tour for Manufacturing
Data Cloud Tour for Retail & CPG
Data Cloud Tour for Public Sector
Data Cloud Tour for Media & Entertainment
Data Cloud Tour - Universal Session
--------------------------------------
• The “story” of this lab is based on the analytics team at Citi Bike, a real, citywide bikeshare system in New York City, USA.This team wants to be able to run analytics on data to
better understand their riders and how to serve them best. We will first load structured .csv data from rider transactions into Snowflake. This comes from Citi Bike internal
transactional systems. Then later we will load open-source, semi-structured JSON weather data into Snowflake to see if there is any correlation between the number of bike rides
and weather
There are many ways to get data into Snowflake from many locations including
• the COPY command,
• Snowpipe auto-ingestion
• an external connector,
• A third-party ETL/ELT product.
We are using the COPY command and S3 storage for this module in a manual process so you can see and learn from the steps invo lved. In the real-world, a customer would likely use an
automated process or ETL product to make the data loading process fully automated and much easier.
Data
The data we will be using is bike share data provided by Citi Bike NYC. The data has been exported and pre-staged for you in an Amazon AWS S3 bucket in the US-EAST region. The data
consists of information about trip times, locations, user type, gender, age of riders, etc. On AWS S3, the data represents 61 .5M rows, 377 objects, and 1.9GB total size compressed.
DDL operations are free! Note that all the DDL operations we have done so far do NOT require compute resources, so we can cre ate all our objects for free.
5.1. - Snowflake has a result cache that holds the results of every query executed in the past24 hours. These are available across warehouses, so query results returned to one use
rare available to any other user on the system who executes the same query, provided the underlying data has not changed. Notonly do these repeated queries return extremely
fast, but they also use no compute credits.
5.2 - Snowflake allows you to create clones, also known as “zero-copy clones” of tables, schemas, anddatabases in seconds. A snapshot of data present in the source object is taken
when the clone iscreated, and is made available to the cloned object. The cloned object is writable, and is independentof theclone source. That is, changes made to either the
source object or the clone object are not partof the other.A popular use case for zero-copy cloning is to clone a production environment for use byDevelopment & Testing to do
testing and experimentation on without (1) adversely impacting theproduction environment and (2) eliminating the need to set up and manage two separateenvironments for
production and Development & Testing.
5.3 - A massive benefit is that the underlying data is not copied; just themetadata/pointers to the underlying data change. Hence “zero-copy” andstorage requirements are not
doubled when data is cloned. Most datawarehouses cannot do this; for Snowflake it is easy!
6.1. Snowflake can easily load and query semi-structured data, such as JSON,Parquet, or Avro, without transformation. This is important because anincreasing amount of business-
relevant data being generated today issemi-structured, and many traditional data warehouses cannot easily load andquery this sort of data.
○ Snowflake’s VARIANT data type allowsSnowflake to ingest semi-structureddata without having to pre-define the schema.
A View allows the result of a query to be accessed as if it were a table. Viewscan help you: present data to end users in a cleaner manner (like in this lab wewill present “ugly”
JSON in a columnar format), limit what end users can view ina source table for privacy/security reasons, or write more modular SQL.There are also Materialized Views in
which SQL results are stored, almost asthough the results were a table. This allows faster access, but requires storagespace. Materialized Views require Snowflake Enterprise
Edition or higher
Snowflake’s Time Travel capability enables accessing historical data at any point within apre-configurable period of time. The default period of time is 24 hours and with Snowflake
Enterprise Edition it can be up to 90 days.
In this module we will show some aspects of Snowflake roles based access control (RBAC), includingcreating a new role and gra nting it specific permissions. We will also cover the
ACCOUNTADMIN(aka Account Administrator) ro
• Snowflake enables account-to-account sharing of data through shares, which are created by dataproviders and “imported” by data consumers, either through their own Snowflake
account or aprovisioned Snowflake Reader account. The consumer could be an external entity/partner, or adifferent internal business unit which is required to have its own, unique
Snowflake account.
Note - Data Sharing currently only supported between accounts in the same SnowflakeProvider and Region
• Resources
○ SnowPro Core Certificaitons FAQ's
Sample test
Snowflake Objects
• All snowflake objects resides within logical containers, with the top-level container being the snowflake account
• All objects are individually securable.
• Users performs operations on objects " priviledges" that are granted to roles.
• Sample Privileges
○ Create a virtual Warehouse
○ List Tables in Schema
○ Insert data into a table
○ Select data from table.
• Role base access
https://docs.snowflake.com/en/user-guide/security-access-control-overview.html#securable-objects
Table Types
• Permanent
○ Persist until dropped
○ Designed for data that requires the highest level of data protection and recovery
○ Default table type
○ Time travel - Up to 90 days with Enterprise
○ Fail Safe - Yes
• Temporary
○ Persist and tied to a session (think single user)
○ Used for transitory data (for example, ETL/ELT)
○ Time Travel - 0 or 1 days
○ Fail Safe - No
Single login session. Accessible only from a session.
• Transient
○ Persist until dropped
○ Multiple User
○ Used for data that needs to persist, but does not need the same level of data retention as a permanent table
○ Time Travel - 0 or 1 days
○ Fail Safe - no
○ Applicable to database, schema and table
• External
○ Persist until removed
○ Snowflake over an external datalake
○ Data accessed via an external stage
○ Read-only
○ Time Trave - No
○ Fail Safe - No
• Fail Safe
○ Is a safety net backup features. Admins to recover and restore data after for 7 days after the time travel is expired
○ Fail safe is only possible contacting Snowflake Technical support.
○ Only Permanent table supports it
○ You are not paying for fail safe backup storage if your table is not permanent.
• Time Travels
○ Its allows you to get the data historically.
○ Customer, admin or user can define what is the time travel retention period for a table, schema or for an entire database.
○ In enterprise edition and above, you can have 90 days of time travel.
○ In Standard edition you can only have 1 day of time travel.
○ From a user perspective
▪ You can query the historical data up to that period of time.
▪ You may run query against that data and tell what that data look like 1 day, 2 day or n days ago.
Views
Management
Optimizer services
• SQL Optimizer - Cost based optimizer
• Automatic join order optimization
○ No user input or tuning required.
• Automatic statistics gathering.
• Pruning using metadata about micro-partitions
Security
• Authentication
• Access control for Roles and users
• Access control for shares
• Encryption and Key management.
Metadata Management
• All statistics and metadata are managed and kept automatically up-to-date by the cloud services layer.
• Stores metadata as data is loaded into the system.
• Handles queries that can be processed completely from metadata.
• Used for time travel and cloning
• Every aspect of Snowflake architecture leverages metadata.
• Currently store procedure is possible in SQL , Python, Java and Javascript and Scala too.
• The file formats that can be used to load data are CSV, XML, JSON, AVRO, ORC, and PARQUET.
• Databases
○ Schemas
▪ Database Objects
□ Tables
Architectures
○ Shared Disk
○ Shared nothing.
○ MultiCluster - Shared Data Architcture
▪ Hybrid architecuture
□ Shared Architecutre - Common Central Storage
□ Shared Nothing - Uses MPP - Massive Parallel processing compute clusters
Each node stores a portion of the data locally.
12. Pricing
Compute Cost
• Active warehouses
○ Only paying for this.
Not paying for suspended.
Data Transfer
Show tables General statistics about table storage, and table properties Show Tables;
Table_stroage_metrics view in Get more detailed information about amount of storage for active databases, time_travel Select * from
Information_Schema and fail_safe option DB_NAME.INFORMATION_SCHEMA.TABLE_STORAGE_METRICS;
• You can see the detailed data stroage at the table level in the admin--> account_admin--> usage view.
• Definition - objects that we can use to control and monitor the credit usage of both warehouses and also our entire account. -- All editions.
• Set Credit Quota - Set in a defined cycle. A limit , that will reset every month.
• Can be set at account level or individual warehouses. On a group of warehouses as well.
• We have three type of actions and we can say at what percentage of the quota this action should happen?
• Notify - % of credit is used
• Suspended and notify - % of credit is used - In this case, it will complete Existsting queries and then only it will suspend.
• Suspend immediately and notify - % of credit is used - This will abort the current queries.
• We can use percentages that are about 100%.
• These can only be created with users that have account_admin role.
• The account_admin can delegate some tasks so we can just grant privileges of monitor and modify on specific resource monitors .
• They can also be used to track the usage of cloud services that are needed to execute certain queries related to a virtual wa rehouse.
○ So our resoruce monitor will suspend the virtual warehouse when the limit is reached but it cannot prevent entirely the usage of cloud services that is needed to run some basic steps for the
warehouse.
• Multi-Cluster Warehouses
○ This is very important when you have a lot of concurrent users.
○ So when we a huge load,
▪ The single warehouse might not be able to cater to all the queries.
▪ So there will be a queue of queries to be executed.
▪ Which will gets executed one by one.
▪ This is very bad for the users, as they need to wait.
▪ So when we have a large number of queries to be processed, we group multiple compute nodes/warehouses together into one multi-cluster warehouse.
▪ This is something where we can use some autoscaling.
▪ So snowflake will automatically detect when there is enough workload to add additional clusters into our warehouse.
○ This is especially good for more concurrent users.
○ This is not ideal solution when we have more complex workload.
○ For example, we have some ML use cases where the data size is use and the query is going to be complex in that case a bigger warehouse will make much more sense.
○ This means we will
▪ Scale up
▪ Rather than scaling out.
Scale up Move to a bigger warehouse
▪
Scaling out Add multiple same size warehouses to the existsting compute/capacity
○ Two modes
1. Maximize - Always have same amount of cluster. No difference between minmum and maximum. This is static workloads. Same amount of users.
2. Autoscale - Specify different minimum and maximum. This is for dynamic workloads. Different amount of users.
How does this autoscale mode work? When is the point, when snowflake decide to add an additional clusters
• Query Acceleration - If you have an unexpected workloads , then they will be managed by query acceleration i.e. all compute managed by snowflake t hemselves. Basically when you enable query
acceleration you are telling snowflake that there could be unexpected worklaod so please provide some additional scalable com pute which you manage by yourself but be ready and provide that compute if
needed.
• Maximized
This mode is enabled by specifying the same value for both maximum and minimum number of clusters (note that the specified value must be larger than 1). In this mode, wh en the
warehouse is started, Snowflake starts all the clusters so that maximum resources are available while the warehouse is runnin g.
This mode is effective for statically controlling the available compute resources, particularly if you have large numbers of concurrent user sessions and/or queries and the numbers do not
fluctuate significantly.
20. SnowSQL
Stages
• Internal Stage
○ Snowflake managed
○ Cloud provider storage
○ You cant manage it yourself.
• External Stages
○ AWS, GCP or Azure
Internal Stages
• Copy into command to load data from Internal stage to Snowflake tables.
• We can use unloading as well
○ Create a file out of internal table and then move into srages.
• There are three different types
▪ User Stages
▪ Tied to a single user
▪ Cannot be accessed by other users
▪ Every user has default stage
▪ They cannot be altered or dropped
▪ Put files to that stage before loading.
▪ Explicitly remove files again.
▪ Load data to multiple tables
▪ Referred using @~ (At the rate and tilt sign)
▪ Table Stages
▪ Automatically created with a table
▪ Can only be accessed by the table.
▪ Canont be altered or dropped
▪ Load to one table.
▪ Remove the files once its loaded.
▪ Referred to with '@%TABLE_NAME'
26. File_Format
• Whenever you create a stage, there are some default properties which get's set including the file_format being CSV as default .
• It is not recommended to do it in a way without specficying the file_format parameter. You should always give the file_format parameter.
• Step 4 - You can overrule the stage by specifying the file_format again during the Copy into command
• ** if you don’t specify the skip_header while creating the file_format, it will set to 0 by default. Which means that it will read the file again
• ** We can change the file_Format but we cannot change the property type.
▪ A Csv type file_format is different then a json type file_format.
• INSERT OVERWRITE - It will truncate the table first and then remove the rows.
▪ This stores the access information to an external cloud storage, so for example, to an Azure container.
30. Snowpipe
• ** The pipe definition should have the copy statement. The data will be loaded according to the copy statement.
• We can also have the stage defined in the copy statement, which will load the data based on a file format object and our stor age integration.
• **A pipe can be paused and resumed. This is often done to alter the ownership of the pipe.
• ** In general, the copy options will be mentioned in the copy command, but actually those are properties of our stage object.
• If we don’t specify any property_value, it will take the property_default. For e.g. ON_ERROR by default is ABORT_Statement.
• ** When create the stage, we can give few copy option.
• ** Skip_File_<num> - Skip_file_10 : Skip files when number of errors are greater than 10.
• **Skip_file_<num>_% - This is based on percentage errors.
• ** Default value for bulk loading is a bold statement.
• ** What's the difference between snowpipe and Bulk_Loading?
▪ Skip_File is default for snowpipe whereas ABORT statement is default for Snowpipe.
• This will just validate the data in the copy command, instead of actually loading the data.
• ** Return_n_Rows - e.g. Return_5_Rows: Validates <n> rows (return error or rows)
▪ This would validate the first five rows and would actually return those five rows. If there are no errors, then all the rows will be returned.
▪ If there are errors, then only the first error will be returned.
34. Validate
• This function validates the files that have been loaded in the previous execution of the copy command.
• And then all the errors that have been encountered will be displayed.
• The default setting is ON_ERROR = ABORT_STATEMENT
▪ If the default setting is used, you wont get any output.
• Here , "_last" is a dynamic parameter used to identify the last immediate history.
35. Unloading
• The same copy command is used only that rather than copying into a table, we copy into a stage.
• This is basically copying from a table into a file which will be stored in a stage.
• Important parameters
▪ Default - We have per default that the files will be split into multiple files based on the size of the file.
▪ This is because the option single = false setting is the default setting.
▪ Max_file_size is the parameter used to indicate the point at which the file will split.
▪ So for example if its 16MB, then every file will be 16MB long.
▪ So Default is 16MB. It can be changed it 5 GB.
• In general there are lot of functions available when we want to query data .
• There is a Scalar functions which means it returns one value per row.
• Then we have aggregate functions such as Max and Min i.e. one value per grouping.
• Then we have table functions - Typically we use this functions to obtain information about snowflake features.
• Table functions return a table, allowing them to be used in the FROM clause of a SELECT statement. These functions can be used for various purposes, including
generating result sets that can be queried like a regular table.
• Example: FLATTEN() function takes an array or a variant column and produces a lateral view (i.e., a virtual table) of the elements.
• If I want to find out number of disticint rows and I am ok with 1.6% of error, I will use this function to get the numbers more quickly.
Let's imagine a table where people vote for their favorite fruit. This table has many rows, each representing a vote, and let's say there are 10 distinct fruits people
voted for, but the table has thousands of votes in total.
If you set counters to 5 when using APPROX_TOP_K, the algorithm doesn't randomly pick 5 fruits to track. It starts with an empty list and adds fruits as it encounters
votes for them. If more than 5 distinct fruits are found, it tracks the ones with the highest counts so far, potentially replacing less frequent ones as it processes more
data. With only 5 counters for 10 fruits, it's trying to keep the most frequent ones in its limited "memory" of 5, but it might miss some of the actual top fruits due to this
limit.
If your table has 1000 rows with votes for 10 different fruits and you use APPROX_TOP_K with counters set to 5, the algorithm will scan all the rows. It attempts to track the
most frequently voted fruits within its capacity to track 5 distinct fruits at any time. Through this process, it aims to estimate the top 5 fruits based on the highest counts, even
though it's only keeping track of 5 fruits at a time during the scan.
• Percentile Values
39. UDF's
• Default is owner.
• Secure UDF's & Procedures will make it secure meaning unwanted users wont be able to see the definition or access the data using optimizer.
• It can be enabled using the Secure Keyword.
• ** How is the security related information of an external function stored? - Via an API integration.
• Supported Formats
○ JSON
○ XML
○ Parquet
○ ORC
○ AVRO
• Data Types in Snowflake to manage semi-structured data
○ Object
○ Array
○ Variant
• So you need to call the nested structure within the sql with a colon.
• Query - Select row_column:courses from Variant_table
• Query - Select $1:courses from variant_table
• In the boave image , We can also use to choose an element within the hierarchy
• Here the information will be fetched from the Array with the formats keyword.
• *** The array starts its positioning from 0 and then 1.
• If the respons is in double quotes, the values can be changed to a VARCHAR datatype by using two columns.
• Within semi-structure data, you have hierarchies, and we need to flatten those hierarchies.
• So the flatten function is used to convert the semi-structured data into relational table view.
• It’s a table function.
50.Unstructured Data
•Snowflake supports URL's which has images, videos and that's how Snowflake access the files.
•The same URL's are needed to share files with other users.
•The URL's can be used for both internal as well as external stages.
•There are multiple URL's
○ Scope URL
▪ Encoded snowflake hosted URL with temporary access to file.
▪ We can access the file without granting access to the entire stage and this URL expresses weather the persisted query results period ends.
▪ The URL expires when the persisted query period ends.
▪ This means when the results cache expires and currently this is at 24 hours.
▪ We can call it by using sql file functions BUILD_SCOPE_FILE_URL
○ File URL
▪ Permanent access to the file that we have specified.
▪ So this is where we want to give permanent access to the file and it doesn’t expired.
▪ Called by using BUILD_STAGE_FILE_URL
○ Pre-Signed URL
▪ Https URL that can be directly used to access the file via a web browser with expiration time.
▪ It can be added as a parameter to the URL.
▪ Called by using BUILD_PRESIGNED_URL
• All the URL's can be accessed by doing sql file functions for the scoped URL.
• When each of the functions are called, it will give us build URL's which can help generate the necessary URL's.
• We can add an operational parameter in seconds after which the URL will get expired.
53. Tasks
• Used to schedule a SQL statement or Stored procedure.
• Often combined with streams to setup continuous ETL Workflows.
54. Streams
• A stream is an object that can be used to record data manipulation, language changes.
○ Its similar to goldengate.
○ Used in ETL to track the DML's.
• Staleness
○ Stream becomes stale when offset is outside the retention period of source table.
○ For example, if the retention period is set to 7 days, then data before 14 days (default)within a stream cant be used.
○ Maximum retention period is 90 days.
• Snowsight - WebUI
• SnowSQL - Command line tool
• Drivers - PHP, JDBC, GO, .net, node.js, etc.
• Connectors - Python, Spark, Kafka
• Partner Connect
58. Snowpark
• Snowpark is a set of libraries and runtimes within Snowflake that enables you to securely deploy and process data using various programming languages like Python, Java, and Scala.
○ Considerations
• Number of days for which historical data is preserved and Time travel can be applied.
• Its default is 1. This is set on the account level.
• If there is any data which is not possible to recover from time travel we can go for fail safe scenarios.
○ Fail safe is always 7 days and its non configurable.
○ For permanenet tables its 7 days.
• Fail safe is 7 days beyond the time travel.
• We need to reach out to snowflake for restoration purpose.
• There is no cost for using time travel or fail safe, but there is some cost involved for storage.
• So if out of 1000 rows only 10 rows got changed, only those rows will take be charged for and not the rest of them.
• So only the amount of modified data will be the time travel storage.
• Permanent
• Transient
○ They are not fail safe.
• Temporary tables
○ 0 or 1 day of retention period.
○ Not supporting fail safe.
• *** Dynamic data masking is something where the data is masked at run time, and not at the database level.
• The row level policy is always defined boolean as it will always return a true or a false.
• Object metadata
• Historical Usage data
Account Usage
89. Caching
• Micro partiions is the way data is stored at the storage layer which is with external cloud providers.
• Each Micro Partition contains around 50 to 500 MB of uncompressed data.
• Because of Micro partitions , it allows partition pruning
○ Partition pruning means that when a query is executed all the unnecessary partitions are skipped.
• How to determine which columns and which tables can benefit from clustering keys?
• There are two functions which are used to give clustering depth
○ System$clustering_information
○ System$clustering_depth
• Serverless feature.
• Only availabel in Optmization service
• Alter table table_name add optimization service.
• Ownership privledges.
• *** Database clone will not inherit the privileges but all of the child objects do. So the child objects privileges will be inherited but not for the database itself.
• *** This is same for the schema as well. Schema privledges are not inherited but all the objects within it are inherited.
• Data can be shared without actually making the copy of the data.
• *** Best practice is to use secure views to make sure that confidential data is not shared mistakenly.
• *** A normal view cant be shared. So always need to create a secureable view to avoid any accidental data share which you might do.
• When you need to share , the account needs to be in the same region and same cloud provider.
○ In case if its in different region, than replication needs to be enabled between different regions or different cloud providers.
• The difference here is that the data is actually extracted and copied over to other account.
• Data needs to be periodically synchronized.
• 4 key concepts
○ Securable object - Access can be granted
○ Privilege - Granted to role
○ Roles - Roles to other roles or users.
○ User - Identiy will login to account.
74. Roles
• *** A privilege will always be granted to a role and then to the user.
2. Account Admin
3. Security Admin
5. Security Admin
7. Public
8. Custom Roles
• https://www.linkedin.com/pulse/how-crack-snowpro-advanced-architect-exam-ruchi-soni/