Subramanian G: 2025

Wednesday, 10 September 2025

Snowflake - Cost Optimization

Reduce auto-suspend to 60 seconds
Reduce virtual warehouse size
Ensure minimum clusters are set to 1
Consolidate warehouses

Separate warehouse by workload, requirement & not by domain

Reduce query frequency

At many organizations, batch data transformation jobs often run hourly by default. But do downstream use cases need such low latency? Check with business before set up the frequency.

Only process new or updated data
Ensure tables are clustered correctly
Drop unused tables
Lower data retention

The time travel (data retention) setting can result in added costs since it must maintain copies of all modifications and changes to a table made over the retention period.

Use transient tables
Avoid frequent DML operations
Ensure files are optimally sized

To ensure cost effective data loading, a best practice is to keep your files around 100-250MB.
To demonstrate these effects,

If we only have one 1GB file, we will only saturate 1/16 threads on a Small warehouse used for loading.
If you instead split this file into ten files that are 100 MB each, you will utilize 10 threads out of 16. This level parallelization is much better as it leads to better utilisation of the given compute resources

Leverage access control
Enable query timeouts
Configure resource monitors

Tuesday, 2 September 2025

Kafka - Topics, Partitions & Offset

KAFKA - EVENT PROCESSING SYSTEM

No need to wait for response
Fire and Forget
Real time processing (Streams)
High throughput & Low latency

Topics

- Particular stream of data

- Can be identified by name

e.g. Tables in a database

- Support all type of messages

- The sequence of message is called, data stream

- You cannot query topics, instead use kafka producers to send data and kafka consumers to read the data

- Kafka topics are immutable, Once data is written to a partition, it cannot be changed

- Data is kept for a limited time (default is one week - configurable)

Partitions

- Topics are split into partitions

- Messages within each partitions are ordered

Offset

- Each message within a partition gets an incremental id, called offset

Producers

- Write data to topics

- Producers know to which partition to write

Kafka Connect

-Getting data in and out of kafka

Step-by-Step to Start Kafka

Step 1: Start ZooKeeper

This will keep running in the terminal. In a new terminal window

Step 2: Start Kafka Server (Broker)
Step 3: Create a Kafka Topic
Step 4: Start Producer

Type messages here to send to Kafka.

Step 5: Start Consumer (in a new terminal)

You will see the messages you type in the producer appear here.

Architecture

Tuesday, 19 August 2025

Data Sharing

1. Create Share


CREATE SHARE my_share;

2. Grant privileges to share


GRANT USAGE ON DATABASE my_db TO SHARE my_share;
GRANT USAGE ON SCHEMA my_schema.my_db TO SHARE my_share;
GRANT SELECT ON TABLE my_table.myschema.my_db TO SHARE my_share;

3. Add consumer account(s)


ALTER SHARE my_share ADD ACCOUNT a123bc;

4. Import share

CREATE DATABASE my_db FROM SHARE my_share;

Monday, 18 August 2025

Materialized View & Warehouse

Materialized View

To know the usage history,

SELECT * FROM information_schema.materialized_view_refresh_history();

SELECT * FROM information_schema.materialized_view_refresh_history(materialized_view_name => 'mname'));

SELECT * FROM snowflake.account_usage.materialized_view_refresh_history;

Warehouse

Resizing: Warehouses can be resized even when query is running or when suspended.

It impact only future queries, not the running one.

Scale Up vs Scale Out:

Scale Up (Resize) - More complex queries

Scale Out - More User (More queries)

Micro - Partitions & Clustering

Micro - Partitions

Immutable - Can't be changed
New data - Added in new partitions

Clustering

Get the clustering key details from the existing tables,

SELECT * FROM information_schema.tables WHERE clustering_key IS NOT NULL;

To know about the cluster in detail,

SELECT SYSTEM$CLUSTERING_INFORMATION ('table_name');

To know about the particular column cluster in detail,

SELECT SYSTEM$CLUSTERING_INFORMATION ('table_name','(column_name)');

To know about the clustering depth,

SELECT SYSTEM$CLUSTERING_DEPTH ('table_name');

CACHING in Snowflake

Result Cache

Stores the results of a query (Cloud Services)
Same queries can use that cache in the future

Table data has not changed
Micro-partitions have not changed
Query doesn't include UDFs or external functions
Sufficient privileges & results are still available

Very fast result (persisted query result)
Avoids re-execution
Can be disabled by using

USE_CACHED_RESULT parameter

If query is not re-used purged after 24 hours
If query is re-used can be stored up to 31 days

Tip : Result cache is resides in the CLOUD SERVICES layer

Data Cache

Local SSD cache
Cannot be shared with other warehouses
Improve performance of subsequent queries that use the same data
Purged if warehouse is suspended or resized
Queries with similar data ⇒ same warehouse
Size depends on warehouse size

Tip : Data cache is resides in the QUERY PROCESSING layer

Metadata Cache

Stores statistics and metadata about objects
Properties for query optimization and processing

Range of values in micro-partition

Count rows, count distinct values, max/min value
Without using virtual warehouse
DESCRIBE + system-defined functions
Called as "Metadata store"
Virtual Private Edition: Dedicated metadata store

Tip : Result cache is resides in the CLOUD SERVICES layer

Query History

In 3 ways we will ab able to view the query history,

1. Using SNOWSIGHT( Web UI)

2. Using INFORMATION_SCHEMA

SELECT * FROM TABLE (information_schema.query_history()) ORDER BY start_time;

3. Using ACCOUNT_USAGE

SELECT * FROM snowflake.account_usage.query_history;

UNLOADING

Syntax:

COPY INTO @stage_name FROM (SELECT col1, col2, col3 FROM table_name)

FILE_FORMAT = (TYPE = CSV)

HEADER = TRUE

Additional Parameters:

SINGLE

Use the SINGLE parameter to specify whether the file will be split into multiple files. The default is set to FALSE which means the data will be split across multiple files

MAX_FILE_SIZE

To define the file size

COPY - Parameters

CopyOption	Description	Values
ON_ERROR	Specifies the error handling for the load operation	CONTINUE \| SKIP_FILE \| SKIP_FILE_num \| 'SKIP_FILE_num%' \| ABORT_STATEMENT
SIZE_LIMIT	Specifies the maximum size (in bytes) of data to be loaded	<num>
PURGE	Remove files after successful load	TRUE \| FALSE
RETURN_FAILED_ONLY	Return only files that have failed to load	TRUE \| FALSE
MATCH_BY_COLUMN_NAME	Load semi-structured data into columns in matching the columns names	CASE_SENSITIVE \| CASE_INSENSITIVE \| NONE
ENFORCE_LENGTH	Truncate text strings that exceed the target column length	TRUE \| FALSE
TRUNCATECOLUMNS	Truncate text strings that exceed the target column length	TRUE \| FALSE
FORCE	Load files even if loaded before	TRUE \| FALSE
LOAD_UNCERTAIN_FILES	Load files even if load status unknown	TRUE \| FALSE

Sunday, 17 August 2025

INSERT OVERWRITE

Specifies that the target table should be truncated before inserting the values into the table.
To use the OVERWRITE option on INSERT, you must use a role that has DELETE privilege on the table because OVERWRITE will delete the existing records in the table.

E.g.

INSERT OVERWRITE INTO table_name

SELECT * FROM src_table_name

WHERE city = 'abcd';

[COPY] File Format Parameters

Property	Property Type
TYPE	String
RECORD_DELIMITER	String
FIELD_DELIMITER	String
FILE_EXTENSION	String
SKIP_HEADER	Integer
DATE_FORMAT	String
TIME_FORMAT	String
TIMESTAMP_FORMAT	String
BINARY_FORMAT	String
ESCAPE	String
ESCAPE_UNENCLOSED_FIELD	String
TRIM_SPACE	Boolean
FIELD_OPTIONALLY_ENCLOSED_BY	String
NULL_IF	List
COMPRESSION	String
ERROR_ON_COLUMN_COUNT_MISMATCH	Boolean
VALIDATE_UTF8	Boolean
SKIP_BLANK_LINES	Boolean
REPLACE_INVALID_CHARACTERS	Boolean
EMPTY_FIELD_AS_NULL	Boolean
SKIP_BYTE_ORDER_MARK	Boolean
ENCODING	String

Tuesday, 12 August 2025

Combining Streams & Tasks

CREATE TASK my_task

WAREHOUSE = my_wh

SCHEDULE = '15 MINUTE'

WHEN SYSTEM$STREAM_HAS_DATA('my_stream_name')

INSERT INTO my_tgt_table (time_col) VALUES (CURRENT_TIMESTAMP);

Data Sampling Methods

ROW or BERNOULLI

Every ROW is chosen with percentage p
More "Randomness"
Smaller tables
e.g. SELECT * FROM table_name SAMPLE ROW (<p>) SEED(15);

BLOCK or SYSTEM

Every BLOCK is chosen with percentage p
More "Effectiveness"
Larger tables
e.g. SELECT * FROM table_name SAMPLE SYSTEM(<p>) SEED(15);

Here, <p> Returns approximately p% of the table rows randomly.

Snowflake - Important one word questions and answers

Maximum length of a VARIANT data type ::: 16 MB uncompressed

-#-#-#-

What is unstructured data ::: Does not fit into any pre-defined data models,

video files
audio files
documents

-#-#-#-

Snowflake share URL ::: below are the supported,

Scoped URL

Temporary URL, expires in 24 hours
e.g. SELECT BUILED_SCOPED_FILE_URL(@stage_name,'logo.png');

File URL

Permanent one
e.g. SELECT BUILD_STAGE_FILE_URL(@stage_name,'logo.png');

Pre Signed URL

HTTPS URL used to access file via a web browser
e.g. SELECT GET_PRESIGNED_URL(@stage_name,'logo.png',60);
here, 60 denotes the seconds to expiry

-#-#-#-

Directory table ::: Stored metadata about staged files

By default it is not enabled, enabling syntax as below

e.g. CREATE OR REPLACE STAGE my_internal_stage

FILE_FORMAT = my_json_format

DIRECTORY = ( ENABLE = TRUE );

e.g. To query the directory table

SELECT * FROM DIRECTORY(@my_internal_stage);

Note: At first the data will not be visible, you need to manually refresh and see the data,

ALTER STAGE my_internal_stage REFRESH;

-#-#-#-

Streams ::: Record (DML) changes made to a table

3 columns will be newly added,

metadata$action
metadata$update
metadata$row_id

STALE :::

Stream becomes stale(no longer available) when offset is outside the data retention period of the source table.

The column STALE_AFTER indicating when the stream is predicted to become stale.

TIME TRAVEL ::: Undrop fails if an object with the same name already exists.

FAIL SAFE :::

Protection of historical data in case of disaster
No user interaction & recoverable only by snowflake
Non configurable 7 day period
Period starts immediately after Time Travel period ends
Contributes to storage cost

TIME TRAVEL & FAIL SAFE - STORAGE COST

Use below queries to get the information,

SELECT * FROM snowflake.account_usage.storage_usage;

SELECT * FROM snowflake.account_usage.table_storage_metrics;

Monday, 11 August 2025

Stage with JSON load

CREATE FILE FORMAT:

CREATE OR REPLACE FILE FORMAT my_json_format
TYPE = 'JSON'
COMPRESSION = 'AUTO'
ENABLE_OCTAL = FALSE
ALLOW_DUPLICATE = FALSE
STRIP_OUTER_ARRAY = TRUE
IGNORE_UTF8_ERRORS = FALSE;

CREATE INTERNAL NAMED STAGE:

CREATE OR REPLACE STAGE my_internal_stage
FILE_FORMAT = my_json_format;

CREATE TABLE:

CREATE OR REPLACE TABLE json_data_raw (
id NUMBER AUTOINCREMENT,
raw_json VARIANT,
load_time TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP
);

LIST THE FILES IN THE STAGE:

LIST @my_internal_stage;

REMOVE THE FILES FROM STAGE:

REMOVE @my_internal_stage;

COPY COMMAND TO LOAD THE DATA (INTO A PARTICULAR COLUMN IN THE TABLE)

COPY INTO json_data_raw(raw_json)
FROM @my_internal_stage/samplejson.json
FILE_FORMAT = (FORMAT_NAME = 'my_json_format');

OUTPUT OF THE LOAD:

QUERY THE TABLE:

SELECT
RAW_JSON:name::string AS emp_name,
RAW_JSON:email::string AS emp_email,
RAW_JSON:isActive::boolean AS emp_status
FROM json_data_raw;

Stored Procedures

Typically performs database operations, usually administrative operations like INSERT, UPDATE or DELETE.

Doesn't need to return value.

Supported languages are,

Snowflake scripting (Snowflake SQL + procedural logic)
Javascript
Snowpark API (Python, Scala & Java)

General Syntax:

CREATE PROCEDURE proc_name (table_name varchar ,n1 int)

EXECUTE as caller

RETURN int

LANGUAGE sql

BEGIN

UPDATE IDENTIFIER(:table_name) SET col_name = :n1;

END;

NOTE:

If argument is used in sql statement to refer to as an object then use IDENTIFIER(:argument)
If argument is used in sql statement then use :argument

PRIVILEGES:

Runs either with caller's or owner's rights.
By default, will run with owner's privileges

UDF - User Defined Functions

Typically calculate and return a value.

Supported Languages:

SQL
Python
Java
Javascript

Scalar functions: Returns one output row per input row

Tabular functions: Returns a tabular value for each row

Function is a securable-schema level object

General syntax for function:

CREATE FUNCTION (n int)

returns int

n+2;

$$;

Approximation of the most frequent values

Syntax:

APPROX_TOP_K( expr, k, counters )

Arg1: the column name for which you want to find the most common values.
Arg2: if you want to see the top 10 most common values, then set k to 10
Arg3: Max no.of distinct values that can be tracked

Normal method to get most frequent value from the table,

SELECT customer_id, COUNT(customer_id) FROM table_name GROUP BY 1;

Using estimation function,

SELECT APPROX_TOP_K(customer_id,5,20) FROM table_name;

Benchmark: For 150,000,000 rows

Normal method took, 12s
Estimation function took, 5.6s

Also check for:

APPROX_TOP_K_ACCUMULATE ,

APPROX_TOP_K_COMBINE,

APPROX_TOP_K_ESTIMATE

Estimating the Number of Distinct Values

Normal method to get the distinct values,

SELECT COUNT(DISTINCT(customer_name)) FROM table_name;

Using estimation function,

SELECT HLL(customer_name) FROM table_name;

Benchmark: For 150,000,000 rows

Normal method took, 12s
Estimation function took, 5.6s

Thursday, 7 August 2025

Stage and Copy - Useful Commands

To show the list of stages:

SHOW STAGES;
DESC STAGE name_of_the_stage;

To show the list of file formats :

SHOW FILE FORMATS;
DESC FILE FORMAT name_of_the_file_format;

Sample file format creation :

CREATE OR REPLACE FILE FORMAT emp_csv_format
TYPE = 'CSV'
FIELD_DELIMITER = ','
SKIP_HEADER = 1
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
NULL_IF = ('NULL', 'null');

Sample file format alter :

ALTER FILE FORMAT emp_csv_format
SET FIELD_DELIMITER = '|';
LIST @emp_named_stage;

Sample stage creation :

CREATE OR REPLACE STAGE emp_named_stage
FILE_FORMAT = emp_csv_format;

Read data from the stage :

SELECT
METADATA$FILENAME,
METADATA$FILE_ROW_NUMBER,
$1 AS column1_name, -- Referencing the first column
$2 AS column2_name, -- Referencing the second column
$3 AS Age,
$4 AS City,
$5 AS unknowncolumn
FROM @emp_named_stage (file_format => emp_csv_format) AS t;

Copy the data from stage to table :

COPY INTO emp_data FROM @emp_named_stage/samplecsv_error.csv ON_ERROR='continue';

Remove the file from stage :

REMOVE @emp_named_stage/samplecsv_error1.csv;

Validation mode :

COPY INTO emp_data
FROM @emp_named_stage/samplecsv_error1.csv
FILE_FORMAT = emp_csv_format
VALIDATION_MODE = 'RETURN_ERRORS';

To check the history of file load :

SELECT * FROM DIRECTORY(@name_of_the_stage);

Tuesday, 5 August 2025

Stored Procedure - Language Selection SQL vs Javascript

Choosing between LANGUAGE JAVASCRIPT and LANGUAGE SQL (Snowflake Scripting) for stored procedures depends on what you’re trying to accomplish.

Use SQL scripting when:

You're mostly executing SQL statements.
You want clean, readable procedures.
You're doing DML (INSERT, UPDATE, DELETE).

Use JavaScript when:

You need complex procedural logic not easily expressed in SQL.
You want to build dynamic SQL more flexibly.
You want to loop through result sets or arrays.

Snowflake - Stored Procedures [ Error free script - ready to use]

1. Basic Stored Procedure

create or replace procedure proc_return_str()
RETURNS VARCHAR
LANGUAGE javascript
EXECUTE AS CALLER
AS '
return "Hello from the snowflake stored procedure!";
';

call proc_return_str();

2. Stored Procedure with Parameters

CREATE OR REPLACE PROCEDURE proc_withparampass(message VARCHAR)
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
BEGIN
RETURN 'Hello from Snowflake! You passed: ' || message;
END;
$$;

CALL proc_withparampass('Snowflake Stored Procedures');

3. Stored Procedures with Update SQL statement

CREATE OR REPLACE PROCEDURE apply_bonus(bonus_percentage DECIMAL(5, 2), min_rating INT)
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
DECLARE
rows_updated INT;
BEGIN
UPDATE employees
SET tot_salary = salary * (1 + :bonus_percentage / 100)
WHERE performance_rating >= :min_rating;
rows_updated := SQLROWCOUNT;
RETURN 'Bonus applied to ' || rows_updated || ' employees.';
END;
$$;

CALL apply_bonus(5, 5);

4. Stored Procedures with Return Output as Table

CREATE OR REPLACE PROCEDURE proc_return_table(min_salary DECIMAL(10, 2))
RETURNS TABLE(emp_id INT, salary DECIMAL(10, 2))
LANGUAGE SQL
AS
$$
DECLARE
res RESULTSET;
BEGIN
res := (SELECT emp_id, salary FROM employees WHERE salary >= :min_salary ORDER BY salary DESC);
RETURN TABLE(res);
END;
$$;

CALL proc_return_table(70000);

5. Stored Procedures with Insert SQL statement

CREATE OR REPLACE PROCEDURE proc_calc_writein_insertinothertable(
start_date DATE,
end_date DATE,
in_region STRING
)
RETURNS STRING
LANGUAGE SQL
AS
$$
DECLARE
v_total_sales NUMBER;
v_total_orders NUMBER;
v_avg_order_value NUMBER;
BEGIN

INSERT INTO SALES_SUMMARY_LOG (
RUN_TIMESTAMP, SALES_REGION, START_DATE, END_DATE,
TOTAL_SALES, TOTAL_ORDERS, AVERAGE_ORDER_VALUE, STATUS
)
SELECT CURRENT_TIMESTAMP,:in_region, :start_date, :end_date,
SUM(AMOUNT) AS v_total_sales, COUNT(*) AS v_total_orders, AVG(AMOUNT) AS v_avg_order_value
,'SUCCESS'
FROM SALES
WHERE ORDER_DATE BETWEEN :start_date AND :end_date
AND SALES_REGION = :in_region;

RETURN 'Summary calculation completed successfully for region: ' || :in_region ;

EXCEPTION
WHEN OTHER THEN
INSERT INTO SALES_SUMMARY_LOG (
RUN_TIMESTAMP, SALES_REGION, START_DATE, END_DATE,
TOTAL_SALES, TOTAL_ORDERS, AVERAGE_ORDER_VALUE, STATUS
)
VALUES (
CURRENT_TIMESTAMP, :in_region, :start_date, :end_date,
NULL, NULL, NULL, 'FAILED'
);
RETURN 'An error occurred during summary calculation.';
END;
$$;

CALL proc_calc_writein_insertinothertable('2025-01-02', '2025-01-02', 'West');

6. Stored Procedures with Calculation

Eg: 1

CREATE OR REPLACE PROCEDURE proc_simple_num_calc()
RETURNS NUMBER
LANGUAGE SQL
AS
$$
DECLARE
profit NUMBER(38, 2) DEFAULT 0.0; -- Variable declared with a default value
cost NUMBER(38, 2); -- Variable declared without a default value (will be NULL initially)
BEGIN
cost := 100.0;
LET revenue NUMBER(38, 2) DEFAULT 110.0; -- LET can also be used inside BEGIN...END
profit := revenue - cost;
RETURN profit;
END;
$$;

CALL proc_simple_num_calc();

Eg: 2

CREATE OR REPLACE PROCEDURE calculate_revenue()
RETURNS NUMBER
LANGUAGE SQL
AS
$$
BEGIN
LET sales_amount NUMBER(38, 2) := 500.0; -- Declared and initialized using :=
LET discount_rate NUMBER(38, 2) DEFAULT 0.10; -- Declared and initialized using DEFAULT
LET net_revenue NUMBER(38, 2);

net_revenue := sales_amount * (1 - discount_rate);
RETURN net_revenue;
END;
$$;

CALL calculate_revenue();

7. Stored Procedure with Cursor & Get count of a table:

CREATE OR REPLACE PROCEDURE proc_cursor_getcountofthetable(table_name STRING)
RETURNS INT
LANGUAGE SQL
AS
$$
DECLARE
sql_text STRING;
rs RESULTSET;
row_count INT DEFAULT 0;
C1 CURSOR FOR SELECT COUNT(*) FROM customer_src;
BEGIN
OPEN C1;
FETCH C1 INTO row_count;
CLOSE C1;
IF (row_count > 4) THEN
RETURN 'Success';
ELSE
RETURN 'Failed';
END IF;
END;
$$;

CALL proc_cursor_getcountofthetable('customer_src');

8. Stored Procedure with Execute Immediate & Dynamic SQL:

CREATE OR REPLACE PROCEDURE proc_with_ex_immediate_dynamic_sql(
minimum_price NUMBER(12,2),
maximum_price NUMBER(12,2))
RETURNS TABLE (order_id NUMBER(38,0), SALES_REGION VARCHAR, ORDER_DATE DATE, amount NUMBER(38, 0))
--Give all the columns from the table, with exact data type and sequence
LANGUAGE SQL
AS
$$
DECLARE
rs RESULTSET;
query VARCHAR DEFAULT 'SELECT * FROM SALES WHERE amount > ? AND amount < ?';
BEGIN
rs := (EXECUTE IMMEDIATE :query USING (minimum_price, maximum_price));
RETURN TABLE(rs);
END;
$$
;

CALL proc_with_ex_immediate_dynamic_sql(1000,1500);

Wednesday, 10 September 2025

Tuesday, 2 September 2025

Tuesday, 19 August 2025

1. Create Share

2. Grant privileges to share

3. Add consumer account(s)

4. Import share

Monday, 18 August 2025

Materialized View

Warehouse

Result Cache

Data Cache

Metadata Cache

Sunday, 17 August 2025

Tuesday, 12 August 2025

"Data sampling" refers to selecting a subset of data from a larger dataset, typically for testing, analysis, or performance purposes.

Monday, 11 August 2025

Thursday, 7 August 2025

To show the list of stages:

To show the list of file formats :

Sample file format creation :

Sample file format alter :

Sample stage creation :

Read data from the stage :

Copy the data from stage to table :

Remove the file from stage :

Validation mode :

To check the history of file load :

Tuesday, 5 August 2025

1. Basic Stored Procedure

2. Stored Procedure with Parameters

3. Stored Procedures with Update SQL statement

4. Stored Procedures with Return Output as Table

5. Stored Procedures with Insert SQL statement

6. Stored Procedures with Calculation

Eg: 1

Eg: 2

7. Stored Procedure with Cursor & Get count of a table:

8. Stored Procedure with Execute Immediate & Dynamic SQL: