- Reduce auto-suspend to 60 seconds
- Reduce virtual warehouse size
- Ensure minimum clusters are set to 1
- Consolidate warehouses
- Separate warehouse by workload, requirement & not by domain
- Reduce query frequency
- At many organizations, batch data transformation jobs often run hourly by default. But do downstream use cases need such low latency? Check with business before set up the frequency.
- Only process new or updated data
- Ensure tables are clustered correctly
- Drop unused tables
- Lower data retention
- The time travel (data retention) setting can result in added costs since it must maintain copies of all modifications and changes to a table made over the retention period.
- Use transient tables
- Avoid frequent DML operations
- Ensure files are optimally sized
- To ensure cost effective data loading, a best practice is to keep your files around 100-250MB.
- To demonstrate these effects,
- If we only have one 1GB file, we will only saturate 1/16 threads on a Small warehouse used for loading.
- If you instead split this file into ten files that are 100 MB each, you will utilize 10 threads out of 16. This level parallelization is much better as it leads to better utilisation of the given compute resources
- Leverage access control
- Enable query timeouts
- Configure resource monitors
Wednesday, 10 September 2025
Snowflake - Cost Optimization
Tuesday, 2 September 2025
Kafka - Topics, Partitions & Offset
KAFKA - EVENT PROCESSING SYSTEM
- No need to wait for response
- Fire and Forget
- Real time processing (Streams)
- High throughput & Low latency
Topics
- Particular stream of data
- Can be identified by name
e.g. Tables in a database
- Support all type of messages
- The sequence of message is called, data stream
- You cannot query topics, instead use kafka producers to send data and kafka consumers to read the data
- Kafka topics are immutable, Once data is written to a partition, it cannot be changed
- Data is kept for a limited time (default is one week - configurable)
Partitions
- Topics are split into partitions
- Messages within each partitions are ordered
Offset
- Each message within a partition gets an incremental id, called offset
Producers
- Write data to topics
- Producers know to which partition to write
Kafka Connect
-Getting data in and out of kafka
Step-by-Step to Start Kafka
- Step 1: Start ZooKeeper
- This will keep running in the terminal. In a new terminal window
- Step 2: Start Kafka Server (Broker)
- Step 3: Create a Kafka Topic
- Step 4: Start Producer
- Type messages here to send to Kafka.
- Step 5: Start Consumer (in a new terminal)
- You will see the messages you type in the producer appear here.
Tuesday, 19 August 2025
Data Sharing
1. Create Share
CREATE SHARE my_share;
2. Grant privileges to share
GRANT USAGE ON DATABASE my_db TO SHARE my_share;
GRANT USAGE ON SCHEMA my_schema.my_db TO SHARE my_share;
GRANT SELECT ON TABLE my_table.myschema.my_db TO SHARE my_share;
3. Add consumer account(s)
ALTER SHARE my_share ADD ACCOUNT a123bc;
4. Import share
CREATE DATABASE my_db FROM SHARE my_share;
Monday, 18 August 2025
Materialized View & Warehouse
Materialized View
To know the usage history,
SELECT * FROM information_schema.materialized_view_refresh_history();
SELECT * FROM information_schema.materialized_view_refresh_history(materialized_view_name => 'mname'));
SELECT * FROM snowflake.account_usage.materialized_view_refresh_history;
Warehouse
Resizing: Warehouses can be resized even when query is running or when suspended.
It impact only future queries, not the running one.
Scale Up vs Scale Out:
Scale Up (Resize) - More complex queries
Scale Out - More User (More queries)
Micro - Partitions & Clustering
- Immutable - Can't be changed
- New data - Added in new partitions
CACHING in Snowflake
Result Cache
- Stores the results of a query (Cloud Services)
- Same queries can use that cache in the future
- Table data has not changed
- Micro-partitions have not changed
- Query doesn't include UDFs or external functions
- Sufficient privileges & results are still available
- Very fast result (persisted query result)
- Avoids re-execution
- Can be disabled by using
- USE_CACHED_RESULT parameter
- If query is not re-used purged after 24 hours
- If query is re-used can be stored up to 31 days
Data Cache
- Local SSD cache
- Cannot be shared with other warehouses
- Improve performance of subsequent queries that use the same data
- Purged if warehouse is suspended or resized
- Queries with similar data ⇒ same warehouse
- Size depends on warehouse size
Metadata Cache
- Stores statistics and metadata about objects
- Properties for query optimization and processing
- Range of values in micro-partition
- Count rows, count distinct values, max/min value
- Without using virtual warehouse
- DESCRIBE + system-defined functions
- Called as "Metadata store"
- Virtual Private Edition: Dedicated metadata store
Query History
In 3 ways we will ab able to view the query history,
1. Using SNOWSIGHT( Web UI)
2. Using INFORMATION_SCHEMA
SELECT * FROM TABLE (information_schema.query_history()) ORDER BY start_time;
3. Using ACCOUNT_USAGE
SELECT * FROM snowflake.account_usage.query_history;
UNLOADING
Syntax:
COPY INTO @stage_name FROM (SELECT col1, col2, col3 FROM table_name)
FILE_FORMAT = (TYPE = CSV)
HEADER = TRUE
Additional Parameters:
SINGLE
Use the SINGLE parameter to specify whether the file will be split into multiple files. The default is set to FALSE which means the data will be split across multiple files
MAX_FILE_SIZE
To define the file size
COPY - Parameters
CopyOption | Description | Values |
---|---|---|
ON_ERROR | Specifies the error handling for the load operation | CONTINUE | SKIP_FILE | SKIP_FILE_num | 'SKIP_FILE_num%' | ABORT_STATEMENT |
SIZE_LIMIT | Specifies the maximum size (in bytes) of data to be loaded | <num> |
PURGE | Remove files after successful load | TRUE | FALSE |
RETURN_FAILED_ONLY | Return only files that have failed to load | TRUE | FALSE |
MATCH_BY_COLUMN_NAME | Load semi-structured data into columns in matching the columns names | CASE_SENSITIVE | CASE_INSENSITIVE | NONE |
ENFORCE_LENGTH | Truncate text strings that exceed the target column length | TRUE | FALSE |
TRUNCATECOLUMNS | Truncate text strings that exceed the target column length | TRUE | FALSE |
FORCE | Load files even if loaded before | TRUE | FALSE |
LOAD_UNCERTAIN_FILES | Load files even if load status unknown | TRUE | FALSE |
Sunday, 17 August 2025
INSERT OVERWRITE
- Specifies that the target table should be truncated before inserting the values into the table.
- To use the OVERWRITE option on INSERT, you must use a role that has DELETE privilege on the table because OVERWRITE will delete the existing records in the table.
[COPY] File Format Parameters
Property | Property Type |
---|---|
TYPE | String |
RECORD_DELIMITER | String |
FIELD_DELIMITER | String |
FILE_EXTENSION | String |
SKIP_HEADER | Integer |
DATE_FORMAT | String |
TIME_FORMAT | String |
TIMESTAMP_FORMAT | String |
BINARY_FORMAT | String |
ESCAPE | String |
ESCAPE_UNENCLOSED_FIELD | String |
TRIM_SPACE | Boolean |
FIELD_OPTIONALLY_ENCLOSED_BY | String |
NULL_IF | List |
COMPRESSION | String |
ERROR_ON_COLUMN_COUNT_MISMATCH | Boolean |
VALIDATE_UTF8 | Boolean |
SKIP_BLANK_LINES | Boolean |
REPLACE_INVALID_CHARACTERS | Boolean |
EMPTY_FIELD_AS_NULL | Boolean |
SKIP_BYTE_ORDER_MARK | Boolean |
ENCODING | String |
Tuesday, 12 August 2025
Combining Streams & Tasks
CREATE TASK my_task
WAREHOUSE = my_wh
SCHEDULE = '15 MINUTE'
WHEN SYSTEM$STREAM_HAS_DATA('my_stream_name')
AS
INSERT INTO my_tgt_table (time_col) VALUES (CURRENT_TIMESTAMP);
Data Sampling Methods
"Data sampling" refers to selecting a subset of data from a larger dataset, typically for testing, analysis, or performance purposes.
- ROW or BERNOULLI
- Every ROW is chosen with percentage p
- More "Randomness"
- Smaller tables
- e.g. SELECT * FROM table_name SAMPLE ROW (<p>) SEED(15);
- BLOCK or SYSTEM
- Every BLOCK is chosen with percentage p
- More "Effectiveness"
- Larger tables
- e.g. SELECT * FROM table_name SAMPLE SYSTEM(<p>) SEED(15);
Snowflake - Important one word questions and answers
- video files
- audio files
- documents
- Scoped URL
- Temporary URL, expires in 24 hours
- e.g. SELECT BUILED_SCOPED_FILE_URL(@stage_name,'logo.png');
- File URL
- Permanent one
- e.g. SELECT BUILD_STAGE_FILE_URL(@stage_name,'logo.png');
- Pre Signed URL
- HTTPS URL used to access file via a web browser
- e.g. SELECT GET_PRESIGNED_URL(@stage_name,'logo.png',60);
- here, 60 denotes the seconds to expiry
- metadata$action
- metadata$update
- metadata$row_id
- Protection of historical data in case of disaster
- No user interaction & recoverable only by snowflake
- Non configurable 7 day period
- Period starts immediately after Time Travel period ends
- Contributes to storage cost
Monday, 11 August 2025
Stage with JSON load
CREATE FILE FORMAT:
CREATE OR REPLACE FILE FORMAT my_json_format
TYPE = 'JSON'
COMPRESSION = 'AUTO'
ENABLE_OCTAL = FALSE
ALLOW_DUPLICATE = FALSE
STRIP_OUTER_ARRAY = TRUE
IGNORE_UTF8_ERRORS = FALSE;
CREATE INTERNAL NAMED STAGE:
CREATE OR REPLACE STAGE my_internal_stage
FILE_FORMAT = my_json_format;
CREATE TABLE:
CREATE OR REPLACE TABLE json_data_raw (
id NUMBER AUTOINCREMENT,
raw_json VARIANT,
load_time TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP
);
LIST THE FILES IN THE STAGE:
LIST @my_internal_stage;
REMOVE THE FILES FROM STAGE:
REMOVE @my_internal_stage;
COPY COMMAND TO LOAD THE DATA (INTO A PARTICULAR COLUMN IN THE TABLE)
COPY INTO json_data_raw(raw_json)
FROM @my_internal_stage/samplejson.json
FILE_FORMAT = (FORMAT_NAME = 'my_json_format');
OUTPUT OF THE LOAD:
QUERY THE TABLE:
SELECTRAW_JSON:name::string AS emp_name,RAW_JSON:email::string AS emp_email,RAW_JSON:isActive::boolean AS emp_statusFROM json_data_raw;
Stored Procedures
Typically performs database operations, usually administrative operations like INSERT, UPDATE or DELETE.
Doesn't need to return value.
Supported languages are,
- Snowflake scripting (Snowflake SQL + procedural logic)
- Javascript
- Snowpark API (Python, Scala & Java)
- If argument is used in sql statement to refer to as an object then use IDENTIFIER(:argument)
- If argument is used in sql statement then use :argument
- Runs either with caller's or owner's rights.
- By default, will run with owner's privileges
UDF - User Defined Functions
Typically calculate and return a value.
Supported Languages:
- SQL
- Python
- Java
- Javascript
Scalar functions: Returns one output row per input row
Tabular functions: Returns a tabular value for each row
Function is a securable-schema level object
General syntax for function:
CREATE FUNCTION (n int)
returns int
AS
$$
n+2;
$$;
Approximation of the most frequent values
Syntax:
APPROX_TOP_K( expr, k, counters )
Arg1: the column name for which you want to find the most common values.
Arg2: if you want to see the top 10 most common values, then set
k
to 10Arg3: Max no.of distinct values that can be tracked
Normal method to get most frequent value from the table,
SELECT customer_id, COUNT(customer_id) FROM table_name GROUP BY 1;
Using estimation function,
SELECT APPROX_TOP_K(customer_id,5,20) FROM table_name;
Benchmark: For 150,000,000 rows
- Normal method took, 12s
- Estimation function took, 5.6s
Estimating the Number of Distinct Values
Normal method to get the distinct values,
SELECT COUNT(DISTINCT(customer_name)) FROM table_name;
Using estimation function,
SELECT HLL(customer_name) FROM table_name;
Benchmark: For 150,000,000 rows
- Normal method took, 12s
- Estimation function took, 5.6s
Thursday, 7 August 2025
Stage and Copy - Useful Commands
To show the list of stages:
- SHOW STAGES;
- DESC STAGE name_of_the_stage;
To show the list of file formats :
- SHOW FILE FORMATS;
- DESC FILE FORMAT name_of_the_file_format;
Sample file format creation :
CREATE OR REPLACE FILE FORMAT emp_csv_format
TYPE = 'CSV'
FIELD_DELIMITER = ','
SKIP_HEADER = 1
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
NULL_IF = ('NULL', 'null');
Sample file format alter :
ALTER FILE FORMAT emp_csv_format
SET FIELD_DELIMITER = '|';
LIST @emp_named_stage;
Sample stage creation :
CREATE OR REPLACE STAGE emp_named_stage
FILE_FORMAT = emp_csv_format;
Read data from the stage :
SELECT
METADATA$FILENAME,
METADATA$FILE_ROW_NUMBER,
$1 AS column1_name, -- Referencing the first column
$2 AS column2_name, -- Referencing the second column
$3 AS Age,
$4 AS City,
$5 AS unknowncolumn
FROM @emp_named_stage (file_format => emp_csv_format) AS t;
Copy the data from stage to table :
COPY INTO emp_data FROM @emp_named_stage/samplecsv_error.csv ON_ERROR='continue';
Remove the file from stage :
REMOVE @emp_named_stage/samplecsv_error1.csv;
Validation mode :
COPY INTO emp_data
FROM @emp_named_stage/samplecsv_error1.csv
FILE_FORMAT = emp_csv_format
VALIDATION_MODE = 'RETURN_ERRORS';
To check the history of file load :
SELECT * FROM DIRECTORY(@name_of_the_stage);
Tuesday, 5 August 2025
Stored Procedure - Language Selection SQL vs Javascript
Choosing between LANGUAGE JAVASCRIPT
and LANGUAGE SQL
(Snowflake Scripting) for stored procedures depends on what you’re trying to accomplish.
Use SQL scripting when:
-
You're mostly executing SQL statements.
-
You want clean, readable procedures.
-
You're doing DML (INSERT, UPDATE, DELETE).
Use JavaScript when:
-
You need complex procedural logic not easily expressed in SQL.
-
You want to build dynamic SQL more flexibly.
-
You want to loop through result sets or arrays.
Snowflake - Stored Procedures [ Error free script - ready to use]
1. Basic Stored Procedure
create or replace procedure proc_return_str()
RETURNS VARCHAR
LANGUAGE javascript
EXECUTE AS CALLER
AS '
return "Hello from the snowflake stored procedure!";
';
call proc_return_str();
2. Stored Procedure with Parameters
CREATE OR REPLACE PROCEDURE proc_withparampass(message VARCHAR)RETURNS VARCHARLANGUAGE SQLAS$$BEGINRETURN 'Hello from Snowflake! You passed: ' || message;END;$$;
CALL proc_withparampass('Snowflake Stored Procedures');
3. Stored Procedures with Update SQL statement
CREATE OR REPLACE PROCEDURE apply_bonus(bonus_percentage DECIMAL(5, 2), min_rating INT)RETURNS VARCHARLANGUAGE SQLAS$$DECLARErows_updated INT;BEGINUPDATE employeesSET tot_salary = salary * (1 + :bonus_percentage / 100)WHERE performance_rating >= :min_rating;rows_updated := SQLROWCOUNT;RETURN 'Bonus applied to ' || rows_updated || ' employees.';END;$$;
CALL apply_bonus(5, 5);
4. Stored Procedures with Return Output as Table
CREATE OR REPLACE PROCEDURE proc_return_table(min_salary DECIMAL(10, 2))RETURNS TABLE(emp_id INT, salary DECIMAL(10, 2))LANGUAGE SQLAS$$DECLAREres RESULTSET;BEGINres := (SELECT emp_id, salary FROM employees WHERE salary >= :min_salary ORDER BY salary DESC);RETURN TABLE(res);END;$$;
CALL proc_return_table(70000);
5. Stored Procedures with Insert SQL statement
CREATE OR REPLACE PROCEDURE proc_calc_writein_insertinothertable(start_date DATE,end_date DATE,in_region STRING)RETURNS STRINGLANGUAGE SQLAS$$DECLAREv_total_sales NUMBER;v_total_orders NUMBER;v_avg_order_value NUMBER;BEGININSERT INTO SALES_SUMMARY_LOG (RUN_TIMESTAMP, SALES_REGION, START_DATE, END_DATE,TOTAL_SALES, TOTAL_ORDERS, AVERAGE_ORDER_VALUE, STATUS)SELECT CURRENT_TIMESTAMP,:in_region, :start_date, :end_date,SUM(AMOUNT) AS v_total_sales, COUNT(*) AS v_total_orders, AVG(AMOUNT) AS v_avg_order_value,'SUCCESS'FROM SALESWHERE ORDER_DATE BETWEEN :start_date AND :end_dateAND SALES_REGION = :in_region;RETURN 'Summary calculation completed successfully for region: ' || :in_region ;EXCEPTIONWHEN OTHER THENINSERT INTO SALES_SUMMARY_LOG (RUN_TIMESTAMP, SALES_REGION, START_DATE, END_DATE,TOTAL_SALES, TOTAL_ORDERS, AVERAGE_ORDER_VALUE, STATUS)VALUES (CURRENT_TIMESTAMP, :in_region, :start_date, :end_date,NULL, NULL, NULL, 'FAILED');RETURN 'An error occurred during summary calculation.';END;$$;
CALL proc_calc_writein_insertinothertable('2025-01-02', '2025-01-02', 'West');
6. Stored Procedures with Calculation
Eg: 1
CREATE OR REPLACE PROCEDURE proc_simple_num_calc()RETURNS NUMBERLANGUAGE SQLAS$$DECLAREprofit NUMBER(38, 2) DEFAULT 0.0; -- Variable declared with a default valuecost NUMBER(38, 2); -- Variable declared without a default value (will be NULL initially)BEGINcost := 100.0;LET revenue NUMBER(38, 2) DEFAULT 110.0; -- LET can also be used inside BEGIN...ENDprofit := revenue - cost;RETURN profit;END;$$;
CALL proc_simple_num_calc();
Eg: 2
CREATE OR REPLACE PROCEDURE calculate_revenue()RETURNS NUMBERLANGUAGE SQLAS$$BEGINLET sales_amount NUMBER(38, 2) := 500.0; -- Declared and initialized using :=LET discount_rate NUMBER(38, 2) DEFAULT 0.10; -- Declared and initialized using DEFAULTLET net_revenue NUMBER(38, 2);net_revenue := sales_amount * (1 - discount_rate);RETURN net_revenue;END;$$;
CALL calculate_revenue();
7. Stored Procedure with Cursor & Get count of a table:
CREATE OR REPLACE PROCEDURE proc_cursor_getcountofthetable(table_name STRING)
RETURNS INT
LANGUAGE SQL
AS
$$
DECLARE
sql_text STRING;
rs RESULTSET;
row_count INT DEFAULT 0;
C1 CURSOR FOR SELECT COUNT(*) FROM customer_src;
BEGIN
OPEN C1;
FETCH C1 INTO row_count;
CLOSE C1;
IF (row_count > 4) THEN
RETURN 'Success';
ELSE
RETURN 'Failed';
END IF;
END;
$$;
CALL proc_cursor_getcountofthetable('customer_src');
8. Stored Procedure with Execute Immediate & Dynamic SQL:
CREATE OR REPLACE PROCEDURE proc_with_ex_immediate_dynamic_sql(
minimum_price NUMBER(12,2),
maximum_price NUMBER(12,2))
RETURNS TABLE (order_id NUMBER(38,0), SALES_REGION VARCHAR, ORDER_DATE DATE, amount NUMBER(38, 0))
--Give all the columns from the table, with exact data type and sequence
LANGUAGE SQL
AS
$$
DECLARE
rs RESULTSET;
query VARCHAR DEFAULT 'SELECT * FROM SALES WHERE amount > ? AND amount < ?';
BEGIN
rs := (EXECUTE IMMEDIATE :query USING (minimum_price, maximum_price));
RETURN TABLE(rs);
END;
$$
;
CALL proc_with_ex_immediate_dynamic_sql(1000,1500);
Snowflake - Cost Optimization
Reduce auto-suspend to 60 seconds Reduce virtual warehouse size Ensure minimum clusters are set to 1 Consolidate warehouses Separate warehou...
-
Upon using the custom query in the source object, the mapping fails with the following error: The select query for read operation failed bec...
-
MD5 (Message Digest Function) is a hash function in Informatica which is used to evaluate data integrity. The MD5 function uses Message-Dig...
-
Result Cache Stores the results of a query (Cloud Services) Same queries can use that cache in the future Table data has not changed Micro-p...