Subramanian G

Wednesday, 10 September 2025

Snowflake - Cost Optimization

Reduce auto-suspend to 60 seconds
Reduce virtual warehouse size
Ensure minimum clusters are set to 1
Consolidate warehouses

Separate warehouse by workload, requirement & not by domain

Reduce query frequency

At many organizations, batch data transformation jobs often run hourly by default. But do downstream use cases need such low latency? Check with business before set up the frequency.

Only process new or updated data
Ensure tables are clustered correctly
Drop unused tables
Lower data retention

The time travel (data retention) setting can result in added costs since it must maintain copies of all modifications and changes to a table made over the retention period.

Use transient tables
Avoid frequent DML operations
Ensure files are optimally sized

To ensure cost effective data loading, a best practice is to keep your files around 100-250MB.
To demonstrate these effects,

If we only have one 1GB file, we will only saturate 1/16 threads on a Small warehouse used for loading.
If you instead split this file into ten files that are 100 MB each, you will utilize 10 threads out of 16. This level parallelization is much better as it leads to better utilisation of the given compute resources

Leverage access control
Enable query timeouts
Configure resource monitors

Tuesday, 2 September 2025

Kafka - Topics, Partitions & Offset

KAFKA - EVENT PROCESSING SYSTEM

No need to wait for response
Fire and Forget
Real time processing (Streams)
High throughput & Low latency

Topics

- Particular stream of data

- Can be identified by name

e.g. Tables in a database

- Support all type of messages

- The sequence of message is called, data stream

- You cannot query topics, instead use kafka producers to send data and kafka consumers to read the data

- Kafka topics are immutable, Once data is written to a partition, it cannot be changed

- Data is kept for a limited time (default is one week - configurable)

Partitions

- Topics are split into partitions

- Messages within each partitions are ordered

Offset

- Each message within a partition gets an incremental id, called offset

Producers

- Write data to topics

- Producers know to which partition to write

Kafka Connect

-Getting data in and out of kafka

Step-by-Step to Start Kafka

Step 1: Start ZooKeeper

This will keep running in the terminal. In a new terminal window

Step 2: Start Kafka Server (Broker)
Step 3: Create a Kafka Topic
Step 4: Start Producer

Type messages here to send to Kafka.

Step 5: Start Consumer (in a new terminal)

You will see the messages you type in the producer appear here.

Architecture

Tuesday, 19 August 2025

Data Sharing

1. Create Share


CREATE SHARE my_share;

2. Grant privileges to share


GRANT USAGE ON DATABASE my_db TO SHARE my_share;
GRANT USAGE ON SCHEMA my_schema.my_db TO SHARE my_share;
GRANT SELECT ON TABLE my_table.myschema.my_db TO SHARE my_share;

3. Add consumer account(s)


ALTER SHARE my_share ADD ACCOUNT a123bc;

4. Import share

CREATE DATABASE my_db FROM SHARE my_share;

Monday, 18 August 2025

Materialized View & Warehouse

Materialized View

To know the usage history,

SELECT * FROM information_schema.materialized_view_refresh_history();

SELECT * FROM information_schema.materialized_view_refresh_history(materialized_view_name => 'mname'));

SELECT * FROM snowflake.account_usage.materialized_view_refresh_history;

Warehouse

Resizing: Warehouses can be resized even when query is running or when suspended.

It impact only future queries, not the running one.

Scale Up vs Scale Out:

Scale Up (Resize) - More complex queries

Scale Out - More User (More queries)

Micro - Partitions & Clustering

Micro - Partitions

Immutable - Can't be changed
New data - Added in new partitions

Clustering

Get the clustering key details from the existing tables,

SELECT * FROM information_schema.tables WHERE clustering_key IS NOT NULL;

To know about the cluster in detail,

SELECT SYSTEM$CLUSTERING_INFORMATION ('table_name');

To know about the particular column cluster in detail,

SELECT SYSTEM$CLUSTERING_INFORMATION ('table_name','(column_name)');

To know about the clustering depth,

SELECT SYSTEM$CLUSTERING_DEPTH ('table_name');

CACHING in Snowflake

Result Cache

Stores the results of a query (Cloud Services)
Same queries can use that cache in the future

Table data has not changed
Micro-partitions have not changed
Query doesn't include UDFs or external functions
Sufficient privileges & results are still available

Very fast result (persisted query result)
Avoids re-execution
Can be disabled by using

USE_CACHED_RESULT parameter

If query is not re-used purged after 24 hours
If query is re-used can be stored up to 31 days

Tip : Result cache is resides in the CLOUD SERVICES layer

Data Cache

Local SSD cache
Cannot be shared with other warehouses
Improve performance of subsequent queries that use the same data
Purged if warehouse is suspended or resized
Queries with similar data ⇒ same warehouse
Size depends on warehouse size

Tip : Data cache is resides in the QUERY PROCESSING layer

Metadata Cache

Stores statistics and metadata about objects
Properties for query optimization and processing

Range of values in micro-partition

Count rows, count distinct values, max/min value
Without using virtual warehouse
DESCRIBE + system-defined functions
Called as "Metadata store"
Virtual Private Edition: Dedicated metadata store

Tip : Result cache is resides in the CLOUD SERVICES layer

Query History

In 3 ways we will ab able to view the query history,

1. Using SNOWSIGHT( Web UI)

2. Using INFORMATION_SCHEMA

SELECT * FROM TABLE (information_schema.query_history()) ORDER BY start_time;

3. Using ACCOUNT_USAGE

SELECT * FROM snowflake.account_usage.query_history;