- Reduce auto-suspend to 60 seconds
- Reduce virtual warehouse size
- Ensure minimum clusters are set to 1
- Consolidate warehouses
- Separate warehouse by workload, requirement & not by domain
- Reduce query frequency
- At many organizations, batch data transformation jobs often run hourly by default. But do downstream use cases need such low latency? Check with business before set up the frequency.
- Only process new or updated data
- Ensure tables are clustered correctly
- Drop unused tables
- Lower data retention
- The time travel (data retention) setting can result in added costs since it must maintain copies of all modifications and changes to a table made over the retention period.
- Use transient tables
- Avoid frequent DML operations
- Ensure files are optimally sized
- To ensure cost effective data loading, a best practice is to keep your files around 100-250MB.
- To demonstrate these effects,
- If we only have one 1GB file, we will only saturate 1/16 threads on a Small warehouse used for loading.
- If you instead split this file into ten files that are 100 MB each, you will utilize 10 threads out of 16. This level parallelization is much better as it leads to better utilisation of the given compute resources
- Leverage access control
- Enable query timeouts
- Configure resource monitors
Subramanian G
Be Good. Do Good.
Wednesday, 10 September 2025
Snowflake - Cost Optimization
Tuesday, 2 September 2025
Kafka - Topics, Partitions & Offset
KAFKA - EVENT PROCESSING SYSTEM
- No need to wait for response
- Fire and Forget
- Real time processing (Streams)
- High throughput & Low latency
Topics
- Particular stream of data
    - Can be identified by name
        e.g. Tables in a database
    - Support all type of messages
    - The sequence of message is called, data stream
    - You cannot query topics, instead use kafka producers to send data and kafka consumers to read the data
    - Kafka topics are immutable, Once data is written to a partition, it cannot be changed
    - Data is kept for a limited time (default is one week - configurable)
Partitions
- Topics are split into partitions
    - Messages within each partitions are ordered
Offset
    - Each message within a partition gets an incremental id, called offset
Producers
    - Write data to topics
    - Producers know to which partition to write
Kafka Connect
    -Getting data in and out of kafka
Step-by-Step to Start Kafka
- Step 1: Start ZooKeeper
- This will keep running in the terminal. In a new terminal window
- Step 2: Start Kafka Server (Broker)
- Step 3: Create a Kafka Topic
- Step 4: Start Producer
- Type messages here to send to Kafka.
- Step 5: Start Consumer (in a new terminal)
- You will see the messages you type in the producer appear here.
Tuesday, 19 August 2025
Data Sharing
1. Create Share
CREATE SHARE my_share;
2. Grant privileges to share
GRANT USAGE ON DATABASE my_db TO SHARE my_share;
GRANT USAGE ON SCHEMA my_schema.my_db TO SHARE my_share;
GRANT SELECT ON TABLE my_table.myschema.my_db TO SHARE my_share;
3. Add consumer account(s)
ALTER SHARE my_share ADD ACCOUNT a123bc;
4. Import share
CREATE DATABASE my_db FROM SHARE my_share;
Monday, 18 August 2025
Materialized View & Warehouse
Materialized View
To know the usage history,
SELECT * FROM information_schema.materialized_view_refresh_history();
SELECT * FROM information_schema.materialized_view_refresh_history(materialized_view_name => 'mname'));
SELECT * FROM snowflake.account_usage.materialized_view_refresh_history;
Warehouse
Resizing: Warehouses can be resized even when query is running or when suspended.
It impact only future queries, not the running one.
Scale Up vs Scale Out:
Scale Up (Resize) - More complex queries
Scale Out - More User (More queries)
Micro - Partitions & Clustering
- Immutable - Can't be changed
- New data - Added in new partitions
CACHING in Snowflake
Result Cache
- Stores the results of a query (Cloud Services)
- Same queries can use that cache in the future
- Table data has not changed
- Micro-partitions have not changed
- Query doesn't include UDFs or external functions
- Sufficient privileges & results are still available
- Very fast result (persisted query result)
- Avoids re-execution
- Can be disabled by using
- USE_CACHED_RESULT parameter
- If query is not re-used purged after 24 hours
- If query is re-used can be stored up to 31 days
Data Cache
- Local SSD cache
- Cannot be shared with other warehouses
- Improve performance of subsequent queries that use the same data
- Purged if warehouse is suspended or resized
- Queries with similar data ⇒ same warehouse
- Size depends on warehouse size
Metadata Cache
- Stores statistics and metadata about objects
- Properties for query optimization and processing
- Range of values in micro-partition
- Count rows, count distinct values, max/min value
- Without using virtual warehouse
- DESCRIBE + system-defined functions
- Called as "Metadata store"
- Virtual Private Edition: Dedicated metadata store
Query History
In 3 ways we will ab able to view the query history,
1. Using SNOWSIGHT( Web UI)
2. Using INFORMATION_SCHEMA
    SELECT * FROM TABLE (information_schema.query_history()) ORDER BY start_time;
3. Using ACCOUNT_USAGE
    SELECT * FROM snowflake.account_usage.query_history;
Snowflake - Cost Optimization
Reduce auto-suspend to 60 seconds Reduce virtual warehouse size Ensure minimum clusters are set to 1 Consolidate warehouses Separate warehou...
- 
Upon using the custom query in the source object, the mapping fails with the following error: The select query for read operation failed bec...
- 
MD5 (Message Digest Function) is a hash function in Informatica which is used to evaluate data integrity. The MD5 function uses Message-Dig...
- 
Result Cache Stores the results of a query (Cloud Services) Same queries can use that cache in the future Table data has not changed Micro-p...
