Monday, 23 March 2026

Snowflake - Warehouse

In Snowflake, data is held in databases and any processing of data is done by something called a "warehouse."

Default three compute warehouses,

  1. COMPUTE_WH owned by ACCOUNTADMIN.
  2. SNOWFLAKE_LEARNING_WH owned by ACCOUNTADMIN
  3. SYSTEM$STREAMLIT_NOTEBOOK_WH. That warehouse will be used by Snowflake to do any work required by streamlit apps and notebooks you create and run. You will not use this warehouse directly, only Snowflake will use it, on your behalf.

Snowflake - Authentication & Authorization

 Authentication  (Identity) : Proven through username & password

Authorization (Access) : Access through RBAC role assignments

Account Admin (See & Do Everything)

                Security Admin (Security administrator can manage security aspects of the account.)

                    User Admin (User administrator can create and manage users and roles)

                Sys Admin (Create DB, Warehouses, Schemas, Views)


           Public

Note: 

  • Other than this, ORG ADMIN is the most powerful
  • Discretionary Access Control (DAC)
  • If you change your system role to another role, when you log out and log back in, your role will revert to the default

Snowflake - Databases

Every time you create a database, Snowflake will automatically create two schemas for you.

  • The INFORMATION_SCHEMA schema holds a collection of views.  
  • The INFORMATION_SCHEMA schema cannot be deleted (dropped), renamed, or moved.

  • The PUBLIC schema is created empty and you can fill it with tables, views and other things over time.
  • The PUBLIC schema can be dropped, renamed, or moved at any time. 
Note: 
  • By default the database created with ACCOUNTADMIN role.
  • ACCOUNTADMIN owns the SYSADMIN role, so it has ownership rights also, but indirectly.

Tuesday, 24 February 2026

Airflow - DAG Dependencies

  • Define the order in which tasks should run
  • Tasks can be upstream (run before) or downstream (run after)
  • Declared after creating the tasks
Methods to declare

  • Recommended
task1>>task2>>[task3,task4]
  • Alternative
task1.set_downstream(task2)
task3.set_upstream(task2)

Wednesday, 10 September 2025

Snowflake - Cost Optimization

  1. Reduce auto-suspend to 60 seconds
  2. Reduce virtual warehouse size
  3. Ensure minimum clusters are set to 1
  4. Consolidate warehouses
    • Separate warehouse by workload, requirement & not by domain
  5. Reduce query frequency
    • At many organizations, batch data transformation jobs often run hourly by default. But do downstream use cases need such low latency? Check with business before set up the frequency.
  6. Only process new or updated data
  7. Ensure tables are clustered correctly
  8. Drop unused tables
  9. Lower data retention
    • The time travel (data retention) setting can result in added costs since it must maintain copies of all modifications and changes to a table made over the retention period.
  10. Use transient tables
  11. Avoid frequent DML operations
  12. Ensure files are optimally sized
    • To ensure cost effective data loading, a best practice is to keep your files around 100-250MB. 
    • To demonstrate these effects, 
      • If we only have one 1GB file, we will only saturate 1/16 threads on a Small warehouse used for loading. 
      • If you instead split this file into ten files that are 100 MB each, you will utilize 10 threads out of 16. This level parallelization is much better as it leads to better utilisation of the given compute resources
  13. Leverage access control
  14. Enable query timeouts
  15. Configure resource monitors

Tuesday, 2 September 2025

Kafka - Topics, Partitions & Offset

KAFKA - EVENT PROCESSING SYSTEM


  • No need to wait for response
  • Fire and Forget
  • Real time processing (Streams)
  • High throughput & Low latency

 

Topics 

    - Particular stream of data

    - Can be identified by name

        e.g. Tables in a database

    - Support all type of messages

    - The sequence of message is called, data stream

    - You cannot query topics, instead use kafka producers to send data and kafka consumers to read the data

    - Kafka topics are immutable, Once data is written to a partition, it cannot be changed

    - Data is kept for a limited time (default is one week - configurable)


Partitions

    - Topics are split into partitions

    - Messages within each partitions are ordered


Offset

    - Each message within a partition gets an incremental id, called offset


Producers

    - Write data to topics

    - Producers know to which partition to write


Kafka Connect

    -Getting data in and out of kafka


Step-by-Step to Start Kafka


  • Step 1: Start ZooKeeper
    • This will keep running in the terminal. In a new terminal window
  • Step 2: Start Kafka Server (Broker)
  • Step 3: Create a Kafka Topic
  • Step 4: Start Producer
    • Type messages here to send to Kafka.
  • Step 5: Start Consumer (in a new terminal)
    • You will see the messages you type in the producer appear here.

Architecture










Tuesday, 19 August 2025

Data Sharing

 

1. Create Share

CREATE SHARE my_share;

2. Grant privileges to share

GRANT USAGE ON DATABASE my_db TO SHARE my_share; GRANT USAGE ON SCHEMA my_schema.my_db TO SHARE my_share; GRANT SELECT ON TABLE my_table.myschema.my_db TO SHARE my_share;

3. Add consumer account(s)

ALTER SHARE my_share ADD ACCOUNT a123bc;

4. Import share

CREATE DATABASE my_db FROM SHARE my_share;

Snowflake - Warehouse

In Snowflake, data is held in databases and any processing of data is done by something called a "warehouse." Default three comput...