Subramanian G: September 2025

Wednesday, 10 September 2025

At many organizations, batch data transformation jobs often run hourly by default. But do downstream use cases need such low latency? Check with business before set up the frequency.

The time travel (data retention) setting can result in added costs since it must maintain copies of all modifications and changes to a table made over the retention period.

To ensure cost effective data loading, a best practice is to keep your files around 100-250MB.
To demonstrate these effects,

If we only have one 1GB file, we will only saturate 1/16 threads on a Small warehouse used for loading.
If you instead split this file into ten files that are 100 MB each, you will utilize 10 threads out of 16. This level parallelization is much better as it leads to better utilisation of the given compute resources

Tuesday, 2 September 2025

KAFKA - EVENT PROCESSING SYSTEM

Topics

- Particular stream of data

- Can be identified by name

e.g. Tables in a database

- Support all type of messages

- The sequence of message is called, data stream

- You cannot query topics, instead use kafka producers to send data and kafka consumers to read the data

- Kafka topics are immutable, Once data is written to a partition, it cannot be changed

- Data is kept for a limited time (default is one week - configurable)

Partitions

- Topics are split into partitions

- Messages within each partitions are ordered

Offset

- Each message within a partition gets an incremental id, called offset

Producers

- Write data to topics

- Producers know to which partition to write

Kafka Connect

-Getting data in and out of kafka

Step-by-Step to Start Kafka

Architecture