Tuesday, 12 August 2025

Data Sampling Methods

"Data sampling" refers to selecting a subset of data from a larger dataset, typically for testing, analysis, or performance purposes.

  • ROW or BERNOULLI
    • Every ROW is chosen with percentage p
    • More "Randomness"
    • Smaller tables
    • e.g. SELECT * FROM table_name SAMPLE ROW (<p>) SEED(15); 
  • BLOCK or SYSTEM
    • Every BLOCK is chosen with percentage p
    • More "Effectiveness"
    • Larger tables
    • e.g. SELECT * FROM table_name SAMPLE SYSTEM(<p>) SEED(15);
    Here, <p> Returns approximately p% of the table rows randomly.

No comments:

Snowflake - Cost Optimization

Reduce auto-suspend to 60 seconds Reduce virtual warehouse size Ensure minimum clusters are set to 1 Consolidate warehouses Separate warehou...