new drafts

2025-09-25 15:46:51 +00:00 · 2025-09-25 15:46:51 +00:00 · 8cf6460743
commit 8cf6460743
parent 0514599e7b
2 changed files with 542 additions and 0 deletions
--- a/drafts/leveraging_pg_stat_statements.md
+++ b/drafts/leveraging_pg_stat_statements.md
@ -0,0 +1,275 @@
+# Leveraging `pg_stat_statements` for Tracking Metrics in PostgreSQL
+
+## Introduction
+
+PostgreSQL's `pg_stat_statements` extension is a powerful tool for monitoring and analyzing SQL query performance. While it doesn't directly track the number of rows affected by operations like `INSERT INTO SELECT`, it provides valuable insights into query execution, including **execution time, CPU usage, and the number of times a query is called**. This can be particularly useful for identifying performance bottlenecks and understanding query patterns.
+
+In this guide, we'll explore how to use `pg_stat_statements` to track and analyze query metrics, and how to complement it with other techniques to capture row-level metrics.
+
+---
+
+## Why Use `pg_stat_statements`?
+
+`pg_stat_statements` provides the following benefits:
+
+1. **Query Performance Insights**: Track execution time, CPU usage, and memory consumption for each query.
+2. **Query Frequency**: Identify frequently executed queries and their impact on the database.
+3. **Optimization Opportunities**: Pinpoint slow queries that need optimization.
+4. **Non-Intrusive**: No need to modify queries or add triggers.
+5. **Lightweight**: Minimal performance overhead compared to other auditing methods.
+
+---
+
+## Step 1: Install and Enable `pg_stat_statements`
+
+### Install the Extension
+
+The `pg_stat_statements` extension is included in the PostgreSQL contrib package. To install it:
+
+#### For Debian/Ubuntu:
+```bash
+sudo apt-get install postgresql-contrib
+```
+
+#### For RHEL/CentOS:
+```bash
+sudo yum install postgresql-contrib
+```
+
+### Enable the Extension
+
+After installing, enable the extension in PostgreSQL:
+
+1. Add `pg_stat_statements` to `shared_preload_libraries` in `postgresql.conf`:
+```ini
+shared_preload_libraries = 'pg_stat_statements'
+```
+
+2. Set the maximum number of statements to track:
+```ini
+pg_stat_statements.max = 10000
+```
+
+3. Set the level of tracking:
+```ini
+pg_stat_statements.track = all  # Track all statements
+```
+
+Restart PostgreSQL to apply the changes:
+```bash
+sudo systemctl restart postgresql
+```
+
+Finally, create the extension in your database:
+```sql
+CREATE EXTENSION pg_stat_statements;
+```
+
+---
+
+## Step 2: Query `pg_stat_statements` for Metrics
+
+Once enabled, `pg_stat_statements` collects statistics on SQL queries. You can query the `pg_stat_statements` view to retrieve this information:
+
+```sql
+SELECT 
+    query,
+    calls,
+    total_exec_time,
+    mean_exec_time,
+    rows,
+    shared_blks_hit,
+    shared_blks_read
+FROM 
+    pg_stat_statements
+ORDER BY 
+    total_exec_time DESC
+LIMIT 10;
+```
+
+### Key Columns in `pg_stat_statements`:
+- **`query`**: The normalized SQL query text.
+- **`calls`**: The number of times the query was executed.
+- **`total_exec_time`**: Total execution time in milliseconds.
+- **`mean_exec_time`**: Average execution time in milliseconds.
+- **`rows`**: Total number of rows retrieved or affected.
+- **`shared_blks_hit`**: Number of shared buffer hits.
+- **`shared_blks_read`**: Number of shared blocks read from disk.
+
+---
+
+## Step 3: Track Row-Level Metrics
+
+While `pg_stat_statements` provides the total number of rows retrieved or affected by a query (`rows` column), it doesn't break this down by individual query execution. To capture row-level metrics for each execution, you can combine `pg_stat_statements` with other techniques:
+
+### Option 1: Use Triggers (If Needed)
+
+If you need to track row-level metrics for specific operations (e.g., `INSERT INTO SELECT`), you can use triggers as described in previous guides. However, this approach is more intrusive and may impact performance.
+
+### Option 2: Parse PostgreSQL Logs
+
+If you prefer a non-intrusive method, parse PostgreSQL logs to extract row-level metrics. Configure PostgreSQL to log detailed information:
+
+```ini
+log_statement = 'all'
+log_duration = on
+log_min_messages = INFO
+```
+
+Then, write a script to parse the logs and extract metrics like the number of rows affected by each operation.
+
+---
+
+## Step 4: Automate Metrics Collection
+
+To keep track of query performance over time, automate the collection of metrics from `pg_stat_statements`. You can create a script to periodically capture and store these metrics in a separate table.
+
+### Create a Table to Store Metrics
+
+```sql
+CREATE TABLE query_performance_metrics (
+    id SERIAL PRIMARY KEY,
+    capture_time TIMESTAMP DEFAULT NOW(),
+    query TEXT,
+    calls INT,
+    total_exec_time FLOAT,
+    mean_exec_time FLOAT,
+    rows BIGINT,
+    shared_blks_hit BIGINT,
+    shared_blks_read BIGINT
+);
+```
+
+### Write a Script to Capture Metrics
+
+Here’s a Python script to capture and store metrics from `pg_stat_statements`:
+
+```python
+import psycopg2
+
+def capture_metrics():
+    conn = psycopg2.connect(
+        dbname="your_database",
+        user="your_user",
+        password="your_password",
+        host="your_host"
+    )
+    cursor = conn.cursor()
+
+    # Fetch metrics from pg_stat_statements
+    cursor.execute("""
+        SELECT 
+            query,
+            calls,
+            total_exec_time,
+            mean_exec_time,
+            rows,
+            shared_blks_hit,
+            shared_blks_read
+        FROM 
+            pg_stat_statements
+    """)
+    
+    metrics = cursor.fetchall()
+
+    # Store metrics in query_performance_metrics
+    for metric in metrics:
+        cursor.execute(
+            """
+            INSERT INTO query_performance_metrics (
+                query, calls, total_exec_time, mean_exec_time, rows, shared_blks_hit, shared_blks_read
+            )
+            VALUES (%s, %s, %s, %s, %s, %s, %s)
+            """,
+            metric
+        )
+
+    conn.commit()
+    cursor.close()
+    conn.close()
+
+if __name__ == "__main__":
+    capture_metrics()
+```
+
+### Schedule the Script
+
+Use `cron` to run the script periodically:
+
+```bash
+crontab -e
+```
+
+Add a line to run the script every hour:
+```
+0 * * * * /usr/bin/python3 /path/to/your/script.py
+```
+
+---
+
+## Step 5: Analyze Metrics
+
+You can now analyze the collected metrics to gain insights into query performance and usage patterns:
+
+### Example Queries
+
+1. **Top 10 Slowest Queries by Total Execution Time**:
+```sql
+SELECT 
+    query,
+    total_exec_time,
+    calls,
+    mean_exec_time
+FROM 
+    query_performance_metrics
+ORDER BY 
+    total_exec_time DESC
+LIMIT 10;
+```
+
+2. **Queries with the Highest Row Impact**:
+```sql
+SELECT 
+    query,
+    rows,
+    calls
+FROM 
+    query_performance_metrics
+ORDER BY 
+    rows DESC
+LIMIT 10;
+```
+
+3. **Trend Analysis Over Time**:
+```sql
+SELECT 
+    DATE(capture_time) AS day,
+    SUM(total_exec_time) AS total_exec_time,
+    SUM(rows) AS total_rows
+FROM 
+    query_performance_metrics
+GROUP BY 
+    DATE(capture_time)
+ORDER BY 
+    day;
+```
+
+---
+
+## Challenges and Considerations
+
+1. **Performance Overhead**: While `pg_stat_statements` has minimal overhead, it can still impact performance on heavily loaded systems. Monitor your database performance after enabling it.
+
+2. **Log Volume**: If you also parse PostgreSQL logs, ensure you have enough storage and a log rotation strategy.
+
+3. **Query Normalization**: `pg_stat_statements` normalizes queries, which means it groups similar queries together. This can make it harder to track specific instances of a query.
+
+4. **Security**: Ensure that sensitive information is not exposed in the logged queries.
+
+---
+
+## Conclusion
+
+`pg_stat_statements` is a powerful tool for tracking and analyzing query performance in PostgreSQL. While it doesn't provide row-level metrics for each query execution, it offers valuable insights into query execution time, frequency, and row impact. By combining `pg_stat_statements` with other techniques like log parsing or triggers, you can build a comprehensive monitoring and auditing system for your PostgreSQL database.
+
+Start leveraging `pg_stat_statements` today to optimize your database performance and gain deeper insights into your query workloads!
--- a/drafts/postgres_as_a_dataplatform.md
+++ b/drafts/postgres_as_a_dataplatform.md
@ -0,0 +1,267 @@
+# PostgreSQL as a Data Platform: ETL/ELT and Data Warehousing
+
+## Introduction
+PostgreSQL, often referred to as "Postgres," is widely recognized as a powerful relational database management system (RDBMS). However, its capabilities extend far beyond traditional database use cases. PostgreSQL can serve as a **full-fledged data platform** for **ETL/ELT processes** and **data warehousing**, thanks to its advanced features, extensibility, and support for semi-structured data like JSONB.
+
+In this post, we'll explore how PostgreSQL can be used as a **data platform for ETL/ELT and data warehousing**, its advantages, and practical examples.
+
+---
+
+## Why PostgreSQL for ETL/ELT and Data Warehousing?
+
+### 1. Extensibility
+PostgreSQL supports **extensions** that add functionality for data processing and integration:
+- **`pg_http`**: For making HTTP requests directly from PostgreSQL, enabling data ingestion from APIs.
+- **`pg_cron`**: For scheduling jobs within PostgreSQL, useful for automating ETL/ELT workflows.
+- **`pg_partman`**: For table partitioning, improving performance for large datasets.
+- **`plpython3u`**: For writing Python scripts within PostgreSQL, enabling advanced data transformations.
+
+These extensions transform PostgreSQL into a **versatile ETL/ELT engine** and **data warehouse solution**. 
+
+---
+
+### 2. Support for Semi-Structured Data
+PostgreSQL excels at handling **semi-structured data** like JSONB, making it ideal for ETL/ELT processes:
+- **JSONB**: A binary format for storing and querying JSON data efficiently.
+- **Advanced JSONB Querying**: Use operators like `->`, `->>`, `#>>`, and functions like `jsonb_path_query` to extract and manipulate JSON data.
+- **Indexing**: Create indexes on JSONB fields for faster querying.
+
+This flexibility allows PostgreSQL to **ingest, transform, and store data** from various sources, including APIs, logs, and unstructured datasets.
+
+---
+
+### 3. Advanced Querying Capabilities
+PostgreSQL offers powerful querying features for ETL/ELT and data warehousing:
+- **Common Table Expressions (CTEs)**: For complex data transformations.
+- **Window Functions**: For analytical queries and aggregations.
+- **Foreign Data Wrappers (FDWs)**: To query external data sources like other databases or APIs.
+- **Materialized Views**: For precomputing and storing query results, improving performance for repetitive queries.
+
+---
+
+### 4. Scalability
+PostgreSQL can scale to handle large datasets:
+- **Table Partitioning**: Using `pg_partman` to manage large tables efficiently.
+- **Parallel Query Execution**: For faster data processing.
+- **Citus**: For horizontal scaling and distributed queries (if needed for very large datasets).
+
+---
+
+### 5. Automation
+PostgreSQL supports automation for ETL/ELT workflows:
+- **`pg_cron`**: Schedule recurring tasks like data ingestion, transformations, and cleanups.
+- **Triggers**: Automate actions based on data changes.
+- **Event-Based Processing**: Use `LISTEN` and `NOTIFY` for real-time data processing.
+
+---
+
+## Use Cases for PostgreSQL as a Data Platform
+
+### 1. ETL/ELT Pipelines
+PostgreSQL can serve as the **central hub** for ETL/ELT pipelines:
+- **Extract**: Ingest data from APIs, files, or other databases using `pg_http` or FDWs.
+- **Transform**: Use SQL queries, Python scripts (`plpython3u`), or JSONB operations to clean and transform data.
+- **Load**: Store the transformed data in structured or semi-structured formats.
+
+---
+
+### 2. Data Warehousing
+PostgreSQL is an excellent choice for **lightweight data warehousing**:
+- **Star Schema**: Design star schemas for analytical queries.
+- **Materialized Views**: Precompute aggregations for faster reporting.
+- **JSONB for Flexibility**: Store raw data in JSONB format while maintaining structured tables for analysis.
+
+---
+
+### 3. Real-Time Data Processing
+PostgreSQL can handle **real-time data processing**:
+- **Streaming Data**: Ingest and process streaming data using triggers or `pg_cron`.
+- **Real-Time Analytics**: Use materialized views or CTEs for up-to-date insights.
+
+---
+
+## Practical Example: Building an ETL/ELT Pipeline with PostgreSQL
+
+### Step 1: Setting Up PostgreSQL
+Start by installing PostgreSQL and enabling the necessary extensions:
+```sql
+-- Enable extensions for ETL/ELT
+CREATE EXTENSION pg_http;      -- For making HTTP requests
+CREATE EXTENSION pg_cron;      -- For scheduling jobs
+CREATE EXTENSION plpython3u;   -- For Python scripts
+```
+
+---
+
+### Step 2: Ingesting Data from an API
+Use `pg_http` to fetch data from an API and store it in a JSONB column:
+```sql
+-- Create a table to store API data
+CREATE TABLE api_data (
+    id SERIAL PRIMARY KEY,
+    raw_data JSONB,
+    fetched_at TIMESTAMP DEFAULT NOW()
+);
+
+-- Fetch data from an API and insert it into the table
+SELECT 
+    pg_http.get('https://api.example.com/data', 
+    response => 
+        INSERT INTO api_data (raw_data) 
+        VALUES (response::jsonb)
+    );
+```
+
+---
+
+### Step 3: Transforming JSONB Data
+Use PostgreSQL's JSONB functions to extract and transform data:
+```sql
+-- Extract specific fields from JSONB
+SELECT 
+    id,
+    raw_data->>'user_id' AS user_id,
+    raw_data->>'timestamp' AS timestamp,
+    raw_data->'metrics'->>'value' AS value
+FROM 
+    api_data;
+
+-- Transform and store structured data
+CREATE TABLE structured_data AS
+SELECT 
+    id,
+    raw_data->>'user_id' AS user_id,
+    (raw_data->>'timestamp')::TIMESTAMP AS timestamp,
+    (raw_data->'metrics'->>'value')::FLOAT AS value
+FROM 
+    api_data;
+```
+
+---
+
+### Step 4: Automating ETL/ELT with `pg_cron`
+Schedule regular data ingestion and transformation jobs:
+```sql
+-- Schedule a job to fetch data every hour
+SELECT cron.schedule(
+    'fetch-api-data',
+    '0 * * * *',
+    $$
+    INSERT INTO api_data (raw_data)
+    SELECT pg_http.get('https://api.example.com/data')::jsonb;
+    $$
+);
+
+-- Schedule a job to transform data daily
+SELECT cron.schedule(
+    'transform-api-data',
+    '0 0 * * *',
+    $$
+    INSERT INTO structured_data (user_id, timestamp, value)
+    SELECT 
+        raw_data->>'user_id',
+        (raw_data->>'timestamp')::TIMESTAMP,
+        (raw_data->'metrics'->>'value')::FLOAT
+    FROM 
+        api_data
+    WHERE 
+        fetched_at > NOW() - INTERVAL '1 day'
+    ON CONFLICT (user_id, timestamp) DO UPDATE SET value = EXCLUDED.value;
+    $$
+);
+```
+
+---
+
+### Step 5: Building a Data Warehouse
+Create a star schema for analytical queries:
+```sql
+-- Create fact and dimension tables
+CREATE TABLE dim_users (
+    user_id VARCHAR(50) PRIMARY KEY,
+    user_name TEXT,
+    created_at TIMESTAMP
+);
+
+CREATE TABLE fact_metrics (
+    id SERIAL PRIMARY KEY,
+    user_id VARCHAR(50) REFERENCES dim_users(user_id),
+    timestamp TIMESTAMP,
+    value FLOAT,
+    loaded_at TIMESTAMP DEFAULT NOW()
+);
+
+-- Populate the data warehouse
+INSERT INTO dim_users (user_id, user_name, created_at)
+SELECT DISTINCT 
+    raw_data->>'user_id' AS user_id,
+    raw_data->>'user_name' AS user_name,
+    (raw_data->>'created_at')::TIMESTAMP AS created_at
+FROM 
+    api_data;
+
+INSERT INTO fact_metrics (user_id, timestamp, value)
+SELECT 
+    raw_data->>'user_id' AS user_id,
+    (raw_data->>'timestamp')::TIMESTAMP AS timestamp,
+    (raw_data->'metrics'->>'value')::FLOAT AS value
+FROM 
+    api_data;
+
+-- Create a materialized view for reporting
+CREATE MATERIALIZED VIEW mv_user_metrics AS
+SELECT 
+    u.user_id,
+    u.user_name,
+    DATE_TRUNC('day', f.timestamp) AS day,
+    AVG(f.value) AS avg_value,
+    MAX(f.value) AS max_value,
+    MIN(f.value) AS min_value
+FROM 
+    dim_users u
+JOIN 
+    fact_metrics f ON u.user_id = f.user_id
+GROUP BY 
+    u.user_id, u.user_name, DATE_TRUNC('day', f.timestamp);
+
+-- Refresh the materialized view periodically
+SELECT cron.schedule(
+    'refresh-mv-user-metrics',
+    '0 0 * * *',
+    'REFRESH MATERIALIZED VIEW mv_user_metrics'
+);
+```
+
+---
+
+## Challenges and Considerations
+
+### 1. Performance
+- **Indexing**: Create indexes on frequently queried columns, including JSONB fields.
+- **Partitioning**: Use `pg_partman` to partition large tables by time or other dimensions.
+- **Query Optimization**: Use `EXPLAIN ANALYZE` to identify and optimize slow queries.
+
+### 2. Learning Curve
+- PostgreSQL's advanced features (e.g., JSONB, FDWs, `pg_cron`) may require time to master.
+- Invest in learning SQL, PostgreSQL extensions, and best practices for data modeling.
+
+### 3. Maintenance
+- **Regular Backups**: Use tools like `pg_dump` or `barman` to back up your data.
+- **Monitoring**: Use tools like `pgBadger` or `Prometheus` to monitor database performance.
+- **Vacuuming**: Regularly run `VACUUM` to reclaim space and maintain performance.
+
+---
+
+## Conclusion
+PostgreSQL is a **powerful and versatile data platform** that can handle **ETL/ELT processes** and **data warehousing** with ease. Its support for **semi-structured data (JSONB)**, **advanced querying**, and **automation** makes it an excellent choice for modern data workflows.
+
+By leveraging PostgreSQL's extensibility, scalability, and flexibility, you can build **end-to-end data pipelines** without relying on multiple specialized tools. Start exploring its advanced features today and unlock the full potential of your data platform!
+
+---
+
+## Further Reading
+- [PostgreSQL Official Documentation](https://www.postgresql.org/docs/)
+- [pg_http Documentation](https://github.com/pramsey/pgsql-http)
+- [pg_cron Documentation](https://github.com/citusdata/pg_cron)
+- [JSONB in PostgreSQL](https://www.postgresql.org/docs/current/datatype-json.html)
+- [Citus Documentation](https://docs.citusdata.com/)