From 8cf6460743fd9f427692e43cc8eacb2c620e7787 Mon Sep 17 00:00:00 2001 From: simonpetit Date: Thu, 25 Sep 2025 15:46:51 +0000 Subject: [PATCH] new drafts --- drafts/leveraging_pg_stat_statements.md | 275 ++++++++++++++++++++++++ drafts/postgres_as_a_dataplatform.md | 267 +++++++++++++++++++++++ 2 files changed, 542 insertions(+) create mode 100644 drafts/leveraging_pg_stat_statements.md create mode 100644 drafts/postgres_as_a_dataplatform.md diff --git a/drafts/leveraging_pg_stat_statements.md b/drafts/leveraging_pg_stat_statements.md new file mode 100644 index 0000000..027a7d7 --- /dev/null +++ b/drafts/leveraging_pg_stat_statements.md @@ -0,0 +1,275 @@ +# Leveraging `pg_stat_statements` for Tracking Metrics in PostgreSQL + +## Introduction + +PostgreSQL's `pg_stat_statements` extension is a powerful tool for monitoring and analyzing SQL query performance. While it doesn't directly track the number of rows affected by operations like `INSERT INTO SELECT`, it provides valuable insights into query execution, including **execution time, CPU usage, and the number of times a query is called**. This can be particularly useful for identifying performance bottlenecks and understanding query patterns. + +In this guide, we'll explore how to use `pg_stat_statements` to track and analyze query metrics, and how to complement it with other techniques to capture row-level metrics. + +--- + +## Why Use `pg_stat_statements`? + +`pg_stat_statements` provides the following benefits: + +1. **Query Performance Insights**: Track execution time, CPU usage, and memory consumption for each query. +2. **Query Frequency**: Identify frequently executed queries and their impact on the database. +3. **Optimization Opportunities**: Pinpoint slow queries that need optimization. +4. **Non-Intrusive**: No need to modify queries or add triggers. +5. **Lightweight**: Minimal performance overhead compared to other auditing methods. + +--- + +## Step 1: Install and Enable `pg_stat_statements` + +### Install the Extension + +The `pg_stat_statements` extension is included in the PostgreSQL contrib package. To install it: + +#### For Debian/Ubuntu: +```bash +sudo apt-get install postgresql-contrib +``` + +#### For RHEL/CentOS: +```bash +sudo yum install postgresql-contrib +``` + +### Enable the Extension + +After installing, enable the extension in PostgreSQL: + +1. Add `pg_stat_statements` to `shared_preload_libraries` in `postgresql.conf`: +```ini +shared_preload_libraries = 'pg_stat_statements' +``` + +2. Set the maximum number of statements to track: +```ini +pg_stat_statements.max = 10000 +``` + +3. Set the level of tracking: +```ini +pg_stat_statements.track = all # Track all statements +``` + +Restart PostgreSQL to apply the changes: +```bash +sudo systemctl restart postgresql +``` + +Finally, create the extension in your database: +```sql +CREATE EXTENSION pg_stat_statements; +``` + +--- + +## Step 2: Query `pg_stat_statements` for Metrics + +Once enabled, `pg_stat_statements` collects statistics on SQL queries. You can query the `pg_stat_statements` view to retrieve this information: + +```sql +SELECT + query, + calls, + total_exec_time, + mean_exec_time, + rows, + shared_blks_hit, + shared_blks_read +FROM + pg_stat_statements +ORDER BY + total_exec_time DESC +LIMIT 10; +``` + +### Key Columns in `pg_stat_statements`: +- **`query`**: The normalized SQL query text. +- **`calls`**: The number of times the query was executed. +- **`total_exec_time`**: Total execution time in milliseconds. +- **`mean_exec_time`**: Average execution time in milliseconds. +- **`rows`**: Total number of rows retrieved or affected. +- **`shared_blks_hit`**: Number of shared buffer hits. +- **`shared_blks_read`**: Number of shared blocks read from disk. + +--- + +## Step 3: Track Row-Level Metrics + +While `pg_stat_statements` provides the total number of rows retrieved or affected by a query (`rows` column), it doesn't break this down by individual query execution. To capture row-level metrics for each execution, you can combine `pg_stat_statements` with other techniques: + +### Option 1: Use Triggers (If Needed) + +If you need to track row-level metrics for specific operations (e.g., `INSERT INTO SELECT`), you can use triggers as described in previous guides. However, this approach is more intrusive and may impact performance. + +### Option 2: Parse PostgreSQL Logs + +If you prefer a non-intrusive method, parse PostgreSQL logs to extract row-level metrics. Configure PostgreSQL to log detailed information: + +```ini +log_statement = 'all' +log_duration = on +log_min_messages = INFO +``` + +Then, write a script to parse the logs and extract metrics like the number of rows affected by each operation. + +--- + +## Step 4: Automate Metrics Collection + +To keep track of query performance over time, automate the collection of metrics from `pg_stat_statements`. You can create a script to periodically capture and store these metrics in a separate table. + +### Create a Table to Store Metrics + +```sql +CREATE TABLE query_performance_metrics ( + id SERIAL PRIMARY KEY, + capture_time TIMESTAMP DEFAULT NOW(), + query TEXT, + calls INT, + total_exec_time FLOAT, + mean_exec_time FLOAT, + rows BIGINT, + shared_blks_hit BIGINT, + shared_blks_read BIGINT +); +``` + +### Write a Script to Capture Metrics + +Here’s a Python script to capture and store metrics from `pg_stat_statements`: + +```python +import psycopg2 + +def capture_metrics(): + conn = psycopg2.connect( + dbname="your_database", + user="your_user", + password="your_password", + host="your_host" + ) + cursor = conn.cursor() + + # Fetch metrics from pg_stat_statements + cursor.execute(""" + SELECT + query, + calls, + total_exec_time, + mean_exec_time, + rows, + shared_blks_hit, + shared_blks_read + FROM + pg_stat_statements + """) + + metrics = cursor.fetchall() + + # Store metrics in query_performance_metrics + for metric in metrics: + cursor.execute( + """ + INSERT INTO query_performance_metrics ( + query, calls, total_exec_time, mean_exec_time, rows, shared_blks_hit, shared_blks_read + ) + VALUES (%s, %s, %s, %s, %s, %s, %s) + """, + metric + ) + + conn.commit() + cursor.close() + conn.close() + +if __name__ == "__main__": + capture_metrics() +``` + +### Schedule the Script + +Use `cron` to run the script periodically: + +```bash +crontab -e +``` + +Add a line to run the script every hour: +``` +0 * * * * /usr/bin/python3 /path/to/your/script.py +``` + +--- + +## Step 5: Analyze Metrics + +You can now analyze the collected metrics to gain insights into query performance and usage patterns: + +### Example Queries + +1. **Top 10 Slowest Queries by Total Execution Time**: +```sql +SELECT + query, + total_exec_time, + calls, + mean_exec_time +FROM + query_performance_metrics +ORDER BY + total_exec_time DESC +LIMIT 10; +``` + +2. **Queries with the Highest Row Impact**: +```sql +SELECT + query, + rows, + calls +FROM + query_performance_metrics +ORDER BY + rows DESC +LIMIT 10; +``` + +3. **Trend Analysis Over Time**: +```sql +SELECT + DATE(capture_time) AS day, + SUM(total_exec_time) AS total_exec_time, + SUM(rows) AS total_rows +FROM + query_performance_metrics +GROUP BY + DATE(capture_time) +ORDER BY + day; +``` + +--- + +## Challenges and Considerations + +1. **Performance Overhead**: While `pg_stat_statements` has minimal overhead, it can still impact performance on heavily loaded systems. Monitor your database performance after enabling it. + +2. **Log Volume**: If you also parse PostgreSQL logs, ensure you have enough storage and a log rotation strategy. + +3. **Query Normalization**: `pg_stat_statements` normalizes queries, which means it groups similar queries together. This can make it harder to track specific instances of a query. + +4. **Security**: Ensure that sensitive information is not exposed in the logged queries. + +--- + +## Conclusion + +`pg_stat_statements` is a powerful tool for tracking and analyzing query performance in PostgreSQL. While it doesn't provide row-level metrics for each query execution, it offers valuable insights into query execution time, frequency, and row impact. By combining `pg_stat_statements` with other techniques like log parsing or triggers, you can build a comprehensive monitoring and auditing system for your PostgreSQL database. + +Start leveraging `pg_stat_statements` today to optimize your database performance and gain deeper insights into your query workloads! \ No newline at end of file diff --git a/drafts/postgres_as_a_dataplatform.md b/drafts/postgres_as_a_dataplatform.md new file mode 100644 index 0000000..45b0445 --- /dev/null +++ b/drafts/postgres_as_a_dataplatform.md @@ -0,0 +1,267 @@ +# PostgreSQL as a Data Platform: ETL/ELT and Data Warehousing + +## Introduction +PostgreSQL, often referred to as "Postgres," is widely recognized as a powerful relational database management system (RDBMS). However, its capabilities extend far beyond traditional database use cases. PostgreSQL can serve as a **full-fledged data platform** for **ETL/ELT processes** and **data warehousing**, thanks to its advanced features, extensibility, and support for semi-structured data like JSONB. + +In this post, we'll explore how PostgreSQL can be used as a **data platform for ETL/ELT and data warehousing**, its advantages, and practical examples. + +--- + +## Why PostgreSQL for ETL/ELT and Data Warehousing? + +### 1. Extensibility +PostgreSQL supports **extensions** that add functionality for data processing and integration: +- **`pg_http`**: For making HTTP requests directly from PostgreSQL, enabling data ingestion from APIs. +- **`pg_cron`**: For scheduling jobs within PostgreSQL, useful for automating ETL/ELT workflows. +- **`pg_partman`**: For table partitioning, improving performance for large datasets. +- **`plpython3u`**: For writing Python scripts within PostgreSQL, enabling advanced data transformations. + +These extensions transform PostgreSQL into a **versatile ETL/ELT engine** and **data warehouse solution**. + +--- + +### 2. Support for Semi-Structured Data +PostgreSQL excels at handling **semi-structured data** like JSONB, making it ideal for ETL/ELT processes: +- **JSONB**: A binary format for storing and querying JSON data efficiently. +- **Advanced JSONB Querying**: Use operators like `->`, `->>`, `#>>`, and functions like `jsonb_path_query` to extract and manipulate JSON data. +- **Indexing**: Create indexes on JSONB fields for faster querying. + +This flexibility allows PostgreSQL to **ingest, transform, and store data** from various sources, including APIs, logs, and unstructured datasets. + +--- + +### 3. Advanced Querying Capabilities +PostgreSQL offers powerful querying features for ETL/ELT and data warehousing: +- **Common Table Expressions (CTEs)**: For complex data transformations. +- **Window Functions**: For analytical queries and aggregations. +- **Foreign Data Wrappers (FDWs)**: To query external data sources like other databases or APIs. +- **Materialized Views**: For precomputing and storing query results, improving performance for repetitive queries. + +--- + +### 4. Scalability +PostgreSQL can scale to handle large datasets: +- **Table Partitioning**: Using `pg_partman` to manage large tables efficiently. +- **Parallel Query Execution**: For faster data processing. +- **Citus**: For horizontal scaling and distributed queries (if needed for very large datasets). + +--- + +### 5. Automation +PostgreSQL supports automation for ETL/ELT workflows: +- **`pg_cron`**: Schedule recurring tasks like data ingestion, transformations, and cleanups. +- **Triggers**: Automate actions based on data changes. +- **Event-Based Processing**: Use `LISTEN` and `NOTIFY` for real-time data processing. + +--- + +## Use Cases for PostgreSQL as a Data Platform + +### 1. ETL/ELT Pipelines +PostgreSQL can serve as the **central hub** for ETL/ELT pipelines: +- **Extract**: Ingest data from APIs, files, or other databases using `pg_http` or FDWs. +- **Transform**: Use SQL queries, Python scripts (`plpython3u`), or JSONB operations to clean and transform data. +- **Load**: Store the transformed data in structured or semi-structured formats. + +--- + +### 2. Data Warehousing +PostgreSQL is an excellent choice for **lightweight data warehousing**: +- **Star Schema**: Design star schemas for analytical queries. +- **Materialized Views**: Precompute aggregations for faster reporting. +- **JSONB for Flexibility**: Store raw data in JSONB format while maintaining structured tables for analysis. + +--- + +### 3. Real-Time Data Processing +PostgreSQL can handle **real-time data processing**: +- **Streaming Data**: Ingest and process streaming data using triggers or `pg_cron`. +- **Real-Time Analytics**: Use materialized views or CTEs for up-to-date insights. + +--- + +## Practical Example: Building an ETL/ELT Pipeline with PostgreSQL + +### Step 1: Setting Up PostgreSQL +Start by installing PostgreSQL and enabling the necessary extensions: +```sql +-- Enable extensions for ETL/ELT +CREATE EXTENSION pg_http; -- For making HTTP requests +CREATE EXTENSION pg_cron; -- For scheduling jobs +CREATE EXTENSION plpython3u; -- For Python scripts +``` + +--- + +### Step 2: Ingesting Data from an API +Use `pg_http` to fetch data from an API and store it in a JSONB column: +```sql +-- Create a table to store API data +CREATE TABLE api_data ( + id SERIAL PRIMARY KEY, + raw_data JSONB, + fetched_at TIMESTAMP DEFAULT NOW() +); + +-- Fetch data from an API and insert it into the table +SELECT + pg_http.get('https://api.example.com/data', + response => + INSERT INTO api_data (raw_data) + VALUES (response::jsonb) + ); +``` + +--- + +### Step 3: Transforming JSONB Data +Use PostgreSQL's JSONB functions to extract and transform data: +```sql +-- Extract specific fields from JSONB +SELECT + id, + raw_data->>'user_id' AS user_id, + raw_data->>'timestamp' AS timestamp, + raw_data->'metrics'->>'value' AS value +FROM + api_data; + +-- Transform and store structured data +CREATE TABLE structured_data AS +SELECT + id, + raw_data->>'user_id' AS user_id, + (raw_data->>'timestamp')::TIMESTAMP AS timestamp, + (raw_data->'metrics'->>'value')::FLOAT AS value +FROM + api_data; +``` + +--- + +### Step 4: Automating ETL/ELT with `pg_cron` +Schedule regular data ingestion and transformation jobs: +```sql +-- Schedule a job to fetch data every hour +SELECT cron.schedule( + 'fetch-api-data', + '0 * * * *', + $$ + INSERT INTO api_data (raw_data) + SELECT pg_http.get('https://api.example.com/data')::jsonb; + $$ +); + +-- Schedule a job to transform data daily +SELECT cron.schedule( + 'transform-api-data', + '0 0 * * *', + $$ + INSERT INTO structured_data (user_id, timestamp, value) + SELECT + raw_data->>'user_id', + (raw_data->>'timestamp')::TIMESTAMP, + (raw_data->'metrics'->>'value')::FLOAT + FROM + api_data + WHERE + fetched_at > NOW() - INTERVAL '1 day' + ON CONFLICT (user_id, timestamp) DO UPDATE SET value = EXCLUDED.value; + $$ +); +``` + +--- + +### Step 5: Building a Data Warehouse +Create a star schema for analytical queries: +```sql +-- Create fact and dimension tables +CREATE TABLE dim_users ( + user_id VARCHAR(50) PRIMARY KEY, + user_name TEXT, + created_at TIMESTAMP +); + +CREATE TABLE fact_metrics ( + id SERIAL PRIMARY KEY, + user_id VARCHAR(50) REFERENCES dim_users(user_id), + timestamp TIMESTAMP, + value FLOAT, + loaded_at TIMESTAMP DEFAULT NOW() +); + +-- Populate the data warehouse +INSERT INTO dim_users (user_id, user_name, created_at) +SELECT DISTINCT + raw_data->>'user_id' AS user_id, + raw_data->>'user_name' AS user_name, + (raw_data->>'created_at')::TIMESTAMP AS created_at +FROM + api_data; + +INSERT INTO fact_metrics (user_id, timestamp, value) +SELECT + raw_data->>'user_id' AS user_id, + (raw_data->>'timestamp')::TIMESTAMP AS timestamp, + (raw_data->'metrics'->>'value')::FLOAT AS value +FROM + api_data; + +-- Create a materialized view for reporting +CREATE MATERIALIZED VIEW mv_user_metrics AS +SELECT + u.user_id, + u.user_name, + DATE_TRUNC('day', f.timestamp) AS day, + AVG(f.value) AS avg_value, + MAX(f.value) AS max_value, + MIN(f.value) AS min_value +FROM + dim_users u +JOIN + fact_metrics f ON u.user_id = f.user_id +GROUP BY + u.user_id, u.user_name, DATE_TRUNC('day', f.timestamp); + +-- Refresh the materialized view periodically +SELECT cron.schedule( + 'refresh-mv-user-metrics', + '0 0 * * *', + 'REFRESH MATERIALIZED VIEW mv_user_metrics' +); +``` + +--- + +## Challenges and Considerations + +### 1. Performance +- **Indexing**: Create indexes on frequently queried columns, including JSONB fields. +- **Partitioning**: Use `pg_partman` to partition large tables by time or other dimensions. +- **Query Optimization**: Use `EXPLAIN ANALYZE` to identify and optimize slow queries. + +### 2. Learning Curve +- PostgreSQL's advanced features (e.g., JSONB, FDWs, `pg_cron`) may require time to master. +- Invest in learning SQL, PostgreSQL extensions, and best practices for data modeling. + +### 3. Maintenance +- **Regular Backups**: Use tools like `pg_dump` or `barman` to back up your data. +- **Monitoring**: Use tools like `pgBadger` or `Prometheus` to monitor database performance. +- **Vacuuming**: Regularly run `VACUUM` to reclaim space and maintain performance. + +--- + +## Conclusion +PostgreSQL is a **powerful and versatile data platform** that can handle **ETL/ELT processes** and **data warehousing** with ease. Its support for **semi-structured data (JSONB)**, **advanced querying**, and **automation** makes it an excellent choice for modern data workflows. + +By leveraging PostgreSQL's extensibility, scalability, and flexibility, you can build **end-to-end data pipelines** without relying on multiple specialized tools. Start exploring its advanced features today and unlock the full potential of your data platform! + +--- + +## Further Reading +- [PostgreSQL Official Documentation](https://www.postgresql.org/docs/) +- [pg_http Documentation](https://github.com/pramsey/pgsql-http) +- [pg_cron Documentation](https://github.com/citusdata/pg_cron) +- [JSONB in PostgreSQL](https://www.postgresql.org/docs/current/datatype-json.html) +- [Citus Documentation](https://docs.citusdata.com/) \ No newline at end of file