blog/drafts/leveraging_pg_stat_statements.md
2025-09-25 15:46:51 +00:00

275 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Leveraging `pg_stat_statements` for Tracking Metrics in PostgreSQL
## Introduction
PostgreSQL's `pg_stat_statements` extension is a powerful tool for monitoring and analyzing SQL query performance. While it doesn't directly track the number of rows affected by operations like `INSERT INTO SELECT`, it provides valuable insights into query execution, including **execution time, CPU usage, and the number of times a query is called**. This can be particularly useful for identifying performance bottlenecks and understanding query patterns.
In this guide, we'll explore how to use `pg_stat_statements` to track and analyze query metrics, and how to complement it with other techniques to capture row-level metrics.
---
## Why Use `pg_stat_statements`?
`pg_stat_statements` provides the following benefits:
1. **Query Performance Insights**: Track execution time, CPU usage, and memory consumption for each query.
2. **Query Frequency**: Identify frequently executed queries and their impact on the database.
3. **Optimization Opportunities**: Pinpoint slow queries that need optimization.
4. **Non-Intrusive**: No need to modify queries or add triggers.
5. **Lightweight**: Minimal performance overhead compared to other auditing methods.
---
## Step 1: Install and Enable `pg_stat_statements`
### Install the Extension
The `pg_stat_statements` extension is included in the PostgreSQL contrib package. To install it:
#### For Debian/Ubuntu:
```bash
sudo apt-get install postgresql-contrib
```
#### For RHEL/CentOS:
```bash
sudo yum install postgresql-contrib
```
### Enable the Extension
After installing, enable the extension in PostgreSQL:
1. Add `pg_stat_statements` to `shared_preload_libraries` in `postgresql.conf`:
```ini
shared_preload_libraries = 'pg_stat_statements'
```
2. Set the maximum number of statements to track:
```ini
pg_stat_statements.max = 10000
```
3. Set the level of tracking:
```ini
pg_stat_statements.track = all # Track all statements
```
Restart PostgreSQL to apply the changes:
```bash
sudo systemctl restart postgresql
```
Finally, create the extension in your database:
```sql
CREATE EXTENSION pg_stat_statements;
```
---
## Step 2: Query `pg_stat_statements` for Metrics
Once enabled, `pg_stat_statements` collects statistics on SQL queries. You can query the `pg_stat_statements` view to retrieve this information:
```sql
SELECT
query,
calls,
total_exec_time,
mean_exec_time,
rows,
shared_blks_hit,
shared_blks_read
FROM
pg_stat_statements
ORDER BY
total_exec_time DESC
LIMIT 10;
```
### Key Columns in `pg_stat_statements`:
- **`query`**: The normalized SQL query text.
- **`calls`**: The number of times the query was executed.
- **`total_exec_time`**: Total execution time in milliseconds.
- **`mean_exec_time`**: Average execution time in milliseconds.
- **`rows`**: Total number of rows retrieved or affected.
- **`shared_blks_hit`**: Number of shared buffer hits.
- **`shared_blks_read`**: Number of shared blocks read from disk.
---
## Step 3: Track Row-Level Metrics
While `pg_stat_statements` provides the total number of rows retrieved or affected by a query (`rows` column), it doesn't break this down by individual query execution. To capture row-level metrics for each execution, you can combine `pg_stat_statements` with other techniques:
### Option 1: Use Triggers (If Needed)
If you need to track row-level metrics for specific operations (e.g., `INSERT INTO SELECT`), you can use triggers as described in previous guides. However, this approach is more intrusive and may impact performance.
### Option 2: Parse PostgreSQL Logs
If you prefer a non-intrusive method, parse PostgreSQL logs to extract row-level metrics. Configure PostgreSQL to log detailed information:
```ini
log_statement = 'all'
log_duration = on
log_min_messages = INFO
```
Then, write a script to parse the logs and extract metrics like the number of rows affected by each operation.
---
## Step 4: Automate Metrics Collection
To keep track of query performance over time, automate the collection of metrics from `pg_stat_statements`. You can create a script to periodically capture and store these metrics in a separate table.
### Create a Table to Store Metrics
```sql
CREATE TABLE query_performance_metrics (
id SERIAL PRIMARY KEY,
capture_time TIMESTAMP DEFAULT NOW(),
query TEXT,
calls INT,
total_exec_time FLOAT,
mean_exec_time FLOAT,
rows BIGINT,
shared_blks_hit BIGINT,
shared_blks_read BIGINT
);
```
### Write a Script to Capture Metrics
Heres a Python script to capture and store metrics from `pg_stat_statements`:
```python
import psycopg2
def capture_metrics():
conn = psycopg2.connect(
dbname="your_database",
user="your_user",
password="your_password",
host="your_host"
)
cursor = conn.cursor()
# Fetch metrics from pg_stat_statements
cursor.execute("""
SELECT
query,
calls,
total_exec_time,
mean_exec_time,
rows,
shared_blks_hit,
shared_blks_read
FROM
pg_stat_statements
""")
metrics = cursor.fetchall()
# Store metrics in query_performance_metrics
for metric in metrics:
cursor.execute(
"""
INSERT INTO query_performance_metrics (
query, calls, total_exec_time, mean_exec_time, rows, shared_blks_hit, shared_blks_read
)
VALUES (%s, %s, %s, %s, %s, %s, %s)
""",
metric
)
conn.commit()
cursor.close()
conn.close()
if __name__ == "__main__":
capture_metrics()
```
### Schedule the Script
Use `cron` to run the script periodically:
```bash
crontab -e
```
Add a line to run the script every hour:
```
0 * * * * /usr/bin/python3 /path/to/your/script.py
```
---
## Step 5: Analyze Metrics
You can now analyze the collected metrics to gain insights into query performance and usage patterns:
### Example Queries
1. **Top 10 Slowest Queries by Total Execution Time**:
```sql
SELECT
query,
total_exec_time,
calls,
mean_exec_time
FROM
query_performance_metrics
ORDER BY
total_exec_time DESC
LIMIT 10;
```
2. **Queries with the Highest Row Impact**:
```sql
SELECT
query,
rows,
calls
FROM
query_performance_metrics
ORDER BY
rows DESC
LIMIT 10;
```
3. **Trend Analysis Over Time**:
```sql
SELECT
DATE(capture_time) AS day,
SUM(total_exec_time) AS total_exec_time,
SUM(rows) AS total_rows
FROM
query_performance_metrics
GROUP BY
DATE(capture_time)
ORDER BY
day;
```
---
## Challenges and Considerations
1. **Performance Overhead**: While `pg_stat_statements` has minimal overhead, it can still impact performance on heavily loaded systems. Monitor your database performance after enabling it.
2. **Log Volume**: If you also parse PostgreSQL logs, ensure you have enough storage and a log rotation strategy.
3. **Query Normalization**: `pg_stat_statements` normalizes queries, which means it groups similar queries together. This can make it harder to track specific instances of a query.
4. **Security**: Ensure that sensitive information is not exposed in the logged queries.
---
## Conclusion
`pg_stat_statements` is a powerful tool for tracking and analyzing query performance in PostgreSQL. While it doesn't provide row-level metrics for each query execution, it offers valuable insights into query execution time, frequency, and row impact. By combining `pg_stat_statements` with other techniques like log parsing or triggers, you can build a comprehensive monitoring and auditing system for your PostgreSQL database.
Start leveraging `pg_stat_statements` today to optimize your database performance and gain deeper insights into your query workloads!