Merge branch 'main' of https://git.simonpetit.top/simonpetit/blog
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
This commit is contained in:
commit
b326349656
@ -37,6 +37,7 @@ pre {
|
||||
background-color: lightgrey;
|
||||
border-radius: 0.5em;
|
||||
display: flex;
|
||||
overflow-x: auto;
|
||||
}
|
||||
|
||||
pre > code {
|
||||
|
||||
50
drafts/postgres_cdc.md
Normal file
50
drafts/postgres_cdc.md
Normal file
@ -0,0 +1,50 @@
|
||||
# Postgres CDC
|
||||
|
||||
## What is CDC ?
|
||||
|
||||
CDC stands for Change Data Capture. It is a mechanism that enables the replication of a database.
|
||||
That is we listen to changes on the tables of the database so that we can replicate them into
|
||||
another database.
|
||||
|
||||
This is used in data engineering pipelines to extract data from sources and to replicate
|
||||
them into the datawarehouse, data lake or lakehouse for example.
|
||||
This way it is possible to do analysis over these data without impacting the transactionnal
|
||||
database, which is used by another software as its primary storage.
|
||||
|
||||
The other advantage of replicating the database is that it can be stored in another way.
|
||||
For example it is possible to store the resulting mirrored database as parquet files,
|
||||
or any columnar storage format, to speed up analytics queries.
|
||||
|
||||
## Replication in Postgres
|
||||
|
||||
The database needs some configuration to enable a replication sufficient for a CDC data pipeline.
|
||||
|
||||
First, in the `postgres.conf` file the three following lines shall be added :
|
||||
- `wal_level=logical`
|
||||
- `max_replication_slots=10`
|
||||
- `max_wal_senders=10`
|
||||
|
||||
Here follows a quick explanation of what each of these parameters mean :
|
||||
|
||||
### wal_level
|
||||
|
||||
WAL stands for Write Ahead Logs. These are the logs written by postgres to
|
||||
record all operations on the database.
|
||||
By default the level is `replica`, which is .... [TODO]
|
||||
but for CDC we need the highest level `logical`. This level records every transaction
|
||||
happening is the database, at the point that we can literally reconstruct the database
|
||||
from the logs; which is exactly what CDC is trying to achieve.
|
||||
|
||||
### max_replication_slots
|
||||
|
||||
Here comes another concept : the replication slots.
|
||||
These are ... .[TODO]
|
||||
Naturally, all WAL are not kepts forever, hence we need to configure replication slots
|
||||
so that unread WAL are not destroyed before our CDC pipeline has had the chance to read them.
|
||||
|
||||
### max_wal_senders
|
||||
|
||||
[TODO]
|
||||
|
||||
## Publications
|
||||
|
||||
Loading…
Reference in New Issue
Block a user