Merge branch 'main' of https://git.simonpetit.top/simonpetit/blog
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
This commit is contained in:
commit
b326349656
@ -37,6 +37,7 @@ pre {
|
|||||||
background-color: lightgrey;
|
background-color: lightgrey;
|
||||||
border-radius: 0.5em;
|
border-radius: 0.5em;
|
||||||
display: flex;
|
display: flex;
|
||||||
|
overflow-x: auto;
|
||||||
}
|
}
|
||||||
|
|
||||||
pre > code {
|
pre > code {
|
||||||
|
|||||||
50
drafts/postgres_cdc.md
Normal file
50
drafts/postgres_cdc.md
Normal file
@ -0,0 +1,50 @@
|
|||||||
|
# Postgres CDC
|
||||||
|
|
||||||
|
## What is CDC ?
|
||||||
|
|
||||||
|
CDC stands for Change Data Capture. It is a mechanism that enables the replication of a database.
|
||||||
|
That is we listen to changes on the tables of the database so that we can replicate them into
|
||||||
|
another database.
|
||||||
|
|
||||||
|
This is used in data engineering pipelines to extract data from sources and to replicate
|
||||||
|
them into the datawarehouse, data lake or lakehouse for example.
|
||||||
|
This way it is possible to do analysis over these data without impacting the transactionnal
|
||||||
|
database, which is used by another software as its primary storage.
|
||||||
|
|
||||||
|
The other advantage of replicating the database is that it can be stored in another way.
|
||||||
|
For example it is possible to store the resulting mirrored database as parquet files,
|
||||||
|
or any columnar storage format, to speed up analytics queries.
|
||||||
|
|
||||||
|
## Replication in Postgres
|
||||||
|
|
||||||
|
The database needs some configuration to enable a replication sufficient for a CDC data pipeline.
|
||||||
|
|
||||||
|
First, in the `postgres.conf` file the three following lines shall be added :
|
||||||
|
- `wal_level=logical`
|
||||||
|
- `max_replication_slots=10`
|
||||||
|
- `max_wal_senders=10`
|
||||||
|
|
||||||
|
Here follows a quick explanation of what each of these parameters mean :
|
||||||
|
|
||||||
|
### wal_level
|
||||||
|
|
||||||
|
WAL stands for Write Ahead Logs. These are the logs written by postgres to
|
||||||
|
record all operations on the database.
|
||||||
|
By default the level is `replica`, which is .... [TODO]
|
||||||
|
but for CDC we need the highest level `logical`. This level records every transaction
|
||||||
|
happening is the database, at the point that we can literally reconstruct the database
|
||||||
|
from the logs; which is exactly what CDC is trying to achieve.
|
||||||
|
|
||||||
|
### max_replication_slots
|
||||||
|
|
||||||
|
Here comes another concept : the replication slots.
|
||||||
|
These are ... .[TODO]
|
||||||
|
Naturally, all WAL are not kepts forever, hence we need to configure replication slots
|
||||||
|
so that unread WAL are not destroyed before our CDC pipeline has had the chance to read them.
|
||||||
|
|
||||||
|
### max_wal_senders
|
||||||
|
|
||||||
|
[TODO]
|
||||||
|
|
||||||
|
## Publications
|
||||||
|
|
||||||
Loading…
Reference in New Issue
Block a user