The WAL(Redo log) is essential to normal database operations. In some sense “The LOG is the DATABASE” has some truth to it. The WAL stores a log of all activities performed in the database. In other words, it maintains the state of the database system. Having the WAL is useful in many ways and one of the most important role is in replication.
Replication in Postgres makes significant use of the WAL when copying data changes from one database server to another. Every data change, once committed is replicated to all configured standbys from reading the WAL. It is important that the WAL is stored durably and long enough for new changes to be replicated to standbys.
Postgres aims to be as resource efficient as possible. WAL files can build up quickly and if not well managed can fill up available disk space bringing the system to a halt. Postgres handles the situation by safely deleting older WAL files, during a checkpoint, to recover disk space for other uses. Once the REDO location is confirmed to be ahead of a WAL file( i.e data has been flushed up to that point) then, that WAL file and all other old files are removed or recycled as part of the database cleanup process. An issue with this process is that a WAL file can be deleted before its changes are safely replicated downstream to standbys: leading to inconsistency between the primary/publisher and a standby/subscriber. When this occurs, replication would normally stop until the issue is fixed or a new full backup is made from the primary node.
Delete Not WAL
Archiving
Postgres offers a few solutions for keeping WALs long enough for standbys to access them before they are deleted. One such solution is continuous archiving. In archiving mode, old WAL files are copied to a separate location before being recycled from the pg_wal
directory. The archive location can be a local disk or remote server. Standbys can still access the archive location during recovery and stay consistent with the primary server. Archiving was used in the post HA Postgres to store old WAL files when setting up the HA cluster.
Archiving is primarily used for backups and PITR and so can be used in addition to other solutions.
Slots
Another solution for preventing pre-mature WAL deletion is replication slots. A replication slot is useful in maintaining the state of WAL being accessed by a standby or subscriber that are tracking changes from the primary. The slot prevents deletion of WAL files and removal of old versions of tuples that are still required to be replicated downstream. The replication slot is a simple data structure represented on disk with the following fields
// in src/include/replication/slot.h
typedef struct ReplicationSlotPersistentData
{
NameData name;
Oid database;
ReplicationSlotPersistency persistency;
TransactionId xmin;
TransactionId catalog_xmin;
XLogRecPtr restart_lsn;
ReplicationSlotInvalidationCause invalidated;
XLogRecPtr confirmed_flush;
XLogRecPtr two_phase_at;
bool two_phase;
NameData plugin;
char synced;
bool failover;
} ReplicationSlotPersistentData;
The interesting field in the struct is the restart_lsn
which indicates the location at which data will be read. The confirmed_flush
points to a WAL location that has been flushed and acknowledged by the standby. xmin
maintains the transaction horizon such that tuple versions with transaction ids newer than the value of xmin
cannot be removed during a vacuum.
Creating a replication slot.
postgres=# SELECT * FROM pg_create_logical_replication_slot('test_slot', 'test_decoding', false, true);
slot_name | lsn
-----------+-----------
test_slot | 0/17ABC18
(1 row)
To view all the replication slots available in the database we can query the pg_replication_slots
view
postgres=# SELECT slot_name, database, active, xmin, restart_lsn, confirmed_flush_lsn, synced FROM pg_replication_slots;
slot_name | database | active | xmin | restart_lsn | confirmed_flush_lsn | synced
-----------+----------+--------+------+-------------+---------------------+--------
test_slot | postgres | f | | 0/17ABBE0 | 0/17ABC18 | f
(1 row)
It displays the field values from the slot data structure shown above. The next WAL position to be read starts at location 0/17ABBE0
.
A new slot called test_slot
has been created to use the test logical decoding output plugin test_decoding. The slot is a permanent one therefore it will be persisted to disk. It also configured to for decoding of prepared transactions, if used.
Create a table and put some data in it.
postgres=# CREATE TABLE foo(id SERIAL, k INT NOT NULL);
CREATE TABLE
For a logical slot, DDL changes are not replicated. When we check for any new change from the slot, only the BEGIN
and COMMIT
statements will be output.
postgres=# SELECT * FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL);
lsn | xid | data
-----------+-----+------------
0/17B1550 | 764 | BEGIN 764
0/17D3B50 | 764 | COMMIT 764
(2 rows)
Now if we insert some data, we’ll get some meaningful decoded output
postgres=# INSERT INTO foo VALUES (1,1);
INSERT 0 1
postgres=# SELECT * FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL);
lsn | xid | data
-----------+-----+------------------------------------------------------
0/17B1550 | 764 | BEGIN 764
0/17D3B50 | 764 | COMMIT 764
0/17D6CE0 | 765 | BEGIN 765
0/17D6CE0 | 765 | table public.foo: INSERT: id[integer]:1 k[integer]:1
0/17D6D50 | 765 | COMMIT 765
(5 rows)
We have not actually consumed any data from the slot. pg_logical_slot_peek_changes
function enables us to look into the data that is yet to be replicated from the slot. It does not update the location from which the next data would be read. If we check the state of the slot, it should have the same values as before.
postgres=# SELECT slot_name, database, active, xmin, restart_lsn, confirmed_flush_lsn, synced FROM pg_replication_slots;
slot_name | database | active | xmin | restart_lsn | confirmed_flush_lsn | synced
-----------+----------+--------+------+-------------+---------------------+--------
test_slot | postgres | f | | 0/17ABBE0 | 0/17ABC18 | f
(1 row)
We can read the changes from the WAL at 0/17ABBE0
and advance the slot’s restart_lsn
position with the pg_logical_slot_get_changes
function.
To consume from the slot and advance the restart_lsn
position, the pg_logical_slot_get_changes
is used. The pg_logical_slot_get_changes
and pg_logical_slot_peek_changes
functions are similar. They both share the same logic in pg_logical_slot_get_changes_guts function. The major difference between the two functions is that pg_logical_slot_peek_changes
does not move the restart_lsn
position after it is done(confirm=false
in pg_logical_slot_get_changes_guts
). The condition check here confirms if restart_lsn
should be updated once reading from the WAL is done.
if (ctx->reader->EndRecPtr != InvalidXLogRecPtr && confirm)
{
LogicalConfirmReceivedLocation(ctx->reader->EndRecPtr);
ReplicationSlotMarkDirty();
}
The actual update of restart_lsn
is done in the LogicalConfirmReceivedLocation function when the next expected restart point is valid.
If the state of the slot is updated, the slot is marked dirty in ReplicationSlotMarkDirty
and is scheduled for flushing to disk in the next checkpoint.
We can now consume from the slot.
postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL);
lsn | xid | data
-----------+-----+------------------------------------------------------
0/17B1550 | 764 | BEGIN 764
0/17D3B50 | 764 | COMMIT 764
0/17D6CE0 | 765 | BEGIN 765
0/17D6CE0 | 765 | table public.foo: INSERT: id[integer]:1 k[integer]:1
0/17D6D50 | 765 | COMMIT 765
(5 rows)
postgres=# SELECT slot_name, database, active, xmin, restart_lsn, confirmed_flush_lsn, synced FROM pg_replication_slots;
slot_name | database | active | xmin | restart_lsn | confirmed_flush_lsn | synced
-----------+----------+--------+------+-------------+---------------------+--------
test_slot | postgres | f | | 0/17ABD78 | 0/17D6E90 | f
Notice the restart_lsn
location has been updated 0/17ABD78
. All the changes have now been consumed and reflected in the slot’s state. There are no new changes at the new restart_lsn
location
postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL);
lsn | xid | data
-----+-----+------
(0 rows)
WAL Preservation
The main purpose for replication slots is preventing deletion of WALs. Any WAL segment older than the one pointed to by restart_lsn
is scheduled for deletion during the next checkpoint cycle.
Let’s check it out. At this point there is only one WAL segment file 000000010000000000000001
with the inode number 516146
$ ls -lhi $PGDATA/pg_wal
total 17M
516146 -rw------- 1 postgres postgres 16M Aug 28 17:37 000000010000000000000001
516118 drwx------ 2 postgres postgres 4.0K Aug 28 17:36 archive_status
516119 drwx------ 2 postgres postgres 4.0K Aug 28 17:36 summaries
Adding more data to the foo
table to fill up the active WAL segment file. The min_wal_size
is set to the default 80MB.
postgres=# INSERT INTO foo SELECT i, i FROM generate_series(1, 247580) i;
INSERT 0 247580
postgres=# SELECT slot_name, database, active, xmin, restart_lsn, confirmed_flush_lsn, synced FROM pg_replication_slots;
slot_name | database | active | xmin | restart_lsn | confirmed_flush_lsn | synced
-----------+----------+--------+------+-------------+---------------------+--------
test_slot | postgres | f | | 0/17ABD78 | 0/17D6E90 | f
The slot’s state has not changed and restart_lsn
is still pointing to a location in the original WAL segment. Therefore, the 000000010000000000000001
segment file will not be removed immediately even after a checkpoint.
A new WAL segment file 000000010000000000000002
has also been created
$ ls -lhi $PGDATA/pg_wal
total 33M
516146 -rw------- 1 postgres postgres 16M Aug 28 17:43 000000010000000000000001
518594 -rw------- 1 postgres postgres 16M Aug 28 17:43 000000010000000000000002
516118 drwx------ 2 postgres postgres 4.0K Aug 28 17:42 archive_status
516119 drwx------ 2 postgres postgres 4.0K Aug 28 17:42 summaries
In order to delete the original segment file, we need to consume from the test_slot
slot using the pg_logical_slot_get_changes
function. With the wal_recycle
setting turned on, the 000000010000000000000001
segment file will be recycled(inode number remains the same) and it will be renamed renamed to 000000010000000000000003
New WAL records will then be stored in it. You can try it out to see this in action.
Idle Slots and System Resources
It is good practice to delete unused slots in order to release resources being used by them. A slot uses the shared memory to store its in-memory state. Unused slots consume memory and hold up the process of cleaning up old WAL files leading to disk bloat. The situation can quickly get worse in a busy server. WAL files newer than restart_lsn
are preserved for as long as the slot’s state is not updated.
After adding more data to the table, multiple WAL files are still stored on disk even after a checkpoint.
$ ls -lhi $PGDATA/pg_wal
total 161M
516146 -rw------- 1 postgres postgres 16M Aug 28 17:43 000000010000000000000001
518594 -rw------- 1 postgres postgres 16M Aug 28 17:44 000000010000000000000002
518596 -rw------- 1 postgres postgres 16M Aug 28 17:44 000000010000000000000003
518597 -rw------- 1 postgres postgres 16M Aug 28 17:44 000000010000000000000004
518598 -rw------- 1 postgres postgres 16M Aug 28 17:44 000000010000000000000005
518595 -rw------- 1 postgres postgres 16M Aug 28 17:44 000000010000000000000006
518599 -rw------- 1 postgres postgres 16M Aug 28 17:44 000000010000000000000007
518600 -rw------- 1 postgres postgres 16M Aug 28 17:44 000000010000000000000008
518601 -rw------- 1 postgres postgres 16M Aug 28 17:44 000000010000000000000009
518602 -rw------- 1 postgres postgres 16M Aug 28 17:44 00000001000000000000000A
516118 drwx------ 2 postgres postgres 4.0K Aug 28 17:42 archive_status
516119 drwx------ 2 postgres postgres 4.0K Aug 28 17:42 summaries
The current active WAL segment is 00000001000000000000000A
and none of the older segments have been deleted. The size of the pg_wal
directory will continue to grow unbounded if the slot remains inactive. To free up space, we would need to get rid of the slot.
In order to clearly show the space reclamation, turn off recycling of WALs (on by default)
$ echo "wal_recycle = off" >> $PGDATA/postgresql.conf
$ pg_ctl -l logfile restart # Assumes data directory is already set to PGDATA
Drop the replication slot
postgres=# SELECT pg_drop_replication_slot('test_slot');
pg_drop_replication_slot
--------------------------
(1 row)
-- Request a checkpoint immediately to cleanup older WAL files.
postgres=# CHECKPOINT;
CHECKPOINT
The slot has been successfully deleted. The pg_wal
directory has also been freed of old WAL files
$ ls -lhi $PGDATA/pg_wal
total 17M
518602 -rw------- 1 postgres postgres 16M Aug 28 17:52 00000001000000000000000A
516118 drwx------ 2 postgres postgres 4.0K Aug 28 17:48 archive_status
516119 drwx------ 2 postgres postgres 4.0K Aug 28 17:48 summaries
The size of pg_wal
directory is now 17MB. The previous size was 161MB. Which is a 89.441 per cent reduction in size. That makes a big difference especially in a large database where WAL files fill up quickly. The pg_replication_slots
view can be used to quickly find and delete idle slots.
postgres=# SELECT slot_name, database, active, xmin, restart_lsn, confirmed_flush_lsn, synced FROM pg_replication_slots WHERE active="f";
slot_name | database | active | xmin | restart_lsn | confirmed_flush_lsn | synced
------------+----------+--------+------+-------------+---------------------+--------
test_slot1 | postgres | f | | 0/A116D08 | 0/A116D40 | f
test_slot2 | postgres | f | | 0/A116D40 | 0/A116D78 | f
(2 rows)
-- Delete based on how long the slot has been idle(e.g 5 minutes)
postgres=# SELECT slot_name, pg_drop_replication_slot(slot_name) FROM pg_replication_slots WHERE (extract(epoch from now() - inactive_since) / 60) > 5 AND active="f";
slot_name | pg_drop_replication_slot
------------+--------------------------
test_slot1 |
test_slot2 |
(2 rows)
Compared to other solutions, replication slots provide a smoother experience for handling replication and monitoring replication progress.