Superfiliate Engineering Blog - The basics of Database Maintenance

As databases grow, approaching the physical hardware limits of an instance, and especially at higher scale with heavy query volume, large sized data, PostgreSQL maintenance emerges as a vital practice to ensure seamless performance and reliability.

Maintenance operations enhance storage efficiency for both table and index data. These operations also empowers the query execution planner to make more informed decisions, optimizing database performance, thereby boosting the performance of new writes and queries. By ensuring that storage is used effectively and data retrieval processes are optimized, maintenance operations contribute to the overall responsiveness and reliability of the database.

In this article we will cover:

The three important maintenance operations: VACUUM, ANALYZE, and REINDEX
Tuning of autovacuum parameters to suit your specific database workload
Strategies for index maintenance

Understanding the Problem

PostgreSQL’s approach to managing table rows involves creating new row versions, or tuples, every time a row is updated. These tuples exist to facilitate the database’s ability to manage data concurrency and integrity through a system known as Multiversion Concurrency Control (MVCC). This system ensures that the database can handle multiple transactions simultaneously without conflict, by maintaining different versions of data.

To illustrate, consider the process of updating a user's first name in the database:

SELECT ctid,id FROM users WHERE id = 1;
ctid => (0, 1) => page number 0, tuple number 1
UPDATE users SET first_name = ‘Jon’ WHERE id = 1;
SELECT ctid,id FROM users WHERE id = 1;
ctid => (0, 2) => page number 0, tuple number 2

What is happening here is a demonstration of how PostgreSQL's MVCC system handles updates. Instead of direct modifications on the original data, it maintains versions of rows to support transactional integrity and isolation. This allows for concurrent transactions to view consistent states of the database without locking issues.

The creation of new tuples leads to the presence of "dead tuples" — outdated versions of rows that are no longer in use. While invisible to current transactions, these dead tuples occupy space, leading to a phenomenon known as "bloat".

Tables with high amounts of updates and deletions are particularly susceptible to bloat because each operation creates dead tuples. Over time, without proper cleaning, this accumulation can severely affect the database's performance by wasting storage space and slowing down query processing, emphasizing the need for regular maintenance operations to manage and reduce it.

If you are curious about about how your database fares in terms of bloat you can use a tool like pgstattuple (postgresql.org/docs/current/pgstattuple.html) to obtain statistics about the physical storage of tables and indexes. It helps in identifying the amount of space being consumed by live tuples (rows), dead tuples, and free space within a PostgreSQL table or index.

VAACUM and ANALYZE

The VACUUM operation is designed to tackle the accumulation of dead tuples head-on. By methodically identifying and removing these redundant data fragments, it plays a pivotal role in not just preventing table bloat but also in maximizing the database's storage efficiency, maintaining the overall health and performance of the database.

VACUUM scans the table's pages, identifying dead tuples that can be safely removed because they are no longer visible to any transactions. These dead tuples are not accessible or required by any current database transactions, rendering them expendable. The space occupied by these dead tuples is then marked as available for reuse by new tuples, helping to optimize storage utilization.

The ANALYZE operation in PostgreSQL plays a pivotal role in optimizing query execution by refreshing the database's statistical data, which is fundamental for the query planner's decision-making process. By assessing the current distribution and characteristics of data across tables and indexes, ANALYZE ensures that the planner has up-to-date insights into the most efficient query paths, whether that involves using specific indexes or determining the optimal join methods. As the database undergoes changes with updates, insertions, and deletions, these statistics can become outdated. Thus, running ANALYZE periodically, either manually or through automated routines, is crucial for maintaining the database's performance, ensuring that query planning is always based on the latest, most accurate data snapshot.

ANALYZE complements the VACUUM operation by ensuring that query performance remains optimal after dead tuples have been removed and the distribution of live tuples may have changed.

Autovacuum

The Autovacuum is a background worker process designed to automate the execution of VACUUM and ANALYZE operations and thus maintain database hygiene and performance without the need for manual oversight.

Autovacuum works behind the scenes, monitoring each table's activity levels and data modifications. It is designed to trigger the VACUUM and ANALYZE operations based on specific, configurable thresholds, such as the number of updated or deleted rows. This ensures that the actions to clear dead tuples and update statistical data are taken at the most opportune times, minimizing the impact on the database's operational efficiency.

By operating on a per-table basis, Autovacuum tailors its maintenance strategy to the unique patterns of usage and change each table experiences.

Tuning Autovacuum Parameters

The intelligent orchestration of PostgreSQL's Autovacuum relies heavily on its ability to adapt to the database's unique demands, primarily through the adjustment of its operational thresholds. By default, these thresholds are set conservatively, ensuring a balance between maintenance frequency and system performance. However, the true strength of Autovacuum lies in its configurability, allowing administrators to tailor its behavior to align more closely with the specific needs of their database environments.

As the workload of a database intensifies, especially with an increase in update and delete operations, the necessity for more frequent Autovacuum cycles becomes apparent. To accommodate this need, PostgreSQL enables the fine-tuning of Autovacuum thresholds, allowing for more aggressive cleanup schedules that can preemptively address bloat and ensure sustained query performance. Moreover, for databases juggling multiple tables, enhancing Autovacuum to operate more jobs in parallel can significantly reduce the time to maintain overall database health.

Configurations can be adjusted on a granular level for individual tables or applied globally through modifications to the postgresql.conf file. This flexibility empowers database administrators to precisely dictate how and when Autovacuum performs its essential duties, ensuring that the database's performance is optimized without compromising the system's stability or availability.

Here are some of the parameters that can be adjusted:

autovacuum_vacuum_scale_factor

A fraction of the total number of tuples in a table that, when exceeded by the number of dead tuples, will trigger a VACUUM operation

ALTER TABLE my_table SET (autovacuum_vacuum_scale_factor = 0.01);

utovacuum_vacuum_threshold

Specifies the minimum number of dead tuples in a table before the Autovacuum process triggers a VACUUM

ALTER TABLE my_table SET (autovacuum_vacuum_threshold = 100);

autovacuum_vacuum_cost_limit

Sets the cost limit for Autovacuum before it stops running. The "cost" refers to a measure used by PostgreSQL to rate-limit the vacuum operation to minimize its impact on database performance.Raising this value increases the cost before Autovacuum stops, which means the worker completes more work (higher cost) and runs longer before it sleeps.

vacuum_cost_limit

This setting is a global cap that applies to manual vacuum operations and serves as the fallback for Autovacuum operations unless autovacuum_vacuum_cost_limit is explicitly set

autovacuum_vacuum_cost_delay

After accumulating the amount of "cost" specified by the cost limit, the vacuum process will sleep for the number of milliseconds set by this value.A higher delay value means Autovacuum will slow down, reducing its resource usage and impact on database performance.A lower value can speed up the vacuum process at the cost of higher immediate resource usage.If you are using a version of Postgres earlier than 15, it might be interesting to reduce he default value, which was set quite conservatively in those early versions.

Bloat in indexes

Before presenting the last of our maintenance operations, it is interesting to better understand the problem that it addresses, and that is index bloat. Besides table data, bloat also affect indexes, which play a pivotal role in enhancing query performance by pointing to specific data rows.

When rows are updated and new versions are created, the indexes are updated to point to the new row versions. However, the entries pointing to the old versions (now dead tuples) do not automatically get removed. This process leaves behind a trail of entries linked to outdated, now-inaccessible dead tuples. These superfluous entries contribute significantly to index bloat, occupying space unnecessarily and detracting from the efficiency of the database.

These obsolete index entries contribute to index bloat, as they occupy space without serving a useful purpose for current transactions. The result is an excess of stale index entries that, despite their redundancy, consume valuable space, diminishing the storage efficiency and impacting the performance of the database

REINDEX

To optimize your indexes, periodically rebuild the important ones with the REINDEX operation. REINDEX creates a new version of the index without the dead tuple entries, effectively compressing the index and improving its efficiency

Starting in version 12, PostgreSQL made it possible to rebuild indexes while they’re being used concurrently

REINDEX (VERBOSE) INDEX CONCURRENTLY index_trips_on_driver_id;

Concurrent option allows REINDEX to rebuild the index without requiring an exclusive lock on the table. It achieves this by building a new index alongside the old one, switching them out once the process is complete. This approach allows the database to remain available for reads and writes, significantly reducing downtime.

However, this method requires more total work than a standard index rebuild and takes significantly longer to complete as it needs to wait for unfinished transactions that might modify the index.Since it allows normal operations to continue while the index is being rebuilt, this method is useful for rebuilding indexes in a production environment.The extra CPU, memory and I/O load imposed by the index rebuild may slow down other operations.

Providing a table name as an option to REINDEX performs a reindex for all indexes that are on the specified table:

REINDEX (VERBOSE) TABLE CONCURRENTLY trips;

And with this operation in our toolbox we know have seen the three main maintenance operations in Postgres: VACUMM, ANALYZE and REINDEX. What other types of maintenance operations are available?

Index Maintenance

Indexes play a pivotal role in optimizing query performance, yet their integration must be approached carefully due to the additional space they require and the latency they introduce to write operations. The following section will explore two crucial maintenance strategies: eliminating unused indexes that impose unnecessary loads on the database, and removing duplicate and overlapping indexes, to simplify database operations and boost performance.

Unused Indexes

Beyond the direct implications of increased storage requirements and added latency to write operations, unused indexes harbor additional, less obvious drawbacks. They can obstruct the execution of Heap Only Tuple (HOT) updates, a mechanism designed for higher efficiency in database operations. Furthermore, these superfluous indexes contribute to prolonged VACUUM operation times and extend the duration of query planning processes. Equally important is their impact on slowing down critical maintenance tasks, including backup and restore operations.

To find unused indexes you can use the following query in your database:

SELECT
 pg_stat_user_indexes.schemaname,
 pg_stat_user_indexes.relname AS table_name,
 pg_stat_user_indexes.indexrelname AS index_name,
 idx_scan,
 idx_tup_read,
 idx_tup_fetch
FROM
 pg_stat_user_indexes
JOIN
 pg_indexes ON pg_stat_user_indexes.indexrelname = pg_indexes.indexname
WHERE
 idx_scan = 0;

Evaluate the identified unused indexes and consider their removal, being mindful that certain indexes, despite their current lack of use, may still be justified due to specific edge cases or future requirements of your system.

Pruning Duplicate And Overlapping Indexes

A common pitfall is the creation of overlapping indexes, where two or more indexes share similar definitions, leading to redundancy. This redundancy, while seemingly benign, can significantly impact the database's performance.

This often occurs when both multicolumn and single column indexes exist for the same data fields. When an index on first_name overlaps with another index on both first_name and last_name, it not only consumes additional storage but also complicates the query planning process. The database's effort to decide between similar indexes for query execution can lead to unnecessary processing overhead, delaying response times.

To identify and resolve these redundancies, leveraging tools like PgHero becomes invaluable. PgHero, with its .duplicate_indexes() method, efficiently scans for and highlights such overlaps, allowing database administrators to make informed decisions on which indexes to keep and which to prune. By doing so, they not only reclaim valuable storage space but also simplify the decision-making process for the query planner, ensuring a smoother, faster execution path for queries.

Conclusion

Think of maintenance operations as regular health checks for your database. They're essential for keeping your PostgreSQL running smoothly, ensuring efficient data storage, and maintaining speedy query responses. By understanding and implementing these maintenance tasks, you're not just fixing problems as they arise; you're preventing issues before they start, ensuring your database remains robust, responsive, and ready to handle whatever your applications throw at it. Regular maintenance is the key to a healthy, efficient database that scales with your needs.

The basics of Database Maintenance