Monday 30 July 2018

Performance Tuning in PostgreSQL

Performance is very important for PostgreSQL. We need to correctly tweak the parameters, hardware and software all together to get the best performance from PostgreSQL.

Reasons For Poor Performance:

  1. Poorly Written SQL
  2. Poor session Management - Postmaster handles all connection rather any connection pooling tool like pgbouncer or pgpool
  3. Misconfigured database parameter
  4. OS I/O issue
  5. No DB Maintenance

Tuning Technique:
  1. Start with OS - Check OS health to make sure problem is in DB
  2. Check the Application - Tune SQL before tuning DB configuration
  3. Tune DB Server - Identify the biggest bottleneck. Tune the area with biggest potential benefit.
  4. Stop - Stop when tuning goal is achieved. 
HARDWARE CONFIGURATION:
  1. Focus on disk speed and RAM over CPU speed
  2. SAS have better disk speed then SATA.(SAS - 10,000 to 15,000 rpm)
  3. Like other RDBMS, postgreSQL is IO intensive
  4. More spindles - PostgreSQL utilizes multiple disks to parallelize read and write request on the database
  5. Separate the transaction logs - Put the database transaction log(pg_xlog) on dedicated disk resource. Whenever there is a commit, WAL buffer is flushed to WAL Segments and Shared buffer area are marked as committed. Transaction log area must be fast to write. The disk should have more write speed. As it will commit faster and it will boost performance.
  6. Index on separate mount point other than data directory.
  7. Keep monitoring data in separate mount point. Pg_log in different mount point.
  8. RAM should be high as many activities happens in memory/buffer. Like sort, update. It will lessen IO on disk. If RAM is high, buffer can cache more data
  9. We can compromise on CPU. PostgreSQL is not that CPU intensive.
OS CONFIGURATION:
  1. Memory, CPU, Disk(I/O)
  2. Use the available memory on the server
  3. PostgreSQL uses the system shared memory segment so set “shmmax” to an appropriate size based on the physical RAM in the server
  4. Shmmax is set in bytes
  5. Shmmax is shared buffer for all application installed in server. /proc/sys/kernel/shmmax
  6. 70 to 90 % of shmmax can be defined out of total RAM as per EDB.
  7. The file system makes a difference
  8. A journaling file system is not required for the transaction log.
  9. A journaling filesystem is a file system that maintains a special file called a journal that is used to repair any inconsistencies that occur as the result of an improper shutdown of a computer
  10. Use the file system based on its strengths to find the best fit for the application.
  11. Cat /etc/sysctl.conf
DB PARAMETER TUNING:
  • Basic information needed for tuning but not limited to:
  1. Database size
  2. Largest table size
  3. Type of frequency of query
  4. Available RAM
  5. Number of concurrent connection
  6. Default parameters are for wide compatibility not performance
  • Shared buffer - Area where read and write happens. Sets the number of shared memory buffers used by the database server. Each buffer is 8kb. Minimum value must be 16KB and at least 2x. Default setting is 1000 however values of few thousand are not uncommon for production applications. 6 to 15% of available memory is a good general guideline. Pg_bench used for benchmarking the server. Shared buffer = 2 * max connections(minimum). 32MB is default shared_buffer. 15 to 30% for reporting server.
  • Max_connection - Set maximum number of concurrent connection. Each user connection has associated user backend process on the server. User backend process is terminated when a user log off. Connection pooling can decrease the overhead on postmaster by recusing existing user backend process.PostgreSQL is process per user. Postmaster creates a process, authenticates and process it. When protocol remains same, connection uses same connection.
  • Work_mem - Amount of memory in KB to be used by internal sorts and hash joining/tables before switching to temporary disk files. Minimum allowed values us 64KB. It is set in KB and the default is 1024 KB(1 MB). work_mem is session based. Should be set in accordance to max_connection. It don't need restart. Set command works here. Alter user, alter database can be used for this parameter. It needs reload for permanent change. (Set work_mem to 10240).
  • Maintenance_work_mem - Maximum memory in KB to used in maintenance operations such as vacuum, create index and alter table add foreign key. Minimum allowed value is 1024KB. The default is maintained by dynatyne.
  • Huge_pages - Enables/disables the use of huge memory pages. Valid values are try(default), on, off. This parameter is only supported in Linux. May help in increasing performance by using smaller page table thus less CPU time on memory management.
  • Temp_file_limit - Maximum amount of disk space that a session can use for temporary files. A transaction attempting to exceed this limit will be cancelled. Default is -1( no limit).
  • Checkpoint - Writes the current in memory modified pages( knows as dirty buffer) to the disk. Automatic checkpoint occurs each time that the number of checkpoint_segment or checkpoint_timeout is reached.
  • Checkpoint_segments - Maximum distance between automatic WAL checkpoints in log file segment. Each log file segment is 16MB. A checkpoint is forced when the log is filled. ( 16MB * checkpoint_segments). A larger  setting results in few checkpoint. Minimum value is 1 and default is 3. Select * from pg_stat_bgwriter; Here, checkpoints_timed is default how checkpoint timed every 5 mins as mentioned in checkpoint_timeout parameter but checkpoint_req value should be always less than checkpoint_timed orelse it means that checkpoint_segment is getting filled up. Optimal value is 96 if transaction is less also.
  • Checkpoint_warning - It will print a message warning in error log file if checkpoint is happening every one minute as mentioned in parameter.
  • Fsync - Ensures that all the WAl buffer are written to WAL log at each commit. When on fsync() or other wal_sync_method is forked. Turning it off will be performance boost but there is a chance of data corruption. Can be turned off in lower environment or during initial loading of database. Synechronous_commit can give similar benefit for non critical transaction without any risk of data corruption.
  • Pg_prewarm -  Loading of object in OS RAM or shared_buffer. 
Some of the important config parameter which needs to be tuned are below:

Custom:96GB RAM
fsync = on -Ensures data is written to disk after commit
maintenance_work_mem = 8GB - RAM used for vacuum, reindex, analyze
checkpoint_completion_target = 0.8 
effective_cache_size = 88GB - Cache for operation from OS
work_mem = 96MB - For sort operation. Session Based
wal_buffers = 2MB
checkpoint_segments =    128 
checkpoint_timeout = 30min
checkpoint_warning = 586
synchronous_commit = off
shared_buffers = 30GB - Used for DB activity - Delete/Update/alter/
max_connections = 900

log_destination = 'syslog'
syslog_facility = 'local4'
syslog_ident = 'postgres'
log_min_messages = error
log_error_verbosity = default
log_min_error_statement = error
log_min_duration_statement = 30000 - in millisecond
log_connections = on
log_line_prefix = 'user=%u,db=%d ' -- Needs reload

autovacuum_naptime = 300 - It will sleep for 300 sec and start auto vacuum again
autovacuum_vacuum_threshold = 1000 - Minimum number of dead tuples it will start Autovacm
autovacuum_analyze_threshold = 1000 - Same but it will analyze
autovacuum_vacuum_scale_factor = 0.1 means 10 % if changes happen, auto vacuum will start. Support for individual table also.
autovacuum_vacuum_cost_delay = 50ms - Cost delay for all autovacuum workers
autovacuum_vacuum_cost_limit = 200 Cost limit for all autovacuum workers
full_page_writes = off
effective_io_concurrency = 3

tcp_keepalives_idle = 60
tcp_keepalives_interval = 60 
tcp_keepalives_count = 2
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d '
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 0

Vacuum: Reuse disk space for same table
Analyze: update stats in pg_stat for better query plan
Vacuum Analyze: Vacuuming and Analyzing together
Vacuum full verbose: Disk space is released to be used by other object
Vacuum freeze - PostgreSQL supports 2billion transaction ID. After that it will start overwriting all data and data will be lost. So either we need to vacuum at that time or else use vacuum freeze to freeze the transaction ID
Reindex: Rebuilds index - Needs downtime

Maintain DB Activity - Vacuum/Analyze should be scheduled regularly
Query Tuning - Pdbadger is a open source tool for analyzing queries and and PostgreSQL instance. 

Hardware issues:
We can use the below tools in Linux to analyze the hardware. Although it needs Linux knowledge.
  1. top
  2. sar
  3. iostat
  4. free -m
  5. Swap
  6. VMstat

SQL Query Optimization:
1. As mentioned earlier, select the query from pgbadger.  Explain analyze the query by using "Explain" command.  Also, we can take the query plan and  and paste in http://explain.depesz.com/ for better presentation. We can check index scan or sequential scan in the query plan and act accordingly. Most important is to check cost of query, width and no of rows in query plan.  Some of the important query tuning parameter are -

     A. enable_hashjoin=off;
     B. join_collapse_limit=1;

Above are the hint to the query Planner.

Statement/Query Processing - ( Query Flow)

Application program request a connection to PostgreSQL server. The application program transmits a query to the server and waits for the result to sent back from server.

Parse - Check Syntax, Call traffic Cop, Identity Query Type, Command processor if needed, Break query into tokens.The parser have to check the query string(which arrive as plain text) for valid syntax. If the syntax is correct, a parse tree is builded up and handed back otherwise an error is returned. The parser stage creates a parser tree using fixed rules. Initially it don’t look into system catalog tables. So it is difficult to understand the detailed semantic. Semantic analysis is done only during the transaction. Not upon receiving the query The query tree created by transformation process is structurally same as raw parse tree but it have difference in details. 

Optimize - Planner, Generate plan, Uses database statistics, Query cost calculation, choose best plan.
The task of planner/optimizer is to create an optimal execution plan. A given SQL query can be executed in various ways but resulting same result. Optimizer will examine all possible execution plan, but will execute the fastest plan. It may take time to decide as there might be more number of joins.  After fastest plan is determined, it is passed to the executor.  Sequential plan is always created and index scan plan is also created. Joins it consider are nested, merge and hash join. 

Nested loop Join - The right relation is scanned once for every row found in the left relation. This strategy is easy to implement but can be very time consuming. (However, if the right relation can be scanned with an index scan, this can be a good strategy. It is possible to use values from the current row of the left relation as keys for the index scan of the right.)

Merge Join - Each relation is sorted on the join attributes before the join starts. Then the two relations are scanned in parallel, and matching rows are combined to form join rows. This kind of join is more attractive because each relation has to be scanned only once. The required sorting might be achieved either by an explicit sort step, or by scanning the relation in the proper order using an index on the join key.

Hash Join - the right relation is first scanned and loaded into a hash table, using its join attributes as hash keys. Next the left relation is scanned and the appropriate values of every row found are used as hash keys to locate the matching rows in the table.

Execute - Execute query based on query plan, performs sorts and joins, evaluates qualifications and finally hands back the row derived.

Below is the example of a EXPLAIN PLAN:
test=# explain select*from emp;
                      QUERY PLAN
-------------------------------------------------------
 Seq Scan on emp  (cost=0.00..15.00 rows=1000 width=4)
(1 row)
Seq Scan on emp - Type of scanning. It may be index if it uses index.
Cost - It is just an unit. It is not in millisec or sec
cost=0.00 - Startup cost. Server might spend in beginning itself.
rows=1000 - Estimated number of rows scanned. Row sampling
width=4 - Width of each row in byte
cost=0.00..15.00 - Most imp - Total cost of the query. This cost be lowered down. 

test=# select relpages,reltuples from pg_class where relname='emp';
 relpages | reltuples
----------+-----------
        5 |      1000
(1 row)
test=# show seq_page_cost;
 seq_page_cost
---------------
 1
(1 row)

test=# show cpu_tuple_cost;
 cpu_tuple_cost
----------------
 0.01
(1 row)

test=# select 5*1+1000*0.01;
Relpages*seq_page_cost + reltuples*cpu_tuple_cost
 ?column?
----------
    15.00
(1 row)
Queries does a hard cache for first time executing so that planning time is more.
test=# explain analyze select*from emp;
                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
 Seq Scan on emp  (cost=0.00..15.00 rows=1000 width=4) (actual time=0.009..0.200 rows=1000 loops=1)
 Planning time: 0.037 ms
 Execution time: 0.382 ms
(3 rows)

test=# explain analyze select*from emp;
                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
 Seq Scan on emp  (cost=0.00..15.00 rows=1000 width=4) (actual time=0.049..0.227 rows=1000 loops=1)
 Planning time: 0.026 ms
 Execution time: 0.415 ms
(3 rows)

actual time=0.049 -- Actual startup time for processing
actual time=0.049..0.227 - Actual Time for query processing
Planning time > Execution Time
Explain Analyze Verbose - Shows more details plan
Explain (costs) select*from emp; - Same as explain command
Explain (analyze, costs) select*from emp; - Same as explain Analyze



Explain (analyze, buffers) select*from emp;
test=# Explain (analyze, buffers) select*from emp;
                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
 Seq Scan on emp  (cost=0.00..15.00 rows=1000 width=4) (actual time=0.012..0.248 rows=1000 loops=1)
   Buffers: shared hit=5 read=62
 Planning time: 0.183 ms
 Execution time: 0.435 ms
(4 rows)
Read - Read from disk . IO from disk
Shared hit - Read from buffer
External Disk Sort - 3433 KB - Sort is from disk, when work_mem is less
Quicksort Memory - 4545 KB - Sort if from memory. 
Select pg_prewarm (‘table_name’); Full table will be in cache
Show default_statistics_target; 100 means 1% of each table are sampled 10,000 mins 100%. Keep it 1% so that maintenance is fast. But it is slow for optimizer. It is row sampling. Higher number will signal the server to gather and update more statistics
test=# Show default_statistics_target;
 default_statistics_target
---------------------------
 100
(1 row)