Tuesday, 11 September 2018

Installing & Configuring POWA (PostgreSQL Workload Analyzer) in Linux

POWA is another very good tool for PostgreSQL DBA. It helps analyzing the stats of queries using few extensions and in UI based format. Below are some important details how to install and configure POWA in linux and view the stats over web.

Some of the important points of POWA -

  • POWA current version supports PostgreSQL version 9.4 and above but if your PostgreSQL version is less than 9.4 then there is a different version which is POWA version 1.x for it. 
  • POWA Supports various extensions ( pg_stat_statement, pg_qualstat, pg_stat_kcache,hypopg).
Some of the Pre-requisite -

  • POWA must be installed on same instance where PG is installed.
  • Shared_preload_libraries needs to be modified ( Postgres service needs reboot).
  • Contrib package needs to be installed.
  • POWA user should be created which needs superuser privileges.
  • POWA Web needs to be configured to view the stats.
Links from where Extensions can be downloaded which are needed for POWA -

  • hypopg - https://github.com/HypoPG/hypopg/releases
  • POWA - https://github.com/powa-team/
  • POWA Installation - https://powa.readthedocs.io/en/latest/quickstart.html#install-powa-from-packages-on-rhel-centos
  • PG_Qualstat - https://github.com/powa-team/pg_qualstats/archive/1.0.2.tar.gz
  • PG STAT Kcache - https://github.com/powa-team/pg_stat_kcache/archive/REL2_1_0.tar.gz
Steps to Implement POWA:
  • Install all the extension which are not available. 
  • Mandatory Extensions are - "pg_stat_statements,powa,pg_stat_kcache,pg_qualstats, hypopg” in one DB and hypopg in all the DB. 
  • Download the extensions with the above links.(make || make install)
  • Make changes in postgresql.conf file for the parameter - shared_preload_libraries='pg_stat_statements,powa,pg_stat_kcache,pg_qualstats’ 
  • Next create Database POWA and create the extensions. 
  • \c to POWA
  1. CREATE EXTENSION pg_stat_statements;
  2. CREATE EXTENSION btree_gist;
  3. CREATE EXTENSION powa;
  4. CREATE EXTENSION pg_qualstats;
  5. CREATE EXTENSION pg_stat_kcache;
  6. CREATE EXTENSION hypopg; to all DB
  7. CREATE ROLE powa SUPERUSER LOGIN PASSWORD 'astrongpassword’;
  • Install powa-web . 
  • Dependency to install powa web are - python, python-psycopg2, python-sqlalchemy & python-tornado.
  • Copy Powa Web config file to /etc/powa.conf
  • Edit powa web config file with address and port of web.
  • Start powa web
  • Check the console from web.  
Install Powa -Web:
  • wget https://pypi.io/packages/source/p/powa-web/powa-web-3.1.3.tar.gz
  • tar -zxvf powa-web-3.1.3.tar.gz
  • cd powa-web-3.1.3
  • cp ./powa-web.conf-dist ./powa-web.conf
  • ./powa-web
  • Copy powa-web.conf to /etc/ and start powa web from root user. You should be able to view the web version here - http://172.18.1.9:8888/login
Output of Powa-Web config file:
us02tespd02 ~ # cat /etc/powa-web.conf
servers={
  'us02tespd02': {
    'host': 'localhost',
    'port': '5432',
    'database': 'powa'
  },
'us02tespd03': {
    'host': '172.18.1.8',
    'port': '5432',
    'database': 'powa'
  }
}

cookie_secret="SECRET_STRING"
address="172.18.1.9"
port=8888


Pg_rewind in PostgreSQL

Short Summary:

Pg_rewind is a tool for synchronizing a PostgreSQL cluster with another copy of the same cluster, after the clusters' timelines have diverged. A typical scenario is to bring an old master server back online after failover, as a standby that follows the new master. The result is equivalent to replacing the target data directory with the source one. All files are copied, including configuration files. The advantage of pg_rewind over taking a new base backup, or tools like rsync, is that pg_rewind does not require reading through all unchanged files in the cluster. That makes it a lot faster when the database is large and only a small portion of it differs between the clusters.

Official Documentation Link:

IP Used for the Steps Below:

  • IP - 198.100.1.10 (Master/Primary)
  • IP - 198.100.1.11 (Slave/Replica/Secondary)

Pre-Requisite:

  1. Streaming Replication is Enabled between Master and Replica Instance.
  2. wal_log_hints should be enabled in postgresql.conf file.
  3. User used for pg_rewind should be superuser.

Steps:

  1. Suppose there are issue with Master DB(198.100.1.10) and it got crashed for some reason. We need to failover(Promote in PostgreSQL Terms) so that application connects and things are back to normal. Therefore, below command can be used to perform failover to Replica(198.100.1.11).
    Command needs to be issued from Replica -  /usr/pgsql-9.6/bin/pg_ctl -D /var/lib/pgsql/9.6/data promote
  2. Now we have the new master after promotion/failover i.e. 198.100.1.11 and we need to make sure we have its replica in place as early as possible.
  3. The traditional way to build up replication is to copy the data directory using base_backup and start the replica instance. It can be very time consuming if DB size is in TB. Therefore, to reduce the effort and time, we can use pg_rewind and it sync up the new master and makes replica in few minutes as it only copies the delta of changes.
  4. To sync up the old primary(198.100.1.10) with new primary(198.100.1.11), we need to stop the old primary (198.100.1.10) using the below command issued in old primary.
    Command -  /usr/pgsql-9.6/bin/pg_ctl -D /var/lib/pgsql/9.6/data stop
  5. Before we run pg_rewind, it is good to backup the below configuration files -
    A. pg_hba.conf
    B. postgresql.conf
    C. postgresql.auto.conf
    D. recovery.conf
  6. Now we are good to run pg_rewind in the old primary(To be new replica) using the below command -
    Template Command -  /usr/pgsql-9.6/bin/pg_rewind --target-pgdata /var/lib/pgsql/9.6/data --source-server="port=<Source DB Port> host=<Source Host> dbname=postgres user=<Source DB Superuse> password=<Password for Superuser>" --progress
    Example Command -  /usr/pgsql-9.6/bin/pg_rewind --target-pgdata /var/lib/pgsql/9.6/data --source-server="port=5432 host=198.100.1.11 dbname=postgres user=superuser password=SomeStrongPassword" --progress
  7. Sample output after running pg_rewind -
    connected to server
    servers diverged at WAL position 198/C3C13ED8 on timeline 1
    rewinding from last common checkpoint at 198/C3C13E30 on timeline 1
    reading source file list
    reading target file list
    reading WAL in target
    need to copy 629 MB (total source directory size is 51531 MB)
    644481/644481 kB (100%) copied
    creating backup label and updating control file
    syncing target data directory
    Done!
  8. Once pg_rewind is complete, we need to configure the recovery.confpg_hba.conf & postgresql.conf file if they are different in both master and replica.
  9. Sample recovery.conf file contents -
    primary_conninfo = 'host=198.100.1.10 port=5432 user=replicator password=StrongPass'
    trigger_file = '/var/lib/pgsql/9.6/data/failover'
    standby_mode = 'on'
    recovery_target_timeline = 'latest'
  10. IP of Replication User should be present in pg_hba.conf so that there is no issue with replication after failover. It is always good to keep the all the configuration file same for both Primary and Replica.
  11. Lastly, its good to check if replication is working fine using the below two commands -
    Command 1 -
    SELECT
    CASE
    WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0
    ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())::INTEGER
    END
    AS replication_lag;
    Command 2 -
    Select pg_last_xlog_receive_location();select pg_last_xlog_replay_location(); — This value should change constantly and should have same value.
  12. Other simple way to check if replication is working fine is by creating a new DB in master and check if it replicates in Replica( If DBA is allowed to create test DB)

Auto Explain in PostgreSQL

Some Key Points for Auto Explain:
  1. Auto Explain helps analyzing the queries automatically in log file based on the configuration we set up.
  2. Auto Explain can be set up in session level(preferred) and also in postgresql.conf file(which needs a reboot).
  3. There are performance impact if we set up auto explain for all the queries as it takes up additional time to explain the queries and log in log file before executing it actually. 
  4. Auto Explain supports from PG version 9.3.
  5. If we add Auto Explain Plan in postgresql.conf file, it would need a reboot of the postgres service. 
  6. Auto Explain gets logged in default log file of the postgres instance.
Steps to set up Auto Explain in session( Once we disconnect from session it wont be logging auto explain plan anymore in log file)
  1. We need to issue the below commands so that auto explain starts getting logged in log file.
  2.  Command 1 - LOAD 'auto_explain';
  3. Command 2 - SET auto_explain.log_min_duration = 5; ( It is in seconds)
  4. Command 3 - SET auto_explain.log_analyze = true; ( It logs the explain plan in more detailed way)
  5. This process does not needs any reboot or reload.
Steps to configure Auto Explain Plan in postgresql.conf file:
  1. Make the below parameter changes in postgresql.conf file.
  2. Parameter 1 - shared_preload_libraries = ‘auto_explain’;
  3. Parameter 2 - auto_explain.log_min_duration = '5s'
  4. Parameter 3 - auto_explain.log_analyze = true
  5. Restart postgres service.

Wednesday, 15 August 2018

Installing and Configuring PgBadger in Linux

PgBadger is one of the best open source tool for postgres to have a track of slow queries, most frequent queries, users analysis and many more. Therefore, I am making this blog for detailed steps to set up pgbadger in Linux and viewing report in web.

As pgbadger reads the log file of postgres, therefore we need to make changes in the log file so that correct information are logged and pgbadger can read the log file appropriately. Below are some of the changes needs to make in postgres log file -

  • Log connection - On
  • Log disconnection - On
  • Log Duration - off
  • Log statement - none
  • Log hostname - On
  • log_line_prefix = '%t [%p]: [%l-1] db=%d, %u , app=%a, %h '  
  • Log Min Duration Statement - 0
  • Log Checkpoint - on
  • lc_messages = ‘C’
Below is the configuration change for the log file size, format and rotation -

  • log_truncate_on_rotation = off
  • log_rotation_age = 1d 
  • log_rotation_size = 0
After making all the above changes, verify the generation of log file with the changes made in configuration file. After verifying it time to download the pgbadger and configure it from the below official site - 

Link - https://github.com/dalibo/pgbadger/releases ( Version 9.2)

Steps for installing pgbadger in linux( RHEL in my case):
  • Download the latest pgbadger using wget command - wget https://github.com/dalibo/pgbadger/archive/v9.2.tar.gz
  • Untar the tar Ball  - tar xvf pgbadger-9.2.tar
  • Install the source code using the below steps- 
  • A. cd pgbadger-9.2  
  • B. perl Makefile.PL 
  • C. Make 
  • D. Make Install
Once we install the pgbadger, we can view the report in text format or we can view the report in HTML format in web. We will see both the process in details - 

Text Format - 

In this process we simply parse the log file and generate a report in text format. And the text format report is generated in linux only, and we can view the report using less, more or VI command. Below is the command for it - 

"pgbadger /data/postgres/data/pg_log/postgresql-2018-07-02_075351.log --outfile /tmp/queries_4.txt"

HTML FORMAT - 

For this format, we have multiple steps to achieve this. We need to make sure we have apache or httpd installed and have all the report generated in the root folder so that we can view the reports over web. Below are the details steps -
  • Firstly, we need to install apache in the linux system using yum or rpm command. Along with it, there are few more dependencies we need to install. They are - GCC, apr-util and apr.
  • If we do not have apache in repository, instead we can install httpd and the above dependencies.
  • Lastly, we need to create the report in the apache root directory i.e. "/var/www/html" so that apache can read the html files.
  • Below is the command to generate html report - pgbadger --incremental --outdir /var/www/html /data/postgres/data/pg_log/postgresql-2018-07-02_075351.log -o out.html
  • Most important is the cron the above steps and generate the report everyday. 
  • You can view the reports online at http://127.0.0.1:80 if it is localhost or the IP where pgbadger is installed.

I have installed pgbadger many times and have good hands on experience. So, please comment below if you face any issue and I will be there to help you. 

Simple Logical PostgreSQL Backup Script (Pg_dump)

Below is the script written by me to make life easier for all DBA. You would just need to change few things in accordance to your environment and the script is ready for use. Also, you can comment in the comment section, if you want any modified script and I can work on that and share with you. Below script is meant for taking logical dump, I did not include much information but as I said you can always tweak but below script is a working script and it makes my life easy. This script can be used for restore process, not for recovery purpose. For that there is a different script using base backup.

Below are some of the details for the script -

1. I am declaring the backup user, backup directory, date, port and small small information which will get repeated in the script.

2. First condition will check if backup directory exist, if not it will create a backup directory.

3. I have also made a table to note down all the details of backup, so that I can query and find out if the backup is successful or failed. It makes a better Audit History.

4. Script is making a note of all the DB to be taken as backup and it is taking the backup of individual backup in for loop and making a note if backup is successful or not.

5. Lastly, I am taking up the backup of global users which is very important and making a entry in table too.

6. Sample how the backup take will look like is mentioned below and it can be taken as reference if you need any similar sort of Backup Strategy.

Here goes the below script -

#!/bin/bash
## Backup Script of Postgres
## Script will take backup of individual database and global objects. Script is good if DB size is small and do not need backup of WAL Files
## Author - rahulsaha0309@gmail.com

set -x

export PGPASSWORD=PassWord!
DATE=`(date +"%Y-%m-%d")`
SIMPLE_DATE=`(date +"%Y%m%d")`
BACKUP_DIRECTORY=/tmp/postgres_backup/$SIMPLE_DATE
BACKUP_USER=postgres
PG_DUMP=/data/postgres/bin/pg_dump
PORT=5432
PSQL=/data/postgres/bin/psql
DUMPALL=/data/postgres/bin/pg_dumpall


if [ ! -d $BACKUP_DIRECTORY ]
then
/bin/mkdir  -p $BACKUP_DIRECTORY
else
echo " Backup Directory exist"

fi

TABLE_EXIST=`$PSQL -d audit -t -c "SELECT EXISTS (SELECT 1 FROM information_schema.tables WHERE  table_schema = 'public' AND table_name='backup_logs')"`

if [ $TABLE_EXIST = f ]
    then
        $PSQL -d audit -c "create table backup_logs(date date, database varchar, status varchar, backup_location varchar)"
else
    echo "table_exist"
fi


LIST_DB=`$PSQL -t -d postgres -c "select datname from pg_database where datname not like 'postgres';"`

array1=($LIST_DB)
echo ${array1[@]}

for database in "${array1[@]}"

do
BACKUP_DATABASE="$PG_DUMP -U$BACKUP_USER -p $PORT -d $database -f $BACKUP_DIRECTORY/backup_$database.sql"
    echo "Taking backup of DATABASE $database"
$BACKUP_DATABASE

if [ $? -eq 0 ]
then
$PSQL -d audit -c "insert into backup_logs values(current_timestamp, '$database', 'Success', '$BACKUP_DIRECTORY/backup_$database.sql')"
else
$PSQL -d audit -c "insert into backup_logs values(current_timestamp, '$database', 'Failed', '$BACKUP_DIRECTORY/backup_$database.sql')"
fi

done

echo "BACKUP_GLOBAL_ROLES"

$DUMPALL -U$BACKUP_USER -p $PORT -g -f $BACKUP_DIRECTORY/backup_roles.sql


Output in table if backup is success or not.

audit=# select * from backup_logs;
    date    | database  | status  |                  backup_location
------------+-----------+---------+----------------------------------------------------
 2018-07-24 | template1 | Success | /tmp/postgres_backup/20180724/backup_template1.sql
 2018-07-24 | template0 | Failed  | /tmp/postgres_backup/20180724/backup_template0.sql
 2018-07-24 | test      | Success | /tmp/postgres_backup/20180724/backup_test.sql
 2018-07-24 | test1     | Success | /tmp/postgres_backup/20180724/backup_test1.sql
 2018-07-24 | test2     | Success | /tmp/postgres_backup/20180724/backup_test2.sql
 2018-07-24 | new       | Success | /tmp/postgres_backup/20180724/backup_new.sql
 2018-07-24 | audit     | Success | /tmp/postgres_backup/20180724/backup_audit.sql
(7 rows)


Cron Entry:
##Scheduling Postgres Backup - Logical pg_dump backup
21 7 * * * /opt/postgresql/scripts/backup_postgres.sh > /data/postgres_backup/`/bin/date +'\%Y\%m\%d'`/log_file/postgres_backup.out 2>&1

-bash-4.1$ crontab -l
##Scheduling Postgres Backup - Logical pg_dump backup
5 21 * * * /opt/postgresql/scripts/backup_postgres.sh > /data/postgres_backup/backup_log_file/`date +\%Y\%m\%d`-cron.out 2>&1

Parameter Needs Restart in PostgreSQL

We come up with a scenario where we want to know if a parameter needs restart or reload. So this article will describe how we can know which parameter needs what. Below are few parameter which needs restart. 

  1. Archive_mode
  2. Archive_level
  3. Port
  4. Connections
  5. Shared_buffer
  6. Max_wal_sender


Some of the below which needs reload:
  1. Archive_command
  2. Pg_min_log_statement
  3. Work_mem
  4. maintenance_work_mem

But it is difficult to remember all the parameter, therefore there is a trick. When we issue the below command and if the result is "postmaster", it means it needs reboot. And if the result is "sighup", it means it needs reload. Also, we can also know if any parameter there is a pending reboot. Below are the example commands -

1. This command would give the context if the parameter needs reboot.
SQL> select context from pg_settings where name = 'work_mem';

2. The below command would say us if the pending reboot is pending for the parameter.
SQL> select pending_restart from pg_settings where name = 'work_mem';

Below are some of the example commanfs for some of the parameter -

postgres=# select context from pg_settings where name ='work_mem';
 context
---------
 user

postgres=# select context from pg_settings where name ='port';
  context
------------
 postmaster
(1 row)

postgres=# select context from pg_settings where name ='archive_command';
 context
---------
 sighup
(1 row)

postgres=# select context from pg_settings where name ='archive_mode';
  context
------------
 postmaster
(1 row)



Monday, 30 July 2018

Performance Tuning in PostgreSQL

Performance is very important for PostgreSQL. We need to correctly tweak the parameters, hardware and software all together to get the best performance from PostgreSQL.

Reasons For Poor Performance:

  1. Poorly Written SQL
  2. Poor session Management - Postmaster handles all connection rather any connection pooling tool like pgbouncer or pgpool
  3. Misconfigured database parameter
  4. OS I/O issue
  5. No DB Maintenance

Tuning Technique:
  1. Start with OS - Check OS health to make sure problem is in DB
  2. Check the Application - Tune SQL before tuning DB configuration
  3. Tune DB Server - Identify the biggest bottleneck. Tune the area with biggest potential benefit.
  4. Stop - Stop when tuning goal is achieved. 
HARDWARE CONFIGURATION:
  1. Focus on disk speed and RAM over CPU speed
  2. SAS have better disk speed then SATA.(SAS - 10,000 to 15,000 rpm)
  3. Like other RDBMS, postgreSQL is IO intensive
  4. More spindles - PostgreSQL utilizes multiple disks to parallelize read and write request on the database
  5. Separate the transaction logs - Put the database transaction log(pg_xlog) on dedicated disk resource. Whenever there is a commit, WAL buffer is flushed to WAL Segments and Shared buffer area are marked as committed. Transaction log area must be fast to write. The disk should have more write speed. As it will commit faster and it will boost performance.
  6. Index on separate mount point other than data directory.
  7. Keep monitoring data in separate mount point. Pg_log in different mount point.
  8. RAM should be high as many activities happens in memory/buffer. Like sort, update. It will lessen IO on disk. If RAM is high, buffer can cache more data
  9. We can compromise on CPU. PostgreSQL is not that CPU intensive.
OS CONFIGURATION:
  1. Memory, CPU, Disk(I/O)
  2. Use the available memory on the server
  3. PostgreSQL uses the system shared memory segment so set “shmmax” to an appropriate size based on the physical RAM in the server
  4. Shmmax is set in bytes
  5. Shmmax is shared buffer for all application installed in server. /proc/sys/kernel/shmmax
  6. 70 to 90 % of shmmax can be defined out of total RAM as per EDB.
  7. The file system makes a difference
  8. A journaling file system is not required for the transaction log.
  9. A journaling filesystem is a file system that maintains a special file called a journal that is used to repair any inconsistencies that occur as the result of an improper shutdown of a computer
  10. Use the file system based on its strengths to find the best fit for the application.
  11. Cat /etc/sysctl.conf
DB PARAMETER TUNING:
  • Basic information needed for tuning but not limited to:
  1. Database size
  2. Largest table size
  3. Type of frequency of query
  4. Available RAM
  5. Number of concurrent connection
  6. Default parameters are for wide compatibility not performance
  • Shared buffer - Area where read and write happens. Sets the number of shared memory buffers used by the database server. Each buffer is 8kb. Minimum value must be 16KB and at least 2x. Default setting is 1000 however values of few thousand are not uncommon for production applications. 6 to 15% of available memory is a good general guideline. Pg_bench used for benchmarking the server. Shared buffer = 2 * max connections(minimum). 32MB is default shared_buffer. 15 to 30% for reporting server.
  • Max_connection - Set maximum number of concurrent connection. Each user connection has associated user backend process on the server. User backend process is terminated when a user log off. Connection pooling can decrease the overhead on postmaster by recusing existing user backend process.PostgreSQL is process per user. Postmaster creates a process, authenticates and process it. When protocol remains same, connection uses same connection.
  • Work_mem - Amount of memory in KB to be used by internal sorts and hash joining/tables before switching to temporary disk files. Minimum allowed values us 64KB. It is set in KB and the default is 1024 KB(1 MB). work_mem is session based. Should be set in accordance to max_connection. It don't need restart. Set command works here. Alter user, alter database can be used for this parameter. It needs reload for permanent change. (Set work_mem to 10240).
  • Maintenance_work_mem - Maximum memory in KB to used in maintenance operations such as vacuum, create index and alter table add foreign key. Minimum allowed value is 1024KB. The default is maintained by dynatyne.
  • Huge_pages - Enables/disables the use of huge memory pages. Valid values are try(default), on, off. This parameter is only supported in Linux. May help in increasing performance by using smaller page table thus less CPU time on memory management.
  • Temp_file_limit - Maximum amount of disk space that a session can use for temporary files. A transaction attempting to exceed this limit will be cancelled. Default is -1( no limit).
  • Checkpoint - Writes the current in memory modified pages( knows as dirty buffer) to the disk. Automatic checkpoint occurs each time that the number of checkpoint_segment or checkpoint_timeout is reached.
  • Checkpoint_segments - Maximum distance between automatic WAL checkpoints in log file segment. Each log file segment is 16MB. A checkpoint is forced when the log is filled. ( 16MB * checkpoint_segments). A larger  setting results in few checkpoint. Minimum value is 1 and default is 3. Select * from pg_stat_bgwriter; Here, checkpoints_timed is default how checkpoint timed every 5 mins as mentioned in checkpoint_timeout parameter but checkpoint_req value should be always less than checkpoint_timed orelse it means that checkpoint_segment is getting filled up. Optimal value is 96 if transaction is less also.
  • Checkpoint_warning - It will print a message warning in error log file if checkpoint is happening every one minute as mentioned in parameter.
  • Fsync - Ensures that all the WAl buffer are written to WAL log at each commit. When on fsync() or other wal_sync_method is forked. Turning it off will be performance boost but there is a chance of data corruption. Can be turned off in lower environment or during initial loading of database. Synechronous_commit can give similar benefit for non critical transaction without any risk of data corruption.
  • Pg_prewarm -  Loading of object in OS RAM or shared_buffer. 
Some of the important config parameter which needs to be tuned are below:

Custom:96GB RAM
fsync = on -Ensures data is written to disk after commit
maintenance_work_mem = 8GB - RAM used for vacuum, reindex, analyze
checkpoint_completion_target = 0.8 
effective_cache_size = 88GB - Cache for operation from OS
work_mem = 96MB - For sort operation. Session Based
wal_buffers = 2MB
checkpoint_segments =    128 
checkpoint_timeout = 30min
checkpoint_warning = 586
synchronous_commit = off
shared_buffers = 30GB - Used for DB activity - Delete/Update/alter/
max_connections = 900

log_destination = 'syslog'
syslog_facility = 'local4'
syslog_ident = 'postgres'
log_min_messages = error
log_error_verbosity = default
log_min_error_statement = error
log_min_duration_statement = 30000 - in millisecond
log_connections = on
log_line_prefix = 'user=%u,db=%d ' -- Needs reload

autovacuum_naptime = 300 - It will sleep for 300 sec and start auto vacuum again
autovacuum_vacuum_threshold = 1000 - Minimum number of dead tuples it will start Autovacm
autovacuum_analyze_threshold = 1000 - Same but it will analyze
autovacuum_vacuum_scale_factor = 0.1 means 10 % if changes happen, auto vacuum will start. Support for individual table also.
autovacuum_vacuum_cost_delay = 50ms - Cost delay for all autovacuum workers
autovacuum_vacuum_cost_limit = 200 Cost limit for all autovacuum workers
full_page_writes = off
effective_io_concurrency = 3

tcp_keepalives_idle = 60
tcp_keepalives_interval = 60 
tcp_keepalives_count = 2
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d '
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 0

Vacuum: Reuse disk space for same table
Analyze: update stats in pg_stat for better query plan
Vacuum Analyze: Vacuuming and Analyzing together
Vacuum full verbose: Disk space is released to be used by other object
Vacuum freeze - PostgreSQL supports 2billion transaction ID. After that it will start overwriting all data and data will be lost. So either we need to vacuum at that time or else use vacuum freeze to freeze the transaction ID
Reindex: Rebuilds index - Needs downtime

Maintain DB Activity - Vacuum/Analyze should be scheduled regularly
Query Tuning - Pdbadger is a open source tool for analyzing queries and and PostgreSQL instance. 

Hardware issues:
We can use the below tools in Linux to analyze the hardware. Although it needs Linux knowledge.
  1. top
  2. sar
  3. iostat
  4. free -m
  5. Swap
  6. VMstat

SQL Query Optimization:
1. As mentioned earlier, select the query from pgbadger.  Explain analyze the query by using "Explain" command.  Also, we can take the query plan and  and paste in http://explain.depesz.com/ for better presentation. We can check index scan or sequential scan in the query plan and act accordingly. Most important is to check cost of query, width and no of rows in query plan.  Some of the important query tuning parameter are -

     A. enable_hashjoin=off;
     B. join_collapse_limit=1;

Above are the hint to the query Planner.

Statement/Query Processing - ( Query Flow)

Application program request a connection to PostgreSQL server. The application program transmits a query to the server and waits for the result to sent back from server.

Parse - Check Syntax, Call traffic Cop, Identity Query Type, Command processor if needed, Break query into tokens.The parser have to check the query string(which arrive as plain text) for valid syntax. If the syntax is correct, a parse tree is builded up and handed back otherwise an error is returned. The parser stage creates a parser tree using fixed rules. Initially it don’t look into system catalog tables. So it is difficult to understand the detailed semantic. Semantic analysis is done only during the transaction. Not upon receiving the query The query tree created by transformation process is structurally same as raw parse tree but it have difference in details. 

Optimize - Planner, Generate plan, Uses database statistics, Query cost calculation, choose best plan.
The task of planner/optimizer is to create an optimal execution plan. A given SQL query can be executed in various ways but resulting same result. Optimizer will examine all possible execution plan, but will execute the fastest plan. It may take time to decide as there might be more number of joins.  After fastest plan is determined, it is passed to the executor.  Sequential plan is always created and index scan plan is also created. Joins it consider are nested, merge and hash join. 

Nested loop Join - The right relation is scanned once for every row found in the left relation. This strategy is easy to implement but can be very time consuming. (However, if the right relation can be scanned with an index scan, this can be a good strategy. It is possible to use values from the current row of the left relation as keys for the index scan of the right.)

Merge Join - Each relation is sorted on the join attributes before the join starts. Then the two relations are scanned in parallel, and matching rows are combined to form join rows. This kind of join is more attractive because each relation has to be scanned only once. The required sorting might be achieved either by an explicit sort step, or by scanning the relation in the proper order using an index on the join key.

Hash Join - the right relation is first scanned and loaded into a hash table, using its join attributes as hash keys. Next the left relation is scanned and the appropriate values of every row found are used as hash keys to locate the matching rows in the table.

Execute - Execute query based on query plan, performs sorts and joins, evaluates qualifications and finally hands back the row derived.

Below is the example of a EXPLAIN PLAN:
test=# explain select*from emp;
                      QUERY PLAN
-------------------------------------------------------
 Seq Scan on emp  (cost=0.00..15.00 rows=1000 width=4)
(1 row)
Seq Scan on emp - Type of scanning. It may be index if it uses index.
Cost - It is just an unit. It is not in millisec or sec
cost=0.00 - Startup cost. Server might spend in beginning itself.
rows=1000 - Estimated number of rows scanned. Row sampling
width=4 - Width of each row in byte
cost=0.00..15.00 - Most imp - Total cost of the query. This cost be lowered down. 

test=# select relpages,reltuples from pg_class where relname='emp';
 relpages | reltuples
----------+-----------
        5 |      1000
(1 row)
test=# show seq_page_cost;
 seq_page_cost
---------------
 1
(1 row)

test=# show cpu_tuple_cost;
 cpu_tuple_cost
----------------
 0.01
(1 row)

test=# select 5*1+1000*0.01;
Relpages*seq_page_cost + reltuples*cpu_tuple_cost
 ?column?
----------
    15.00
(1 row)
Queries does a hard cache for first time executing so that planning time is more.
test=# explain analyze select*from emp;
                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
 Seq Scan on emp  (cost=0.00..15.00 rows=1000 width=4) (actual time=0.009..0.200 rows=1000 loops=1)
 Planning time: 0.037 ms
 Execution time: 0.382 ms
(3 rows)

test=# explain analyze select*from emp;
                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
 Seq Scan on emp  (cost=0.00..15.00 rows=1000 width=4) (actual time=0.049..0.227 rows=1000 loops=1)
 Planning time: 0.026 ms
 Execution time: 0.415 ms
(3 rows)

actual time=0.049 -- Actual startup time for processing
actual time=0.049..0.227 - Actual Time for query processing
Planning time > Execution Time
Explain Analyze Verbose - Shows more details plan
Explain (costs) select*from emp; - Same as explain command
Explain (analyze, costs) select*from emp; - Same as explain Analyze



Explain (analyze, buffers) select*from emp;
test=# Explain (analyze, buffers) select*from emp;
                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
 Seq Scan on emp  (cost=0.00..15.00 rows=1000 width=4) (actual time=0.012..0.248 rows=1000 loops=1)
   Buffers: shared hit=5 read=62
 Planning time: 0.183 ms
 Execution time: 0.435 ms
(4 rows)
Read - Read from disk . IO from disk
Shared hit - Read from buffer
External Disk Sort - 3433 KB - Sort is from disk, when work_mem is less
Quicksort Memory - 4545 KB - Sort if from memory. 
Select pg_prewarm (‘table_name’); Full table will be in cache
Show default_statistics_target; 100 means 1% of each table are sampled 10,000 mins 100%. Keep it 1% so that maintenance is fast. But it is slow for optimizer. It is row sampling. Higher number will signal the server to gather and update more statistics
test=# Show default_statistics_target;
 default_statistics_target
---------------------------
 100
(1 row)







Monday, 29 January 2018

Partitions in PostgreSQL

Below are some important points regarding "Partitions" in PostgreSQL -
  1. PostgreSQL supports basic table partitioning until Version 9.6
  2. Query Performance can be increased dramatically situations when most of the heavily accessed rows of the table are in a single partition. 
  3. Partitions substitutes for leading column index, reducing size and make it more likely that the heavily used parts fits the memory. 
  4. When queries or updates access large partition of single partition, performance can be improved by taking advantage of sequential scan instead using index scan and random access read scattered across the table. 
  5. Bulk loads and deletes can be accomplished by adding or removing partitions, if that requirement is planned into the partitioning design
  6. ALTER TABLE NO INHERIT and DROP TABLE are both far faster than a bulk operation. These commands also entirely avoid the VACUUM overhead caused by a bulk DELETE.
  7. Each partition must be created as a child table of a single parent table.
  8. The parent table itself is normally empty; it exists just to represent the entire data set.
  9. PostgreSQL supports partitioning via table inheritance until version 9.6
  10. Declarative partitioning is the new feature which is supported from PostgreSQL version 10.0
PARTITION USING TABLE INHERITANCE:
  1. Create the “parent” table, from which all of the partitions will inherit.
  2. Create several “child” tables (each representing a partition of the data) that each inherit from the parent.
  3. Add constraints to the partition tables to define the row values in each partition.
  4. Create indexes on any parent and child tables individually. (Indexes do not propagate from the parent tables to child tables).
  5. Write a suitable trigger function to the master table so that inserts into the parent table redirect into the appropriate partition table.
  6. Create a trigger that calls the trigger function.
  7. Redefine the trigger function when the set of child tables changes.
PROBELEMS:
  1. Data consistency depends on the CHECK constraints specified on each child table.
  2. INSERT and COPY commands do not automatically propagate data to other child tables in the inheritance hierarchy, but instead rely on triggers, resulting in slower inserts.
  3. Substantial manual work is required to create and maintain child sub-tables.
  4. Given 3, re-partitioning is non-trivial (e.g., when adding disks, making schema changes).
  5. Ensuring uniqueness across partitioned tables is not supported.
  6. Indexes, constraints, and many table maintenance commands need to be applied to child tables explicitly. This greatly complicates management tasks.
DECLARATIVE PARTITIONS ( Version 10 Onwards):

Currently, declarative partitioning supports RANGE and LIST partitions:

RANGE — partition the table into “ranges” defined by a key column or set of columns, with no overlap between the ranges of values assigned to different partitions. For example, device_id.

LIST— partition the table by explicitly listing which key values appear in each partition, and are limited to a single column. For example, device_type.

How does it work?

Create a partitioned table using the PARTITION BY clause, which includes the partitioning method (RANGE in this example) and the list of column(s) to use as the partition key.

CREATE TABLE measurement (
 city_id int not null,
 logdate date not null,
 peaktemp int,
 unitsales int
) PARTITION BY RANGE (logdate);

After creating the partitioned table, partitions can be manually created:

CREATE TABLE measurement_y2006m02 PARTITION OF measurement
 FOR VALUES FROM (‘2006–02–01’) TO (‘2006–03–01’);

CREATE TABLE measurement_y2006m03 PARTITION OF measurement
 FOR VALUES FROM (‘2006–03–01’) TO (‘2006–04–01’);

CREATE TABLE measurement_y2006m04 PARTITION OF measurement
 FOR VALUES FROM (‘2006–04–01’) TO (‘2006–05–01’);
...
CREATE TABLE measurement_y2007m11 PARTITION OF measurement
 FOR VALUES FROM (‘2007–11–01’) TO (‘2007–12–01’);

And then indexes can be created:

CREATE INDEX ON measurement_y2006m02 (logdate);
CREATE INDEX ON measurement_y2006m03 (logdate);
CREATE INDEX ON measurement_y2006m04 (logdate);
CREATE INDEX ON measurement_y2007m11 (logdate);

The partitioned table functions similarly to the parent or “master” table in table inheritance but is much improved, serving as a much more powerful query interface to the underlying child table partitions. For instance, commands such as TRUNCATE and COPY now propagate to child tables via execution on the partitioned table. Additionally, users can insert data into underlying child tables via the partitioned table, since tuples (i.e., rows) now automatically route to the right partition on INSERT and no longer depend on triggers, as is the case with table inheritance.

Index in PostgreSQL

There are different types of index in PostgreSQL. They are below:
  1. B-tree
  2. Hash
  3. GiST
  4. SP-GiST
  5. GIN
  6. Single Column Index
  7. Multi Column Index ( Composite Index)
  8. Functional Index
  9. Index on Foreign Key
  10. Unique Index
  11. Partial Index
  12. Implicit Index 

When Index is created, it blocks the table for writes and performs the entire index build with a single scan of table. Other transaction can read the table. DDL and DML operations will be in waiting state until Index rebuild is done. 

PostgreSQL supports building index without locking out writes. We need to specify concurrently option while creating index. When we use concurrently option, PostgreSQL must perform 2 scans of the table and it should wait for all existing transaction that could potentially use or modify the index to terminate. Thus, it takes more time to complete. But this allows to continue write in production, so it is helpful. But it will slow down other operation.

B-Tree - Balanced Tree are used for Equality and Range Queries. All data types and NULL values.

Hash - Hash Indexes are only useful for equality comparisons, but you pretty much never want to use them since they are not transaction safe, need to be manually rebuilt after crashes, and are not replicated to followers.

GIN - Generalized Inverted Index are useful when an index must map many values to one row, whereas B-Tree indexes are optimized for when a row has a single key value. GINs are good for indexing array values as well as for implementing full-text search.

GIST - Generalized Search Tree - indexes allow you to build general balanced tree structures, and can be used for operations beyond equality and range comparisons. They are used to index the geometric data types, as well as full-text search.

Implicit Index - Indexes are automatically created for primary key constraints and unique constraints.

Places where we should avoid Index:
  1. Where there are too many updates,deletes in table.
  2. We should avoid in small tables
  3. Should be avoided where there are too many NULL values
  4. Columns that are frequently manipulated should not be indexed.


Monitoring in PostgreSQL

Database Monitoring:

  1. Database monitoring consists of capturing and recording the database events.
  2. This information helps DBA to detect, identify and fix potential database performance issues.
  3. Database monitoring statistics makes it easy for DBA’s to monitor the health and performance of PostgreSQL Database. 
  4. Several tools are available for monitoring database activity and analyzing performance.

Database Statistics:

  • DB statistics catalog tables store DB activity information like
  1. Current running sessions
  2. Running SQL
  3. Locks
  4. DML Counts
  5. Row Counts
  6. Index Usage
  • Stat collector collects all the above stats and update pg_catalog(statistics table)
  • Pg_catalog is a schema which contains all the statistics table
CATALOG TABLE - Which needs to be updated with analyze(pg_class, pg_catalog)
SYSTEM VIEWS - Which automatically gets updated in occurrence to an event(pg_stat_activity)

The statistics Collector:
  • Stats collector is a process that collect and report information about database activities.
  • Stats collector  adds some overhead to query execution.
  • Stats Parameter:
  1. Track counts - controls table and index statistics
  2. Track activities - enables monitoring of the current command.
  • The stats collector use temp files stored in pg_stat_tmp subdirectory.
  • Permanent statistics are stored in pg_catalog schema in global subdirectory
  • Performance TIP: Parameter stats_temp_directory can be pointed at a RAM based file System.
Database Statistics Table:
  1. Pg_catalog schema contains a set of table,view and function which store and report database statistics.
  2. Pg_class and pg_stats catalog tables stores all statistics. They are two main statistics table. It stores index and table statistics.
  3. Pg_stat_database view can be used to view stats information about a database
  4. Pg_stat_bgwriter shows background writer steps
  5. Pg_stat_user_tables shows information about activities on a table like inserts, updates, deletes, vacuum, autovacuum etc
  6. Pg_stat_user_indexes shows information about index usage for all user table. 
  7. \dS - shows view and statistics table in a server. 
  8. Select * from pg_stat_database
  9. Pg_stat_% are all views
Pg_Stat_Database:

postgres=# \x
Expanded display is on.
postgres=# select * from pg_stat_database where datname='postgres';
-[ RECORD 1 ]--+---------------------------------
datid          | 12896            --- Database OID
datname        | postgres    --- Database Name
numbackends    | 1            --- Number of connection at this time
xact_commit    | 16791      --- How many connections are committed
xact_rollback  | 2               --- How many connections are rollbacked
blks_read      | 187288      --- How many bulk reads happened
blks_hit       | 273549         --- How many bulks blocks are hit
tup_returned   | 7688150   --- How many tuples were returned
tup_fetched    | 93446        --- How many tuples were fetched
tup_inserted   | 0                --- How many tuples were inserted
tup_updated    | 0               --- How many tuples were updates
tup_deleted    | 0                --- How many tuples were deleted
conflicts      | 0                    ---   
temp_files     | 0
temp_bytes     | 0
deadlocks      | 0
blk_read_time  | 0
blk_write_time | 0
stats_reset    | 2016-10-07 12:29:11.117365+05:30  -- When statistics were reset

Pg_stat_bgwriter:

postgres=# select * from pg_stat_bgwriter;
-[ RECORD 1 ]---------+---------------------------------
checkpoints_timed     | 96              -- How many checkpoint were timed/occurred
checkpoints_req       | 11394         -- How many checkpoint were actually required
checkpoint_write_time | 210405805  -- 
checkpoint_sync_time  | 8204341
buffers_checkpoint    | 49477813   -- How many buffers were checkpoint
buffers_clean         | 234507961    -- How many buffers were cleaned
maxwritten_clean      | 1296814     -- How many maximum written clean buffer
buffers_backend       | 1768849795 -- How many buffers at backend 
buffers_backend_fsync | 0              -- How many buffers used by fsync
buffers_alloc         | 2155806772     -- Buffer allocation
stats_reset           | 2016-10-07 12:29:04.686137+05:30  -- Stats were reset on

When checkpoint_req is more than checkpoint_times, there is an issue. Because there are too many transaction. Checkpoint are happening because checkpoint_segment is getting full.
SOLUTION: Increase the shared_buffer, Increase checkpoint_segment.

Pg_stat_user_tables:

postgres=# Select * from pg_stat_user_tables where relname='a';
-[ RECORD 1 ]-----+-----------
relid             | 2338166798
schemaname        | public   -- Schemaname
relname           | a                -- Table Name
seq_scan          | 0               -- How many seq_scan
seq_tup_read      | 0            -- How many tuples were read sequentially
idx_scan          |                   -- How many index scan
idx_tup_fetch     |                 -- How many tuples returned using index
n_tup_ins         | 0                -- How many tuples inserted
n_tup_upd         | 0              -- How many tuples updated
n_tup_del         | 0               -- How many tuples deleted
n_tup_hot_upd     | 0           -- 
n_live_tup        | 0
n_dead_tup        | 0              -- How many tuples were dead
last_vacuum       |                 -- When the table was last vacuum
last_autovacuum   |              -- when the table was last auto vacuum 
last_analyze      |                  -- when the table was last analyzed
last_autoanalyze  |               -- When the table was last auto analyzed
vacuum_count      | 0            -- How many vacuum was done
autovacuum_count  | 0         -- How many autovacuum was done
analyze_count     | 0             -- How many analyze was done
autoanalyze_count | 0          -- How many auto analyze was done.

Autoanalyze: Autovacuum daemon does auto analyze. But if table is rarely inserted, then have to analyze manually. If the table which is being analyzed have parent and child table. It will analyze parent table and then parent and child table. 

Sequential scan should be less than index scan. 


IMPORTANTS VIEWS:

Below are some important views which are very much essential for a DBA.

pg_stat_activity
pg_stat_bgwriter
pg_stat_database
pg_stat_all_tables
pg_stat_sys_tables
pg_stat_user_tables
pg_stat_xact_all_tables
pg_stat_xact_sys_tables
pg_stat_xact_user_tables
pg_stat_all_indexes
pg_stat_sys_indexes
pg_stat_user_indexes
pg_statio_all_tables
pg_statio_sys_tables
pg_statio_user_tables
pg_statio_all_indexes
pg_statio_sys_indexes
pg_statio_user_indexes
pg_statio_all_sequences
pg_statio_sys_sequences
pg_statio_user_sequences
pg_stat_user_functions
pg_stat_xact_user_functions
pg_stat_replication
Pg_stat_database_conflicts

Pg_stat_activity:

postgres=# select * from pg_stat_activity;

-[ RECORD 1 ]----+---------------------------------------------------------------------------------------------------------------------------------------------
datid         | 16384  -- DB ID
datname       | ewcloud  -- DB Name
pid           | 32438     -- Process ID matches with pid of ps -ef | grep post
usesysid      | 32217507
usename       | t_a3efe696b07a44f9ae84ef5b5f096e0c  -- schema Name
application_name | Cisco Energy Management Server Processor Message Queue reportin
client_addr   | 172.30.2.64
client_hostname  |
client_port   | 61742
backend_start | 2016-10-07 12:29:05.183022+05:30  -- Connection started
xact_start    |
query_start   | 2016-10-23 00:30:32.395862+05:30  -- Query start time
state_change  | 2016-10-23 00:30:32.396797+05:30
waiting       | f  -- Query is in waiting state or not
state         | idle
query         | COMMIT


Replication in PostgreSQL

Synchronous Replication:

  1. Ensures that all data are first written to master, slave and transaction log of disk in both master and slave then it will be committed to client.
  2. It ensures maximum Protection.
  3. Synchronous replication offers the ability to confirm that all changes made by a transaction have been transferred to one synchronous standby server.
  4. The only chance data will be lost if both Master and slave crashed as same time. 
  5. The maximum wait time is the round trip starting from Master to Slave & slave to master. Once streaming replication has been configured, configuring synchronous replication requires only one additional configuration step: synchronous_standby_names must be set to a non-empty value.

Asynchronous Replication:


  1. It is by default.
  2. If Master/Primary server crashes then some transaction that were committed may be not replicated to standby server/slave. The amount of data loss is proportional to delay at the time of failover.

Logical Replication:


  1. Replication slots with Streaming replication
  2. Replication Slot - a slot represents a stream of changes that can be replayed to a client in the order they were made on the origin server. Each slot streams a sequence of changes from a single database, sending each change exactly once. Only one receiver may consume changes from a slot at any given time.
  3. It can only replicate DML changes not DDL changes.
  4. Any version it can replicate, any DB it can replicate as it is logical replication.
  5. Replication should be synchronous replication.
  6. Only after replication statement are read, it will delete the xlog.
  7. It don’t need any overhead process from server, as it reads from WAL files
  8. It is an approach towards multi-master slowly.
  9. Replication slots are counted as one of the connection in max_connection.
  10. Process - There is a logical decoding, which pulls all the queries from WAL logs. Decoder should have access to pg_catalog as needs various information from pg_catalog. The DML queries are decoded in decoder. There  is plugin attached to decoder which reads the decoder and send statements to slave server. The plugin can be customized plugin according to our use. We can use that plugin for analytical use also or for logical replication. 
  11. We should make sure, replcation_slot, wal_sender, wal_receiver is enabled in master and slave.
  12. Please find the below example.


postgres=# -- Create a slot named 'regression_slot' using the output plugin 'test_decoding'

postgres=# SELECT * FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');

   slot_name    | xlog_position

-----------------+---------------

regression_slot | 0/16B1970

(1 row)

postgres=# SELECT * FROM pg_replication_slots;

   slot_name    |    plugin     | slot_type | datoid | database | active |  xmin  | catalog_xmin | restart_lsn

-----------------+---------------+-----------+--------+----------+--------+--------+--------------+-------------
regression_slot | test_decoding | logical   |  12052 | postgres | f      |        |          684 | 0/16A4408
(1 row)

postgres=# -- There are no changes to see yet
postgres=# SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
location | xid | data
----------+-----+------
(0 rows)

postgres=# CREATE TABLE data(id serial primary key, data text);
CREATE TABLE

postgres=# -- DDL isn't replicated, so all you'll see is the transaction
postgres=# SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
location  | xid |    data
-----------+-----+------------
0/16D5D48 | 688 | BEGIN 688
0/16E0380 | 688 | COMMIT 688
(2 rows)

postgres=# -- Once changes are read, they're consumed and not emitted
postgres=# -- in a subsequent call:
postgres=# SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
location | xid | data
----------+-----+------
(0 rows)

postgres=# BEGIN;
postgres=# INSERT INTO data(data) VALUES('1');
postgres=# INSERT INTO data(data) VALUES('2');
postgres=# COMMIT;

postgres=# SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
location  | xid |                     data
-----------+-----+-----------------------------------------------
0/16E0478 | 689 | BEGIN 689
0/16E0478 | 689 | table public.data: INSERT: id[integer]:1 data[text]:'1'
0/16E0580 | 689 | table public.data: INSERT: id[integer]:2 data[text]:'2'
0/16E0650 | 689 | COMMIT 689
(4 rows)

postgres=# INSERT INTO data(data) VALUES('3');

postgres=# -- You can also peek ahead in the change stream without consuming changes
postgres=# SELECT * FROM pg_logical_slot_peek_changes('regression_slot', NULL, NULL);
location  | xid |                     data
-----------+-----+-----------------------------------------------
0/16E09C0 | 690 | BEGIN 690
0/16E09C0 | 690 | table public.data: INSERT: id[integer]:3 data[text]:'3'
0/16E0B90 | 690 | COMMIT 690
(3 rows)

postgres=# -- The next call to pg_logical_slot_peek_changes() returns the same changes again
postgres=# SELECT * FROM pg_logical_slot_peek_changes('regression_slot', NULL, NULL);
location  | xid |                     data
-----------+-----+-----------------------------------------------
0/16E09C0 | 690 | BEGIN 690
0/16E09C0 | 690 | table public.data: INSERT: id[integer]:3 data[text]:'3'
0/16E0B90 | 690 | COMMIT 690
(3 rows)

postgres=# -- options can be passed to output plugin, to influence the formatting
postgres=# SELECT * FROM pg_logical_slot_peek_changes('regression_slot', NULL, NULL, 'include-timestamp', 'on');
location  | xid |                     data
-----------+-----+-----------------------------------------------
0/16E09C0 | 690 | BEGIN 690
0/16E09C0 | 690 | table public.data: INSERT: id[integer]:3 data[text]:'3'
0/16E0B90 | 690 | COMMIT 690 (at 2014-02-27 16:41:51.863092+01)
(3 rows)

postgres=# -- Remember to destroy a slot you no longer need to stop it consuming
postgres=# -- server resources:
postgres=# SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
-----------------------

(1 row)



Physical Replication:

  1. Streaming Replication
  2. Hot Standby
  3. Warm Standby
  4. Log shipping

Streaming Replication:

There should be always a real time backup so that the data's are always safe. Streaming Replication is one of the process that always sync the real time data into different server. The main server is always called as master and the server where the data's need to get sync is called slave. 

Max_Wal sender process keeps checking the change block and sends it to replicated server where wal_reciever makes sure, it is written to the database accordingly. Row by row can be replicated in streaming replication, as process reads from xlog. 

Setting up Streaming Replication:


Pre-Requisite:
  1. Slave Empty
  2. Same Version (9.2.2)
  3. Same OS-Linux (Ubuntu)
  4. Same bit - 32
Steps:

1)Set the keygen so that there is no problem in connection both the server i.e slave and master
 
   a) As a root user create the key in master and in slave
            Master: ssh-keygen -t rsa
            Slave: ssh-keygen -t rsa
   b) Now copy the key from master to slave and slave to master
             Master: vi authorized-keys
                        cat id-rsa.pub from slave and paste in authorized-keys
             Slave: Do same thing for slave

2)Change the permission of the file authorized-keys using the below command -
      [root@linux] chmod 600 authorized-keys

3) We need to repeat the above steps to set the keygen as postgres user. If server A needs to connect to server B as postgres user. Then we need to copy public key of server A to authorized_keys in server B. Permissions are very important, without that it won’t work. Also we need to make sure that knows_hosts are cleared up before/after we do ssh keygen setup.
       .ssh permission - 700
       Authorized_keys - 640

4)  Check the master is connected to the slave without password or not.
         ssh postgres@192.168.175.3
     It may be still it is not connection, and throws an error. Then we can just reset our password in OS
         passwd postgres

5) Create directory wallogs in slave in root and change the permission
      chown -R postgres wallogs

6) Change postgres config(master):
     archive_mode = on
     archive_command = 'ssh slave_host test -d /wallogs && rsync -a %p slave_host:/wallogs/%f'
      (it is secure shelling to slave ip and test the wallogs if everything ok then it copies to slave wallogs)
      wal_level = hot_standby
      max_wal_senders = 2
       wal_keep_segments = 30

7) Restart the master
    service postgresql restart

8) We can switch log to check logs are getting created and moved.
    select pg_switch_xlog();

9) Change in pg_hba.conf
    type    database    user          host(slave)            Method
    host    all         all         192.168.175.196/32      trust
    host  replication   all         192.168.175.196/32      trust
     Change all md5 to trust.

10) Next we will run commands from slave. We won’t touch master

11) Keep a backup of postgresql.conf and pg_hba.conf in home directory.

12) Going to root:
     rm /tmp/data_pgsql.trigger.5432
     /etc/init.d/postgresql stop
     rm /usr/local/pgsql/data/ -rf
     echo "SELECT pg_start_backup('`hostname``/bin/date '+%Y%m-%d-%T'`');" | psql -h masr_hostte -U postgres
     rsync -avz   postgres@master_host:/usr/local/pgsql/data/  /usr/local/pgsql/data/
     su - postgres -c "/usr/local/pgsql/bin/psql -h master_host -c 'SELECT pg_stop_backup();'"
     rm /usr/local/pgsql/data/pg_xlog/archive_status/*
     rm /usr/local/pgsql/data/pg_xlog/*

13) Change in postgresql.conf
      hot_standby = on

14) Create recovery.conf
      standby_mode = 'on'
      primary_conninfo = 'host=master_host port=5432 user=postgres'
      restore_command = 'cp /wallogs/%f %p'
      trigger_file = '/tmp/pgsql.trigger'
      archive_cleanup_command = 'pg_archivecleanup /wallogs/ps_admin_juliet %r

15) Remove the postmaster.pid before restarting the slave server.

16) Restart the slave and check is it working or not.

17) Monitoring: pg_stat_replication

     Large differences between pg_current_xlog_location and sent_location field might indicate that the master server is under heavy load, while differences between sent_location and pg_last_xlog_receive_location on the standby might indicate network delay, or that the standby is under heavy load.

Recovery.conf File:

standby_mode = 'on'
primary_conninfo = 'host=master_host port=5432 user=postgres'
restore_command = 'cp /wallogs/%f %p'
trigger_file = '/tmp/pgsql.trigger'
archive_cleanup_command = 'pg_archivecleanup /wallogs/ps_admin_juliet %r’
Recovery_target_name = name point created pg_create_restore_point()
Recovery_target_time = till a specific time, it will recover. 
Recovery_target_xid = recovery upto transaction ID
Recovery_target_inclusive = True(default) false stops before recovery target
Recovery_target_timeline = Recovery till specific timeline
Pause_at_recovery_target = recovery target should pause after recovery is complete.

What happens at the end of recovery?

  1. End of recovery means the point where the the database opens up for writing
  2. New timeline is chosen
  3. A timeline history file is written
  4. The partial last WAL file on the previous timeline is copied with the new timeline's ID
  5. A checkpoint record is written on the new timeline
Example:

LOG: database system was interrupted; last known up at 2013-01-30
21:45:14 EET
LOG: starting archive recovery
LOG: redo starts at 13/E00000C8
LOG: could not open file "pg_xlog/0000000100000013000000E4": No
such file or directory
LOG: redo done at 13/E3D389A0
LOG: last completed transaction was at log time 2013-01-30 21:45:20+02
LOG: selected new timeline ID: 2
LOG: archive recovery complete
LOG: database system is ready to accept connections

Partial File:

0000000100000013000000E4
------------------- - - - - - - - -  - - -      - - - - ( Unused Portion)
0000000200000013000000E4
------------------------------------           --------(Merged Portion)
Both are partial file at one point of time and partial file vanishes and become normal WAL file.

TimeLine History File:

After timeline switch there is a history file in between
0000000100000013000000E4
0000000100000013000000E5
00000002.history
0000000200000013000000E3
0000000200000013000000E4

Cat 00000002.history
13/E4000000  no recovery target specified
13/ED000090  at restore point ”before FOSDEM”

Promoting a standby also switches timeline.

Pg-receivexlog:

  1. Pg_receivexlog streams transaction log from a running postgresql cluster.
  2. This tool uses streaming replication protocol to stream transaction logs to a local directory.
  3. These files can be used for PITR.
  4. Transaction logs are streamed in real time thus pg_receivelog can be used instead of archive command.
  5. EX: pg_receivelog -h localhost -D /usr/local/wal_archive
  6. It can be used as an alternate of archiving or for streaming replication.
  7. Max_wal_sender=1 . At Least 1 and create entry in pg_hba.conf for the IP as replication user, Then it will work
  8. It uses replication protocol

Replication Troubleshooting:

  1. Generally we look into view pg_stat_replication:
  2. Below queries can be used to find lag:
  3. psql>select extract( 'epoch' from now() - pg_last_xact_replay_timestamp())::int ;
  4. Few More Queries are below:
$ psql -c "SELECT pg_current_xlog_location()" -h192.168.0.10 (primary host)
 pg_current_xlog_location 
--------------------------
 0/2000000
(1 row)

$ psql -c "select pg_last_xlog_receive_location()" -h192.168.0.20 (standby host)
 pg_last_xlog_receive_location 
-------------------------------
 0/2000000
(1 row)

$ psql -c "select pg_last_xlog_replay_location()" -h192.168.0.20 (standby host)
 pg_last_xlog_replay_location 
------------------------------
 0/2000000
(1 row)

Pg_stat_replication:

postgres=# select * from pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
pid                        | 14447
usesysid               | 10
usename              | postgres
application_name | walreceiver
client_addr            | 172.31.28.103
client_hostname   |
client_port             | 34792
backend_start       | 2016-11-29 00:09:15.958171-05
backend_xmin      |
state                      | streaming
sent_location        | 0/2000060
write_location       | 0/2000060
flush_location       | 0/2000060
replay_location     | 0/2000060
sync_priority         | 0
sync_state            | async


Column
Type
Description
pid
integer
Process ID of a WAL sender process
usesysid
oid
OID of the user logged into this WAL sender process
usename
name
Name of the user logged into this WAL sender process
application_name
text
Name of the application that is connected to this WAL sender
client_addr
inet
IP address of the client connected to this WAL sender. If this field is null, it indicates that the client is connected via a Unix socket on the server machine.
client_hostname
text
Host name of the connected client, as reported by a reverse DNS lookup of client_addr. This field will only be non-null for IP connections, and only when log_hostname is enabled.
client_port
integer
TCP port number that the client is using for communication with this WAL sender, or -1 if a Unix socket is used
backend_start
timestamp with time zone
Time when this process was started, i.e., when the client connected to this WAL sender
state
text
Current WAL sender state
sent_location
text
Last transaction log position sent on this connection
write_location
text
Last transaction log position written to disk by this standby server
flush_location
text
Last transaction log position flushed to disk by this standby server
replay_location
text
Last transaction log position replayed into the database on this standby server
sync_priority
integer
Priority of this standby server for being chosen as the synchronous standby
sync_state
text
Synchronous state of this standby server

Cascading Replication:

  1. Allows standby server to accept connection and stream WAL logs to other standby servers acting like a relay. This helps in reducing number of direct connection to the master.
  2. A standby acting as both receiver and standby is known as cascading standby.
  3. Standby servers that are more connected to master is called upstream servers and those servers which are further away are called downstream servers.
  4. Cascading replication is presently asynchronous.
  5. We need to set primary_conninfo in the downstream standby to point to the cascading standby.

Log Shipping/Warm Standby:

  1. Log shipping is asynchronous
  2. Wal_level = archive(Master) 
  3. Hot_standby=off(Slave). 
  4. Not able to connect to DB. DB system is starting up mode.
  5. Also known as Warm Standby. 

Hot-Standby:

  1. Wal_level =hot_standby(master) 
  2. hot_standby=on.(Slave) 
  3. Can connect to DB for read only operation.
  4. Slave will be one archive behind master. It will read from archive_location.