Monday, March 26, 2012

PostgreSQL 9.2 Volume 2 Part ii

Chapter 11. Indexes
If there are many rows  a tabl and only a few rows (perhaps zero or one) that would be returned by a query, this is clearly an inefficient method. But if the system has been instructed to maintain an index on the id column, it can use a more efficient method for locating matching rows. For instance, it might only have to walk a few levels deep into a search tree.
CREATE INDEX test1_id_index ON test1 (id);
B-trees indexes can handle equality and range queries on data.involved in a comparison using one of these operators: <, <=, =, >=, >, BETWEEN, IN, IS NULL, IS NOT NULL and patter matching (LIKE). B-tree indexes can also be used to retrieve data in sorted order. This is not always faster than a simple scan and sort, but it is often helpful.
B-tree indexes can also be used to retrieve data in sorted order. This is not always faster than a simple scan and sort, but it is often helpful. Hash indexes can only handle simple equality comparisons. Involved in a comparison using the = operator. 
CREATE INDEX name ON table USING hash (column);
Multi column Indexes 
CREATE INDEX test2_mm_idx ON test2 (major, minor);
Chapter 12. Full Text Search
Chapter 13. Concurrency Control
data consistency is maintained by using a multiversion model (Multiversion Concurrency Control, MVCC). This means that while querying a database each transaction sees a snapshot of data as it was some time ago, regardless of the current state of the underlying data. This protects the transaction from viewing inconsistent data that could be caused by (other) concurrent transaction updates on the same data rows, providing transaction isolation for each database session.
MVCC model of concurrency control rather than locking is that in MVCC locks acquired for querying (reading) data do not conflict with locks acquired for writing data  which guarantee even when providing the strictest level of transaction isolation through the use of an innovative Serializable Snapshot Isolation (SSI) level.

13.2. Transaction Isolation

The SQL standard defines four levels of transaction isolation. The most strict is Serializable.The phenomena which are prohibited are various levels are:
Dirty readA transaction reads data written by a concurrent uncommitted transaction.
Non-repeatable read - A transaction re-reads data it has previously read and finds that data has been modified by another transaction (that committed since the initial read).
phantom read - A transaction re-executes a query returning a set of rows that satisfy a search condition and finds that the set of rows satisfying the condition has changed due to another recently-committed transaction.
The four transaction isolation levels and the corresponding behaviors are described in Table 13-1.
Transaction Isolation Levels

Isolation LevelDirty ReadNonrepeatable ReadPhantom Read
Read uncommittedPossiblePossiblePossible
Read committedNot possiblePossiblePossible
Repeatable readNot possibleNot possiblePossible
SerializableNot possibleNot possibleNot possible
13.3. Explicit Locking 
Chapter 14. Performance Tips
14.1. Using EXPLAIN

PostgreSQL devises a query plan for each query it receives. You can use the EXPLAIN command to see what query plan the planner creates for any query. 

The structure of a query plan is a tree of plan nodes. The first line (topmost node) has the estimated total execution cost for the plan; it is this number that the planner seeks to minimize.

                         QUERY PLAN
 Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=244)
The numbers that are quoted by EXPLAIN are (left to right): 
  • Estimated start-up cost (time expended before the output scan can start, e.g., time to do the sorting in a sort node) 
  • Estimated total cost (if all rows are retrieved, though they might not be; e.g., a query with a LIMIT clause will stop short of paying the total cost of the Limit plan node's input node) 
  • Estimated number of rows output by this plan node (again, only if executed to completion) 
  • Estimated average width (in bytes) of rows output by this plan node 

14.4. Populating a Database

One might need to insert a large amount of data when first populating a database.

** Disable Auto commit

When using multiple INSERTs, turn off autocommit and just do one commit at the end. An additional benefit of doing all insertions in one transaction is that if the insertion of one row were to fail then the insertion of all rows inserted up to that point would be rolled back, so you won't be stuck with partially loaded data.

** Use COPY
Use COPY to load all the rows in one command, instead of using a series of INSERT commands. The COPY command is optimized for loading large numbers of rows; it is less flexible than INSERT, but incurs significantly less overhead for large data loads. Since COPY is a single command, there is no need to disable autocommit if you use this method to populate a table.
COPY is fastest when used within the same transaction as an earlier CREATE TABLE or TRUNCATE command. In such cases no WAL needs to be written, because in case of an error, the files containing the newly loaded data will be removed anyway. However, this consideration only applies when wal_level is minimal as all commands must write WAL otherwise.
** Remove Indexes
If you are loading a freshly created table, the fastest method is to create the table, bulk load the table's data using COPY, then create any indexes needed for the table. Creating an index on pre-existing data is quicker than updating it incrementally as each row is loaded.
If you are adding large amounts of data to an existing table, it might be a win to drop the indexes, load the table, and then recreate the indexes. Of course, the database performance for other users might suffer during the time the indexes are missing. One should also think twice before dropping a unique index, since the error checking afforded by the unique constraint will be lost while the index is missing.
** Remove Foreign Key Constraints
Just as with indexes, a foreign key constraint can be checked "in bulk" more efficiently than row-by-row. So it might be useful to drop foreign key constraints, load data, and re-create the constraints. Again.
when you load data into a table with existing foreign key constraints, each new row requires an entry in the server's list of pending trigger events (since it is the firing of a trigger that checks the row's foreign key constraint). Loading many millions of rows can cause the trigger event queue to overflow available memory, leading to intolerable swapping or even outright failure of the command.Alternative method is to split up the load operation into smaller transactions

** Increase maintenance_work_mem

Temporarily increasing the maintenance_work_mem configuration variable. This will help to speed up CREATE INDEX commands and ALTER TABLE ADD FOREIGN KEY commands. It won't do much for COPY itself, so this advice is only useful when you are using one or both of the above techniques.

** Increase checkpoint_segments

Temporarily increasing the "checkpoint_segments" configuration variable. This is because loading a large amount of data into PostgreSQL will cause checkpoints to occur more often than the normal checkpoint frequency (specified by the checkpoint_timeout configuration variable). Whenever a checkpoint occurs, all dirty pages must be flushed to disk. By increasing checkpoint_segments temporarily during bulk data loads, the number of checkpoints that are required can be reduced.
** Run ANALYZE Afterwards
Whenever you have significantly altered the distribution of data within a table, running ANALYZE is strongly recommended. which ensures that the planner has up-to-date statistics about the table. With no statistics or obsolete statistics, the planner might make poor decisions during query planning. Note that if the autovacuum daemon is enabled, it might run ANALYZE automatically

No comments:

Post a Comment