|
This document describes and defines the on-disk database file format used by SQLite.
The complete state of an SQLite database is usually contained a single file on disk called the "main database file".
During a transaction, SQLite stores additional information in a second file called the "rollback journal", or if SQLite is in WAL mode, a write-ahead log file. If the application or host computer crashes before the transaction completes, then the rollback journal or write-ahead log contains critical state information needed to restore the main database file to a consistent state. When a rollback journal or write-ahead log contains information necessary for recovering the state of the database, they are called a "hot journal" or "hot WAL file". Hot journals and WAL files are only a factor during error recovery scenarios and so are uncommon, but they are part of the state of an SQLite database and so cannot be ignored. This document defines the format of a rollback journal and the write-ahead log file, but the focus is on the main database file.
The main database file consists of one or more pages. The size of a page is a power of two between 512 and 65536 inclusive. All pages within the same database are the same size. The page size for a database file is determined by the 2-byte integer located at an offset of 16 bytes from the beginning of the database file.
Pages are numbered beginning with 1. The maximum page number is 2147483646 (231 - 2). The minimum size SQLite database is a single 512-byte page. The maximum size database would be 2147483646 pages at 65536 bytes per page or 140,737,488,224,256 bytes (about 140 terabytes). Usually SQLite will hit the maximum file size limit of the underlying filesystem or disk hardware size limit long before it hits its own internal size limit.
In common use, SQLite databases tend to range in size from a few kilobytes to a few gigabytes.
At any point in time, every page in the main database has a single use which is one of the following:
All reads from and writes to the main database file begin at a page boundary and all writes are an integer number of pages in size. Reads are also usually an integer number of pages in size, with the one exception that when the database is first opened, the first 100 bytes of the database file (the database file header) are read as a sub-page size unit.
Before any information-bearing page of the database is modified, the original unmodified content of that page is written into the rollback journal. If a transaction is interrupted and needs to be rolled back, the rollback journal can then be used to restore the database to its original state. Freelist leaf pages bear no information that would need to be restored on a rollback and so they are not written to the journal prior to modification, in order to reduce disk I/O.
The first 100 bytes of the database file comprise the database file header. The database file header is divided into fields as shown by the table below. All multibyte fields in the database file header are stored with the most significant byte first (big-endian).
Offset | Size | Description |
---|---|---|
0 | 16 | The header string: "SQLite format 3\000" |
16 | 2 | The database page size in bytes. Must be a power of two between 512 and 32768 inclusive, or the value 1 representing a page size of 65536. |
18 | 1 | File format write version. 1 for legacy; 2 for WAL. |
19 | 1 | File format read version. 1 for legacy; 2 for WAL. |
20 | 1 | Bytes of unused "reserved" space at the end of each page. Usually 0. |
21 | 1 | Maximum embedded payload fraction. Must be 64. |
22 | 1 | Minimum embedded payload fraction. Must be 32. |
23 | 1 | Leaf payload fraction. Must be 32. |
24 | 4 | File change counter. |
28 | 4 | Size of the database file in pages. The "in-header database size". |
32 | 4 | Page number of the first freelist trunk page. |
36 | 4 | Total number of freelist pages. |
40 | 4 | The schema cookie. |
44 | 4 | The schema format number. Supported schema formats are 1, 2, 3, and 4. |
48 | 4 | Default page cache size. |
52 | 4 | The page number of the largest root b-tree page when in auto-vacuum or incremental-vacuum modes, or zero otherwise. |
56 | 4 | The database text encoding. A value of 1 means UTF-8. A value of 2 means UTF-16le. A value of 3 means UTF-16be. |
60 | 4 | The "user version" as read and set by the user_version pragma. |
64 | 4 | True (non-zero) for incremental-vacuum mode. False (zero) otherwise. |
68 | 24 | Reserved for expansion. Must be zero. |
92 | 4 | The version-valid-for number. |
96 | 4 | SQLITE_VERSION_NUMBER |
Every valid SQLite database file begins with the following 16 bytes (in hex): 53 51 4c 69 74 65 20 66 6f 72 6d 61 74 20 33 00. This byte sequence corresponds to the UTF-8 string "SQLite format 3" including the nul terminator character at the end.
The two-byte value beginning at offset 16 determines the page size of the database. For SQLite versions 3.7.0.1 and earlier, this value is interpreted as a big-endian integer and must be a power of two between 512 and 32768, inclusive. Beginning with SQLite version 3.7.1, a page size of 65536 bytes is supported. The value 65536 will not fit in a two-byte integer, so to specify a 65536-byte page size, the value is at offset 16 is 0x00 0x01. This value can be interpreted as a big-endian 1 and thought of is as a magic number to represent the 65536 page size. Or one can view the two-byte field as a little endian number and say that it represents the page size divided by 256. These two interpretations of the page-size field are equivalent.
The file format write version and file format read version at offsets 18 and 19 are intended to allow for enhancements of the file format in future versions of SQLite. In current versions of SQLite, both of these values are 1 for rollback journalling modes and 2 for WAL journalling mode. If a version of SQLite coded to the current file format specification encounters a database file where the read version is 1 or 2 but the write version is greater than 2, then the database file must be treated as read-only. If a database file with a read version greater than 2 is encounter, then that database cannot be read or written.
SQLite has the ability to set aside a small number of extra bytes at the end of every page for use by extensions. These extra bytes are used, for example, by the SQLite Encryption Extension to store a nonce and/or cryptographic checksum associated with each page. The "reserved space" size in the 1-byte integer at offset 20 is the number of bytes of space at the end of each page to reserve for extensions. This value is usually 0. The value can be odd.
The "usable size" of a database page is the page size specify by the 2-byte integer at offset 16 in the header less the "reserved" space size recorded in the 1-byte integer at offset 20 in the header. The usable size of a page might be an odd number. However, the usable size is not allowed to be less than 480. In other words, if the page size is 512, then the reserved space size cannot exceed 32.
The maximum and minimum embedded payload fractions and the leaf payload fraction values must be 64, 32, and 32. These values were originally intended to as tunable parameters that could be used to modify the storage format of the b-tree algorithm. However, that functionality is not supported and there are no current plans to add support in the future. Hence, these three bytes are fixed at the values specified.
The file change counter is a 4-byte big-endian integer which is incremented whenever the database file is unlocked after having been modified. When two or more processes are reading the same database file, each process can detect database changes from other processes by monitoring the change counter. A process will normally want to flush its database page cache when another process modified the database, since the cache has become stale. The file change counter facilitates this.
In WAL mode, changes to the database are detected using the wal-index and so the change counter is not needed. Hence, the change counter might not be incremented on each transaction in WAL mode.
The 4-byte big-endian integer at offset 28 into the header stores the size of the database file in pages. If this in-header datasize size is not valid (see the next paragraph), then the database size is computed by looking at the actual size of the database file. Older versions of SQLite ignored the in-header database size and used the actual file size exclusively. Newer versions of SQLite use the in-header database size if it is available but fall back to the actual file size if the in-header database size is not valid.
The in-header database size is only considered to be valid if it is non-zero and if the 4-byte change counter at offset 24 exactly matches the 4-byte version-valid-for number at offset 92. The in-header database size is always valid when the database is only modified using recent versions of SQLite (versions 3.7.0 and later). If a legacy version of SQLite writes to the database, it will not know to update the in-header database size and so the in-header database size could be incorrect. But legacy versions of SQLite will also leave the version-valid-for number at offset 92 unchanged so it will not match the change-counter. Hence, invalid in-header database sizes can be detected (and ignored) by observing when the change-counter does not match the version-valid-for number.
Unused pages in the database file are stored on a freelist. The 4-byte big-endian integer at offset 32 stores the page number of the first page of the freelist, or zero if the freelist is empty. The 4-byte big-endian integer at offset 36 stores stores the total number of pages on the freelist.
The schema cookie is a 4-byte big-endian integer at offset 40 that is incremented whenever the database schema changes. A prepared statement is compiled against a specific version of the database schema. Whenever the database schema changes, the statement must be reprepared. Whenever a prepared statement runs, it first checks the schema cookie to make sure the value is the same as when the statement was prepared and if the schema cookie has changed, the statement aborts in order to force the statement to be reprepared.
The schema format number is a 4-byte big-endian integer at offset 44. The schema format number is similar to the file format read and write version numbers at offsets 18 and 19 except that the schema format number refers to the high-level SQL formatting rather than the low-level b-tree formatting. Four schema format numbers are currently defined:
New database files created by SQLite use format 4 by default. The legacy_file_format pragma can be used to cause SQLite to create new database files using format 1. The format version number can be made to default to 1 instead of 4 by setting SQLITE_DEFAULT_FILE_FORMAT=1 at compile-time.
The 4-byte big-endian signed integer at offset 48 is the suggested cache size in pages for the database file. The value is a suggestion only and SQLite is under no obligation to honor it. The absolute value of the integer is used as the suggested size. The suggested cache size can be set using the default_cache_size pragma.
The two 4-byte big-endian integers at offsets 52 and 64 are used to manage the auto_vacuum and incremental_vacuum modes. If the integer at offset 52 is zero then pointer-map (ptrmap) pages are omitted from the database file and neither auto_vacuum nor incremental_vacuum are supported. If the integer at offset 52 is non-zero then it is the page number of the largest root page in the database file, the database file will contain ptrmap pages, and the mode must be either auto_vacuum or incremental_vacuum. In this latter case, the integer at offset 64 is true for incremental_vacuum and false for auto_vacuum. If the integer at offset 52 is zero then the integer at offset 64 must also be zero.
The 4-byte big-endian integer at offset 56 determines the encoding used for all text strings stored in the database. A value of 1 means UTF-8. A value of 2 means UTF-16le. A value of 3 means UTF-16be. No other values are allowed.
The 4-byte big-endian integer at offset 60 is the user version which is set and queried by the user_version pragma. The user version is not used by SQLite.
The 4-byte big-endian integer at offset 96 stores the SQLITE_VERSION_NUMBER value for the SQLite library that most recently modified the database file. The 4-byte big-endian integer at offset 92 is the value of the change counter when the version number was stored. The integer at offset 92 indicates which transaction the version number is valid for and is sometimes called the "version-valid-for number".
All other bytes of the database file header are reserved for future expansion and must be set to zero.
The lock-byte page is the single page of the database file that contains the bytes at offsets between 1073741824 and 1073742335, inclusive. A database file that is less than or equal to 1073741824 bytes in size contains no lock-byte page. A database file larger than 1073741824 contains exactly one lock-byte page.
The lock-byte page is set aside for use by the operating-system specific VFS implementation in implementing the database file locking primitives. SQLite does not use the lock-byte page. The SQLite core will never read or write the lock-byte page, though operating-system specific VFS implementations may choose to read or write bytes on the lock-byte page according to the needs and proclivities of the underlying system. The unix and win32 VFS implementations that come built into SQLite do not write to the lock-byte page, but third-party VFS implementations for other operating systems might.
A database file might contain one or more pages that are not in active use. Unused pages can come about, for example, when information is deleted from the database. Unused pages are stored on the freelist and are reused when additional pages are required.
The freelist is organized as a linked list of freelist trunk pages with each trunk pages containing page numbers for zero or more freelist leaf pages.
A freelist trunk page consists of an array of 4-byte big-endian integers. The size of the array is as many integers as will fit in the usable space of a page. The minimum usable space is 480 bytes so the array will always be at least 120 entries in length. The first integer in the array is the page number of the next freelist trunk page in the list or zero if this is the last freelist trunk page. The second integer in the array is the number of leaf page pointers to follow. Call the second integer L. If L is greater than zero then integers with array indexes between 2 and L+1 inclusive contain page numbers for freelist leaf pages.
Freelist leaf pages contain no information. SQLite avoids reading or writing freelist leaf pages in order to reduce disk I/O.
A bug in SQLite versions prior to 3.6.0 caused the database to be reported as corrupt if any of the last 6 entries in the freelist trunk page array contained non-zero values. Newer versions of SQLite do not have this problem. However, newer versions of SQLite still avoid using the last six entries in the freelist trunk page array in order that database files created by newer versions of SQLite can be read by older versions of SQLite.
The number of freelist pages is stored as a 4-byte big-endian integer in the database header at an offset of 36 from the beginning of the file. The database header also stores the page number of the first freelist trunk page as a 4-byte big-endian integer at an offset of 32 from the beginning of the file.
A b-tree page is either an interior page or a leaf page. A leaf page contains keys and in the case of a table b-tree each key has associated content. An interior page contains K keys without content but with K+1 pointers to child b-tree pages. A "pointer" in an interior b-tree page is just the 31-bit integer page number of the child page.
Define the depth of a leaf b-tree to be 1 and the depth of any interior b-tree to be one more than the maximum depth of any of its children. In a well-formed database, all children of any one interior b-tree have the same depth.
In an interior b-tree page, the pointers and keys logically alternate with a pointer on both ends. (The previous sentence is to be understood conceptually - the actual layout of the keys and pointers within the page is more complicated and will be described in the sequel.) All keys within the same page are unique and are logically organized in ascending order from left to right. (Again, this ordering is logical, not physical. The actual location of keys within the page is arbitrary.) For any key X, pointers to the left of a X refer to b-tree pages on which all keys are less than or equal to X. Pointers to the right of X refer to pages where all keys are greater than X.
Within an interior b-tree page, each key and the pointer to its immediate left are combined into a structure called a "cell". The right-most pointer is held separately. A leaf b-tree page has no pointers, but it still uses the cell structure to hold keys for index b-trees or keys and content for table b-trees.
Every b-tree page has at most one parent b-tree page. A b-tree page without a parent is called a root page. A root b-tree page together with the closure of its children form a complete b-tree. It is possible (and in fact rather common) to have a complete b-tree that consists of a single page that is both a leaf and the root. Because there are pointers from parents to children, every page of a complete b-tree can be located if only the root page is known. Hence, b-trees are identified by their root page number.
A b-tree page is either a table b-tree page or an index b-tree page. All pages within each complete b-tree are of the same type: either table or index. There is a one-to-one mapping from table b-trees in the database file to (non-virtual) tables in the database schema, including system tables such as sqlite_master. There is one-to-one mapping between index b-trees in the database file and indices in the schema, including implied indices created by uniqueness constraints. The b-tree corresponding to the sqlite_master table always has its root page on a page number of 1. The sqlite_master table contains the root page number for every other table and index in the database file.
Each entry in a table b-tree consists of a 64-bit signed integer key and up to 2147483647 bytes of arbitrary data. Interior table b-trees hold only keys and pointers to children. All data is contained in the table b-tree leaves.
Each entry in an index b-tree consists of an arbitrary key of up to 2147483647 bytes in length and no data.
Define the "payload" of a cell to be the arbitrary length section of the cell. For an index b-tree, the key is always arbitrary in length and hence the payload is the key. There are no arbitrary length elements in the cells of interior table b-tree pages and so those cells have no payload. Table b-tree leaf pages contain arbitrary length content and so for cells on those pages the payload is the content.
When the size of payload for a cell exceeds a certain threshold (to be defined later) then only the first few bytes of the payload are stored on the b-tree page and the balance is stored in a linked list of content overflow pages.
A b-tree page is divided into regions in the following order:
The 100-byte database file header is found only on page 1, which is always a table b-tree page. All other b-tree pages in the database file omit this 100-byte header.
The reserved region is an area of unused space at the end of every page (except the locking page) that extensions can use to hold per-page information. The size of the reserved region is determined by the one-byte unsigned integer found at an offset of 20 into the database file header. The size of the reserved region is usually zero.
The b-tree page header is 8 bytes in size for leaf pages and 12 bytes for interior pages. All multibyte values in the page header are big-endian. The b-tree page header is composed of the following fields:
Offset | Size | Description |
---|---|---|
0 | 1 | A flag indicating the b-tree page type A value of 2 means the page is an interior index b-tree page. A value of 5 means the page is an interior table b-tree page. A value of 10 means the page is a leaf index b-tree page. A value of 13 means the page is a leaf table b-tree page. Any other value for the b-tree page type is an error. |
1 | 2 | Byte offset into the page of the first freeblock |
3 | 2 | Number of cells on this page |
5 | 2 | Offset to the first byte of the cell content area. A zero value is used to represent an offset of 65536, which occurs on an empty root page when using a 65536-byte page size. |
7 | 1 | Number of fragmented free bytes within the cell content area |
8 | 4 | The right-most pointer (interior b-tree pages only) |
The cell pointer array of a b-tree page immediately follows the b-tree page header. Let K be the number of cells on the btree. The cell pointer array consists of K 2-byte integer offsets to the cell contents. The cell pointers are arranged in key order with left-most cell (the cell with the smallest key) first and the right-most cell (the cell with the largest key) last.
Cell content is stored in the cell content region of the b-tree page. SQLite strives to place cells as far toward the end of the b-tree page as it can, in order to leave space for future growth of the cell pointer array. The area in between the last cell pointer array entry and the beginning of the first cell is the unallocated region.
If a page contains no cells (which is only possible for a root page of a table that contains no rows) then the offset to the cell content area will equal the page size minus the bytes of reserved space. If the database uses a 65536-byte page size and the reserved space is zero (the usual value for reserved space) then the cell content offset of an empty page wants to be 65536. However, that integer is too large to be stored in a 2-byte unsigned integer, so a value of 0 is used in its place.
A freeblock is a structure used to identify unallocated space within a b-tree page. Freeblocks are organized as a chain. The first 2 bytes of a freeblock are a big-endian integer which is the offset in the b-tree page of the next freeblock in the chain, or zero if the freeblock is the last on the chain. The third and fourth bytes of each freeblock form a big-endian integer which is the size of the freeblock in bytes, including the 4-byte header. Freeblocks are always connected in order of increasing offset. The second field of the b-tree page header is the offset of the first freeblock, or zero if there are no freeblocks on the page. In a well-formed b-tree page, there will always be at least one cell before the first freeblock.
A freeblock requires at least 4 bytes of space. If there is an isolated group of 1, 2, or 3 unused bytes within the cell content area, those bytes comprise a fragment. The total number of bytes in all fragments is stored in the fifth field of the b-tree page header. In a well-formed b-tree page, the total number of bytes in fragments may not exceed 60.
The total amount of free space on a b-tree page consists of the size of the unallocated region plus the total size of all freeblocks plus the number of fragmented free bytes. SQLite may from time to time reorganize a b-tree page so that there are no freeblocks or fragment bytes, all unused bytes are contained in the unallocated space region, and all cells are packed tightly at the end of the page. This is called "defragmenting" the b-tree page.
A variable-length integer or "varint" is a static Huffman encoding of 64-bit twos-complement integers that uses less space for small positive values. A varint is between 1 and 9 bytes in length. The varint consists of either zero or more byte which have the high-order bit set followed by a single byte with the high-order bit clear, or nine bytes, whichever is shorter. The lower seven bits of each of the first eight bytes and all 8 bits of the ninth byte are used to reconstruct the 64-bit twos-complement integer. Varints are big-endian: bits taken from the earlier byte of the varint are the more significant and bits taken from the later bytes.
The format of a cell depends on which kind of b-tree page the cell appears on. The following table shows the elements of a cell, in order of appearance, for the various b-tree page types.
Table B-Tree Leaf Cell:
- A varint which is the total number of bytes of payload, including any overflow
- A varint which is the integer key, a.k.a. "rowid"
- The initial portion of the payload that does not spill to overflow pages.
- A 4-byte big-endian integer page number for the first page of the overflow page list - omitted if all payload fits on the b-tree page.
Table B-Tree Interior Cell:
- A 4-byte big-endian page number which is the left child pointer.
- A varint which is the integer key
Index B-Tree Leaf Cell:
- A varint which is the total number of bytes of key payload, including any overflow
- The initial portion of the payload that does not spill to overflow pages.
- A 4-byte big-endian integer page number for the first page of the overflow page list - omitted if all payload fits on the b-tree page.
Index B-Tree Interior Cell:
- A 4-byte big-endianpage number which is the left child pointer.
- A varint which is the total number of bytes of key payload, including any overflow
- The initial portion of the payload that does not spill to overflow pages.
- A 4-byte big-endian integer page number for the first page of the overflow page list - omitted if all payload fits on the b-tree page.
The information above can be recast into a table format as follows:
Datatype | Appears in... | Description | |||
---|---|---|---|---|---|
Table Leaf | Table Interior | Index Leaf | Index Interior | ||
4-byte integer | ✔ | ✔ | Page number of left child | ||
varint | ✔ | ✔ | ✔ | Number of bytes of payload | |
varint | ✔ | ✔ | Rowid | ||
byte array | ✔ | ✔ | ✔ | Payload | |
4-byte integer | ✔ | ✔ | ✔ | Page number of first overflow page |
The amount of payload that spills onto overflow pages also depends on the page type. For the following computations, let U be the usable size of a database page, the total page size less the reserved space at the end of each page. And let P be the payload size.
- Table B-Tree Leaf Cell:
If the payload size P is less than or equal to U-35 then the entire payload is stored on the b-tree leaf page. Let M be ((U-12)*32/255)-23. If P is greater than U-35 then the number of byte stored on the b-tree leaf page is the smaller of M+((P-M)%(U-4)) and U-35. Note that number of bytes stored on the leaf page is never less than M.
- Table B-Tree Interior Cell:
Interior pages of table b-trees have no payload and so there is never any payload to spill.
- Index B-Tree Leaf Or Interior Cell:
Let X be ((U-12)*64/255)-23). If the payload size P is less than or equal to X then the entire payload is stored on the b-tree page. Let M be ((U-12)*32/255)-23. If P is greater than X then the number of byte stored on the b-tree page is the smaller of M+((P-M)%(U-4)) and X. Note that number of bytes stored on the index page is never less than M.
The overflow thresholds are designed to give a minimum fanout of 4 for index b-trees and to make sure enough of the payload is on the b-tree page that the record header can usually be accessed without consulting an overflow page. In hindsight, the designers of the SQLite b-tree logic realize that these thresholds could have been made much simpler. However, the computations cannot be changed without resulting in an incompatible file format. And the current computations work well, even if they are a little complex.
When the payload of a b-tree cell is too large for the b-tree page, the surplus is spilled onto overflow pages. Overflow pages form a linked list. The first four bytes of each overflow page are a big-endian integer which is the page number of the next page in the chain, or zero for the final page in the chain. The fifth byte through the last usable byte are used to hold overflow content.
Pointer map or ptrmap pages are extra pages inserted into the database to make the operation of auto_vacuum and incremental_vacuum modes more efficient. Other page types in the database typically have pointers from parent to child. For example, an interior b-tree page contains pointers to its child b-tree pages and an overflow chain has a pointer from earlier to later links in the chain. A ptrmap page contains linkage information going in the opposite direction, from child to parent.
Ptrmap pages must exist in any database file which has a non-zero largest root b-tree page value at offset 52 in the database header. If the largest root b-tree page value is zero, then the database must not contain ptrmap pages.
In a database with ptrmap pages, the first ptrmap page is page 2. A ptrmap page consists of an array of 5-byte entries. Let J be the number of 5-byte entries that will fit in the usable space of a page. (In other words, J=U/5.) The first ptrmap page will contain back pointer information for pages 3 through J+2, inclusive. The second pointer map page will be on page J+3 and that ptrmap page will provide back pointer information for pages J+4 through 2*J+3 inclusive. And so forth for the entire database file.
In a database that uses ptrmap pages, all pages at locations identified by the computation in the previous paragraph must be ptrmap page and no other page may be a ptrmap page. Except, if the byte-lock page happens to fall on the same page number as a ptrmap page, then the ptrmap is moved to the following page for that one case.
Each 5-byte entry on a ptrmap page provides back-link information about one of pages that immediately follow the pointer map. If page B is a ptrmap page then back-link information about page B+1 is provided by the first entry on the pointer map. Information about page B+2 is provided by the second entry. And so forth.
Each 5-byte ptrmap entry consists of one byte of "page type" information followed by a 4-byte big-endian page number. Five page types are recognized:
In any database file that contains ptrmap pages, all b-tree root pages must come before any non-root b-tree page, cell payload overflow page, or freelist page. This restriction ensures that a root page will never be moved during an auto-vacuum or incremental-vacuum. The auto-vacuum logic does not know how to update the root_page field of the sqlite_master table and so it is necessary to prevent root pages from being moved during an auto-vacuum in order to preserve the integrity of the sqlite_master table. Root pages are moved to the beginning of the database file by the CREATE TABLE, CREATE INDEX, DROP TABLE, and DROP INDEX operations.
The foregoing text describes low-level aspects of the SQLite file format. The b-tree mechanism provides a powerful and efficient means of accessing a large data set. This section will describe how the low-level b-tree layer is used to implement higher-level SQL capabilities.
The content of a table b-tree leaf page and the key of any index b-tree page was characterized above as an arbitrary sequence of bytes. The prior discussion mentioned one key being less than another, but did not define what "less than" meant. The current section will address these omissions.
Payload, either table content or index keys, is always in the "record format". The record format defines a sequence of values corresponding to columns in a table or index. The record format specifies the number of columns, the datatype of each column, and the content of each column.
The record format makes extensive use of the variable-length integer or varint representation of 64-bit signed integers defined above.
A record contains a header and a body, in that order. The header begins with a single varint which determines the total number of bytes in the header. The varint value is the size of the header in bytes including the size varint itself. Following the size varint are one or more additional varints, one per column. These additional varints are called "serial type" numbers and determine the datatype of each column, according to the following chart:
Serial Type | Content Size | Meaning |
---|---|---|
0 | 0 | NULL |
1 | 1 | 8-bit twos-complement integer |
2 | 2 | Big-endian 16-bit twos-complement integer |
3 | 3 | Big-endian 24-bit twos-complement integer |
4 | 4 | Big-endian 32-bit twos-complement integer |
5 | 6 | Big-endian 48-bit twos-complement integer |
6 | 8 | Big-endian 64-bit twos-complement integer |
7 | 8 | Big-endian IEEE 754-2008 64-bit floating point number |
8 | 0 | Integer constant 0. Only available for schema format 4 and higher. |
9 | 0 | Integer constant 1. Only available for schema format 4 and higher. |
10,11 | Not used. Reserved for expansion. | |
N≥12 and even | (N-12)/2 | A BLOB that is (N-12)/2 bytes in length |
N≥13 and odd | (N-13)/2 | A string in the database encoding and (N-13)/2 bytes in length. The nul terminator is omitted. |
Note that because of the way varints are defined, the header size varint and serial type varints will usually consist of a single byte. The serial type varints for large strings and BLOBs might extend to two or three byte varints, but that is the exception rather than the rule. The varint format is very efficient at coding the record header.
The values for each column in the record immediately follow the header. Note that for serial types 0, 8, 9, 12, and 13, the value is zero bytes in length. If all columns are of these types then the body section of the record is empty.
The order of keys in an index b-tree is determined by the sort order of the records that the keys represent. Record comparison progresses column by column. Columns of a record are examined from left to right. The first pair of columns that are not equal determines the relative order of the two records. The sort order of individual columns is as follows:
A collating function for each column is necessary in order to compute the order of text fields. SQLite defines three built-in collating functions:
BINARY Strings are compared byte by byte using the memcmp() function from the standard C library. NOCASE Like BINARY except that uppercase ASCII characters ('A' through 'Z') are folded into their lowercase equivalents prior to running the comparison. Note that only ASCII characters are case-folded. NOCASE does not implement a general purpose unicode caseless comparison. RTRIM Like BINARY except that spaces at the end of the string are elided prior to comparison.
Additional application-specific collating functions can be added to SQLite using the sqlite3_create_collation() interface.
The default collating function for all strings is BINARY. Alternative collating functions for table columns can be specified in the CREATE TABLE statement using the COLLATE clause on the column definition. When a column is indexed, the same collating function specified in the CREATE TABLE statement is used for the column in the index, by default, though this can be overridden using a COLLATE clause in the CREATE INDEX statement.
Each ordinary SQL table in the database schema is represented on disk by a table b-tree. Each entry in the table b-tree corresponds to a row of the SQL table. The rowid of the SQL table is the 64-bit signed integer key for each entry in the table b-tree.
The content of each SQL table row is stored in the database file by first combining the values in the various columns into a byte array in the record format, then storing that byte array as the payload in an entry in the table b-tree. The order of values in the record is the same as the order of columns in the SQL table definition. When an SQL table that includes an INTEGER PRIMARY KEY column (which aliases the rowid) then that column appears in the record as a NULL value. SQLite will always use the table b-tree key rather than the NULL value when referencing the INTEGER PRIMARY KEY column.
If the affinity of a column is REAL and that column contains a value that can be converted to an integer without loss of information (if the value contains no fractional part and is not too large to be represented as an integer) then the column may be stored in the record as an integer. SQLite will convert the value back to floating point when extracting it from the record.
Each SQL index, whether explicitly declared via a CREATE INDEX statement or implied by a UNIQUE constraint, corresponds to an index b-tree in the database file. There is one entry in index b-tree for each row in the corresponding table. The key to an index b-tree is a record composed of the columns that are being indexed followed by the rowid of the table row. Because every row in a table has a unique rowid and all keys in an index contain the rowid, all keys in an index are unique.
There is a one-to-one mapping between rows in a table and entries in each index associated with that table. Corresponding rows int the index and table b-trees share the same rowid value, and contain the same value for all indexed columns.
Page 1 of a database file is the root page of a table b-tree that holds a special table named "sqlite_master" (or "sqlite_temp_master" in the case of a TEMP database) which stores the complete database schema. The structure of the sqlite_master table is as if it had been created using the following SQL:
CREATE TABLE sqlite_master( type text, name text, tbl_name text, rootpage integer, sql text );
The sqlite_master table contains one row for each table, index, view, and trigger (collectively "objects") in the database schema, except there is no entry for the sqlite_master table itself. The sqlite_master table contains entries for internal schema objects in addition to application- and programmer-defined objects.
The sqlite_master.type column will be one of the following text strings: 'table', 'index', 'view', or 'trigger' according to the type of object defined. The 'table' string is used for both ordinary and virtual tables.
The sqlite_master.name column will hold the name of the object. UNIQUE and PRIMARY KEY constraints on tables cause SQLite to create internal indices with names of the form "sqlite_autoindex_TABLE_N" where TABLE is replaced by the name of the table that contains the constraint and N is an integer beginning with 1 and increasing by one with each constraint seen in the table definition.The sqlite_master.tbl_name column holds the name of a table or view that the object is associated with. For a table or view, the tbl_name column is a copy of the name column. For an index, the tbl_name is the name of the table that is indexed. For a trigger, the tbl_name column stores the name of the table or view that causes the trigger to fire.
The sqlite_master.rootpage column stores the page number of the root b-tree page for tables and indices. For rows that define views, triggers, and virtual tables, the rootpage column is 0 or NULL.
The sqlite_master.sql column stores SQL text that describes the object. This SQL text is a CREATE TABLE, CREATE VIRTUAL TABLE, CREATE INDEX, CREATE VIEW, or CREATE TRIGGER statement that if evaluated against the database file when it is the main database of a database connection would recreated the object. The text is usually a copy of the original statement used to create the object but with normalizations applied so that the text conforms to the following rules:
The text in the sqlite_master.sql column is a copy of the original CREATE statement text that created the object, except normalized as described above and as modified by subsequent ALTER TABLE statements. The sqlite_master.sql is NULL for the internal indices that are automatically created by UNIQUE or PRIMARY KEY constraints.
In addition to the tables, indices, views, and triggers created by the application and/or the developer using CREATE statements SQL, the sqlite_master table may contain zero or more entries for internal schema objects that are created by SQLite for its own internal use. The names of internal schema objects always begin with "sqlite_" and any table, index, view, or trigger whose name begins with "sqlite_" is an internal schema object. SQLite prohibits applications from creating objects whose names begin with "sqlite_".
Internal schema objects used by SQLite may include the following:
Indices with names of the form "sqlite_autoindex_TABLE_N" that are used to implement UNIQUE and PRIMARY KEY constraints on ordinary tables.
A table with the name "sqlite_sequence" that is used to keep track of the maximum historical INTEGER PRIMARY KEY for a table that using AUTOINCREMENT.
Tables with names of the form "sqlite_statN" where N is an integer. Such tables store database statistics gathered by the ANALYZE command and used by the query planner to help determine the best algorithm to use for each query.
Additional internal schema objects names, always beginning with "sqlite_", may be added to the SQLite file format in future releases.
The sqlite_sequence table is an internal table used to help implement AUTOINCREMENT. The sqlite_sequence table is created automatically whenever any ordinary table with an AUTOINCREMENT integer primary key is created. Once created, the sqlite_sequence table exists in the sqlite_master table forever; it cannot be dropped. The schema for the sqlite_sequence table is:
CREATE TABLE sqlite_sequence(name,seq);
There is a single row in the sqlite_sequence table for each ordinary table that uses AUTOINCREMENT. The name of the table (as it appears in sqlite_master.name) is in the sqlite_sequence.main field and the largest INTEGER PRIMARY KEY ever used by that table is in the sqlite_sequence.seq field. New automatically generated integer primary keys for AUTOINCREMENT tables are guaranteed to be larger than the sqlite_sequence.seq field for that table. If the sqlite_sequence.seq field of an AUTOINCREMENT table is already at the largest integer value (9223372036854775807) then attempts to add new rows to that table with an automatically generated integer primary will fail with an SQLITE_FULL error. The sqlite_sequence.seq field is automatically updated if required when new entries are added to an AUTOINCREMENT table. The sqlite_sequence row for an AUTOINCREMENT table is automatically deleted when the table is dropped. If the sqlite_sequence row for an AUTOINCREMENT table does not exist when the AUTOINCREMENT table is updated, then a new sqlite_sequence row is created. If the sqlite_sequence.seq value for an AUTOINCREMENT table is manually set to something other than an integer and there is a subsequent attempt to insert the or update the AUTOINCREMENT table, then the behavior is undefined.
Application code is allowed to modify the sqlite_sequence table, to add new rows, to delete rows, or to modify existing rows. However, application code cannot create the sqlite_sequence table if it does not already exist. Application code can delete all entries from the sqlite_sequence table, but application code cannot drop the sqlite_sequence table.
The sqlite_stat1 is an internal table created by the ANALYZE command and used to hold supplemental information about tables and indices that the query planner can use to help it find better ways of performing queries. Applications can update, delete from, insert into or drop the sqlite_stat1 table, but may not create or alter the sqlite_stat1 table. The schema of the sqlite_stat1 table is as follows:
CREATE TABLE sqlite_stat1(tbl,idx,stat);
There is normally one row per index, with the index identified by the name in the sqlite_stat1.idx column. The sqlite_stat1.tbl column is the name of the table to which the index belongs. In each such row, the sqlite_stat.stat column will be a string consisting of a list of integers. The first integer in this list is the number of rows in the index and in the table. The second integer is the average number of rows in the index that have the same value in the first column of the index. The third integer is the average number of rows in the index that have the same value for the first two columns. The N-th integer (for N>1) is the average number of rows in the index which have the same value for the first N-1 columns. For a K-column index, there will be K+1 integers in the stat column. If the index is unique, then the last integer will be 1.
The list of integers in the stat column can optionally be followed by the keyword "unordered". The "unordered" keyword, if it is present, must be separated from the last integer by a single space. If the "unordered" keyword is present, then the query planner assumes that the index is unordered and will not use the index for a range query.
If the sqlite_stat1.idx column is NULL, then the sqlite_stat1.stat column contains a single integer which is the (estimated) number of rows in the table identified by sqlite_stat1.tbl.
The sqlite_stat2 is only created and is only used if SQLite is compiled with SQLITE_ENABLE_STAT2 and if the SQLite version number is between 3.6.18 and 3.7.8. The sqlite_stat2 table is neither read nor written by any version of SQLite before 3.6.18 nor after 3.7.8. The sqlite_stat2 table contains additional information about the distribution of keys within an index. The schema of the sqlite_stat2 table is as follows:
CREATE TABLE sqlite_stat2(tbl,idx,sampleno,sample);
The sqlite_stat2.idx column and the sqlite_stat2.tbl column in each row of the sqlite_stat2 table identify an index described by that row. There are usually 10 rows in the sqlite_stat2 table for each index.
The sqlite_stat2 entries for an index that have sqlite_stat2.sampleno between 0 and 9 inclusive are samples of the left-most key value in the index taken at evenly spaced points along the index. Let C be the number of rows in the index. Then the sampled rows are given by
rownumber = (i*C*2 + C)/20
The variable i in the previous expression varies between 0 and 9. Conceptually, the index space is divided into 10 uniform buckets and the samples are the middle row from each bucket.
The format for sqlite_stat2 is recorded here for legacy reference. Recent versions of SQLite no longer support sqlite_stat2 and the sqlite_stat2 table, it is exists, is simply ignored.
The sqlite_stat3 is only created and is only used if SQLite is compiled with SQLITE_ENABLE_STAT3 and if the SQLite version number is 3.7.9 or greater. The sqlite_stat3 table is neither read nor written by any version of SQLite before 3.6.9. The sqlite_stat3 table contains additional information about the distribution of keys within an index, information that the query planner can use to devise better and faster query algorithms. The schema of the sqlite_stat3 table is as follows:
CREATE TABLE sqlite_stat3(tbl,idx,nEq,nLt,nDLt,sample);
There are usually multiple entries in the sqlite_stat3 table for each index. The sqlite_stat3.sample column holds the value of the left-most field of an index identified by the sqlite_stat3.idx and the sqlite_stat3.tbl columns. If the sqlite_stat3.idx and sqlite_stat3.tbl columns hold the same value, then that row contains a sample from the INTEGER PRIMARY KEY of the table. The sqlite_stat3.nEq column holds the approximate number of entries in the index whose left-most column exactly matches the sample. The sqlite_stat3.nLt holds the approximate number of entries in the index whose left-most column is less than the sample. The sqlite_stat3.nDLt column holds the approximate number of distinct left-most entries in the index that are less than the sample.
Future versions of SQLite might change to store a string containing multiple integers values in the sqlite_stat3.nDLt column, a string in which the first integer will be the number of prior index entries that are distinct in the left-most column, the second integer is the number of prior index entries that are distinct in the first two columns, the third integer is the number of prior index entries that are distinct in the first three columns, and so forth With such an extension, the nDLt field will become similar in function to the sqlite_stat1.stat field.
There can be an arbitrary number of sqlite_stat3 entries per index. The ANALYZE command will typically generate sqlite_stat3 tables that contain between 10 and 40 samples that are distributed across the key space and with large nEq values.
The rollback journal is a file associated with each SQLite database file that hold information used to restore the database file to its initial state during the course of a transaction. The rollback journal file is always located in the same directory as the database file and has the same name as the database file but with the string "-journal" appended. There can only be a single rollback journal associated with a give database and hence there can only be one write transaction open against a single database at one time.
If a transaction is aborted due to an application crash, an operating system crash, or a hardware power failure or crash, then the database may be left in an inconsistent state. The next time SQLite attempts to open the database file, the presence of the rollback journal file will be detected and the journal will be automatically played back to restore the database to its state at the start of the incomplete transaction.
A rollback journal is only considered to be valid if it exists and contains a valid header. Hence a transaction can be committed in one of three ways:
A valid rollback journal begins with a header in the following format:
Offset | Size | Description |
---|---|---|
0 | 8 | Header string: 0xd9, 0xd5, 0x05, 0xf9, 0x20, 0xa1, 0x63, 0xd7 |
8 | 4 | The "Page Count" - The number of pages in the next segment of the journal, or -1 to mean all content to the end of the file |
12 | 4 | A random nonce for the checksum |
16 | 4 | Initial size of the database in pages |
20 | 4 | Size of a disk sector assumed by the process that wrote this journal. |
24 | 4 | Size of pages in this journal. |
A rollback journal header is padded with zeros out to the size of a single sector (as defined by the sector size integer at offset 20). The header is in a sector by itself so that if a power loss occurs while writing the sector, information that follows the header will be (hopefully) undamaged.
After the header and zero padding are zero or more page records. Each page record stores a copy of the content of a page from the database file before it was changed. The same page may not appear more than once within a single rollback journal. To rollback an incomplete transaction, a process has merely to read the rollback journal from beginning to end and write pages found in the journal back into the database file at the appropriate location.
Let the database page size (the value of the integer at offset 24 in the journal header) be N. Then the format of a page record is as follows:
Offset | Size | Description |
---|---|---|
0 | 4 | The page number in the database file |
4 | N | Original content of the page prior to the start of the transaction |
N+4 | 4 | Checksum |
The checksum is an unsigned 32-bit integer computed as follows:
The checksum value is used to guard against incomplete writes of a journal page record following a power failure. A different random nonce is used each time a transaction is started in order to minimize the risk that unwritten sectors might by chance contain data from the same page that was a part of prior journals. By changing the nonce for each transaction, stale data on disk will still generate an incorrect checksum and be detected with high probability. The checksum only uses a sparse sample of 32-bit words from the data record for performance reasons - design studies during the planning phases of SQLite 3.0.0 showed a significant performance hit in checksumming the entire page.
Let the page count value at offset 8 in the journal header be M. If M is greater than zero then after M page records the journal file may be zero padded out to the next multiple of the sector size and another journal header may be inserted. All journal headers within the same journal must contain the same database page size and sector size.
If M is -1 in the initial journal header, then the number of page records that follow is computed by computing how many page records will fit in the available space of the remainder of the journal file.
Beginning with version 3.7.0, SQLite supports a new transaction control mechanism called "write-ahead log" or "WAL". When a database is in WAL mode, all connections to that database must use the WAL. A particular database will use either a rollback journal or a WAL, but not both at the same time. The WAL is always located in the same directory as the database file and has the same name as the database file but with the string "-wal" appended.
A WAL file consists of a header followed by zero or more "frames". Each frame records the revised content of a single page from the database file. All changes to the database are recorded by writing frames into the WAL. Transactions commit when a frame is written that contains a commit marker. A single WAL can and usually does record multiple transactions. Periodically, the content of the WAL is transferred back into the database file in an operation called a "checkpoint".
A single WAL file can be reused multiple times. In other words, the WAL can fill up with frames and then be checkpointed and then new frames can overwrite the old ones. A WAL always grows from beginning toward the end. Checksums and counters attached to each frame are used to determine which frames within the WAL are valid and which are leftovers from prior checkpoints.
The WAL header is 32 bytes in size and consists of the following eight big-endian 32-bit unsigned integer values:
Offset | Size | Description |
---|---|---|
0 | 4 | Magic number. 0x377f0682 or 0x377f0683 |
4 | 4 | File format version. Currently 3007000. |
8 | 4 | Database page size. Example: 1024 |
12 | 4 | Checkpoint sequence number |
16 | 4 | Salt-1: random integer incremented with each checkpoint |
20 | 4 | Salt-2: a different random number for each checkpoint |
24 | 4 | Checksum-1: First part of a checksum on the first 24 bytes of header |
28 | 4 | Checksum-2: Second part of the checksum on the first 24 bytes of header |
Immediately following the wal-header are zero or more frames. Each frame consists of a 24-byte frame-header followed by a page-size bytes of page data. The frame-header is six big-endian 32-bit unsigned integer values, as follows:
Offset | Size | Description |
---|---|---|
0 | 4 | Page number |
4 | 4 | For commit records, the size of the database file in pages after the commit. For all other records, zero. |
8 | 4 | Salt-1 copied from the WAL header |
12 | 4 | Salt-2 copied from the WAL header |
16 | 4 | Checksum-1: Cumulative checksum up through and including this page |
20 | 4 | Checksum-2: Second half of the cumulative checksum. |
A frame is considered valid if and only if the following conditions are true:
The salt-1 and salt-2 values in the frame-header match salt values in the wal-header
The checksum values in the final 8 bytes of the frame-header exactly match the checksum computed consecutively on the WAL header and the first 8 bytes and the content of all frames up to and including the current frame.
The checksum is computed by interpreting the input as an even number of unsigned 32-bit integers: x(0) through x(N). The 32-bit integers are big-endian if the magic number in the first 4 bytes of the WAL header is 0x377f0683 and the integers are little-endian if the magic number is 0x377f0682. The checksum values are always stored in the frame header in a big-endian format regardless of which byte order is used to compute the checksum.
The checksum algorithm only works for content which is a multiple of 8 bytes in length. In other words, if the inputs are x(0) through x(N) then N must be odd. The checksum algorithm is as follows:
s0 = s1 = 0 for i from 0 to n-1 step 2: s0 += x(i) + s1; s1 += x(i+1) + s0; endfor # result in s0 and s1
The outputs s0 and s1 are both weighted checksums using Fibonacci weights in reverse order. (The largest Fibonacci weight occurs on the first element of the sequence being summed.) The s1 value spans all 32-bit integer terms of the sequence whereas s0 omits the final term.
On a checkpoint, the WAL is first flushed to persistent storage using the xSync method of the VFS. Then valid content of the WAL is transferred into the database file. Finally, the database is flushed to persistent storage using another xSync method call. The xSync operations serve as write barriers - all writes launched before the xSync must complete before any write that launches after the xSync begins.
After each checkpoint, the WAL header salt-1 value is incremented and the salt-2 value is randomized. This prevents old and new frames in the WAL from being considered valid at the same time and being checkpointing together following a crash.
To read a page from the database (call it page number P), a reader first checks the WAL to see if it contains page P. If so, then the last valid instance of page P that is followed by a commit frame or is a commit frame itself becomes the value read. If the WAL contains no copies of page P that are valid and which are a commit frame or are followed by a commit frame, then page P is read from the database file.
To start a read transaction, the reader records the index of the last valid frame in the WAL. The reader uses this recorded "mxFrame" value for all subsequent read operations. New transactions can be appended to the WAL, but as long as the reader uses its original mxFrame value and ignores subsequently appended content, the reader will see a consistent snapshot of the database from a single point in time. This technique allows multiple concurrent readers to view different versions of the database content simultaneously.
The reader algorithm in the previous paragraphs works correctly, but because frames for page P can appear anywhere within the WAL, the reader has to scan the entire WAL looking for page P frames. If the WAL is large (multiple megabytes is typical) that scan can be slow, and read performance suffers. To overcome this problem, a separate data structure called the wal-index is maintained to expedite the search for frames of a particular page.
Conceptually, the wal-index is shared memory, though the current VFS implementations use a mmapped file for the wal-index. The mmapped file is in the same directory as the database and has the same name as the database with a "-shm" suffix appended. Because the wal-index is shared memory, SQLite does not support journal_mode=WAL on a network filesystem when clients are on different machines. All users of the database must be able to share the same memory.
The purpose of the wal-index is to answer this question quickly:
Given a page number P and a maximum WAL frame index M, return the largest WAL frame index for page P that does not exceed M, or return NULL if there are no frames for page P that do not exceed M.
The M value in the previous paragraph is the "mxFrame" value defined in section 4.4 that is read at the start of a transaction and which defines the maximum frame from the WAL that the reader will use.
The wal-index is transient. After a crash, the wal-index is reconstructed from the original WAL file. The VFS is required to either truncate or zero the header of the wal-index when the last connection to it closes. Because the wal-index is transient, it can use an architecture-specific format; it does not have to be cross-platform. Hence, unlike the database and WAL file formats which store all values as big endian, the wal-index stores multi-byte values in the native byte order of the host computer.
This document is concerned with the persistent state of the database file, and since the wal-index is a transient structure, no further information about the format of the wal-index will be provided here. Complete details on the format of the wal-index are contained within comments in SQLite source code.