Tuesday, March 2, 2010

Simple is Good

The claim was made in an earlier entry that the TFS (Transactional File Server) being introduced in our next version of RDM Embedded has "relatively little to do" when a transaction is submitted to it by a runtime library. This post is going to explain why that is true.

First of all, remember that runtimes will be executing independently from the TFS and from each other so that maximum usage of multiple cores, processors and/or computers is achieved. Each runtime, regardless of where it is located relative to the TFS, has the job of requesting resources (locks and pages) from the TFS, then performing its entire transaction independently, finally encapsulating the transaction in a form that can be merged into the database(s) by the TFS with minimal effort. This is important because the TFS is the critical path of throughput - the more work done by other processes to reduce work done by the TFS will increase overall system throughput.

A database maintained by RDM Embedded is a collection of files, each with fixed-sized pages. Whenever an update must be done to a database, it is typical that several files and several pages within each of those files must be updated together. To update some of the pages but not all of them can result in database corruption. The atomic principle of database transactions insists that all updates occur as a single unit. The durable principle insists that once the DBMS says that the transaction is committed, then it doesn't matter if the computer crashes - the transaction will be found when the computer comes back up.

There has been much research on how to best achieve the above requirements of transactions without sacrificing performance. Some of the solutions are very elegant, sophisticated and "computer sciencey." Well, I have been trained in computer science too, and understand the geeky pull of solutions that only special people can understand. But having been in the database engine business for 25 years now, I'm going to make the claim that Simple is Good. And simple, in the case of an RDM Embedded transaction, is to have the runtime write page images of all pages that have been changed or created by a transaction into a log file, give that log file to the TFS, and have the TFS control the writing of those images to the database files. We call it page-image-logging, and it is too simple to warrant its own page in Wikipedia. Others exist, see Write-Ahead-Logging and Shadow Paging.


Okay, it is simple, but why is it good? It's because of a programming "truth" which says complexity leads to unreliability and another which says simple pieces fit together with other simple pieces.


Simplicity Leads to Reliability

To make sure that each transaction is durable, the runtimes make sure that a transaction's page-image-log is "synced" to disk, meaning that the computer can lose power, but the file containing the transaction will be found in its entirety when restarted. Until a file sync function returns, neither the runtimes nor TFS will assume that the data can be found again. The thing about file syncing is that it is expensive! A DBMS must sync files, but it must do it at key moments as infrequently as possible.

To make sure each transaction is atomic, the TFS makes sure that all pages stored in a log are written to the database files, or none.

Any reliable DBMS must assume that when it starts up, it must recover from a crash. The evidence it has is stored on the disk, and it must be coherent. In our case, this consists of finding log files that exist and haven't been removed (simple - no sophisticated analysis needed!). Because of their simple format, many of these files may exist, and as long as they are re-written to the database in the same order, the exact same database contents will result. This is true even if they were successfully written the first time. The repeatable result is a benefit of the simplicity of the log file format.

While running, the TFS receives log files and queues them up for writing to the database. Every so often, it writes all accumulated logs to the database and syncs the database files. Only then can the logs be removed. Until then, the writing is repeatable. And because the logs are batched and written as one large series of writes, the sync operation is performed much less frequently than it would if it were done after every transaction. This sequence makes sure that a database is recoverable no matter when the computer may crash. It also makes it quick.

Simple pieces fit together with other simple pieces.

Let's say that 100 transactions have occurred and there are 100 log files ready to write to the database files. In most databases, there are "hot spots" that get modified all the time. So it's likely that the 100 log files have many repeats of the same pages. When all logs are written to the database, only the most recent page image survives - all earlier ones are overwritten. Why then should all log pages get written? Because it is possible to find only the most recent images among all page images in all logs, the actual writing to the database can be optimized to only write to each page once. An optimization possible because of simplicity.

Another possibility with page-image-logging: database mirroring. What if the log files are transferred to another location and applied in the same order as they were originally? You have another identical database for very little cost. What if the logs on the receiving side are given to the TFS running there as though they were created by local runtimes (it won't know the difference)? You have a database that can be read through this TFS, thereby offloading the original database. What if you eliminated the redundant pages from several logs before sending one combined log to the mirror location? It wouldn't know the difference, and would apply it as thought it always was one big transaction.

Database mirroring is a major new feature of the next version of RDM Embedded, of course. It wasn't too hard to add. And it got us to thinking - mirroring looks like a major piece of a distributed database solution, doesn't it? It does, and we are working on that too. More later.