Embedded Databases from Raima: 2010

Tuesday, March 2, 2010

Simple is Good

The claim was made in an earlier entry that the TFS (Transactional File Server) being introduced in our next version of RDM Embedded has "relatively little to do" when a transaction is submitted to it by a runtime library. This post is going to explain why that is true.

First of all, remember that runtimes will be executing independently from the TFS and from each other so that maximum usage of multiple cores, processors and/or computers is achieved. Each runtime, regardless of where it is located relative to the TFS, has the job of requesting resources (locks and pages) from the TFS, then performing its entire transaction independently, finally encapsulating the transaction in a form that can be merged into the database(s) by the TFS with minimal effort. This is important because the TFS is the critical path of throughput - the more work done by other processes to reduce work done by the TFS will increase overall system throughput.

A database maintained by RDM Embedded is a collection of files, each with fixed-sized pages. Whenever an update must be done to a database, it is typical that several files and several pages within each of those files must be updated together. To update some of the pages but not all of them can result in database corruption. The atomic principle of database transactions insists that all updates occur as a single unit. The durable principle insists that once the DBMS says that the transaction is committed, then it doesn't matter if the computer crashes - the transaction will be found when the computer comes back up.

There has been much research on how to best achieve the above requirements of transactions without sacrificing performance. Some of the solutions are very elegant, sophisticated and "computer sciencey." Well, I have been trained in computer science too, and understand the geeky pull of solutions that only special people can understand. But having been in the database engine business for 25 years now, I'm going to make the claim that Simple is Good. And simple, in the case of an RDM Embedded transaction, is to have the runtime write page images of all pages that have been changed or created by a transaction into a log file, give that log file to the TFS, and have the TFS control the writing of those images to the database files. We call it page-image-logging, and it is too simple to warrant its own page in Wikipedia. Others exist, see Write-Ahead-Logging and Shadow Paging.

Okay, it is simple, but why is it good? It's because of a programming "truth" which says complexity leads to unreliability and another which says simple pieces fit together with other simple pieces.

Simplicity Leads to Reliability

To make sure that each transaction is durable, the runtimes make sure that a transaction's page-image-log is "synced" to disk, meaning that the computer can lose power, but the file containing the transaction will be found in its entirety when restarted. Until a file sync function returns, neither the runtimes nor TFS will assume that the data can be found again. The thing about file syncing is that it is expensive! A DBMS must sync files, but it must do it at key moments as infrequently as possible.

To make sure each transaction is atomic, the TFS makes sure that all pages stored in a log are written to the database files, or none.

Any reliable DBMS must assume that when it starts up, it must recover from a crash. The evidence it has is stored on the disk, and it must be coherent. In our case, this consists of finding log files that exist and haven't been removed (simple - no sophisticated analysis needed!). Because of their simple format, many of these files may exist, and as long as they are re-written to the database in the same order, the exact same database contents will result. This is true even if they were successfully written the first time. The repeatable result is a benefit of the simplicity of the log file format.

While running, the TFS receives log files and queues them up for writing to the database. Every so often, it writes all accumulated logs to the database and syncs the database files. Only then can the logs be removed. Until then, the writing is repeatable. And because the logs are batched and written as one large series of writes, the sync operation is performed much less frequently than it would if it were done after every transaction. This sequence makes sure that a database is recoverable no matter when the computer may crash. It also makes it quick.

Simple pieces fit together with other simple pieces.

Let's say that 100 transactions have occurred and there are 100 log files ready to write to the database files. In most databases, there are "hot spots" that get modified all the time. So it's likely that the 100 log files have many repeats of the same pages. When all logs are written to the database, only the most recent page image survives - all earlier ones are overwritten. Why then should all log pages get written? Because it is possible to find only the most recent images among all page images in all logs, the actual writing to the database can be optimized to only write to each page once. An optimization possible because of simplicity.

Another possibility with page-image-logging: database mirroring. What if the log files are transferred to another location and applied in the same order as they were originally? You have another identical database for very little cost. What if the logs on the receiving side are given to the TFS running there as though they were created by local runtimes (it won't know the difference)? You have a database that can be read through this TFS, thereby offloading the original database. What if you eliminated the redundant pages from several logs before sending one combined log to the mirror location? It wouldn't know the difference, and would apply it as thought it always was one big transaction.

Database mirroring is a major new feature of the next version of RDM Embedded, of course. It wasn't too hard to add. And it got us to thinking - mirroring looks like a major piece of a distributed database solution, doesn't it? It does, and we are working on that too. More later.

Tuesday, February 23, 2010

RDM Embedded's Transactional File System

There are many programmers who, over the past 20 years, have implemented a multi-user database system using db_VISTA, Raima Database Manager or RDM Embedded (depending on when you got on board). Congratulations! That was not easy.

The next release of RDM Embedded has been re-architected for efficient, scalable multi-core operation, but it has also become much simpler to use. It turns out that application developers don't want or need to configure every little option in order to use a database. Rather, they would just like to use it. That may seem like a subtle point, but it has had a dramatic effect on usability of RDM Embedded by the programmer.

Both usability and multi-core awareness are facilitated through the new component of RDM Embedded, called the Transactional File Server (or TFS, as I will refer to it). The TFS owns databases, controlling how they are read and updated. In the old architecture, the runtime library did all this from within each application that was using an Embedded database. The TFS centralizes the critical functionality of database reading and updating. It is the central piece in satisfying the major multi-core design principle I discussed in the last blog entry:

Obtain all resources needed for a task (the TFS requires database pages to be locked prior to reading, then it provides the pages to the requesting runtime libraray).
Perform the task. This is the job of the runtime library. After obtaining existing database pages, or creating new ones, the runtime library will submit a completed transaction to the TFS to commit the transaction.
Quickly merge, or re-integrate the results of the task. Since the runtime library has encapsulated the changes of a transaction into a single unit (actually, a log file), the TFS has a minimal amount of work to do to commit the changes. The "heavy lifting" was done by the process linked to the runtime library, and the TFS has little to do but flip the "commit" switch.

So the way it all fits together is like this. The runtime library is linked into application programs. As always, the runtime library reads database pages, makes changes to them, creates new pages, then writes out all pages, modified or new, to the database. Each application in the system has its own copy of the runtime library maintaining its own local cache and submitting the new or changed pages to the database. Only now, the reading is through the TFS and the writing is through the TFS. In all of this, the real work is done by the runtime library in its local cache. The TFS, serving pages and committing transactions for multiple runtimes, is left with relatively little to do.

This allows multiple cores to work simultaneously without blocking each other. Even within the TFS, there are threads that service the runtimes which are mostly independent.

Now for the usability claim. The TFS is the runtime's only "connection" with the outside world - if it connects with the TFS, it can do its job. Connection is established through the most pervasive mechanism ever known - TCP/IP. If the TFS is visible, the runtime can connect to it. There is no more shared disk drives or TRUENAME for identifying files, no lock manager, DBDPATH, DBFPATH, DBTAF, ... (ugh, what were we thinking at the time?).

So, a TFS running on a computer does not need to share any files or drives in order for other runtimes to use the databases within the TFS domain. Runtimes may be on the same computer as the TFS, another computer on its LAN, or somewhere else on Internet. The only difference will be performance, not whether it works or not.

This brings up an unintended benefit of this architecture - not only can multiple cores on one computer allow the TFS and runtimes to scale up together, but other computers can be added to the mix. Multiple cores plus multiple computers all working together in parallel. Not only that, but one runtime can access databases from multiple TFS's. Not only that, but one runtime can view databases scattered throughout the world in different TFS's as though they are one unified database.

The separation of database semantics and operations (the runtime) from the TFS (safe, transactional updating of files) lets the pieces fit where they need to fit, and to be combined in an indefinite number of ways, depending on the need.

Usability now means more than "easier to use," it means usable in ways never considered before.

Wednesday, February 17, 2010

Multi-core Aware Software Architecture

Almost every major software system in use today was initially created prior to the advent of multi-core computers. Multi-processor computers have existed for a while, but not for the common computer, and very few software systems were adapted to take full advantage of them.

So what's the problem? Why can't you throw more hardware at it and expect it to run faster? The issue is almost always characterized as "shared resources." A shared resource is anything that one task, or thread must use without worrying about another task changing it while it is being used. Concepts of synchronization, locking, mutual exclusion or "critical sections" were developed to deal with the issue of safe sharing of common resources.

The traditional mechanisms for creating safe sharing among tasks are called semaphores or queues or events. Used correctly, the programming primitives allow everything from the incrementing of a shared number to a complete database transaction to be done without the corruption that would occur if tasks were not properly sequenced, or "serialized."

In the single-CPU age, multi-processing or multi-threading existed to create the illusion of multiple tasks happening concurrently. When synchronization primitives prevented one task from making progress, no problem, because the single CPU would switch to another task that was not blocked. Software systems with shared resources were not degraded too much from synchronization if there was always something for the CPU to do.

Why can't multiple cores just make this go 2 or 4 times faster? I'll try to use an analogy to explain it. Imagine a person at a desk with an inbox, a computer, and an outbox. The job is to take the next item from the inbox and read it, enter information from the item into the computer, get a response from the computer, write it down on a form, prepare it for delivery and put it in the outbox. This cycle takes on the average 5 minutes per item, and the boss needs an output of 30 per hour, not the current 12. To solve this problem, the boss puts another person (core) at the same desk and allows the computer to swivel. Getting started, both people reach into the inbox, grab the same item and tear it in two. Starting over, they agree to notify the other before grabbing. Now, they each read their next items and are ready for data entry. Unfortunately, only one can do this at once, so the other waits, turns the monitor around and starts data entry. In the mean time, the first person prepared the response for the outbox, grabbed the next item, and needed to wait a short time for the computer monitor.

The good news is that output went from 12 items per hour to 18. The bad news is that the boss still requires 30. Another person is added to the desk and production goes from 18 to 22 items per hour. The next person takes it to 24. One more person moves it to 22, as this person, being equally skilled in processing items, is actually interfering with the work that was already occurring.

So it goes, in a very real way, with software systems that have more cores thrown at them. The first few help, a little, but they get to a point where performance degrades.

Instead, if another desk, inbox, computer and outbox were provided for the second person, output would nearly double. Nearly, because some additional time would be required to fill two inboxes and empty two outboxes rather than just one, but clearly a good trade-off. The solution is more expensive, but pays for itself quickly.

Changing a software system to split up the work is not so easy, especially software that is well established and mature. It's not just a change in algorithms, it's a change in architecture.

If you do a web search for multi-core optimization, you will find articles about shared L2 cache or matrix processing. Many OS and chip vendors say that they have done the optimization for you, but they don't say how. Some have proposed programming languages that have parallel processing constructs added to the language. All of these are fine and helpful, but don't fix an architectural problem any more than it helps the people processing a single inbox to send them to a speed-reading class. Speed-reading is great, but in the absence of a new desk architecture, it has very limited benefit.

There are very few articles about multi-core and system architecture, where the main benefits of system-wide performance are discussed. It turns out there is only one major design principle for creating a software system that scales up with multiple cores:

Obtain all resources needed for a task.
Perform the task.
Quickly merge, or re-integrate the results of the task.

Step one is best if it is non-blocking. Next best is making it very granular, meaning that one task can have pieces of a resource while another has different pieces.

Step two may be time consuming, but in a correct architecture, it is not blocking other tasks. It can execute in parallel. An important principle of step two is that the sum of the time taken for all cores can be much more than the time it would take one CPU, but the wall-clock time is much less because they are done at the same time. For example, if step two takes a single-CPU system 100ms, then twelve tasks would take 1200ms. In a re-architected multi-core system with 4 cores requiring 180ms for step two, 4 parallel cores each performing the task 3 times (twelve tasks) would be done in 540ms. It hasn't cut the time by 4, but the stage is set for true scaling when even more cores are added.

The third step is hard, especially for a database system, where ACID principles must be followed. Saving transactions safely to disk in a consistent and durable manner requires synchronization around shared resources. The key here is to do as much work beforehand as possible, so that the re-integration is quick. This is probably why step two takes longer. Step two does more work (in parallel) so that the serialized portion is short. It's a massively worthwhile trade-off.

The trick now is to take working software and re-architect it, not just optimize it, for multi-core scaling.

Raima is addressing this right now. The next release of RDM Embedded has gone through the transformation described above. Check back for details.

Friday, February 12, 2010

Mid-Course Correction, Again

In 1991, Raima sold a linkable C library called db_VISTA, which worked fine for small work-groups that needed to share a database, but the database industry in general was more interested in the client/server architecture.

So, we started with db_VISTA and created a database server which was named Velocis. We had two products, aimed at two types of applications. This was the first major mid-course correction. Technically, it involved a fundamental architectural change. db_VISTA performs disk I/O directly from each application program, keeping a local cache of database pages. Velocis performs disk I/O only from the server, and keeps only one centralized cache.

Fast-forward to 2010. db_VISTA is now called RDM Embedded, Velocis is now called RDM Server, and they are both actively used and supported. The company has gone through multiple ownership changes, but the products have remained viable for over 20 years. Raima is now a division of Birdstep Technology. The original developers (Randy Merilatt and me, Wayne Warren) are back after trying other things for a while, and other long-term developers have survived all the changes. The products have had several new features added, run on new platforms, and are much more reliable.

But the industry has changed, and another mid-course correction was necessary. To state some of the obvious changes:

Memory is massive and cheap,
Disk space is massive and cheap,
Networking is fast and pervasive, and
Multi-core and multiprocessor systems are common.

It turns out that a database system that was architected to optimize memory and disk and a single CPU does not scale up as one would expect it to. This is not good, but it is something that is being realized throughout the industry. Multi-core scaling requires an architecture that turns each core loose, rather than each core interfering with the others. I'll talk more about a multi-core-aware architecture in another post.

This second major mid-course correction started again from the smaller product, RDM Embedded, and will be released as the next version of RDM Embedded. It will work for the applications currently using RDM Embedded, but the underlying architecture has changed to take advantage of all of the above - use memory to obtain speed, use networking to achieve parallelism, and program the database access to allow multiple cores/processors to run in parallel.

It turns out that by doing all of the above, the product also became simpler to use and deploy, and more flexible in using multiple computers together. You just start using it, and it just starts working (any of you old-school RDM Embedded programmers will love this).

I'll get into more details as I have time. But I'm also updating the documentation, which is keeping me very busy.

Embedded Databases from Raima