Tuesday, February 23, 2010

RDM Embedded's Transactional File System

There are many programmers who, over the past 20 years, have implemented a multi-user database system using db_VISTA, Raima Database Manager or RDM Embedded (depending on when you got on board). Congratulations! That was not easy.

The next release of RDM Embedded has been re-architected for efficient, scalable multi-core operation, but it has also become much simpler to use. It turns out that application developers don't want or need to configure every little option in order to use a database. Rather, they would just like to use it. That may seem like a subtle point, but it has had a dramatic effect on usability of RDM Embedded by the programmer.

Both usability and multi-core awareness are facilitated through the new component of RDM Embedded, called the Transactional File Server (or TFS, as I will refer to it). The TFS owns databases, controlling how they are read and updated. In the old architecture, the runtime library did all this from within each application that was using an Embedded database. The TFS centralizes the critical functionality of database reading and updating. It is the central piece in satisfying the major multi-core design principle I discussed in the last blog entry:
  • Obtain all resources needed for a task (the TFS requires database pages to be locked prior to reading, then it provides the pages to the requesting runtime libraray).
  • Perform the task. This is the job of the runtime library. After obtaining existing database pages, or creating new ones, the runtime library will submit a completed transaction to the TFS to commit the transaction.
  • Quickly merge, or re-integrate the results of the task. Since the runtime library has encapsulated the changes of a transaction into a single unit (actually, a log file), the TFS has a minimal amount of work to do to commit the changes. The "heavy lifting" was done by the process linked to the runtime library, and the TFS has little to do but flip the "commit" switch.
So the way it all fits together is like this. The runtime library is linked into application programs. As always, the runtime library reads database pages, makes changes to them, creates new pages, then writes out all pages, modified or new, to the database. Each application in the system has its own copy of the runtime library maintaining its own local cache and submitting the new or changed pages to the database. Only now, the reading is through the TFS and the writing is through the TFS. In all of this, the real work is done by the runtime library in its local cache. The TFS, serving pages and committing transactions for multiple runtimes, is left with relatively little to do.


This allows multiple cores to work simultaneously without blocking each other. Even within the TFS, there are threads that service the runtimes which are mostly independent.

Now for the usability claim. The TFS is the runtime's only "connection" with the outside world - if it connects with the TFS, it can do its job. Connection is established through the most pervasive mechanism ever known - TCP/IP. If the TFS is visible, the runtime can connect to it. There is no more shared disk drives or TRUENAME for identifying files, no lock manager, DBDPATH, DBFPATH, DBTAF, ... (ugh, what were we thinking at the time?).

So, a TFS running on a computer does not need to share any files or drives in order for other runtimes to use the databases within the TFS domain. Runtimes may be on the same computer as the TFS, another computer on its LAN, or somewhere else on Internet. The only difference will be performance, not whether it works or not.

This brings up an unintended benefit of this architecture - not only can multiple cores on one computer allow the TFS and runtimes to scale up together, but other computers can be added to the mix. Multiple cores plus multiple computers all working together in parallel. Not only that, but one runtime can access databases from multiple TFS's. Not only that, but one runtime can view databases scattered throughout the world in different TFS's as though they are one unified database.

The separation of database semantics and operations (the runtime) from the TFS (safe, transactional updating of files) lets the pieces fit where they need to fit, and to be combined in an indefinite number of ways, depending on the need.

Usability now means more than "easier to use," it means usable in ways never considered before.

Wednesday, February 17, 2010

Multi-core Aware Software Architecture

Almost every major software system in use today was initially created prior to the advent of multi-core computers. Multi-processor computers have existed for a while, but not for the common computer, and very few software systems were adapted to take full advantage of them.

So what's the problem? Why can't you throw more hardware at it and expect it to run faster? The issue is almost always characterized as "shared resources." A shared resource is anything that one task, or thread must use without worrying about another task changing it while it is being used. Concepts of synchronization, locking, mutual exclusion or "critical sections" were developed to deal with the issue of safe sharing of common resources.

The traditional mechanisms for creating safe sharing among tasks are called semaphores or queues or events. Used correctly, the programming primitives allow everything from the incrementing of a shared number to a complete database transaction to be done without the corruption that would occur if tasks were not properly sequenced, or "serialized."

In the single-CPU age, multi-processing or multi-threading existed to create the illusion of multiple tasks happening concurrently. When synchronization primitives prevented one task from making progress, no problem, because the single CPU would switch to another task that was not blocked. Software systems with shared resources were not degraded too much from synchronization if there was always something for the CPU to do.

Why can't multiple cores just make this go 2 or 4 times faster? I'll try to use an analogy to explain it. Imagine a person at a desk with an inbox, a computer, and an outbox. The job is to take the next item from the inbox and read it, enter information from the item into the computer, get a response from the computer, write it down on a form, prepare it for delivery and put it in the outbox. This cycle takes on the average 5 minutes per item, and the boss needs an output of 30 per hour, not the current 12. To solve this problem, the boss puts another person (core) at the same desk and allows the computer to swivel. Getting started, both people reach into the inbox, grab the same item and tear it in two. Starting over, they agree to notify the other before grabbing. Now, they each read their next items and are ready for data entry. Unfortunately, only one can do this at once, so the other waits, turns the monitor around and starts data entry. In the mean time, the first person prepared the response for the outbox, grabbed the next item, and needed to wait a short time for the computer monitor.

The good news is that output went from 12 items per hour to 18. The bad news is that the boss still requires 30. Another person is added to the desk and production goes from 18 to 22 items per hour. The next person takes it to 24. One more person moves it to 22, as this person, being equally skilled in processing items, is actually interfering with the work that was already occurring.

So it goes, in a very real way, with software systems that have more cores thrown at them. The first few help, a little, but they get to a point where performance degrades.

Instead, if another desk, inbox, computer and outbox were provided for the second person, output would nearly double. Nearly, because some additional time would be required to fill two inboxes and empty two outboxes rather than just one, but clearly a good trade-off. The solution is more expensive, but pays for itself quickly.

Changing a software system to split up the work is not so easy, especially software that is well established and mature. It's not just a change in algorithms, it's a change in architecture.

If you do a web search for multi-core optimization, you will find articles about shared L2 cache or matrix processing. Many OS and chip vendors say that they have done the optimization for you, but they don't say how. Some have proposed programming languages that have parallel processing constructs added to the language. All of these are fine and helpful, but don't fix an architectural problem any more than it helps the people processing a single inbox to send them to a speed-reading class. Speed-reading is great, but in the absence of a new desk architecture, it has very limited benefit.

There are very few articles about multi-core and system architecture, where the main benefits of system-wide performance are discussed. It turns out there is only one major design principle for creating a software system that scales up with multiple cores:
  1. Obtain all resources needed for a task.
  2. Perform the task.
  3. Quickly merge, or re-integrate the results of the task.
Step one is best if it is non-blocking. Next best is making it very granular, meaning that one task can have pieces of a resource while another has different pieces.

Step two may be time consuming, but in a correct architecture, it is not blocking other tasks. It can execute in parallel. An important principle of step two is that the sum of the time taken for all cores can be much more than the time it would take one CPU, but the wall-clock time is much less because they are done at the same time. For example, if step two takes a single-CPU system 100ms, then twelve tasks would take 1200ms. In a re-architected multi-core system with 4 cores requiring 180ms for step two, 4 parallel cores each performing the task 3 times (twelve tasks) would be done in 540ms. It hasn't cut the time by 4, but the stage is set for true scaling when even more cores are added.

The third step is hard, especially for a database system, where ACID principles must be followed. Saving transactions safely to disk in a consistent and durable manner requires synchronization around shared resources. The key here is to do as much work beforehand as possible, so that the re-integration is quick. This is probably why step two takes longer. Step two does more work (in parallel) so that the serialized portion is short. It's a massively worthwhile trade-off.

The trick now is to take working software and re-architect it, not just optimize it, for multi-core scaling.

Raima is addressing this right now. The next release of RDM Embedded has gone through the transformation described above. Check back for details.

Friday, February 12, 2010

Mid-Course Correction, Again

In 1991, Raima sold a linkable C library called db_VISTA, which worked fine for small work-groups that needed to share a database, but the database industry in general was more interested in the client/server architecture.

So, we started with db_VISTA and created a database server which was named Velocis. We had two products, aimed at two types of applications. This was the first major mid-course correction. Technically, it involved a fundamental architectural change. db_VISTA performs disk I/O directly from each application program, keeping a local cache of database pages. Velocis performs disk I/O only from the server, and keeps only one centralized cache.

Fast-forward to 2010. db_VISTA is now called RDM Embedded, Velocis is now called RDM Server, and they are both actively used and supported. The company has gone through multiple ownership changes, but the products have remained viable for over 20 years. Raima is now a division of Birdstep Technology. The original developers (Randy Merilatt and me, Wayne Warren) are back after trying other things for a while, and other long-term developers have survived all the changes. The products have had several new features added, run on new platforms, and are much more reliable.

But the industry has changed, and another mid-course correction was necessary. To state some of the obvious changes:
  • Memory is massive and cheap,
  • Disk space is massive and cheap,
  • Networking is fast and pervasive, and
  • Multi-core and multiprocessor systems are common.
It turns out that a database system that was architected to optimize memory and disk and a single CPU does not scale up as one would expect it to. This is not good, but it is something that is being realized throughout the industry. Multi-core scaling requires an architecture that turns each core loose, rather than each core interfering with the others. I'll talk more about a multi-core-aware architecture in another post.

This second major mid-course correction started again from the smaller product, RDM Embedded, and will be released as the next version of RDM Embedded. It will work for the applications currently using RDM Embedded, but the underlying architecture has changed to take advantage of all of the above - use memory to obtain speed, use networking to achieve parallelism, and program the database access to allow multiple cores/processors to run in parallel.

It turns out that by doing all of the above, the product also became simpler to use and deploy, and more flexible in using multiple computers together. You just start using it, and it just starts working (any of you old-school RDM Embedded programmers will love this).

I'll get into more details as I have time. But I'm also updating the documentation, which is keeping me very busy.