---------------------------------------------------------------------------- The Florida SunFlash The Sun Database Excelerator (SunDBE) Technical Description SunFLASH Vol 29 #10 May 1991 ---------------------------------------------------------------------------- The Sun Database Excelerator (SunDBE) is an unbundled software product that is an alternative kernel to SunOS for Sun customers who have database applications. It increases both the performance and the number of concurrent users of these systems. SunDBE is the kernel that was used to obtain the excellent TPC-B benchmark numbers with Sybase and Informix, and it greatly increases performance for Ingres and Oracle as well. SunDBE 1.0 for use only with SunOS 4.1 was released in the fall of 1990, and was described in the August 1990 STB. The newest release is SunDBE 1.1 for use only with SunOS 4.1.1 and SunOS 4.1.1 Rev B, and it is available now. In this article we use SunDBE to refer to SunDBE 1.1. SunDBE 1.1 includes all features from SunDBE 1.0: support for very large shared memory partitions and large numbers of concurrent users with less overhead than SunOS 4.1; faster asynchronous I/O for raw disk devices where individual disk operations are less than 63 kilobytes; and 1024 file descriptors per process, rather than 256 (the increase in file descriptors is available only to a few relational database management systems). SunDBE 1.1 adds new features such as miscellaneous kernel code optimizations to the system call dispatcher, the read system call, and the semaphore system call to increase DBMS performance. In addition, certain dbms processes can share memory management resources; a high-resolution timer facility has been added; and a Unix File System (UFS) change has been added to increase performance for updates to UFS files for one relational DBMS. SunDBE 1.1: Optimized for SPARCservers and Requires SunOS 4.1.1 SunDBE was designed specifically to optimize the performance of Oracle, Sybase, Ingres, and INFORMIX-OnLine DBMS products on Sun SPARCservers. However, some of the performance enhancements can also be used by a wide variety of other dbms and non-dbms applications, especially those with large virtual memory requirements. SunDBE provides full SunOS kernel functionality. The SunDBE release consists of a pre-built kernel and the object files necessary to create custom kernels. SunDBE is intended to be used only on the following Sun-4, server-class machines with at least two disks and 16 MBytes of memory. SPARCserver 1 SPARCserver 1+ SPARCserver 2 SPARCserver 330 SPARCserver 370 SPARCserver 390 SPARCserver 470 SPARCserver 490 Sun-4/260 Sun-4/280 SunDBE 1.1 will work with the SPARCengine 1, SPARCengine 1+, SPARCengine 2, and SPARCengine 3XX board products since they are the cpu boards that go into the SPARCserver 1, 1+, and SPARCservers 3XX, respectively, with no changes to SunOS. The only qualification here is that the SunOS that is running must be SunOS 4.1.1 or SunOS 4.1.1 Rev B. SunDBE will not work with the SPARCengine 1E boards due to their requiring a special version of SunOS. Please note that SunDBE is not a substitute for an entire SunOS installation. SunOS 4.1.1 must be installed before SunDBE is installed. SunDBE is supported only on SunOS 4.1.1 and does not work with earlier versions of SunOS (SunDBE 1.0 works with SunOS 4.1). Virtual Memory Enhancements The virtual memory changes in SunDBE are intended to enhance the performance of the MMU (Memory Management Unit) in situations where there are many competing processes, possibly with large shared memory caches and large text segments. The changes can also help any time there are large virtual memory requirements, regardless of the number of competing processes. To determine when SunDBE will help improve overall system performance, an introduction to the way in which virtual to physical address translation is handled is necessary. The specific SunDBE changes are discussed following this introduction. The translation is handled in hardware, and the basic unit used in translation is a PMEG, or page map entry group. Page map entries (PMEs) are used to translate virtual addresses to physical addresses in all existing Sun SPARC machines. Each PME translates one page. PMEGs are a way of grouping PMEs to allow for more efficient handling of sparse address spaces. In order to make the translation efficient, PMEGs are stored in fast memory. Fast memory is expensive, however, so it is not possible to have enough to hold translations for all of virtual memory in the MMU. Each PMEG holds 32 PMEs (64 on the SPARCserver 1, 1+, and 2); thus each PMEG maps 256KB (32*8KB or 64*4KB). The numbers of PMEGs for each type of server are: SPARCserver 1,1+ 128 SPARCserver 2 256 SPARCserver 3x0 256 SPARCserver 4x0 1024 Sun-4/260,280 512 As the aggregate amount of virtual memory used by all processes increases, the number of PMEGs used also increases. Eventually, when virtual memory requirements become too large, the system "runs out" of PMEGs. For example, on a SPARCserver 1, this would be at 32 megabytes (128*256K), and on SPARCserver 4X0 machines, this would be at 256 megabytes (1024 * 256K). Running out of PMEGs is rather like running out of physical memory. It requires the kernel to "steal" a PMEG from a (hopefully) inactive process and give it to the process that needs it. Note that if a process is large enough, it could steal from itself. This takes time. Also, the kernel does not have perfect knowledge of all applications, so the decision as to which PMEGs to use is not always optimal. This can lead to "PMEG thrashing," just as insufficient physical memory can lead to page thrashing. SunDBE enhances virtual memory performance by reducing the amount of virtual address translation resources required for a given application mix. Software page tables reduce the amount of PMEG thrashing, and shared page tables reduce the number of PMEGs required. Software Page Tables SunDBE enhances performance by maintaining software page tables that can be thought of as PMEG caches. PMEGs are cached in these software page tables when a steal is necessary, and restored when needed again (assuming the translations are still valid). Thus, SunDBE improves performance by efficiently caching translations, thereby reducing or eliminating costly translations as a result of PMEG steals. The software cache consists of software page tables, each of which can store one PMEG. Software page tables are a standard part of SunOS 4.1.1. However, the number of software page tables configured in each generic SunOS 4.1.1 kernel is less than that of SunDBE. This means that for application mixes that use large amounts of virtual memory, SunDBE may provide significantly better performance than a generic SunOS 4.1.1 kernel. The default numbers of page tables for SunDBE were determined to run all DBMS packages without thrashing on the VM system. Software page tables are not without cost: The physical memory used for software page tables is locked down at boot time and cannot be reclaimed for other uses. This change may boost throughput, and should allow more users with reasonable performance. Shared Page Tables (Intimate Shared Memory) Unlike generic SunOS 4.1.1, SunDBE actually allows different processes to share the same PMEGs. This feature is referred to as shared page tables, or Intimate Shared Memory (ISM). This decreases the number of PMEGs necessary to hold translations for all processes. As a result, performance is enhanced because PMEG thrashing can be avoided entirely in many cases. This feature is available only on the SPARCserver 470 and SPARCserver 490, and only to processes attaching to the same shared memory address. Shared page tables are intended to be used only by INFORMIX-OnLine and Oracle RDBMS (other RDBMS with different architectures do not require this feature for best performance). When software page tables are not enough to prevent PMEG thrashing (or when PMEG thrashing would normally occur on a system on which a generic SunOS 4.1.1 kernel is installed), the SunDBE shared page tables feature can be enabled. The ISM feature is turned off by default; it must be specifically enabled, usually by a database or system administrator. This feature can boost performance up to 50% with flatter throughput as more users are added. Asynchronous I/O Enhancements The asynchronous I/O changes in SunDBE are intended to reduce the CPU overhead of asynchronous disk I/O operations done on raw partitions where each I/O buffer size is less than or equal to 63 kilobytes. The "freed" CPU cycles can then be used for other "real" work done by the DBMS to provide increased performance and support a greater number of users. SunDBE asynchronous I/O takes a short path through the kernel code and calls the disk drivers directly to perform each raw disk I/O. Thus, fewer CPU cycles are required to perform each I/O than are required by generic asynchronous I/O in SunOS 4.1.1 because it has (1) a longer code path through the file system layer of the kernel and (2) the added overhead of kernel threads management. No special action is required by a process to enable the faster asynchronous I/O. The SunDBE kernel automatically uses the new feature when the two preconditions are met. This feature is available to all types of applications, not just database management systems. It can result in a throughput increase of up to 30%, although 5% - 10% is typical. File Descriptor Enhancements Multi-threaded server DBMS products consume large quantities of file descriptors for managing user connections, disk partitions, and network connections. SunOS 4.1.1 supports up to 256 open file descriptors (fd's) per process. This constraint places a hard limit on the number of concurrent users on a system using a multi-threaded DBMS server. SunDBE raises the number of file descriptors per process limit to 1024. Currently this feature is available only to three RDBMS (relational database management systems) that know how to use it. All other products running on a SunDBE kernel will see the normal SunOS 4.1.1 file descriptors per process limit. UFS Enhancement This is the same as one of the enhancements shipping with Sun QuickCheck. It increases Ingres performance by avoiding unnecessary writes of the database inode during updating. This feature is only available to Ingres 6.3, and it is transparent to Ingres application programs and system administrators. This feature can increase throughput up to 32% for update transactions. High Resolution Timer SunDBE1.1 allows user programs to access a low overhead, high resolution time function which is not in SunOS 4.1.1. All Sun4 and Sun4c machines with the exception of 4/1xx and 4/2xx have Mostek chips that have two counters that increment each microsecond. One of the counters is used to generate level 10 interrupts each 10 milliseconds to give the kernel a notion of a "tick". The other counter is normally disabled and used by profiling kernels to interrupt (at level 14) each 10 milliseconds. In SunOS 4.1.1, both the kernel and user programs have access to time using the datatype struct timeval struct timeval { int tv_sec; int tv_usec; }; Though the tv_usec is the number of microseconds, the precision is only 10 milliseconds. The new function hrestime returns the value of the current time. The hrestime function has a precision of 1 microsecond on all supported machine types (see Table 2-10 for a list of supported machine types) except the 4/2X0, where the precision is the same as SunOS 4.1.1. Other enhancements SunDBE has also made miscellaneous kernel code optimizations to the system call dispatcher, the read system call, and the semaphore system call. These improvements are transparent to all applications. They can boost throughput up to 10%. SunDBE Media and Availability SunDBE is available on CD-ROM only, except to customers with SunDBE 1.0 support contracts that want 1/4 inch tape media. SunDBE 1.0 customers with SunDBE support contracts will automatically receive SunDBE 1.1. Everyone else needs to purchase SunDBE 1.1. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ For information send mail to info-sunflash@sunvice.East.Sun.COM. Subscription requests should be sent to sunflash-request@sunvice.East.Sun.COM. Archives are on solar.nova.edu and paris.cs.miami.edu. All prices, availability, and other statements relating to Sun or third party products are valid in the U.S. only. Please contact your local Sales Representative for details of pricing and product availability in your region. Descriptions of, or references to products or publications within SunFlash does not imply an endorsement of that product or publication by Sun Microsystems. John McLaughlin, SunFlash editor, flash@sunvice.East.Sun.COM. (305) 776-7770.