Berkeley DB Performance Tuning

 When it comes to performance, BerkeleyDB’s cache size is the single most important configuration parameter. In order to achieve maximum performance, the data most frequently accessed by your application should be cached so that read and write requests do not trigger too much I/O activity.

BerkeleyDB databases are grouped together in application environments – directories which group data and log files along with common settings used by a particular application. The database cache is global for all databases in a particular environment and needs to be allocated at the time that environment is created. Most of the time this is done programmaticaly. The issue of course is that, the optimal cache size is directly related to the size of the data set, which is not always known at application design time. Many BerkeleyDB applications end up storing considerably more data than originally envisioned. Even worse, some applications do not explicitly specify a cache size at all, so the default cache size of 256KB is allocated – which far too low for many applications. Such applications suffer from degraded performance as they accumulate more data; their performance can be significantly improved by increasing the cache size. Luckily, most of the time this can be achieved without any code changes by creating a configuration file in your BerkeleyDB application environment.

The db_stat tool can help you determine the size of the BerkeleyDB cache and the cash hit rate (the number of pages retrieved from the cache as opposed to loaded from disk). Run the command in the directory containing the BerkeleyDB environment (or use the -h switch to specify a directory):

$db_stat -m
 264KB 48B       Total cache size
 ...
 100004  Requested pages found in the cache (42%)
 ...

The output above tells us that the cache size is about 264KB and 42% of the requests were satisfied by the cache, i.e without reading from data files. This number pertains to the environment as a whole.

In order to determine how much cache we need, we will use the db_stat command again, this time with the -d option to specify a database file:

$ db_stat -d test0.db
 ...
 4096 Underlying database page size
 ...
 29230 Number of hash buckets
 15044 Number of bucket overflow pages
 0 Number of duplicate pages

The amount of the data in a BerkeleyDB file can be calculated as follows:

(Number_of_hash_buckets + Number_of_Bucket_Overflow Pages + Number_of_Duplicate_Pages ) * Page_Size

In this case this equals (29230 + 15044 + 0) * 4096 bytes, or about 173 MB.

It is worth noting that the output above contains data for a single database file, while the cache is global for the entire environment and that different databases sometimes use different page sizes, so if you are using more than one file, you will need to run the command for each file and calculate the totals.

Ideally, we should aim for a cache hit rate of 100%, which would mean that the complete dataset is loaded in the cache. Depending on the size of your dataset and the amount of memory on your system, that may not be realistic. Increasing the cache size beyond a certain point yields diminishing returns, because the OS will always allocate the requested amount of memory, but at some point (determined by the amount of physical memory on your system), it will not be able to keep the entire cache in physical memory and will start paging excessively, which defeats the purpose of having a cache. As a rule of thumb, you should plan on keeping only the most frequently requested data in the cache.

Once you have decided to increase the size of your BerkeleyDB cache, here is a step-by-step guide on how to achieve that:

Step 1. Stop all processes that use the BerkeleyDB environment you are about to modify. While this is application-specific, the processes should be shut down gracefully in order to avoid corrupting the databases.

Step 2. Use your favorite backup tool to back up the complete BerkeleyDB environment directory, including database files and log files.

Step 3. Create a file named DB_CONFIG in the BerkeleyDB environment directory with the following contents:

#Sample BerkeleyDB DB_CONFIG - set cache size to 200 MB
set_cachesize 0 209715200 1

The three parameters for set_cachesize directive are as follows:

  • Cache size in gigabytes – in our case 0 since we want less than one GB.
  • Additional size in bytes (209715200 or 200 MB)
  • Number of chunks to split cache into (1)

Keep in mind that if you request less than 500 MB, BerkeleyDB will automatically allocate 25% more for overhead.

Step 4.  Run db_recover -e in the environment directory to increase the cache size; then verify the new settings using db_stat -m.

#db_recover -e
#db_stat -m
25MB 4KB 48B    Total cache size
...

Running db_recover will essentially re-create the BerkeleyDB environment and. Keep in mind that, as a result, BerkeleyDB environment statistics reported by db_stat (such as the cache hit rate) will be re-initialized, so you will need to run your application for a while for BerkeleyDB to collect new statistical data.

Step 5. Start your application and continue monitoring its performance and the BerkeleyDB cache hit rate.

Now that you have seen how to optimize your BerkeleyDB application for maximum performance, in the next installment we will show you how Monitis can help you monitor your environment continuously and identify bottlenecks before they cause significant performance degradation.

You might also like