System-global counter that can be updated programmatically (on various Linux versions)? - c

I need some fast method to update key/value type data or alternatively an arbitrary amount of "counters" system-wide on Linux. The systems in question are Ubuntu 10.04, RHEL 4.8 and RHEL 5.x.
Now, I am aware of memcached, but it seems to be more suited for long-running processes, such as FastCGI processes. Unfortunately my process is a traditional CGI and therefore has to use some persistent storage outside of the process itself.
What options do I have and which are easiest and cheapest (w.r.t. runtime) to access from C/C++?
Note: this is not to measure the speed (i.e. performance counters) but to measure the number a certain type of event happens. And in order to count reliably, I need to be able to atomically increment the counters at will ...

You could use a simple DBM-like database, for example GDBM.

You could try a small SQLite database. SQLite is FAST and reliable, any application can modify it and a transaction method prevents collision. Just add a record to a table for each event, or use a single table with an [event] column. Inserting is really fast, what's slow is searching, but you'll only search when analysing the data, hopefully AFTER the performance is a factor.

Nowadays, in order
to update key/value type data
Developers often use NoSQL databases. They run mainly on linux systems, and some of them are in C++ (MongoDB & ClusterPoint). They are really fast for this kind of things, they try really hard to keep low latency and it should be easy to access it from C++ since they are coded in C++.

Related

Mongodb interface for high speed updates

Are there any examples with source code for high speed (at least 10,000 read/write of record/s) mongodb read/update of a single record at a time ?
Alternatively where could I look in the mongodb server code for a way to say inject a customised put/get record for example with the “wired tiger” storage system ?
For example say that mongo C interface is similar to oracle's sql*net client, I'd need something similar to sqlldr bulk insert/update tool.
Thank you for any hint where to start from.
Raw performance is highly dependent on hardware. If the only requirement is "10,000 reads/writes of one document per second", then all of the following would help:
Having an empty database.
Storing the database in memory (tmpfs).
Using linux rather than windows.
Using a CPU with high clock speed, at the expense of core count.
Using the fastest memory available.
You might notice that some of these conditions are mutually exclusive with many practical systems. For example, if you truly have an empty database or nearly-empty database that fits in memory, you could probably use something like redis instead which is much simpler (i.e. has WAY less functionality) and thus for a simple operation like read/write of a simple document would be way faster.
If you try to implement this requirement on a real database, things become way more complicated. For example, "10,000 reads/writes of one document which is part of a 10 GB collection which is also being used in various aggregation pipelines" is a very different problem to solve.
For a real-world deployment there are simply too many variables to account for, you need to look at your system and go through performance investigation or hire a consultant who can do this. For example, if you talk to your database over TLS (which is very common) that creates significant overhead in the wire protocol which would absolutely seriously affect your peak r/w performance of trivial document sizes.

Best programming language for fast DB reads and fast local data structure manipulation

I have a mysql database with a variety of different tables with some storing 100k+ rows. I wanted a language that would allow me to read quickly from the database, allowing me to collate data from various different tables and store them into local objects/data structures. I would then do most of the complex processing locally, which I would also like to be optimized for.
This is mainly for an analysis project of data that is cleared out every day. Some friends recommended Ruby or Python, but not knowing either, I wanted a second opinion before I took the leap.
Part of the problem here is the latency between the db and your application-tier code. Ping the DB server from where you intend to query the database from. Double that and that's your turnaround time for every operation. If you can live with that time, then you're OK. But you might be better off writing your manipulations in sprocs or something that lives close to the DB and use your application code to make that presentable to a user.
the Db is going to be the bottle neck in most cases in terms of getting data out.
Really depends on what your "complex processing" is that will make the greatest difference in what language and what performance you need.
in terms of being easy to get up and started with, python and ruby are quick to get started with and get something working. Bit slower compared to others for computing stuff. But even then, both can compute a hell of a lot of stuff before you will notice much difference from a native compiler language.
100,000 records really isn't that much. Provided you have enough ram and a multiple local "indexes" into the data are referencing the same objects and not copies, you'll be able to cache it locally and access it very quickly without concern. While Ruby and Python are both interpreted languages and operation-for-operation slower than compiled languages, the reality is in executing an application only a small portion of cpu time is spent in your code and the majority is spent into the built-in libraries you call into, which are often native implementations, and thus are as fast as compiled code.
Either Ruby or Python would work fine for this and even if you find, after testing, that you performance is in fact not sufficient, translating from one of these to a faster language like Java or .NET or even C++ would be significantly faster than actually rewriting from scratch since you've really already done the tough work.
One other option is to cache all the data in an in-memory database. Depending on how dynamic the analysis you need to do, this may work well in your situation. SQLite works very well for this.
Also note that since you're asking about caching the data locally and then acting on the local cache only, the performance calling out to the database doesn't apply.

Which embedded database capable of 100 million records has an efficient C or C++ API

I'm looking for a cross-platform database engine that can handle databases up hundreds of millions of records without severe degradation in query performance. It needs to have a C or C++ API which will allow easy, fast construction of records and parsing returned data.
Highly discouraged are products where data has to be translated to and from strings just to get it into the database. The technical users storing things like IP addresses don't want or need this overhead. This is a very important criteria so if you're going to refer to products, please be explicit about how they offer such a direct API. Not wishing to be rude, but I can use Google - please assume I've found most mainstream products and I'm asking because it's often hard to work out just what direct API they offer, rather than just a C wrapper around SQL.
It does not need to be an RDBMS - a simple ISAM record-oriented approach would be sufficient.
Whilst the primary need is for a single-user database, expansion to some kind of shared file or server operations is likely for future use.
Access to source code, either open source or via licensing, is highly desirable if the database comes from a small company. It must not be GPL or LGPL.
you might consider C-Tree by FairCom - tell 'em I sent you ;-)
i'm the author of hamsterdb.
tokyo cabinet and berkeleydb should work fine. hamsterdb definitely will work. It's a plain C API, open source, platform independent, very fast and tested with databases up to several hundreds of GB and hundreds of million items.
If you are willing to evaluate and need support then drop me a mail (contact form on hamsterdb.com) - i will help as good as i can!
bye
Christoph
You didn't mention what platform you are on, but if Windows only is OK, take a look at the Extensible Storage Engine (previously known as Jet Blue), the embedded ISAM table engine included in Windows 2000 and later. It's used for Active Directory, Exchange, and other internal components, optimized for a small number of large tables.
It has a C interface and supports binary data types natively. It supports indexes, transactions and uses a log to ensure atomicity and durability. There is no query language; you have to work with the tables and indexes directly yourself.
ESE doesn't like to open files over a network, and doesn't support sharing a database through file sharing. You're going to be hard pressed to find any database engine that supports sharing through file sharing. The Access Jet database engine (AKA Jet Red, totally separate code base) is the only one I know of, and it's notorious for corrupting files over the network, especially if they're large (>100 MB).
Whatever engine you use, you'll most likely have to implement the shared usage functions yourself in your own network server process or use a discrete database engine.
For anyone finding this page a few years later, I'm now using LevelDB with some scaffolding on top to add the multiple indexing necessary. In particular, it's a nice fit for embedded databases on iOS. I ended up writing a book about it! (Getting Started with LevelDB, from Packt in late 2013).
One option could be Firebird. It offers both a server based product, as well as an embedded product.
It is also open source and there are a large number of providers for all types of languages.
I believe what you are looking for is BerkeleyDB:
http://www.oracle.com/technology/products/berkeley-db/db/index.html
Never mind that it's Oracle, the license is free, and it's open-source -- the only catch is that if you redistribute your software that uses BerkeleyDB, you must make your source available as well -- or buy a license.
It does not provide SQL support, but rather direct lookups (via b-tree or hash-table structure, whichever makes more sense for your needs). It's extremely reliable, fast, ACID, has built-in replication support, and so on.
Here is a small quote from the page I refer to above, that lists a few features:
Data Storage
Berkeley DB stores data quickly and
easily without the overhead found in
other databases. Berkeley DB is a C
library that runs in the same process
as your application, avoiding the
interprocess communication delays of
using a remote database server. Shared
caches keep the most active data in
memory, avoiding costly disk access.
Local, in-process data storage
Schema-neutral, application native data format
Indexed and sequential retrieval (Btree, Queue, Recno, Hash)
Multiple processes per application and multiple threads per process
Fine grained and configurable locking for highly concurrent systems
Multi-version concurrency control (MVCC)
Support for secondary indexes
In-memory, on disk or both
Online Btree compaction
Online Btree disk space reclamation
Online abandoned lock removal
On disk data encryption (AES)
Records up to 4GB and tables up to 256TB
Update: Just ran across this project and thought of the question you posted:
http://tokyocabinet.sourceforge.net/index.html . It is under LGPL, so not compatible with your restrictions, but an interesting project to check out, nonetheless.
SQLite would meet those criteria, except for the eventual shared file scenario in the future (and actually it could probably do that to if the network file system implements file locks correctly).
Many good solutions (such as SQLite) have been mentioned. Let me add two, since you don't require SQL:
HamsterDB fast, simple to use, can store arbitrary binary data. No provision for shared databases.
Glib HashTable module seems quite interesting too and is very
common so you won't risk going into a dead end. On the other end,
I'm not sure there is and easy way to store the database on the
disk, it's mostly for in-memory stuff
I've tested both on multi-million records projects.
As you are familiar with Fairtree, then you are probably also familiar with Raima RDM.
It went open source a few years ago, then dbstar claimed that they had somehow acquired the copyright. This seems debatable though. From reading the original Raima license, this does not seem possible. Of course it is possible to stay with the original code release. It is rather rare, but I have a copy archived away.
SQLite tends to be the first option. It doesn't store data as strings but I think you have to build a SQL command to do the insertion and that command will have some string building.
BerkeleyDB is a well engineered product if you don't need a relationDB. I have no idea what Oracle charges for it and if you would need a license for your application.
Personally I would consider why you have some of your requirements . Have you done testing to verify the requirement that you need to do direct insertion into the database? Seems like you could take a couple of hours to write up a wrapper that converts from whatever API you want to SQL and then see if SQLite, MySql,... meet your speed requirements.
There used to be a product called b-trieve but I'm not sure if source code was included. I think it has been discontinued. The only database engine I know of with an ISAM orientation is c-tree.

Is there is a software caching API out there?

I'm looking for an API or an application that can cache data from a file or database.
My idea is that I have an application that reads a database, but the access to database is sequential and it is on a disk.
What I basically want to do is get the data from cache first and then if it doesn't exist in cache, then hit the database. Note I'm not using a mainstream database, I'm using SQLite, but my performance requirements are very high.
So is there any product or API (free or commercial) that I can use for this purpose? Also I must have an API to interface with my cache.
I want to implement something like a web server cache or something like that.
I'm using C and Unix platform.
Thanks
You might want to look at using a shared memory cache such as memcached, although this requires a separate daemon, or roll something similar for yourself.
One thing I'd mention is that you should probably do some actual benchmarking to check that your database is your bottleneck, and if performance is a real concern there, then you're going to have to consider scaling up to a non-embedded DBMS. If that's not an option, then you may still be able to optimise the existing database accesses (query optimisation, indices, etc.).
Check out memcached. Brian Aker has written a C library for it.
But I would also second Rob's suggestion. SQLite and "performance requirements are very high" may not necessarily go together, depending on what aspect of performance you mean.
You might try Zola's CaLi library:
http://icis.pcz.pl/~zola/CaLi/

How to tell how much main memory an app uses

I need to choose a database management system (DBMS) that uses the least amount of main memory since we are severely constrained. Since a DBMS will use more and more memory to hold the index in main memory, how exactly do I tell which DBMS has the smallest memory footprint?
Right now I just have a memory monitor program open while I perform a series of queries we'll call X. Then I run the same set of queries X on a different DBMS and see how much memory is used in its lifetime and compare with the other memory footprints.
Is this a not-dumb way of going about it? Is there a better way?
Thanks,
Jbu
Just use SQLite. In a single process. With C++, preferably.
What you can do in the application is manage how you fetch data. If you fetch all rows from a given query, it may try to build a Collection in your application, which can consume memory very quickly if you're not careful. This is probably the most likely cause of memory exhaustion.
To solve this, open a cursor to a query and fetch the rows one by one, discarding the row objects as you iterate through the result set. That way you only store one row at a time, and you can predict the "high-water mark" more easily.
Depending on the JDBC driver (i.e. the brand of database you're connecting to), it may be tricky to convince the JDBC driver not to do a fetchall. For instance, some drivers fetch the whole result set to allow you to scroll through it backwards as well as forwards. Even though JDBC is a standard interface, configuring it to do row-at-a-time instead of fetchall may involve proprietary options.
On the database server side, you should be able to manage the amount of memory it allocates to index cache and so on, but the specific things you can configure are different in each brand of database. There's no shortcut for educating yourself about how to tune each server.
Ultimately, this kind of optimization is probably answering the wrong question.
Most likely the answers you gather through this sort of testing are going to be misleading, because the DBMS will react differently under "live" circumstances than during your testing. Futhermore, you're locking yourself in to a particular architecture. It's difficult to change DBMS down the road, once you've got code written against it. You'd be far better served finding which DBMS will fill your needs and simplify your development process, and then make sure you're optimizing your SQL queries and indices to fit the needs of your application.

Resources