Benchmarks for Crate.io

Benchmarks for Crate.io - benchmarking

Are there any benchmarks for Crate.io?
I am particularly interested in storing a binary file (using shards and replicas of Crate.io) and to access it using the provided SQL similar language.

Thanks for the question. Right now, we don't have any benchmarks available. We've been planning to do some benchmarking although it's important to note that so many different factors can go into the results. For example, cloud services are slower than bare metal, the configuration of the machine, the makeup of the data, the size of the cluster, the type of query...the list goes on and on. So, if we're going to do benchmarks we want to make sure that we have the right setup and conditions.

Related

Mongodb interface for high speed updates

Are there any examples with source code for high speed (at least 10,000 read/write of record/s) mongodb read/update of a single record at a time ?
Alternatively where could I look in the mongodb server code for a way to say inject a customised put/get record for example with the “wired tiger” storage system ?
For example say that mongo C interface is similar to oracle's sql*net client, I'd need something similar to sqlldr bulk insert/update tool.
Thank you for any hint where to start from.

Raw performance is highly dependent on hardware. If the only requirement is "10,000 reads/writes of one document per second", then all of the following would help:
Having an empty database.
Storing the database in memory (tmpfs).
Using linux rather than windows.
Using a CPU with high clock speed, at the expense of core count.
Using the fastest memory available.
You might notice that some of these conditions are mutually exclusive with many practical systems. For example, if you truly have an empty database or nearly-empty database that fits in memory, you could probably use something like redis instead which is much simpler (i.e. has WAY less functionality) and thus for a simple operation like read/write of a simple document would be way faster.
If you try to implement this requirement on a real database, things become way more complicated. For example, "10,000 reads/writes of one document which is part of a 10 GB collection which is also being used in various aggregation pipelines" is a very different problem to solve.
For a real-world deployment there are simply too many variables to account for, you need to look at your system and go through performance investigation or hire a consultant who can do this. For example, if you talk to your database over TLS (which is very common) that creates significant overhead in the wire protocol which would absolutely seriously affect your peak r/w performance of trivial document sizes.

which db to go for tiny data requirements

I need some help choosing databases for my application.
My web application will basically consist of a main table. lets call it the "User" table.
it will have the user info like name, id, password, address, phone etc.
There will be 5 other related tables where i will save each user's info.
eg. Table for books read, Table for songs heard, Food eaten etc.
Overall i dont expect my data to go beyond 1,000 users.
So, i have got tiny data requirements.
Generally i would have gone with mysql, but i am feeling a bit adventurous.
I want to try out some of the new solutions on the block.
my requirements are:
1. pure performance
2. good documentation, ease of use
since my db size shouldn't be more than a few hundreds megs in size, i'd rather the entire tablespace in the memory itself for faster performance. How about some of the new NoSQL DBs.
any recommendations? I have worked mainly on oracle and MySQl and don't have much idea of all the new exciting stuff out there.

I would suggest to go with sqlite if your database requirement is small.
From sqlite website:
SQLite is a compact library. With all features enabled, the library
size can be less than 350KiB, depending on the target platform and
compiler optimization settings. (64-bit code is larger. And some
compiler optimizations such as aggressive function inlining and loop
unrolling can cause the object code to be much larger.) If optional
features are omitted, the size of the SQLite library can be reduced
below 200KiB. SQLite can also be made to run in minimal stack space
(4KiB) and very little heap (100KiB), making SQLite a popular database
engine choice on memory constrained gadgets such as cellphones, PDAs,
and MP3 players. There is a tradeoff between memory usage and speed.
SQLite generally runs faster the more memory you give it.
Nevertheless, performance is usually quite good even in low-memory
environments.

Object oriented dbs can be used like db4o or versant.

Neo4j (for Java) is a pretty awesome tool. It's technically a graph database, but by the sounds of your data model, I think it would be well-suited for you. From what I've seen it performs very well, its documentation was just incredibly good, and if you are using Java then it's like second nature. You basically point it at a directory and it sets up shop there.
If you are feeling adventurous and happen to be using Java, I suggest you give it a try.

I think redis is exactly what you want!
Yesterday I downloaded and installed it for the first time. It runs completely in memory and that meets your performance requirement. (It only writes the data to disk for cases like power failure, like a backup, but this does not slow down the writes to it.)
For linux and such there is tar.gz on the download page.
For windows you can download Dusan's native port: http://redis.io/download - it is precompiled and also has the client console to try out.
The documentation is very good, for example this is the page for the data types: http://redis.io/topics/data-types and you also find all the other relevant information as a fast to browse reference there.
And there is a nice online tutorial to get started quickly: http://try.redis-db.com/ which is actually fun to work through.
I like the atomic operations like "increment by" and the list stuctures with push and pop.
There is also a hash type.
For python there is redis-py: https://github.com/andymccurdy/redis-py
Me myself being a python coder I think the data structures that redis offers do very good match the python datatypes.

Best C language key/value database around for massive amounts of entries

I am trying to create a key/value database with 300,000,000 key/value pairs of 8 bytes each (both for the key and the value). The requirement is to have a very fast key/value mechanism which can query about 500,000 entries per second.
I tried BDB, Tokyo DB, Kyoto DB, and levelDB and they all perform very bad when it comes to databases at that size. (Their performance is not even close to their benchmarked rate at 1,000,000 entries).
I cannot store my database in memory because of hardware limitations (32 bit software), so memcached is out of the question.
I cannot use external server software as well (only a database module), and there is no need for multi-user support at all. Of course server software cannot hold 500,000 queries per second from a single endpoint anyways, so that leaves out Redis, Tokyo tyrant, etc.

David Segleau, here. Product Manager for Berkeley DB.
The most common problem with BDB performance is that people don't configure the cache size, leaving it at the default, which is pretty small. The second most common problem is that people write application behavior emulators that do random look-ups (even though their application is not really completely random) which forces them to read data out of cache. The random I/O then takes them down a path of conclusions about performance that are not based on the simulated application rather than the actual application behavior.
From your description, I'm not sure if your running into these common problems or maybe into something else entirely. In any case, our experience is that Berkeley DB tends to perform and scale very well. We'd be happy to help you identify any bottlenecks and improve your BDB application throughput. The best place to get help in this regard would be on the BDB forums at: http://forums.oracle.com/forums/forum.jspa?forumID=271. When you post to the forum it would be useful to show the critical query segments of your application code and the db_stat output showing the performance of the database environment.
It's likely that you will want to use BDB HA/Replication in order to load balance the queries across multiple servers. 500K queries/second is probably going to require a larger multi-core server or a series of smaller replicated servers. We've frequently seen BDB applications with 100-200K queries/second on commodity hardware, but 500K queries per second on 300M records in a 32-bit application is likely going to require some careful tuning. I'd suggest focusing on optimizing the performance of a the queries on the BDB application running on a single node, and then use HA to distribute that load across multiple systems in order to scale your query/second throughput.
I hope that helps.
Good luck with your application.
Regards,
Dave

I found a good benchmark comparison web page that basically compares 5 renowned databases:
LevelDB
Kyoto TreeDB
SQLite3
MDB
BerkeleyDB
You should check it out before making your choice: http://symas.com/mdb/microbench/.
P.S - I know you've already tested them, but you should also consider that your configuration for each of these tests was not optimized as the benchmark shows otherwise.

Try ZooLib.
It provides a database with a C++ API, that was originally written for a high-performance multimedia database for educational institutions called Knowledge Forum. It could handle 3,000 simultaneous Mac and Windows clients (also written in ZooLib - it's a cross-platform application framework), all of them streaming audio, video and working with graphically rich documents created by the teachers and students.
It has two low-level APIs for actually writing your bytes to disk. One is very fast but is not fault-tolerant. The other is fault-tolerant but not as fast.
I'm one of ZooLib's developers, but I don't have much experience with ZooLib's database component. There is also no documentation - you'd have to read the source to figure out how it works. That's my own damn fault, as I took on the job of writing ZooLib's manual over ten years ago, but barely started it.
ZooLib's primarily developer Andy Green is a great guy and always happy to answer questions. What I suggest you do is subscribe to ZooLib's developer list at SourceForge then ask on the list how to use the database. Most likely Andy will answer you himself but maybe one of our other developers will.
ZooLib is Open Source under the MIT License, and is really high-quality, mature code. It has been under continuous development since 1990 or so, and was placed in Open Source in 2000.
Don't be concerned that we haven't released a tarball since 2003. We probably should, as this leads lots of potential users to think it's been abandoned, but it is very actively used and maintained. Just get the source from Subversion.
Andy is a self-employed consultant. If you don't have time but you do have a budget, he would do a very good job of writing custom, maintainable top-quality C++ code to suit your needs.
I would too, if it were any part of ZooLib other than the database, which as I said I am unfamiliar with. I've done a lot of my own consulting work with ZooLib's UI framework.

300 M * 8 bytes = 2.4GB. That will probably fit into memory (if the OS does not restrict the address space to 31 bits)
Since you'll also need to handle overflow, (either by a rehashing scheme or by chaining) memory gets even tighter, for linear probing you probably need > 400M slots, chaining will increase the sizeof item to 12 bytes (bit fiddling might gain you a few bits). That would increase the total footprint to circa 3.6 GB.
In any case you will need a specially crafted kernel that restricts it's own "reserved" address space to a few hundred MB. Not impossible, but a major operation. Escaping to a disk-based thing would be too slow, in all cases. (PAE could save you, but it is tricky)
IMHO your best choice would be to migrate to a 64 bits platform.

500,000 entries per second without holding the working set in memory? Wow.
In the general case this is not possible using HDDs and even difficult SSDs.
Have you any locality properties that might help to make the task a bit easier? What kind of queries do you have?

We use Redis. Written in C, its only slightly more complicated than memcached by design. Never tried to use that many rows but for us latency is very important and it handles those latencies well and lets us store the data in the disk
Here is a bench mark blog entry, comparing redis and memcached.

Berkely DB could do it for you.
I acheived 50000 inserts per second about 8 years ago and a final database of 70 billion records.

Best programming language for fast DB reads and fast local data structure manipulation

I have a mysql database with a variety of different tables with some storing 100k+ rows. I wanted a language that would allow me to read quickly from the database, allowing me to collate data from various different tables and store them into local objects/data structures. I would then do most of the complex processing locally, which I would also like to be optimized for.
This is mainly for an analysis project of data that is cleared out every day. Some friends recommended Ruby or Python, but not knowing either, I wanted a second opinion before I took the leap.

Part of the problem here is the latency between the db and your application-tier code. Ping the DB server from where you intend to query the database from. Double that and that's your turnaround time for every operation. If you can live with that time, then you're OK. But you might be better off writing your manipulations in sprocs or something that lives close to the DB and use your application code to make that presentable to a user.

the Db is going to be the bottle neck in most cases in terms of getting data out.
Really depends on what your "complex processing" is that will make the greatest difference in what language and what performance you need.
in terms of being easy to get up and started with, python and ruby are quick to get started with and get something working. Bit slower compared to others for computing stuff. But even then, both can compute a hell of a lot of stuff before you will notice much difference from a native compiler language.

100,000 records really isn't that much. Provided you have enough ram and a multiple local "indexes" into the data are referencing the same objects and not copies, you'll be able to cache it locally and access it very quickly without concern. While Ruby and Python are both interpreted languages and operation-for-operation slower than compiled languages, the reality is in executing an application only a small portion of cpu time is spent in your code and the majority is spent into the built-in libraries you call into, which are often native implementations, and thus are as fast as compiled code.
Either Ruby or Python would work fine for this and even if you find, after testing, that you performance is in fact not sufficient, translating from one of these to a faster language like Java or .NET or even C++ would be significantly faster than actually rewriting from scratch since you've really already done the tough work.
One other option is to cache all the data in an in-memory database. Depending on how dynamic the analysis you need to do, this may work well in your situation. SQLite works very well for this.
Also note that since you're asking about caching the data locally and then acting on the local cache only, the performance calling out to the database doesn't apply.

which db should i select if performance of postgres is low

In a web app that support more than 5000 users, postgres is becoming the bottle neck.
It takes more than 1 minute to add a new user.(even after optimizations and on Win 2k3)
So, as a design issue, which other DB's might be better?

Most likely, it's not PostgreSQL, it's your design. Changing shoes most likely will not make you a better dancer.
Do you know what is causing slowness? Is it contention, time to update indexes, seek times?
Are all 5000 users trying to write to the user table at the same exact time as you are trying to insert 5001st user? That, I can believe can cause a problem. You might have to go with something tuned to handling extreme concurrency, like Oracle.
MySQL (I am told) can be optimized to do faster reads than PostgreSQL, but both are pretty ridiculously fast in terms of # transactions/sec they support, and it doesn't sound like that's your problem.
P.S.
We were having a little discussion in the comments to a different answer -- do note that some of the biggest, storage-wise, databases in the world are implemented using Postgres (though they tend to tweak the internals of the engine). Postgres scales for data size extremely well, for concurrency better than most, and is very flexible in terms of what you can do with it.
I wish there was a better answer for you, 30 years after the technology was invented, we should be able to make users have less detailed knowledge of the system in order to have it run smoothly. But alas, extensive thinking and tweaking is required for all products I am aware of. I wonder if the creators of StackOverflow could share how they handled db concurrency and scalability? They are using SQLServer, I know that much.
P.P.S.
So as chance would have it I slammed head-first into a concurrency problem in Oracle yesterday. I am not totally sure I have it right, not being a DBA, but what the guys explained was something like this: We had a large number of processes connecting to the DB and examining the system dictionary, which apparently forces a short lock on it, despite the fact that it's just a read. Parsing queries does the same thing.. so we had (on a multi-tera system with 1000s of objects) a lot of forced wait times because processes were locking each other out of the system. Our system dictionary was also excessively big because it contains a separate copy of all the information for each partition, of which there can be thousands per table. This is not really related to PostgreSQL, but the takeaway is -- in addition to checking your design, make sure your queries are using bind variables and getting reused, and pressure is minimal on shared resources.

Please change the OS under which you run Postgres - the Windows port, though immensely useful for expanding the user base, is still not on a par with the (much older and more mature) Un*x ports (and especially the Linux one).

Ithink your best choice is still PostgresSQL. Spend the time to make sure you have properly tuned your application. After your confident you have reached the limits of what can be done with tuning, start cacheing everything you can. After that, start think about moving to an asynchronous master slave setup...Also are you running OLAP type functionality on the same database your doing OLTP on?

Let me introduce you to the simplest, most practical way to scale almost any database server if the database design is truly optimal: just double your ram for an instantaneous boost in performance. It's like magic.

PostgreSQL scales better than most, if you are going to stay with a relational db, Oracle would be it. ODBMS scale better but they have their own issues, as in that it is closer to programming to set one up.
Yahoo uses PostgreSQL, that should tell you something about is scalability.

As highlighted above the problem is not with the particular database you are using, i.e. PostgreSQL but one of the following:
Schema design, maybe you need to add, remove, refine your indexes
Hardware maybe you are asking to much of your server - you said 5k users but then again very few of them are probably querying the db at the same time
Queries: perhaps poorly defined resulting in lots of inefficiency
A pragmatic way to find out what is happening is to analyse the PostgeSQL log files and find out what queries in terms of:
Most frequently executed
Longest running
etc. etc.
A quick review will tell you where to focus your efforts and you will most likely resolve your issues fairly quickly. There is no silver bullet, you have to do some homework but this will be small compared to changing your db vendor.
Good news ... there are lots of utilities to analayse your log files that are easy to use and produce easy to interpret results, here are two:
pgFouine - a PostgreSQL log analyzer (PHP)
pgFouine: Sample report
PQA (ruby)
PQA: Sample report

First, I would make sure the optimizations are, indeed, useful. For example, if you have many indexes, sometimes adding or modifying a record can take a long time.
I know there are several big projects running over PostgreSQL, so take a look at this issue.

I'd suggest looking here for information on PostgreSQL's performance: http://enfranchisedmind.com/blog/2006/11/04/postgres-for-the-win
What version of PG are you running? As the releases have progressed, performance has improved greatly.

Hi had the same issue previously with my current company. When I first joined them, their queries were huge and very slow. It takes 10 minutes to run them. I was able to optimize them to a few milliseconds or 1 to 2 seconds. I have learned many things during that time and I will share a few highlights in them.
Check your query first. doing an inner join of all the tables that you need will always take sometime. One thing that I would suggest is always start off with the table with which you can actually cut your data to those that you need.
e.g. SELECT * FROM (SELECT * FROM person WHERE person ilike '%abc') AS person;
If you look at the example above, this will cut your results to something that you know you need and you can refine them more by doing an inner join. This is one of the best way to speed up your query but there are more than one way to skin a cat. I cannot explain all of them here because there are just too many but from the example above, you just need to modify that to suite your need.
It depends on your postgres version. Older postgres does internally optimize the query. on example is that on postgres 8.2 and below, IN statements are slower than 8.4's.
EXPLAIN ANALYZE is your friend. if your query is running slow, do an explain analyze to determine which of it is causing the slowness.
Vacuum your database. This will ensure that statistics on your database will almost match the actual result. Big difference in the statistics and actual will result on your query running slow.
If all of these does not help you, try modifying your postgresql.conf. Increase the shared memory and try to experiment with the configuration to better suite your needs.
Hope this helps, but of course, these are just for postgres optimization.
btw. 5000 users are not much. My DB contains users with about 200k to a million users.

If you do want to switch away from PostgreSQL, Sybase SQL Anywhere is number 5 in terms of price/performance on the TPC-C benchmark list. It's also the lowest price option (by far) on the top 10 list, and is the only non-Microsoft and non-Oracle entry.
It can easily scale to thousands of users and terabytes of data.
Full disclosure: I work on the SQL Anywhere development team.

We need more details: What version you are using? What is the memory usage of the server? Are you vacuuming the database? Your performance problems might be un-related to PostgreSQL.

If you have many reads over writes, you may want to try MySQL assuming that the problem is with Postgres, but your problem is a write problem.
Still, you may want to look into your database design, and possibly consider sharding. For a really large database you may still have to look at the above 2 issues regardless.
You may also want to look at non-RDBMS database servers or document oriented like Mensia and CouchDB depending on the task at hand. No single tool will manage all tasks, so choose wisely.
Just out of curiosity, do you have any stored procedures that may be causing this delay?