I am dealing with a huge dataset consisting of key-value pairs. The queries are always in the form of range queries on the key space (keys are numbers) hence any persistent B-Tree like structure will handle the situation. I would like to use BDB-Java Edition but the product is closed source and my company doesn't want to buy BDB-JE License. I am wondering, would you please share your experience with any non-GPL java based key-value storage system.
Thanks,
-A
There is also OrientDB, which is a document database written in Java and can be embedded to application (no external server) like BDB Java edition. They use Apache 2.0 license.
They also have key/value based variant: OrientKV. I haven't really used Orient myself, just poking around, so I don't know if it supports your use case (range queries on key space). However, it advertises itself as really fast.
Though, it seems Orient DB is not very widely used. I even made a question asking if anybody has any experiences to share.
Tokyo Cabinet comes to mind as a very fast KV store which is under the LGPL and is embedded like the BDB and supports BTrees. It is c-based but a javaclient is available and I had no trouble installing it.
MongoDB and CouchDB nice , but it runs as a separate server. Again Java support is available.
Related
I'm in need for an embedded database for a Clojure application. Maybe it's the same criteria as for any other Java application but I rather get some other people's opinion anyway. I'm not picking SQLite because that's not pure Java so distribution of a standalone application gets much more complex. It seems the way to go is Apache Derby. Anything else I should consider?
Without a doubt, H2
Here are the settings,
(def demo-settings
{
:classname "org.h2.Driver"
:subprotocol "h2:file"
:subname (str (System/getProperty "user.dir") "/" "demo")
:user "sa"
:password ""
}
)
And then the usual Clojure SQL code:
(with-connection demo-settings
(create-table :DEMO_TABLE
[:M_LABEL "varchar(120)"]
[:M_DATE "varchar(120)"]
[:M_COMMENT "varchar(32)"]))
Have you looked at FleetDB? It's a Clojure database with a JSON protocol and clients in several languages. I suspect you could probably run it embedded without working too hard at it.
h2
oracle Berkley DB
I used an embedded database, H2 within clojure and used clojureQL to access it. Be warned though that since the database is in process you should not use this for large amounts of records (> than 10,000s in a single table) as you will get huge performance problems as the database and your code will both be sharing the same JVM
I think Derby makes an excellent 100% Java embedded database, and it's useful for a wide variety of applications, well-maintained by an active community, and very well documented.
If you don't mind NOSQL, neo4j is an embeddable graph db with transactions, licensed under the GPL. The most up to date bindings I've found are https://github.com/hgavin/borneo
There is also an interesting graph db project in clojure with pluggable backends: https://github.com/flatland/jiraph
The still quite young but promising looking OrientDB might be worth a look: http://www.orientechnologies.com/orient-db.htm
http://github.com/eduardoejp/clj-orient
Then there's http://jdbm.sourceforge.net/
I am using https://github.com/clojurewerkz/archimedes which allows you to specify a backend later.
Another option to consider is a key-value store Chronicle Map, because it's pure Java and provides a vanilla Java Map interface, so working with it should be very simple using Clojure.
I'm using Squeak4.1. How does it handle Database connections? Does it provides something similar to ODBC/ADO in .NET or the J2EE stuff?
Which packages deal with database operations?
Can anybody give me some hints?
Few links that might be of use to you:
Squeak Smalltalk and Databases
SqueakDBX
Persistence in Seaside (Also see Chapter 8 in the Seaside tutorial)
Magma
Databases and Persistence
Squeak PostreSQL
If you want something that's truly an analog to ODBC/JDBC or ADO.NET, then the closest analog would be SqueakDBX, a generic, FFI-based connector to a wide variety of databases. While it uses FFI, the developers have gone to great lengths to ensure that long operations do not block the VM. While I can't honestly say I've used it in production, reviews have been positive, it supports a very wide variety of databases (MySQL, Microsoft SQL Server, PostgreSQL, SQLite3, and more), and it's being actively developed, so it's probably a good bet.
Historically, the downside of SqueakDBX is that you didn't get GLORP, the major ORM used in the Smalltalk world these days. The good news is that's no longer true: SqueakDBX now has GlorpDBX, which brings GLORP to SqueakDBX. Drivers are currently available for PostgreSQL, MS SQL, and MySQL, among others. If you need to connect to a traditional database, this is probably your best bet.
Benjamin: We have already started to modify Glorp, we call it GlorpDBX and now Glorp works with a generic database driver, included a GlorpSqueakDBX driver. Right now we have GlorpDBX working with SqueakDBX for Postgres, MSSQL and Oracle.
Cheers
You might not need to. If your smalltalk code runs in Gemstone, there is no need to worry about database connections and queries before you have a lot of data/a lot of transactions.
And if the number of objects is very small, SandstoneDB is much easier to use. On the Persistence in Seaside page you can find the links.
Ideally, where would an application like Facebook store its "Friends" data?
In a database table? in an xml file?
From Facebooks engineering page:
"Already, we are the second most-trafficked PHP site in the world (Yahoo is #1), and one of the largest MySQL installations anywhere, running thousands of databases."
and
"We've built a lightweight but powerful multi-language RPC framework that allows us to seamlessly and easily tie together subsystems written in any language, running on any platform. Facebook is built in PHP, C++, Perl, Python, Erlang, Java, and even a little bit of ML—and it all works together.
* We are the largest user in the world of memcached, an open-source caching system. Originally developed by LiveJournal, we've since made so many scalability improvements and performance upgrades that we will be the primary contributor of features in the next major release.
* We've created a custom-built search engine serving millions of queries a day, completely distributed and entirely in-memory, with real-time updates."
Relational databases?
check out this blog: http://highscalability.com/ many real-world examples of systems architecures to learn from
"Friends" data is well-described in a graph database. Neo4j is an example, though I know it's not the way Facebook stores this information.
Facebook uses a number of database technologies that may be involved:
a patched version of MySQL
Cassandra
Hadoop
... others
Most probably it should contain some other mechanism. As an example a search engine does not keep its index as a database or XML file. To obtain a maximum performance generally they keep some tree (Binary search tree or more complicated one) and store them on disk in performance effective manner. So I guess such mechanism.
Certainly not in a XML file.
Yes, in a database, in one or several tables. And for the precise exemple of facebook, on several server.
With the rising of non-sql database usage in high traffic website, I'm interested to use it for my project. Now I've heard several names like Voldermort, MongoDB and CouchDB. But which are among these NonSQL database that is production ready? I've seen the download pages and it seems that none of them is production ready because is not version 1.0 yet. Is there any other names other than these 3 that is recommendable to be used in production?
What do you mean by production ready? As far as I know, all of them are being used on live systems.
You should make your choice based on how the features they provide fit your needs.
You can also add Tokyo Cabinet to the list as well as the mnesia database provided by the Erlang VM.
I think you need to start out from your project requirements to see what kind of database you really need. There are many non-relational DBMS:s out there and they differ a lot in what kind of problems they are good at solving. I think the article Should you go Beyond Relational Databases? by Martin Kleppmann is a good starting point for finding out what you need. There's also a lot of stackoverflow threads on similar topics, these are my favorites:
The Next-gen Databases
Non-Relational Database Design
When shouldn’t you use a relational
database?
Good reasons NOT to use a relational
database?
When you have narrowed down what you actually need you can take a deeper look into the alternatives to see which DBMS are production ready for your use case. Production readiness isn't a yes/no thing: people may successfully deploy some solution that for example lacks in tool support - in another project this could be a no-go.
As for version numbers different projects have a different take on this, so you can't just compare the version numbers. I'm involved in the graph database project Neo4j and even if it has been in production use for 5+ years by now we still haven't released a version 1.0 final yet.
I'm tempted to answer "use SIRA_PRISE".
It's definitely non-SQL.
And its current version is 1.2, meaning that someone like you must definitely assume it's "production-ready".
But perhaps I shouldn't be answering at all.
Nice article comparing rdbms with 'next gen' and listing some providers:
Is the Relational Database Doomed?
http://readwrite.com/2009/02/12/is-the-relational-database-doomed
I will suggest you to use Arangodb.
ArangoDB is a multi-model mostly-memory database with a flexible data model for documents and graphs. It is designed as a “general purpose database”, offering all the features you typically need for modern web applications.
ArangoDB is supposed to grow with the application—the project may start as a simple single-server prototype, nothing you couldn’t do with a relational database equally well. After some time, some geo-location features are needed and a shopping cart requires transactions. ArangoDB’s graph data model is useful for the recommendation system. The smartphone app needs a lean API to the back-end—this is where Foxx, ArangoDB’s integrated Javascript application framework, comes into play.
Another unique feature is ArangoDB’s query language AQL — it makes querying powerful and convenient. AQL enables you to describe complex filter conditions and joins in a readable format, much in the same way as SQL.
You can model your data in several ways:
in key/value pairs
as collections of documents
as graphs with nodes, edges, and properties for both
You can access data in ArangoDB:
using the general HTTP REST API via curl/wget, or your browser
via the ArangoDB shell (“arangosh”)
using a programming language specific client library
Server requirements for ArangoDB:
ArangoDB runs on Linux, OS X and Microsoft Windows.
It runs on 32bit and 64bit systems, though using a 32bit system will limit you to using only approximately 2 to 3 GB of data with ArangoDB.
I'm looking for a cross-platform database engine that can handle databases up hundreds of millions of records without severe degradation in query performance. It needs to have a C or C++ API which will allow easy, fast construction of records and parsing returned data.
Highly discouraged are products where data has to be translated to and from strings just to get it into the database. The technical users storing things like IP addresses don't want or need this overhead. This is a very important criteria so if you're going to refer to products, please be explicit about how they offer such a direct API. Not wishing to be rude, but I can use Google - please assume I've found most mainstream products and I'm asking because it's often hard to work out just what direct API they offer, rather than just a C wrapper around SQL.
It does not need to be an RDBMS - a simple ISAM record-oriented approach would be sufficient.
Whilst the primary need is for a single-user database, expansion to some kind of shared file or server operations is likely for future use.
Access to source code, either open source or via licensing, is highly desirable if the database comes from a small company. It must not be GPL or LGPL.
you might consider C-Tree by FairCom - tell 'em I sent you ;-)
i'm the author of hamsterdb.
tokyo cabinet and berkeleydb should work fine. hamsterdb definitely will work. It's a plain C API, open source, platform independent, very fast and tested with databases up to several hundreds of GB and hundreds of million items.
If you are willing to evaluate and need support then drop me a mail (contact form on hamsterdb.com) - i will help as good as i can!
bye
Christoph
You didn't mention what platform you are on, but if Windows only is OK, take a look at the Extensible Storage Engine (previously known as Jet Blue), the embedded ISAM table engine included in Windows 2000 and later. It's used for Active Directory, Exchange, and other internal components, optimized for a small number of large tables.
It has a C interface and supports binary data types natively. It supports indexes, transactions and uses a log to ensure atomicity and durability. There is no query language; you have to work with the tables and indexes directly yourself.
ESE doesn't like to open files over a network, and doesn't support sharing a database through file sharing. You're going to be hard pressed to find any database engine that supports sharing through file sharing. The Access Jet database engine (AKA Jet Red, totally separate code base) is the only one I know of, and it's notorious for corrupting files over the network, especially if they're large (>100 MB).
Whatever engine you use, you'll most likely have to implement the shared usage functions yourself in your own network server process or use a discrete database engine.
For anyone finding this page a few years later, I'm now using LevelDB with some scaffolding on top to add the multiple indexing necessary. In particular, it's a nice fit for embedded databases on iOS. I ended up writing a book about it! (Getting Started with LevelDB, from Packt in late 2013).
One option could be Firebird. It offers both a server based product, as well as an embedded product.
It is also open source and there are a large number of providers for all types of languages.
I believe what you are looking for is BerkeleyDB:
http://www.oracle.com/technology/products/berkeley-db/db/index.html
Never mind that it's Oracle, the license is free, and it's open-source -- the only catch is that if you redistribute your software that uses BerkeleyDB, you must make your source available as well -- or buy a license.
It does not provide SQL support, but rather direct lookups (via b-tree or hash-table structure, whichever makes more sense for your needs). It's extremely reliable, fast, ACID, has built-in replication support, and so on.
Here is a small quote from the page I refer to above, that lists a few features:
Data Storage
Berkeley DB stores data quickly and
easily without the overhead found in
other databases. Berkeley DB is a C
library that runs in the same process
as your application, avoiding the
interprocess communication delays of
using a remote database server. Shared
caches keep the most active data in
memory, avoiding costly disk access.
Local, in-process data storage
Schema-neutral, application native data format
Indexed and sequential retrieval (Btree, Queue, Recno, Hash)
Multiple processes per application and multiple threads per process
Fine grained and configurable locking for highly concurrent systems
Multi-version concurrency control (MVCC)
Support for secondary indexes
In-memory, on disk or both
Online Btree compaction
Online Btree disk space reclamation
Online abandoned lock removal
On disk data encryption (AES)
Records up to 4GB and tables up to 256TB
Update: Just ran across this project and thought of the question you posted:
http://tokyocabinet.sourceforge.net/index.html . It is under LGPL, so not compatible with your restrictions, but an interesting project to check out, nonetheless.
SQLite would meet those criteria, except for the eventual shared file scenario in the future (and actually it could probably do that to if the network file system implements file locks correctly).
Many good solutions (such as SQLite) have been mentioned. Let me add two, since you don't require SQL:
HamsterDB fast, simple to use, can store arbitrary binary data. No provision for shared databases.
Glib HashTable module seems quite interesting too and is very
common so you won't risk going into a dead end. On the other end,
I'm not sure there is and easy way to store the database on the
disk, it's mostly for in-memory stuff
I've tested both on multi-million records projects.
As you are familiar with Fairtree, then you are probably also familiar with Raima RDM.
It went open source a few years ago, then dbstar claimed that they had somehow acquired the copyright. This seems debatable though. From reading the original Raima license, this does not seem possible. Of course it is possible to stay with the original code release. It is rather rare, but I have a copy archived away.
SQLite tends to be the first option. It doesn't store data as strings but I think you have to build a SQL command to do the insertion and that command will have some string building.
BerkeleyDB is a well engineered product if you don't need a relationDB. I have no idea what Oracle charges for it and if you would need a license for your application.
Personally I would consider why you have some of your requirements . Have you done testing to verify the requirement that you need to do direct insertion into the database? Seems like you could take a couple of hours to write up a wrapper that converts from whatever API you want to SQL and then see if SQLite, MySql,... meet your speed requirements.
There used to be a product called b-trieve but I'm not sure if source code was included. I think it has been discontinued. The only database engine I know of with an ISAM orientation is c-tree.