Has someone experiment RDF storage solution like Sesame? I'm looking for performance review of this kind of solution compared to the traditional database solution.
There are plenny scalabity reports and benchmarks on the web about various triple-stores.
Here is a fine scalability report.
W3C itself maintain a wiki with lots of information about Large Triplestores and Benchmarks.
Follow these 3 links and take a time to read it. It's very informative. :)
I've used sesame extensively in my projects at work. I've found it to be speedy and reliable enough for most situations I find myself in. It has definitely outperformed Jena's storage solutions on a variety of fronts. Sesame 1.x has faster query performance than the 2.x version, but the 2.x version has some nice features such as contexts and sparql support.
If you are looking to use a traditional relational database, you could look at something like D2RQ, or something like Owlgres (if you want inferencing).
One intuition is that if you have a very large number of entities, tuple stores can save yourself the trouble of having your indexes routinely knocked out of memory as you switch between tables, and instead always have the first couple levels of the tuple index in RAM.
Related
I'm intrigued in the database service Datomic, but I'm not sure if it fits the needs of the projects I work on. When is Datomic a good choice, and when should it be avoided?
With the proviso that I haven't used Datomic in production, thought I'd give you an answer.
Advantages
Datalog queries are powerful (more so than non-recursive SQL) and very expressive.
Queries can be written with Clojure data structures, and it's NOT a weak DSL like many SQL libraries that allow you to query with data structures.
It's immutable, so you get the advantages that immutability gives you in Clojure/other languages as well
a. This also allows you to store, while saving structures, all past facts in your database—this is VERY useful for auditing & more
Disadvantages
It can be slow, as Datalog is just going to be slower than equivalent SQL (assuming an equivalent SQL statement can be written).
If you are writing a LOT, you could maybe need to worry about the single transactor getting overwhelmed. This seems unlikely for most cases, but it's something to think about (you could do a sort of shard, though, and probably save yourself; but this isn't a DB for e.g. storing stock tick data).
It's a bit tricky to get up and running with, and it's expensive, and the licensing and price makes it difficult to use a hosted instance with it: you'll need to be dealing with sysadminning this yourself instead of using something like Postgres on Heroku or Mongo at MongoHQ
I'm sure I'm missing some on each side, and though I have 3 listed under disadvantages, I think that the advantages outweigh them in more circumstances where disadvantages don't preclude its use. Price is probably the one that will prevent its being used in most small projects (that you expect to outlast the 1 year free trial).
Cf. this short post describing Datomic simply for some more information.
Expressivity (c.f. Datalog) and immutability are awesome. It's SO much fun to work with Dataomic in that regard, and you can tell it's powerful just by using it a bit.
One important thing when considering if Datomic is the right fit for your application is to think about shape of the data you are going to store and query - as Datomic facts are actually very similar to RDF triples (+ first class time notion) it lends itself very good to modeling complex relationships (linked graph data) - something which is often cumbersome with traditional SQL databases.
I found this aspect to be one of the most appealing and important for me, it worked really well, even if this is of course not something exclusive to Datomic, as there are many other high-quality offerings for graph databases, one must mention Neo4J when we are talking about JVM based solutions.
Regarding Datomic schema, i think it's just the right balance between flexibility and stability.
To complete the above answers, I'd like to emphasize that immutability and the ability to remember the past are not 'wizardry features' suited to a few special case like auditing. It is an approach which has several deep benefits compared to 'mutable cells' databases (which are 99% of databases today). Stuart Halloway demonstrates this nicely in this video: the Impedance Mismatch is our fault.
In my personal opinion, this approach is fundamentally more sane conceptually. Having used it for several months, I don't see Datomic has having crazy magical sophisticated powers, rather a more natural paradigm without some of the big problems the others have.
Here are some features of Datomic I find valuable, most of which are enabled by immutability:
because reading is not remote, you don't have to design your queries like an expedition over the wire. In particular, you can separate concerns into several queries (e.g find the entities which are the input to my query - answer some business question about these entities - fetch associated data for presenting the result)
the schema is very flexible, without sacrificing query power
it's comfortable to have your queries integrated in your application programming language
the Entity API brings you the good parts of ORMs
the query language is programmable and has primitives for abstraction and reuse (rules, predicates, database functions)
performance: writers impede only other writers, and no one impedes readers. Plus, lots of caching.
... and yes, a few superpowers like travelling to the past, speculative writes or branching reality.
Regarding when not to use Datomic, here are the current constraints and limitations I see:
you have to be on the JVM (there is also a REST API, but you lose most of the benefits IMO)
not suited for write scale, nor huge data volumes
won't be especially integrated into frameworks, e.g you won't currently find a library which generates CRUD REST endpoints from a Datomic schema
it's a commercial database
since reading happens in the application process (the 'Peer'), you have to make sure that the Peer has enough memory to hold all the data it needs to traverse in a query.
So my very vague and informal answer would be that Datomic is a good fit for most non-trivial applications which write load is reasonable and you don't have a problem with the license and being on the JVM.
As an analogy, you can ask yourself the same question for Git as compared to other version control systems which are not based on immutability.
Just to tentatively add over the other answers:
It is probably fair to say datomic presents the better conceptual framework for a queryable data store of all other current options out there, while being partially scalable and not exceptionally performant.
I say only partially scalable, because queries need to fit in the peer RAM or fail. And not exceptionally performant, as top-notch SQL engines can optimize queries to fit in memory through sophisticated execution plans, something I've not yet seen mentioned as a feature in datomic; Datomic's decoupling of transacting and querying might in the overall offset this feature.
Unlike many NoSQL engines though, transactions are a first-class citizen, which puts it at par with RDBMS systems in that key regard.
For applications where data is read more than being written, transactions are needed, queries always fit in memory or memory is very cheap, and the overall size of accumulated data isn't too large, it might be a win where a commercial-only product can be afforded ― for those who are willing to embrace its novel conceptual framework implied in the API.
I've touched a Teradata. I've never touched hadoop, but since yesterday, I am doing some research on that. By description of both, they seem quite interchangable, but in some papers it is written that they serve for different purposes. But all I found is vague. I am confused.
Has anybody experience with both of them? What is the serious difference between them?
Simple Example: I want to build ETL which will transform billions rows of raw data and organize them to DWH. Then do some resources expensive analysis on them. Why use TD? Why Hadoop? or why not?
I think this article titled 'MapReduce and Parallel DBMSs: Friends or Foes' does quite a good job describing the situations where each technology works best. In a nutshell, Hadoop is excellent for storing unstructured data and running parallel transformations to 'sanitize' incoming data, where DBMSs excel at executing complex queries quickly.
Hadoop, Hadoop with Extensions, RDBMS Feature/Property Comparison
I am not an expert in this area, but in the coursera.com course, Introduction to Data Science, there is a lecture titled: Comparing MapReduce and Databases as well as a lecture on Parallel databases within the map reduce section of the course.
Here is a summary from these lectures on the comparison of MapReduce vs. RDBMS (not necessarily parallel RDMBS).
One point to remember is that the comparison is different if you include extensions to Hadoop like PIG, Hive, etc. I will put in () MapReduce extensions that add some of these functionality/properties.
Some functionality/properties that RDBMS have but not native MapReduce:
Declaritive query languages -(Pig, HIVE)
Schemas (Hive, Pig, DyradLINQ, Hadapt)
Logical Data Independence
Indexing (Hbase)
Algebraic Optimization (Pig, Dryad, HIVE)
Caching/Materialized Views
ACID/Transactions
MapReduce (relative to regular RDBMS not necessarily Parallel RDMBS)
High Scalability
Fault-tolerance
“One-person deployment”
I've been asked this question several times, the answer that I usually give is a car analogy (which is pretty silly because I'm not a car person - but it seems to work)
Teradata is the car/dbms for the masses - it is reliable, mature, works well and is there when you need it. It is difficult (compared to Hadoop) to customise and add functionality to the base product.
Hadoop is the car/dbms for the enthusiast - it isn't as reliable or mature, it works well so long as you attend to it. It is easy (compared to Teradata) to customise and add functionality to the base product.
Put another way, Teradata is the reliable workhorse where you put your mission critical process (operational reporting, enterprise reporting, decision support etc).
Hadoop is the place where you can do alot of this stuff, but don't be surprised if you come in one morning and find that your regulatory reports can't be produced because someone applied a patch or you've suddenly got a "too many small files" problem.
To loop back into the analogy, if you don't want to be too techy and the manufacturers product (dbms and/or car) works for you out of the box, Teradata is a good option.
On the other hand, if you like to tinker under the hood, swap out the carburettor (or whatever), adjust the gear ratios, tweak the fuel air mixture depending on whether you are country or city driving, bolt on a Turbo charger and/or your family complain about how long you spend in the garage on weekends - Hadoop is the place for you.
IMHO, Most, if not all organisations need both.
I hope this helps :-)
To Begin with, Vanilla Apache Hadoop is 100% open source. But if you need commercial support along with consultancy there are companies like Cloudera, MapR, HortonWorks, etc.
Hadoop is backed by a growing community fixing bugs and making improvements on a consistent basis. Hadoop storage model HDFS is based on Google's GFS architecture which is proven to handle large quantities of data. Furthermore Hadoop analysis model Map Reduce is based on Google's Map Reduce Model.
Hadoop is used by Tech Giants like Facebook, Yahoo, Twitter, EBay etc to store and analysis they high volume of data real time as well as passively.
For your question ETL systems read these slides where you will see.
Ok now Why Hadoop?
Open Source
Proven Storage and Analysis model for Large Quantities of data
Minimum Hardware Requirement to setup and run.
Ok now Why TD?
Commercial Support
What exactly is NoSQL? Is it database systems that only work with {key:value} pairs?
As far as I know MemCache is one of such database systems, am I right?
What other popular NoSQL databases are there and where exactly are they useful?
Thanks, Boda Cydo.
I'm not agree with the answers I'm seeing, although it's true that NoSQL solutions tends to break the ACID rules, not all are created from that approach.
I think first you should define what is a SQL Solution and then you can put the "Not Only" in front of it, that will be more accurate definition of what is a NoSQL solution.
With this approach in mind:
SQL databases are a way to group all the data stores that are accessible using Structured Query Language as the main (and most of the time only) way to communicate with them, this means it requires that the database support the structures that are common to those systems like "tables", "columns", "rows", "relationships", etc.
Now, put the "Not Only" in front of the last sentence and you will get a definition of what means "NoSQL". NoSQL groups all the stores created as an attempt to solve problems which cannot fit into the table/column/rows structures or even in SQL Statements, in most of the cases these databases will not support relationships, they're abandoning the well known structures just because the problems have changed since their conception.
If you have a text file, and you create an API to store/retrieve/organize this information, then you have a NoSQL database in your hands.
All of these means that there are several solutions to store the information in a way that traditional SQL systems will not allow to achieve better performance, flexibility, etc etc. Every NoSQL provider tries to solve a different problem and that's why you wont be able to compare two different solutions, for example:
djondb is a document store created to be used as
NoSQL enterprise solution supporting transactions, consistency, etc.
but sacrifice performance of its counterparts.
MongoDB is a document store (similar to
djondb) which accomplish great performance but trades some of the
ACID properties to achieve this.
CouchDB is another document store which
solves the queries slightly different providing views to retrieve the
information without doing a full query every time.
...
As you may have noticed I only talked about the document stores, that's because I wanted to show you that 3 different document stores implementations have different approach, therefore you should keep in mind the golden rule of NoSQL stores "Use the right tool for the right job".
I'm the creator of djondb and I've been doing a lot of research even before trying to start my own NoSQL implementation, but this is a field where the concepts will keep changing the way we see the information storage.
From wikipedia:
NoSQL is an umbrella term for a loosely defined class of non-relational data stores that break with a long history of relational databases and ACID guarantees. Data stores that fall under this term may not require fixed table schemas, and usually avoid join operations. The term was first popularised in early 2009.
The motivation for such an architecture was high scalability, to support sites such as Facebook, advertising.com, etc...
To quickly get a handle on NoSQL systems, see this blog post I wrote: Visual Guide to NoSQL Systems. Essentially, NoSQL systems sacrifice either consistency or availability in favor of tolerance to network partitions.
What is NoSQL ?
NoSQL is the acronym for Not Only SQL. The basic qualities of NoSQL databases are schemaless, distributed and horizontally scalable on commodity hardware. The NoSQL databases offers variety of functions to solve various problems with variety of data types, where “blob” used to be the only data type in RDBMS to store unstructured data.
1 Dynamic Schema
NoSQL databases allows schema to be flexible. New columns can be added anytime. Rows may or may not have values for those columns and no strict enforcement of data types for columns. This flexibility is handy for developers, especially when they expect frequent changes during the course of product life cycle.
2 Variety of Data
NoSQL databases support any type of data. It supports structured, semi-structured and unstructured data to be stored. Its supports logs, images files, videos, graphs, jpegs, JSON, XML to be stored and operated as it is without any pre-processing. So it reduces the need for ETL (Extract – Transform – Load).
3 High Availability Cluster
NoSQL databases support distributed storage using commodity hardware. It also supports high availability by horizontal scalability. This features enables NoSQL databases get the benefit of elastic nature of the Cloud infrastructure services.
4 Open Source
NoSQL databases are open source software. The usage of software is free and most of them are free to use in commercial products. The open sources codebase can be modified to solve the business needs. There are minor variations in the open source software licenses, users must be aware of license agreements.
5 NoSQL – Not Only SQL
NoSQL databases not only depend SQL to retrieve data. They provide rich API interfaces to perform DML and CRUD operations. These are APIs are move developer friendly and supported in variety of programming languages.
Take a look at these:
http://en.wikipedia.org/wiki/Nosql#List_of_NoSQL_open_source_projects
and this:
http://www.mongodb.org/display/DOCS/Comparing+Mongo+DB+and+Couch+DB
I used something called the Raima Data Manager more than a dozen years ago, that qualifies as NoSQL. It calls itself a "Set Oriented Database" Its not based on tables, and there is no query "language", just an C API for asking for subsets.
It's fast and easier to work with in C/C++ and SQL, there's no building up strings to pass to a query interpreter and the data comes back as an enumerable object rather than as an array. variable sized records are normal and don't waste space. I never saw the source code, but there were some hints at the interface that internally, the code used pointers a lot.
I'm not sure that the product I used is even sold anymore, but the company is still around.
MongoDB looks interesting, SourceForge is now using it.
I listened to a podcast with a team member. The idea with NoSQL isn't so much to replace SQL as it is to provide a solution for problems that aren't solved well with traditional RDBMS. As mentioned elsewhere, they are faster and scale better at the cost of reliability and atomicity (different solutions to different degrees). You wouldn't want to use one for a financial system, but a document based system would work great.
Here is a comprehensive list of NoSQL Databases: http://nosql-database.org/.
I'm glad that you have had success with RDM John! I work at Raima so it's great to hear feedback. For those looking for more information, here are a couple of resources:
Video Overview of RDM's General Architecture
Free Evaluation Download of RDM
Let's pretend it's for word frequency counts in a web crawler. Is relational the way to go (I'm imagining a simple two-column table) or is there a NoSQL option better suited to this task?
When I say better, I mean more conceptually suited to the task. I'm not really concerned with scalability, just simplicity and an obvious conceptual mapping to the task at hand. In the way that, for me at least, CouchDB maps much more sensibly to a blog than MySQL does.
If this is a thing that you'll only run on one machine I'd just use an internal datastructure, a red-black tree or perhaps a trie for something as simple and small as this..
Or I'd embed a key/value pair database such as BerkeleyDB.
first - are you looking for paid database or a free one ?
if you have 1 or 2 tables maybe 1 or 2 indexes, no need for enterprise features (clustering/replication/Flashbacks) then any DB can do good from mysql (the free one) to sql express(still free) , sql server and oracle which both are commercial and cost money.
you need to understand that the hardware might have a role here too, as the schema looks very simple and there wouldn't be much optimization possible - but then again - I don't know your exact needs...
(on the other hand if you're talking about extreme large tables with a lot of read and writes - and I mean A LOT you might need a configuration of 2 active nodes and other advanced features than might push you towards paid databases)
Have a look at the papers and ideas behind Google BigTable (and the MapReduce operations possible on it). There's other implementations that think in that box; you're really implementing a distributed hash table, to give you some juice to throw at Google.
I am looking into mechanisms for better search capabilities against our database. It is currently a huge bottleneck (causing long-lasting queries that are hurting our database performance).
My boss wanted me to look into Solr, but on closer inspection, it seems we actually want some kind of DB integration mechanism with Lucene itself.
From the Lucene FAQ, they recommend Hibernate Search, Compass, and DBSight.
As a background of our current technology stack, we are using straight JSPs on Tomcat, no Hibernate, no other frameworks on top of it... just straight Java, JSP, and JDBC against a DB2 database.
Given that, it seems Hibernate Search might be a bit more difficult to integrate into our system, though it might be nice to have the option of using Hibernate after such an integration.
Does anyone have any experiences they can share with using one of these tools (or other similar Lucene based solutions) that might help in picking the right tool?
It needs to be a FOSS solution, and ideally will manage updating Lucene with changes from the database automagicly (though efficiently), without extra effort to notify the tool when changes have been made (otherwise, it seems rolling my own Lucene solution would be just as good). Also, we have multiple application servers with just 1 database (+failover), so it would be nice if it is easy to use the solution from all application servers seamlessly.
I am continuing to inspect the options now, but it would be really helpful to utilize other people's experiences.
When you say "search against a DB", what do you mean?
Relational databases and information retrieval systems use very different approaches for good reason. What kind of data are you searching? What kind of queries do you perform?
If I were going to implement an inverted index on top of a database, as Compass does, I would not use their approach, which is to implement Lucene's Directory abstraction with BLOBs. Rather, I'd implement Lucene's IndexReader abstraction.
Relational databases are quite capable of maintaining indexes. The value that Lucene brings in this context is its analysis capabilities, which are most useful for unstructured text records. A good approach would leverage the strengths of each tool.
As updates are made to the index, Lucene creates more segments (additional files or BLOBs), which degrade performance until a costly "optimize" procedure is used. Most databases will amortize this cost over each index update, giving you more stable performance.
I have had good experiences with Compass. It has really good integration with hibernate and can mirror data changes made through hibernate and jdbc directly to the Lucene indexes though its GPS devices http://www.compass-project.org/docs/1.2.2/reference/html/gps-jdbc.html.
Maintaining the Lucene indexes on all your application servers may be an issue. If you have multiple App servers updating the db, then you may hit some issues with keeping the index in sync with all the changes. Compass may have an alternate mechanism for handling this now.
The Alfresco Project (CMS) also uses Lucene and have a mechanism for replicating Lucene index changes between servers that may be useful in handling these issues.
We started using Compass before Hibernate Search was really off the ground so I cannot offer any comparison with it.
LuSql http://code.google.com/p/lusql/ allows you to load the contents of a JDBC-accessible database into Lucene, making it searchable. It is highly optimized and multi-threaded. I am the author of LuSql and will be coming out with a new version (re-architected with a new plugable architecture) in the next month.
For a pure performance boost with searching Lucene will certainly help out a lot. Only index what you care about/need and you should be good. You could use Hibernate or some other piece if you like but I don't think it is required.
Well, it seems DBSight doesn't meet the FOSS requirement, so unless it is an absolutely stellar solution, it is not an option for me right now...