I am developing a web application where I want to use Solr for search only and keep my data on another Database.
I will be having 2 databases: one Relational (Sql Server) and the other will be a copy of it on the NoSQL Solr database.
I'll be searching for specific fields in the solr documents e.g(by id,name,type and join queries) i.e NOT full text search.
I know Solr strength is in full text search by creating inverted index on the documents data, now i want to know does it also helps in my case by creating another type of index on my documents which make normal searching faster than sql server index?
Yes, it will help you.
You need to consider what is your requirement. What is your preference?
If you have the solr as another additional option which will be used for the searching the application data, you need to consider that you have to constantly update the solr. You will need additional infrastructure and all.
If the performance is your main criteria and you don't want to put any search load on your RDBMS then you can add the solr to your system. Also consider how big your data is in the RDBMS. Because RDBMS system are also enough strong to support searching data.
Considering all the above aspects you can take the decision.
I'm used to working with mysql but for my next series of projects CouchDB (NoSQL) seems to be the way to go, basically to avoid EAV in mysql and to embrace all the cool features it has to offer.
After lots of investigation and reading documentation etc, there is one thing I don't seem to understand quite well.
Lets assume I host three web applications on my server and thus need three databases accordingly. For instance one is a webshop with product and invoice tables, one is a weblog with article and comment tables and another one is a web based game with game stats tables (simplification obviously).
So I host multiple sites on one installation of mysql, and each application I run on my server gets its own database with tables, fields and content.
Now, with CouchDb I want do the exact same thing. The problem seems to be that creating a database in CouchDb, is more similar to creating a table in mysql. I.e. I create databases called 'comments', 'articles' etc. for my weblog and inside I create a document per article or a document per comment.
So my question is: how can I separate my data from multiple web applications on one CouchDB installation?
I think I am doing something fundamentally wrong here but hopefully one of you guys can help me get on the right track.
In CouchDB, there's no explicit need to separate unrelated data into multiple databases. If you've constructed your documents and views correctly, only relevant data will appear in your queries.
If you do decide to separate your data into separate databases, simply create a new database.
$ curl -X PUT http://localhost:5984/somedb
{"ok":true}
From my experience with couchdb, separating unrelated data into different databases is very important for performance and also a no-brainer. The view generation is a painful part of couchdb. Everytime the database is updated, the views (think of them as indexes in a traditional relational sql db) have to be regenerated. This involves iterating every document in the database. So if you have say 2 million documents of type A, and you have 300 documents of type, B. And you need to regenerate a view the queries type B, then all 2 million and 300 hundred enumerations will be performed during view generation and it will take a long time (it might even do a read-timeout).
Therefore, having multiple databases is a no-brainer when it comes to keeping views (how you query in couchdb, an obviously important and unavoidable feature) updated.
#Zombies is extremely right about performance. CouchDB isn't suited to perform on a lot of documents in a single database. If you need to perform on, let's say, more than 5000 documents, MongoDB will outperfom CouchDB.
Views in CouchDB are essential, but painful, with limited JavaScript options to build your queries (don't even think about document references or nested objects). Considering having multiples databases for different documents is quite the solution. Some people will say something like:
CouchDB is a NoSQL database, and as such you should not need to order your documents nor filtering them using something else than views. NoSQL database core feature is the ability to store scheme-less documents [...]
And I find it very annoying when you need to find a workaround to performance and querying. You should not mind creating a few databases to separate your data if it allows you to split your data, it will still be on a 'single CouchDB installation'. Don't forget that CouchDB is suited for small databases. The smallest a database will be, the fastest your query will be, the better the performance will be.
(I do not know if there are any english mistakes, pardon me if so)
EDIT
Some companies like ArangoDB made a comparison between themselves, MongoDB and CouchDB, and it is confirming my saying about the number of documents. This is the result:
There are a lot of other resources on their website. On the other hand, this statement was a personnal experience, and from benchmarking them for my internship, with a .PHP benchmarking software I found on the Internet. The results are provided below:
I would like to be able to search a CouchDB database using Solr. Are there any projects that provide such an integration?
I am also aware of CouchDB-Lucene. Is there a way to hook Solr into that?
Thanks!
It would make more sense to roll your own, given how wasy it easy. First you need to decide what kind of SOLR schema to use and how to map your CouchDB documents onto that schema. Then simple iterate through all the documents in a db Pagination in CouchDB? and generate SOLR <add> documents.
People do this all the time with all kinds of data sources. Since SOLR is essentially searching a single table, the hard work is often figuring out how to map your database format onto a single table. Read up on what you can do with the SOLR schema, and you may be surprised at how easy this is.
There is a CouchDB integration for ElasticSearch available, apart from feeding ElasticSearch with JSON on your own. Both work with schema-less JSON, so it's very easy to integrate them.
In terms of features, ElasticSearch would offer a comparable set to Solr (in addition to some unique features, of course.)
According to this
http://wiki.apache.org/couchdb/Related_Projects
there was a CouchDB-Solr2 project (scroll down to the end), which is no longer maintained.
I was reading about App Engine on wikipedia and came across some GQL restrictions:
JOIN is not supported
can SELECT from at most one table at a time
can put at most 1 column in the WHERE clause
What are the advantages of these restrictions?
Are these restrictions common in other places where scalability is a priority?
The datastore that GQL talks to is:
not a relational database like MySQL or PostgreSQL
is a Column-oriented DBMS called BigTable
One reason to have a database like this is to have a very high performance database that you can scale across hundreds of servers.
GQL is not SQL it is SQL-like.
Here are some references:
http://en.wikipedia.org/wiki/Column-oriented_DBMS
http://en.wikipedia.org/wiki/BigTable
http://code.google.com/appengine/docs/datastore/overview.html
http://code.google.com/appengine/docs/datastore/gqlreference.html
I believe the answer is in fact to do with the underlying technology of the datastore rather than any kind of restriction on what is available. Google aren't using a relational database under the hood, but instead BigTable, they have just added a nice API which uses SQL like queries to limit the learning curve for those who are used to using a relational database. For those who are more used to using ORM's will take to it like a duck to water.
the existing answers do a good job with the high-level question.
one additional note: the third restriction you mention isn't actually true. GQL queries can include as many columns in the WHERE clause as you like. there are a few caveats, but number of columns is not explicitly limited. more:
http://code.google.com/appengine/docs/python/datastore/queries.html
What are the other types of database systems out there. I've recently came across couchDB that handles data in a non relational way. It got me thinking about what other models are other people is using.
So, I want to know what other types of data model is out there. (I'm not looking for any specifics, just want to look at how other people are handling data storage, my interest are purely academic)
The ones I already know are:
RDBMS (mysql,postgres etc..)
Document based approach (couchDB, lotus notes)
Key/value pair (BerkeleyDB)
db4o
Quote from the "about" page:
db4o is the open source object database that enables Java and .NET developers to store and retrieve any application object with only one line of code, eliminating the need to predefine or maintain a separate, rigid data model.
Older non-relational databases:
Network Database
Hierarchical Database
Both mostly went out of style when relational became feasible.
Column-oriented databases are also a bit of a different animal. Many of them do support standard relational database SQL though. These are generally used for data warehouse type applications.
Semantic Web is also a non-relational data storage paradigm. There are no relations, all metadata is stored in the same way as data, and every entity has potentially its own unique set of attributes. Open-source projects that implement RDF, a Semantic Web standard, include Jena and Sesame.
Isn't Amazon's SimpleDB non-relational?
db4o, as mentioned by Eric, is an Object-Oriented database management system (OODBMS).
There's object-based databases(Gemstore, for example). Google's Big-Table and Amason's Simple Storage I am not sure how you would categorize, but both are map-reduce based.
A non-relational document oriented database we have been looking at is Apache CouchDB.
Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API. Among other features, it provides robust, incremental replication with bi-directional conflict detection and resolution, and is queryable and indexable using a table-oriented view engine with JavaScript acting as the default view definition language.
Our interest was in providing a distributed access user preferences store that would be immune to shape changes to which we could serialize preference objects from Java and access those just as easily with Javascript from a XULRunner based client application.
I'd like to detail more on Bill Karwin's answer about semantic web and triplestores, since it's what I am working on at the moment, and I have something to say on it.
The idea behind a triplestore is to store a graph-based database, whose datamodel roots in RDF. With RDF, you describe nodes and associations among nodes (in other words, edges). Data is organized in triples :
start node ----relation----> end node
(in RDF speech: subject --predicate--> object). With this very simple data model, any data network can be represented by adding more and more triples, provided you give a meaning to nodes and relations.
RDF is very general, and it's a graph-based data model well suited for search criteria looking for all triples with a particular combination of subject, predicate, or object, in any combination. Eventually, through a query language called SPARQL, you can also perform more complex queries, an operation that boils down to a graph isomorphism search onto the graph, both in terms of topology and in terms of node-edge meaning (we'll see this in a moment). SPARQL allows you only SELECT (and similar) queries. No DELETE, no INSERT, no UPDATE. The information you query (e.g. specific nodes you are interested in) are mapped into a table, which is what you get as a result of your query.
Now, topology in itself does not mean a lot. For this, a Schema language has been invented. Actually, more than one, and calling them schema languages is, in some cases, very limitative. The most famous and used today are RDF-Schema, OWL (Lite and Full), and they predate from the obsolete DAML+OIL. The point of these languages is, boiling down stuff, to give a meaning to nodes (by granting them a type, also described as a triple) and to relationships (edges). Also, you can define the "range" and "domain" of these relationships, or said differently what type is the start node and what type is the end node: you can say for example, that the property "numberOfWheels" can be applied only to connect a node of type Vehicle to a non-zero integer value.
ns:MyFiat --rdf:type--> ns:Vehicle
ns:MyFiat --ns:numberOfWheels-> 4
Now, you can use these ontologies in two directions: validation and inference. Validation is not that fancy today, but I've seen instances of use. Inference is what is cool today, because it allows reasoning. Inference basically takes a RDF graph containing a set of triples, takes an ontology, mixes them into a triplestore database which contains an "inference engine" and like magic the inference engine invents triples according to your ontological description. Example: suppose you just store this information in the database
ns:MyFiat --ns:numberOfWheels--> 4
and nothing else. No type is specified about this node, but the inference engine will add automatically a triple saying that
ns:MyFiat --rdf:type--> ns:Vehicle
because you said in your ontology that only objects of type Vehicle can be described by a property numberOfWheels.
Conversely, you can use the inference engine to validate your data against the ontology so to refuse not compliant data (sort of like XML-Schema for XML). In this case, you will need both triples to have your data successfully accepted by the triplestore.
Additional characteristics of triplestores are Formulas and Context-aware storage. Formulas are statements (as usual, triples subject predicate object) that describe something hypothetical. I never used Formulas, so I won't go into more details of something I don't know. Context awareness are basically subgraphs: the problem with storing triples is that you don't have anything to say where these triples come from. Suppose you have two dealers that describe the same price of a component. One says that the price is 5.99 and the other 4.99. If you just store both triples into a database, now you don't know anything about who stated each information. There are two ways to solve this problem.
One is reification. Reification means that you store additional triples to describe another triple. It's wasteful, and makes life hell because you have to reify every and each triple you store. The alternative is context-awareness. Having a context-aware storage It's like being able to box a bunch of triples into a container with a label on it (the context identifier). You now can use this identifier as subject for additional statements, hence describing a bunch of triples in a single action.
4. Navigational. Includes Tree/Hierarchy and Graph/Network.
File systems, the semantic web, XML, Object databases, CODASYL, and many others all fit into this category.
Those 4 are pretty much it.
There is also what is referred to as an "inverted index" or "inverted list" database. Software AG's Adabas product would be an example. As with hierachical, these databases continue to be used in large corporate or university environments because of legacy considerations or due to a performance advantage in certain situations (typically high-end transactional applications).
There are BASE systems (Basically Available, Soft State, Eventually consistent) and they work well with simple data models holding vast volumes of data. Google's BigTable, Dojo's Persevere, Amazon's Dynamo, Facebook's Cassandra are some examples.
See LINK
The illuminate Correlation Database is a new revolutionary non-relational database. The Correlation Database Management Dystem (CDBMS) is data model independent and designed to efficiently handle unplanned, ad hoc queries in an analytical system environment. Unlike relational database management systems or column-oriented databases, a correlation database uses a value-based storage (VBS) architecture in which each unique data value is stored only once and an auto-generated indexing system maintains the context for all values (data is 100% indexed). Queries are performed using natural language instead of SQL (NoSQL).
Learn more at: www.datainnovationsgroup.com