NOSQL "basic operations" - google-app-engine

NOSQL "basic operations" - google-app-engine

If I wanted to define a minimum set of functionality for an investigative NOSQL implementation, which ones should I pick?
I like the Google AppEngine API and its simple methods (e.g. Get all objects of type X, get all objects with property Y, etc.) but I wonder whether there is a base layer of calls that can be considered "expected of all similar systems", a sort of (portable?) baseline API for NOSQL?
Is there an instructive FOSS minimalistic implementation of such simple system? I would like to study it and see how it creates indices, allocates storage blocks, performs queries, distributes saves, etc.
This is for educational and research purposes.

There's an enormous breadth of 'NoSQL' systems out there, and many of them have very little in common except the label of 'NoSQL' and the most basic of operations: insert, update/replace and delete. Pretty much every other attribute of a storage system - indexing, datatypes, data structure, allowable queries, and so forth - varies wildly from database to database.
It may be that over time standards emerge, but for now about the only thing you can be sure of about a NoSQL database is that it'll be capable of storing data and retrieving it again.

Related

what means the use of a Triplestore

the use of a triplestore means that we are going to use a database that have a table with 3 coluns and 7 indexes?
I mean using a triple store always is relationed with that relational model?

From http://en.wikipedia.org/wiki/Triplestore:
A triplestore is a purpose-built database for the storage and retrieval of Resource Description Framework (RDF) metadata.
It looks like that the high-performance triplestores use a non-relational model:
Some triplestores have been built as database engines from scratch, while others have been built on top of existing commercial relational database engines (i.e. SQL-based).4 Like the early development of OLAP databases, this intermediate approach allowed large and powerful database engines to be constructed for little programming effort in the initial phases of triplestore development. Long-term though it seems likely that native triplestores will have the advantage for performance. A difficulty with implementing triplestores over SQL is that although "triples" may thus be "stored", implementing efficient querying of a graph-based RDF model (i.e. mapping from SPARQL) onto SQL queries is difficult.5

Not necessarily. There are triplestores that rely on other RDBMS systems as backends. Examples of this case are: Jena/SDB, 3store or Virtuoso.
Others implement their own native persistent model customized to respond better to the RDF data model, like 4store, Jena/TDB or BigData. These tend to scale better.

Conceptually, yes - a triple is a binary relation with a subject, predicate, and object (e.g. <JohnSmith--marriedTo->JillSmith>.
Higher arity relations are not possible in a triple store as they are in a normal RDBMS (though you can fake them via the use of multiple triples).
The implementation varies though, as previous answers state.
Most triple stores actually store quads, so they can group triples into subsets ("Named Graphs" in RDF-speak).
The indexes are of course optional, but usually present in some form - again often modified to accommodate quads.

What is the difference between graph-based databases and object-oriented databases?

What is the difference between graph-based databases (http://neo4j.org/) and object-oriented databases (http://www.db4o.com/)?

I'd answer this differently: object and graph databases operate on two different levels of abstraction.
An object database's main data elements are objects, the way we know them from an object-oriented programming language.
A graph database's main data elements are nodes and edges.
An object database does not have the notion of a (bidirectional) edge between two things with automatic referential integrity etc. A graph database does not have the notion of a pointer that can be NULL. (Of course one can imagine hybrids.)
In terms of schema, an object database's schema is whatever the set of classes is in the application. A graph database's schema (whether implicit, by convention of what String labels mean, or explicit, by declaration as models as we do it in InfoGrid for example) is independent of the application. This makes it much simpler, for example, to write multiple applications against the same data using a graph database instead of an object database, because the schema is application-independent. On the other hand, using a graph database you can't simply take an arbitrary object and persist it.
Different tools for different jobs I would think.

Yes, the API seems like the major difference, but is not really a superficial one. Conceptually a set of objects will form a graph and you could think of an API that treats this graph in a uniform way. Conversely, you could in theory mine a generic graph structure for patterns and map them to objects exposed via some API. But the design of the API of an actual product will generally have consequence on how data is actually stored, how it can be queried, so it would be far from trivial to, say, create a wrapper and make it look like something else. Also, an object-oriented database must offer some integrity guarantees and a typing structure that a graph database won't normally do. In fact, serious OO database are far from "free form" :)
Take a look at [HyperGraphDB][1] - it is both a full object-oriented database (like db4o) and a very advanced graph database both in terms of representational and querying capabilities. It is capable of storing generalized hypergraphs (where edges can point to more than one node and also to other edges as well), it has a fully extensible type system embedded as a graph etc.
Unlike other graph databases, in HyperGraphDB every object becomes a node or an edge in the graph, with none-to-minimal API intrusion and you have the choice of representing your objects as a graph or treating them in a way that is orthogonal to the graph structure (as "payload" values of your nodes or edges). You can do sophisticated traversals, customized indexing and querying.
An explanation why HyperGraphDB is in fact an ODMS, see the blog post Is HyperGraphDB an OO Database? at Kobrix's website.

As Will descibes from another angle, a graphdb will keep your data separated from your application classes and objects. A graphdb also has more built-in functionality to deal with graphs, obviously - like shortest path or deep traversals.
Another important difference is that in a graphdb like neo4j you can traverse the graph based on relationship (edge) types and directions without loading the full nodes (including node properties/attributes). There's also the choice of using neo4j as backend of an object db, still being able to use all the graphy stuff, see: jo4neo This project has a different approach that could also count as an object db on top of neo4j: neo4j.rb. A new option is to use Spring Data Graph, which gives graphdb support through annotations.
The same question was asked in the comments to this blogpost.

From a quick browse of both their websites:
The major difference is the way the APIs are structured, rather than the kind of free-form database you can build with them.
db4o uses an object mapping - you create a Java/C# class, and it uses reflection to persist it in the database.
neo4j has an explicit manipulation API.
Neo4j seemed, in my humble opinion, much nicer to interact with.
You might also consider a key-value store - you could make exactly the same free-form database with one of those.

The difference at low-level is not so huge. Both manage relationships as direct links without costly joins. Furthermore both have a way to traverse relationships with the Query language, but the graph database has operators to go recursively at Nth level.
But the biggest difference is in the domain: in a Graph databases all is based on the 2 types: vertexes and edges, even if usually you can define your own types as a sort of subtypes of Vertex or Edge.
In the ODBMS you have no Vertex and Edge concepts, unless you write your own.

With graph databases, you have a slight semblance of a chance that it is based on mathematical graph theory. With Object-oriented databases, you have the certainty that it is based on nothing at all (and most certainly no mathematical theory at all).

Database alternatives?

I was wondering the trade-offs for using databases and what the other options were? Also, what problems are not well suited for databases?
I'm concerned with Relational Databases.

The concept of database is very broad. I will make some simplifications in what I present here.
For some tasks, the most common database is the relational database. It's a database based on the relational model. The relational model assumes that you describe your data in rows, belonging to tables where each table has a given and fixed number of columns. You submit data on a "per row" basis, meaning that you have to provide a row in a single shot containing the data relative to all columns of your table. Every submitted row normally gets an identifier which is unique at the table level, sometimes at the database level. You can create relationships between entities in the relational database, for example by saying that a given cell in your table must refer to another table's row, so to preserve the so called "referential integrity".
This model works fine, but it's not the only one out there. In some cases, data are better organized as a tree. The filesystem is a hierarchical database. starts at a root, and everything goes under this root, in a tree like structure. Another model is the key/value pair. Sleepycat BDB is basically a store of key/value entities.
LDAP is another database which has two advantages: stores rather generic data, it's distributed by design, and it's optimized for reading.
Graph databases and triplestores allow you to store a graph and perform isomorphism search. This is typically needed if you have a very generic dataset that can encompass a broad level of description of your entities, so broad that is basically unknown. This is in clear opposition to the relational model, where you create your tables with a very precise set of columns, and you know what each column is going to contain.
Some relational column-based databases exist as well. Instead of submitting data by row, you submit them by whole column.
So, to answer your question: a database is a method to store data. Technically, even a text file is a database, although not a particularly nice one. The choice of the model behind your database is mostly relative to what is the typical needs of your application.
Setting the answer as CW as I am probably saying something strictly not correct. Feel free to edit.

This is a rather broad question, but databases are well suited for managing relational data. Alternatives would almost always imply to design your own data storage and retrieval engine, which for most standard/small applications is not worth the effort.
A typical scenario that is not well suited for a database is the storage of large amounts of data which are organized as a relatively small amount of logical files, in this case a simple filesystem-like system can be enough.

Don't forget to take a look at NOSQL databases. It's pretty new technology and well suited for stuff that doesn't fit/scale in a relational database.

Use a database if you have data to store and query.
Technically, most things are suited for databases. Computers are made to process data and databases are made to store them.
The only thing to consider is cost. Cost of deployment, cost of maintenance, time investment, but it will usually be worth it.
If you only need to store very simple data, flat files would be an alternative (text files).
Note: you used the generic term 'database', but there are many many different types and implementations of these.

For search applications, full-text search engines (some of which are integrated to traditional DBMSes, but some of which are not), can be a good alternative, allowing both more features (various linguistic awareness, ability to have semi-structured data, ranking...) as well as better performance.
Also, I've seen applications where configuration data is stored in the database, and while this makes sense in some cases, using plain text files (or YAML, XML and such) and loading the underlying objects during initialization, may be preferable, due to the self-contained nature of such alternative, and to the ease of modifying and replicating such files.
A flat log file, can be a good alternative to logging to DBMS, depending on usage of course.
This said, in the last 10 years or so, the DBMS Systems, in general, have added many features, to help them handle different forms of data and different search capabilities (ex: FullText search a fore mentioned, XML, Smart storage/handling of BLOBs, powerful user-defined functions, etc.) which render them more versatile, and hence a fairly ubiquitous service. Their strength remain mainly with relational data however.

What exactly is NoSQL?

What exactly is NoSQL? Is it database systems that only work with {key:value} pairs?
As far as I know MemCache is one of such database systems, am I right?
What other popular NoSQL databases are there and where exactly are they useful?
Thanks, Boda Cydo.

I'm not agree with the answers I'm seeing, although it's true that NoSQL solutions tends to break the ACID rules, not all are created from that approach.
I think first you should define what is a SQL Solution and then you can put the "Not Only" in front of it, that will be more accurate definition of what is a NoSQL solution.
With this approach in mind:
SQL databases are a way to group all the data stores that are accessible using Structured Query Language as the main (and most of the time only) way to communicate with them, this means it requires that the database support the structures that are common to those systems like "tables", "columns", "rows", "relationships", etc.
Now, put the "Not Only" in front of the last sentence and you will get a definition of what means "NoSQL". NoSQL groups all the stores created as an attempt to solve problems which cannot fit into the table/column/rows structures or even in SQL Statements, in most of the cases these databases will not support relationships, they're abandoning the well known structures just because the problems have changed since their conception.
If you have a text file, and you create an API to store/retrieve/organize this information, then you have a NoSQL database in your hands.
All of these means that there are several solutions to store the information in a way that traditional SQL systems will not allow to achieve better performance, flexibility, etc etc. Every NoSQL provider tries to solve a different problem and that's why you wont be able to compare two different solutions, for example:
djondb is a document store created to be used as
NoSQL enterprise solution supporting transactions, consistency, etc.
but sacrifice performance of its counterparts.
MongoDB is a document store (similar to
djondb) which accomplish great performance but trades some of the
ACID properties to achieve this.
CouchDB is another document store which
solves the queries slightly different providing views to retrieve the
information without doing a full query every time.
...
As you may have noticed I only talked about the document stores, that's because I wanted to show you that 3 different document stores implementations have different approach, therefore you should keep in mind the golden rule of NoSQL stores "Use the right tool for the right job".
I'm the creator of djondb and I've been doing a lot of research even before trying to start my own NoSQL implementation, but this is a field where the concepts will keep changing the way we see the information storage.

From wikipedia:
NoSQL is an umbrella term for a loosely defined class of non-relational data stores that break with a long history of relational databases and ACID guarantees. Data stores that fall under this term may not require fixed table schemas, and usually avoid join operations. The term was first popularised in early 2009.
The motivation for such an architecture was high scalability, to support sites such as Facebook, advertising.com, etc...

To quickly get a handle on NoSQL systems, see this blog post I wrote: Visual Guide to NoSQL Systems. Essentially, NoSQL systems sacrifice either consistency or availability in favor of tolerance to network partitions.

What is NoSQL ?
NoSQL is the acronym for Not Only SQL. The basic qualities of NoSQL databases are schemaless, distributed and horizontally scalable on commodity hardware. The NoSQL databases offers variety of functions to solve various problems with variety of data types, where “blob” used to be the only data type in RDBMS to store unstructured data.
1 Dynamic Schema
NoSQL databases allows schema to be flexible. New columns can be added anytime. Rows may or may not have values for those columns and no strict enforcement of data types for columns. This flexibility is handy for developers, especially when they expect frequent changes during the course of product life cycle.
2 Variety of Data
NoSQL databases support any type of data. It supports structured, semi-structured and unstructured data to be stored. Its supports logs, images files, videos, graphs, jpegs, JSON, XML to be stored and operated as it is without any pre-processing. So it reduces the need for ETL (Extract – Transform – Load).
3 High Availability Cluster
NoSQL databases support distributed storage using commodity hardware. It also supports high availability by horizontal scalability. This features enables NoSQL databases get the benefit of elastic nature of the Cloud infrastructure services.
4 Open Source
NoSQL databases are open source software. The usage of software is free and most of them are free to use in commercial products. The open sources codebase can be modified to solve the business needs. There are minor variations in the open source software licenses, users must be aware of license agreements.
5 NoSQL – Not Only SQL
NoSQL databases not only depend SQL to retrieve data. They provide rich API interfaces to perform DML and CRUD operations. These are APIs are move developer friendly and supported in variety of programming languages.

Take a look at these:
http://en.wikipedia.org/wiki/Nosql#List_of_NoSQL_open_source_projects
and this:
http://www.mongodb.org/display/DOCS/Comparing+Mongo+DB+and+Couch+DB

I used something called the Raima Data Manager more than a dozen years ago, that qualifies as NoSQL. It calls itself a "Set Oriented Database" Its not based on tables, and there is no query "language", just an C API for asking for subsets.
It's fast and easier to work with in C/C++ and SQL, there's no building up strings to pass to a query interpreter and the data comes back as an enumerable object rather than as an array. variable sized records are normal and don't waste space. I never saw the source code, but there were some hints at the interface that internally, the code used pointers a lot.
I'm not sure that the product I used is even sold anymore, but the company is still around.

MongoDB looks interesting, SourceForge is now using it.
I listened to a podcast with a team member. The idea with NoSQL isn't so much to replace SQL as it is to provide a solution for problems that aren't solved well with traditional RDBMS. As mentioned elsewhere, they are faster and scale better at the cost of reliability and atomicity (different solutions to different degrees). You wouldn't want to use one for a financial system, but a document based system would work great.

Here is a comprehensive list of NoSQL Databases: http://nosql-database.org/.
I'm glad that you have had success with RDM John! I work at Raima so it's great to hear feedback. For those looking for more information, here are a couple of resources:
Video Overview of RDM's General Architecture
Free Evaluation Download of RDM

Database system that is not relational

What are the other types of database systems out there. I've recently came across couchDB that handles data in a non relational way. It got me thinking about what other models are other people is using.
So, I want to know what other types of data model is out there. (I'm not looking for any specifics, just want to look at how other people are handling data storage, my interest are purely academic)
The ones I already know are:
RDBMS (mysql,postgres etc..)
Document based approach (couchDB, lotus notes)
Key/value pair (BerkeleyDB)

db4o
Quote from the "about" page:
db4o is the open source object database that enables Java and .NET developers to store and retrieve any application object with only one line of code, eliminating the need to predefine or maintain a separate, rigid data model.

Older non-relational databases:
Network Database
Hierarchical Database
Both mostly went out of style when relational became feasible.

Column-oriented databases are also a bit of a different animal. Many of them do support standard relational database SQL though. These are generally used for data warehouse type applications.

Semantic Web is also a non-relational data storage paradigm. There are no relations, all metadata is stored in the same way as data, and every entity has potentially its own unique set of attributes. Open-source projects that implement RDF, a Semantic Web standard, include Jena and Sesame.

Isn't Amazon's SimpleDB non-relational?

db4o, as mentioned by Eric, is an Object-Oriented database management system (OODBMS).

There's object-based databases(Gemstore, for example). Google's Big-Table and Amason's Simple Storage I am not sure how you would categorize, but both are map-reduce based.

A non-relational document oriented database we have been looking at is Apache CouchDB.
Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API. Among other features, it provides robust, incremental replication with bi-directional conflict detection and resolution, and is queryable and indexable using a table-oriented view engine with JavaScript acting as the default view definition language.
Our interest was in providing a distributed access user preferences store that would be immune to shape changes to which we could serialize preference objects from Java and access those just as easily with Javascript from a XULRunner based client application.

I'd like to detail more on Bill Karwin's answer about semantic web and triplestores, since it's what I am working on at the moment, and I have something to say on it.
The idea behind a triplestore is to store a graph-based database, whose datamodel roots in RDF. With RDF, you describe nodes and associations among nodes (in other words, edges). Data is organized in triples :
start node ----relation----> end node
(in RDF speech: subject --predicate--> object). With this very simple data model, any data network can be represented by adding more and more triples, provided you give a meaning to nodes and relations.
RDF is very general, and it's a graph-based data model well suited for search criteria looking for all triples with a particular combination of subject, predicate, or object, in any combination. Eventually, through a query language called SPARQL, you can also perform more complex queries, an operation that boils down to a graph isomorphism search onto the graph, both in terms of topology and in terms of node-edge meaning (we'll see this in a moment). SPARQL allows you only SELECT (and similar) queries. No DELETE, no INSERT, no UPDATE. The information you query (e.g. specific nodes you are interested in) are mapped into a table, which is what you get as a result of your query.
Now, topology in itself does not mean a lot. For this, a Schema language has been invented. Actually, more than one, and calling them schema languages is, in some cases, very limitative. The most famous and used today are RDF-Schema, OWL (Lite and Full), and they predate from the obsolete DAML+OIL. The point of these languages is, boiling down stuff, to give a meaning to nodes (by granting them a type, also described as a triple) and to relationships (edges). Also, you can define the "range" and "domain" of these relationships, or said differently what type is the start node and what type is the end node: you can say for example, that the property "numberOfWheels" can be applied only to connect a node of type Vehicle to a non-zero integer value.
ns:MyFiat --rdf:type--> ns:Vehicle
ns:MyFiat --ns:numberOfWheels-> 4
Now, you can use these ontologies in two directions: validation and inference. Validation is not that fancy today, but I've seen instances of use. Inference is what is cool today, because it allows reasoning. Inference basically takes a RDF graph containing a set of triples, takes an ontology, mixes them into a triplestore database which contains an "inference engine" and like magic the inference engine invents triples according to your ontological description. Example: suppose you just store this information in the database
ns:MyFiat --ns:numberOfWheels--> 4
and nothing else. No type is specified about this node, but the inference engine will add automatically a triple saying that
ns:MyFiat --rdf:type--> ns:Vehicle
because you said in your ontology that only objects of type Vehicle can be described by a property numberOfWheels.
Conversely, you can use the inference engine to validate your data against the ontology so to refuse not compliant data (sort of like XML-Schema for XML). In this case, you will need both triples to have your data successfully accepted by the triplestore.
Additional characteristics of triplestores are Formulas and Context-aware storage. Formulas are statements (as usual, triples subject predicate object) that describe something hypothetical. I never used Formulas, so I won't go into more details of something I don't know. Context awareness are basically subgraphs: the problem with storing triples is that you don't have anything to say where these triples come from. Suppose you have two dealers that describe the same price of a component. One says that the price is 5.99 and the other 4.99. If you just store both triples into a database, now you don't know anything about who stated each information. There are two ways to solve this problem.
One is reification. Reification means that you store additional triples to describe another triple. It's wasteful, and makes life hell because you have to reify every and each triple you store. The alternative is context-awareness. Having a context-aware storage It's like being able to box a bunch of triples into a container with a label on it (the context identifier). You now can use this identifier as subject for additional statements, hence describing a bunch of triples in a single action.

4. Navigational. Includes Tree/Hierarchy and Graph/Network.
File systems, the semantic web, XML, Object databases, CODASYL, and many others all fit into this category.
Those 4 are pretty much it.

There is also what is referred to as an "inverted index" or "inverted list" database. Software AG's Adabas product would be an example. As with hierachical, these databases continue to be used in large corporate or university environments because of legacy considerations or due to a performance advantage in certain situations (typically high-end transactional applications).

There are BASE systems (Basically Available, Soft State, Eventually consistent) and they work well with simple data models holding vast volumes of data. Google's BigTable, Dojo's Persevere, Amazon's Dynamo, Facebook's Cassandra are some examples.
See LINK

The illuminate Correlation Database is a new revolutionary non-relational database. The Correlation Database Management Dystem (CDBMS) is data model independent and designed to efficiently handle unplanned, ad hoc queries in an analytical system environment. Unlike relational database management systems or column-oriented databases, a correlation database uses a value-based storage (VBS) architecture in which each unique data value is stored only once and an auto-generated indexing system maintains the context for all values (data is 100% indexed). Queries are performed using natural language instead of SQL (NoSQL).
Learn more at: www.datainnovationsgroup.com

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight