Architecture for database analytics - database

We have an architecture where we provide each customer Business Intelligence-like services for their website (internet merchant). Now, I need to analyze those data internally (for algorithmic improvement, performance tracking, etc...) and those are potentially quite heavy: we have up to millions of rows / customer / day, and I may want to know how many queries we had in the last month, weekly compared, etc... that is the order of billions entries if not more.
The way it is currently done is quite standard: daily scripts which scan the databases, and generate big CSV files. I don't like this solutions for several reasons:
as typical with those kinds of scripts, they fall into the write-once and never-touched-again category
tracking things in "real-time" is necessary (we have separate toolset to query the last few hours ATM).
this is slow and non-"agile"
Although I have some experience in dealing with huge datasets for scientific usage, I am a complete beginner as far as traditional RDBM go. It seems that using column-oriented database for analytics could be a solution (the analytics don't need most of the data we have in the app database), but I would like to know what other options are available for this kind of issues.

You will want to google Star Schema. The basic idea is to model a special data warehouse / OLAP instance of your existing OLTP system in a way that is optimized to provided the type of aggregations you describe. This instance will be comprised of facts and dimensions.
In the example below, sales 'facts' are modeled to provide analytics based on customer, store, product, time and other 'dimensions'.
You will find Microsoft's Adventure Works sample databases instructive, in that they provide both the OLTP and OLAP schemas along with representative data.

There are special db's for analytics like Greenplum, Aster data, Vertica, Netezza, Infobright and others. You can read about those db's on this site: http://www.dbms2.com/

The canonical handbook on Star-Schema style data warehouses is Raplh Kimball's "The Data Warehouse Toolkit" (there's also the "Clickstream Data Warehousing" in the same series, but this is from 2002 I think, and somewhat dated, I think that if there's a new version of the Kimball book it might serve you better. If you google for "web analytics data warehouse" there are a bunch of sample schema available to download & study.
On the other hand, a lot of the no-sql that happens in real life is based around mining clickstream data, so it might be worth see what the Hadoop/Cassandra/[latest-cool-thing] community has in the way of case studies to see if your use case matches well with what they can do.

Related

How are OLAP, OLTP, data warehouses, analytics, analysis and data mining related?

I'm trying to understand what OLAP, OLTP, data mining, analytics etc. are about, and I feel like my understanding about some of these concepts is still a bit vague. Information about these subjects tend to be explained in a very complex manner on the internet.
I feel like a question like this is likely to be closed since it's a very broad one, so I'll try to narrow it down into two questions:
Question 1:
After doing research I understand the following about these concepts, is it correct?
Analysis is decomposing something complex, to understand the inner workings better.
Analytics is predictive analysis on information that requires alot of math and statistics.
There's many type of databases, but they are either OLTP (transactional) or OLAP (analytical).
OLTP databases use ER diagrams, and are therefore easier to update because they are in normalized form.
In contrast, OLAP uses the denormalized star schema's and is therefore easier to query
OLAP is used for predictive analysis and OLTP is usually used in more practical situations since theres no redundancy.
Data warehouses is a type of OLAP database, and usually consists out of multiple other databases.
Data mining is a tool used in analytics, where u use computer software to find out relationships between data so you can predict things (e.g. customer behavior).
Question 2:
I'm especially confused about the difference between analytics and analysis. They say analytics is multidimensional analysis, but what is that supposed to mean?
I will try to explain you from the top of the pyramid:
Business Intelligence (what you didn't mentioned) is term in IT which stands for a complex system and gives useful informations about company from data.
So, BI systems has target: Clean, accurate and meaningful informations.
Clean means there is no tech problems (missing keys, incomplete data ect). Accurate means accurate - BI systems are also used as fault checker of production database (logical faults - i.e invoice bill is too high, or inactive partner is used ect). It has been accomplished with rules. Meaningful is hard to explain, but in simple english, it's all your data (even excel table from the last meeting), in way you want.
So, BI system has back-end: It's data warehouse.
DWH is nothing else than a database (instance, not software). It can be stored in RDBMS, analytical db (columnar or document store types), or NoSQL databases.
Data warehouse is term used usually for whole database that I explained above. There could be number of data-marts (if Kimball model is used) - more often, or relational system in 3rd normalized form (Inmon model) called enterprise data warehouse.
Data marts are tables inside DWH that are related (star schema, snowflake schema). Fact table (business process in denormalized form ) and dimension tables.
Each data mart represents one business process. Example: DWH has 3 data marts. One is retail sales, second is export, and third is import. In retail you can see total sales, qty sold, import price, profit (measures) by SKU, date, store, city ect (dimensions).
Loading data in DWH is called ETL(extract, transform, load).
Extract data from multiple sources (ERP db, CRM db, excel files, web service...)
Transform data (clean data, connect data from diff sources, match keys, mine data)
Load data (Load transformed data in specific data marts)
edit because of comment: ETL process is usually created with ETL tool, or manually with some programming language (python, c# ect) and APIs.
ETL process is group of SQLs, procedures, scripts and rules related and separated in 3 parts (look above), controlled by meta data.
It's either scheduled (every night, every few hours) or live (change data capture, triggers, transactions).
OLTP and OLAP are types of data processing. OLTP is used in transaction purpose, between database and software (usually only one way of input/output data).
OLAP is for analitical purpose, and this means there is multiple sources, historical data, high select query performance, mined data.
edit because of comment: Data Processing is way how data is stored and accessed from database. So, based on of your needs, database is set in different way.
Image from http://datawarehouse4u.info/:
Data mining is the computational process of discovering patterns in large data sets. Mined data can give you more insight view of business process or even forecast.
Analysis is a verb, which in BI world means simplicity of getting asked information from data. Multidimensional analysis actually says how system is slicing your data (with dimensions inside cube). Wikipedia said that analysis of data is a process of inspecting data with the goal of discovering useful information.
Analytics is a noun and it represent a result of analysis process.
Don't get so much fuss about those two words.
I can tell you about Data mining as i had project on Data mining. Data mining is not a tool ,Its a method of mining data and different tools used for data mining is WEKA ,RAPID MINER etc. Data mining follows many algorithms which are inbuilt in tools like Weka ,Rapid miner. Algorithms like Clustering algorithm , assosiation algorithm etc.
A simple example i can give you of data mining . Teacher is teaching science subject in a class by using different methods of teaching like using chalkboard,presentation,Practical. So now our aim is to find which method is suitable for students. Then we do survey and take students opinion 40 students like chalk board ,30 likes presentation and 20 likes practical method. So with help of this data we can make the rules for example Science subject should be taught by chalk board method.
To knw different algorithms you can use google :D.

hadoop vs teradata what is the difference

I've touched a Teradata. I've never touched hadoop, but since yesterday, I am doing some research on that. By description of both, they seem quite interchangable, but in some papers it is written that they serve for different purposes. But all I found is vague. I am confused.
Has anybody experience with both of them? What is the serious difference between them?
Simple Example: I want to build ETL which will transform billions rows of raw data and organize them to DWH. Then do some resources expensive analysis on them. Why use TD? Why Hadoop? or why not?
I think this article titled 'MapReduce and Parallel DBMSs: Friends or Foes' does quite a good job describing the situations where each technology works best. In a nutshell, Hadoop is excellent for storing unstructured data and running parallel transformations to 'sanitize' incoming data, where DBMSs excel at executing complex queries quickly.
Hadoop, Hadoop with Extensions, RDBMS Feature/Property Comparison
I am not an expert in this area, but in the coursera.com course, Introduction to Data Science, there is a lecture titled: Comparing MapReduce and Databases as well as a lecture on Parallel databases within the map reduce section of the course.
Here is a summary from these lectures on the comparison of MapReduce vs. RDBMS (not necessarily parallel RDMBS).
One point to remember is that the comparison is different if you include extensions to Hadoop like PIG, Hive, etc. I will put in () MapReduce extensions that add some of these functionality/properties.
Some functionality/properties that RDBMS have but not native MapReduce:
Declaritive query languages -(Pig, HIVE)
Schemas (Hive, Pig, DyradLINQ, Hadapt)
Logical Data Independence
Indexing (Hbase)
Algebraic Optimization (Pig, Dryad, HIVE)
Caching/Materialized Views
ACID/Transactions
MapReduce (relative to regular RDBMS not necessarily Parallel RDMBS)
High Scalability
Fault-tolerance
“One-person deployment”
I've been asked this question several times, the answer that I usually give is a car analogy (which is pretty silly because I'm not a car person - but it seems to work)
Teradata is the car/dbms for the masses - it is reliable, mature, works well and is there when you need it. It is difficult (compared to Hadoop) to customise and add functionality to the base product.
Hadoop is the car/dbms for the enthusiast - it isn't as reliable or mature, it works well so long as you attend to it. It is easy (compared to Teradata) to customise and add functionality to the base product.
Put another way, Teradata is the reliable workhorse where you put your mission critical process (operational reporting, enterprise reporting, decision support etc).
Hadoop is the place where you can do alot of this stuff, but don't be surprised if you come in one morning and find that your regulatory reports can't be produced because someone applied a patch or you've suddenly got a "too many small files" problem.
To loop back into the analogy, if you don't want to be too techy and the manufacturers product (dbms and/or car) works for you out of the box, Teradata is a good option.
On the other hand, if you like to tinker under the hood, swap out the carburettor (or whatever), adjust the gear ratios, tweak the fuel air mixture depending on whether you are country or city driving, bolt on a Turbo charger and/or your family complain about how long you spend in the garage on weekends - Hadoop is the place for you.
IMHO, Most, if not all organisations need both.
I hope this helps :-)
To Begin with, Vanilla Apache Hadoop is 100% open source. But if you need commercial support along with consultancy there are companies like Cloudera, MapR, HortonWorks, etc.
Hadoop is backed by a growing community fixing bugs and making improvements on a consistent basis. Hadoop storage model HDFS is based on Google's GFS architecture which is proven to handle large quantities of data. Furthermore Hadoop analysis model Map Reduce is based on Google's Map Reduce Model.
Hadoop is used by Tech Giants like Facebook, Yahoo, Twitter, EBay etc to store and analysis they high volume of data real time as well as passively.
For your question ETL systems read these slides where you will see.
Ok now Why Hadoop?
Open Source
Proven Storage and Analysis model for Large Quantities of data
Minimum Hardware Requirement to setup and run.
Ok now Why TD?
Commercial Support

Database selection for a web-scale analytics application

I want to build a web-application similar to Google-Analytics, in which I collect statistics on my customers' end-users, and show my customers analysis based on that data.
Characteristics:
High scalability, handle very large volume
Compartmentalized - Queries always run on a single customer's data
Support analytical queries (drill-down, slices, etc.)
Due to the analytical need, I'm considering to use an OLAP/BI suite, but I'm not sure it's meant for this scale. NoSQL database? Simple RDBMS would do?
These what I am using at work in a production environnement and it works like a charm.
I copled three things
PostgreSQL + LucidDB + Mondrian (More generally the whole Pentaho BI suite components)
PostgreSQL : I am not going to describe postgresql, really strong open source RDBMS will let you do - certainly - everything you need. I use it to store my operational data.
LucidDB : LucidDB is an Open source column-store database. Highly scalable and will provide a really gain of processing time compare to PostgreSQL for retrieving a large amount of data. It is not optimized for transaction processing but for intensive reads. This is my Datawarehouse database
Mondrian : Mondrian is an Open Source R-OLAP cube. LucidDB made it easy to connect those two programs together.
I would recommend you to look at the whole Pentaho BI Suite, it worth it, you might want to use some of there components.
Hope I could help,
There are two main architectures you could opt for for true web-scale:
1. "BI" architecture
Event journaller (e.g. LWES Journaller) or immutable event store (e.g. HDFS) feeds
Analytics/column-store database (e.g. Greenplum, InfiniDB, LucidDB, Infobright) feeds
Business intelligence reporting tool (e.g. Microstrategy, Pentaho Business Analytics)
2. "NoSQL" architecture
(Optional) Event journaller or immutable event store feeds
NoSQL database (e.g. Cassandra, Riak, HBase) feeds
A custom analytics UI (e.g. using D3.js)
The immutable event store or journaller is there because in most cases you want to be batching your analytics events and doing bulk updates to your database (even with something like HDFS) - rather than doing an atomic write for every single page view etc.
For SnowPlow, our open-source analytics platform built on Hadoop and Hive, the event logs are all collected on S3 first before being batch loaded into Hive.
Note that the "NoSQL architecture" will involve a fair bit more development work. Remember that with either architecture, you can always shard by customer if the volumes grow truly epic (billions of rows per customer) - because there's no need (I'm guessing) for cross-customer analytics.
I'd say that having put in place OLAP analysis is always nice and then has great potential for sophisticated data analysis using MDX.
What do you mean by large volume ?
Where are your customer user information?
What kind of front-end and reporting are you going to use?
Cheers.
Disclaimer : I'll make some publicity for my own solution - have a look to www.icCube.com and contact me for more details

What exactly is NoSQL?

What exactly is NoSQL? Is it database systems that only work with {key:value} pairs?
As far as I know MemCache is one of such database systems, am I right?
What other popular NoSQL databases are there and where exactly are they useful?
Thanks, Boda Cydo.
I'm not agree with the answers I'm seeing, although it's true that NoSQL solutions tends to break the ACID rules, not all are created from that approach.
I think first you should define what is a SQL Solution and then you can put the "Not Only" in front of it, that will be more accurate definition of what is a NoSQL solution.
With this approach in mind:
SQL databases are a way to group all the data stores that are accessible using Structured Query Language as the main (and most of the time only) way to communicate with them, this means it requires that the database support the structures that are common to those systems like "tables", "columns", "rows", "relationships", etc.
Now, put the "Not Only" in front of the last sentence and you will get a definition of what means "NoSQL". NoSQL groups all the stores created as an attempt to solve problems which cannot fit into the table/column/rows structures or even in SQL Statements, in most of the cases these databases will not support relationships, they're abandoning the well known structures just because the problems have changed since their conception.
If you have a text file, and you create an API to store/retrieve/organize this information, then you have a NoSQL database in your hands.
All of these means that there are several solutions to store the information in a way that traditional SQL systems will not allow to achieve better performance, flexibility, etc etc. Every NoSQL provider tries to solve a different problem and that's why you wont be able to compare two different solutions, for example:
djondb is a document store created to be used as
NoSQL enterprise solution supporting transactions, consistency, etc.
but sacrifice performance of its counterparts.
MongoDB is a document store (similar to
djondb) which accomplish great performance but trades some of the
ACID properties to achieve this.
CouchDB is another document store which
solves the queries slightly different providing views to retrieve the
information without doing a full query every time.
...
As you may have noticed I only talked about the document stores, that's because I wanted to show you that 3 different document stores implementations have different approach, therefore you should keep in mind the golden rule of NoSQL stores "Use the right tool for the right job".
I'm the creator of djondb and I've been doing a lot of research even before trying to start my own NoSQL implementation, but this is a field where the concepts will keep changing the way we see the information storage.
From wikipedia:
NoSQL is an umbrella term for a loosely defined class of non-relational data stores that break with a long history of relational databases and ACID guarantees. Data stores that fall under this term may not require fixed table schemas, and usually avoid join operations. The term was first popularised in early 2009.
The motivation for such an architecture was high scalability, to support sites such as Facebook, advertising.com, etc...
To quickly get a handle on NoSQL systems, see this blog post I wrote: Visual Guide to NoSQL Systems. Essentially, NoSQL systems sacrifice either consistency or availability in favor of tolerance to network partitions.
What is NoSQL ?
NoSQL is the acronym for Not Only SQL. The basic qualities of NoSQL databases are schemaless, distributed and horizontally scalable on commodity hardware. The NoSQL databases offers variety of functions to solve various problems with variety of data types, where “blob” used to be the only data type in RDBMS to store unstructured data.
1 Dynamic Schema
NoSQL databases allows schema to be flexible. New columns can be added anytime. Rows may or may not have values for those columns and no strict enforcement of data types for columns. This flexibility is handy for developers, especially when they expect frequent changes during the course of product life cycle.
2 Variety of Data
NoSQL databases support any type of data. It supports structured, semi-structured and unstructured data to be stored. Its supports logs, images files, videos, graphs, jpegs, JSON, XML to be stored and operated as it is without any pre-processing. So it reduces the need for ETL (Extract – Transform – Load).
3 High Availability Cluster
NoSQL databases support distributed storage using commodity hardware. It also supports high availability by horizontal scalability. This features enables NoSQL databases get the benefit of elastic nature of the Cloud infrastructure services.
4 Open Source
NoSQL databases are open source software. The usage of software is free and most of them are free to use in commercial products. The open sources codebase can be modified to solve the business needs. There are minor variations in the open source software licenses, users must be aware of license agreements.
5 NoSQL – Not Only SQL
NoSQL databases not only depend SQL to retrieve data. They provide rich API interfaces to perform DML and CRUD operations. These are APIs are move developer friendly and supported in variety of programming languages.
Take a look at these:
http://en.wikipedia.org/wiki/Nosql#List_of_NoSQL_open_source_projects
and this:
http://www.mongodb.org/display/DOCS/Comparing+Mongo+DB+and+Couch+DB
I used something called the Raima Data Manager more than a dozen years ago, that qualifies as NoSQL. It calls itself a "Set Oriented Database" Its not based on tables, and there is no query "language", just an C API for asking for subsets.
It's fast and easier to work with in C/C++ and SQL, there's no building up strings to pass to a query interpreter and the data comes back as an enumerable object rather than as an array. variable sized records are normal and don't waste space. I never saw the source code, but there were some hints at the interface that internally, the code used pointers a lot.
I'm not sure that the product I used is even sold anymore, but the company is still around.
MongoDB looks interesting, SourceForge is now using it.
I listened to a podcast with a team member. The idea with NoSQL isn't so much to replace SQL as it is to provide a solution for problems that aren't solved well with traditional RDBMS. As mentioned elsewhere, they are faster and scale better at the cost of reliability and atomicity (different solutions to different degrees). You wouldn't want to use one for a financial system, but a document based system would work great.
Here is a comprehensive list of NoSQL Databases: http://nosql-database.org/.
I'm glad that you have had success with RDM John! I work at Raima so it's great to hear feedback. For those looking for more information, here are a couple of resources:
Video Overview of RDM's General Architecture
Free Evaluation Download of RDM

What is an example of a non-relational database? Where/how are they used?

I have been working with relational databases for sometime, but it only recently occurred to me that there must be other types of databases that are non-relational.
What are some examples of non-relational databases, and where/how are they used in the real world? Why would you choose to use a non-relational database over relational databases?
Edit: Two other similar questions have been mentioned in the answers:
Database system that is not relational.
Good reasons NOT to use a relational database?
An admittedly obscure but interesting alternative to the types of databases mentioned here is the associative database, such as Sentences, from LazySoft Technology. There is a free personal version you can download and try on your own. The Enterprise Edition is also free, but requires a request to the company.
Essentially, an associative database allows you to store information in much the same way as our brains do: as things and associations between those things. The name "Sentences" comes from the way this information can be represented in a subject-verb-object syntax:
Tom is brother to Laura
San Francisco is located in California
Mike has a credit limit of $10,000
A sentence may be the subject or object of another sentence:
(Bus 570 arrives at 8:15am) on Sundays
Mary says (the pie was baked by William)
So, everything can be boiled down to entities and associations.
There is, of course, much more to Sentences than what can be expressed here. I recommend that you take some time to read more about it in a white paper from LazySoft.
"The Associative Model of Data" is a book available in PDF format by Simon Williams, one of the creators of Sentences.
Flat file
CSV or other delimited data
spreadsheets
/etc/passwd
mbox mail files
Hierarchical
Windows Registry
Subversion using the file system, FSFS, instead of Berkley DB
A non-relational document oriented database we have been looking at is Apache CouchDB.
Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API. Among other features, it provides robust, incremental replication with bi-directional conflict detection and resolution, and is queryable and indexable using a table-oriented view engine with JavaScript acting as the default view definition language.
Our interest was in providing a distributed access user preferences store that would be immune to shape changes to which we could serialize preference objects from Java and access those just as easily with Javascript from a XULRunner based client application.
Any database that claims to be a "Berkley style Database" or "Key/Value" Database is not relational.
These databases are usually based off complex hashing algorithms and provide a very fast lookup O(1) based off a key, but leave any form of relational goodness to end user.
For example, in a relational database, you would normalize your structure and join many tables together to create a single result set.
In a key/value database, you would denormalize as much as possible and then use a unique key to lookup data.
If you need to pull data from two sources, you would have to join the resulting set together by hand.
All databases were originally non-relational, it was only with the arrival of DB2 and Oracle in the mid 1980's that they became common. Before that most databases where either flat files or hierarchical.
Flat files are inherently boring, but hierarchical database are much less so, particularly as DB2 was actually implemented on top of an hierarchical implementation (namely VSAM) in the first instance. VSAM is I believe still around on mainframe systems and is of some considerable importance.
DB/1 (so obscure now I can't even find a wikipedia link) was IBM's predecessor prime-time database to DB2 (hence the name). This was hierarchical - basically you had a file which consisted of any number or 'root' records, generally directly accessible by a key. Each root record could then have any number of child records off it, each of which could in turn have it's own children. The net effect is a index file or root records with each root being the top of a potential tree-like structure. Accessing the child records could be tricky - there were limitations of direct access so usually you ended up traversing the tree looking for the record you needed. A 'database' could have any number of these files in it, usually related by keys.
This had major disadvantages - not least that actually doing anything demanded a full program written - basically the equivalent of a days work for what we can now do in SQL in a few minutes. However it really did score on execution speed, in those days a mainframe had about the processing power of your iPhone (albeit optimized for data I/O) and poor DB2 queries could kill a multi-million dollar installation dead. This was never an issue with DB/1 and in a world where programmers were less expensive than CPU time it made sense.
Google App Engine Datastore :
The App Engine datastore is not a relational database. While the datastore interface has many of the same features of traditional databases, the datastore's unique characteristics imply a different way of designing and managing data to take advantage of the ability to scale automatically.
The PI historical database from OSIsoft is non-relational. It's only made to archive timestamped data. It's used a lot by industry, especially as the back-end database for all those 'dashboards'.
There's no need to be relational in it, since there are no joins.
Other two types of databases that haven't come up yet:
Content Repositories are databases designed for content (i.e. files, documents, images, etc). They typically have additioan constructs such as a hierarchical way to browse content, search, transformation between different formats, versioning, and many other things. Examples - Alfresco, Documentum, JackRabbit, Day, OpenText, many other ECM vendors.
Directories, i.e. Active Directory, or LDAP Directories. These are databases designed for low-write / high read scenarios and highly distributed across high geographical distances / high latency connections. While mostly used for authentication / authorization, they don't have to be if your use case matches the requirements.
Dimensional Databases are great examples of non-relational databases. They are very commonly used for 'Business Dashboards'/'Business Intelligence' for KPIs and other types of aggregate or statistical data. They are usually populated from relational databases and can offer better performance in certain situations.
http://en.wikipedia.org/wiki/Dimensional_database
XML databases e.g. xindice
Object databases e.g. db4o
Be aware that the concept of relational databases is highly contentious. Purists such as C. J. Date would argue that many databases in common use (such as Oracle and SQL Server) do not comply sufficiently with the relational model to be termed 'relational'.
Non-Relational databases just do not meet Codd’s requirements.
Intersystems Caché seams a total re-write/re-design of the old Pick Operating system’s database. From the little I’ve read of Caché it appears to be a nicely done redesign.
It permits .net programs to access the database just like it would SQL. Caché’s run’s the Pick OS programs without requiring any changes. By importing your Pick files into Caché you can still run your old green screen applications with it, but also write new programs using .net so you can migrate to Windows Applications without abandoning the years of data design you’ve already invested in.
Here is some background on the Pick DB model . A Pick database uses totally variable length records, and fields. All table are keyed by a single unique key and are accessible without reading an index. Pick designed the system to use a Hashing algorithm that reads the item from disk on generally on the 1st physical read (assuming system maintenance was performed correctly). Fields in Pick are Non-Typed. All data is stored as string and Casting is up to the programmer. Nulls are stored as a empty string, thus a null does not take up disk space as it does in SQL. There is no need for Foreign Keys. In the ‘Relational world’ the DBA has to create and Order Header table, and an Order Line Item table. In the “Pick Model’ there is a single table. An example would be, ‘Order Date’ is a field that would store a # of days since ‘Dec 13, 1967’ (the data Pick OS was turned on for the first time). Pick programmers did not have Y2k problems. A 2nd column would be Customer Number. The big difference is when you get to the Product Number Column, it would be ‘Multi-Valued’ (the Codd Non-Conformance). In other words, the database can handle 1-32000 product#s in that column. Other columns like Quantity Ordered would be in a controlling/dependant relationship with the Product Number and would also be multi-valued. When you get to the Quantity Shipped, Pick would go to a third dimension and have Sub-Multi-Valued field. You would have a Shipment Number Column, and it would be Multi-Valued by Line Item and Sub-Multi-Valued containing The Shipment Quantity for that line for that shipment number. There are No Inner Joins needed. All data for that Order is stored in One Table and in a single record. No orphan rows ever!
Secondly the data definition is a bit different. Our dictionaries can contain definitions for data that is not in this table or is being manipulated. A couple of examples are, Customer Name. It would be defined as ‘Use the Customer number column and return the Name field from the Customer Table. Another Example is Line Item Extension would be defined as a calculation of Quantity*Price/PricePer.
I believe I read somewhere Caché claims to have over 100,000 installations.
I would think a flat-file database in Excel is non-relational and used by quite a few people.
It is really just a database table that can not be joined with other tables.
Object-Oriented Databases are one interesting type of non-relational database.
The trading sector sometimes uses OO Databases since each deal/contract can look sort of like others in that category but have unique attributes as well. VERY difficult to represent it relationally.
eXist-db is an xml database that has been around for a long time. It is particularly useful for xquery over tons of xml documents.
Any file or group of files that contains data but does not express relationships within that data is a non-relational database.
RRDtool is designed to store and aggregate log data. You configure a sampling interval and feed data into it, then it returns time-based results. It's optimized for fixed-size storage, and starts aggregating past results after a time. For example, suppose you have a round-robin database with a 5-minute time interval. Even if you send it temperature data once per second, it still only stores the results in 5-minute increments. After a week, it averages those results into hourly values. After a month, hourly results are averaged into daily numbers, and so on.
RRDtool is commonly used as the backend for tools like Cricket and MRTG to track network and environmental data for months and years at a stretch.
For a graph based dbms you have neo4j
For a hierarichal dbms you have any standard filesystem or with "schema" support any LDAP implementation.
There are many answers but they all end up being in one of two major categories:
Navigational. Includes Tree/Hierarchy databases and Graph databases.
Databases that break first normal form (multiple values). Includes Pick databases and Lotus Notes and its offspring like CouchDB.
EDIT: And of course key/value stores like BDB aren't relational, but that just goes without saying doesn't it? I mean, they're just key/value stores.
dBase. Although it was marketed as such, it doesn't meet the requirements.
As an OO database, Intersystems Caché comes to mind. Some medical and library systems are built on this.
In my company, www.smartsgroup.com, we have a proprietary database engine we call a "transaction log database". It is built on flat files, each file containing a sequence of "events" or "messages", in binary format, plus various indexes on this data and algorithms for reproducing the state of a stock exchange's orderbook. It is highly optimised for sequential updates and sequential access.
In scientific applications, it is also common to use proprietary database engines rather than RDBMS's. I also worked for a company that has the world's largest database of EEG brain recordings: www.brainresource.com . There we use a flat file database, and it worked well for us.
SmartsGroup also uses a temporal database, which is like a non-relational database table, except that we store a history of all changes to all fields so we can reproduce the state of a particular row on a particular date.
The Wiki page for Dimensional Databases linked to above seems to have disappeared.
Some OLAP Systems have are backed by Multidimensional databases (MOLAP) these are used often in financial analysis. They afford interactive clients that allow one to navigate through the data at different levels of aggregation.
At my university there is a group that researches deductive databases.

Resources