The more I read about NoSQL, the more it begins to sound like a column oriented database to me.
What's the difference between NoSQL (e.g. CouchDB, Cassandra, MongoDB) and a column oriented database (e.g. Vertica, MonetDB)?
NoSQL is term used for Not Only SQL, which covers four major categories - Key-Value, Document, Column Family and Graph databases.
Key-value databases are well-suited to applications that have frequent small reads and writes along with simple data models.
These records are stored and retrieved using a key that uniquely identifies the record, and is used to quickly find the data within the database.
e.g. Redis, Riak etc.
Document databases have ability to store varying attributes along with large amounts of data
e.g. MongoDB , CouchDB etc.
Column family databases are designed for large volumes of data, read and write performance, and high availability
e.g Cassandra, HBase etc.
Graph database is a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data
e.g Neo4j, InfiniteGraph etc.
Before understanding NoSQL, you have to understand some key concepts.
Consistency – All the servers in the system will have the same data so anyone using the system will get the same copy regardless of which server answers their request.
Availability – The system will always respond to a request (even if it's not the latest data or consistent across the system or just a message saying the system isn't working) .
Partition Tolerance – The system continues to operate as a whole even if individual servers fail or can't be reached.
Most of the times, only two out above three properties will be satisfied by NoSQL databases.
From your question,
CouchDB : AP ( Availability & Partition) & Document database
Cassandra : AP ( Availability & Partition) & Column family database
MongoDB : CP ( Consistency & Partition) & Document database
Vertica : CA ( Consistency & Availability) & Column family database
MonetDB : ACID (Atomicity Consistency Isolation Durability) & Relational database
From : http://blog.nahurst.com/visual-guide-to-nosql-systems
Have a look at this article1 , article2 and ppt for various scenarios to select a particular type of database.
Some NoSQL databases are column-oriented databases, and some SQL databases are column-oriented as well. Whether the database is column or row-oriented is a physical storage implementation detail of the database and can be true of both relational and non-relational (NoSQL) databases.
Vertica, for example, is a column-oriented relational database so it wouldn't actually qualify as a NoSQL datastore.
A "NoSQL movement" datastore is better defined as being non-relational, shared-nothing, horizontally scalable database without (necessarily) ACID guarantees. Some column-oriented databases can be characterized this way. Besides column stores, NoSQL implementations also include document stores, object stores, tuple stores, and graph stores.
A NoSQL Database is a different paradigm from traditional schema based databases. They are designed to scale and hold documents like json data. Obviously they have a way of querying information, but you should expect syntax like eval("person = * and age > 10) for retrieving data. Even if they support standard SQL interface, they are intended for something else, so if you like SQL you should stick to traditional databases.
A column-oriented database is different from traditional row-oriented databases because of how they store data. By storing a whole column together instead of a row, you can minimize disk access when selecting a few columns from a row containing many columns. In row-oriented databases there's no difference if you select just one or all fields from a row.
You have to pay for a more expensive insert though. Inserting a new row will cause many disk operations, depending on the number of columns.
But there's no difference with traditional databases in terms of SQL, ACID, foreign keys and stuff like that.
I would suggest reading the taxonomy section of the NoSQL wikipedia entry to get a feel for just how different NoSQL databases are from a traditional schema-oriented database. Being column-oriented implies rows and columns, which implies a (two dimensional) schema, while NoSQL databases tend to be schema-less (key-value stores) or have structured contents but without a formal schema (document stores).
For document stores, the structure and contents of each "document" are independent of other documents in the same "collection". Adding a field is usually a code change rather than a database change: new documents get an entry for the new field, while older documents are considered to have a null value for the non-existent field. Similarly, "removing" a field could mean that you simply stop referring to it in your code rather than going to the trouble of deleting it from each document (unless space is at a premium, and then you have the option of removing only those with the largest contents). Contrast this to how an entire table must be changed to add or remove a column in a traditional row/column database.
Documents can also hold lists as well as other nested documents. Here's a sample document from MongoDB (a post from a blog or other forum), represented as JSON:
{
_id : ObjectId("4e77bb3b8a3e000000004f7a"),
when : Date("2011-09-19T02:10:11.3Z"),
author : "alex",
title : "No Free Lunch",
text : "This is the text of the post. It could be very long.",
tags : [ "business", "ramblings" ],
votes : 5,
voters : [ "jane", "joe", "spencer", "phyllis", "li" ],
comments : [
{ who : "jane", when : Date("2011-09-19T04:00:10.112Z"),
comment : "I agree." },
{ who : "meghan", when : Date("2011-09-20T14:36:06.958Z"),
comment : "You must be joking. etc etc ..." }
]
}
Note how "comments" is a list of nested documents with their own independent structure. Queries can "reach into" these documents from the outer document, for example to find posts that have comments by Jane, or posts with comments from a certain date range.
So in short, two of the major differences typical of NoSQL databases are the lack of a (formal) schema and contents that go beyond the two dimensional orientation of a traditional row/column database.
Distinguishing between coloumn stores Read this blog. This answers your question.
As #tuinstoel wrote, the article answers your question in point 3:
3. Interface. Group A is distinguished by being part of the
NoSQL movement and does not typically
have a traditional SQL interface.
Group B supports standard SQL
interfaces.
Here is how I see it: Column Oriented databases are dealing with the way data is physically stored on disk. As the name suggests, the each column is stored in its own separate space/file. This allows for 2 important things:
You achieve better compression ratio to the order of 10:1 because you have single data type to deal with.
You achieve better data read performance because you avoid whole row scans and can just pick and choose the columns specified in your SELECT query.
NoSQL on the other hand are a whole new breed of databases that define "logical" aggregate levels to explain the data. Some treat the data as having hierachical relationship (aggregate being a "node"), while the other treat the data as documents (which is the aggregate level). They do not dictate the physical storage strategy (some may do, but abstracted away from the end user).
Also, the whole NoSQL movement is more to do with unstructured data, or rather data sets whose schema cannot be predefined, or in unknown beforehand, and therefore cannot conform to the strict relational model.
Column Oriented databases still deal with relational data, although eliminate the need for index etc.
Related
In a SQL database, we generally have related information stored in different tables in a database. From what I read from RocksDB document, there's really no clear or 'right' way to represent this kind of structure. So I'm wondering what is the practice to categorize information?
Say I have three types of information, Customer, Product, and Employee. And I want to implement these in RocksDB. Should I use prefix of the key, different column families, or different databases?
Thanks for any suggestion.
You can do it by coming up with some prefix which will mean such table, such column, such id. You could for simplicity store in one column family and definetely in one db since you have atomic operations, snapshots and so on. The better question why would you want to store relational data in nosql db unless you are building something higher level.
By the way, checkout linqdb which is an example of higher-level db where you can store entities, perform linq-style operations and it uses rocksdb underneath.
The way data is organized in key-value store is up the the implementation. There is no "one good way to go" it depends on the underlying key-value store features (is it key ordered in particular).
The same normalization/denormalization technics applies.
I think the piece you are missing about key-value store application design is the concept of key composition. Quickly, is the practice of building keys in such a way that they are amenable to querying. If the database is ordered then it will also also for prefix/range/scan queries and next/previous navigation. This will lead you to build key prefixes in such a way that querying is fast ie. that doesn't require a full table scan.
Expand your research to other key value stores like bsddb or leveldb here on Stack Overflow.
What is the difference between a DBMS and an RDBMS with some examples and some new tools as examples. Why can't we really use a DBMS instead of an RDBMS or vice versa?
A relational DBMS will expose to its users "relations, and nothing else". Other DBMS's will violate that principle in various ways. E.g. in IDMS, you could do <ACCEPT <hostvar> FROM CURRENCY> and this would expose the internal record id of the "current record" to the user, violating the "nothing else".
A relational DBMS will allow its users to operate exclusively at the logical level, i.e. work exclusively with assertions of fact (which are represented as tuples). Other DBMS's made/make their users operate more at the "record" level (too "low" on the conceptual-logical-physical scale) or at the "document" level (in a certain sense too "high" on that same scale, since a "document" is often one particular view of a multitude of underlying facts).
A relational DBMS will also offer facilities for manipulation of the data, in the form of a language that supports the operations of the relational algebra. Other DBMS's, seeing as they don't support relations to boot, obviously cannot build their data manipulation facilities on relational algebra, and as a consequence the data manipulation facilities/language is mostly ad-hoc. On the "too low" end of the spectrum, this forces DBMS users to hand-write operations such as JOIN again and again and again. On the "too high" end of the spectrum, it causes problems of combinatorial explosion in language complexity/size (the RA has some 4 or 5 primitive operators and that's all it needs - can you imagine 4 or 5 operators that will allow you to do just any "document transform" anyone would ever want to do ?)
(Note very carefully that even SQL systems violate basic relational principles quite seriously, so "relational DBMS" is a thing that arguably doesn't even exist, except then in rather small specialized spaces, see e.g. http://www.thethirdmanifesto.com/ - projects page.)
DBMS : Database management system, here we can store some data and collect.
Imagine a single table , save and read.
RDBMS : Relational Database Management , here you can join several tables together and get related data and queried data ( say data for a particular user or for an particular order,not all users or all orders)
The Noramalization forms comes into play in RDBMS, we dont need to store repeated data again and again, can store in one table, and use the id in other table, easier to update, and for reading we can join both the table and get what we want.
DBMS:
DBMS applications store data as file.In DBMS, data is generally stored in either a hierarchical form or a navigational form.Normalization is not present in DBMS.
RDBMS:
RDBMS applications store data in a tabular form.In RDBMS, the tables have an identifier called primary key and the data values are stored in the form of tables.Normalization is present in RDBMS.
I have a legacy in-house human resources web app that I'd like to rebuild using more modern technologies. Doctrine 2 is looking good. But I've not been able to find articles or documentation on how best to organise the Entities for a large-ish database (120 tables). Can you help?
My main problem is the Person table (of course! it's an HR system!). It currently has 70 columns. I want to refactor that to extract several subsets into one-to-one sub tables, which will leave me with about 30 columns. There are about 50 other supporting one-to-many tables called person_address, person_medical, person_status, person_travel, person_education, person_profession etc. More will be added later.
If I put all the doctrine associations (http://docs.doctrine-project.org/projects/doctrine-orm/en/latest/reference/working-with-associations.html) in the Person entity class along with the set/get/add/remove methods for each, along with the original 30 columns and their methods, and some supporting utility functions then the Person entity is going to be 1000+ lines long and a nightmare to test.
FWIW i plan to create a PersonRepository to handle the common bulk queries, a PersonProfessionRepository for the bulk queries / reports on that sub table etc, and Person*Service s which will contain some of the more complex business logic where needed. So organising the rest of the app logic is fine: this is a question about how to correctly organise lots of sub-table Entities with Doctrine that all have relationships / associations back to one primary table. How do I avoid bloating out the Person entity class?
Identifying types of objects
It sounds like you have a nicely normalized database and I suggest you keep it that way. Removing columns from the people table to create separate tables for one-to-one relations isn't going to help in performance nor maintainability.
The fact that you recognize several groups of properties in the Person entity might indicate you have found cases for a Value Object. Even some of the one-to-many tables (like person_address) sound more like Value Objects than Entities.
Starting with Doctrine 2.5 (which is not yet stable at the time of this writing) it will support embedding single Value Objects. Unfortunately we will have to wait for a future version for support of collections of Value objects.
Putting that aside, you can mimic embedding Value Objects, Ross Tuck has blogged about this.
Lasagna Code
Your plan of implementing an entity, repository, service (and maybe controller?) for Person, PersonProfession, etc sounds like a road to Lasagna Code.
Without extensive knowledge about your domain, I'd say you want to have an aggregate Person, of which the Person entity is the aggregate root. That aggregate needs a single repository. (But maybe I'm off here and being simplistic, as I said, I don't know your domain.)
Creating a service for Person (and other entities / value objects) indicates data-minded thinking. For services it's better to think of behavior. Think of what kind of tasks you want to perform, and group coherent sets of tasks into services. I suspect that for a HR system you'll end up with many services that evolve around your Person aggregate.
Is Doctrine 2 suitable?
I would say: yes. Doctrine itself has no problems with large amounts of tables and large amounts of columns. But performance highly depends on how you use it.
OLTP vs OLAP
For OLTP systems an ORM can be very helpful. OLTP involves many short transactions, writing a single (or short list) of aggregates to the database.
For OLAP systems an ORM is not suited. OLAP involves many complex analytical queries, usually resulting in large object-graphs. For these kind of operations, native SQL is much more convenient.
Even in case of OLAP systems Doctrine 2 can be of help:
You can use DQL queries (in stead of native SQL) to use the power of your mapping metadata. Then use scalar or array hydration to fetch the data.
Doctrine also support arbitrary joins, which means you can join entities that are not associated to each other according by mapping metadata.
And you can make use of the NativeQuery object with which you can map the results to whatever you want.
I think a HR system is a perfect example of where you have both OLTP and OLAP. OLTP when it comes to adding a new Person to the system for example. OLAP when it comes to various reports and analytics.
So there's nothing wrong with using an ORM for transactional operations, while using plain SQL for analytical operations.
Choose wisely
I think the key is to carefully choose when to use what, on a case by case basis.
Hydrating entities is great for transactional operations. Make use of lazy loading associations which can prevent fetching data you're not going to use. But also choose to eager load certain associations (using DQL) where it makes sense.
Use scalar or array hydration when working with large data sets. Data sets usually grow where you're doing analytical operations, where you don't really need full blown entities anyway.
#Quicker makes a valid point by saying you can create specialized View objects. You can fetch only the data you need in specific cases and manually mold that data into objects. This is accompanied by his point to don't bloat the user interface with options a user with a certain role doesn't need.
A technique you might want to look into is Command Query Responsibility Segregation (CQRS).
I understood that you have a fully normalized table persons and now you are asking for how to denormalize that best.
As long as you do not hit any technical constaints (such as max 64 K Byte) I find 70 columns definitly not overloaded for a persons table in a HR system. Do yourself a favour to not segment that information for following reasons:
selects potentially become more complex
each extract table needs (an) extra index/indeces, which increases your overall memory utilization -> this sounds to be a minor issue as disk is cheap. However keep in mind that via caching the RAM to disk space utilization ratio determines your performance to a huge extend
changes become more complex as extra relations demand for extra care
as any edit/update/read view can be restricted to deal with slices of your physical data from the tables only no "cosmetics" pressure arises from end user (or even admin) perspective
In summary your the table subsetting causes lots of issues and effort but does add low if not no value.
Btw. databases are optimized for data storage. Millions of rows and some dozens of columns are no brainers at that end.
For our project, we need a database that supports JOINs and has the ability to easily add and modify attributes of the entity (schema-less/free). Key points:
The system is designed to work with customers (CRM)
Basic entities: User, Customer, Case, Case Interaction, Order
Currently in the database there are ~200k customers and ~250k orders
Customer entity contains 15-20 optional attributes that are most often not filled
About 100 new cases a day
The data is synchronized with several other sources in the background
Requirements (high to low priority):
Ability to implement search/sort by related entities, e.g. Case by linked Customer name (support JOINs)
Having the flexibility to change the schema of the data and do not store NULL for a large number of attributes
Performance
ORM for Python with support for monitoring changes and the possibility of storing only the changes to the database
What we've tried:
MongoDB does not satisfy paragraph 1.
PostgreSQL with all the attributes in one table does not satisfy paragraph 2.
PostgreSQL with a separate table for each attribute or EAV does not satisfy paragraph 3 (a lot of slow joins), but seems a better solution than others.
Can you suggest any database or design of the system that will meet our needs?
Datomic might be worth checking out (http://www.datomic.com/). It satisfies requirements 1-3, and although there's no python ORM, there is a REST API.
Datomic is based on an Entity Attribute Value schema (it's not quite schema free - you need to specify a name and type for each attribute - but any entity can have any attribute). It is transactional and has support for joins, unlike some of the other flexible "NoSQL" solutions. Interestingly, it also has first-class support for time (e.g. what is the history of this entity/what did the database look like at time t,etc), which might be useful if you're tracking cases and interactions.
Queries are based on datalog, which queries by unification. Query by unification looks a bit odd at first but is brilliant once you get used to it.
For example, a query to find cases by linked customer name would be something like this:
[find ?x
:in $
:where [?x :case/linked-customers ?c
?c :customer/name "Barry"]]
The query engine looks in the database, and tries to satisfy the where clause by unifying all occurrences of a given variable. In this case, only ?c appears twice (the case has a linked customer c whose name is Barry), but queries can obviously get a lot more complex. The $ here represents the database.
You may want to consider storing the "flexible" part as XML. Some databases, e.g. DB2, allow XML indexing so lookup performance should be as good as with the relational data store. DB2 Express-C is free and does not have an artificial limit on the database size.
Update Since 2015 DB2 Express-C limits the database user data volume to 15 TB, which still should be plenty.
What would be considered best practice when you need additional data about facet results.
ie. i need a friendlyname / image / meta keywords / description / and more.. for product categories. (when faceting on categories)
include it in the document? (can lead to looots of duplication)
introduce category as a new index in solr (or fake by doctype=category field in solr)
use a rdbms to lookup additional data using a SELECT WHERE IN (..category facet result ids..)
Thanks,
Remco
use fast NoSQL db that fits your data
BTW Lucene, which is Solr's underlying layer, is in fact also NoSQL-type storage facility.
If I were you, I'd use MongoDB. That's the first db that came to mind, since you need binary data and they practically invented BSON, which is now widespread mean of transferring binary data in a JSON-like fashion.
If your data structure is more graph-shaped (like social network) check out Neo4j, which has blindingly fast graph traversal algorithms.
A relational DB can reliably enforce the "category is first class entity" thing. You would need referential integrity: a product may not belong to a category that doesnt exist. A deleted category must not have it's child categories lying around. A normalized RDB can enforce referential integrity through schema. A NoSQL DB must work with client-side code (you must write) to enforce referential integrity.
Lets see how "product's category must exist" and "subcategories' parents must exist" are done:
RDB: The table that assigns categories to products (an m:n relation) must be keyed up to the product and category by an ON DELETE CASCADE. If a category is deleted, a product simply cannot have such a category. A category that links up to another category as a child: the relavent field has an ON DELETE CASCADE. This means that if a parent is deleted, it's children cannot exist. This entire method is declarative ("it is declared thus"), all complexities exist in the data, we dont need no stinking code to do it for us. You can model a DB as naturally as you understand their real world implications.
Document store-type NoSQL: You need to write code to do everything. A "category is deleted" is an use case, and you need to find products that have that category, and update each one. You have to write code for each use case. Same goes for managing subcategories. The data model may be incredibly stupid, but their real-world implications must be modeled in the code. And its tougher to reason in code and control flow rather than in data structures.
Do you really have performance needs that require NoSQL databases?
So use RDBMSs to manage your data. Then use Direct Import handler or client-side code to insert/update denormalized entities for searching. If most requests to your site can be expressed in Solr queries, great!
As for expressing hierarchial faceting in Solr, see ' Ways to do hierarchial faceting in Solr? '.
I would think about 2 alternatives:
1.) strong the informations for every document without indexing it (to keep the index small as possible). The point is, that i would not store the image insight Lucene/Solr - only an file pointer.
2.) store the additional data on an rdbms or nosql (linke mongoDB) to lookup, as you wrote.
My favorite is the 2nd. one, because an database is the traditional and most optimized way to storing data.
But finally it depends on your system, because you should keep in mind, that you need time for connecting an database, searching through the data and sending the additional information back to the application.
So it could be faster to store everything on lucene.
Probably an small performance test would be useful.
maybe I am wrong, but if you are on Solr trunk you could benefit from Solr join suport, this would allow you to index several entities with relations among them while enforcing conditions on both.