Why, or why not, to use a schema free database? - database

I've tried using only mongodb in a web application for some time. But I'm wondering why some people say schema-free or dynamic schema is powerful. Now I don't think it so fantastic or wonderful. Would anybody like to talk about the proper case to use schema free databases? First I'd like tell some of my stories.
What is schema free, the database or the codes?
Most of the NoSQL databases would like to say they are schema-free, but I think down to earth the important part is the codes running in the application.
For example, the storage of user information could be schema free, but it doesn't mean that you could store username as an object or store password as an timestamp. The code for user login assumes that username is a string and password is a hash. And eventually that turns the database storage constrained in schema.
Embedded documents are hard to maintain or to query
I created a CMS as the example to start my NoSQL database life. At the beginning the posts and comments data were stored like this
[
{
title: 'Mongo is Good',
content: 'Mongo is a NoSQL database.',
tags: ['Database', 'MongoDB', 'NoSQL'],
comments: [ COMMENT_0, COMMENT_1, ... ]
},
{
title: 'Design CMS',
content: 'Design a blog or something else.',
tags: ['Web', 'CMS'],
comments: [ COMMENT_2, COMMENT_3, ... ]
},
...
]
As you see I embedded comments into a list in each post. It was quite convenient as I could easily append new comment to any post or retrieve comments along with the post. But soon I encountered the first problem: it wasis quite messy to delete a certain comment (usually a spam) from the list. To my surprise mongo haven't still implemented it.
Aside that API level problem, it also hard to query embedded document across the collection. If I insisted on that design, the following queries could only implements in brute force ways
recent comments
comments by one certain user
Eventually I had to place comments into another collection, with a post_id field storing the id of a post the comment belongs to, just like an FK we did in a relational database.
Despite the comments design, the post tags are pretty helpful.
I found an opinion in this post
In NoSQL, you don't design your database based on the relationships between data entities. You design your database based on the queries you will run against it.
But how about changes of the requirements? Is it too crasy to restructure a database only because a new query should be supported?
The cases are worth schema free
In some other cases that need schema free storage. For example, a twitter-like timeline, with data in the following format
[
{
_id: ObjectId('aaa'),
type: 'tweet',
user: ObjectId('xxx'),
content: '0000',
},
{
_id: ObjectId('bbb'),
type: 'retweet',
user: ObjectId('yyy'),
ref: ObjectId('aaa'),
},
...
]
The problem is it won't be an easy job to render the documents into HTML. I render them in this way (Python)
renderMethods = {
'tweet': render_tweet,
'retweet': render_retweet,
}
result = [ render_methods[u['type']](u) for u in updates ]
Because only the JSON data is stored, not with member functions. As the result I have to manually map a render function to each update according to its type. (Similar things would happen when server send the JSON to browser intactly via AJAX)
The above problems confuse me a lot. Would anyone like to tell about the good practice in schema free database, and whether it'swould a good decision to mix one relational database along with a schema free database in a single application?

The main strength of schemaless databases comes to light when using them in an object-oriented context with inheritance.
Inheritance means that you have objects which have some attributes in common, but also some attributes which are specific to the sub-type of object.
Imagine, for example, a product catalog for a computer hardware store.
Every product will have the attributes name, vendor and price. But CPUs will have a clock_rate, hard drives will have a capacity, RAM both capacity and clock_rate and network cards a bandwidth. Doing this in a relational database leaves you two options which are equally cumbersome:
create a table with fields for all possible attributes, but leave most of them NULL for products where they don't apply.
create a secondary table "product_attributes" with productId, attribute_name and attribute_value.
A schemaless database, on the other hand, easily allows to store items in the same collection which have different sets of optional properties. The code to render the product attributes to HTML would then check for the existence of each known optional property and then call an appropriate function which outputs its value as a table row.
Another advantage of schemaless databases is that it gives additional agility during development. It easily allows you to try new features without having to restructure your database. This makes it very easy to maintain backward compatibility to data created by a previous version of the application without having to run complicated database conversion routines. I am currently developing an MMORPG using MongoDB. During the development I added lots of new features which required new data about each character to be persisted on the database. I never had to run a single command equivalent to CREATE TABLE or ALTER TABLE on my database. MongoDB just ate and spewed out whatever data I threw at it. My first test character is still playable, although I never did any intentional upgrading to its database document. It has some obsolete fields which are remnants from features I discarded or refactored, but these don't hurt at all - these obsolete fields would be useful though when my players would scream for bringing back a feature I removed.

Doing this in a relational database leaves you two options which are equally cumbersome:
There is one more option i guess here.
Adding the data in xml format and let the application deserialize/serialize it the way it wants.

Related

Storing custom collection templates in MongoDB

Background
I'm developing a web application that requires users to create and fill in custom tables (think HTML or MS Word tables, not database tables). The idea is that admins will create the templates for those tables, and general users will fill them in. Each general user can fill their own template, so there will be many tables that use the same template.
Now, these templates can hold any amount of columns, and each column can have a different data type. They can also change over time, meaning there should be versions of them. In other words, once a user fills a template, it should be stored with that structure, even if the template changes in the future. It will keep the old version of the template, while new users will use the newest version.
I usually work with relational databases, but for this scenario, a relational database doesn't seem like a good fit. I feel would end up with a bunch of tables and require many joins to extract the data I need, especially if the database is normalized.
I thought about using something like MongoDB. I'm new to Mongo, so I'm sure I haven't explored every single option, but I thought it would be better idea than using MySQL (what I've used for other apps). Do tell me if I'm wrong about this, though; maybe I'm missing something.
I want to be able to store the templates (name, columns (in order, with their names and data types)), then reuse these templates every time a user wants to fill them in. I would use a collection to store all tables that use the same template version. That is, a collection would be created each time a template is made. The collection should have defined data types that all documents inside should follow. If I'm understanding correctly, MongoDB offers schema validation for this.
My approach and questions
How can I store templates in the database? I thought about storing them as documents in a "templates" collection, which would be something like the example below, but I'm concerned about having the data types as strings.
{
name: <String>,
fields: [
{
name: <String>,
type: <String> <----this is what I have an issue with
},
...
]
}
I would have to read these templates and convert their structure into HTML tables so users can fill them, then store them as documents in the collections with the template name or id.
Something like:
{
template_id: <uuid, integer, string or something>
name: <String>,
rows: [
{
<field name>:<field value of specified type>,
...
}
]
}
How feasible and recommendable is it to create collections on the fly?
If anything is unclear, let me know so I can clarify. Like I said, I'm new to MongoDB (and non-relational DBs in general), so if I'm missing something obvious, feel free to point it out and offer suggestions. Helpful links to the docs or tutorials are also welcome. I'm also open to trying other NoSQL solutions.
I have same idea about storing templates in mongodb in my application.
You should using a standard schema for your project because we will have some support library for validate data.
JSON Schema:
Current version is draft-07 http://json-schema.org/
.You can read about the schema standard and have client library in many program language
MongoDB:
Now MongoDB 3.6 support validate JSon Schema draft-04
https://docs.mongodb.com/manual/core/schema-validation/#json-schema
Storing Schema in MongoDB:
We will have problem when store Json schema in a colletion.
Json Schema has some characters which can't store in mongodb.
You can reference https://accraze.info/storing-json-schema-templates-in-mongodb/

How to select which database to use for object storing

I want to deploy a small web project I have in mind, where the data I want to save are structs with nested structs inside, and most of the times the inner structs does not have the same fields and types.
for example, I'd like something like that, to be a "row" in a table
{
event : "The Oscars",
place : "Los Angeles, USA",
date : "March 2, 2014"
awards :
[
bestMovie :
[
name : "someName",
director : "someDirector",
actors :
[
... etc
]
],
bestActor : "someActor"
]
}
( JSON objects are easy to use for me at the moment, and passing it between server and client side. The client-side is run on JavaScript )
I started with MySQL/PHP but very soon I saw that it doesn't suit me. I tried mongoDB for a few days but I don't know how exactly to refine my search on which is the best db to use.
I want to be able to set some object models/schemas, and select exactly which part to update and which fields are unique in each struct.
Any suggestions? Thanks.
This is not an answerable question so will likely be closed.
There is no one right answer here and there are a few questions to ask such as data structure, speed requirements, and the old CAP Theorem questions of what do you need:
Consistency
Availability
Partition-ability
I would suggest mongo will be a great place to start if you are casually working away and don't anticipate having to deal with any of the issues above at scale. Couch is another similar option but doesn't have the same community size.
I say mongo because your data is denormalized into a document and mongo is good at serving documents. It also speaks json!
RDBMS databases would require you to denormalize your documents and create relationships which is quite a bit of work from where you are relative to sticking the documents into a document.
You could serialize the data using protocol buffers and put that in an rdbms but this is not advisable.
For blazing speed you could use redis which has constant time lookups in memory. But this is better suited (in most cases) for ephemeral data like user sessions - not long term persistant storage.
Finally there are graph databases like neo4j which are document-like databases which store relationships between nodes with typed edges. This suits social and recommendation problems quite well but that's probably not the problem you're trying to solve - in the question it simply states what is best for your data for storage.
Looking at some of the possibilities, I think you'll probably find mongo best suits your needs as you already have json document structures and only need simple persistance for those documents.

What choices for a relational Document-store (NoSql?) database engine?

What choices are there for document-store databases that allow for relational data to be retrieved? To give a real example, say you have a database to store blog posts. I'd like to have the data look something like:
{id: 12345,
title: "My post",
body: "The body of my post",
author: {
id: 123,
name: "Joe Bloggs",
email: "joe.bloggs#example.com"
}
}
Now, you will likely have a number of these records that all share the author details. What I'd really like is to have the author itself stored as a different record in the database, so that if you update this one record every post record that links to it gets the updates as well. To date the only way I've seen mentioned to do this is to have the post record instead store an ID of the author record, so that the calling code will have to make two queries of the data store - one for the post and another for the author ID that is linked to the post.
Are there any document store databases that will allow me to make a single query and return a structured document containing the linked records? And preferably allow me to edit an internal part of the document, persist the document as a whole and have the correct thing happen [I.e. in the above, if I retrieved the entire document, changed the value of email and persisted the entire document then the email address of the author record is changed, and reflected in all posts that have that author...]
First, let me acknowledge: This particular type of data is somewhat relational by nature. It just depends on exactly how you want to structure this type of data, and what technologies that you have easy access to for this particular project. That said, how do you want your data structured?
If you can structure your data any way you want, you could go with something like this:
{
name: 'Joe',
email: 'joe.bloggs#ex.com',
posts: [
{
id: 123,
title: "My post"
},
{..}
]
}
Where all the posts were contained in one particular key/value pair. This particular type of data I would say is uniquely suited for Riak (due to it being able to query internally against JSON using JavaScript natively). Though you could probably come at it from just about any of the NoSQL data store point of views (Cassandra, Couch, Mongo, et al..), as most of them can store straight up JSON. I just have a tendency towards Riak at this point, due to my personal experience with it.
The more interesting things that you'll probably run up against will relate to how you deal with the data store. For instance, I really like using Ripple for Ruby, which lets me deal with this kind of data in Riak real easy. But if you're in Java land, that might make adoption of this technique a bit more difficult (though I haven't spent a lot of time looking in to Java adoption of Riak), since it tends to lag on 'edge' style data storage techniques.
What is more than that, getting your brain to start thinking in NoSQL terms, or without using 'relations' is what usually takes the longest in structuring data. Because there isn't a schema, and there aren't any preconceptions that come with it, that means that you can do a lot of things that are thought of as simply wrong in the relational DB world. Like storing all of the blog posts for a single user in one document, which just wouldn't work in the standard schema-heavy strongly table based relational world.

Are there ORM (OKM) for key-value stores?

Object-Relational-Mappers have been created to help applications (which think in terms of objects) deal with stored data in a more application-friendly way like every other class/object.
However, I have never seen a OKM (Object-Key/Value-Mapper) for NoSQL "Key/Value" storage systems. Which seems odd because the need should be far greater given the fact that more value-relations will have to be hard-coded into the app than a regular, single SQL table row object.
four requests:
user:id
user:id:name
user:id:email
user:id:created
vs one request:
user = [id => ..., name => ..., email => ...]
Plus you must keep track of "lists" (post has_many comments) since you don't have has_many through tables or foreign keys.
INSERT INTO user_groups (user_id, group_id) VALUES (23, 54)
vs
usergroups:user_id = {54,108,32,..}
groupsuser:group_id = {23,12,645,..}
And there are lots more examples of the added logic that an application would need to replicate some basic features that normal relational databases use. All of these reasons make the idea of a OKM sound like a shoe-in.
Are there any? Are there any reasons there are not any?
Ruby's DataMapper project is an ORM and will happily talk to a key-value store through the use of an adapter.
Redis and MongoDB have adapters that already exist. CouchDB has an adapter — it's not maintained, but at one point it worked pretty well. I don't think anyone's done anything with Cassandra yet, but there's no reason it couldn't be done. The Dubious framework for Google App Engine takes a very similar approach to Data Mapper to make the Data Store available to applications.
So it's very possible to do ORM with key-value stores. The ORM just really needs to avoid the assumption that SQL is its primary vocabulary.
One of the design goals of SQL is that any data can be stored/queried in any relational database - There are some differences between platforms, but in general the correct way to handle a particular data structure is well known and easily automated but requiring fairly verbose code. That is not the case with NoSQL - generally you will be directly storing the data as used in your application rather than trying to map it to a relational structure, and without joins or other object/relational differences the mapping code is trivial.
Beyond generating the boilerplate data access code, one of the main purposes of an ORM is abstraction of differences between platforms. In my experience the ability to switch platforms has always been purely theoretical, and this lowest common denominator approach simply won't work for NoSQL as the platform is usually chosen specifically for capabilities not present on other platforms. Your example is only for the most trivial key value store - depending on your platform you most likely have some useful additional commands, so your first example could be
MGET user:id:name user:id:email ... (multiget - get any number of keys in a single call)
GET user:id:* (key wildcards)
HGETALL user:id (redis hash - gets all subkeys of user)
You might also have your user object stored in a serialized form - unlike in a relational database this will not break all your queries.
Working with lists isn't great if your platform doesn't have support built in - native list/set support is one of the reasons I like to use redis - but aside from potentially needing locks it's no worse than getting the list out of sql.
It's also worth noting that you may not need all the relationships you would define in sql - for example if you have a group containing a million users, the ability to get a list of all users in a group is completely useless, so you would never create the groupsuser list at all and rather than a seperate usergroups list have user:id:groups as a multivalue property. If you just need to check for membership you could set up keys as usergroups:userid:groupid and get constant time lookup.
I find it helps to think in terms of indexes rather than relationships - when setting up your data access code decide which fields will need to be queried and adding appropriate index records when those fields are written.
ORMs don't map terribly well to the schema-less nature of key-value stores. That being said, if you're using Riak and Ruby, you could take a look at Ripple. There are a number of other drivers for Riak which might fit with your language.
If you're looking into MongoDB (more of a document store than a k/v store), there are a number of drivers available.
The UNIVERSE db , which is a descendent of Pick, lets you store a list of key value pairs for a given key. However this is very old technoligy and the world ran away from these databases a long time ago.
You can implement this in an SQL database with a three column table
CREATE TABLE ATTRS ( KEYVAL VARCHAR(32),
ATTRNAME VARCHAR(32),
ATTRVAR VARCHAR(1024)
)
Although most DBAs will hit you over the head with the very thick Codd and Date hardback edition if you propose this, it is in fact a very common pattern in packaged applications to allow you to add site specific attributes to a system.
To prarphrase Richrd Stallmans comments on LISP.
"Any reasonably functional datastorage system will eventually end up implementing there own version of RDBMS."

Database design help with varying schemas

I work for a billing service that uses some complicated mainframe-based billing software for it's core services. We have all kinds of codes we set up that are used for tracking things: payment codes, provider codes, write-off codes, etc... Each type of code has a completely different set of data items that control what the code does and how it behaves.
I am tasked with building a new system for tracking changes made to these codes. We want to know who requested what code, who/when it was reviewed, approved, and implemented, and what the exact setup looked like for that code. The current process only tracks two of the different types of code. This project will add immediate support for a third, with the goal of also making it easy to add additional code types into the same process at a later date. My design conundrum is that each code type has a different set of data that needs to be configured with it, of varying complexity. So I have a few choices available:
I could give each code type it's own table(s) and build them independently. Considering we only have three codes I'm concerned about at the moment, this would be simplest. However, this concept has already failed or I wouldn't be building a new system in the first place. It's also weak in that the code involved in writing generic source code at the presentation level to display request data for any code type (even those not yet implemented) is not trivial.
Build a db schema capable of storing the data points associated with each code type: not only values, but what type they are and how they should be displayed (dropdown list from an enum of some kind). I have a decent db schema for this started, but it just feels wrong: overly complicated to query and maintain, and it ultimately requires a custom query to view full data in nice tabular for for each code type anyway.
Storing the data points for each code request as xml. This greatly simplifies the database design and will hopefully make it easier to build the interface: just set up a schema for each code type. Then have code that validates requests to their schema, transforms a schema into display widgets and maps an actual request item onto the display. What this item lacks is how to handle changes to the schema.
My questions are: how would you do it? Am I missing any big design options? Any other pros/cons to those choices?
My current inclination is to go with the xml option. Given the schema updates are expected but extremely infrequent (probably less than one per code type per 18 months), should I just build it to assume the schema never changes, but so that I can easily add support for a changing schema later? What would that look like in SQL Server 2000 (we're moving to SQL Server 2005, but that won't be ready until after this project is supposed to be completed)?
[Update]:
One reason I'm thinking xml is that some of the data will be complex: nested/conditional data, enumerated drop down lists, etc. But I really don't need to query any of it. So I was thinking it would be easier to define this data in xml schemas.
However, le dorfier's point about introducing a whole new technology hit very close to home. We currently use very little xml anywhere. That's slowly changing, but at the moment this would look a little out of place.
I'm also not entirely sure how to build an input form from a schema, and then merge a record that matches that schema into the form in an elegant way. It will be very common to only store a partially-completed record and so I don't want to build the form from the record itself. That's a topic for a different question, though.
Based on all the comments so far Xml is still the leading candidate. Separate tables may be as good or better, but I have the feeling that my manager would see that as not different or generic enough compared to what we're currently doing.
There is no simple, generic solution to a complex, meticulous problem. You can't have both simple storage and simple app logic at the same time. Either the database structure must be complex, or else your app must be complex as it interprets the data.
I outline five solution to this general problem in "product table, many kind of product, each product have many parameters."
For your situation, I would lean toward Concrete Table Inheritance or Serialized LOB (the XML solution).
The reason that XML might be a good solution is that:
You don't need to use SQL to pick out individual fields; you're always going to display the whole form.
Your XML can annotate fields for data type, user interface control, etc.
But of course you need to add code to parse and validate the XML. You should use an XML schema to help with this. In which case you're just replacing one technology for enforcing data organization (RDBMS) with another (XML schema).
You could also use an RDF solution instead of an RDBMS. In RDF, metadata is queriable and extensible, and you can model entities with "facts" about them. For example:
Payment code XYZ contains attribute TradeCredit (Net-30, Net-60, etc.)
Attribute TradeCredit is of type CalendarInterval
Type CalendarInterval is displayed as a drop-down
.. and so on
Re your comments: Yeah, I am wary of any solution that uses XML. To paraphrase Jamie Zawinski:
Some people, when confronted with a problem, think "I know, I'll use XML." Now they have two problems.
Another solution would be to invent a little Domain-Specific Language to describe your forms. Use that to generate the user-interface. Then use the database only to store the values for form data instances.
Why do you say "this concept has already failed or I wouldn't be building a new system in the first place"? Is it because you suspect there must be a scheme for handling them in common?
Else I'd say to continue the existing philosophy, and establish additional tables. At least it would be sharing an existing pattern and maintaining some consistency in that respect.
Do a web search on "generalized specialized relational modeling". You'll find articles on how to set up tables that store the attributes of each kind of code, and the attributes common to all codes.
If you’re interested in object modeling, just search on “generalized specialized object modeling”.

Resources