I have a document DB that follows the following rough json structure:
{
id: 32,
Name: 'John',
Occupation:{
name: 'Programmer',
hours: 40
}
}
Is there a way to have a separate document collection or something similar that occupation can relate to? Such as:
{
id: 32,
Name: 'John',
Occupation: 'jobs/JOB_PROGRAMMER_ID'
}
// jobs document collection
{
id: 'JOB_PROGRAMMER_ID',
name: 'Programmer',
hours: 40
}
I can write out a backend querying function which will automatically resolve these relationships but was wondering if there was an inbuilt way to do this.
It is possible to do with Cosmosdb.Only thing is that without having multiple collection, you need to embed within a collection as explained below.
Relational databases are not the only place where you can create
relationships between entities. In a document database you can have
information in one document that actually relates to data in other
documents. Now, I am not advocating for even one minute that we build
systems that would be better suited to a relational database in Azure
Cosmos DB, or any other document database, but simple relationships
are fine and can be very useful.
Embedding documents
You can store any type of references you want, within a document. However, as #gaurav mentioned in his comment, you will need to create additional queries to follow these relationships (whether in the same collection or in different collections).
Queries (as well as stored procedures) are scoped to a single collection (more accurately, to a single partition, although you can have cross-partition queries). They cannot span multiple collections; this would be up to you to interpret your reference (whether a path, an id, etc) and then use this as the basis for your follow-on query.
If multiple queries are impractical for your solution, you would need to devise some other scheme (such as denormalizing, where you would store a subset of related data within your document, as a subdocument or collection of subdocuments). Note: with embedded documents, if the number of embedded documents is unbounded (e.g. no limit, like comments for a blog post, or replies to a tweet), you run a risk of exceeding maximum document size, which will break your app once this happens (unless you have alternative logic for storing content beyond a single document's size limit).
Related
I'm trying to store data within Solr so that I can best maintain the indexes. The problem i'm having is that my data structure is heavily nested. Example:
Company
(to many) Person
(to many) Property
(to many) Network
(to many) SubNetwork
I'm trying to create a full text search index for each SubNetwork that will display the current parent fields along side it.
Currently my data is completely denormalised, e.g:
{
"company": "Coca-Cola",
"property": "1 plaza hotel",
"network": "ABC",
"subNetwork": "123"
}
Now if a user were to go into the application and change the name of the company, right now (in the denormalized state), that would require Solr to partially update (atomic update) many documents which doesn't feel very efficient. Re-indexing the index isn't a preferred solution as this is a multi tenanted application.
I have tried putting the relational data in separate indexes and then used join within Solr but this does not copy over the joined indexes fields in the final result which means a full text search on all the fields isn't possible.
{!join from=inner_id to=outer_id}field:value
I'm trying to configure Solr in a way that when a parent record is updated, it only requires one atomic update but still retains the ability to search on all fields. Is this possible?
Unless you are seeing the performance issues, your initial implementation seems correct. Especially if you are returning subnetwork and may be searching on subnetwork and parent values at the same time.
Doing atomic update, under the covers, actually re-indexes the document anyway (and creates a Lucene-level new document). It also requires all fields to be stored to allow recreating the document. And the join reduces the scoring flexibility you can have.
One optimization you could do is to NOT store the parent fields, but keep them index-only. This will be more space-efficient and less disk/record re-hydration work. But then you cannot return those fields to the user and would have to fetch them from the original source instead.
Background
I'm developing a web application that requires users to create and fill in custom tables (think HTML or MS Word tables, not database tables). The idea is that admins will create the templates for those tables, and general users will fill them in. Each general user can fill their own template, so there will be many tables that use the same template.
Now, these templates can hold any amount of columns, and each column can have a different data type. They can also change over time, meaning there should be versions of them. In other words, once a user fills a template, it should be stored with that structure, even if the template changes in the future. It will keep the old version of the template, while new users will use the newest version.
I usually work with relational databases, but for this scenario, a relational database doesn't seem like a good fit. I feel would end up with a bunch of tables and require many joins to extract the data I need, especially if the database is normalized.
I thought about using something like MongoDB. I'm new to Mongo, so I'm sure I haven't explored every single option, but I thought it would be better idea than using MySQL (what I've used for other apps). Do tell me if I'm wrong about this, though; maybe I'm missing something.
I want to be able to store the templates (name, columns (in order, with their names and data types)), then reuse these templates every time a user wants to fill them in. I would use a collection to store all tables that use the same template version. That is, a collection would be created each time a template is made. The collection should have defined data types that all documents inside should follow. If I'm understanding correctly, MongoDB offers schema validation for this.
My approach and questions
How can I store templates in the database? I thought about storing them as documents in a "templates" collection, which would be something like the example below, but I'm concerned about having the data types as strings.
{
name: <String>,
fields: [
{
name: <String>,
type: <String> <----this is what I have an issue with
},
...
]
}
I would have to read these templates and convert their structure into HTML tables so users can fill them, then store them as documents in the collections with the template name or id.
Something like:
{
template_id: <uuid, integer, string or something>
name: <String>,
rows: [
{
<field name>:<field value of specified type>,
...
}
]
}
How feasible and recommendable is it to create collections on the fly?
If anything is unclear, let me know so I can clarify. Like I said, I'm new to MongoDB (and non-relational DBs in general), so if I'm missing something obvious, feel free to point it out and offer suggestions. Helpful links to the docs or tutorials are also welcome. I'm also open to trying other NoSQL solutions.
I have same idea about storing templates in mongodb in my application.
You should using a standard schema for your project because we will have some support library for validate data.
JSON Schema:
Current version is draft-07 http://json-schema.org/
.You can read about the schema standard and have client library in many program language
MongoDB:
Now MongoDB 3.6 support validate JSon Schema draft-04
https://docs.mongodb.com/manual/core/schema-validation/#json-schema
Storing Schema in MongoDB:
We will have problem when store Json schema in a colletion.
Json Schema has some characters which can't store in mongodb.
You can reference https://accraze.info/storing-json-schema-templates-in-mongodb/
I've tried using only mongodb in a web application for some time. But I'm wondering why some people say schema-free or dynamic schema is powerful. Now I don't think it so fantastic or wonderful. Would anybody like to talk about the proper case to use schema free databases? First I'd like tell some of my stories.
What is schema free, the database or the codes?
Most of the NoSQL databases would like to say they are schema-free, but I think down to earth the important part is the codes running in the application.
For example, the storage of user information could be schema free, but it doesn't mean that you could store username as an object or store password as an timestamp. The code for user login assumes that username is a string and password is a hash. And eventually that turns the database storage constrained in schema.
Embedded documents are hard to maintain or to query
I created a CMS as the example to start my NoSQL database life. At the beginning the posts and comments data were stored like this
[
{
title: 'Mongo is Good',
content: 'Mongo is a NoSQL database.',
tags: ['Database', 'MongoDB', 'NoSQL'],
comments: [ COMMENT_0, COMMENT_1, ... ]
},
{
title: 'Design CMS',
content: 'Design a blog or something else.',
tags: ['Web', 'CMS'],
comments: [ COMMENT_2, COMMENT_3, ... ]
},
...
]
As you see I embedded comments into a list in each post. It was quite convenient as I could easily append new comment to any post or retrieve comments along with the post. But soon I encountered the first problem: it wasis quite messy to delete a certain comment (usually a spam) from the list. To my surprise mongo haven't still implemented it.
Aside that API level problem, it also hard to query embedded document across the collection. If I insisted on that design, the following queries could only implements in brute force ways
recent comments
comments by one certain user
Eventually I had to place comments into another collection, with a post_id field storing the id of a post the comment belongs to, just like an FK we did in a relational database.
Despite the comments design, the post tags are pretty helpful.
I found an opinion in this post
In NoSQL, you don't design your database based on the relationships between data entities. You design your database based on the queries you will run against it.
But how about changes of the requirements? Is it too crasy to restructure a database only because a new query should be supported?
The cases are worth schema free
In some other cases that need schema free storage. For example, a twitter-like timeline, with data in the following format
[
{
_id: ObjectId('aaa'),
type: 'tweet',
user: ObjectId('xxx'),
content: '0000',
},
{
_id: ObjectId('bbb'),
type: 'retweet',
user: ObjectId('yyy'),
ref: ObjectId('aaa'),
},
...
]
The problem is it won't be an easy job to render the documents into HTML. I render them in this way (Python)
renderMethods = {
'tweet': render_tweet,
'retweet': render_retweet,
}
result = [ render_methods[u['type']](u) for u in updates ]
Because only the JSON data is stored, not with member functions. As the result I have to manually map a render function to each update according to its type. (Similar things would happen when server send the JSON to browser intactly via AJAX)
The above problems confuse me a lot. Would anyone like to tell about the good practice in schema free database, and whether it'swould a good decision to mix one relational database along with a schema free database in a single application?
The main strength of schemaless databases comes to light when using them in an object-oriented context with inheritance.
Inheritance means that you have objects which have some attributes in common, but also some attributes which are specific to the sub-type of object.
Imagine, for example, a product catalog for a computer hardware store.
Every product will have the attributes name, vendor and price. But CPUs will have a clock_rate, hard drives will have a capacity, RAM both capacity and clock_rate and network cards a bandwidth. Doing this in a relational database leaves you two options which are equally cumbersome:
create a table with fields for all possible attributes, but leave most of them NULL for products where they don't apply.
create a secondary table "product_attributes" with productId, attribute_name and attribute_value.
A schemaless database, on the other hand, easily allows to store items in the same collection which have different sets of optional properties. The code to render the product attributes to HTML would then check for the existence of each known optional property and then call an appropriate function which outputs its value as a table row.
Another advantage of schemaless databases is that it gives additional agility during development. It easily allows you to try new features without having to restructure your database. This makes it very easy to maintain backward compatibility to data created by a previous version of the application without having to run complicated database conversion routines. I am currently developing an MMORPG using MongoDB. During the development I added lots of new features which required new data about each character to be persisted on the database. I never had to run a single command equivalent to CREATE TABLE or ALTER TABLE on my database. MongoDB just ate and spewed out whatever data I threw at it. My first test character is still playable, although I never did any intentional upgrading to its database document. It has some obsolete fields which are remnants from features I discarded or refactored, but these don't hurt at all - these obsolete fields would be useful though when my players would scream for bringing back a feature I removed.
Doing this in a relational database leaves you two options which are equally cumbersome:
There is one more option i guess here.
Adding the data in xml format and let the application deserialize/serialize it the way it wants.
What choices are there for document-store databases that allow for relational data to be retrieved? To give a real example, say you have a database to store blog posts. I'd like to have the data look something like:
{id: 12345,
title: "My post",
body: "The body of my post",
author: {
id: 123,
name: "Joe Bloggs",
email: "joe.bloggs#example.com"
}
}
Now, you will likely have a number of these records that all share the author details. What I'd really like is to have the author itself stored as a different record in the database, so that if you update this one record every post record that links to it gets the updates as well. To date the only way I've seen mentioned to do this is to have the post record instead store an ID of the author record, so that the calling code will have to make two queries of the data store - one for the post and another for the author ID that is linked to the post.
Are there any document store databases that will allow me to make a single query and return a structured document containing the linked records? And preferably allow me to edit an internal part of the document, persist the document as a whole and have the correct thing happen [I.e. in the above, if I retrieved the entire document, changed the value of email and persisted the entire document then the email address of the author record is changed, and reflected in all posts that have that author...]
First, let me acknowledge: This particular type of data is somewhat relational by nature. It just depends on exactly how you want to structure this type of data, and what technologies that you have easy access to for this particular project. That said, how do you want your data structured?
If you can structure your data any way you want, you could go with something like this:
{
name: 'Joe',
email: 'joe.bloggs#ex.com',
posts: [
{
id: 123,
title: "My post"
},
{..}
]
}
Where all the posts were contained in one particular key/value pair. This particular type of data I would say is uniquely suited for Riak (due to it being able to query internally against JSON using JavaScript natively). Though you could probably come at it from just about any of the NoSQL data store point of views (Cassandra, Couch, Mongo, et al..), as most of them can store straight up JSON. I just have a tendency towards Riak at this point, due to my personal experience with it.
The more interesting things that you'll probably run up against will relate to how you deal with the data store. For instance, I really like using Ripple for Ruby, which lets me deal with this kind of data in Riak real easy. But if you're in Java land, that might make adoption of this technique a bit more difficult (though I haven't spent a lot of time looking in to Java adoption of Riak), since it tends to lag on 'edge' style data storage techniques.
What is more than that, getting your brain to start thinking in NoSQL terms, or without using 'relations' is what usually takes the longest in structuring data. Because there isn't a schema, and there aren't any preconceptions that come with it, that means that you can do a lot of things that are thought of as simply wrong in the relational DB world. Like storing all of the blog posts for a single user in one document, which just wouldn't work in the standard schema-heavy strongly table based relational world.
The more I read about NoSQL, the more it begins to sound like a column oriented database to me.
What's the difference between NoSQL (e.g. CouchDB, Cassandra, MongoDB) and a column oriented database (e.g. Vertica, MonetDB)?
NoSQL is term used for Not Only SQL, which covers four major categories - Key-Value, Document, Column Family and Graph databases.
Key-value databases are well-suited to applications that have frequent small reads and writes along with simple data models.
These records are stored and retrieved using a key that uniquely identifies the record, and is used to quickly find the data within the database.
e.g. Redis, Riak etc.
Document databases have ability to store varying attributes along with large amounts of data
e.g. MongoDB , CouchDB etc.
Column family databases are designed for large volumes of data, read and write performance, and high availability
e.g Cassandra, HBase etc.
Graph database is a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data
e.g Neo4j, InfiniteGraph etc.
Before understanding NoSQL, you have to understand some key concepts.
Consistency – All the servers in the system will have the same data so anyone using the system will get the same copy regardless of which server answers their request.
Availability – The system will always respond to a request (even if it's not the latest data or consistent across the system or just a message saying the system isn't working) .
Partition Tolerance – The system continues to operate as a whole even if individual servers fail or can't be reached.
Most of the times, only two out above three properties will be satisfied by NoSQL databases.
From your question,
CouchDB : AP ( Availability & Partition) & Document database
Cassandra : AP ( Availability & Partition) & Column family database
MongoDB : CP ( Consistency & Partition) & Document database
Vertica : CA ( Consistency & Availability) & Column family database
MonetDB : ACID (Atomicity Consistency Isolation Durability) & Relational database
From : http://blog.nahurst.com/visual-guide-to-nosql-systems
Have a look at this article1 , article2 and ppt for various scenarios to select a particular type of database.
Some NoSQL databases are column-oriented databases, and some SQL databases are column-oriented as well. Whether the database is column or row-oriented is a physical storage implementation detail of the database and can be true of both relational and non-relational (NoSQL) databases.
Vertica, for example, is a column-oriented relational database so it wouldn't actually qualify as a NoSQL datastore.
A "NoSQL movement" datastore is better defined as being non-relational, shared-nothing, horizontally scalable database without (necessarily) ACID guarantees. Some column-oriented databases can be characterized this way. Besides column stores, NoSQL implementations also include document stores, object stores, tuple stores, and graph stores.
A NoSQL Database is a different paradigm from traditional schema based databases. They are designed to scale and hold documents like json data. Obviously they have a way of querying information, but you should expect syntax like eval("person = * and age > 10) for retrieving data. Even if they support standard SQL interface, they are intended for something else, so if you like SQL you should stick to traditional databases.
A column-oriented database is different from traditional row-oriented databases because of how they store data. By storing a whole column together instead of a row, you can minimize disk access when selecting a few columns from a row containing many columns. In row-oriented databases there's no difference if you select just one or all fields from a row.
You have to pay for a more expensive insert though. Inserting a new row will cause many disk operations, depending on the number of columns.
But there's no difference with traditional databases in terms of SQL, ACID, foreign keys and stuff like that.
I would suggest reading the taxonomy section of the NoSQL wikipedia entry to get a feel for just how different NoSQL databases are from a traditional schema-oriented database. Being column-oriented implies rows and columns, which implies a (two dimensional) schema, while NoSQL databases tend to be schema-less (key-value stores) or have structured contents but without a formal schema (document stores).
For document stores, the structure and contents of each "document" are independent of other documents in the same "collection". Adding a field is usually a code change rather than a database change: new documents get an entry for the new field, while older documents are considered to have a null value for the non-existent field. Similarly, "removing" a field could mean that you simply stop referring to it in your code rather than going to the trouble of deleting it from each document (unless space is at a premium, and then you have the option of removing only those with the largest contents). Contrast this to how an entire table must be changed to add or remove a column in a traditional row/column database.
Documents can also hold lists as well as other nested documents. Here's a sample document from MongoDB (a post from a blog or other forum), represented as JSON:
{
_id : ObjectId("4e77bb3b8a3e000000004f7a"),
when : Date("2011-09-19T02:10:11.3Z"),
author : "alex",
title : "No Free Lunch",
text : "This is the text of the post. It could be very long.",
tags : [ "business", "ramblings" ],
votes : 5,
voters : [ "jane", "joe", "spencer", "phyllis", "li" ],
comments : [
{ who : "jane", when : Date("2011-09-19T04:00:10.112Z"),
comment : "I agree." },
{ who : "meghan", when : Date("2011-09-20T14:36:06.958Z"),
comment : "You must be joking. etc etc ..." }
]
}
Note how "comments" is a list of nested documents with their own independent structure. Queries can "reach into" these documents from the outer document, for example to find posts that have comments by Jane, or posts with comments from a certain date range.
So in short, two of the major differences typical of NoSQL databases are the lack of a (formal) schema and contents that go beyond the two dimensional orientation of a traditional row/column database.
Distinguishing between coloumn stores Read this blog. This answers your question.
As #tuinstoel wrote, the article answers your question in point 3:
3. Interface. Group A is distinguished by being part of the
NoSQL movement and does not typically
have a traditional SQL interface.
Group B supports standard SQL
interfaces.
Here is how I see it: Column Oriented databases are dealing with the way data is physically stored on disk. As the name suggests, the each column is stored in its own separate space/file. This allows for 2 important things:
You achieve better compression ratio to the order of 10:1 because you have single data type to deal with.
You achieve better data read performance because you avoid whole row scans and can just pick and choose the columns specified in your SELECT query.
NoSQL on the other hand are a whole new breed of databases that define "logical" aggregate levels to explain the data. Some treat the data as having hierachical relationship (aggregate being a "node"), while the other treat the data as documents (which is the aggregate level). They do not dictate the physical storage strategy (some may do, but abstracted away from the end user).
Also, the whole NoSQL movement is more to do with unstructured data, or rather data sets whose schema cannot be predefined, or in unknown beforehand, and therefore cannot conform to the strict relational model.
Column Oriented databases still deal with relational data, although eliminate the need for index etc.