Being new to CouchDB, just wanted to discuss the best practice for structuring a database and documents. My background is from MySQL, so still trying to get a handle on document-driven databases.
To outline the system, we have several clients who each access a separate website with separate data. Each clients data will be split into its own database. Each database will have data constantly inserted (every 5 minutes, for at least a year) for logging events. A new document is created every 5 minutes with a timestamp and value. We also need to store some information about the client, which is a single document that doesn't ever get updated (if so, very rarely).
Below is an example of how one client database looks...
{
"_id": "client_info",
"name": "Client Name",
"role": "admin",
....
},
{
"_id": "1199145600",
"alert_1_value": 0.150
"alert_2_value": 1.030
"alert_3_value": 12.500
...
...
},
{
"_id": "1199145900",
"alert_1_value": 0.150
"alert_2_value": 1.030
"alert_3_value": 12.500
...
...
},
{
"_id": "1199146200",
"alert_1_value": 0.150
"alert_2_value": 1.030
"alert_3_value": 12.500
...
...
},
etc...literally millions more of these every 5 minutes...
My question is, is this sort of structure correct? I understand CouchDB is a flat-file database, but there will be literally millions of the timestamp/value documents in the database. I may just be being picky, but it just seems a little disorganized to me.
Thanks!
Use the timestamp as your id if it's guaranteed to be unique. This dramatically improves the ability of couch to maintain its b-tree for things like building and maintaining views as well as docs, and also it will save you len([_id]) space too.
Each doc you add (for such small data) adds some overhead in b-tree space. In your view (the logical equivalent of your SQL query) you can always parse the doc fields and emit them separately, or multiple times, if needed.
This type of unchanging data is great fit for CouchDB. As the data is added to couch, you can trigger a view update periodically, and the view will build the query data in advance. This means that, unlike SQL, where you'd calculate that aggregate date on the fly each time, couch will simply read that data, cached in the view b-tree's intermediate nodes. Much faster.
So the typical CouchDB approach is:
- model your transactions to minimise the # of docs (i.e. denormalise)
- use different views if needed to filter or sort results differently.
I guess you'll want to produce aggregate stats across that time period. Likely this will be much more efficient (CPU wise) in erlang; so take a look at https://github.com/apache/couchdb/blob/trunk/src/couchdb/couch_query_servers.erl#L172-205 to see how they're done.
Related
I need to be able to search records that have any user ID within a group of user IDs in the query.
However, the amount of user IDs that must be searched will grow substantially over time. Therefore, I must be able to add thousands of user IDs to a single query and search across all of them.
I'm considering using ElasticSearch for this via a managed service like bonsai.
How well does ElasticSearch perform when queried with thousands of conditions?
The answer depends on lots of things (number of servers, RAM, CPU, etc), and it will probably take some experimentation to figure out what works best for you. I'm confident that Elasticsearch can solve your problem, but it's hard to predict performance in general.
You might want to investigate terms lookup. Basically you store all the terms for which you want to search in a document in the index (or another one), then you can reference that list in your search.
So you could save the IDs you want to search for as
PUT /test_index/idlist/1
{
"ids" : [2,1982,939,1982,98716,7611,983838,...]
}
Then you can search another type using that list with something like this, for example, with a top-level filter:
POST /test_index/doc/_search
{
"filter": {
"terms": {
"id": {
"index": "test_index",
"type": "idlist",
"id": "1",
"path": "ids"
}
}
}
}
This probably only makes sense if you're going to run the same query more than once. You could have more than one list of IDs, though, and give the documents holding lists descriptive IDs if it helps.
Using a managed service makes it easy to experiment with different cluster setups (number of nodes, size of machines, data center, and so on). I would suggest you take a look at Qbox (I'm biased, since I work with Qbox). New customers get a $40 introductory credit, which is usually enough to experiment with a proof of concept.
I've tried using only mongodb in a web application for some time. But I'm wondering why some people say schema-free or dynamic schema is powerful. Now I don't think it so fantastic or wonderful. Would anybody like to talk about the proper case to use schema free databases? First I'd like tell some of my stories.
What is schema free, the database or the codes?
Most of the NoSQL databases would like to say they are schema-free, but I think down to earth the important part is the codes running in the application.
For example, the storage of user information could be schema free, but it doesn't mean that you could store username as an object or store password as an timestamp. The code for user login assumes that username is a string and password is a hash. And eventually that turns the database storage constrained in schema.
Embedded documents are hard to maintain or to query
I created a CMS as the example to start my NoSQL database life. At the beginning the posts and comments data were stored like this
[
{
title: 'Mongo is Good',
content: 'Mongo is a NoSQL database.',
tags: ['Database', 'MongoDB', 'NoSQL'],
comments: [ COMMENT_0, COMMENT_1, ... ]
},
{
title: 'Design CMS',
content: 'Design a blog or something else.',
tags: ['Web', 'CMS'],
comments: [ COMMENT_2, COMMENT_3, ... ]
},
...
]
As you see I embedded comments into a list in each post. It was quite convenient as I could easily append new comment to any post or retrieve comments along with the post. But soon I encountered the first problem: it wasis quite messy to delete a certain comment (usually a spam) from the list. To my surprise mongo haven't still implemented it.
Aside that API level problem, it also hard to query embedded document across the collection. If I insisted on that design, the following queries could only implements in brute force ways
recent comments
comments by one certain user
Eventually I had to place comments into another collection, with a post_id field storing the id of a post the comment belongs to, just like an FK we did in a relational database.
Despite the comments design, the post tags are pretty helpful.
I found an opinion in this post
In NoSQL, you don't design your database based on the relationships between data entities. You design your database based on the queries you will run against it.
But how about changes of the requirements? Is it too crasy to restructure a database only because a new query should be supported?
The cases are worth schema free
In some other cases that need schema free storage. For example, a twitter-like timeline, with data in the following format
[
{
_id: ObjectId('aaa'),
type: 'tweet',
user: ObjectId('xxx'),
content: '0000',
},
{
_id: ObjectId('bbb'),
type: 'retweet',
user: ObjectId('yyy'),
ref: ObjectId('aaa'),
},
...
]
The problem is it won't be an easy job to render the documents into HTML. I render them in this way (Python)
renderMethods = {
'tweet': render_tweet,
'retweet': render_retweet,
}
result = [ render_methods[u['type']](u) for u in updates ]
Because only the JSON data is stored, not with member functions. As the result I have to manually map a render function to each update according to its type. (Similar things would happen when server send the JSON to browser intactly via AJAX)
The above problems confuse me a lot. Would anyone like to tell about the good practice in schema free database, and whether it'swould a good decision to mix one relational database along with a schema free database in a single application?
The main strength of schemaless databases comes to light when using them in an object-oriented context with inheritance.
Inheritance means that you have objects which have some attributes in common, but also some attributes which are specific to the sub-type of object.
Imagine, for example, a product catalog for a computer hardware store.
Every product will have the attributes name, vendor and price. But CPUs will have a clock_rate, hard drives will have a capacity, RAM both capacity and clock_rate and network cards a bandwidth. Doing this in a relational database leaves you two options which are equally cumbersome:
create a table with fields for all possible attributes, but leave most of them NULL for products where they don't apply.
create a secondary table "product_attributes" with productId, attribute_name and attribute_value.
A schemaless database, on the other hand, easily allows to store items in the same collection which have different sets of optional properties. The code to render the product attributes to HTML would then check for the existence of each known optional property and then call an appropriate function which outputs its value as a table row.
Another advantage of schemaless databases is that it gives additional agility during development. It easily allows you to try new features without having to restructure your database. This makes it very easy to maintain backward compatibility to data created by a previous version of the application without having to run complicated database conversion routines. I am currently developing an MMORPG using MongoDB. During the development I added lots of new features which required new data about each character to be persisted on the database. I never had to run a single command equivalent to CREATE TABLE or ALTER TABLE on my database. MongoDB just ate and spewed out whatever data I threw at it. My first test character is still playable, although I never did any intentional upgrading to its database document. It has some obsolete fields which are remnants from features I discarded or refactored, but these don't hurt at all - these obsolete fields would be useful though when my players would scream for bringing back a feature I removed.
Doing this in a relational database leaves you two options which are equally cumbersome:
There is one more option i guess here.
Adding the data in xml format and let the application deserialize/serialize it the way it wants.
I want to deploy a small web project I have in mind, where the data I want to save are structs with nested structs inside, and most of the times the inner structs does not have the same fields and types.
for example, I'd like something like that, to be a "row" in a table
{
event : "The Oscars",
place : "Los Angeles, USA",
date : "March 2, 2014"
awards :
[
bestMovie :
[
name : "someName",
director : "someDirector",
actors :
[
... etc
]
],
bestActor : "someActor"
]
}
( JSON objects are easy to use for me at the moment, and passing it between server and client side. The client-side is run on JavaScript )
I started with MySQL/PHP but very soon I saw that it doesn't suit me. I tried mongoDB for a few days but I don't know how exactly to refine my search on which is the best db to use.
I want to be able to set some object models/schemas, and select exactly which part to update and which fields are unique in each struct.
Any suggestions? Thanks.
This is not an answerable question so will likely be closed.
There is no one right answer here and there are a few questions to ask such as data structure, speed requirements, and the old CAP Theorem questions of what do you need:
Consistency
Availability
Partition-ability
I would suggest mongo will be a great place to start if you are casually working away and don't anticipate having to deal with any of the issues above at scale. Couch is another similar option but doesn't have the same community size.
I say mongo because your data is denormalized into a document and mongo is good at serving documents. It also speaks json!
RDBMS databases would require you to denormalize your documents and create relationships which is quite a bit of work from where you are relative to sticking the documents into a document.
You could serialize the data using protocol buffers and put that in an rdbms but this is not advisable.
For blazing speed you could use redis which has constant time lookups in memory. But this is better suited (in most cases) for ephemeral data like user sessions - not long term persistant storage.
Finally there are graph databases like neo4j which are document-like databases which store relationships between nodes with typed edges. This suits social and recommendation problems quite well but that's probably not the problem you're trying to solve - in the question it simply states what is best for your data for storage.
Looking at some of the possibilities, I think you'll probably find mongo best suits your needs as you already have json document structures and only need simple persistance for those documents.
I have an application developed using the MVC pattern and I would like to index now multiple models of it, this means each model has a different data structure.
Is it better to use mutliple indexes, one for each model or have a type within the same index for each model? Both ways would also require a different search query I think. I just started on this.
Are there differences performancewise between both concepts if the data set is small or huge?
I would test the 2nd question myself if somebody could recommend me some good sample data for that purpose.
There are different implications to both approaches.
Assuming you are using Elasticsearch's default settings, having 1 index for each model will significantly increase the number of your shards as 1 index will use 5 shards, 5 data models will use 25 shards; while having 5 object types in 1 index is still going to use 5 shards.
Implications for having each data model as index:
Efficient and fast to search within index, as amount of data should be smaller in each shard since it is distributed to different indices.
Searching a combination of data models from 2 or more indices is going to generate overhead, because the query will have to be sent to more shards across indices, compiled and sent back to the user.
Not recommended if your data set is small since you will incur more storage with each additional shard being created and the performance gain is marginal.
Recommended if your data set is big and your queries are taking a long time to process, since dedicated shards are storing your specific data and it will be easier for Elasticsearch to process.
Implications for having each data model as an object type within an index:
More data will be stored within the 5 shards of an index, which means there is lesser overhead issues when you query across different data models but your shard size will be significantly bigger.
More data within the shards is going to take a longer time for Elasticsearch to search through since there are more documents to filter.
Not recommended if you know you are going through 1 terabytes of data and you are not distributing your data across different indices or multiple shards in your Elasticsearch mapping.
Recommended for small data sets, because you will not waste storage space for marginal performance gain since each shard take up space in your hardware.
If you are asking what is too much data vs small data? Typically it depends on the processor speed and the RAM of your hardware, the amount of data you store within each variable in your mapping for Elasticsearch and your query requirements; using many facets in your queries is going to slow down your response time significantly. There is no straightforward answer to this and you will have to benchmark according to your needs.
Although Jonathan's answer was correct at the time, the world has moved on and it now seems that the people behind ElasticSearch have a long term plan to drop support for multiple types:
Where we want to get to: We want to remove the concept of types from Elasticsearch, while still supporting parent/child.
So for new projects, using only a single type per index will make the eventual upgrade to ElasticSearch 6.x be easier.
Jonathan's answer is great. I would just add few other points to consider:
number of shards can be customized per solution you select. You may have one index with 15 primary shards, or split it to 3 indexes for 5 shards - performance perspective won't change (assuming data are distributed equally)
think about data usage. Ie. if you use kibana to visualize, it's easier to include/exclude particular index(es), but types has to be filtered in dashboard
data retention: for application log/metric data, use different indexes if you require different retention period
Both the above answers are great!
I am adding an example of several types in an index.
Suppose you are developing an app to search for books in a library.
There are few questions to ask to the Library owner,
Questions:
How many books are you planning to store?
What kind of books are you going to store in the library?
How are you going to search for books?
Answers:
I am planning to store 50 k – to 70 k books (approximately)
I will have 15 k -20 k technology related books (computer science, mechanical engineering, chemical engineering and so on), 15 k of historical books, 10 k of medical science books. 10 k of language related books (English, Spanish and so on)
Search by authors first name, author last name, year of publish, name of the publisher. (This gives you the idea about what information you should store in the index)
From the above answers we can say the schema in our index should look somewhat like this.
//This is not the exact mapping, just for the example
"yearOfPublish":{
"type": "integer"
},
"author":{
"type": "object",
"properties": {
"firstName":{
"type": "string"
},
"lastName":{
"type": "string"
}
}
},
"publisherName":{
"type": "string"
}
}
In order to achieve the above we can create one index called Books and can have various types.
Index: Book
Types: Science, Arts
(Or you can create many types such as Technology, Medical Science, History, Language, if you have lot more books)
Important thing to note here is the schema is similar but the data is not identical. And the other important thing is the total data you are storing.
Hope the above helps when to go for different types in an Index, if you have different schema you should consider different index. Small index for less data . big index for big data :-)
Cassandra doesn't have some CQL like like clause.... in MySQL to search a more specific data in database.
I have looked through some data and came up some ideas
1.Using Hadoop
2.Using MySQL server to be my anther database server
But is there any ways I can improve my Cassandra DB performance easier?
Improving your Cassandra DB performance can be done in many ways, but I feel like you need to query the data efficiently which has nothing to do with performance tweaks on the db itself.
As you know, Cassandra is a nosql database, which means when dealing with it, you are sacrificing flexibility of queries for fast read/writes and scalability and fault tolerance. That means querying the data is slightly harder. There are many patterns which can help you query the data:
Know what you are needing in advance. As querying with CQL is slightly less flexible than what you could find in a RDBMS engine, you can take advantage of the fast read-writes and save the data you want to query in the proper format by duplicating it. Too complex?
Imagine you have a user entity that looks like that:
{
"pk" : "someTimeUUID",
"name": "someName",
"address": "address",
"birthDate": "someBirthDate"
}
If you persist the user like that, you will get a sorted list of users in the order they joined your db (you persisted them). Let's assume you want to get the same list of users, but only of those who are named "John". It is possible to do that with CQL but slightly inefficient. What you could do here to amend this problem is to de-normalize your data by duplicating it in order to fit the query you are going to execute over it. You can read more about this here:
http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model
However, this approach seems ok for simple queries, but for complex queries it is somewhat hard to achieve and also, if you are unsure what you are going to query in advance, there is no way you store the data in the proper manner beforehand.
Hadoop comes to the rescue. As you know, you can use hadoop's map reduce to solve tasks involving a large amount of data, and Cassandra data, by my experience, can become very very large. With hadoop, to solve the above example, you would iterate over the data as it is, in each map method to find if the user is named John, if so, write to context.
Here is how the pseudocode would look:
map<data> {
if ("John".equals(data.getColumn("name")){
context.write(data);
}
}
At the end of the map method, you would end up with a list of all users who are named John. Youl could put a time range (range slice) on the data you feed to hadoop which will give you
all the users who joined your database over a certain period and are named John. As you see, here you are left with a lot more flexibility and you can do virtually anything. If the data you got was small enough, you could put it in some RDBMS as summary data or cache it somewhere so further queries for the same data can easily retrieve it. You can read more about hadoop in here:
http://hadoop.apache.org/