Message storage duplication for messaging systems

Message storage duplication for messaging systems - database

In many sub-system designs for messaging applications (twitter, facebook e.t.c) I notice duplication of where user message history is stored. On other hand they use tokenizing indexer like ElasticSeach or Solr. It's good for search. On other hand still use some sort of DB for history. Why to duplicate? Why the same instance of ES/Solr/EarlyBird can not be used for history? It's in fact able to.

The usual problem is the following - you want to search and also ideally you want to try index data in a different manner (e.g. wipe index and try new awesome analyzer, that you forgot to include initially). Separating data source and index from each other makes system less coupled. You're not afraid, that you will lose data in the Elasticsearch/Solr.
I am usually strongly against calling Elasticsearch/Solr a database. Since in fact, it's not. For example none of them have support for transactions, which makes your life harder, if you want to update multiple documents following standard relational logic.
Last, but not least - one of the hardest operation in Elasticsearch/Solr is to retrieve stored values, since it's not much optimised to do so, especially if you want to return 10k documents at once. In this case separate datasource would also help, since you will be able to return only matched document ids from Elasticsearch/Solr and later retrieve needed content from datasource and return it to the user.
Summary is just simple - Elasticsearch/Solr should be more think of as a search engines, not data storage.

True that ES is NOT a database per se and will never be. But no one says you cannot use it as such, and many people actually do. It really depends on your specific use case(s), and in the end it's all a question of the trade-offs you are ready to make to support your specific needs. As with pretty much any technology in general, there is no one-size-fits-all approach and with ES (and the like) it's no different.
A primary source of truth might not necessarily be a relational DBMS and they are not necessarily "duplicating" the data in the sense that you meant, it can be anything that has a copy of your data and allows you to rebuild your ES indexes in case something goes wrong. I've seen many many different "sources of truth". It could simply be:
your raw flat files containing your historical logs or business data
Kafka topics that you can replay anytime easily
a snapshot that you take from ES on a regular basis
a relational DB
you name it...
The point is that if something goes wrong for any reason (and that happens), you want to be able to recreate your ES indexes, be it from a real DB, from backups or from raw data. You should see that as a safety net. Even if all you have is a MySQL DB, you usually have a backup of it, so you're already "duplicating" the data in some way.
One thing that you need to think of, though, when architecting your system, is that you might not necessarily need to have the entirety of your data in ES, since ES is a search and analytics engine, you should only store in there what is necessary to support your search and analytics needs and be able to recreate that information anytime. In the end, ES is just a subsystem of your whole architecture, just like your DB, your messaging queue or your web server.
Also worth reading: Using ElasticSeach as primary source for part of my DB

Related

Ideal database for a minimalist blog engine

So I'm designing this blog engine and I'm trying to just keep my blog data without considering comments or membership system or any other type of multi-user data.
The blog itself is surrounded around 2 types of data, the first is the actual blog post entry which consists of: title, post body, meta data (mostly dates and statistics), so it's really simple and can be represented by simple json object. The second type of data is the blog admin configuration and personal information. Comment system and other will be implemented using disqus.
My main concern here is the ability of such engine to scale with spiked visits (I know you might argue this but lets take it for granted). So since I've started this project I'm moving well with the rest of my stack except the data layer. Now I've been having this dilemma choosing the database, I've considered MongoDB but some reviews and articles/benchmarking were suggesting slow reads after collections read certain size. Next I was looking at Redis and using its persistence features RDB and AOF, while Redis is good at both fast reading/writing I'm afraid of using it because I'm not familiar with it. And this whole search keeps going on to things like "PostgreSQL 9.4 is now faster than MongoDB for storing JSON documents" etc.
So is there any way I can settle this issue for good? considering that I only need to represent my data in key,value structure and only require fast reading but not writing and the ability to be fault tolerant.
Thank you

If I were you I would start small and not try to optimize for big data just yet. A lot of blogs you read about the downsides of a NoSQL solution are around large data sets - or people that are trying to do relational things with a database designed for de-normalized data.
My list of databases to consider:
Mongo. It has huge community support and based on recent funding - it's going to be around for a while. It runs very well on a single instance and a basic replica set. It's easy to set up and free, so it's worth spending a day or two running your own tests to settle the issue once and for all. Don't trust a blog.
Couchbase. Supports key/value storage and also has persistence to disk. http://www.couchbase.com/couchbase-server/features Also has had some recent funding so hopefully that means stability. =)
CouchDB/PouchDB. You can use PouchDB purely on the client side and it can connect to a server side CouchDB. CouchDB might not have the same momentum as Mongo or Couchbase, but it's an actively supported product and does key/value with persistence to disk.
Riak. http://basho.com/riak/. Another NoSQL that scales and is a key/value store.
You can install and run a proof-of-concept on all of the above products in a few hours. I would recommend this for the following reasons:
A given database might scale and hit your points, but be unpleasant to use. Consider picking a database that feels fun! Sort of akin to picking Ruby/Python over Java because the syntax is nicer.
Your use case and domain will be fairly unique. Worth testing various products to see what fits best.
Each database has quirks and you won't find those until you actually try one. One might have quirks that are passable, one will have quirks that are a show stopper.
The benefit of trying all of them is that they all support schemaless data, so if you write JSON, you can use all of them! No need to create objects in your code for each database.
If you abstract the database correctly in code, swapping out data stores won't be that painful. In other words, your code will be happier if you make it easy to swap out data stores.

This is only an option for really simple CMSes, but it sounds like that's what you're building.
If your blog is super-simple as you describe and your main concern is very high traffic then the best option might be to avoid a database entirely and have your CMS generate static files instead. By doing this, you eliminate all your database concerns completely.
It's not the best option if you're doing anything dynamic or complex, but in this small use case it might fit the bill.

NoSql/Raven DB implementation best practices

I'm investigating a new project which will be a social networking style site. I'm reading up on RavenDb and I like the look of a lot of its features. I've not read up on nosql all that much but I'm wondering if there's a niche it fits best with and old school sql is still the best choice for other stuff.
I'm thinking that the permissions plug in would be ideal for a social net style site - but will it really perform in an environment where the database will be getting hammered - or is it optimised for a more reporting style system where it's possible to keep throwing new data structures at the database and report on those structures.
I'm eager to use the right tool for the job - I'll be using MVC3, Windsor + either Nhibernate+Sql server or RavenDb.
Should I stick with the old school sql or go with the new kid on the block: ravendb?

This question can get very close to being subjective (even though it's really not), you're talking about NoSQL as if it is just one thing, and that is not the case.
You have
graph databases (Neo4j etc),
map/reduce style document databases (Couch,Raven),
document databases which attempt to feel like ordinary databases (Mongo),
Key/value stores (Cassandra etc)
moar goes here.
Each of them attempts to solve a different problem via different means, and whether you'd use one of them over a traditional relational store is
A matter of suitability
A matter of personal preference
At the end of the day, for the primary data-storage for a single system, a document database or relational store is probably what you want, although for different parts of your system you may well end up utilising a graph database (For calculating neighbours etc), or a key/value store (like Facebook does/did for inbox messages).
The main benefit of choosing a document store as your primary store over that of a relational one, is that you haven't got to worry about trying to map your objects into a collection of tables, and there is less configuration overhead involved in doing so.
The other downside/upside would be that you have to learn something new and make mistakes along the way.
So my answer if I am going to be direct?
RavenDB would be suitable
SQL would be suitable
Which do you prefer to use? These days I'd probably just go for Raven, knowing I can dump data into a relational store for reporting purposes and probably do likewise for other parts of my system, and getting free-text search and fastish-writes/fast-reads without going through the effort of defining separate read/write stores is an overall win.
But that's me, and I am biased.

Should I choose relational or non-relation database for social-network like app

I'm in the process of choosing database for my application. I have been using MySQL for the longest time but for my current application Performance and Scalability is important and I know MySQL has its limitation and I have been hearing a lot about key-value stores, column-based DBs and document-based DBs and others. I have looked into:
Cassandra
MongoDB
Redis
CouchDB
They all seem (or claim) to be faster than relational DBs such as MySQL.
I'm using Ruby on Rails and there are clients for all the above so it shouldn't be a problem.
My data model is simple for the most part which is centered on a user object(with rich profile and preferences) related to different items such as photos, videos, posts...etc and each one of these has one tag or more.
The fact that these databases are new there doesn't seem to be a lot of resources for them online. Plus they are in a way structurally different so it will not be trivial to switch from one to another later.
I wish you can give me your input on what DB you think would be most suit my application that will have good performance and scale.
Thanks,
Tam

Step 1) Create your design using whatever technology you are strongest with.
Step 2) Release your social network, begin on researching non-relational databases and master whichever you feel most comfortable with.
Step 3) Refactor your data tier so you could potentially replace MySQL quickly and easily with your newly learned DB technology.
Step 4) Wait for your website to become so big that the need to replace MySQL comes around and begin to plug the holes.
I know this seems kind of cheeky, but really my point is just release your software and start to worry about scale etc. when it actually becomes a concern.

The primary benefit of something like a document database, at least for your app, is that you can treat the entire User glob of info as a single document. You don't have to worry about adding table for properties, or new features, or whatever, rather you can keep the bulk of it in the user document and update it dynamically.
For read often, write rarely, this works a treat.
Now you don't need a "document database" to do something like this. MySQL et al will work just fine with a primary key and a CLOB (text) / BLOB field to hold the document.
Where something like CouchDB (the one that I'm most familiar with in this space) can help is that it has well supported replication, and it's straightforward to create views on specific attributes of the documents (for example, you want all "premiere" members, or whatever).
Plus, since CouchDB is HTTP, it works well with the modern caches and such that are available, which can help you in scaling, especially in, again, read heavy operations.
A lot of this is more about overall architecture than actual tools, so make sure you consider that first.

There is also Tokyo Cabinet which is used by some large sites.
I have not yet used on but my understanding is that when site like Twitter need to turn large numbers of messages round very quickly the overhead of the RDBMS is just to great and starts to slow the response times down significantly.
What you would need to do is look at the advantages you get from an RDBMS and weigh that against it's speed then do the same in reverse for a nosql type database.
RDBMS's give you a standard, they give you security, integrity and a general purpose language based on sets to make data manipulation easier. However if you do not need all or any of that structure you are loosing out on speed.
Prior to SQL was CODASYL and network databases. SQL took ove because of portability and transferability of skills etc. But i think the mobile wired world is changing this and it would be worth investigating.

How much business logic should be in the database?

I'm developing a multi-user application which uses a (postgresql-)database to store its data. I wonder how much logic I should shift into the database?
e.g. When a user is going to save some data he just entered. Should the application just send the data to the database and the database decides if the data is valid? Or should the application be the smart part in the line and check if the data is OK?
In the last (commercial) project I worked on, the database was very dump. No constraits, no views etc, everything was ruled by the application. I think that's very bad, because every time a certain table was accesed in the code, there was the same code to check if the access is valid repeated over and over again.
By shifting the logic into the database (with functions, trigers and constraints), I think we can save a lot of code in the application (and a lot of potential errors). But I'm afraid of putting to much of the business-logic into the database will be a boomerang and someday it will be impossible to maintain.
Are there some real-life-approved guidelines to follow?

If you don't need massive distributed scalability (think companies with as much traffic as Amazon or Facebook etc.) then the relational database model is probably going to be sufficient for your performance needs. In which case, using a relational model with primary keys, foreign keys, constraints plus transactions makes it much easier to maintain data integrity, and reduces the amount of reconciliation that needs to be done (and trust me, as soon as you stop using any of these things, you will need reconciliation -- even with them you likely will due to bugs).
However, most validation code is much easier to write in languages like C#, Java, Python etc. than it is in languages like SQL because that's the type of thing they're designed for. This includes things like validating the formats of strings, dependencies between fields, etc. So I'd tend to do that in 'normal' code rather than the database.
Which means that the pragmatic solution (and certainly the one we use) is to write the code where it makes sense. Let the database handle data integrity because that's what it's good at, and let the 'normal' code handle data validity because that's what it's good at. You'll find a whole load of cases where this doesn't hold true, and where it makes sense to do things in different places, so just be pragmatic and weigh it up on a case by case basis.

Two cents: if you choose smart, remember not to go in the "too smart" field. The database should not deal with inconsistencies that are inappropriate for its level of understanding of the data.
Example: suppose you want to insert a valid (checked with a confirmation mail) email address in a field. The database could check if the email actually conforms to a given regular expression, but asking the database to check if the email address is valid (e.g. checking if the domain exists, sending the email and handling the response) it's a bit too much.
It's not meant to be a real case example. Just to illustrate you that a smart database has limits in its smartness anyway, and if an unexistent email address gets into it, the data is still not valid, but for the database is fine. As in the OSI model, everything should handle data at its level of understanding. ethernet does not care if it's transporting ICMP, TCP, if they are valid or not.

I find that you need to validate in both the front end (either the GUI client, if you have one, or the server) and the database.
The database can easily assert for nulls, foreign key constraints etc. i.e. that the data is the right shape and linked up correctly. Transactions will enforce atomic writes of this. It's the database's responsibility to contain/return data in the right shape.
The server can perform more complex validations (e.g. does this look like an email, does this look like a postcode etc.) and then re-structure the input for insertion into the database (e.g. normalise it and create the appropriate entities for insertion into the tables).
Where you put the emphasis on validation depends to some degree on your application. e.g. it's useful to validate a (say) postcode in a GUI client and immediately provide feedback, but if your database is used by other applications (e.g. an application to bulkload addresses) then your layer surrounding the database needs to validate as well. Sometimes you end up providing validation in two different implementations (e.g. in the above, perhaps a Javascript front-end and a Java DAO backend). I've never found a good strategic solution to this.

Using the common features of relational databases, like primary and foreign key constraints, datatype declarations, etc. is good sense. If you're not going to use them they why bother with a relational db?
That said, all data should be validated for both type and business rules before it hits the db. Type validation is just defensive programming- assume users are out to hack you and then you'll get fewer unpleasant surprises. Business rules are what your application is all about. If you make them part of the structure of your db they become much more tightly bound to how your app works. If you put them in the application layer, it's easier to change them if business requirements change.
As a secondary consideration: clients often have less choice about which db they use (postgresql, mysql, Oracle, etc) than which application language they have available. So if there is a good chance that your application will be installed on many different systems, your best bet is to make sure that your SQL is as standard as possible. This may will mean that constructing language agnostic db features like triggers, etc. will be more trouble than putting that same logic in your application layer.

It depends on the application :)
For some applications the dumb database is the best. For example Google's applications run on a big dumb database that can't even do joins because the need amazing scalability to be able to serve millions of users.
On the other hand, for some internal enterprise app it can be beneficial to go with very smart database as those are often used in more than just application and therefore you want a single point of control - think of employees database.
That said if your new application is similar to the previous one, I would go with dumb database. In order to eliminate all the manual checks and database access code I would suggest using an ORM library such as Hibernate for Java. It will essentially automate your data access layer but will leave all the logic to your application.
Regarding validation it must be done on all levels. See other answers for more details.

One other item of consideration is deployment. We have an application where the deployment of database changes is actually much easier for remote installations than the actual code base is. For this reason, we've put a lot of application code in stored procedures and database functions.
Deployment is not your #1 consideration but it can play an important role in deciding b/t various choices

This is as much a people question as it is a technology question. If your application is the only application that's ever going to manipulate the data (which is rarely the case, even if you think that's the plan), and you've only got application coders to hand, then by all means keep all the logic in the application.
On the other hand, if you've got DBAs who can handle it, or you know that more than one app will need to have its access validated, then managing data actually in the database makes a lot of sense.
Remember, though, that the best things for the database to be validating are a) the types of the data and b) relational constraints, which anything calling itself an RDBMS should have a handle on anyway.
If you've got any transactions in your application code, it's also worthwhile asking yourself whether they should be pushed to the database as a stored procedure so that it's impossible for them to be incorrectly reimplemented elsewhere.
I do know of shops where the only access allowed to the database is via stored procedures, so the DBAs have full resposibility for both the data storage semantics and access restrictions, and anyone else has to go through their gateways. There are obvious advantages to this, especially if more than one application has to have access to the data. Whether you go quite that far is up to you, but it's a perfectly valid approach.

While I believe that most data should be validated from the user interface (why send known bad stuff across the network tying up resources?), I also believe it is irresponsible not to put constraints on the database as the user interface is unlikely to be the only way that data ever gets into the database. Data also comes in from imports, other applications, quick script fixes for problems run at the query window, mass updates run (to update all prices by 10% for example). I want all bad records rejected no matter what their source and the database is the only place where you can be assured that will happen. To skip the database integrity checks because the user interface does it is to guarantee that you will most likely eventually have data integrity issues and then all of your data become meaningless and useless.

e.g. When a user is going to save some
data he just entered. Should the
application just send the data to the
database and the database decides if
the data is valid? Or should the
application be the smart part in the
line and check if the data is OK?
Its better to have the validation in the front end as well as the server side. So if the data is invalid the user will be notified immediately. Otherwise he will have to wait for the DB to respond after a post back.
When security is concerned its better to validate at both the ends. Front end as well as DB. Or how can the DB trust all the data that is sent by the application ;-)

Validation should be done on the client-side and server side and once it valid then it should be stored.
The only work that the database should do is any querying logic. So update rows, inserting rows, selects and everything else should be handled by the server side logic since thats where the real meat of the application lives.
Structuring your insert properly will handle any foreign Key constraints. Getting your business logic to call a sproc will insert data in the correct format. I don't really consider this validation but some people might.

My decision is : never use stored procedure in database. Stored procedure is not portable.

Pros/cons of document-based databases vs. relational databases

I've been trying to see if I can accomplish some requirements with a document based database, in this case CouchDB. Two generic requirements:
CRUD of entities with some fields which have unique index on it
ecommerce web app like eBay (better description here).
And I'm begining to think that a Document-based database isn't the best choice to address these requirements. Furthermore, I can't imagine a use for a Document based database (maybe my imagination is too limited).
Can you explain to me if I am asking pears from an elm when I try to use a Document oriented database for these requirements?

You need to think of how you approach the application in a document oriented way. If you simply try to replicate how you would model the problem in an RDBMS then you will fail. There are also different trade-offs that you might want to make. ([ed: not sure how this ties into the argument but:] Remember that CouchDB's design assumes you will have an active cluster of many nodes that could fail at any time. How is your app going to handle one of the database nodes disappearing from under it?)
One way to think about it is to imagine you didn't have any computers, just paper documents. How would you create an efficient business process using bits of paper being passed around? How can you avoid bottlenecks? What if something goes wrong?
Another angle you should think about is eventual consistency, where you will get into a consistent state eventually, but you may be inconsistent for some period of time. This is anathema in RDBMS land, but extremely common in the real world. The canonical transaction example is of transferring money from bank accounts. How does this actually happen in the real world - through a single atomic transactions or through different banks issuing credit and debit notices to each other? What happens when you write a cheque?
So lets look at your examples:
CRUD of entities with some fields with unique index on it.
If I understand this correctly in CouchDB terms, you want to have a collection of documents where some named value is guaranteed to be unique across all those documents? That case isn't generally supportable because documents may be created on different replicas.
So we need to look at the real world problem and see if we can model that. Do you really need them to be unique? Can your application handle multiple docs with the same value? Do you need to assign a unique identifier? Can you do that deterministically? A common scenario where this is required is where you need a unique sequential identifier. This is tough to solve in a replicated environment. In fact if the unique id is required to be strictly sequential with respect to time created it's impossible if you need the id straight away. You need to relax at least one of those constraints.
ecommerce web app like ebay
I'm not sure what to add here as the last comment you made on that post was to say "very useful! thanks". Was there something missing from the approach outlined there that is still causing you a problem? I thought MrKurt's answer was pretty full and I added a little enhancement that would reduce contention.

Is there a need to normalize the data?
Yes: Use relational.
No: Use document.

I am in the same boat, I am loving couchdb at the moment, and I think that the whole functional style is great. But when exactly do we start to use them in ernest for applications. I mean, yes we can all start to develop applications extremely quickly, cruft free with all those nasty hang-ups about normal form being left in the wayside and not using schemas. But, to coin a phrase "we are standing on the shoulders of giants". There is a good reason to use RDBMS and to normalise and to use schemas. My old oracle head is reeling thinking about data without form.
My main wow factor on couchdb is the replication stuff and the versioning system working in tandem.
I have been racking my brain for the last month trying to grok the storage mechanisms of couchdb, apparently it uses B trees but doesn't store data based on normal form. Does this mean that it is really really smart and realises that bits of data are replicated so lets just make a pointer to this B tree entry?
So far I am thinking of xml documents, config files, resource files streamed to base64 strings.
But would I use couchdb for structural data. I don't know, any help greatly appreciated on this.
Might be useful in storing RDF data or even free form text.

A possibility is to have a main relational database that stores definitions of items that can be retrieved by their IDs, and a document database for the descriptions and/or specifications of those items. For example, you could have a relational database with a Products table with the following fields:
ProductID
Description
UnitPrice
LotSize
Specifications
And that Specifications field would actually contain a reference to a document with the technical specifications of the product. This way, you have the best of both worlds.

Document based DBs are best suiting for storing, well, documents. Lotus Notes is a common implementation and Notes email is an example. For what you are describing, eCommerce, CRUD, etc., realtional DBs are better designed for storage and retrieval of data items/elements that are indexed (as opposed to documents).

Re CRUD: the whole REST paradigm maps directly to CRUD (or vice versa). So if you know that you can model your requirements with resources (identifiable via URIs) and a basic set of operations (namely CRUD), you may be very near to a REST-based system, which quite a few document-oriented systems provide out of the box.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight