Using ElasticSearch as source of truth - database

I am working with a team which uses two data sources.
MSSQL as a primary data source for making transaction calls.
ES as a back-up/read-only source of truth for viewing the data.
e.g. If I put an order, The order is inserted in DB, then there is a RabbitMQ listener/ Batch which then synchronizes the data from DB to ES.
Somehow this system fails for even just a million records. When I say fails, it means the records are not updated in ES in timely fashion, e.g. Say I create a coupon, then the coupon is generated in DB, when the coupon is generated, customer tries to redeem it immediately, although ES doesn't have the information about the coupon yet, so it fails. Of course there are options to use RabbitMQ's priority Queues etc, but the questions I have got are very basic
I have few questions in my mind, which I asked to the team, and still haven't got satisfactory answers
What is the minimum load should be expected when we use elastic search, and doesn't it become an overkill if we have just 1M records.
Does it really makes sense to use ES as source of truth for real-time data?
Is ES designed for handling relational-like databases, and to handle the data that gets continuously updated? AFAIK such search-optimized databases are once write, multiple read kind.
If we are doing it to handle load, then how will it be different than making a cluster of MSSQL databases as source of truth and using ES just for analytic?
The main question I have in mind is, how we can optimize this architecture so that we can scale better?
PS:
When I asked minimum load, what I really meant is what is the number of records/transaction for which we can say ES will be faster than conventional relational databases? Or there is no such term at all?

What is the minimum load should be expected when we use elastic search, and doesn't it become an overkill if we have just 1M records.
Answer: the possible load depends on the capabilities of your server
Does it really makes sense to use ES as source of truth for real-time data?
From ES website: "Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected."
So yes, it can be your source of truth, that said, it is "eventually consistent" which raises the question, how soon is it considered "real-time"... and there is no way to answer it without testing and measuring your system .
Is ES designed for handling relational-like databases, and to handle the data that gets continuously updated? AFAIK such
search-optimized databases are once write, multiple read kind.
That's a good point, as any eventual-consistent system, it is indeed NOT optimized to series of modifications!
If we are doing it to handle load, then how will it be different than making a cluster of MSSQL databases as source of truth and using
ES just for analytic?
It won't. Bare in mind that ES, as quoted above, was build to accommodate requirements of search and analysis. If that's not what you intend to do with it you should consider another tool. Use the right tool for the right job.

1)
There isn't a minimum expected load.
You can have 2 small nodes (master & data) with 2 shards per index (1 primary + 1 replica).
You can also split your data into multiple indices if it makes sense from a functional point of view (i.e. how data is searched).
2)
In my experience, the main benefits you get from ElasticSearch are:
Near linear scalability.
Lucene-based text search.
Many ways to put your data to work: RESTful query API, Kibana...
Easy administration (compared to your typical RDBMS).
If your project doesn't get these benefits, then most probably ES is not the right tool for the job.
3)
ElasticSearch doesn't like data that is updated frequently. The best use case is for read-only data.
Anyway, this doesn't explain the high latency you are getting; your problem must lie in RabbitMQ or the network.
4)
Indeed, that's what I would do: MSSQL cluster for application data and ES for analytics.

Related

Message storage duplication for messaging systems

In many sub-system designs for messaging applications (twitter, facebook e.t.c) I notice duplication of where user message history is stored. On other hand they use tokenizing indexer like ElasticSeach or Solr. It's good for search. On other hand still use some sort of DB for history. Why to duplicate? Why the same instance of ES/Solr/EarlyBird can not be used for history? It's in fact able to.
The usual problem is the following - you want to search and also ideally you want to try index data in a different manner (e.g. wipe index and try new awesome analyzer, that you forgot to include initially). Separating data source and index from each other makes system less coupled. You're not afraid, that you will lose data in the Elasticsearch/Solr.
I am usually strongly against calling Elasticsearch/Solr a database. Since in fact, it's not. For example none of them have support for transactions, which makes your life harder, if you want to update multiple documents following standard relational logic.
Last, but not least - one of the hardest operation in Elasticsearch/Solr is to retrieve stored values, since it's not much optimised to do so, especially if you want to return 10k documents at once. In this case separate datasource would also help, since you will be able to return only matched document ids from Elasticsearch/Solr and later retrieve needed content from datasource and return it to the user.
Summary is just simple - Elasticsearch/Solr should be more think of as a search engines, not data storage.
True that ES is NOT a database per se and will never be. But no one says you cannot use it as such, and many people actually do. It really depends on your specific use case(s), and in the end it's all a question of the trade-offs you are ready to make to support your specific needs. As with pretty much any technology in general, there is no one-size-fits-all approach and with ES (and the like) it's no different.
A primary source of truth might not necessarily be a relational DBMS and they are not necessarily "duplicating" the data in the sense that you meant, it can be anything that has a copy of your data and allows you to rebuild your ES indexes in case something goes wrong. I've seen many many different "sources of truth". It could simply be:
your raw flat files containing your historical logs or business data
Kafka topics that you can replay anytime easily
a snapshot that you take from ES on a regular basis
a relational DB
you name it...
The point is that if something goes wrong for any reason (and that happens), you want to be able to recreate your ES indexes, be it from a real DB, from backups or from raw data. You should see that as a safety net. Even if all you have is a MySQL DB, you usually have a backup of it, so you're already "duplicating" the data in some way.
One thing that you need to think of, though, when architecting your system, is that you might not necessarily need to have the entirety of your data in ES, since ES is a search and analytics engine, you should only store in there what is necessary to support your search and analytics needs and be able to recreate that information anytime. In the end, ES is just a subsystem of your whole architecture, just like your DB, your messaging queue or your web server.
Also worth reading: Using ElasticSeach as primary source for part of my DB

Scalable database technology and architecture

I've been trying to learn more about database scaling in a distributed system, and I am stuck in between RDBMS and NoSQL.
Some articles online suggest that NoSQL is the solution to modern Big Data. Others say NoSQL is just a hype and RDBMS can be just as scalable with good design, and it provides good data structure.
Instead of reading others' opinions, I'd love to judge the two myself, but I do not understand exactly what is required for a scalable RDBMS and a scalable NoSQL.
I've done a bit more readings on RDBMS, and it seems that the solution requires leveraging memcache and sharding to reduce database size and the number of DB queries. Are there other tricks? Can you still use tables with many columns? Or use less columns and more joins?
As for NoSQL, I've read a little about MongoDB. I understand that it encourages data aggregation. But how does that make it more scalable? I'm also starting to learn Cassandra because I read that it scales much better than MongoDB, but I have no idea how it is more scalable.
I would very much appreciate a basic (or advanced, if you have the patience to type it out) condensed and down-to-the-core explanation on scaling RDBMS and NoSQL, or good articles online or books that explain the topic. :)
I won't cover ways you can scale by implementing things on your own and putting a memcache server in between, ... I'll just cover what comes right out of the box...
Let's start first with RDBMS:
I think setting up an RDBMS cluster is more complicated than a NoSQL cluster, but that's just my opinion. Usually what you have is one Master and multiple Slaves. You have to send all your writes to the master and can read from any slave you want. Since you have RDBMS and ACID, the system should somehow guarantee you, that you won't read old data. So the thing here is, that you assume that your application writes once and reads often (as it's usually the case). For those purposes, one Server for read/write and multiple servers for read is great. The problem is if you'r writes are so often that you can't keep up with them anymore on the one machine. That is your bottleneck. Additionally to the build in solutions from Oracle for instance - which are huge - there is also http://www.scalearc.com/ which can cache queries, ... and handle the scaling for you.
NoSQL:
There is no 1 NoSQL schema which is implemented by all the DBs. Every system is a bit different. MongoDB for instance is quite similar to RDBMS, it also has only one Master and several slaves to which it can replicate data, but additionally you can also create shards. Data is split between shards, and replicated to slaves. So you could have multiple different masters which are responsible for smaller parts. Afterwards when you read, you can choose if you want to read from multiple slaves, from the master or from any slave - depending how urgently you need the latest data.
Cassandra on the other hand works totally differently. I'm not sure if you can write to multiple servers or how it works, but basically the servers keep a log of all the writes. So even if they can't process the writes immediately, they are stored in a log, to still give you a fast response. Afterwards when you read, you can say again how urgently you want to have the new data, and if you really want the latest latest data, Cassandra will need to check the log, if there are any updates written, and it will cost you a lot of time.
Key-Value stores like ElasticSearch, CouchDB, CouchBase work again differently. Here the of the item is hashed, and based on the hash, sent to one node which will be responsible for it. This way, when you read after the key was written, you get again up to date information, because you'll read from the same node. The idea of this design is, that no one single key will be of everyone's interest, but the load will be distributed. These are also the DBs which I think scale the best, and make it the easiest to add more servers to the cluster, but you loose the power of complex queries, like you have it in MongoDB and Cassandra - and of course RDBMS. ElasticSearch has some simple search queries, and CouchDB and CouchBase have only Views which are produced by MapReduce, where you can get data which you want, if it fits the view. Otherwise you can only access it by the key.
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis - is a very comprehensive summary of the most common NoSQL DBs, what are their strengths and weaknesses, and the most common usage scenarios.
In the end, the question is also, why do you want to scale? how many records are you going to have in the database? Few millions is not a problem at all. Few hundred millions is also not a problem for most of the RDBMS on a powerful enough server. And if designed the DB and it's indices properly even a billion records per year should be still fine.

What database is good enough for logging application?

I am writing a web application with nodeJS that can be used by other applications to store logs and accessed later in a web interface or by applications themselves providing an API. Similar to Graylog2 but schema free.
I've already tried couchDB in which each document would be a log doc but since I'm not really using revisions it seems to me I'm not using its all features. And beside that I think if the logs exceeds a limit it would be pretty hard to manage in couchDB.
What I'm really looking for, is a big array of logs that can be sorted, filtered, searched and capped on. Then the last events of it accessed. It should be schema free and writing to it should be non-blocking.
I'm considering using Cassandra(I'm not really familiar with it) due to the points here said. MongoDB seems good here too, since Graylog2 uses in mongoDB, in here it has some good points about it.
I've already have seen this question, but not satisfied with the answers.
Edit:
For some reasons I can't use Cassandra in production, now I'm trying MongoDB.
One more reason to use mongoDB :
http://www.slideshare.net/WombatNation/logging-app-behavior-to-mongo-db
More edits:
It is similar to graylog2, but the difference I want to make that instead of having a message field, having fileds defined by the client, which is why I want it to be schema free, and because of that, I may need to query in the user defined fields. We can build it on SQL, but querying on the user defined fields would be reinventing wheel. Same goes with files.
Technically what I'm looking for is to get rich statistical data in the end, or easy debugging and a lot of other stuff that we can't get out of the logs.
Where shall it be stored and how shall it be retrieved?
I guess it depends on how much data you are dealing with. If you have a huge amount (terabytes and petabytes per day) of logs then Apache Kafka, which is designed to allow data to be PULLED by HDFS in parallel, is a interesting solution - still in the incubation stage. I believe if you want to consume Kafka messages with MongoDb, you'd need to develop your own adapter to ingest it as a consumer of a particular Kafka topic. Although MongoDb data (e.g. shards and replicas) is distributed, it may be a sequential process to ingest each message. So, there may be a bottleneck or even race conditions depending on the rate and size of message traffic. Kafka is optimized to pump and append that data to HDFS nodes using message brokers FAST. Then once it is in HDFS you can map/reduce to analyze your information in a variety of ways.
If MongoDb can handle the ingestion load, then it is an excellent, scalable, real-time solution to find information, particularly documents. Otherwise, if you have more time to process data (i.e. batch processes that take hours and sometimes days), then Hadoop or some other Map Reduce database is warranted. Finally, Kafka can distribute that load of messages and hookup that fire-hose to a variety of consumers. Overall, these new technologies spread the load and huge amounts of data across cheap hardware using software to manage failure and recover with a very low probability of losing data.
Even with a small amount of data, MongoDb is a nice option to traditional relational database solutions which require more overhead of developer resources to design, build and maintain.
General Approach
You have a lot of work ahead of you. Whichever database you use, you have many features which you must build on top of the DB foundation. You have done good research about all of your options. It sounds like you suspect that all have pros and cons but all are imperfect. Your suspicion is correct. At this point it is probably time to start writing code.
You could just choose one arbitrarily and start building your application. If your guess was correct that the pros and cons balance out and it's all about the same, then why not simply start building immediately? When you hit difficulty X on your database, remember that it gave you convenience Y and Z and that's just life.
You could also establish the fundamental core of your application and implement various prototypes on each of the databases. That might give you true insight to help discriminate between the databases for your specific application. For example, besides the interface, indexing, and querying questions, what about deployment? What about backups? What about maintenance and security? Maybe "wasting" time to build the same prototype on each platform will make the answer very clear for you.
Notes about CouchDB
I suppose CouchDB is "NoSQL" if you say so. Other things which are "no SQL" include bananas, poems, and cricket. It is not a very meaningful word. We have general-purpose languages and domain-specific languages; similarly CouchDB is a domain-specific database. It can save you time if you need the following features:
Built-in web API: clients may query directly
Incremental map-reduce: CouchDB runs the job once, but you can query repeatedly at no cost. Updates to the data set are immediately reflected in the map/reduce result without full re-processing
Easy to start small but expand to large clusters without changing application code.
Have you considered Apache Kafka?
Kafka is a distributed messaging system developed at LinkedIn for
collecting and delivering high volumes of log data with low latency.
Our system incorporates ideas from existing log aggregators and
messaging systems, and is suitable for both offline and online message
consumption.

how to gain a high performance with a very big database

I alway wondered how could a very big site like facebook to be faster than any other sites ,though the very big large amount of data which stored everyday ..
what they are using to store information and if I use sql server to store e.g news feed is that ok or what (the news feed will be stored in a separate table which called News) .
in the other hand what could happen if I joined many huge tables with each other - it should be slow (maybe) or it doesn't matter how big the table is !?
thanx :)
When you talk about scaling at the size of Facebook, is a whole different ball park. Latest estimates put Facebook datacenter at about 60000 servers (sixty thousand). Only the cache is estimated to be at about 30 TB (terabytes) ina a masive Memcached cluster. Although their back end is stil MySQL, is used as a pure key-value store, according to publicly available information:
Facebook uses MySQL, but
primarily as a key-value persistent
storage, moving joins and logic onto
the web servers since optimizations
are easier to perform there (on the
“other side” of the Memcached layer).
There are various other technologies in use there:
HipHop to compile PHP into native code
Haystack for media (photo) storage
BigPipe for HTTP delivery
Cassandra for Inbox search
You can also watch this year SIGMOD 2010 key address Building Facebook: Performance at big scale. They even present their basic internal API:
cache_get ($ids,
'cache_function',
$cache_params,
'db_function',
$db_params);
So if you connect the dots you'll see that at such scale you no longer talk about a 'big database'. You talk about huge clusters of services, key-value storage partitioned across thousands of servers, many technologies used together and so on and so forth.
As a side note, you can also see a pretty good presentation of MySpace internals. Although the technology stack is completely different (Microsoft .Net and SQL Server based, with a huge emphasis on message passing via Service Broker) there are similar points in how they approach storage. To sum up: application layer partitioning.
It depends, Facebook is very fast because they have a server farm, so queries are optimised and each single query hits many servers.
In regards to huge tables, they can be fast as long as you have enough physical memory to index whatever you need to search on. Having correct index's can improve database performance hugely (When it comes to retrieving data).
As long as it makes sense to join many huge tables together into one then yes, but if they're separate, and not related then no. If you provide more details on what kind of tables you would be looking to merge, we might be able to help you more.
According to link text and other pages Facebook uses a technique called Sharding.
It simply uses a bunch of databases with a small portion of the site on each database. A simple algorithm for deciding which database to use could be using the first letter in the username as an index for the database. One database for 'a', one for 'b', etc. I'm sure Facebook has a more advanced scheme than that, but the principle is the same.
The result is many small independent databases that are small enough to handle the load. Facebook and all other major sites has all sorts of similar tricks to make the sites fast and responsive.
They continuously monitor the sites for performance and other metrics and come up with solutions to the issues the find.
I think the monitoring part is more important to the performance success than the actual techniques used to gain the performance. You can not make a fast site by blindly throw some "good performance spells" at it. You have to know where and why you have bottlenecks before you can remove them.
Depends what the performance bottleneck is. One problem is often using the wrong technology for the problem, eg using a relational DB when an object DB or document store would be better, or vice versa of course.
Some people try and use the same DB for everything which is not always the answer. Sometimes it is useful to have multiple denormalizations of the same data for different purposes.
Thinking about the nature of the data and how it is written, read, queried etc is important. You can put all write-once data in one DB and optimize that db for that. Other data that is written frequently could be stored on a db optimized for that.
Distribution techniques can also assist with upscaling.

Which NoSQL backend to store trace data from webpage

In our web application we need to trace what users click, what they write into search box, etc. Lots of data will be sent by AJAX. Generally functionality is a bit similar to google analytics, but we need to customize it in different ways.
Data will be collected and once per day aggregated and exported to PostgreSQL, so backend should be able to handle dozens of inserts. I don't consider usage of traditional SQL database, because probably it won't handle so many inserts efficiently.
I wonder which backend would you use for such task? Actually I think about MongoDB or Cassandra. But maybe you know better software for that task? Maybe something different then NoSQL database?
Web application is written in Ruby on Rails so support for Ruby would be nice but that's definitely not the most important.
Sounds like you need to analyse your specific requirements.
It may be that the best solution is to split / partition / shard a conventional database and then push the data up from there.
Depending on what your tolerance for data loss is, there are a lot of options. If you choose a system which has single-server durability, a major source of write bottleneck will be fdatasync() (assuming you use hard drives to store your data on).
If you can tolerate syncing less often than on every commit, then you may be able to tune your database to commit at timed intervals.
Depending on your table, index structure etc, I'd expect that you can get rather a lot of inserts with a "conventional" db (e.g. postgresql), if you manage it correctly and tune the durability (if it supports that) to your liking.
Sharding this into several instances of course will enable you to scale this up. However, you need to be mindful of operational requirements (i.e. what happens if some of the instances are down). Talk to your Ops team about what they're comfortable managing.

Resources