I've read a lot of posts on using ES as a primary store and whether it's a good practice or not. But nowhere I could find whether ES will fail the request 100% of the time when it's unavailable ?
What are the conditions in which ES can lose your data ? One of the conditions I found was nodes are down, but in this case the insert / update request itself will fail and the client can retry. What I am looking for is whether ES rest highlevel client will respond with 200 OK and internally lose data ?
Also if I don't need ACID properties, but need to be fault tolerant and have some protection against data loss. Is that possible with ES ? with replication ?
The articles or blogs that I've read so far are :
Using ElasticSearch as source of truth
How reliable is ElasticSearch as a primary datastore against factors like write loss, data availability
AWS Elasticsearch as a primary database
Related
I am in the middle of an interview simulation and I got stock with one question. Can someone provide the answer for me please?
The question:
We use a secondary datastore (we use elasticsearch alongside our main database) for real time analytics and reporting. What problems might you anticipate with this sort of approach? Explain how would go about solving or mitigating them?
Thank you
There are several problems:
No transactional cover : If your main database is transactional (which it usually is), so you either commit or you don't. After the record is inserted into your main database, there is no guarentee that it will be committed to ES. In fact if you commit several records to your primary DB, you may have a situation where some of them are committed to ES, and few others are not. This is a MAJOR issue.
Refresh Interval : Elasticsearch by default refreshes every second. That means "Real-time" is generally 1 second later, or at least when the data is queried for. If you commit a record into your primary db, and immediately query for it via ES, it may not get found. THe only way around this is to GET the record using its ID.
Data-Duplication : Elasticsearch cannot do joins. You need to denormalize all data that is coming from a RDBMS. If one user has many posts, you cannot "join" to search. You have to add the user id an any other user specific details to every post object.
Hardware : Elasticsearch needs RAM (bare minimum of 1 gb) to work properly. This is assuming you don't use anything else from the ELK stack. THis is an important cost wise consideration.
One problem might be synchronization issues, where the elastic search store gets out of sync and starts service stale data. To avoid issues, you will have to implement monitoring on your data pipeline, elastic search and the primary database, to detect any problem by checking for update times, delay, number of records (within some level of error) in each of them and overall system operation status (up / down).
Another is disconnection and recovery - what happens if your data pipeline or elastic search loses connection to the rest of the system? You will need an automatic way to re-connect, when network is restored and start synchronising data again.
You also have to take into account sudden influx of data - how to scale ElasticSearch ingestion or your data processor (data pipeline) if there is large amount of updates and inserts in peak hours or after re-connection when there was network issues.
We have a microservice application which is saving the data into an Oracle Db.
So far the DB is our single point of failure which we want to improve (we are using a single Oracle DB with a cold failover instance).
Now the company is asking us to upgrade the oracle DB, the issue is that it requires downtime.
For that reason we were thinking about:
add a global/geo replicated cache layer (e.g redis) between the microservice and the DB
for each new record that should be saved on the db:
Add the record in the cache (storing the entries on the HD in case the whole cache layer crashes)
throw an event to a queue (we have RabbitMQ). On the other side of the queue we can create a new service to consume the events and add them to the DB in an asynch way.
It's basically adding a write-behind cache layer.
In the above scenario we are confident that we can save easily 1 week data in the cache or more.
If the DB is down the new service which is listening to the queue will simply re-trying adding the rows in the DB, as soon as an event is added to the Db then the event can be ack and the next one will be consumed. In this way, if the DB is down or if we have to do some maintenance, it should not affect the main application: the users can still "save" the data and retrieve it (with the 1 week max constraint whenever the db is down).
The down side is that the architecture is more complex and we can have now data eventual consistency.
Is there another design pattern to better deal with database downtime without having the users feel that something is wrong?
Do you know any already-existing tools that we can use to automatically read an event from Rabbit and save it in the db? (we are already doing it with logstash to automatically forwards some rabbit events to elastic).
The next step would be to have a cluster of DB (cassandra,mongo etc) but for now we do not have the capacity for that.
Adding cache for increase availability is, probably, an awkward solution - as you will eventually get to the same issue of keeping cache available. Also, handling cold caches is not a simple task.
I am not familiar with Oracle, but most databases do support replication; and you have options for synchronous/asynchronous/semi-synchronous patterns.
Quick search helped me to discover "Oracle Data Guard" - seems that's the tool you need. Docs say that the Guard supports data replication and failover.
As for using Cassandra - I highly recommend to evaluate that first - Oracle gives you ACID properties and joins; this makes application code much simpler. Also, consistency patterns will be different. Lots of details to think about.
My general recommendation is to look into your data layer (oracle in this case) and follow their recommendation to achieve high availability. Oracle is mature product, and availability is well-supported.
I want to implement a real-time chat. My main db is PostgreSQL, with the backend written in NodeJS. Clients will be mobile devices.
As far as I understand, to achieve real-time performance for messaging, I need to use Redis.
Therefore, my plan was to use Redis for the X most recent messages between 2 or more people(group chat) , for example 1000, and everything is synced and backed in my main Db which is PostgreSQL.
However, since Redis is essentially just RAM, the chat history can be too "vulnerable", owing to the volatile nature of storing data in RAM.
So if my redis server has some unexpected and temporary failure, the recent messages in conversations would be lost.
What are the best practices nowaydays to implement something like this?
Do I simply need to persist Redis data to disk? but in that case, wouldn't that hurt performance, since it will increase the write time for each message sent ?
Or perhaps I should just prepare a recovery method, that fetches the recent history from PostgreSQL in case my redis chat history list is empty?
P.S - while we're at it, could you please suggest a good approach for maintaining the status of users (online/offline) ? Is it done with Redis as well?
What are the best practices nowaydays to implement something like
this? Do I simply need to persist Redis data to disk? but in that
case, wouldn't that hurt performance, since it will increase the write
time for each message sent?
Yes, enabling persistence will impact performance of redis.
The best bet will be run a quick benchmark with the expected IOPS and type of operations from your application to identify impacts on IOPS with persistence enabled.
RBD vs AOF:
With RDB persistence enabled, the parent process does not perform disk I/O to store changes to data to RDB. Based on the values of save points, redis forks a child process to perform RDB.
However, based on the configuration of save points, you may loose data written after last save point - in case of the event of server restart or crash if data was not saved from last save point
If your use case can not tolerate to the data loss for this period, you need to look at the AOF persistence method. AOF will keep track of all write operations, that can be used to construct data upon server restart event.
AOF with fsync policy set to every second can be slower, however, it can be as good as RDB if fsync is disabled.
Read the trade-offs of using RDB or AOF: https://redis.io/topics/persistence#redis-persistence
P.S - while we're at it, could you please suggest a good approach for
maintaining the status of users (online/offline) ? Is it done with
Redis as well?
Yes
Let's say I have a distributed application that writes to the database. To decrease latency one of the instances (app + database) is hosted in Australia and another one is hosted in Europe. Both instances of database need to share the same data.
So what we are after here is data locality. The reason for it is obvious: we don't want users in Australia shooting requests to our database in Europe because that would increase latency.
The natural choice would be to deploy both instances of database in a one replica set. But it seems that with MongoDB you can write to only one Mongo instance within replica set.
What are the strategies with MongoDB to have two instances of database, sharing the same data, to which you can write to? Or is the MongoDB just a wrong choice for this requirement?
Huge subject, but i'll try to give you a short and simple answer :
As your two instances must share the same sata, you can't use sharded cluster with zones . But replica set can be your solution :
Create a replica set with at least the following :
a server in a 'neutral' zone. It will be the primary server (set a priority higher). This server, as long as it still primary, will handle your write operations.
your two existing servers with lower priority.
Set in your application Read Preference to 'nearest'. This way, your read operations will be handle by the server having the mower network latency, regardless of the Master/secondary roles of server.
But i highly recommand you to check the documentation, to see how correctly deploy this architecture. Here's a good start
EDIT
Some consideration about this solution :
This use case is one of the rare use case where it's better to read from secondaries. In general, prefer reading your data from MASTER, since replica set is done for high availability, not for scalability.
If some of your data can be 'located' to be accessed faster, consider sharding collections as a better solution
I am implementing a license key system on Google AppEngine. Keys are generated ahead of time and emailed to users. Then they log into the system and enter the key to activate a product.
I could have potentially several hundred people submitting their keys for validation at the same time. I need the transactions to be strongly consistent so that the same license key cannot be used more than once.
Option 1: Use the datastore
To use the datastore, I need it to be strongly consistent, so I will use an EntityGroup for the license keys. However, there is a limit of 1 write / second to an entity group. Appengine requests must complete within 60 seconds, so this would mean either notifying users offline when their key was activated, or having them poll in a loop until their key was accepted.
Option 2: Use Google Cloud SQL
Even the smallest tier of Google Cloud SQL can handle 250 concurrent connections. I don't expect these queries to take very long. This seems like it would be a lot faster and would handle hundreds or thousands of simultaneous license key requests without any issues.
The downside to Google Cloud SQL is that it is limited in size to 500GB per instance. If I run out of space, I'll have to create a new database instance and then query both for the submitted license key. I think it will be a long time before I use up that 500GB and it looks like you can even increase the size by contacting Google.
Seems like Option2 is the way to go - but I'm wondering what others think. Do you find Entity Group performance for transactions acceptable?
Option 2 seems more feasible, neat and clean in your case but you have to take care of db connections by yourself and its a hassle with increasing load if connection pooling is not properly used.
Datastore can also be used in license key system by defining multiple EntityGroups with dummy ancestors based on few leading or trailing digits of key to deal with 1 write / second to an entity group. In this way you can also easily determine EntityGroup of a generated or provided license key.
For example 4321 G42T 531P 8922 is license key so 4321 can be used as EntityGroup and all keys starting with 4321 will be part of this EntityGroup. This is sort of sharding like mechanism to avoid the potential of simultaneous writes to single entity group.
If you need to perform queries on some columns other than license key then a separate mapping table can be maintained without an EntityGroup.
You can mixed them , Google Cloud SQL is only have Keys and Email , with 500G i belived you can store key for all of people in the planet .
In other hand you can request google to increase data size limit .
I will go with Option 1 datastore, it's much faster and scalable.
And I don't know why you need to create EntityGroup, you could make the "license key" itself as the Key, so each Entity is in it's own EntityGroup... only this will make things scalable.