NOSQL Database for Hight read concurrency - database

Hi I am using ldap to store user configuation, when i started i have small amount of data now it increased to more than 20 million records.
Now I face the performance issue, I preferred ldap beacuse user configuration are less updated compare to read and search operation.
I want to replace the ldap with NOSQL db which will provide me 20000/sec read operation for more than 50 millions record.
Data in ldap is User info , credentials and user specific settings, issue arise because of all-ids. I have indexed data based on First name , last name, Sun ldap did well when i have lesssa data around 500K, when My data incarsed to around 5 million then i face problem for searching that , search is not indexed , later i found issue is regarding all-ids , e.g. Chavan is very comman surname in india, when it appear in more than all-id-threshold prpperty , then my seach always failed , I increased all-id threshold many time but it has performance issue. so i want to get read ldap and use nosql db

What DB you want to use? In Mongo, for example, you can use sharding for some load balancing.
Also, switching to NoSQL should be considered, coz it's very depends on the application logic.

Related

Problems and solutions when using a secondary datastore alongside the main database?

I am in the middle of an interview simulation and I got stock with one question. Can someone provide the answer for me please?
The question:
We use a secondary datastore (we use elasticsearch alongside our main database) for real time analytics and reporting. What problems might you anticipate with this sort of approach? Explain how would go about solving or mitigating them?
Thank you
There are several problems:
No transactional cover : If your main database is transactional (which it usually is), so you either commit or you don't. After the record is inserted into your main database, there is no guarentee that it will be committed to ES. In fact if you commit several records to your primary DB, you may have a situation where some of them are committed to ES, and few others are not. This is a MAJOR issue.
Refresh Interval : Elasticsearch by default refreshes every second. That means "Real-time" is generally 1 second later, or at least when the data is queried for. If you commit a record into your primary db, and immediately query for it via ES, it may not get found. THe only way around this is to GET the record using its ID.
Data-Duplication : Elasticsearch cannot do joins. You need to denormalize all data that is coming from a RDBMS. If one user has many posts, you cannot "join" to search. You have to add the user id an any other user specific details to every post object.
Hardware : Elasticsearch needs RAM (bare minimum of 1 gb) to work properly. This is assuming you don't use anything else from the ELK stack. THis is an important cost wise consideration.
One problem might be synchronization issues, where the elastic search store gets out of sync and starts service stale data. To avoid issues, you will have to implement monitoring on your data pipeline, elastic search and the primary database, to detect any problem by checking for update times, delay, number of records (within some level of error) in each of them and overall system operation status (up / down).
Another is disconnection and recovery - what happens if your data pipeline or elastic search loses connection to the rest of the system? You will need an automatic way to re-connect, when network is restored and start synchronising data again.
You also have to take into account sudden influx of data - how to scale ElasticSearch ingestion or your data processor (data pipeline) if there is large amount of updates and inserts in peak hours or after re-connection when there was network issues.

database to be used for log storing

I am starting up with a log monitoring tool which captures audit logs,firewall logs and many other logs.I Have an issue in choosing the right kind of database for this project as number of logs generated per second is at least 500 which has to be stored.
Let's assume that this should be able to support 1BN+ log entries per month. The two factors that will likely come into play most are ability to write quickly and also the ability display reports quickly.
A common stack used for log storing is the ELK stack composed of Elastic search, Logstash, and Kibana.
Elastic search is used to store the documents and can execute queries.
Logstash monitors and parses the logs.
Kibana is used to create reports based off of the data in elastic search.
There are other options out there. Splunk is a paid solution that does all of the above. Graylog is another solution that is similar to splunk.

Is google Datastore recommended for storing logs?

I am investigating what might be the best infrastructure for storing log files from many clients.
Google App engine offers a nice solution that doesn't make the process a IT nightmare: Load balancing, sharding, server, user authentication - all in once place with almost zero configuration.
However, I wonder if the Datastore model is the right for storing logs. Each log entry should be saved as a single document, where each clients uploads its document on a daily basis and can consists of 100K of log entries each day.
Plus, there are some limitation and questions that can break the requirements:
60 seconds timeout on bulk transaction - How many log entries per second will I be able to insert? If 100K won't fit into the 60 seconds frame - this will affect the design and the work that needs to be put into the server.
5 inserts per entity per seconds - Is a transaction considered a single insert?
Post analysis - text search, searching for similar log entries cross clients. How flexible and efficient is Datastore with these queries?
Real time data fetch - getting all the recent log entries.
The other option is to deploy an elasticsearch cluster on goole compute and write the server on our own which fetches data from ES.
Thanks!
Bad idea to use datastore and even worse if you use entity groups with parent/child as a comment mentions when comparing performance.
Those numbers do not apply but datastore is not at all designed for what you want.
bigquery is what you want. its designed for this specially if you later want to analyze the logs in a sql-like fashion. Any more detail requires that you ask a specific question as it seems you havent read much about either service.
I do not agree, Data Store is a totally fully managed no sql document store database, you can store the logs you want in this type of storage and you can query directly in datastore, the benefits of using this instead of BigQuery is the schemaless part, in BigQuery you have to define the schema before inserting the logs, this is not necessary if you use DataStore, think of DataStore as a MongoDB log analysis use case in Google Cloud.

Implementing a license key management system on GAE: Datastore or Cloud SQL?

I am implementing a license key system on Google AppEngine. Keys are generated ahead of time and emailed to users. Then they log into the system and enter the key to activate a product.
I could have potentially several hundred people submitting their keys for validation at the same time. I need the transactions to be strongly consistent so that the same license key cannot be used more than once.
Option 1: Use the datastore
To use the datastore, I need it to be strongly consistent, so I will use an EntityGroup for the license keys. However, there is a limit of 1 write / second to an entity group. Appengine requests must complete within 60 seconds, so this would mean either notifying users offline when their key was activated, or having them poll in a loop until their key was accepted.
Option 2: Use Google Cloud SQL
Even the smallest tier of Google Cloud SQL can handle 250 concurrent connections. I don't expect these queries to take very long. This seems like it would be a lot faster and would handle hundreds or thousands of simultaneous license key requests without any issues.
The downside to Google Cloud SQL is that it is limited in size to 500GB per instance. If I run out of space, I'll have to create a new database instance and then query both for the submitted license key. I think it will be a long time before I use up that 500GB and it looks like you can even increase the size by contacting Google.
Seems like Option2 is the way to go - but I'm wondering what others think. Do you find Entity Group performance for transactions acceptable?
Option 2 seems more feasible, neat and clean in your case but you have to take care of db connections by yourself and its a hassle with increasing load if connection pooling is not properly used.
Datastore can also be used in license key system by defining multiple EntityGroups with dummy ancestors based on few leading or trailing digits of key to deal with 1 write / second to an entity group. In this way you can also easily determine EntityGroup of a generated or provided license key.
For example 4321 G42T 531P 8922 is license key so 4321 can be used as EntityGroup and all keys starting with 4321 will be part of this EntityGroup. This is sort of sharding like mechanism to avoid the potential of simultaneous writes to single entity group.
If you need to perform queries on some columns other than license key then a separate mapping table can be maintained without an EntityGroup.
You can mixed them , Google Cloud SQL is only have Keys and Email , with 500G i belived you can store key for all of people in the planet .
In other hand you can request google to increase data size limit .
I will go with Option 1 datastore, it's much faster and scalable.
And I don't know why you need to create EntityGroup, you could make the "license key" itself as the Key, so each Entity is in it's own EntityGroup... only this will make things scalable.

Centralized data access or variables

I'm trying to find a way to access a centralized database for both retrieval and update.
the following is what I'm looking for,
Server 1 has this variable for example
int counter;
Server 2 will be interacting with the user, and will increase the counter whenever the user uses the service, until a certain threshold is reached. when this threshold is reached then server 2 will start rejecting the user access.
Also, the user will be able to use multiple servers (like server 2) from multiple locations and each time the user accesses the access any server the counter will be increased.
I tried google but it's hard to search for something without a name.
One approach to designing this is to do sharding by user - i.e. split the users between your servers depending on the ID of the user. That is, if you have 10 servers, then users with ID's ending with 2 would have all of their data stored on server 2, and so on. This assumes that user ID's are distributed uniformly.
One other approach is to shard the users by location - if you have servers in Asia vs Europe, for example. You'd need a property in the User record that tells you where the user is located; based on that, you'll know which server to route them to.
Ultimately, all of these design options have a concept of "where does the master record for a user reside?" Each of these approaches attempts to definitively answer this question.
A different category of approaches has to do with multi-master replication, which is supported by some database vendors; this approach does not scale as well (i.e. it's hard to get it to scale to 20 servers), but you might want to look into it, too.

Resources