We have been looking for a big data storage which can collect a huge pool of user information.
Also, I would like to mention that we are working on RTB platform (namely DSP side). As a result, our platform handles around 100 millions of requests per day. It is about 1-2 thousands of requests per second (depends on time). Here is a simple overview what we are going to implement:
My questions are:
Is it a good solution using Solr (SolrCloud) in Data Management Platform?
How do you think whether SolrCloud can handle high frequency traffic?
I have checked Solr performance problem and there is one option - extremely frequent commits (under Slow commits item). What is the limit?
What configuration of SolrCloud will be suitable for us? I mean amount of shards, cores and server configuration (CPU, RAM) to handle 1-2K QPS and store about 500M docs. How can we calculate this?
Related
I am writing a feature that might lead to us executing a few 100s or even 1000 mongodb transactions for a particular endpoint. I want to know if there is a maximum limit to the number of transactions that can occur in mongodb?
I read this old answer about SQL server Can SQL server 2008 handle 300 transactions a second? but couldn't find anything on mongo
It's really hard to find a non-biased benchmark, let alone the benchmark that your objectively reflect your projected workload.
Here is one, by makers of Cassandra (obviously, here Cassandra wins): Cassandra vs. MongoDB vs. Couchbase vs. HBase
few thousand operations/second as a starting point and it only goes up as the cluster size grows.
Once again - numbers here is just a baseline and can not be used to correctly estimate the performance of your application on your data. Not all the transactions are created equal.
Well, this isn't a direct answer to your question, but since you have quoted a comparison, I would like to share an experience with Couchbase. When it comes to Couchbase: a cluster's performance is usually limited by the network bandwidth (assuming you have given it SSD/NVMe storage which improves the storage latency). I have achieved in excess of 400k TPS on a 4 node cluster running Couchbase 4.x and 5.x. in a K/V use case.
Node specs below:
12 core x 2 Xeon on HP BL460c blades
SAS SSD's (NVMe would generally be a lot better)
10 GBPS network within the blade chassis
Before we arrived here, we moved on from MongoDB that was limiting the system throughput to a few tens of thousand at most.
We are currently using Couchbase for data caching and there is talk of doing cross-data center replication with it. However, we will need up to 1000 documents replicated to multiple locations every second. Documents will be between 2 and 64K each.
Is there anyone out there with XDCR experience who can tell me whether this is even feasible, or if we will have to use other means to replicate this data at that speed. The only "benchmark" in the documentation at Couchbase implies that the rate of XDCR is only about 100TPS. (149 ms to replicate 11 documents.)
The replication rate of XDCR is limited by network bandwidth and latency first, then CPU and disk IO. Assuming you have enough bandwidth between the datacenters and your clusters are provisioned properly, Couchbase will replicate hundreds of thousands of documents per second, or more. It's a pretty simple experiment to run, just set up XDCR between two singles node clusters and use one of the load generator tools that come with Couchbase to create some traffic. (cbworkloadgen in the Couchbase bin folder or cbc-pillowfight that comes with libcouchbase.)
There are several config settings you can play with to optimize throughput, such as increasing batch size, changing the optimistic replication threshold, etc.
We are currently using Solr4.3 cloud in master slave mode and have been pretty happy with our initial solr POC. We are looking to store Social Data (tweets, blogs, Facebook feed) into Solr and make it searchable, also at the same time utilize the Faceting capabilities provided by Solr.
Going by the amount of social data that comes in, we were wondering what kind of infrastructure would be required to say store 2 TB or data and query them with minimum time.
Also, give the rate in which tweets come in what would be the best indexing strategy.
I will suggest you to choose some cloud platform .AWS is one good option in this matter since you can always try and then change your machine if it does'nt suit your requirement.
So what will suit your requirement?
I think for solr since there is a high I/O I will suggest you to go with high I/O machine and good processing power
I suggest using
c3.2xlarge of AWS with 8 ecu, 15 gb ram, 2 x 80 SSD.
and attach to it an EBS volume of atleast 4 TB.
This would solve your problem.
I am using Google App Engine for an app, and the app is currently hitting the datastore at a rate of around 2.5 million row writes, and 4.5 million row reads per day.
I am currently porting the app to Amazon Elastic Beanstalk and Amazon RDS due to the very high costs of running an application on GAE.
Based on the values above, how can I find out / estimate what type of RDS instance I will need for my requirements? Is the above a considerable amount of processing for, lets say a Small or Micro MySQL RDS instance to process in a day?
Totally depends on a number of factors:
Row size.
Field types and sizes.
Complexity of your queries (joins, etc).
Proper use of indexes.
Row contention and other possible bottlenecks.
Really hard to tell. But from experience, if you don't need fancy replication or sharding, the costs of the GAE datastore are usually higher as it offers total redundancy, distribution, scalability, etc.
My suggestion would be to write a quick program to benchmark a load on RDS that replicates what you are expecting. Should be easy to write if you forgo all the business rules and such and just do fake but randomized reads and writes.
We're being asked to spec out production database hardware for an ASP.NET web application that hasn't been built yet.
The specs we need to determine are:
Database CPU
Database I/O
Database RAM
Here are the metrics I'm currently looking at:
Estimated number of future hits to
website - based on current IIS logs.
Estimated worst-case peak loads to
website.
Estimated number of DB queries per
page, on average.
Number of servers in web farm that
will be hitting database.
Cache polling traffic from database
(using SqlCacheDependency).
Estimated data cache misses.
Estimated number of daily database transactions.
Maximum acceptable page render time.
Any other metrics we should be taking into account?
Also, once we have all those metrics in place, how do they translate into hardware requirements?
What I have been doing lately for server planning is using some free tools that HP provides, which are collectively referred to as the "server sizers". These are great tools because they figure out the optimal type of RAID to use, and the correct number of disk spindles to handle the load (very important when planning for a good DB server) and memory processor etc. I've provided the link below I hope this helps.
http://h71019.www7.hp.com/ActiveAnswers/cache/70729-0-0-225-121.html?jumpid=reg_R1002_USEN
What I am missing is a measure for the needed / required / defined level of reliability.
While you could probably spec out a big honking machine to handle all the load, depending on your reliabiltiy requirements, you might rather want to invest in smaller, but multiple machines, and into safer disk subsystems (RAID 5).
Marc
In my opinion, estimating hardware for an application that hasn't been built and designed yet is more of a political issue than a scientific issue. By the time you finish the project, current hardware capability and their price, functional requirements, expected number of concurrent users, external systems and all other things will change and this change is beyond your control.
However this question comes up very often since you need to put numbers in a proposal or provide a report to your manager. If it is a proposal, what you are trying to accomplish is to come up with a spec that can support the proposed sofware system. The only trick is to propose a system that will not increase your cost for competiteveness while not puting yourself at the risk of a low performance system.
If you can characterize your current workload in terms of hits to pages, then you can then:
1) calculate the typical type of query that will be done for each page
2) using the above 2 pieces of information, estimate the workload on the database server
You also need to determine your performance requirements - what is the max and average response time you want for your website?
Given the workload, and performance requirements, you can then calculate capacity. The best way to make this estimate is to use some existing hardware, run a simulated database workload on a database on that hardware, and then extrapolate your hardware requirements based on your data from the first steps.