CouchDB parallel replications causes high cpu usage - database

I have a per-user DB architecture like so:
There is around 200 user DBs and each has a continuous replication link to the master couch. ( All within the same couch instance) The problem is the CPU usage at any given time is always close to 100%.
The DBs are idle, so no data is getting written/read from them.There's only a few KB of data per DB so I don't think the load is an issue at this point. The master DB size is less than 10 MB.
How can I go about debugging this performance issue?

You should have a look at https://github.com/redgeoff/spiegel - it's a tool to handle many CouchDB replications in a scalable way. Basically it achieves that by listening to the _global_changes endpoint and creating replications only when needed.
In recent CouchDB versions (2.1.0+), the replicator has been improved, but I think for replicating per-user databases it still makes sense to use an external mechanism like Spiegel to manage the number of active replications.

Just as a reminder, there are some security flaws in CouchDB 2.1.0 and you might need to upgrade to 2.1.1. Maybe you've been hacked like this one.

Related

What is the smallest AWS EC2 instance I can run a postgres db on?

There is the free tier on AWS and I can get a micro EC2 instance for free essentially.. or close to.. I'm sure setting up elastic ips - loads balancers etc is extra.
Would it effectively be possible for me to run a postgres DB - for a small api. Roughly about 50 inserts + 50 reads per second ... say about 6000 operations per min at the most.
I can't seem to find anything online - which makes me think that this might be a silly idea.
For this not to be an "open question" - it's simply: Is it possible and realistic to expect usable performance on an EC2 instance running my postgres DB.
The best way to determine whether the database can handle a particular workload is to test it at that capacity. Launch the database, simulate traffic and monitor its performance. Please note that every application uses a database differently, so nobody can provide "general advice" as to whether a particular-sized database would meet the needs of your particular application.
If you are going to run 'production' workloads, try to avoid using the Burstable performance instances (T2, T3) since they can hit limits under heavy workloads unless the 'Unlimited' option is selected. T2/T3 is great for bursty workloads, but not for sustained workloads.
Comparing m5.xlarge between EC2 and RDS:
Amazon EC2: 19.2c/hr ($4.61/day)
Amazon RDS: 35.6c/hr ($8.54/day)
For the additional price, Amazon RDS provides a fully-managed database, automated backups, CloudWatch metrics, etc. This is probably worth much more than $4 of your time every day.
Alternatively, if you can modify your application to use NoSQL instead of SQL, you could use Amazon DynamoDB where the capacity you mention would cost 4c/hour ($1/day) plus request and storage costs.
Don't underspend on your database — it powers everything you do. Instead, try to save money by turning off non-production systems when they aren't being used (eg weekends and evenings). That will hopefully give you enough savings to afford an appropriately-powered database.

SQL query Performance testing regarding cache

The situation
We are planning to migrate our entire (production)DWH to a new cluster. One of the requirements is that the new cluster is atleast as fast as the current cluster. This calls for performance testing with the current cluster serving as a baseline.
When conducting these tests we want both enviorments to be near identical in terms of behaviour.
I can already clone the user behaviour from the live production cluster and execute it on the new cluster. Leaving the cache to be tackled.
The catch
Since we are going to compare this new cluster to the live production enviorment I can't simply clear the cache of both servers. Clearing the cache of the new cluster would be possible since it isn't in production yet. However I am not going to clear the cache of the live production cluster since this is still being used and will have a big impact on the performance.
I was wondering if it would be possible to clone/mimic the cache between the two clusters.
I'm also open for an entire different approach on this matter.
I think you are going about this the wrong way and here is why. I assume the following:
The new cluster's hardware is of the same vendor, quality, etc as the previous
The cores / CPU, RAM, etc is as good or better on the new instance
The instance is of the same version, or an upgraded version of SQL Server. Note, upgrading doesn't mean the queries will perform better in all cases
The storage is the same, or better (SAN / NAS configurations)
The server settings are the same (MAXDOP, etc)
If these aren't true, then I don't see why you are conducting the test anyway since it wouldn't be comparing apples to apples. With that being said, I still don't see how the tests would be equal even if you could mirror the plan cache. You could create a brand new query that would be used for performance testing, and run it on both instances to compare their performance (it would use a new plan) but here's a big catch... you aren't going to kick off all the users from production instance, so your baseline query will contend for resources. Unless you have an identical mirror of your production server, which no users are using, I don't see how you're going to get an unbiased test.
With all that being said, most often you are upgrading to faster, better hardware so one could feel safe that it would be faster, or at least not slower, assuming equal configurations. Additionally, there are tons of performance tuning blogs out there from Dave Pinal, Paul White, Brent Ozar, Paul Randall, Aaron Bertrand, etc... ranging from optimal server settings to query tuning. This alone, could be a night and day difference in performance along with proper DB maintenance (fixing index fragmentation, fixing queries with hundreds or thousands of plans which only get used once, fixing indexing in general, etc)

Import script that maintains data integrity w/ zero downtime

Background
I'm writing a import script which is fairly computationally expensive and results in many insert and update database queries. My intention is to store the database on an EBS volume and use EC2's command-line tools to launch a c1.xlarge instance, perform the import (writing to the EBS volume) and self-destruct on completion (to save $).
On instance termination, the EBS volume (that contains all the imported data) is then programatically attached and mounted to the machine that contains my webserver.
By using this scheme, the webserver machine can continue to respond to HTTP requests without being:
CPU and RAM overloaded.
Serving incomplete data while the import is running.
Wasting resources ( Being an expensive instance-type. )
Question
Is this a sound approach? Is it essentially how companies that manage large amounts of data are able to do so without downtime, whilst keeping up-to-date? Good books or blog posts on the subject?
I would posit that if you have a single webserver and are concerned about the cost of a c1.xlarge for a short period of time, you can tolerate slightly more than zero downtime. The setup you're describing sounds fine, just realize that one you turn off the DB on the c1.xlarge, your downtime starts ticking until the DB is up and your app reconfigured to point to the local instance of the DB. Figure that'll be a few minutes, so plan accordingly - perhaps with a maintenance page or if you app can work off cached data / in read-only mode, do that.
Or if you're using a supported DB, just use RDS. That'll probably get you a lot closer to zero downtime with less work. You'll pay something for it, but the multi-AZ failover is worth the price of admission.
And no, this is not how a large company with lots of data would do it. They'd most likely use replication, ensure the data is on a few different servers, and then failover the master. That's also what you get with RDS.

Improving database record retrieval throughput with appengine

Using AppEngine with Python and the HRD retrieving records sequentially (via an indexed field which is an incrementing integer timestamp) we get 15,000 records returned in 30-45 seconds. (Batching and limiting is used.) I did experiment with doing queries on two instances in parallel but still achieved the same overall throughput.
Is there a way to improve this overall number without changing any code? I'm hoping we can just pay some more and get better database throughput. (You can pay more for bigger frontends but that didn't affect database throughput.)
We will be changing our code to store multiple underlying data items in one database record, but hopefully there is a short term workaround.
Edit: These are log records being downloaded to another system. We will fix it in the future and know how to do so, but I'd rather work on more important things first.
Try splitting the records on different entity groups. That might force them to go to different physical servers. Read entity groups in parallel from multiple threads or instances.
Using cache mght not work well for large tables.
Maybe you can cache your records, like use Memcache:
https://developers.google.com/appengine/docs/python/memcache/
This could definitely speed up your application access. I don't think that App Engine Datastore is designed for speed but for scalability. Memcache however is.
BTW, if you are conscious about the performance that GAE gives as per what you pay, then maybe you can try setting up your own App Engine cloud with:
AppScale
JBoss CapeDwarf
Both have an active community support. I'm using CapeDwarf in my local environment it is still in BETA but it works.
Move to any of the in-memory databases. If you have Oracle Database, using TimesTen will improve the throughput multifold.

What are the pros and cons of a distributed second level cache versus focusing on tuning database

we have a website that uses nhibernate and 2nd level cache. We are having a debate as one person wants to turn off the second level cache as we are moving to a multi webserver environment (with a load balancer in front).
One argument is to get rid of the second level cache and focus on optimizing and tuning the Db. the other argument is to roll out a distributed cache as the second level cache.
I am curious to hear folks pro and con here of DB tuning versus distributed cache (factoring in effort involved, cost, complexity, etc)
In case of a load balancing scenario you have to use a distributed cache provider to get best performance and consistency, that has nothing to do with optimizing your database. In any scenario you should optimize you database.
Both. You should have a distributed cache to prevent unecessary calls to the database and a tuned database so the initial calls are quickly returned. As an example, facebook required a significant amount of caching to scale, but I'm sure it wouldn't do much good if the initial queries took 10 minutes. :)
Two words: measure it.
Since you already have cache implement it you can probably measure what the impact would be of turning it off for benchmark purposes.
I would think that a multi-web server and a distributed second level cache can -and probably should- coexist.
First of all if we take as example memcached, it supports distributed object storing so if you're not using that, you could switch to that. it works.
Secondly, I'm guessing that you're introducing the web-server farm to respond to increasing web requests which will in turn mean increasing requests for data. If you kill your caching, it won't matter how much you optimize your database you're going to thrash it with queries. So you are going to improve your execution time, but while you wait for the database to return your data.
This is especially true for the case that web-node 1 requests dataset A and web-node 2 requests dataset A --> you are going to do the same query twice while with second level caching you only do it once.
So my recommendation is:
Don't kill your second level cache. You have already spent resources to implement it and by disabling it you are NOT going to improve your application's performance. Even a single node of memcached is going to be faster than having none at all.
Do optimize your database operations. This means both from the database side (indexes, views, sp's, functions, perhaps a cluster with read-only and write-only nodes) and application side (optimize your queries, lazy/eager loading profiling, don't fetch data you don't need, combine multiple queries into single-round-trips via Future, MutliQuery, MultiCriteria)
Do optimize your second-level cache implementation. There are datasets that have an infinite expiration date, and thus you query the db for them only once, and there are datasets that have short expiration dates, and thus probably expensive queries are executed more frequently. By optimizing your queries and your db you are going to improve the performance for the queries but the second-level cache is going to save your skin on peak load where short-expiration date datasets will be fetched by the cache more frequently.
If using textual queries is an everyday operation use the database's full-text capabilities or, even better, use a independent service like Lucene.NET (which can be integrated with NHibernate via NHibernate.Search)
That's a very difficult topic. In either case you need proficiency. Either a very proficient DBA, or a very proficient NHibernate / Cache administrator.
Personally, I prefer having full control over my SQL and tuning the database. Since you only have multiple webservers (and not necessarily multiple database instances), you might be better off that way, too. Modern databases have very efficient caches, so usually you create more harm with badly configured second-level caches in the application, rather than just letting the database cache sql statements, cursors, data, buffers, etc. I have experienced this to work very well for around 15 weblogic servers and only one database with lots of memory.
Since you do have NHibernate already, though, moving away from it, back to SQL (maybe with LINQ?) might be quite a costly task, that's not worth the effort.
We use NHibernate's 2nd level cache in our multi-server environment using Microsoft AppFabric distributed cache framework (NHibernate Velocity Provider) with great success.
Having said that, using 2nd level cache requires deeper understanding of the framework to prevent unexpected results. In addition, before using distributed caches, it is important to measure their overhead.
So my answer is basically - before using 2nd-level cache, you should really test and see whether it is really needed.

Resources