Embedded couchDB - database

Embedded couchDB - database

CouchDB is great, I like its p2p replication functionality, but it's a bit larger(because we have to install Erlang) and slower when used in desktop application.
As I tested in intel duo core cpu,
12 seconds to load 10000 docs
10 seconds to insert 10000 doc, but need 20 seconds to update view, so total is 30 seconds
Is there any No SQL implementation which has the same p2p replication functionality, but the size is very small like sqlite, and the speed is quite good(1 second to load 10000 docs).

Have you tried using the Hovercraft and/or the Erlang view server? I had a similar problem and found staying within the Erlang VM (thereby and avoiding excursions to SpiderMonkey) gave me the boost I needed. I did 3 things...
Boosting Queries: Porting your mapreduce functions from js to "native" Erlang usually gives tremendous performance boost when querying couch (http://wiki.apache.org/couchdb/EnableErlangViews). Also, managing views is easier coz you can call external libs or your own compiled modules (just add them to your ebin dir) reducing the number of uploads you need to do during development.
Boosting Inserts: Using Hovercraft for inserts gives upto X100 increase in performance (https://github.com/jchris/hovercraft.) This was mentioned in the CouchDB book (http://guide.couchdb.org/draft/performance.html)
Pre-Run Views: The last thing you can do for desktop apps is run your views during application startup (say, when the splash-screen is showing.) The first time views are run is always the slowest, subsequent runs are faster.
These helped me a lot.
Edmond -

Unfortunately the question doesn't offer enough details about your app requirements so it's kind of difficult to offer an advise. Anyways, I'm not aware of any other storage solution offering a similar/advanced P2P replication.
A couple of questions/comments about your your requirements:
what kind of desktop app requires 10000 inserts/second?
when you say size what exactly are you referring to?
You might want to take a look at:
Redis
RavenDB
Also check some of the other NoSQL-solutions listed on http://nosql.mypopescu.com against your app requirements.

Related

Profiling and output caching in ASP.NET MVC

So I was recently hired by a big department of a Fortune 50 company, straight out of college. I'll be supporting a brand new ASP.NET MVC app - over a million lines of code written by contractors over 4 years. The system works great with up to 3 or 4 simultaneous requests, but becomes very slow with more. It's supposed to go live in 2 weeks ... I'm looking for practical advice on how to drastically improve the scalability.
The advice I was given in Uni is to always run a profiler first. I've already secured a sizeable tools budget with my manager, so price wouldn't be a problem. What is a good or even the best profiler for ASP.NET MVC?
I'm also looking at adding caching. There is currently no second level and query cache configured for nHibernate. My current thinking is to use Redis for that purpose. Also looking at output caching, but unfortunately the majority of the users will login to the site. Is there a way to still cache parts of the pages served by MVC?

Do you have any monitoring or instrumentation setup for the application? If not, I would highly recommend starting there. I've been using New Relic for a few years with ASP.NET apps and been very happy with it.
Right off the bat you get a nice graph of request response times broken down into 3 kind of tasks that contribute to the response time
.NET CLR - Time spent running .NET code
Database - Time spent waiting on SQL requests
Request Queue - Time spent waiting for application workers to become available
It also breaks down performance by MVC action so you can see which ones are the slowest. You also get a breakdown of performance per database query. I've used this many times to detect procedures that were way too slow for heavy production loads.
If you want to, you can have New Relic add some unobtrusive Javascript to your page that allows you to instrument browser load times. This helps you figure things out like "my users outside North America spend on average 500ms loading images. I need to move my images to a CDN!"
I would highly recommend you use some instrumentation software like this. It will definitely get you pointed in the right direction and help you keep your app available and healthy.
Profiler is a handy tool to watch how apps communicate with your database and debug odd behaviour. It's not a long-term solution for performance instrumentation given that it puts a load on your server and the results require quite a bit of laborious processing and digestion to paint a clear picture for you.
Random thought: check out your application pool configuration and keep and eye out in the event log for too many recycling events. When an application pool recycles, it takes a long time to become responsive again. It's just one of those things can kill performance and you can rip your hair out trying to track it down. Improper recycling settings bit me recently so that's why I mention it.

For nHibernate analysis (session queries, caching, execution time) you could use HibernatingRhinos Profiler. It's developed by the guys that developed nhibernate, so you know it will work really good with it.
Here is the URL for it:
http://hibernatingrhinos.com/products/nhprof
You could give it a try and decide if it helps you or not.

What are the rules of thumb when trying to decide if developing on Google App Engine platform is worthwhile

I have an idea for a web application and I am currently researching different platforms. I am really interested in Google App Engine, but it looks like it works pretty good for certain application types while it is less suitable for others (there are horror as well as success stories e.g. Goodbye Google App Engine vs. Why we are really happy with Google App Engine
There is also a similar negative story in this thread from 1 year ago, concluding GAE was not ready for commercial production platform: GAE as Production Platform. There are also other threads from 2009 talking about data select limits (1000 rows) that has since been lifted.
My app will essentially perform some mathematical analysis based on data pulled from external data feeds (could be some substantial amount of data), it would be real time only the first time data is downloaded for a specific item at hand and then stored and retrieved locally from the database at that point. There will be some additional external data pulls as scheduled intervals.
Based on this brief description, should I even bother starting on GAE? In general, what are the rules of thumb to try and decide if developing on GAE is suitable for a problem at hand? Also, what are the good examples of Apps in Production that use GAE. It looks like GAE App Gallery is not around anymore, but I would definitely appreciate any Web 2.0 App examples running on the app engine.

In your specific case I would double check these factors:
a. Is the mathematical analysis a long running CPU intensive job?
GAE is not designed for long running CPU intensive computational Jobs; this would lead to have an high billing cost and would force you to design your application to avoid some GAE limitations (10 minutes max per job, limited soft memory, CPU quota, etc. etc.).
b. Are you planning to retrieve external data using a mainstream API (twitter, yahoo, facebook)?
Your application shares the same pool of IPs with other applications; if the API you want to adopt does not allow authenticated request, your application will suffer hiccups caused by throttling/quota limits errors. I faced this problem here.

App Engine should work fine for your application. It's generally designed to serve, and to scale, sites that serve mostly user-facing traffic. Applications that it's not suitable for are things such as video transcoding, which rely heavily on backend processing, or things that have to shell out to native code, such as 3D graphics, etcetera.

Depends on what type of mathematical analysis are you doing. If your application is heavy in I/O, I would give it some pause. On GAE, you're kind of limited in your I/O options. You basically have the following:
RAM: I can't recall exactly, but GAE imposes a hard limit of around 200MB of RAM.
Datastore: You get plenty of space here, but it's slow compared to a cached local file system.
Memcache: Faster than datastore, but not nearly as fast as a cached disk. And worse, it's a cache, so there's no guarantee that it won't get wiped out.
External sources: These include calling out to external web-pages. Lots of flexibility, but very slow.
In sum, I would perhaps look at other options if you're doing heavy I/O on a medium-size dataset (>20MB and ~<2GB). These are probably non-issues for 90% of web-apps, although you should be aware of them.
All the negatives aside, working on GAE is a joyous experience. You spend more time programming and less time configuring. And it's really cheap.

Question about Solr caching mechanism

I'm working at a Apache Solr project.
( distributed in a cloud environment - Amazon ec2 instances ).
I've noticed Solr does an excellent job in caching the results.
When I execute the same queries again - the respons states Solr QTime 0 or 1 millisecond.
I want to stress test the Solr system. Therefore I have a limited list of queries I could use ( 50 000 unique queries ). The problem now is that all queries are cached!
When I stress test - after 5 mins or so - all my queries are given in Solr & executed.
This makes the system sweat unther the heavy load :) ( witch was the purpose ).
But then, as I execute the same query set again - QTime is almost zero!
--> Solr has an easy time & isn't stressed.
My question:
How can you turn of ALL Solr caches ( Both Solr and Lucence caches)?
Or how can you limit the cache?
I've tried to turn of all Solr intern cache, but the cache still stays.
( QueryResultCache and FieldCache )
Note: The config mentions that Lucence will take management of an internal cache - maybe this cache is the problem?
It's just weird that all of the 50 000 queries can be stored in the cache - out of the box.

You can comment out the filterCache, queryResultCache and documentCache in your configuration. Lucene's FieldCache cannot be disabled.
Although it doesn't really make any sense to do so, even for benchmarking. Would you also disable disk caching in your operating system? CPU caches (all three levels)? The internal cache of each hard disk?
Caches are part of the system, if you disable them you won't accurately simulate what happens in production, thus rendering the benchmark useless.

Turning off caches is an excellent idea, at least those that are application specific. A benchmark in this case is intended I gather to find the response/cost of a query that has not been seen before; as opposed to those that are popular within a cache expire.
You sound like you want metrics that tell you how the search system performs; not the query cache.
Previous answers are really out of left field, suggesting all benchmarks should measure the same thing, "his own definition of " real life performance. That is not how engineering works.
As to the remark about "disk caches". There are no disk caches in Linux; only a page cache; whether that page is persisted on disk, created and destroyed in memory or pre-allocations for large file systems that are smart....they are all pages.
There is benefit to benchmarking with caches... if you bother to measure the cache performance metrics. duh.
BTW, between "-server" and "XXcompileThreshold" you want to make sure your first large set of queries are either random enough or specifically chosen to exercise as many function pathways in the Solr/Lucene as you can; so JIT is both active and somewhat settled.

Is GAE a viable platform for my application? (if not, what would be a better option?)

Here's the requirement at a very high level.
We are going to distribute desktop agents (or browser plugins) to collect certain information from tons of users (in thousands or possibly millions down the road).
These agents collect data and periodically upload it to a server app.
The server app will allow for analyzing collected data (filter, sort etc based on 4-5 attributes) and summarize in form of charts etc.
We should also be able to export some of the collected data (csv or pdf)
We are looking for an platform to host the server app. GAE seems attractive because of low administrative cost and scalability (as users base increases, the platform will handle the scale... hopefully!).
Is GAE a viable option for us?
One important consideration is that sometimes the volume of uploads from the agents can exceed 50MB per upload cycle. We will have users in places where Internet connections could be very slow too. Apparently GAE has a limit on the duration a request can last. The upload volume may cause the request (transferring data from an agent to the server) to last longer than 30 seconds. How would one handle such situation?
Thanks!

The time of the upload is not considered part of the script execution time, so no worries there.
Google App Engine is very good to perform a vast number of smaller jobs but not so much to do complex long running background jobs (because of the 30 sec limit + even smaller database connection time limit). So probably GAE would be a very good platform to GATHER the data but not for actually ANALYZING it. You probably would like to separate these two.

We went ahead an implemented the first version on GAE anyway. The experience has been very much what is described here http://www.carlosble.com/?p=719
For a proof-of-concept prototype, what we have built so far is acceptable. However, we have decided not to go with GAE (at least in its current shape) for the production version. The pains somewhat outweigh the benefits in our case.
The problems we faced were numerous. Unlike my experience dealing with J2EE stacks, when you run into an issue, many a times it is a dead end. Workarounds are very complicated and ugly, if you can find one.
By writing good prototypes one could figure out whether GAE is right for the solution being built, however, the hype is a problem. Many newcomers are going get overly excited about GAE due to its hype and end up failing badly. Because they will choose GAE for all kinds of purpose that it is not suitable for.

Performance Testing - How much data should I create

I'm very new to performance engineering, so I have a very basic question.
I'm working in a client-server system that uses SQL server backend. The application is a huge tax-related application that requires testing performance at peak load. Meaning that there should be like 10 million tax returns in the system when we run scenarios related to creating tax returns and submitting them. Then there will also be proportional number of users that need to be created.
Now I'm hearing in meetings that we need to create 10 million records to test performance and run scenarios with 5000 users and I just don't think it is feasible.
When one talks about creating a smaller dataset and extrapolating the performance planning, a very common answer I hear is that we need to 10 million records because we cannot tell from a smaller data set how the database or network will behave.
So how does one plan capacity and test performance on large enterprise application without creating peak level of data or running peak number of scenarios?
Thanks.

Personally, I would throw as much data and traffic at it as you can. Forget what traffic you "think you need to handle". And just see how much traffic you CAN handle and go from there. Knowing the limits of your system is more valuable than simply knowing it can handle 10 million records.
Maybe it does handle 10 million, but at 11 million it dies a horrible death. Or maybe it's well written and will scale to 100 million before it dies. There's a very distinct difference between the two even though both pass the "10 million test"

Now I'm hearing in meetings that we need to create 10 million records to test performance and run scenarios with 5000 users and I just don't think it is feasible.
Why do you think so?
Of course you can (and should) test with limited amounts of data, but you also really, really need to test with a realistic load, which means testing with the amount (and type) of data that you will use in production.
This is just a special case of a general rule: For system or integration testing, you need to test in a scenario that is as close as possible to production; ideally you just copy/clone a live production system, data, config and all and use that for testing. That is actually what we do (if we technically can and the client agrees). We just run a few SQL scripts to randomize personal data in the test data set, so prevent privacy concerns.
There are always issues that crop up because production data is somehow different from what you tested on, and this is the only way to prevent (or at least limit) these problems.
I've planned and implemented reporting and imports, and they invariably break or misbehave the first time they're exposed to real data, because there are always special cases or scaling problems you didn't expect. You want that breakage to happen during development, not in production :-).
In short:
Bite the bullet, and (after having done all the tests with "toy data"), get a realistic dataset to test on. If you don't have the hardware to handle that, then you don't have the right hardware for your tests :-).

I would take a look at Redgate's SQL Data Generator. It does a good job of generating representative data.

Have a peek at "The art of application performance testing / Ian Molyneaux, O’Reilly, 2009".

Your test data is ideally a realistic variety of records. But for first approximations you could have just a few unique records, and duplicate them until you have the desired size. Then use ApacheBench to roughly approximate the traffic.

To help generate data look at ruby faker and perl data faker. I have had good luck with it in generating large data sets for testing. SQL generator from redgate is good too.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight