Slow local queries

Slow local queries - database

Using AWS' DocumentDB, I've deployed a test database cluster in Germany. When I run a test-query on it from an EC2 instance in Germany, it takes less than a 2 seconds. When I query it from my country (middle-east) it takes more then a minute.
This is the simple test I run (it just goes through the whole collection):
var last_doc= db.dev.find()
while(last_doc.hasNext()){ last_doc = last_doc.next(); }
Thing is, we also have an RDS (MySQL) DB in AWS and a query of the same size takes less than 2 seconds from my country to RDS (Also in Germany).
I tried viewing the logs so I followed this document but I can't seem to find any logs from docdb in CloudWatch. These are my cluster parameters. I also tried opening a ticket with AWS but apparently our basic subscription doesn't allow creating tickets.
Does anyone happen to have suggestion on how to tackle this? What should I do/look for?
Thanks ahead!

Related

MongoDB slow in fetching from database

I'm using MongoDB in combination with Meteor + React and the result fetching takes like 5 sec even on a small database.
This happens only on the production server (AWS) and it works instantly on the local machine.
In order to fetch the results, I'm using the following code.
return{ cand : Job.find({thejob:props.id}).fetch() };
and to see if the array has been loaded, I use the following code on the frontend side.
if(!this.props.cand){return(<div>Loading....</div>)}
but the Loading.... takes like 5 sec on the server always. The database is a small one with less than 1000 records.

I have had similar experiences. The performance is pretty good when you run the queries in the local machine. If the query is slower in platforms like AWS and not on the local, it's mostly due to the Network latency.

I suspect there isn't an index on the thejob field.
First check if there is an index on thejob field
db.job.getIndexes()
If there is none, simply create one
db.job.createIndex({thejob:1})

NDB query().iter() of 1000<n<1500 entities is wigging out

I have a script that, using Remote API, iterates through all entities for a few models. Let's say two models, called FooModel with about 200 entities, and BarModel with about 1200 entities. Each has 15 StringPropertys.
for model in [FooModel, BarModel]:
print 'Downloading {}'.format(model.__name__)
new_items_iter = model.query().iter()
new_items = [i.to_dict() for i in new_items_iter]
print new_items
When I run this in my console, it hangs for a while after printing 'Downloading BarModel'. It hangs until I hit ctrl+C, at which point it prints the downloaded list of items.
When this is run in a Jenkins job, there's no one to press ctrl+C, so it just runs continuously (last night it ran for 6 hours before something, presumably Jenkins, killed it). Datastore activity logs reveal that the datastore was taking 5.5 API calls per second for the entire 6 hours, racking up a few dollars in GAE usage charges in the meantime.
Why is this happening? What's with the weird behavior of ctrl+C? Why is the iterator not finishing?

This is a known issue currently being tracked on the Google App Engine public issue tracker under Issue 12908. The issue was forwarded to the engineering team and progress on this issue will be discussed on said thread. Should this be affecting you, please star the issue to receive updates.
In short, the issue appears to be with the remote_api script. When querying entities of a given kind, it will hang when fetching 1001 + batch_size entities when the batch_size is specified. This does not happen in production outside of the remote_api.
Possible workarounds
Using the remote_api
One could limit the number of entities fetched per script execution using the limit argument for queries. This may be somewhat tedious but the script could simply be executed repeatedly from another script to essentially have the same effect.
Using admin URLs
For repeated operations, it may be worthwhile to build a web UI accessible only to admins. This can be done with the help of the users module as shown here. This is not really practical for a one-time task but far more robust for regular maintenance tasks. As this does not use the remote_api at all, one would not encounter this bug.

How to rollback/tear down/clear the database changes after a system test runs?

I have a test method, using NUnit and Selenium, which opens a browser on our website which is on the Production Server and registers a user and verifies that the registration is successful.
(I know ideally the system tests should run on a separate Test Server rather than production but here they want to test whether the prod system works!)
The problem is how to rollback the database changes as a result of this test? For example, the state of my database before and after running the state should be the same.
I thought of 3 possible options but none is practical:
1) writing SQL queries to delete from the actual tables before starting the test (Setup) and after running the test (TearDown); this is my current approach however
The problem with this approach is that I have to know exactly which tables were involved for each System Test which runs and this can quickly become very complex as a test may impact more than one table.
2) Writing transactional Code
This is not an option since the code changes are done by the website, not by the unit test written.
3) Getting an snapshot of existing database (SQL Server 2008 R2) before each test starts then after the test finished, restoring the snapshot to the original one.
This idea sounds good to me if we could run the tests only on Staging environment but the problem is that the tests have to run on Production and may take like 5 minutes totally so rolling it back and restoring it, would be a stupid idea as the changes done in that 5 minutes would be lost!
Please advise what approach would be best possible option to resolve this problem? there may be a 4th option?
Thanks,

Option 4 never ever ever ever do tests on a production server it's a recipe for disaster (see thousands of funny (if you are not the protagonist) stories on the internet on how this could go horribly wrong), the right thing to do would be to configure the test and production server in the same way.

There is a fith option. If the website receives a registration for user "WeAreTestingOutSite" it does everything except for actually adding the user to the Database.
To be honest, as was said, there are better ways to test if a production site is still in operation than to run bots to register a user to make sure it is working (or operational).

I would recommend you going with 4th option: Introduce new feature which allows to delete the user. Probably not to the user himself/herself but to the system admins (Backoffice users). That way you can test if user can be registered - and deleted afterwards while not caring that much about the SQL scripts.

Drupal website blocked because of many connection errors - website goes offline

From time to time, the number of database connections from our Drupal 6.20 system to our Mysql database reaches 100-150 and after a while the website goes offline. The error message when trying to connect to Mysql manually is "blocked because of many connection errors. Unblock with 'mysqladmin flush-hosts'". Since the database is hosted on an Amazon RDS I don't have the permission to issue this command, but I can reboot the database and once rebooted the website works normally again. Until next time.
Drupal reports multiple errors prior to going offline, of two types:
Duplicate entry
'279890-0-all' for key
'PRIMARY' query:
node_access_write_grants /* Guest :
node_access_write_grants */ INSERT
INTO node_access (nid, realm, gid,
grant_view, grant_update,
grant_delete) VALUES (279890,
'all', 0, 1, 0, 0) in
/var/www/quadplex/drupal-6.20/modules/node/node.module
on line 2267.
Lock wait timeout exceeded; try
restarting transaction query:
content_write_record /* Guest :
content_write_record */ UPDATE
content_field_rating SET vid = 503621,
nid = 503621, field_rating_value =
1212 WHERE vid = 503621 in
/var/www/quadplex/drupal-6.20/sites/all/modules/cck/content.module
on line 1213.
The nids in these two queries are always the same and refer to two nodes that are frequently automatically updated by a custom module. I can track down a correlation between these errors and unusually many web requests in the Apache logs. I would understand that the website would become slower because of this. But:
Why do these errors occur, and how can they be solved? It seems to me it's to do with several web requests trying to update the same node at the same time. But surely Drupal should deal with this by locking the tables etc? Or should I deal with it in some special way?
Despite the higher web load, why does the database completely lock and require to be rebooted? Wouldn't it be better if the website still had access to Mysql and so, once the load is lower, it can serve pages again? Is there some setting for this?
Thank you!

Can be solved one or all of these three things to check:
are you out of disk space? From ssh, type command df -h and make sure you still have disk space.
Are the tables damaged? Repair the tables in phpMyAdmin, or CLI instructions here: http://dev.mysql.com/doc/refman/5.1/en/repair-table.html
Have you performance-tuned your mysql with an /etc/my.cnf? See this for more ideas: http://drupal.org/node/51263

How can I find why some classic asp pages randomly take a real long time to execute?

I'm working on a rather large classic asp / SQL Server application.
A new version was rolled out a few months ago with a lot of new features, and I must have a very nasty bug somewhere : some very basic pages randomly take a very long time to execute.
A few clues :
It isn't the database : when I run the query profiler, it doesn't detect any long running query
When I launch IIS Diagnostic tools, reqviewer shows that the request is in state "processing"
This can happen on ANY page
I can't reproduce it easily, it's completely random.
To have an idea of "a very long time" : this morning I had a page take more than 5 minutes to execute, when it normaly should be returned to the client in less than 100 ms.
The application can handle rather large upload and download of files (up to 2 gb in size). This is also handled with a classic asp script, using SoftArtisan FileUp. Don't think it can cause the problem though, we've had these uploads for quite a while now.
I've had the problem on two separate servers (in two separate locations, with different sets of data). One is running the application with good ol' SQL Server 2000 and the other runs SQL Server 2005. The web server is IIS 6 in both cases.
Any idea what the problem is or on how to solve that kind of problem ?
Thanks.
Sebastien
Edit :
The problem came from memory fragmentation. Some asp pages were used to download files from the server. File sizes could go from a few kb to more than 2 gb. These variations in size induced memory fragmentation. The asp pages could also take quite some time to execute (the time for the user to download the pages minus what is put in cache at IIS's level), which is not really standard for server pages that should execute quickly.
This is what I did to improve things :
Put all the download logic in a single asp page with session turned off
That allowed me to put that asp page in a specific pool that could be recycled every so often (download would now disturb the rest of the application no more)
Turn on LFH (Low Fragmention Heap), which is not by default on Windows 2003, in order to reduce memory fragmentation
References for LFH :
http://msdn.microsoft.com/en-us/library/aa366750(v=vs.85).aspx
Link (there is a dll there that you can use to turn on LFH, but the article is in French. You'll have to learn our beautiful language now!)

I noticed the same thing on a classic ASP + ajax application that I worked on. Using Timer, I timed the page load to be 153 milliseconds but in the firebug waterfall chart it randomly says 3.5 seconds. The Timer output is on the response and the waterfall chart claims that it's Firefox waiting for a response from the server. Because the waterfall chart also shows the response, I can compare the waterfall chart to the timer and there's a huge discrepancy 'every so often'

Can you establish whether this is a problem for all pages or a common subset of pages?
If a subset examine what these pages have in common, for example they all use a specific COM dll, that other pages don't.
Does this problem affect multiple clients or just a few?
IOW is there an issue with a specific browser OS version.
Is this public or intranet?
Can you reproduce the problem from a client you own?

Is there any chance there are some full-text search queries going on SQL Server?
Because if so, and if SQL Server has no access to internet, it may cause a 45-second delay every few hours or so when it tries to check the certifications (though this does not apply to SQL Server 2000).
For a detailed explanation of what I'm referring to, read this.

Are any other apps running on your web server? If so, is your problematic in the same app pool as any of them? If so, try creating a dedicated app pool for it. Maybe one of the other apps is having a problem and is adversely affecting yours.

One thing to watch out for is if you have server side debugging turned on in IIS, the web server will run in single threaded mode.
So if you try to load a page, and someone else has hit that url at the same time, you will be queued up behind them. It will seem like pages take a long time to load, but its simply because the server is doling out page requests in a single file line and sometimes you aren't at the front of the line.
You may have turned this on for debugging and forgot to turn it off for production.