Google app engine - Sudden Increase of Datastore Read Operations - google-app-engine

I'm maintaining a blog app(blog.wokanxing.info, it's in Chinese) for myself which was built upon Google app engine. It's been like two or three years since first deployment and I've never met any quota issue because its simpicity and small visit count.
However since early last month, I noticed that from time to time the app reported 500 server error, and in admin panel it shows a mysterious fast consumption of free datatstore read operation quota. Within a single hour about 10% of free read quota (~5k ops) are consumed, but I'm counting only a dozen requests that involve datastore read ops, 30 tops, which means an average 150 to 200 read op per request, which sounds impossible to me.
I've not commited any change to my codebase for months, and I'm not seeing any change in datastore or quote policy either. Despite that, it also confuses me how such consumption can be made. I use memcache a lot, which leaves first page the biggest player, which fetch the first threads using Post.all.order('-date').fetch(10, offset). Other request merely fetch a single model using Post.get_by_key_name and iterates post.comment_set.
Sorry for my poor English, but can anyone give me some clues? Thanks.

From Admin console check your log.
Do not check for errors only, rather check all types of messages inside the log.
Look for the requests made by robots/web crawlers. In most cases, you can detect such "users" by words "robot" or "bot" (well, if they are honest...).
The first thing you can do is to edit your "robot" file. For more detail read How to identify web-crawler? . Also, GAE has help for use of "robot" file.
If that fails, try to detect IP address used by bot/bots. Using GAE Admin console put such addresses in blacklist and check your quota consumption again.

Related

Datastore Quota Reached: Project quota page shows none reached

I'm receiving this error on our App:
The API call datastore_v3.Put() required more quota than is available.
However when I check out our quotas page, nothing is flagging as being over quota (or even close). We have billing enabled, we're not at our daily budget ($2 to test that it's not that - although normally $0) and these errors have been showing for over a minute (so I don't expect that it's the per-minute limits).
How can an API call fail due to being over quota, if everything seems to show that we're not over quota?
Budgets for API calls take a while to kick in. In this case, the project had hit the free limit for datastore operations (0.05M), however the increased daily budget had only just been enabled and so the app was still unable to use more of the operations.
This problem solved was solved after a couple of hours.
For others experiencing this issue, you can find the datastore free quotas here. Compare your current usage on the view in the question to these limits. If it looks like you've gone over then re-assess where your daily budget is (or whether you need so many datastore calls!).

Objectify cache hit/miss and quotas

I have launched a new web app this month. I'm trying to understand why I'm getting such a high datastore read value, even though all my entities are cached.
So, my main point of misunderstanding is this. in the total quota overview for this month I have 1.12 M read operations in the datastore.
But when I go to the memcache section in the console, it tells me the hit ratio is 96.35% and the numbers are: 1,457,499 hit / 55,177 miss
First of all, is it true that these numbers are per month or are they per day?
Second, how is this possible?
I know that reads in transactions don't user the cache. But I do not make heavy use of transactions. Is there anything other than transactions that can cause this?
If you want more insight into your Objectify memcache hit rates, mount the MemcacheStatsServlet (or look at its code and do something similar). This will provide your cache hit ratio broken down by Kind.
Keep in mind that since it is reporting for just one instance (whichever you happen to hit with your request for stats), this is only a representative sample of what is going on in your cluster.

What happens when a function in GAE app exceeds quota yet is unfinished?

I ran a function that loads a lot of data to GAE using db.put(). However, it raised over quota exception when I hit my write quota. When I rechecked the data by running the app, the data returned was indeed incomplete. So when the quota is available again, I ran the data loader again from some index (so I don't write the same data again and again).
Here is the problem: after I ran the data loader manually (again and again), it seems all the data that I need for the app to work is already there, although the first time I load the data there was over quota exception.
So, my question specifically is: does function that ran over quota in GAE being queued until the quota is available again or does it being terminated?
Background of project: my friend and I are building a search system. We need the database of the search system, thus we load the database to GAE.
If you hit write quota while adding many values to the datastore, the remaining values will not be saved anywhere and you will have to try again. Datastore admin shows the number of entities based on datastore statistics, but this will have a delay in being updated. Though officially it is mentioned as upto 24 hours, it can be even more as mentioned in this previous post. So for finding if recently uploaded entities are present in the datastore, we cannot rely on datastore admin and need to query and find if a particular entity you added recently is present. Or else you can read the entity key value that is returned for each db.put() and use the last returned value to see which is the last successfully stored entity.

Why am I hitting the datastore read operation quota?

I was in a place without Internet access for 3 weeks and just came back to find out that one of my apps since January 18 started to reach a quota limit (Datastore Read Operations) after around the 18 hours.
I don't see any increase in traffic from either users or crawlers.
This is the error in the logs:
"The API call datastore_v3.RunQuery() required more quota than is available."
It seems very strange since this application has been running for some years and I'm memcaching most the datastore requests.
Please help - This is affecting my bottom line!
Thanks.
I found a subset of pages in the site that had got a sudden interest from several crawlers and some of the requests that those pages made to the Datastore were not being memcached, so that was it...problem solved.
Thanks.

Is there a way to make more than 10K requests on Google search from the same IP?

I am currently working to an app that requires to scrape data from Google's search results. For example google.com/search?q=domain.com and so on. But Google blocks my IP address after making some requests. I know there are Google APIs, but there are many sites around that just scrape the data directly.
Scraping Google search results is a breech of the terms-of-service. Google actively discourages such and blocks those who do. They share their information with you free of charge but they don't appreciate you trying to get a copy of all of it.
Better to do your own crawling of the domain.
Too bad I did not see your question earlier, if it's not too late:
Scraping Google does indeed violate their terms of service, on the other hand you may choose not to accept them. You would accept their TOS when you create a Google account for example but as far as I know you can also reject the acceptance again (at least when they change them).
For a smaller amount of data you can use their API or also their commercial API but if you need the results and ranks exactly as a user will see them (SEO purposes) I know no official way to get their permission.
I am not a lawyer, so you might want to consult one if you want to make sure about legal consequences.
However, scraping Google usually does not lead to any legal problems. I remember that even Bing (Microsofts engine) got caught scraping Google for unknown keywords. That happened a few years ago. My personal guess is that the majority of their original results were copied from Google in secret.
There is an open source project http://google-rank-checker.squabbel.com which does work to scrape large amounts of Google results. As far as I remember, without modification it is limited to about 50-70k resultpages per day.
I suggest to take a look at the code, it's PHP with libcURL.
You will need proper IP addresses (not shared, not previously abused) as well. Scraping with a single IP will result in getting blocked by Google within an hour.
Usually the first thing that happens is a captcha, by solving the captcha you generate a cookie which allows you to keep making requests.
If you continue you will get a complete ban.
And if you "hammer" Google with a huge amount of requests you will alert their staff and they can put a manual ban on the whole ISP or network block.
A proper amount is around 10 requests per hour with an IP, that's what I have been sticking to on my related projects.
So if someone scrapes Google, make sure you have functions which validate the results and watch for unexpected returns. In such a case your code should immediately stop accessing Google to prevent further accessing a page which is just showing a captcha.

Resources