Paying for crawlers on AppEngine - google-app-engine

Yesterday my app has been visited 35 times by HUMANS. It seems however that a machine was crawling the website. I was overquota in a few hours (mostly frontend instance hours).
Today i pay max 5USD per day. For 35 real people it seems way too much.
I dont feel really good paying for crawlers that block the access to my website to regular users. 2 questions for you guys :
Is it normal that it happens?
What can i do to invest money in the real users instead of crawlers ? (and i am not talking about not referencing my app)
app : www.conceptstore.me

A well-behaved crawler should:
follow the rules in /robots.txt - so upload one. This alone should be enough.
provide a distinct User-Agent HTTP request header - so look at the User Agents automatically recorded in the App Engine logs, then return error pages for User-Agents you don't like.

Related

Why is my Google App Engine site over quota?

I'm getting "Over Quota
This application is temporarily over its serving quota. Please try again later." on my GAE app. It's not billing-enabled. I ran a security scan against it today, which presumably triggered the over quota, but I can't explain why based on the information in the console.
Note that 1.59G has been used responding to 4578 requests. That's an average of about 347k per request, but none of my responses should ever be that large.
By filtering my logs I can see that there was no request today whose response size was greater than 25k. So although the security scan generated a lot of small requests over its 14 minute run, it couldn't possibly account for 1.59G. Can anyone explain this?
Note: mostly suppositions ...
The Impact of Security Scanner on logs section mentions:
Some traces of the scan will appear in your log files. For instance,
the security scanner generates requests for unlikely strings such as
"~sfi9876" and "/sfi9876" in order to examine your application's error
pages; these intentionally invalid page requests will show up in your
logs.
My interpretation is that some of the scan requests will not appear in the app's logs.
I guess it's not impossible for some of the scanner's requests to similarly not be counted in the app's request stats, which might explain the suspicious computation results you reported. I don't see any mention of this in the docs to validate or invalidate this theory. However...
In the Pricing, costs, and traffic section I see:
Currently, a large scan stops after 100,000 test requests, not
including requests related to site crawling. (Site crawling requests
are not capped.)
A couple of other quotes from Google Cloud Security Scanner doc:
The Google Cloud Security Scanner identifies security vulnerabilities
in your Google App Engine web applications. It crawls your
application, following all links within the scope of your starting
URLs, and attempts to exercise as many user inputs and event handlers
as possible.
Because the scanner populates fields, pushes buttons, clicks links,
and so on, it should be used with caution. The scanner could
potentially activate features that change the state of your data or
system, with undesirable results. For example:
In a blog application that allows public comments, the scanner may post test strings as comments on all your blog articles.
In an email sign-up page, the scanner may generate large numbers of test emails.
These quotes suggest that, depending on your app's structure and functionality, the number of requests can be fairly high. Your app would need to be really basic for the quoted kinds of activities to be achieved in 4578 requests - kinda supporting the above theory that some scanner requests might not be counted in the app's stats.

How do I set a cost limit in Google Developers Console

Some functions in the Google Developers Console, like the Analytics API, are free until you reach a quota. Other functions, like Google Cloud Storage, create costs from the first click.
When I upload a file under https://console.developers.google.com/ > Storage > Cloud Storage > Storage Browser and I make this file publicly available, I pay about $0.12 per GB traffic.
But theoretically the traffic to this link could explode, e.g. because of sudden popularity. Therefore I would like to set something like a daily or monthly cost limit.
Q: How do I protect myself from overly high costs in the Google Developers Console?
You cannot. I asked Google about this, here's their response, from May 7 2016:
(GCE = Google cloud engine. No spending limits.
GAE = Google app engine — yes it has spending limits.)
... you are eligible for support on ... only ...
... [various helpful links] ...
That been said, at the moment there is no a feature that allows you to
configure a limited budget on GCE. This feature is certainly available
for GAE [1]. As you mentioned in your comments, you either can totally
shut down your VMs (will depend on your use case) or set the VMs to
send you alerts if they reach a certain traffic limit [2].
Sincerely,
Someone's first name
Technical Solutions Representative
Google Cloud Platform
[1] https://cloud.google.com/appengine/docs/quotas
[2] https://cloud.google.com/monitoring/support/notification-options
#wmdry, you wrote: "traffic to this link could explode" — I'm afraid of this too. That's why I asked Google about this. And I'm planning to avoid Google's CDN because of this, and use another CDN provider instead, which has spending limits. Because, unlike Nginx, I don't see any way for me to rate limit / throttle Google's CDN.
I do plan to use GCE (Google Cloud Engine) though. Therefore, right now I'm reading about how to rate limit my Nginx server. Because if I just configure Nginx correctly, then those $0.12 / GB you mentioned, cannot possible explode to ... like $10k in a month? What if Google sends a $10k bill when I'm back from an a few week's vacation, just because of my hobby project and a few people downloading a 1 MB movie over and over again forever (because: evil). Hmm, & the bigger & faster my servers, the higher the risk.
I hope Google will add spending limits, because I did want to use Google's CDN.
Update 2020: Apparently this does bite people from time to time — look here:
"Burnt $72k testing Firebase and Cloud Run and almost went bankrupt", Dec 08, 2020, https://news.ycombinator.com/item?id=25372336,
In that case, they could contact Google and in the end didn't need to pay.
As of July 2017 you can set budgets that send notifications via email but do not cap spending:
To set an alert-only budget, which will not cap spending:
Go to the Cloud Platform Console.
Open the console left side menu and click Billing
If you have more than one billing account, click the billing account name.
On the left, click Budgets & alerts.
Official help page: https://support.google.com/cloud/answer/6293540?hl=en
I found that Google's documentation now provides two methods to actually limit the cost of a GCP project. It involves the following setup:
Create a Cloud Function that checks the cost against the budget, and carries out a certain action if the cost exceeds the budget. Google's Documentation provides a sample code snip that can either shutdown all VM instances in a Project or disable the billing for a project. Shutting down all VMs would stop all VM-related cost but you get to keep your data (and still have to pay for the storage). Disabling the billing for a project would effectively zap all cost-related activities and you could lose data. You can name the Cloud Function "budget-enforcer".
The Google code snip as provided above has a hard coded ZONE variable. Remember to change it to match your zone!
Create a Service Account to run the Cloud Function "budget-enforcer". For shutting down VMs, the Service Account would need role "Compute Instance Admin (v1)". For disabling billing on a project, the Service Account would need role "Project Billing Manager".
Set a Topic for the Cloud Function (I call mine "proj-name-stop-vm" and "proj-name-disable-bill").
Set up a budget alert as usual, and connect it to one of the Pub/Sub topic above.
Please be noted that Google's documentation did mention that there could be a delay between the cost exceeds a budget and the function is triggered, so you should build in a buffer if you have an absolute hard cost limit. I use 90% of the budget as the trigger line for shutting down my instances.
The API usage can be limited with a hard limit:
Depending on the API, you can explicitly cap requests in a variety of
ways, including: requests per day, requests per 100 seconds, and
requests per 100 seconds per user. You might want to limit the
billable usage by setting caps. For example, to prevent getting billed
for usage beyond the free courtesy usage limits, you can set requests
per day caps
Source
You can combine budget pub/sub alerts with a cloud function that can disable billing on your entire account if a threshold is met.
Full Tutorial Here:
https://www.youtube.com/watch?v=KiTg8RPpGG4
GitHub Repo Here: https://github.com/aioverlords/Google-Cloud-Platform-Killswitch
To Disable Billing
const _disableBillingForProject = async projectName => {
const res = await billing.updateBillingInfo({
name: projectName,
resource: {
billingAccountName: ''
}, // Disable billing
});
console.log(res);
console.log("Billing Disabled");
return `Billing disabled: ${JSON.stringify(res.data)}`;
};
Simply go to the developer console:
https://console.developers.google.com/project
Select your project.
Select "billings & settings"
Enable billing.
Then go to Compute/AppEngine/Settings and set a daily budget.
Go to Google Cloud console, and then to Billing / Budgets and Alerts and create a new budget for one or all your projects. You can select which services should be included in the limit and set a monthly amount that should not be exceeded.

google app engine, budget is short warning script

I have a critical app running on GAE. I had 2 cases, when I went over budget.
I have tests warning me when I loose the website, but I'd love to get an earlier notification, when my buget is running out.
Is there a way to access the daily budget estination, so I can send myself a warning before it becomes a problem?
There's no API for this purpose sorry, your best bet right now (if it's that important) is parsing the quota page.
A ticket was opened a few years back, it's acknowledged but no news.

"Over quota" when using GCS json-api from App Engine

I am using Go on App Engine. In most cases, I use the file api to access GCS, which works great, except that deletes don't work so to delete files I use the JSON-API (specifically, the google-go-api-client). To authenticate, I use app engine service accounts. We are sometimes seeing an error come back of "Over quota:" with nothing after the colon. Since we are a paid app, what quota could this be? Is there a burst limit (e.g. no more than X requests in a single minute)? Is there any places where any such applicable quotas are documented?
The caching mechanism is broken for goauth2 and serviceaccount tokens. You can see the issue I created here for more detail: https://code.google.com/p/goauth2/issues/detail?id=28
I came across a "over quota" issue myself when requesting more than 60 service accounts a minute. I opened a ticket with AppEngine support (I pay for the silver package) and got this undocumented information out of them.
You can apply the patch yourself in your $GOPATH/src/code.google.com/p/goauth2/appengine/serviceaccount/cache.go file. This fixed the issue you described for my team.
Even i had found same problem and found two reasons:-
1.Daily budget
2.Logs retention
Solution:
for problem 1 increase the daily budget
for problem 2 increase the retention from 1 to higher GB
![enter image description here][1]

Subdomain is preventing my search results from rising as it should in page rank

My problem is that I have a site which has requires a dedicated page for every city I choose to support. Early on, I decided to use subdomains rather than a directly after my domain (ie i used la.truxmap.com rather than truxmap.com/la). I realize now that this was a major mistake because Google seems to treat la.truxmap.com as a completely different site as ny.truxmap.com. So for instance, if i search "la food truck map" my site will be near the top, however, if i search "nyc food truck map" im no where in sight because ny.truxmap.com wouldnt be very high in the page rank by itself, and it doesnt have the boost that it ought to be getting from the better known la.truxmap.com
So a mistake I made a year ago is now haunting my page rank. I'd like to know what the most painless way of resolving my dilemma might be. I have received so much press at la.truxmap.com that I can't just kill the site, but could I re-direct all requests at la.truxmap.com to truxmap.com/la and do the same for all cities supported without trashing my current, satisfactory page rank results I'm getting from la.truxmap.com ??
EDIT
I left out some critical information. I am using Google Apps to manage my domain (that is, to add the subdomains) and Google App Engine to host my site. Thus, Google Apps provides a simple mechanism to mask truxmap.appspot.com (the app engine domain) as la.truxmap.com, but I don't see how I can mask it as truxmap.com/la. If I can get this done, then I can just 301 redirect la.truxmap.com to truxmap.com/la as suggested below.
Thanks so much!
You could send a "301 Moved Permanently" redirect to cause the Google crawler to update its references to your site, no?
See this article on 301 redirects and SEO.
You'll need to modify your app as follows:
Add www.truxmap.com as an alias for the app (you can't serve naked domains in App Engine, so just truxmap.com won't work)
Add support to your app for handling URLs of the form www.truxmap.com/something/, routing to the same handlers as the subdomain. You'll need to make sure you've debugged any relative path issues well before continuing.
Modify your app to serve 302 redirects for every url under something.truxmap.com/whatever to www.truxmap.com/something/whatever.

Resources