How to block Yandex and Baidu IPs in GAE? - google-app-engine

In robots.txt I can put:
#Baiduspider
User-agent: Baiduspider
Disallow: /
#Yandex
User-agent: Yandex
Disallow: /
to tell the search engines to stop crawling my app pages (php app). But how to block them by IPs in GAE?

There are two ways.
Do it in your code.
Use the DOS facilities https://developers.google.com/appengine/docs/python/config/dos

Related

405 Err on OPTIONS preflight for upload_url on Google Appengine SDK on different port #

I have a Google AppEngine project that works fine in production but not locally.
There is a React browser application running locally on port 3001 and a python api service running on 9090.
When I attempt to upload files via the React client, I first call an REST endpoint that returns the blobstore get_upload_url() to the client. This url is something like: http://localhost:9090/_ah/upload/aghkZXZ-... <-- note the port is that of the python service
When I fashion a POST request to that url from the browser client to actually upload the file, I get a 405 on the OPTIONS preflight check. So far as I understand, this is due to the ports being different. This only occurs in the local App Engine SDK since I am using dispatch.yaml settings in production to have everything on the same domain/port.
I had dug into the SDK code a while ago and put a hack in place. (https://gist.github.com/blainegarrett/4d3b3081d09b4ff7be00765eb32b0d94)
However, since upgrading Google Cloud to 218.0.0, the hack was overwritten and I'm back to square one.
Here are the headers to the blobstore upload url:
OPTIONS /_ah/upload/aghkZXZ-Tm9uZXIiCxIVX19CbG9iVXBsb2FkU2Vzc2lvbl9fGICAgICA77ALDA HTTP/1.1
Host: localhost:9090
Connection: keep-alive
Origin: http://localhost:3001
Access-Control-Request-Method: POST
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36
Accept: */*
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
I am currently using vanilla XMLHttpRequest() for the upload call specifically.
Does anyone have any suggestion on how to either get around the preflight check when the ports are different and/or to allow OPTIONS checks on the upload url in a less hacky way?
Update: I'd still like to hear an answer regarding the 405 on the SDK, but I was able to dodge the preflight check by getting rid of the xhr progress listener. My original assertion that the port difference was triggering the preflight check was incorrect. It was the progress callback.
xhr.upload.addEventListener('progress', function(e) { .. }
See research on: CORS request is preflighted, but it seems like it should not be

google says my site is not mobile-friendly and it is

I have created a site, complete with responsive layout which works well.
Apparently google thinks the site isn't mobile friendly, and has listed a whole pile of resoures that I notice are included in this text in the robots.txt file
Disallow: /administrator/
Disallow: /bin/
Disallow: /cache/
Disallow: /cli/
Disallow: /components/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /layouts/
Disallow: /libraries/
Disallow: /logs/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
It looks like I need to allow access to some of these files/folders including media, templates and plugins
I am concerned that google will then be putting up administrator type pages within its search results
What should I do?
Is it ok to do this - and which ones should I allow?
Thanks
After some more rooting around I just made image, media and templates viewable to robots. Now my site is friends with google.

server log analysis of 404s

A large site I am working with is getting 80K+ 404s a day from Google for garbage URLs. I can't figure out where they are coming from. Here is a sample of a few. These URIs exist no where in the site structure so I am assuming they are being created by an external agent/site that is driving Gbot to crawl them. Anyone have any ideas?
7/2/2013 22:05 /Sl/4watQCXBFtF6obwFRA0f35148b 10262 404 - Not Found No
Referrer Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)
7/2/2013 22:05 /PvDIs6AveH9tju3tETtWg045cb22d 10261 404 - Not Found No
Referrer Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)

Google Gae : Unreachable robots.txt

I have uploaded robots.txt in my url http://watchmariyaanmovieonline.appspot.com/robots.txt , But when i use google webmaster and do Fetch as google for my home page http://watchmariyaanmovieonline.appspot.com/ i get error Unreachable robots.txt
Your robots.txt contents have one empty Disallow due to which you get that error.
User-agent: *
Disallow:
Disallow: /cgi-bin/
Sitemap: http://watchmariyaanmovieonline.appspot.com/sitemap.xml
Update it to:
User-agent: *
Disallow: /cgi-bin/
Sitemap: http://watchmariyaanmovieonline.appspot.com/sitemap.xml
And it should work just fine, let me know if this helps :)

"HTTP ERROR: 500 No realm" running GWT's "MobileWebApp" sample

I'm trying to run the GWT 2.4 sample app "MobileWebApp". I get a 500 "No Realm" error when I try to run the app in dev mode through Eclipse.
I understand this is an authentication problem.
I'm not familiar with Google App Engine or Jetty but from looking at the web.xml I can see there is a servlet filter where it is using the appengine UserService to presumably redirect the user to Google for authentication.
I'm using:
Eclipse 3.7 (Indigo SR1)
Google Plugin for Eclipse 2.4
m2eclipse
I'm including an excerpt from the web.xml below. I'm not sure what other info would be helpful in diagnosing this problem.
<security-constraint>
<display-name>
Redirect to the login page if needed before showing
the host html page.
</display-name>
<web-resource-collection>
<web-resource-name>Login required</web-resource-name>
<url-pattern>/MobileWebApp.html</url-pattern>
</web-resource-collection>
<auth-constraint>
<role-name>*</role-name>
</auth-constraint>
</security-constraint>
<filter>
<filter-name>GaeAuthFilter</filter-name>
<!--
This filter demonstrates making GAE authentication
services visible to a RequestFactory client.
-->
<filter-class>com.google.gwt.sample.gaerequest.server.GaeAuthFilter</filter-class>
</filter>
<filter-mapping>
<filter-name>GaeAuthFilter</filter-name>
<url-pattern>/gwtRequest/*</url-pattern>
</filter-mapping>
Below is the output in the Eclipse console:
[WARN] Request /MobileWebApp.html failed - no realm
[ERROR] 500 - GET /MobileWebApp.html?gwt.codesvr=127.0.0.1:9997 (127.0.0.1) 1401 bytes
Request headers
Host: 127.0.0.1:8888
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Response headers
Content-Type: text/html; charset=iso-8859-1
Content-Length: 1401
Many thanks for any helpful advice!
Edit on 11/11/11: I added Jetty tag since it seems relevant to this problem.
If your very first request fails, just getting the /MobileWebApp.html page, then it probably isn't an authentication problem. Do you have GAE enabled for that project (not only GWT)? That might be one issue.
I read somewhere that there's two ways of debugging an app in Eclipse, one is with run as/webapp, and forgot which was the other one (I don't use Eclipse). One of them works and another doesn't.
If that doesn't work, you can try replacing the built-in jetty:
add a GWT param: -server com.google.appengine.tools.development.gwt.AppEngineLauncher
VM param: -javaagent:/path_to/appengine-agent.jar
And the last option is with -noserver, but then you wont be able to debug the server-side code, just the client-side GWT stuff: first start jetty with mvn jetty:run and then debug in Eclipse with -noserver GWT param.
I had the same problem. Finally I noticed that when I switched to a newer version of Appengine, the older Appengine libraries remained in the WEB-INF/lib along with the new ones.
Removing them solved the problem.

Resources