Apache Httpclient and the Cloud - google-app-engine

I want to put a scraping service using Apache HttpClient to the Cloud. I read problems are possible with Google App Engine, as it's direct network access and threads creation are prohibited. What's about other cloud hosting providers? Have anyone experince with Apache HttpClient + cloud?

AppEngine has threads and direct network access (HTTP only). There is a workaround to make it work with HttpClient.
Also, if you plan to use many parse tasks in parallel, you might check out Task Queue or even mapreduce.
Btw, there is a "misfeature" in GAE that you can not fully set custom User-agent header on your requests - GAE always adds "AppEngine" to the end of it (this breaks requests to certain sites - most notably iTunes).

It's certainly possible to create threads and access other websites from CloudFoundry, you're just time limited for each process. For example, if you take a look at http://rack-scrape.cloudfoundry.com/, it's a simple rack application that inspects the 'a' tags from Google.com;
require 'rubygems'
require 'open-uri'
require 'hpricot'
run Proc.new { |env|
doc = Hpricot(open("http://www.google.com"))
anchors = (doc/"a")
[200, {"Content-Type" => "text/html"}, [anchors.inspect]]
}
As for Apache HttpClient, I have no experience of this but I understand it isn't maintained any more.

Related

Long-running script on Google App Engine

I'm attempting to create a microservice on Google App Engine that is not intended to handle HTTP requests.
Instead, I was hoping to have a continuously running Python script that monitors a remote queue--RabbitMQ, to be precise--and sends out an api-call to another service as tasks are pushed to the queue.
I was wondering, firstly, is it possible to run a script upon deployment--one that did not originate with a user action/request?
Secondly, how would I accomplish this?
Thanks in advance for your time!
You can deploy your "script" as a manually scaled module -- see https://cloud.google.com/appengine/docs/python/modules/ -- with exactly one instance. As the docs say, "When you start a manual scaling instance, App Engine immediately sends a /_ah/start request to each instance"; so, just set that module's handler for /_ah/start to the handler you want to run (in the module's yaml file and the WSGI app in the Python code, using whatever lightweight framework you like -- webapp2, falcon, flask, bottle, or whatever else... the framework won't be doing much for you in this case save the one-off routing).
Note that the number of free machine hours for manual scaling modules is limited to 8 hours per day (for the smaller, B1 instance class; proportionally fewer for larger instance classes), so you may need to upgrade to paid-app status if you need to run for more than 8 hours.
Like #brant said, App Engine is designed to handle HTTP requests. It's not a perfect fit for background jobs, unless you try to wrap your logic into one http request.
Further, App Engine will emit an error when the response timeout, depending on your scaling settings. If you want to try it, consider basic or manual scaling.
For this type of workload, I would suggest you use a VM.
I think there are a few problems with this design.
First, App Engine is designed to be an HTTP request processor, not a RabbitMQ message processor. GAE is intended for many small requests, not one long-running process.
Second, "RabbitMQ should not be exposed to the public internet, it wasn't created for such use case."
I would recommend that you keep the RabbitMQ clients on the same internal network as the RabbitMQ broker, and have the clients send HTTP requests to App Engine.

detect AppEngine vs basic server

I am developing a web app to run on either Google's AppEngine or a basic server with file storage (it may not stay that way but that's the current status).
How do I detect whether the AppEngine services (most importantly blobstore) are available at runtime?
I have tried using code like the following:
try{
Class.forName( "com.google.appengine.api.blobstore.BlobstoreServiceFactory" );
logger.info( "Using GAE blobstore backend" );
return new GAEBlobService();
}catch( ClassNotFoundException e ){
logger.info( "Using filesystem-based backend" );
return new FileBlobService();
}
but it doesn't work because BlobstoreServiceFactory is available at compile time. What fails if trying to use GAE's blobstore without a GAE server is the following:
com.google.apphosting.api.ApiProxy$CallNotFoundException: The API package 'blobstore' or call 'CreateUploadURL()' was not found.
There's a few things you can use.
You can check the runtime environment to check the running version of App Engine. Check the section about "The Environment" in the runtime docs: https://developers.google.com/appengine/docs/java/runtime
You could also do what you were doing, and attempt to make an call that uses the SDK API functions (instead of just checking for the existence of a class) and catch the exception. This may negatively impact performance since you're making an extra RPC.
You could check request headers for GAE specific request headers too.

DWR with Google app engine

I created one google app engine project. There I configured DWR(DirectWebRemoting).
I have created one ajax functionality which will checks the username and password of the user. But it does give me any output rather it gives Error:
dwr is not defined
Source File: http://localhost:8081/dwr.jsp
Line: 16
Where as it works fine in Tomcat web application.
So My question is that does Google app engine not support DWR configuration?
DWR is compatible with GAE according to "Will it play in Java" (broken link. possibly: a list of GAE-compatible java libraries) page.
Several links to threads with proof of success and potential limitations:
thread about "DWR 3 RC1" from "users#dwr.java.net"
StackOverflow question "DWR sometimes die on the GAE server"
Most useful post in "DWR 3 RC1" thread is this one (about potential problems).
Do you require reverse ajax? (Current implementation uses a thread to clean up expired sessions. Current implementation uses javax.swing.event.EventListenerList which is blacklisted by GAE/J.)
Do you require file upload? (Can potentially write to disk)
Do you require file download? (Current implementation uses a thread to clean up expired downloads)
If you don't need anything from the above, then you could replace problematic classes with dummy implementation.
I don't think this is supported, but the same functionality is offered by channels :
https://developers.google.com/appengine/docs/java/channel/

google ajax-search-api call "quota exceeded" on google app engine

i tried to use the custom search api ( http://code.google.com/intl/de-DE/apis/websearch/docs ) with java. it works perfectly on eclipse on my local machine.
when i try to do the same from google app engine the reply is: {"responseData": null, "responseDetails": "Quota Exceeded. Please see http://code.google.com/apis/websearch", "responseStatus": 403}
i do not understand. isn't it possible to call search api from GAE apps?
If you look at the very top of that page you linked to, they note that the API has been deprecated and the number of search queries you can make is limited.
However, if you absolutely NEED to use that API instead of the Custom Search API as Google suggests, there are a few troubleshooting steps you can take:
1) Check that your API key is unique to the project, and the limited number of queries you're allowed isn't being consumed by some other application.
2) Google does (did?) hostname filtering so that one computer doesn't use up all the API requests. You may be able to move the queries to Javascript instead of Java -- essentially move the request from the server to the client.
3) Try using a named backend (Java Backends)

Google Application Engine - Using URL Fetch Service

I've looked at http://code.google.com/appengine/docs/java/urlfetch/overview.html
but the code does not show a pooling example,
i mean if i want to fetch www.example.com/1.html, www.example.com/3.html, www.example.com/3.html, ...., www.example.com/1000.html
I'd have to open 1000 connection and close 1000 connections.
I think I could just open 1 connection 'keep-alive', and issue 1000 request and then close it.
that should be faster.
but i have no idea how to do that using url.openStream()
The URLFetch service operates at a higher level of abstraction than individual connections, and the native Python and Java libraries that use it are modified to use this service. As such, you have no direct control over connections - but you can expect that the underlying service will keep connections open when it deems it appropriate.
Unfortunately, as the docs for Java App Engine say, at this time "The Java API for the URL Fetch service only supports synchronous requests". The Python version of App Engine does support async requests, so, if porting to Python is just unthinkable, you may wait in the reasonable hope that such functionality will eventually be in the Java side too. After all, the Python version has been around for a year more, so of course it's more mature, stable, and function-rich.

Resources