Web Scraping with Google App Engine - google-app-engine

I am trying to scrape some website and republish the data as a RSS feed. How hard is this to setup with Google App Engine? Disadvantages and Advantages using GAE. Any recommendations and guidelines greatly appreciated!

Google AppEngine offers much more functionality (and complexity) than you will need if truly all you will want to do is republish some structured data as RSS.
Personally, I would use something like Yahoo pipes for a task like this.
That being said... if you want/need to get your feet wet with GAE, go for it!

Working with Google App Engine is pretty straight forward. I would recommend going through the Getting Started guide. It's short and simple and touches on essential GAE topics. There are more pros and cons than I will list here.
Pros:
In general, App Engine is designed for high traffic web applications that need to scale. Furthermore, it is designed from a programmer's perspective. Much of the scalability issues (database optimization, server administration, etc) are dealt with by Google. Having said that, I find it to be a nice platform. It is still being actively developed by Google engineers, and scheduling of tasks (a feature that has been long requested) is in the current road map.
Cons:
Perhaps the biggest downside right now is again the lack of official scheduling support and the quota limits currently set for free accounts. However you can't complain much if its free. Currently it only supports Python as a programming interface (although a new language [Java I predict] is coming soon). Furthermore, Python 2.6 (and 3.0 for that matter) are not yet supported. In addition, Django 1.0 is not officially supported in App Engine (although you can package Django 1.0 with your application).

Harder than it would be in most other technologies.
GAE can sort of do scheduled batch stuff like this now, but it's really not intended for that type of thing. Pick pretty much any other language and platform for this particular task, and you'll make your life a lot easier.

I think BeautifulSoup could run on GAE, so all your scraping needs are handled :D
Also, GAE has a geturl thingy. The only problem I think you might have is not having enough time to get the data (30 secs limitation).
I am working on a same project and I've decided that it's easier to prepare the data on another server and push them to GAE.

You might also want to look into Yahoo! Query Language (YQL)

Related

Should I use GAE + Lift for my Scala based webapp?

Such questions have been asked before - but all of the answers are outdated now.
I am looking forward to work on a Scala based webapp. I understand this question can be split into two, but I am posting them as one because they rely on same context, there being a dependency on the hosting platform and frameworks used.
I have read multiple (awesome) debates on Play! and Lift, but cannot find a good comparison between Play! 2.1 and Lift. How do I decide which one is better for my scenario (a social network website) ?
Similarly, this discussion has some very good arguments as to which platform to use for if I go with Lift, but it's from 2010 and seems outdated. The recommended provider (stax.net) is dead (or I guess it's merged with cloudbees.com). I am personally inclined towards GAE, as they are quick to start with, but unsure if the issues still prevail :
Support for actors (I am not sure if Akka helps us solve this problem)
Requests for a given session being served by different JVMs without notice to running app
Quoting David Pollak (lead author of Lift) :
GAE is slow and non-scalable, despite Google's claims (everyone I've
spoken with that have tried to scale GAE apps have failed and gone
elsewhere). GAE locks you into a tremendously suboptimal storage
mechanism. GAE is free, but so is Stax and there are many inexpensive
options including SliceHost. Next up, you've got Amazon EC2 and
RackSpace. So, I haven't found a good reason for anyone to use GAE.
And if there's no good reason to use GAE, devoting a pile of resources
to code around the GAE JVM incompatibilities (e.g., no new threads)
seems like a waste.
Another issue if I go with GAE is lack of Play! 2.1 support. I still don't see a module for that. Another issue is difficulty to migrate to other databases (although I hear migrating to MongoDB should be relatively easier) in the future. Worst case would be to move out of GAE and use AppScale.
Personally I use Lift, Cloudbees, and MongoLab as my first choice for most of my projects. I tried several cloud hosting services to no avail (Heroku and RedHat in particular. I don't think I tried GAE due to the post from David Pollak that you have already referenced). To use cloudbees, you just need an sbt plugin. Then it is as easy as running the cloudbees-deploy target. Within a minute, your code is up and running. I was floored by how easy it was. I went with Mongo primarily because of this excellent g8 template (note, there is now an SQL equivalent)
Another thing I really like about Cloudbees and MongoLab is they both have free services. It's great for me because I only work on these projects in my free time, so I don't want to spend any money while my ideas are half-baked.
As for Lift, I can't compare it much to Play. I downloaded/installed play and was immediately turned off by how MVC it is. I felt that the view-first approach, albeit foreign to me, seemed to be a much more intuitive and powerful way to build web applications. I love how Lift doesn't obscure from me the fact that I am indeed developing a web application. I often feel that MVC frameworks try to keep all of the HTML/CSS/JS etc at an arms-length.
The question is quite open so I will share my experience and opinion regarding Scala web app development as it might help you with your decision.
I built my first scala web application using Scalatra and Scalate using Jetty as the server. The app is hosted on an Amazon EC2 instance and I've had no problems with this... it's been running since the end of 2011 with only one small blip that took 10 mins to resolve. I found it a good experience for learning to use Scala in web applications.
http://www.scalatra.org/
Typesafe (http://typesafe.com) appear to have opted for the Play Framework and so for my next scala based web app I am likely to go for Play. A book I have been reading on the Play Framework is "Play for Scala". It has just been published this month (Oct 2013).
http://www.manning.com/hilton/
My impression is that Lift was the go-to framework in the past but that this has shifted to the Play Framework.

In retrospect, best open source stack/tools to build facebook

I have been doing high performance scientific computing in c++ most of my life. I am trying to learn to developing AJAXy web applications. As an exercise, I would like to build something that has a subset of functionality of facebook (profiles, posts with comment threads, friend lists) + the ability to search any post/comment.
I have no experience developing these kind of apps, except minor amount of toying with Google Appengine with GWT+Java and little bit of python. What tools/stack would you suggest using for it? I understand that this a very vague question, but I'd like to get a few opinions and your thought process about how would you go about using it.
How does the choice change, if you want a fast prototype as fast as possible, vs if you are trying to build something that can scale and last a few cycles of feature requests.
To be more specific, I'm lost in questions like, should I consider Drupal, should I consider Lucene for search, Would GWT get me what I want in the UI or would python+django be faster to develop. Probably I should not over think and pick something. But some perspective from others would be nice.
If you have started out with Python, that might be the easiest to get going with, especially since you have some experience with Google's App Engine already. However, if you have spent most of your time working with C++ ... did you know that C++ has at least two different full-stack web frameworks?
CppCMS
Wt (WebToolkit)
Remember, it's what you develop fastest in that makes the difference in the long run. What will slow you down most of all is dealing with what you dislike. So, if long compile times kill you, then try Python, Ruby, PHP, or some other dynamic language. If having code that is less than perfectly optimized (and slower that it could be) is what bothers you most, use C++, C#, or Java instead.
One disadvantage with google app engine is that there is no CMS like drupal or Joomla for google app engine so you're going to have to write your own if you want some of that functionality. The advantages of google app engine however outweigh the disadvantage since you have easier development, easier deployment, won't have to fiddle with phpmyadmin or other ugly sql interfaces, with app engine you also leverage google's huge infrastructure and since it's cloud computing you only pay for what you use. If you want something you as a developer will be most happy about - then I recommend you choose Google App Engine.

Which web solution should I use for my project?

I'm going to create a fairly large (from my point of view anyway) web project with a friend. We will create a site with roads and other road related info.
Our calculations is that we will have around 100k items in our database. Each item will contain some information like location, name etc. (about 30 thing each). We are counting on having a few hundred thousand unique visitors per month.
The 100k items and their locations (that will be searchable) will be the main part of the page but we will also have some articles, comments, news and later on some more social functions (accounts, forums, picture uploads etc.).
We were going to use Google AppEngine to develop our project since it is really scalable and free (at least for a while). But I'm actually starting to doubt that AppEngine is right for us. It seems to be for webbapps and not sites like ours.
Which system (language/framework etc.) would you guys recommend us to use? It doesn't really mater if we know the language since before (we like learning new stuff) but it would be good if it's something that is future proof.
I think that GAE can do the job. Google claims that Google App Engine is able to handle 5 million visitors for free and you will have to start paying only if you exceed their free quota.
It's also pretty easy to get started. If you don't have experience on administrating websites and choose a regular hosting service, you will have to worry about several things that you don't even imagine now.
My only concern would be with respect of the kind of data and queries you will have to do, since it does not have a relational database. Anyway, there is an open source project for GAE, called GeoModel that gives GAE the ability to do complex geo spacial queries, like proximity fetch. Have a look at their tutorial and the demo app.
About your impression that GAE was intended only for small web apps, there are a couple of CMS that run on it.
Good luck!
If once of your concerns is scalability, and you don't want to depend on expensive or commercial tools, I would recommend that you take a look at this tech stack:
Erlang - A programming language designed for concurrency and distribution.
Nitrogen - An Erlang web framework with a lot of cool stuff, like transparent AJAX.
NoSQL scalable databases, such as CouchDB or Riak - Save the the hassle of SQL code and are more scalable than plain MySQL. Both has direct native Erlang API.
To be honest, I don't know if this tool set is your cup of tea; These are not mainstream solutions. I just suggest these to everyone who ask about size-sensitive web applications.
All serious web frameworks will provide you with what you need. The real issues (for example scalability) might be tackled in a different way depending on what you use, but you wont be limited if you choose a well-known one. The choice of database system might be more important for that (sql vs nosql), even if both of those will do fine too.
It's all about
knowing how to use
enjoying to use
the tool(s) you've chosen.
In either case, name-dropping some suggestions:
Rails (Ruby)
Django (Python)
Nitrogen (Erlang)
ASP.NET MVC (C#)
And please note, if you really want to learn everything from the bottom, you'd be fine with any of these (or one of the other gazillion out there). But if you want to perform your best, choose one that supports a language you know well or uses techniques/tools you have experience of etc. Think twice about how you value this is fun and we learn a lot against we want to be productive and do a really good job.

How difficult is it to migrate away from Google App Engine?

I am thinking of making an (initially) small Web Application, which would eventually have a potential to grow. All things considered Google App Engine seems like a very attractive option. Say, user base and complexity grows and for one or other reason I needed to leave GAE behind. How difficult would it be to migrate away?
1) Does GAE provide a way to export the database? What format would it be? Would it be difficult to put it under MySQL (or similar)?
2) In which areas (ex. database access, others?) would I have to use GAE API? I.e. which parts of implementation would have to be abstracted away / interfaced?
Edit: 3) Alternatively, is it even worth to abstract away GAE API?
For question #1: I don't know if GAE specifically supports exports of a database but you can always roll your own, worst case scenario. If you are in a position where you need to, you'll probably have the resources to do it, too.
For question #2: You can and should always encapsulate those kinds of outside dependencies anyway. It doesn't matter whether or not they provide interfaces. Coupling to those interfaces should be kept to an absolute minimum.
For question #3: This question is not really super-clear so I cannot answer it.
I'm speaking strictly from a java webapp point of view...
Google App Engine for python has a backup/restore utility:
http://code.google.com/appengine/articles/gae_backup_and_restore.html
There is a huge interested in porting this to the java flavor.
You can use the higher level standard database apis (JDO/JPS) to allow you to move your app away from google's database services. I suggest purchasing the data nucleus tools in order to smooth the transition from big tables to something like mysql or oracle.
The packaged services GAE provides are enumerated at
http://code.google.com/appengine/docs/java/javadoc/
The stock JRE should handle porting of the urlfetch, mail, and memcache api packages.
You'll have to find a substitute technology for the users, blobstore, xmpp, and taskqueue packages.

could gSoap be integrated with the Google app engine?

We are using GAE to host our web services, as far as I know GAE only support Java and python at the moment, however most of our engineers here are more comfortable with C/C++, so i was wondering if there is a way to integrate gsoap with GAE at all. Thanks for your help!
Though I am not an expert at Google App Engine, it is unlikely you'd be able to use native C++ code in the app engine. Based on experience with an app engine like Tomcat, the purpose of application engines is to make your application run on a shared service in its own little sandbox so that it can't affect the other shared services. With C++, you can get a pointer to the beginning of the process memory and start writing zeros if you so desired. This doesn't turn out to work too well in a shared computing environment.
The app engine pages indicate Java and Python runtime environments are available. I've been using C++ for many years and am a big fan of gSoap, but I think these are tools best used in limited cases these days. Web services for Java aren't that much different from gSoap in terms of ramp-up time anyway.
I've used Axis2 for Java web services and it isn't that difficult to use. However, I think it suffers from being overly complex and under documented. I have used WSO2 under PHP and was impressed with how easy it was to use. WSO2 is built on top of Axis and has a Java port too (though I have not used it). If your engineers want to dig in, WSO2 is probably going to be the easiest route.
Motivating them might be hard, but my take is that if they are real software engineers then they won't have a problem adapting.
This might be helpful too: http://code.google.com/appengine/docs/java/overview.html
The short answer is No, Google-App engine is very limited with what you can do (you can't even dynamically create new files).

Resources