Migrating from Google App Engine Mapreduce to Apache Beam - google-app-engine

I have been a long-time user of Google App Engine's Mapreduce library for processing data in the Google Datastore. Google no longer supports it and it doesn't work at all in Python 3. I'm trying to migrate our older Mapreduce jobs to Google's Dataflow / Apache Beam runner, but the official documentation is awful, it just describes Apache Beam, it does not tell you how to migrate.
In particular, the issues are this:
in Mapreduce, the jobs would run on your existing deployed application. However in Beam you have to create and deploy a custom Docker image to build the environment for Dataflow, is this right?
To create a new job template in Mapreduce, you just need to edit a yaml file and deploy it. To create one in Apache beam, you need to create custom runner code, a template file deployed to google cloud storage, and link up with the docker image, is this right?
Is the above accurate? If so, is it generally the case that working with Dataflow is much more difficult than Mapreduce? Are there any libraries or tips for making this easier?

In technical terms that's what is happening, but unless you have some specific advanced use-cases, you won't need to set any custom Docker images manually. Dataflow does some work in the background to run your user code and dependencies on a custom container so that it can execute your user-written code and dependencies on their VMs.
In Dataflow, writing a job template mainly requires writing some pipeline code in your chosen language (Java or Python), and possibly writing some metadata. Once your code is written, creating and staging the template itself isn't much different than running a normal Dataflow job. There's a page documenting the process.
I agree the page on Mapreduce to Beam migration is very sparse and unhelpful, although I think I understand why that is. Migrating from Mapreduce to Beam isn't a straightforward 1:1 migration where only the syntax changes. It's a different pipeline model and most likely will require some level of rewriting your code for the migration. A migration guide that fully covered everything would end up repeating most of the existing documentation.
Since it sounds like most of your questions are around setting up and executing Beam pipelines, I encourage you to begin with the Dataflow quickstart in your chosen language. It won't teach you how to write pipelines, but will teach you how to set up your environment to write and run pipelines. There are links in the quickstarts which direct you to Apache Beam tutorials that teach you the Beam API and how to write your own pipelines, and those will be useful for rewriting your Mapreduce code in Beam.

Related

Weird ruby process in App Engine Flexible instance

I am connecting via ssh to one of an App Engine Flex instances with .Net Core application running on it and get this:
Where does that ruby process(with 24% cpu usage) come from? Is it some internal google service?
The running Ruby process is /usr/sbin/google-fluentd. This package contains the logger agent which is the basis of Stackdriver Logging and it is written in Ruby gem as explained in this document. All in all, the Ruby process is using the CPU because the application’s logging.
As an aside, I noticed that the screenshot you uploaded contains your account-id and project-id. I strongly suggest you to re-upload the picture without this information for security and privacy reasons.

Deploying AngularJs + Sinatra to AWS

I have an AngularJS site consuming an API written in Sinatra.
I'm simply trying to deploy these 2 components together on an AWS EC2 instance.
How would one go about doing that? What tools do you recommend? What structure do you think is most suitable?
Cheers
This is based upon my experience of utilizing the HashciCorp line of tools.
Manual: Launch an Ubuntu image, gem install sinatra and deploy your code. Take a snapshot for safe keeping. This one off approach is good for a development box to iron out the configuration process. Write down the commands you run and any options you may need.
Automated: Use the Packer EC2 Builder and Shell Provisioner to automate your commands from the previous manual approach. This will give you a configured AMI that can be launched.
You can apply different methods of getting to an AMI using different toolsets. However, in the end, you want a single immutable image that can be deployed. repeatedly.

How to launch app from the web/cloud

I have developed an app in Twilio which I would like to run from the cloud. I tried learning about AWS and Google App Engine but am quite confused at this stage:
I have 2 questions which I hope to get your help on:
1) How can I store my scripts and database in the cloud? Right now, everything is running out of my local machine but I would like to transfer the scripts and db to another server and run my app at a predetermined time of day. What would be the best way to do this?
2) How can I write a batch file to run my app at a predetermined time of day in the cloud?
I understand this does not have code, but I really hope someone can point me to the right direction. I have spent lots of time trying to understand this myself but still am unsure. Tks in adv.
Update: The application is a Twilio app that makes calls to people, the script simply applies an algorithm to make calls in a certain fashion and the database is a mysql db that provides the details of people to be called.
This is quite difficult to provide an exact answer without understanding what is the application, what is the DB or what is the script that you wish to run.
I can give you a couple of ideas that might be helpful in such cases.
OpsWorks (http://aws.amazon.com/opsworks/) is a managed service for managing applications. You can define your stack (multiple layers like web, workers, DB...) and what are the chef recipes that should run in various points in the life of the instances in each layer (startup, shutdown, app deployment or stack modification..). Then you can use the ability to add instances to each layer in specific days and hours, to implement the functionality of running at predetermined times as you requested.
In such a solution you can either have some of your instances (like DB) always on, or even to bootstrap them using the chef recipes every day, with restore from snapshot on start and create snapshot on shutdown.
Another AWS service that you use is Data Pipeline (http://aws.amazon.com/datapipeline/). It is designed to move data periodically between data sources, for example from a MySQL database to Amazon Redshift, the Data warehouse service. But you can use it to trigger scripts and run random shell scripts that you wish (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-shellcommandactivity.html), and schedule it to run in various conditions like every hour/day or specific times (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-concepts-schedules.html).
A simple path here would be just to create an EC2 instance in AWS, and put the components needed to run your app there. A thorough walk through is here:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/get-set-up-for-amazon-ec2.html
Essentially you will create an EC2 virtual machine, which you can for most purposes treat just like any other Linux server. You can install MySQL on it, copy your script there, and run it. Of course whatever container or support libraries your code requires will need to be installed as well.
You don't say what OS you are using locally, but if it is Mac or Linux, you should be able to follow almost the same process to get your script running on an EC2 instance that you used on your local machine.
As you get to know AWS, there are sophisticated services you can use for deployment, infrastructure orchestration, database services, and so on. But just to get started running a script from a virtual machine should be pretty straightforward.
I recently developed a Twilio application using Ruby on Rails for the backend and found Heroku extremely simple to setup and launch. While Heroku does cost more than AWS, I found that the time I saved using Heroku more than made up this. As an early stage startup, we wanted to spend our time developing important features, and not "wasting" time optimizing our AWS cloud.
However, while I believe Heroku is ideal for early-stage websites/startups I do believe hosting should be reevaluated once a company reaches a certain size. At some point it becomes economically viable to devote resources into optimizing an AWS cloud solution because it will be cheaper than Heroku in the long run.

Should i be using gradle for continuous deployment?

anyone has past experience with gradle? i'm thinking of using it for continuous deployment... i'm considering either using my own scripts (python) or gradle.
can anyone tell from experience which way he thinks recommanded to go? note i already use maven and i don't intend to move away for my dependency management and project management.
thanks
We have implemented Gradle-based deployment and environment management in a big governmental project (100+ servers). But we had to develop a custom set of plugins (which is actually rather straight forward process in Gradle) to handle tasks like remote SSH command execution through Groovy DSL, creation of application server domains/clusters (we are using WebLogic), application/configuration deployment.
We also are thinking of integrating Gradle with Puppet for easier Linux administration.
If you are coming from Java world, then using Gradle (which is Groovy-based) would be rather simple for you, because you can reuse your Java/Ant/Maven/Groovy knowledge to write scripts. Also an ability to create DSLs in Groovy may allow you to build interesting abstractions. Gradle also has very clean API which allows building nice dependencies between tasks. It also integrates very well with Maven infrastructure and you can reuse all Ant tasks.
Yes, Gradle-based deployment possible with gradle-ssh-plugin
Here is an article with good usage example.

Is there a good utility / 3rd party library to manage the AppEngine datastore?

I have been developing an app using appengine. We are likely to be storing a lot of records in the datastore but I find the admin functionality you are given to manage this data lacking.
As an example, there are no good ways to bulk delete a bunch of data - you have to write a class of your own to do this.
Before I start down the path of building the admin ui and features I need to manage the datastore entities, I was wondering if anyone knows of a good 3rd party tool that's already been written to do this for me? Something that has basic CRUD functionality plus bulk import and bulk export features.
I am using the Python SDK.
You haven't specified whether you're using the Java or Python SDK, but if you're using Java App Engine, I suggest using the Objectify framework to interact with the datastore rather than the standard JDO/JPA method. It's much nicer.

Resources