State of Map-Reduce on Appengine? - google-app-engine

There is appengine-mapreduce which seems the official way to do things on AppEngine. But there seems no documentation besides some hacked together Wiki Pages and lengthy videos. There are statements that the lib only supports the map step. But the source indicates that there are also implementations for shuffle.
A Version of this appengine-mapreduce library seems also to be included in the SDK but it not blessed for public use. So you basically are expected to load the library twice into your runtime.
Then there is appengine-pipeline. "A primary use-case of the API is connecting together various App Engine MapReduces into a computational pipeline." But there also seems pipeline-related code in the appengine-mapreduce library.
So where do I start to find out how this all fits together? Which is the library to call from my project. Is there any decent documentation on appengine-mapreduce besides parsing change logs?

Which is the library to call from my project.
They serve different purposes, and you've provided no details about what you're attempting to do.
The most fundamental layer here is the task queue, which lets you schedule background work that can be highly parallelized. This is fan-out. Let's say you had a list of 1000 websites, and you wanted to check the response time for each one and send an email for any site that takes more than 5 seconds to load. By running these as concurrent tasks, you can complete the work much faster than if you checked all 1000 sites in sequence.
Now let's say you don't want to send an email for every slow site, you just want to check all 1000 sites and send one summary email that says how many took more than 5 seconds and how many took fewer. This is fan-in. It's trickier with the task queue, because you need to know when all tasks have completed, and you need to collect and summarize their results.
Enter the Pipeline API. The Pipeline API abstracts the task queue to make fan-in easier. You write what looks like synchronous, procedural code, but uses Python futures and is executed (as much as possible) in parallel. The Pipeline API keeps track of task dependencies and collects results to facilitate building distributed workflows.
The MapReduce API wraps the Pipeline API to facilitate a specific type of distributed workflow: mapping the results of a piece of work into a set of key/value pairs, and reducing multiple sets of results to one by combining their values.
So they provide increasing layers of abstraction and convenience around a common system of distributed task execution. The right solution depends on what you're trying to accomplish.

There is offical documentation here: https://developers.google.com/appengine/docs/java/dataprocessing/

Related

Sharing data to client whenever API has results available?

I have the following scenario, can anyone guide what's the best approach:
Front End —> Rest API —> SOAP API (Legacy applications)
The Legacy applications behave unpredictably; sometimes it's very slow, sometimes fast.
The following is what needs to be achieved:
- As and when data is available to Rest API, the results should be made available to the client
- Whatever info is available, show the intermediate results.
Can anyone share insights in how to design this system?
you have several options to do that
polling from the UI - will require some changes to the API, the initial call will return a url where results will be available and the UI will check that out everytime
websockets - will require changing the api
server-sent events - essentially keeping the http connection open and pushing new results as they are available - sounds the closest to what you want
You want some sort of event-based API that the API consumers can subscribe to.
Event-driven architectures come in many forms - from event notification ('hey, I have new data, come and get it') to message/payload delivery; full-on publish/subscribe solutions to that allow consumers to subscribe to one or more "topics", with event back-up and replay functionality to relatively basic ones.
If you don't want a full-on eventing platform, you could look at WebHooks.
A great way to get started will be to start familiarizing yourself with some event-based architecture patterns. That last link is for Chris Richardson's website, he's got a lot of great info on such architectures and would be well worth a look.
In terms of the defining the event API, if you're familiar with OpenAPI, there's AsyncAPI which is the async equivalent.
In terms of solutions, there's a few well known platforms, including open source ones. The big cloud providers (Azure, GCP and AWS) will also have async / event based services you can use.
For more background there's this Wikipedia page (which I have not read - so can't speak for it's quality but it does look detailed).
Update: Webhooks
Webhooks are a bit like an ice-berg, there's more to them than might appear at first glance. A full-on eventing solution will have a very steep learning curve but will solve problems that you'll otherwise have to address separately (write your own code, etc). Two big areas to think about:
Consumer management. How will you onboard new consumers? Is it a small handful of internal systems / URLs that you can manage through some basic config, manually? Or is it external facing for public third parties? If it's the latter, will you need to provide auto-provisioning through a secure developer portal or get them to email/submit details for manual set-up at your end?
Error handling & missed events. Let's say you have an event, you call the subscribing webhook - but there's no response (or an error). What do you do? Do you retry? If so, how often, for how long? Once the consumer is back up what do you do - did you save the old events to replay? How many events? How do you track who has received what?
Polling
#Arnon is right to mention polling as an approach but I'd only do it if you have no other choice, or, if you have a very small number of internal system doing the polling, i.e - incurs low load, and you control both "ends" of the polling; in such a scenario its a valid approach.
But if its for an external API you'll need to implement throttling to protect your systems, as you'll have limited control over who's calling you and how much. Caching will be another obvious topic to explore in a polling approach.

Logger/ data store recommendation

I am looking for a recommendation for the following scenario: we have a service that consists of, on a high level, a front-end web app serving API and web UI requests (the latter are less important) -- decomposing, putting them as tasks in queue for processing, and a number of worker services consuming the tasks from the queue and processing them. The API clients would poll for results asynchronously.
We need to be able to log pieces of information along the way (starting from the originating request, through intermediate outputs, to final results) so that they can be accessed later if needed (mostly to troubleshoot what went wrong for a given request).
Ultimately, what we need is:
To be used as a secure storage for information related to logging and short term auditing,
Low overhead insertion:
(Low) constant time insertion, either truly non-blocking or effectively non-blocking (guaranteed quick),
Very frequent insertion – think multiple inserts per one CF API call,
Retrieval used significantly less frequently, can be slow-ish,
Items need to be retrievable at least by ID, but...
Payloads are effectively text or binary
Full text search capability would be a plus,
Understanding the structure of the text, e.g. being able to query JSON
elements is a mild nice-to-have,
Data retention policies either built in or easy to implement.
"Secure" means we're processing personal information in several countries, usual regulations/ standards apply.
This can be software (open source, usable in commercial environment) that we'd host ourselves or an Amazon AWS service.
checkout, as a base for your app, sherlock on Sourceforge.net , it's an opensource a Log4J implementation, you could modify as you like, ie- containerize the headless tomcat server , it's a "Chain of Custody" "C2" compliant Rsyslog replacement server collector of syslog and syslogrelay data, which first stores the logs as flat files per source, then post processes and dumps the log data into a mysql db, thereafter there is an older web client with some regex support to search/filter data so you can get at the log data for forensics..
The guys that put this together with me came from Platespin (later sold to Novell) , actually the team that built this code successfully sold a dervitative work for decent cash right at the time they built it, and then went on to work for Tibco(later Mulesoft) and RIM(Blackberry, and now BMO)... so its solid code
here is the link...
https://sourceforge.net/projects/sherlock/
r2

Simulating multiple Policy Decision Points (PDPs) in distributed environment

Let's take a scenario where subjects will be requesting access to many objects per second. A heavy load on a single PDP would mean increase in wait and read/write times per request.
So far I have used the AuthzForce Core project to setup a single PDP for which I have a for loop sending multiple requests (this can be done simultaneously using threads). However, this does not seem like a suitable setup for evaluating my policies in a distributed environment.
Is there any way that it can be done? Perhaps using AuthzForce Server?
Edit:
I am running a Java application which uses Authzforce Core. The program creates an instance of a PDP which loads a single policy document, and then a for loop executes multiple requests. This is all done locally within the program itself.
It is difficult to help improve the performance here without looking at the code or the architecture, but I can give a few general tips (some of them maybe obvious to you but just to be thorough):
Since the PDP is embedded in your Java app, I assume (or make sure you do) you are using AuthzForce native Java API (example on the README), which is the most efficient way to evaluate.
I also assume you are (re-)using the same PDP (BasePdpEngine) instance throughout the lifetime of your application. It should be thread-safe.
In order to evaluate multiple requests at once, you may try the PDP engine's evaluate(List) method ( javadoc ) instead of the usual evaluate(DecisionRequest), which is faster in some cases.
If by "distributed environment", you mean that you may have multiple instances of your Java app deployed in different places, therefore multiple PDPs, the right setup(s) depend on where/how you load the policy document: local file, remote db, etc. See my last comment. As mentioned in Rafael Sisto's answer, you can reuse some guidelines from the High Availability section of AuthzForce Server installation guide there.
Authzforce server has an option for high availability:
https://github.com/authzforce/fiware/blob/master/doc/InstallationAndAdministrationGuide.rst#high-availability
You could follow the same guidelines to implement this using your single pdp.

What would be a good pipeline to create a scalable, distributed web crawler and scraper?

I would like to build a semi-general crawler and scraper for pharmacy product webpages.
I know that most of the webs are not equal, but most of the URLs I have in a list follow one specific type of logic:
For example, by using Microdata, JSON-ld, etc. I can already scrape a certain group of webpages.
By using XPath stored in configuration files I can crawl and scrape some other websites.
And other methods work good for the rest of the websites and if I can already extract the information I need from 80% of the data, I would be more than happy with the result.
In essence, I am worried about building a good pipeline to address issues related with monitoring (to handle webpages that suddenly change their structure), scalability and performance.
I have thought of the following pipeline (not taking into account storage):
Create 2 main spiders. One that crawls the websites given their domains. It gets all the URLs inside a webpage (obeying robots.txt of course) and puts it into a queue system that stores the URLs that are scrape-ready. Then, the second spider picks up the last URL in the Queue and extracts it using either metadata, XPath or any other method. Then, this is put again into another queue system that will be eventually be handled by a module that puts all the data in the queue into a database (which I still do not know if it should be SQL or NoSQL).
The advantages of this system is that by putting queues in between the main processes of extraction and storage, parallelization and scalability becomes feasible.
Is there anything flawed in my logic? What are the things that I am missing?
Thank you so much.
First off, that approach will work; my team and I have built numerous crawlers based on that structure, and they are efficient.
That said, if you're looking to scale, I would recommend a slightly different approach. For my own large-scale crawler, I have a 3-program approach.
There is one program to schedule which handles which URLs to download.
There is a program to perform the actual downloading
There is a program to extract the information from the downloaded pages and add in new links for the program that handles the scheduling.
The other major recommendation is that if you're using cURL at all, you'll want to use the cURL multi interface and a FIFO queue to handle sending the data from the scheduler to the downloader.
The advantage of this approach is that it separates out the processing from the downloading. This allows you to scale up your crawler by adding new servers and operating in parallel.
At Potent Pages, this is the architecture we use for our site spider that handles downloading hundreds of sites simultaneously. We use MySQL for the data saving (links, etc), but as you scale up, you'll need to do a lot of optimization. Plus phpmyadmin starts to break down if you have a lot of databases, but having one database per site really speeds up the parsing process so you don't have to go through millions of rows of data.

Parallel calls to google.appengine.api.channel.send_message

I am using send_message(client_id, message) in google.appengine.api.channel to fan out messages. The most common use case is two users. A typical trace looks like the following:
The two calls to send_message are independent. Can I perform them in parallel to save latency?
Well there's no async api available, so you might need to implement a custom solution.
Have you already tried with native threading? It could work in theory, but because of the GIL, the xmpp api must block by I/O, which I'm not sure it does.
A custom implementation will invariably come with some overhead, so it might not be the best idea for your simple case, unless it breaks the experience for the >2 user cases.
There is, however, another problem that might make it worth your while: what happens if the instance crashes and only got to send the first message? The api isn't transactional, so you should have some kind of safeguard. Maybe a simple recovery mode will suffice, given how infrequently this will happen, but I'm willing to bet a transactional message channel sounds more appealing, right?
Two ways you could go about it, off the top of my head:
Push a task for every message, they're transactional and guaranteed to run, and will execute in parallel with a fairly identical run time. It'll increase the time it takes for the first message to go out but will keep it consistent between all of them.
Use a service built for this exact use case, like firebase (though it might even be too powerful lol), in my experience the channel api is not very consistent and the performance is underwhelming for gaming, so this might make your system even better.
Fixed that for you
I just posted a patch on googleappengine issue 9157, adding:
channel.send_message_async for asynchronously sending a message to a recipient.
channel.send_message_multi_async for asynchronously broadcasting a single message to multiple recipients.
Some helper methods to make those possible.
Until the patch is pushed into the SDK, you'll need to include the channel_async.py file (that's attached on that thread).
Usage
import channel_async as channel
# this is synchronous D:
channel.send_message(<client-id>, <message>)
# this is asynchronous :D
channel.send_message_async(<client-id>, <message>)
# this is good for broadcasting a single message to multiple recipients
channel.send_message_multi_async([<client-id>, <client-id>], <message>)
# or
channel.send_message_multi_async(<list-of-client-ids>, <message>)
Benefits
Speed comparison on production:
Synchronous model: 2 - 80 ms per recipient (and blocking -.-)
Asynchronous model: 0.15 - 0.25 ms per recipient

Resources