Priorities of Pods in Google Container Cluster - google-app-engine

Is it possible to schedule upcoming Pods/Containers based on their priorities?
(if container1 is critical and needs resources, the google orchestrator can kill other low priority containers)
If yes, is there some specific priorities tags (like: critical, monitoring, production...)?

This use case is described in both the Borg paper and the Omega paper. However, it is not presently implemented within Kubernetes. Here are some related links to ongoing proposals:
QoS Tiers
Preemption Policy / Scheme
Resource Quality of Service

Related

Message queue with conditional processing and consensus between workers

I'm implementing workflow engine, where a job request is received first and executed later by a pool of workers. Sounds like a typical message queue use case.
However, there are some restrictions for parallel processing. For example, it's not allowed to run concurrent jobs for the same customer. In other words, there must be some sort of consensus between workers.
I'm currently using database table with business identifiers, status flags, row locking and conditional queries to store and poll available jobs according to spec. It works, but using database for asynchronous processing feels counterintuitive. Does messaging systems support my requirements of conditional processing?
As an author of a few workflow engines, I believe that the persistence component for maintaining state is essential. I cannot imagine a workflow engine that only uses queues.
Unless you are just doing it just for fun implementing your own is a weird idea. A fully featured workflow engine is an extremely complex piece of software comparable to a database. I would recommend looking into existing ones instead of building your own if it is for production use. You can start from my open source project temporal.io :). It is used by thousands of companies for mission-critical applications and can scale to almost any rate given enough DB capacity.

GAE, am I still required to implement a load balancer?

Have a production web-application deployment on GAE (LAMP Stack) with autoscaled setting, and according to the documentation, Google will automatically spin-up additional instances to meet demand; this seemed to have been proven when we went live, hours before a season finale aired which would guarantee traffic hit our site, and our site did NOT fall-over even with the expected sizable influx - so kudos to Google! However, I'd be naive to think that this server architecture is done, knowing that we're still in our infancy, and we could potentially get 10 - 100x more traffic in the near future on a consistent basis when we gain popularity and move into the global market. So my question is:
Should I be implementing a Load Balancer in GCP or will GAE be able to scale "indefinitely" to accommodate?
Based on this answer: AppEngine load balancing across multiple regions you'll need to implement the load balancer if you're targeting multiple regions.
Otherwise it will be dependent on your configuration and the thresholds you've set on your GAE config.
According to https://cloud.google.com/appengine/docs/standard/go/how-instances-are-managed, there are three ways you can define scaling on your AppEngine instance:
Automatic scaling
Automatic scaling creates dynamic instances based on request rate, response latencies, and other application metrics. However, if you
specify a number of minimum idle instances, that specified number of
instances run as resident instances while any additional instances
are dynamic.
Basic Scaling
Basic scaling creates dynamic instances when your application receives requests. Each instance will be shut down when the app
becomes idle. Basic scaling is ideal for work that is intermittent or
driven by user activity.
Manual scaling
Manual scaling uses resident instances that continuously run the specified number of instances regardless of the load level. This
allows tasks such as complex initializations and applications that
rely on the state of the memory over time.
So the answer is it depends. You'll just need to base your scaling strategy on how your load distribution looks. I would expect that the automatic scaling is fine for 90% of early-stage websites, though that's just my impression.

Limit on the rate of inferences one can make for a SageMaker endpoint

Is there a limit on the rate of inferences one can make for a SageMaker endpoint?
Is it determined somehow by the instance type behind the endpoint or the number of instances?
I tried looking for this info as AWS Service Quotas for SageMaker but couldn't find it.
I am invoking the endpoint from a Spark job abd wondered if the number of concurrent tasks is a factor I should be taking care of when running inference (assuming each task runs one inference at a time)
Here's the throttling error I got:
com.amazonaws.services.sagemakerruntime.model.AmazonSageMakerRuntimeException: null (Service: AmazonSageMakerRuntime; Status Code: 400; Error Code: ThrottlingException; Request ID: b515121b-f3d5-4057-a8a4-6716f0708980)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
at com.amazonaws.services.sagemakerruntime.AmazonSageMakerRuntimeClient.doInvoke(AmazonSageMakerRuntimeClient.java:236)
at com.amazonaws.services.sagemakerruntime.AmazonSageMakerRuntimeClient.invoke(AmazonSageMakerRuntimeClient.java:212)
at com.amazonaws.services.sagemakerruntime.AmazonSageMakerRuntimeClient.executeInvokeEndpoint(AmazonSageMakerRuntimeClient.java:176)
at com.amazonaws.services.sagemakerruntime.AmazonSageMakerRuntimeClient.invokeEndpoint(AmazonSageMakerRuntimeClient.java:151)
at lineefd06a2d143b4016906a6138a6ffec15194.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$$$a5cddfc4633c5dd8aa603ddc4f9aad5$$$$w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$Predictor.predict(command-2334973:41)
at lineefd06a2d143b4016906a6138a6ffec15200.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$$$50a9225beeac265557e61f69d69d7d$$$$w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(command-2307906:11)
at lineefd06a2d143b4016906a6138a6ffec15200.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$$$50a9225beeac265557e61f69d69d7d$$$$w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(command-2307906:11)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:2000)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1220)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1220)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2321)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2321)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:140)
at org.apache.spark.scheduler.Task.run(Task.scala:113)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:533)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:539)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Amazon SageMaker is offering model hosting service (https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html), which gives you a lot of flexibility based on your inference requirements.
As you noted, first you can choose the instance type to use for your model hosting. The large set of options is important to tune to your models. You can host the model on a GPU based machines (P2/P3/P4) or CPU ones. You can have instances with faster CPU (C4, for example), or more RAM (R4, for example). You can also choose instances with more cores (16xl, for example) or less (medium, for example). Here is a list of the full range of instances that you can choose: https://aws.amazon.com/sagemaker/pricing/instance-types/ . It is important to balance your performance and costs. The selection of the instance type and the type and size of your model will determine the invocations-per-second that you can expect from your model in this single-node configuration. It is important to measure this number to avoid hitting the throttle errors that you saw.
The second important feature of the SageMaker hosting that you use is the ability to auto-scale your model to multiple instances. You can configure the endpoint of your model hosting to automatically add and remove instances based on the load on the endpoint. AWS is adding a load balancer in front of the multiple instances that are hosting your models and distributing the requests among them. Using the autoscaling functionality allows you to keep a smaller instance for low traffic hours, and to be able to scale up during peak traffic hours, and still keep your costs low and your throttle errors to the minimum. See here for documentation on the SageMaker autoscaling options: https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html

Time limit for "background thread" in Google App Engine

In GAE, web requests are limited to 30 seconds, and tasks are limited to 10 minutes. However, background threads exist as well. According to their documentation:
Background threads created using this API do not inherit the context of their creator and do not need to end before the creator request completes.
Does this mean that they have no time limit? What about their memory limits?
As far as my own research goes, the only place I find background threads mentioned in the docs (other than the module documentation above) is in "backends" documentation. Backends are deprecated (in favor of modules, which are now renamed to services, it would appear... and yet all of these terms are used freely in the docs!). So I don't know how much of that page is applicable, and even then, it doesn't mention whether background threads have time limits.
Yes, Background Threads have no limit but they have to run on Manual Scaling or Basic Scaling Instances and they can
only get as much memory the instance offers.
The official documentation suggests not to use Background Threads and to use alternatives like Queues.
https://cloud.google.com/appengine/docs/java/runtime#threads
Tasks Queues can also run on Manual Scaling and Basic Scaling Instances and they have a time limit of 24 hours
See the overview Table here:
https://cloud.google.com/appengine/docs/java/an-overview-of-app-engine#scaling_types_and_instance_classes

Mesos as a giant Linux box

Is it possible to see all the mesos resources as a giant Linux box without custom code for the framework?
I am wondering, if I want to run a program using 2500tb of ram, can mesos abstract the master slave architecture away? Do I have to use custom code?
You have to write custom code. Mesos offers resources per agent (slave) basis and it is up to you how to coordinate binaries of your app running on different machines.
Q1: Mesos is a resource manager. Yes, it's a giant pool of resources. Although at a given time it will offer you only a subspace of all resources. Assuming that there are other users that might need some resources (don't worry there's a way how to utilize almost whole cluster).
Q2: Mesos is designed for a commodity hardware (many nodes, not a single giant HPC computer). A framework running on Mesos will be given a list resources (and slaves - worker nodes) and Mesos will execute a task within bound of given resources. This way you can start an MPI job or run a task on top of Apache Spark which will handle the communication between nodes for you (but not Mesos itself).
Q3: You haven't specified what kind of task you'd like to compute. Spark comes with quite a few examples. You can run any of those without writing own code.
(Image credits: Malte Schwarzkopf, Google talk EuroSys 2013 in Prague)

Resources