DC/OS: Service vs Marathon App - mesosphere

I have the following two questions:
1) Are DC/OS Services just marathon apps? (Or: What's the difference between a DC/OS Service (like Cassandra) and a Cassandra app installed via Marathon?)
2) Scaling: Do DC/OS services like Cassandra automatically scale to all available nodes in the cluster (given sufficient work load)?
Thank you for your help :)

1) In order to answer the first part of your question, let me add one another concept: DC/OS package, so we have DC/OS package vs DC/OS service vs marathon app.
a) DC/OS service vs Marathon App
They are the same, a long run service which will be automatically kept running by marathon. You see this for example when creating a new DC/OS service, which you can do with an marathon app definition.
b) DC/OS packages (and I believe the core of your question)
dcos package install cassandra will deploy the DC/OS Apache Cassandra package.
The interesting piece of the Cassandra package is the scheduler, a piece of software that manages your Cassandra cluster (e.g., by bootstrapping the cluster or automatically restarting failed tasks) and also provides endpoints for scaling, upgrading,...
If you want it is the automated version of an administrator for your Cassandra cluster.
Now we also have to ensure that this admistrator is always available (i.e., what happens if the administrator/scheduler task/node is failing?).
This is why the scheduler is deployed by Marathon so it will be automatically restarted.
Marathon
|
Cassandra Scheduler
|
Cassandra Cluster
2) Second part of your question: Autoscaling
The package provides endpoints for scaling, so the typical pattern is to provide a script (e.g., marathon-autoscale) to scale the cassandra cluster. The reason why you need your own script is that scaling is something very individual to every user, and especialy scaling down.
Keep in mind that you are scaling a persistent service, so how to you select the node you want to remove? Do you first drain traffic from that node? Do you migrate data to other nodes?

Related

MongoDB for small production project with very low traffic but with opportunity to scale fast

I've faced the problem with DBs (specifically with MongoDB).
At first, I thought to deploy it to MongoDB Atlas Cloud but the production Clusters (M30 and higher) are quite expensive for a little project like mine.
Now I am thinking about deploying the MongoDB replica set somewhere on Heroku Dyno or maybe AWS instance.
Couldn't you please suggest any production-ready way/solution that would not cost much?
As I can see AWS offers some FREE Tiers for DBs (Amazon RDS, Amazon DynamoDB, and others) but is it really free or it's a trap? Because I heard AWS Pricing policy is not that honey-sweet as it looks on the pricing page (in reality)
Any advice or help is welcomed!
Your least expensive way to experiment with MongoDB that still offers a path to scaling to something much larger is to get a free tier EC2 instance like a t2.micro and install MongoDB Community on it yourself. The Linux 64bit distribution is available here: https://fastdl.mongodb.org/linux/mongodb-linux-x86_64-amazon-5.0.5.tgz. Simply run tar xf on that tarfile and the mongod server and mongo CLI executables will be extracted. No other config required and pretty much you would start it like this:
$ mkdir -p /path/to/dbfiles
$ mongod --dbpath /path/to/dbfiles --logpath /path/to/dbfiles/mongo.log --replSet "rs0" --fork
You do not need a replica set to use aggregation, transactions, etc.
This solution will also expose you to a lot of important things like network security, database security, log management, driver compatibility, key vaults, and using the AWS Console and perhaps the AWS CLI to manipulate the environment. You can graduate to a bigger machine and create and add members to a replica set later -- or you can take the plunge and go with MongoDB Atlas. But at that point you'll be comfortable with functions esp. the aggregation pipeline, the document model, drivers, etc., all learned on essentially a zero-cost platform.
Performance is not really an issue here but using the handy load generator POCDriver (https://github.com/johnlpage/POCDriver) on a t2.micro instance with no special setup and a doc size of 0.28Kb, with 90/10 read/insert mix on a default of 4 threads and batch update removed (-b 1), we get about 450 inserts/sec and 4200 reads/sec on primary key _id.
java -jar POCDriver.jar --host "mongodb://localhost:27017" -k 90 -i 10 -b 1
Of course, the load generator is competing with the DB engine itself for resources but as a first cut it is good to know the performance independent of network considerations.
From launching the EC2 instance to running the test including installing java sudo yum -y install java-1.8.0-openjdk-devel.x86_64 took about 5 mins.

Flink: Multi-data center deployment possible?

Does Flink support multi-data center deployments so that jobs can survive a single data-center outage (i.e. resume failed jobs in a different data-center) ?
My limited research on Google suggests the answer: there is no such support.
Thanks
While there is no explicit support for multi-datacenter failover in Flink, it can be (and has been) setup. A Flink Forward talk provides one example: engineers from Uber describe their experience with this and related topics in Practical Experience running Flink in Production. There are a number of components to their solution, but the key ideas are: (1) running extra YARN resource managers, configured with a longer inetaddr.ttl, and (2) checkpointing to a federation of two HDFS clusters.

Moving app to GAE

Context:
We have a web application written in Java EE on Glassfish + MySQL with JDBC for storage + XMPP connection for GCM CCS upstream messaging + Quartz scheduler library. Those are the main components of the app.
We're wrapping up our app development stage and we're trying to figure out the best option for deploying it, considering future scalability, both in number of users and their geographical location (ex, multiple VMS on multiple continents, if needed). Currently we're using a single VM instance from DigitalOcean for both the application server and the MySQL server, which we can then scale up with some effort (not much, but probably more than GAE).
Questions:
1) From what I understood, Cloud SQL is replicated in multiple datacenters across the world, thus storage-wise, the geographical scalability is assured. So this is a bonus. However, we would need to update the DB interaction code in the app to make use of Cloud SQL structure, thus being locked-in to their platform. Has this been a problem for you ? (I haven't looked at their DB interaction code yet, so I might be off on this one)
2) Do GAE instances work pretty much like a normal VM would ? Are there any differences regarding this aspect ? I've understood the auto-scaling capability, but how are they geographically scalable ? Do you have to select a datacenter location for each individual instance ? Or how can you achieve multiple worldwide instance locations as Cloud SQL does right out of the box ?
3) How would the XMPP connection fare with multiple instances ? Considering each instance is a separate VM, that means each instance will have a unique XMPP connection to the GCM CCS server. That would cause multiple problems, ex if more than 10 instances are being opened, then the 10 simultaneous XMPP connections limit cap will be hit.
4) How would the Quartz scheduler fare with the instances ? Right now we're saving the scheduled jobs in the MySQL database and scheduling them at each server start. If there are multiple instances, that means the jobs will be scheduled on each instance; so if there are multiple instances, the jobs will be scheduled multiple times. We wouldn't want that.
So, if indeed problems 3 & 4 are like I described, what would be the solution to those 2 problems ? Having a single instance for the XMPP connection as well as a single instance for the Quartz scheduler ?
1) Although CloudSQL is a managed replicated RDBMS, you still need to chose a region when creating an instance. That said, you cannot expect the latency to be great seamlessly globally. You would still need to design a proper architecture to achieve that.
2) GAE isn't a VM in any sense. GAP is a PaaS and, therefore, a runtime environment. You should expect several differences: you are restricted to Java, PHP, GO and Python, the mechanisms GAE provide do you out-of-the-box and compatible third-parties libraries. You cannot install a middleware there, for example. On the other hand, you don't have to deal with anything from the infrastructure standpoint, having transparent high-availability and scalability.
3) XMPP is not my strong suit, but GAE offers a XMPP service, you should take a look at it. More info: https://developers.google.com/appengine/docs/java/xmpp/?hl=pt-br
4) GAE offers a cron service that works pretty well. You wouldn't need to use Quartz.
My last advice is: if you want to migrate an existing application, your best choice would probably be GCE + CloudSQL, not GAE.
Good luck!
Cheers!

Is it a best practice in continuous deployment to deploy to all production servers immediately or just to a subset at first?

We use CD in our project and since the application is used world wide we use more than one data center (one per region). Each data center hosts an isolated instance of the application (each regional deployment uses its own DB, application server etc). Data is not shared between data centers.
There are two different approaches that we can take:
Deploy to integration server (I) where all tests are run, then
deploy to the first data center A and then (once the deployment to A
is finished) to a data center B.
Region A has a smaller user base and to prevent outage in both A
and B caused by a software bug that was not caught on the
integration server (I), an alternative is to deploy to the integration server and then "bake" the code in
region A for 24 hours and deploy the application to data center B
only after it was tested in production for 24 hours. Does this
alternative go against CI best practices since there is no
"continuous" deployment in this case?
There is a big difference between Continuous Integration and Continuous Deploy. CI is for a situation where multiple users work on the same code base and integration tests are run repetitevely for multiple check ins so that integration failures are handled quickly and programatically. Continuous deploy is a pradigm which encapsulate deploying quickly and programmatically accepting your acceptance tests so that you deploy as quick as feasible (instead of the usual ticketing delays that exist in most IT organizations). The question you are asking is a mix for both
As per your specific question, your practice does go against the best practices. If you have 2 different data centers, you have the chance of running into separate issues on the different data centers.
I would rather design your data centers to have the flexibility to switch between current and next version. That way , you can deploy your code to the "next" environment, run your tests there. Once your testing confirms that your new environment is good to go , you can switch your environments from current to next .
The best practice, as Paul Hicks commented, is probably to decouple deployment from feature delivery with feature flags. That said, organizations with many production servers usually protect their uptime by either deploying to a subset of servers ("canary deployment") and monitoring before deploying to all, or by using blue-green deployment. Once the code is deployed, one can hedge one's bet further by flipping a feature flag only for a subset of users and, again, monitoring before exposing the feature to all.

Is there a framework for distributing browser automation testing over a cluster of EC2 instances?

I am aiming to simulate a large number of 'real users' hitting and realistically using our site at the same time, and ensuring they can all get through their use cases. I am looking for a framework that combines some EC2 grid management with a web automation tool (such as GEB/WATIR). Ideal 'pushbutton' operation would do all of this:
Start up a configurable number of EC2 instances (using a specified
AMI preconfigured with my browser automation framework and test
scripts)
Start the web automation framework test(s) running on all of them,
in parallel. I guess they would have to be headless.
Wait for completion
Aggregate results
Shut down EC2 instances.
While not a framework per se, I've been really happy with http://loader.io/
It has an API for your own custom integration, reporting and analytics for analysis.
PS. I'm not affiliated with them, just a happy customer.
However, in my experience, you need to do both load testing and actual client testing. Even loader.io will only hit your service from a handful of hosts. And, it skips a major part (the client-side performance from a number of different clients' browsers).
This video has more on that topic:
http://www.youtube.com/watch?v=Il4swGfTOSM&feature=youtu.be
BrowserMob used to offer such service. Looks like they got acquired.

Resources