Open source distributed computing/cloud computing frameworks

Open source distributed computing/cloud computing frameworks - distributed

I was wondering if anyone knows of any good Open Source distributed computing projects? I have a computationally intensive program that could benefit from distributed computing (a la SETI#Home, etc.) and want to know if anyone has seen such a thing or will I be developing it from scratch?

I see that this is over a year old but this is a new and relevant answer
http://openstack.org/

Here's one for java and one for c# and here's an open source grid toolkit.

SETI#Home uses BOINC

MPAPI - Parallel and Distributed Applications Framework.
Sector 0 Article:
http://sector0.dk/?page_id=15.
Gives a good overview of the
framework, architecture and the
theory behind it.
Works on a single machine to 'n'
machines.
Design distributed logic into the
system.
Focuses on message passing to isolate
the state that each thread has access
to i.e. no shared state only
messages.
Is Open Source =] and is MONO
Compatible YAY!
Architecture in a Nutshell
Cluster
Single Main Node
Controls the cluster
Numerous Sub-Nodes (one per machine) which are the work horses of the cluster
Single Registration Server - Binds the cluster together by allowing nodes to register / unregister with cluster notifying
existing nodes
Communication
Node to Node directly. Each worker
communicates with others through the
node.
The messages are not
propagated down through the remoting
layer unless two workers are on
different nodes.

Hadoop if you want to run the machines yourself. Amazon Elastic MapReduce if you want to let others run your workers. Amazon Elastic MapReduce is based on Hadoop.

I have personally used BOINC which is a robust solution, widely used and offer you a great range of possibilities in term of customization.
This is the most complete solution I know. The only problems I had were that it was difficult to use for remote job submission (if you don't have access to the server) and it can be a bit long to setup. But overall it is a very good solution.
If you rather want to implement distributed computing just over a local grid, you can use GridCompute that should be quick to set up and will let you use your application through python scripts.
PS: I am the developer of GridCompute.

Related

Difference between SageMaker instance count and Data parallelism

I can't understand the difference between SageMaker instance count and Data parallelism. As we already have a feature that can specify how many instances we train model when we write a training script using sagemaker-sdk.
However, in 2021 re:Invent, SageMaker team launched and demonstrated SageMaker managed Data Parallelism and this feature also provides distributed training.
I've searched a lot of sites for letting me know about that, but I can't find really clear demonstration. I share some stuffs explaining the concept I mentioned closely. Link : https://godatadriven.com/blog/distributed-training-a-diy-aws-sagemaker-model/

Increasing the instance count will enable SageMaker to launch those many instances and copy data to the instances. This will only enable parallelization at the infrastructure level. To really carry out distributed training we need support at framework/code level where the code should know how to aggregate/send gradients across all the GPU's/instances within the cluster. In some case how to distribute data as well usually when using DataLoaders. To achieve this SageMaker has Distributed Data Parallelism feature built into it. This is similar to other alternatives like Horovod, Pytorch DDP etc...

How to implement continuous delivery on a platform consisting of multiple applications which all depends on one database and each other?

We are working on old project which consists of multiple applications which all use the same database and strongly depend on each other. Because of the size of the project, we can't refactor the code so they all use the API as a single database source. The platform contains the following applications:
Website
Admin / CMS
API
Cronjobs
Right now we want to start implementing a CI/CD pipeline using Gitlab. We are currently experiencing problems, because we can't update the database for the deployment of one application without breaking all other applications (unless we deploy all applications).
I was thinking about a solution where one pipeline triggers all other pipelines. Every pipeline will execute all newly added database migrations and will test if the pipeline is still working like it should. If all pipelines succeeds, the deployment of all applications will be started.
I'm doubting if this is a good solution, because this change will only increase the already high coupling between our applications. Does anybody know a better solution how to implement CI/CD for our platform?

You have to stop thinking about these as separate applications. You have a monolith with multiple modules, but until they can be decoupled, they are all one application and will have to deployed as such.
Fighting this by pretending they aren't is likely a waste of time, your efforts would be better spent actually decoupling these systems.

There are likely a lot of solutions, but one that I've done in the past is create a separate repository for the CI/CD of the entire system.
Each individual repo builds that component, and then you can create tags as they are released or ready for CI at a system level.
The separate CI/CD repo pulls in the appropriate tags for each item and runs CI/CD against all of them as one unit. This allows you to specify which tag for each repo you want to specify, which should prevent this pipeline from failing when changes are made on the individual components.

Ask yourself why these "distinct applications" are using "one and the same database". Is that because every single one of all of those "distinct applications" all deal with "one and the same business semantics" ? If so, as Rob already stated, then you simply have one single application (and on top of that, there will be no decoupling precisely because your business semantics are singular/atomic/...).
Or are there discernable portions in the db structure such that a highly accurate mapping could be identified saying "this component uses that portion" etc. etc. ? In that case what is it that causes you to say stuff like "can't update the database for the deployment of ..." ??? (BTW "update the database" is not the same thing as "restructure the database". Please, please, please be precise.) The answer to that will identify what you've got to tackle.

Manage multiple clusters in Hadoop OR Distributed Computing Framework

I have five computers networked together. Among them one is master computer and another four are slave computers.
Each slave computer has its own set of data (a very big integer matrix). I want to run four different clustering programs in four different slaves. Then, take the results back into the master computer for further processing (such as visualization).
I initially thought to use Hadoop. But, I cannot find any nice way to convert the above problem (specifically the output results) into the Map Reduce framework.
Is there any nice open-source distributed computing framework by using which I can perform the above task easily?
Thanks in advance.

You should used YARN for manage multiple clusters or resources
YARN is the prerequisite for Enterprise Hadoop, providing resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters.
Reference

It seems that you have already stored the data on each of the nodes, so you have already solved the "distributed storage" element of the problem.
Since each node's dataset is different, this isn't a parallel processing problem either.
It seems to me that you don't need Hadoop or any other big data framework. However, you can embrace the philosophy of Hadoop by taking the code to the data. You run the clustering algorithm on each node, and then handle the results in whatever way you need. A caveat would be if you also have a problem in loading the data and running the clustering algorithm on each node, but that is a different problem.

OrientDB in distributed architecture works with vertex replication across servers?

I have worked in a project with OrientDB graph database. I've managed to fill the database and perform the queries in it without problems. But after I needed to run my queries using the distributed feature from OrientDB, and I came with an important (maybe trivial) doubt.
I've managed to use the distributed mode also without problems using 3 differente machines, but I wanted to be sure that OrientDB is really storing my database within the 3 machines that I've used. Is there any way to check that?
When I was researching for this answer, I came to the conclusion that OrientDB replicates the entire database across all the machines, is that correct? The goal to use the distributed architecture was to improve performance, but if OrientDB works with replication, and I run one query in a specific machine, the query will be processed using all machines, or only one?
To be short, I want to know if OrientDB when using the distributed mode, distributes the vertex and edges across the machines, and process the queries using all the machines?
I've read the entire documentation : http://orientdb.com/docs/2.0/orientdb.wiki/Distributed-Architecture.html and could not find a clear explanation for this questions.
Thanks in advance!

OrientDB, by default, replicates the entire DB on all the servers. What you're looking for is called "Sharding". OrientDB supports manual sharding (automatic in the future), that means you (the application) can decide where to store the vertices/edges.

Silverlight Multi-User application with synchronization

I am wondering if it's possible to create a graphical application in Silverlight which supports synchronisation between the different clients.
To be a bit more precise, I am drawing concepts of developing a Silverlight Game. Visitors would log-in, and see live, synchronised what the other vistors are doing.
If it is possible to have this implemented, I would like to know what is needed to create a fully synched Silverlight environment between multiple peers. Anything from links, code snippets, ideas and / or alternatives are more than appreciated !
Please do not suggest Flash, as I do not own a valid Flash building license, I prefer to have this created within Visual Studio 2010.
Edit:
I want it to be as lightweight for the clients as possible, I don't care much for the server, and also low bandwidth consuming. I don't know whether a broadcasting principal is the only option to have all the events taken place at the same time?

You may want to take a look at the Polling Duplex protocol of WCF. This is the subscription and publish concept. Support in SL has been about since version 2 so there's plenty of articles out there. An article I referenced for a message broadcast system we put in place at work can be found here...
http://tomasz.janczuk.org/2009/07/pubsub-sample-using-http-polling-duplex.html
which also mentions an interesting project on codeplex (I've not used)...
http://laharsub.codeplex.com/

A simple and working (but rather inefficient) solution would be for all clients to ask a WCF/Ria service on the server for status updates in regular intervals, perhaps once every X seconds or so, letting the server keep track of changes relevant to the calling clients.