How do different components in a distributed system know where to send messages to access certain services?
For example, lets say I have a service which handles authentication, and a service which handles searching. How does the component which handles searching know where to send an authentication request? Are subdomains more commonly used? If so, how does replication work in this scenario? Is there some registry of local IP addresses which handles all this routing?
The problem you are describing is called service lookup / service registry / resource lookup / .. and it depends. It depends on how large your system is and how dynamic it is.
If you only have few components, it might be feasible enough to store the necessary information in a config file, or pass it as parameter. Generally, many use DNS as a lookup system, but it’s not considered to be a good one, due to the caching and long latency.
I think most distributed systems use Zookeeper to store this information for them. This way, all the services only need to know the IP-addresses of the Zookeeper cluster. If you have replication, you just store multiple addresses inside Zookeeper, and depending on which system you are using, you’ll need to choose an address on your own, or the driver does it (in case you’re connecting to a replicated database for instance).
Another way to do this, is to use a message queue, like ZMQ which will forward the messages to the correct instances. ZMQ can deal with replications and load balancing as well.
Related
Let us say, there is a system containing data, where the user can view or manipulate it, using the options in the system, but should not be able to copy/ extract/ export the data out of the system. Also, any bots such as RPA or crawlers should not be exporting too. The data strictly recides in the system.
Eg: VDI - Virtual Desktop Infrastructure, does some sort of this work. People can connect to remote machines and do some work, but cannot extract data out of it to their local machine, unless it allows the user to do so. Even RPA bots will not be allowed to run in that remote system, only can be run in local system but it will be tedious to build such a bot, providing a closer solution to the above problem.
I am just looking for alterate light-weight options. Please let me know, if there is any solution available.
There is simply no way of stopping all information export.
A user could just take a photo to the screen and share the info.
If by exporting you mean exporting files, then simply do not allow exporting the files in your program or restrict the option, if you need to store data on the disk, store it encrypted.
The best options would be to configure a machine only to use that software, so on boot it would lauch the software fullscreen, deny any usb autorun keys and have something like Veyon insyalled to be remotely controlled and have some config data on the disk but pretty much all the data on a remote server.
If you need a local cache, you can keep it encrypted.
That said theoretically if a user had access to the ram physically, he/she could retrieve that data but it is highly unlikely.
First of all, you'll have to make ssh and ftp useless! this is to prevent scp or other ftp software from being used to move things from inside the system out and vice versa, block ports 20, 21 and 22!
If possible, I'd block access to cloud storage services (DNS/Firewall), so that no one with access to the machine would be able to upload stuff to common cloud services or if you have a known address that might be a potential goal for your protected data. Make sure that online code repositories are also blocked! if the data can be stored as text, it can be also transfered to github/gitlab/bitbucket as a normal repo... you can block them also at DNS level. Make sure that users don't have the previlage to change network settings, otherwise they can bypass your DNS blocks!
You should prevent any kind of external storage connectivity.. by disallowing your VM from connecting to the server's USB ports or even bluetooth if exists.
That's off the top of my head... I'll edit this answer if I remember any more things to block.
In our application we have a server which contains entities along with their relations and processing rules stored in DB. To that server there will be n no.of clients like raspberry pi , gateways, android apps are connected.
I want to push configuration & processing rules to those clients, so when they read some data they can process on their own. This is to make the edge devices self sustainable, avoid outages when server/network is down.
How to push/pull the configuration. I don't want to maintain DBs at client and configure replication. But the problem is maintenance and patching of DBs for those no.of client will be tough.
So any other better alternative.?
At the same time I have to push logs to upstream (server).
Thanks in advance.
I have been there. You need an on-device data store. For this range of embedded Linux, in order of growing development complexity:
Variables: Fast to change and retrieve, makes sense if the data fits in memory. Lost if the process ends.
Filesystem: Requires no special libraries, just read/write access somewhere. Workable if the data is small enough to fit in memory and does not change much during execution (read on startup when lacking network, write on update from server). If your data can be structured as a few object variables, you could write them to JSON files, and there is plenty of documentation on other file storage options for Android apps.
In-memory datastore like Redis: Lightweight dependency, can automate messaging and filesystem-stored backup. Provides a managed framework/hybrid of the previous two.
Lightweight databases, especially SQLite: Lightweight SQL database, stored in one file and popular with Android apps (probably already installed on many of the target devices). It could work for frequent changes on a larger block of data in a memory-constrained environment, but does not look like a great fit. It gets worse for anything heavier.
Redis replication is easy, but indiscriminate, so mainly sensible if your devices receive a changing but identical ruleset. Otherwise, in all these cases, the easiest transfer option may be to request and receive the whole configuration (GET a string, download a JSON file, etc.) and parse the received values.
https://ringpop.readthedocs.org/en/latest/
To my understanding, the sharding can be implemented in some library routines, and the application programs are just linked with the library. If the library is a RPC client, the sharding can be queried from the server side in real-time. So, even if there is a new partition, it is transparent to the applications.
Ringpop is application-layer sharding strategy, based on SWIM membership protocol. I wonder what is the major advantage at the application layer?
What is the other side, say the sharding in the system layer?
Thanks!
Maybe a bit late for this reply, but maybe someone still needs this information.
Ringpop has introduced the idea of 'sharding' inside application rather then data. It works more or less like an application level middleware, but with the advantage that it offers an easy way to build scalabale and fault-tolerance applications.
The things that Ringpop shards are the requests coming from clients to a specific service. This is one of its major advantages (there are mores, keep reading).
In a traditional SOA architecure, all requests for a specific serveice goes to a unique system that dispatch them among the workers for load balancing. These workers do not know each other, they are indipendent entities and cannot communicate between them. They do their job and sent back a reply.
Ringpop is the opposite: the workers know each other and can discover new ones, regularly talk among them to check their healthy status, and spread this information with the other workers.
How Ringpop shard the request?
It uses the concept of keyspaces. A keyspace is just a range of number, e.g. you are free to choice the range you like, but the obvious choice is hash the IDs of the objects in the application and use the hashing-function's codomain as range.
A keyspace can be imaginated as an hash "ring", but in practice is just a 4 or 8 byte integer.
A worker, e.g. a node that can serve a request for a specific service, is 'virtually' placed on this ring, e.g. it owns a contiguous portion of the ring. In practice, it has assigned a sub-range. A worker is in charge to handle all the requests belonging to its sub-range. Handle a request means two things:
- process the request and provide a response, or
- forward the request to another service that actually knows how to serve it
Every application is build with this behaviour embedded. There is the logic to handle a request or just forward it to another service that can handle it. The forwarding mechanism is nothing more than a remote call procedure, which is actually made using TChannel, the Uber's high performance forwarding for general RPC.
If you think on this, you can figure out that Ringpop is actually offering a very nice thing that traditionals SOA architecture do not have. The clients don't need to know or care about the correct instance that can serve their request. They can just send a request anywhere in Ringpop, and the receiver worker will serve it or forward to the rigth owner.
Ringpop has another interesting feature. New workers can dinamically enter the ring and old workers can leave the ring (e.g. because a crash or just a shutdown) without any service interrputions.
Ringpop implements a membership protocol based on SWIM.
It enable workers to discover each another and exclude a broken worker from the ring using a tcp-based gossip protocol. When a new worker is discovered by another worker, a new connection is established between them. Every worker map the status of the other workers sending a ping request at regular time intervals, and spread the status information with the other workers if a ping does not get a reply (e.g. piggyback membership update on a ping / gossip based)
These 3 elements consistent hashing, request forwarding and a membership protocol, make Ringpop an interesting solution to promote scalability and fault tolerance at application layer while keeping the complexity and operational overhead to a minimum.
I am wondering what the stats are for different ways of storing (and therefore retrieving) content. Are there any charts out there, or do you guys have any quick tests to show, the requests per second, etc., of:
Direct (local) database access, vs.
HTTP Access to cached data, vs.
HTTP Access to uncached data (remote database), vs.
Direct File access
I am wondering to judge how necessary it is to locally cache data if I'm using remote services.
Thanks!
.. what the stats are ...
Although some people may have published their findings, this will not map directly to your experience - you may find the opposite of they discovered.
Sometimes it may be faster to retrieve files from a database than a file - it depends on the size of the file, the filesystem or DBMS it resides on, the other data which affects the access path (e.g. indexes, number of I/O operations to dereference the start of the file...) the underlying hardware, the amount of caching available, the presence of the data or information relating to its location in the cache and the interaction between each of these factors.
And that's before you start considering the additional variables introduced when you start talking about HTTP, which also implies remote network access.
While ultimately any file would need to be read from the filesystem at some point, this suggests that direct file access would be the fastest method (but only on the local machine) however if you consider centralized caching and concurrency this is not necessarily the case.
I am wondering to judge how necessary it is to locally cache data if I'm using remote services.
Rather hard to say. How remote? what are your bandwidth costs? Latency? What level of service do you hope to provide? Does the remote system provide caching information already? How do you deal with cache invalidations?
If we knew everything about your application, the data source, your customers and networks connecting them and your budget for implementing the service then we might hazard a guess. And, yes, caching on the MITM server probably is a good idea but only if you know that you're not breaking anything by using caching.
C.
On the memcached website it says that memcached is a distributed memory cache. It implies that it can run across multiple servers and maintain some sort of consistency. When I make a request in google app engine, there is a high probability that request in the same entity group will be serviced by the same server.
My question is, say there were two servers servicing my request, is the view of memcached from these two servers the same? That is, do things I put in memcached in one server reflected in the memcached instance for the other server, or are these two completely separate memcached instances (one for each server)?
Specifically, I want each server to actually run its own instance of memcached (no replication in other memcached instances). If it is the case that these two memcached instances update one another concerning changes made to them, is there a way to disable this?
I apologize if these questions are stupid, as I just started reading about it, but these are initial questions I have run into. Thanks.
App Engine does not really use memcached, but rather an API-compatible reimplementation (chiefly by the same guy, I believe -- in his "20% time";-).
Cached values may disappear at any time (via explicit expiration, a crash in one server, or due to memory scarcity in which case they're evicted in least-recently-used order, etc), but if they don't disappear they are consistent when viewed by different servers.
The memcached server chosen doesn't depend on the entity group that you're using (the entity group is a concept from the datastore, a different beast).
Each server runs its own instance of memcached, and each server will store a percentage of the objects that you store in memcache. The way it works is that when you use the memcached API to store something under a given key, a memcached server is chosen (based on the key).
There is no replication between memcached instances, if one of those boxes goes down, you lose 1/N of your memcached' data (N being the number of memcached instances running in AppEngine).
Typically, memcached does not share data between servers. The application server hashes the key to choose a memcached server, and then communicates with that server to get or set the data.
Based in what I know, there is only ONE instance of Memcache of you entire application, there could be many instance of your code running each one with their memory, and many datastore around the world, but there is only one Memcache server at a time, and keep in mind that this susceptible to failure service, even is no SLA for it.