What is the difference between p2p file system and distributed file system? - distributed

When I googled for a distributed storage tool for my app,
I found two type of technologies:
The first represent themselves as p2p file system (IPFS..) and the others as distributed files system (Ceph ..)
so what is the different between p2p systems and distributed system ?
what I believe (it can be wrong) is that p2p systems doesn't assume trust between nodes, in contrast distributed systems all nodes have to trust each others or at least trust a "master" node.

P2P is a Distributed System Architecture.
what I believe (it can be wrong) is that p2p systems doesn't assume
trust between nodes, in contrast distributed systems all nodes have to
trust each others or at least trust a "master" node.
It depends on your definition of trust. If the 'trust' means standalone computer node operation, then you are correct.
P2P involves a component called Peer. In P2P, each peer has the same power/capability with another peer in the network. One peer can work alone without another peer.
Another example of Distributed System Architecture is Client-Server Architecture.
The client has a limited capability compared to peer. The client must connect to Server to perform a specific task. The client has limited capability without a server.

Distributed files system (DFS) combine several nodes storage (can be a large number) in a way that end user this see as single storage space. There is middleware that manage with all disks space and take care of data. Now, this Distributed file system can relay on servers or can relay on simple workstations. If nodes are Workstation we are talking about P2P DF system and if there are servers then we just say distributed file systems. I have to say that even P2P file system could involve node that act as server for indexing files, maping locatione etc.P2P DFS is effected by churn nature of peers (join/leave behaviour), while server based don t have this problem.
The best approach is to analyze several P2P distributed file Systems like Freenet, CFS, Oceanstores (interesanting since it use untrusted servers that act as peers), Farsite etc. take a look here for more.
And some DFS like Cepth, Hadoop, Riak etc ... some of them you can find here
Hope this helped.

Related

How to transfer rules and configuration to edge devices?

In our application we have a server which contains entities along with their relations and processing rules stored in DB. To that server there will be n no.of clients like raspberry pi , gateways, android apps are connected.
I want to push configuration & processing rules to those clients, so when they read some data they can process on their own. This is to make the edge devices self sustainable, avoid outages when server/network is down.
How to push/pull the configuration. I don't want to maintain DBs at client and configure replication. But the problem is maintenance and patching of DBs for those no.of client will be tough.
So any other better alternative.?
At the same time I have to push logs to upstream (server).
Thanks in advance.
I have been there. You need an on-device data store. For this range of embedded Linux, in order of growing development complexity:
Variables: Fast to change and retrieve, makes sense if the data fits in memory. Lost if the process ends.
Filesystem: Requires no special libraries, just read/write access somewhere. Workable if the data is small enough to fit in memory and does not change much during execution (read on startup when lacking network, write on update from server). If your data can be structured as a few object variables, you could write them to JSON files, and there is plenty of documentation on other file storage options for Android apps.
In-memory datastore like Redis: Lightweight dependency, can automate messaging and filesystem-stored backup. Provides a managed framework/hybrid of the previous two.
Lightweight databases, especially SQLite: Lightweight SQL database, stored in one file and popular with Android apps (probably already installed on many of the target devices). It could work for frequent changes on a larger block of data in a memory-constrained environment, but does not look like a great fit. It gets worse for anything heavier.
Redis replication is easy, but indiscriminate, so mainly sensible if your devices receive a changing but identical ruleset. Otherwise, in all these cases, the easiest transfer option may be to request and receive the whole configuration (GET a string, download a JSON file, etc.) and parse the received values.

When does a distributed system need ZooKeeper

Why do some distributed systems like Solr or Kafka need ZooKeeper, but some distributed systems like Cassandra don't?
ZooKeeper provides a strongly consistent store for critical system state. Many systems, e.g. Storm and Kafka rely on ZooKeeper to do service discovery and leader election. Because ZooKeeper's ZAB protocol falls on the CP side of the CAP theorem, it can guarantee that two clients will not see different views of the same system. So, for instance, Kafka will not mistakenly believe both node A and node C are the leader for the same partition.
These systems simply use ZooKeeper because it's a very well tested and proven technology for storing this type of critical metadata. ZooKeeper acts as a central point for coordination. Cassandra, however, has a more decentralized architecture and implements its own consensus algorithm (Paxos) rather than relying on an external CP store like ZooKeeper. Depending on how Cassandra uses its gossip and consensus protocols, it may simply make some concessions that systems like Kafka and Solr do not. This allows Cassandra to be devoid of dependencies on external systems like ZooKeeper which can generally tolerate less failures than can HA systems.
Systems that need Zookeeper relies on it for cluster coordination. Cassandra architecture is different because it's a peer-to-peer system. As consequence of that the coordination is "distribuited" among each node.
In Kafka Consumers of topics register themselves in ZooKeeper, in order to coordinate with each other and balance the consumption of data.
Consumers can also store their offsets in ZooKeeper by setting offsets.storage=zookeeper.
Solr embeds and uses Zookeeper as a repository for cluster configuration and coordination - think of it as a distributed filesystem that contains information about all of the Solr servers.
Apart from these zookeeper is used in many other systems like Hadoop Highavailabilty, HBase.

Distributed systems where to send requests?

How do different components in a distributed system know where to send messages to access certain services?
For example, lets say I have a service which handles authentication, and a service which handles searching. How does the component which handles searching know where to send an authentication request? Are subdomains more commonly used? If so, how does replication work in this scenario? Is there some registry of local IP addresses which handles all this routing?
The problem you are describing is called service lookup / service registry / resource lookup / .. and it depends. It depends on how large your system is and how dynamic it is.
If you only have few components, it might be feasible enough to store the necessary information in a config file, or pass it as parameter. Generally, many use DNS as a lookup system, but it’s not considered to be a good one, due to the caching and long latency.
I think most distributed systems use Zookeeper to store this information for them. This way, all the services only need to know the IP-addresses of the Zookeeper cluster. If you have replication, you just store multiple addresses inside Zookeeper, and depending on which system you are using, you’ll need to choose an address on your own, or the driver does it (in case you’re connecting to a replicated database for instance).
Another way to do this, is to use a message queue, like ZMQ which will forward the messages to the correct instances. ZMQ can deal with replications and load balancing as well.

Peer to peer replication of local databases

I have a program in C that monitors traffic and records the URLs visited by the user. Currently, I am maintaining this in a hash table. My key is the src-IP address and the result is a data-structure with a linked list of URLs. I am currently maintaining 50k to 100k records in a hash table. When the user logs out, the record can get deleted.
The program independently runs on a Active-Standby pair. I want to replicate this database to another machine in case my primary machine crashes (the 2 systems act as Client and Server) and continue recording stuff associated with the user.
The hard way is to write code for sending this information to the peer and on the peer system to receive and store. The issue is, it will add lots of code (and bugs!). To do data-replication and data-store, here are a few prereqs:
I want data-record replication between these machines. I am NOT looking at adding another machine/cluster unless required.
Prefer library so that query is quick. If not another process on the same machine to which I can IPC.
Add, update and delete operations should be supported.
In memory database a must.
Support multiple such databases with different keys.
Something that has publish/subscribe.
Resync capability if the backup dies and comes back again.
Interface should be in C
Possible options I looked at were zookeeper, redis, memcached, sql-lite, berkeley-db.
Zookeeper - Needs odd number of systems for tie-break. Not suitable for 1 to 1.
Redis - Looks to fit my requirements with hiredis for C interface. Separate process though.
Memcached - I don't have any caching requirements.
Sql-lite - Embedded database with C interface
Berkeley-DB - Embedded database for better scale.
So, Redis, Sql-lite and Berkeley-DB look like my options to go forward. Appreciate any help/thoughts on the DBs I should research more for my requirements. Or if there are any other DBs I should research? I apologize if my question is very generic. If the question does not belong here, please point me to the right forum.

can a berkeley database be opened and accessed from multiple programs at the same time?

according to the Berkeley documentation the Transactional (TS) and the Concurrent Datastore version of the Database, multiple threads may access (and change) the database.
Does this also mean that I can have 2 programs linked to the berkely 'client' and have them access the same database file without any problems?
(I ask, since for a separate database server this would be no problem of course, but in the case of Berkeley the database engine is linked long with your program)
thanks!
R
Some documentation seems to think you can use the same database concurrently from multiple processes as well as from multiple threads. Specifically:
"Multiple processes, or multiple threads in a single process, can all use the database at the same time as each uses the Berkeley DB library. Low-level services like locking, transaction logging, shared buffer management, memory management, and so on are all handled transparently by the library."
A cursory read did not shed any light on what BDB uses to control access from multiple processes, but if filesystem locks are used, access from multiple processes on a network filesystems may well be problematic.
Chapter 16: The Locking Subsystem from the reference guide looks promising.

Resources