Where does IPFS store all the data? - distributed

I've been trying to implement and understand the working of IPFS and have a few things that aren't clear.
Things I've tried:
Implemented IPFS on my system and stored files on it. Even if I delete the files from my system and close the ipfs daemon, I am still able to access the files from a different machine through IPFS.
I've noticed there's a .ipfs folder in my home directory that contains the part of blocks of data that I add to IPFS.
Questions:
Are the blocks stored locally on my system too?
Where else is the data stored? On other peers that I am connected to? Because I'm still able to access the file if I close my ipfs daemon.
If this is true, and data is stored at several places, the possibility of losing my data is still there, if all the peers disconnect from the network?
Does every peer on the network store the entire file or just a part of the file?
If a copy of data is being distributed across the p2p network, it means the data is being duplicated multiple times? How is this efficient in terms of storage?
We store data uploaded by other peers too?
Minimum System requirements for running IPFS? We just need abundant storage, not necessarily a powerful system?

When you upload something, the file is chunked by ipfs and stored in your cache folder (.ipfs).
If you check the file existance on another peer of the network (say the main gateway, ipfs.io) that peer requests the file from you and caches it too.
If later you switch off your daemon and you can still see the file on the gateway it's probably because the gateway or some other peer on the web still has it cached.
When a peer wants to download a file but it's out of memory (it can no longer cache), it trashes the oldest used files to free space.
If you want to dive deep into the technology, check first these fundamentals:
how git works
decentralized hash tables (DHT)
kademlia
merkle trees
The latter should give you an idea of how the mechanism works more or less.
Now, let's answer point by point
Are the blocks stored locally on my system too?
Yes
Where else is the data stored? On other peers that I am connected to? Because I'm still able to access the file if I close my ipfs daemon.
All the peers that request your file cache it
If this is true, and data is stored at several places, possibility of losing my data is still there, if all the peers disconnect from the network?
You lose the file when it's no longer possible to reconstitute your file from all the peers that had a part of it cached (including yourself)
Does every peer on the network store the entire file or just a part of the file?
One can get just a part of it, imagine you are watching a movie and you stop more, or less, at the half... that's it, you've cached just half of it.
If copy of data is being distributed across the p2p network, it means the data is being duplicated multiple times? How is this efficient in terms of storage?
When you watch a video on YouTube your browser caches it (that means a replication!)... ipfs is more efficient in terms of traffic, let's say you switch off the browser and 2 minutes later you want to watch it again. Ipfs gets it from your cache, YouTube makes you download it again. There's also an interesting matter on the delta storage (related to git) and from where you get it (could be inside your lan... that means blazing fast) but I want to stick to the questions.
We store data uploaded by other peers too?
If you get data, you cache it so...
Minimum System requirements for running IPFS? We just need abundant storage, not necessarily a powerful system?
The main daemon is written in go. Go is efficient but not as much as writing it on C++, C, Rust... Also, the tech is pretty young and it will improve with time. The more space you have the more you can cache, CPU power isn't THAT important.
If you are interested in ways to store data in a p2p manner, here some links to interesting projects.
https://filecoin.io/
https://storj.io/
https://maidsafe.net/
https://www.ethereum.org/ and it's related storage layer
https://ethersphere.github.io/swarm-home/

Files are stored inside IPFS Object which are upto 256kb in size. IPFS
object can also contain link to other ipfs objects. Files that are
larger than 256 kb, an image or video. those are split up into
multiple ipfs objects that are all 256 kb in size and afterward, the
system will create an empty IPFS OBJECT that links to all other pieces
of the file. Each object gets hashed and given a unique content
identifier (CID), which serves as a fingerprint. This makes it faster
and easier to store the small pieces of your data on the network
quickly.
Because IPFS uses content-based addressing, once something is added
cannot be changed. it is an immutable data store much like a
blockchain. IPFS can help deliver content in a way that can save you
considerable money.
IPFS removes duplications across the network and tracks version
history for every file. IPFS also provides high performance and
clustered persistence.
Since IPFS supports versioning of your files. let's say u want to
share an important file with someone over the ipfs. IPFS will create a
new commit object. it is very basic. it just tells ipfs which commit
went before it and it links to the IPFS object of your file. let's say
after a while u want to update a file. you just add updated file to
the IPFS network and the software will create a new commit object for
your file. this commit object now links to the previous commit. this
can be done endlessly. IPFS will make sure your file plus its entire
history is accessible to the other nodes on the network.
Biggest problem with IPFS is keeping files available. every node on
the network keeps a cache of the files that it has downloaded and
helps to share them if other people need them. But if a specific file
is hosted by 4 nodes and if those nodes go offline, then those files
become unavailable and no one can grab a copy of it. there are two
possible solutions to this problem.
Either we incentive people to store files and make them available or
we can proactively distribute files and make sure that there are
always a certain amount number of copies available on the network.
that is what exactly file coin intends to do. File coin is created by
the same group people that have created IPFS. it is basically a
blockchain built on top of IPFS that wants to create a decentralized
market for storage. if you have some free space you can rent it out to
others and make money from it in the process.
IPFS and the blockchain are a perfect fit. You can address large
amounts of data with IPFS and place the immutable, IPFS links into a
blockchain transaction. This timestamps and secures your content,
without having to put the data on the chain itself.
Reference

Related

How to prevent people or a program from extracting data out of a system?

Let us say, there is a system containing data, where the user can view or manipulate it, using the options in the system, but should not be able to copy/ extract/ export the data out of the system. Also, any bots such as RPA or crawlers should not be exporting too. The data strictly recides in the system.
Eg: VDI - Virtual Desktop Infrastructure, does some sort of this work. People can connect to remote machines and do some work, but cannot extract data out of it to their local machine, unless it allows the user to do so. Even RPA bots will not be allowed to run in that remote system, only can be run in local system but it will be tedious to build such a bot, providing a closer solution to the above problem.
I am just looking for alterate light-weight options. Please let me know, if there is any solution available.
There is simply no way of stopping all information export.
A user could just take a photo to the screen and share the info.
If by exporting you mean exporting files, then simply do not allow exporting the files in your program or restrict the option, if you need to store data on the disk, store it encrypted.
The best options would be to configure a machine only to use that software, so on boot it would lauch the software fullscreen, deny any usb autorun keys and have something like Veyon insyalled to be remotely controlled and have some config data on the disk but pretty much all the data on a remote server.
If you need a local cache, you can keep it encrypted.
That said theoretically if a user had access to the ram physically, he/she could retrieve that data but it is highly unlikely.
First of all, you'll have to make ssh and ftp useless! this is to prevent scp or other ftp software from being used to move things from inside the system out and vice versa, block ports 20, 21 and 22!
If possible, I'd block access to cloud storage services (DNS/Firewall), so that no one with access to the machine would be able to upload stuff to common cloud services or if you have a known address that might be a potential goal for your protected data. Make sure that online code repositories are also blocked! if the data can be stored as text, it can be also transfered to github/gitlab/bitbucket as a normal repo... you can block them also at DNS level. Make sure that users don't have the previlage to change network settings, otherwise they can bypass your DNS blocks!
You should prevent any kind of external storage connectivity.. by disallowing your VM from connecting to the server's USB ports or even bluetooth if exists.
That's off the top of my head... I'll edit this answer if I remember any more things to block.

How to transfer rules and configuration to edge devices?

In our application we have a server which contains entities along with their relations and processing rules stored in DB. To that server there will be n no.of clients like raspberry pi , gateways, android apps are connected.
I want to push configuration & processing rules to those clients, so when they read some data they can process on their own. This is to make the edge devices self sustainable, avoid outages when server/network is down.
How to push/pull the configuration. I don't want to maintain DBs at client and configure replication. But the problem is maintenance and patching of DBs for those no.of client will be tough.
So any other better alternative.?
At the same time I have to push logs to upstream (server).
Thanks in advance.
I have been there. You need an on-device data store. For this range of embedded Linux, in order of growing development complexity:
Variables: Fast to change and retrieve, makes sense if the data fits in memory. Lost if the process ends.
Filesystem: Requires no special libraries, just read/write access somewhere. Workable if the data is small enough to fit in memory and does not change much during execution (read on startup when lacking network, write on update from server). If your data can be structured as a few object variables, you could write them to JSON files, and there is plenty of documentation on other file storage options for Android apps.
In-memory datastore like Redis: Lightweight dependency, can automate messaging and filesystem-stored backup. Provides a managed framework/hybrid of the previous two.
Lightweight databases, especially SQLite: Lightweight SQL database, stored in one file and popular with Android apps (probably already installed on many of the target devices). It could work for frequent changes on a larger block of data in a memory-constrained environment, but does not look like a great fit. It gets worse for anything heavier.
Redis replication is easy, but indiscriminate, so mainly sensible if your devices receive a changing but identical ruleset. Otherwise, in all these cases, the easiest transfer option may be to request and receive the whole configuration (GET a string, download a JSON file, etc.) and parse the received values.

Fastest way to store tiny data on the server

I'm looking for a better and faster way to store data on my webserver at the best speed possible.
My idea is to log the IP address of every incoming request to any website on the server and if it reaches a certain number within a set time then users will be redirected to a page where they need to enter a code to regain access.
I created an apache module that does just that. It attempts to create files on a ramdisk however I constantly run into permission problems since there is another module that switches users before my module has a chance to run.
Using a physical disk is an option that is too slow.
So my only options are as follows:
Either create folders on the ramdrive for each website so IP addresses can be logged independently.
Somehow figure out how to make my apache module execute its functionality before all other modules.
OR
Allocate a huge amount of ram and store everything in it.
If I choose option #2, then I'll continue to beat around the bush as I have already attempted that.
If I choose option #1, then I might need lots more ram as tons of duplicate IP addresses are expected to be stored across several folders on the ramdrive.
If I choose option #3, then the apache module will have to constantly seek through the ram space allocated in order to find the IP address, and seeking takes time.
People say that memory access is faster than file access but I'm not sure if just a direct memory access via malloc is faster than storing data to a ram drive.
The reason why I expect to collect alot of IP addresses is to block script-kiddies from constantly accessing my server at a very high rate.
So what I'm asking is what is the best way I should store my data and why?
You can use hashmap instead of huge amount of ram. It will be pretty fast.
May be several hashmaps for each websits. Or use composite like string_hash(website_name) + (int_hash(ip) << 32)
If the problem is with permissions, why not solve it at that level? Use a common user account or group. Or make everything on the RAM disk world readable/writable.
If you want to solve it at the Apache level, you might want to look into mod_security and mod_evasive.

Silverlight Large File Downloader

I've got an interesting one: the ability to marshal the download of files - many in the gigabyte region of data.
I have a silverlight website that allows the upload of large volumes of data (Gigs) using the following plugin: http://silverlightuploader.codeplex.com/
However, I also want to be able to allow users to download the same data too. But I want to be able to restrict the amount of concurrent downloads. Thus the idea of directly controlling a stream of data to the client via silverlight is compelling - as I don't want to directly install anything on the machine.
My question is: For the volume of data I am looking at retrieving is it appropriate to use the WebClient class (I can specify how many bytes into the http stream I want to read, so I can download it incrementally, and put some business rules round it checking how many people are currently downloading, and make it wait until user count has gone down...), or can I use sockets to keep the overhead down of HTTP?
Unless there is a project I've not found which does exactly this thing!
Cheers in advance,
Matt
As long as you download the data in chunks of some smaller size then the actual volume of the total file won't matter and it won't really matter what you use to do the downloading. For example, for a file of that size I would just use the WebClient class and download chucks of maybe 1 or 2 Mb at a time to a temporary storage file on disk. You'll have to keep track of how much you've got downloaded and where you need to start the next chuck from, but that isn't overly difficult a problem. You can use sockets but then you have to communicate with the web server yourself to get access to the file in the first place.
When a client connects to download the next chunk, that is where you would enforce your business logic concerning the number of concurrent users. There might be a library you can use to do all this but to be honest it's not a complex problem.

What's the best way to send pictures to a browser when they have to be stored as blobs in a database?

I have an existing database containing some pictures in blob fields. For a web application I have to display them.
What's the best way to do that, considering stress on the server and maintenance and coding efforts.
I can think of the following:
"Cache" the blobs to external files and send the files to the browser.
Read them from directly the database every time it's requested.
Some additionals facts:
I cannot change the database and get rid of the blobs alltogether and only save file references in the database (Like in the good ol' Access days), because the database is used by another application which actually requires the blobs.
The images change rarely, i.e. if an image is in the database it mostly stays that way forever.
There'll be many read accesses to the pictures, 10-100 pictures per view will be displayed. (Depending on the user's settings)
The pictures are relativly small, < 500 KB
I would suggest a combination of the two ideas of yours: The first time the item is requested read it from the database. But afterwards make sure they are cached by something like Squid so you don't have to retrieve the every time they are requested.
one improtant thing is to use proper HTTP cache control ... that is, setting expiration dates properly, responding to HEAD requests correctly (not all plattforms/webservers allow that) ...
caching thos blobs to the file system makes sense to me ... even more so, if the DB is running on another server ... but even if not, i think a simple file access is a lot cheaper, than piping the data through a local socket ... if you did cache the DB to the file system, you could most probably configure any webserver to do a good cache control for you ... if it's not the case already, you should maybe request, that there is a field to indicate the last update of the image, to make your cache more efficient ...
greetz
back2dos

Resources