Distributed file system vs mounting a drive over the network. - filesystems

I would like to ask you, what are the pros/cons of having a distributed file system off the shelf (like HadoopFS) over just mounting a drive over the network on linux? As I understand we will achieve the same with these two approaches: the same data will be available on many remote locations.
Cheers!

Distributed filesystems provide many benefits like automatic backup or distributions on some rules (you can, say, add many new elements to your storage and that operation will be transparent for your applications using the storage).
Mounting drives can become a pain one day, when one of your elements in the network gets off on some reason, while your applications rely on it.

Related

writing lots of small amount of data to a file (like logging)

I have a job that need to write lots of small amount of data individually to a file (like logging). If I implement it just like normal file write access, will that wear out my disk very quick?
I later realized the best solution depends on different systems too. In my case I can use ramdisk or such. But I wonder what is the solution for industry systems normally use, in case I want expandability.
But I wonder what is the solution for industry systems normally use
Filebeat was used to manage logging in one of my past project. As far as I remember, you need to do few settings in its config file. You need to specify the remote server and the file that you want to keep uploading.
filebeat will keep uploading your logs on that remote server.
You can read more about it: https://www.elastic.co/beats/filebeat

Java EE, EJBs File handling

I'm developing a web application in which users are allowed to upload pictures, the system will then generate thumbs for them.
My problem relies on the fact that EJBs can be distributed on several servers and thus are not allowed to handle files directly. I could store the images in the databases but I was hoping to store them as files in one of the servers. How can I do this? Is there any way to centralize the storage of files? Or any approach to deal with files in Java EE with EJBs?
Currently, I'm storing my files in a database. So I have centralized access and I don't need a dedicated file server. I'm doing this because I don't know how to integrate ftp servers and EJBs. Is this however a good alternative?
What I want is: Using Stateless EJBs, store the uploaded images as files and the path to them in the database. So I can display them using
<h:graphicImage ... />
You actually have four aspects here,
Receiving and sending files
Creating thumbnails
Storing the files somewhere every node can access
Storing the original + thumbnail entities in a common database
Since you already have Java EE server, you probably also already have a (HTTP) servlet server, in which there is numerious ways of doing load balancing and caching, not to mention the obious potential for web-based interaction. If anything, support FTP transfer with a directory watcher as a bonus.
You should not create the thumbnails using stateless session beans, this means your servers will be crap at peak time - the server will give priority to buisness logic over making new connections. Rather, first receieve and store the file + original entity in the database, and then use a service bean to queue up thumbnail creation (maybe with n worker threads or message queues if you want). You can also use native tools in some cases, we do in linux.
You should use a shared file system, SAN, which is the right tool for sharing files across several machines. And structure your files according to your file system's limits - like number of files per directory and read/write capacity.
And a single database will be good enough for at least a small cluster, as long as you are not killing it with big binary blobs.
If in doubt, buy more ram ;) Especially the thumbnails are very cachable and will give good performance also in Tomcat - if you are not familiar with multi-threading, find a cache at google. Also cache the entities naturally, not only the files.
You might want to use a (private) FTP server for this. Your EJB beans can contact this server for storing and retrieving files.
There are various libraries in Java for accessing FTP servers. Specifically well suited for use in an EJB environment would be JCA based FTP connectors, but 'normal' ones will usually work fine too.
You can also look into using a clustered file system. RedHat Global File System and Symantec's Veritas Clustered File System are two I have experience with. These products allow you to mount the same file system across several servers for read/write access. All your application sees is just another directory. If your environment already has the requisite components (SAN and a good Sys Admin), this might be the best performing solution in a lot of use cases.
However, there are drawbacks to this approach:
You shift complexity from your app to the OS. These products aren't trivial to set up.
Scalability might become an issue if you have a large server farm. And when scaling problems arise finding the bottleneck is not as straight forward as arjan's ftp solution.
You need a SAN.
If you can make reasonable assumptions about "where" your EJB instance is, direct handling of a file is no problem. In your case (since you want to have files) I would read the image into a local temp folder and upload it to a remote destination.
A possible way to do that is http://txconnect.sourceforge.net/ a JCA Transaction Adapter that handes (among others) ftp connections. Configure the factory in xml and inject the connection into your bean and you have ready to go.
Depending on your Application server there might be a special connector available (f.e.: Oracle or IBM systems)
I'd suggest you to stick to your current solution. Ftp access (if needed for purposes other than just keeping files together) can be build on top of your ejb layer. Displaying images stored in DB is not a problem, simple servlet will do the trick.
You can :
Create a WebDAV based file share. This can be done by using many libraries available for Java or other languages. One such library is : http://milton.ettrema.com/index.html
All EJB instances can read /write images from this file share. They would need to use WebDav client libraries
DO setup backups of directories behind this file share

What kind of file system configuration changes are handled by an RDBMS?

From here,
Oracle ASM provides several advantages over conventional file systems and storage managers, including the following:
Rebalances data automatically after storage configuration changes
What kind of configuration changes are we talking about here? In a database, what kind of configuration changes happen?
"What kind of configuration changes
are we talking about here?"
Adding new hard drives, reconfiguring the SAN, replacing DAS with NAS or SAN.
The point of ASM is to shield the database from the physical directory structure. Tablespaces are logical structures which provide one layer of indirection, but they are still tied to actual OS files. ASM uses another logical structure, the disk group, which operates at a lower level. It manages the allocation of actual OS storage resource to the disk groups.
Things like tablespaces?

Distributed Key-Value Data Store with Offline Access (Static Partitioning)

Need to be able to set server(s) that replicate all information, as a master data store that has all the data.
Also need servers that specifically store/replicate certain data, available in local LANs, so that when the internet connection goes down, they can still access their local data. Under normal circumstances, the clients will access most of their data from the local LAN, and may use others when the local LAN server goes down.
This is wanted alongside the benefits of a distributed data store, such as failure resistance and speed.
Which Distributed Key-Value Data Store or other data storage method would be most suited for this?
Try out CouchDB. Your use case reads like it was build for it. Point taken, CouchDB is much more than a key/value store, but on the other hand, not less suitable for it.
Add replication and as an added bonus fault tolerance, conflict detection (and resolution) and an easy API (HTTP).
Let me know if you have any other questions.
Of course you must remember that replication is something completely different from backup, because one system's programmatic failure in handling the data can quickly replicate to other nodes resulting in total mayhem.
Maybe using a Hadoop File System or OpenAFS would be a good solution here?
I haven't used any of those systems in real-life scenarios, only had interest in them during my research on peer-to-peer and distributed storage solutions, but I think they're worth a try.
Have you checked out the new Microsoft's Velocity? http://msdn.microsoft.com/en-us/data/cc655792.aspx. Unlike many other cloud services, you can run the setup (for Velocity) on your premises.

Non-file FileSystems?

I've been thinking on this for a while now (you know, that dangerous thing programmers tend to do) and I've been wondering, is the method of storing data that we're so accustomed to really all that efficient? The trouble with answering this question is that I really don't have anything to compare it to, since it's the only thing I've ever used.
I don't mean FAT or NTFS or a particular type of file system, I mean the filesystem structure as a whole. We are simply used to thinking of "files" inside "folders" like our hard drive was one giant filing cabinet. This is a great analogy and indeed, it makes it a lot easier to learn when we think of it this way, but is it really the best way to go about describing programs and their respective parts?
I'd like to know if anyone can think of (or knows about) a data storage technique that might be used to store data for an Operating System to use that would organize the parts of data in a different manner. Does anything... different even exist?
Emails are often stored in folders. But ever since I have migrated to Gmail, I have become accustomed to classifying my emails with tags.
I often wondered if we could manage a whole file-system that way: instead of storing files in folders, you could tag files with the tags you like. A file identifier would not look like this:
/home/john/personal/contacts.txt
but more like this:
contacts[john,personal]
Well... just food for thought (maybe this already exists!)
You can for example have dedicated solutions, like Oracle Raw Partitions. Other databases support similar thing. In these cases the filesystem provides unnecessary overhead and can be ommited - DB software will take care of organising the structure.
The problem seems very application dependent and files/folders seem to be a reasonable compromise for many applications (and is easy for human beings to comprehend).
Mainframes used to just give programmers a number of 'devices' to use. The device corresponsed to a drive or a partition thereof and the programmer was responsible for the organisation of all data on it. Of course they quickly built up libraries to help with that.
The only OS I think think of that does use the common hierachical arrangement of flat files (like UNIX) is PICK. That used a sort of relational database as the filesystem.
Microsoft had originally planned to introduce a new file-system for windows vista (WinFS - windows future storage). The idea was to store everything in a relational database (SQL Server). As far as I know, this project was never (or not yet?) finished.
There's more information about it on wikipedia.
I knew a guy who wrote his doctorate about a hard disk that comes with its own file system. It was based on an extension of SCSI commands that allowed the usual open, read, write and close commands to be sent to the disk directly, bypassing the file system drivers of the OS. I think the conclusion was that it is inflexible, and does not add much efficiency.
Anyway, this disk based file system still had a folder like structure I believe, so I don't think it really counts for you ;-)
Well, there's always Pick, where the OS and file system were an integrated database.
Traditional file systems are optimized for fast file access if you know the name of the file you want (including its path). Directories are a way of grouping files together so that they're easier to find if you know properties of the file but not its actual name.
Traditional file systems are not good at finding files if you know very little about them, however they are robust enough that one can add a layer on top of them to aid in retrieving files based on content or meta-information such as tags. That's what indexers are for.
The bottom line is we need a way to store persistently the bytes that the CPU needs to execute. So we have traditional file systems which are very good at organizing sequential sets of bytes. We also need to store persistently the bytes of files that aren't executed directly, but are used by things that do execute. Why create a new system for the same fundamental thing?
What more should a file system do other than store and retrieve bytes?
I'll echo the other responses. If I could pick a filesystem type, I personally would rather see a hybrid approach: a flat database of subtrees, where each subtree is considered as a cohesive unit, but if you consider the subtrees themselves as discrete units they would have no hierarchy, but instead could have metadata + be queryable on that metadata.
The reason for files is that humans like to attach names to "things" they have to use. Otherwise, it becomes hard to talk or think about or even distinguish them.
When we have too many things on a heap, we like to separate the heap. We sort it by some means and we like to build hierarchies where you can navigate arbitrarily sized amounts of things.
Hence directories and files just map our natural way of working with real objects. Since you can put anything in a file. On Unix, even hardware is mapped as "device nodes" into the file system which are special files which you can read/write to send commands to the hardware.
I think the metaphor is so powerful, it will stay.
I spent a while trying to come up with an automagically versioning file system that would maintain versions (and version history) of any specific file and/or directory structure.
The idea was that all of the standard access command (e.g. dir, read, etc.) would have an optional date/time parameter that could be passed to access the file system as it looked at that point in time.
I got pretty far with it, but had to abandon it when I had to actually go out and earn some money. It's been on the back-burner since then.
If you take a look at the start-up times for operating systems, it should be clear that improvements in accessing disks can be made. I'm not sure if the changes should be in the file system or rather in the OS start-up code.
Personally, I'm really sorry WinFS didn't fly. I loved the concept..
From Wikipedia (http://en.wikipedia.org/wiki/WinFS) :
WinFS includes a relational database
for storage of information, and allows
any type of information to be stored
in it, provided there is a well
defined schema for the type.
Individual data items could then be
related together by relationships,
which are either inferred by the
system based on certain attributes or
explicitly stated by the user. As the
data has a well defined schema, any
application can reuse the data; and
using the relationships, related data
can be effectively organized as well
as retrieved. Because the system knows
the structure and intent of the
information, it can be used to make
complex queries that enable advanced
searching through the data and
aggregating various data items by
exploiting the relationships between
them.

Resources