file server to replace a clustered file system - filesystems

For various maintenance, stability and backup reasons I need to replace a 10 node (10 Linux hosts) ocfs2 shared filesystem with something that does not rely on a shared disk. The client applications are PHP in a Linux only environment.
Right now each PHP client requests a unique id from the database and creates a file with that id/name on the shared disk. The database stores all the file metadata. Existing files are accessed in a similar fashion.
I want to replace the shared disk solution with putfile(id, '/tmp/path') and getfile(id, '/tmp/path') calls to a file server over the network. Client-side I could work with the files in a tmpfs. The server should handle compression etc. This would also free me of the PHP client dependency and I could use the file server directly from some other applications as well like from Windows Delphi applications.
In theory a FTP based solution could even work, though it would probably not perform very well. Or am I wrong to distrust the old FTP protocol?
I have over 30 million file id's currently, most of them being a few KB in size with notable exceptions up to 300MB, totalling only 320GB. The PHP client also does some compression and grouping with gzip and tar, it's all very clumsy.
I was hoping to find something fast and simple like memcachedb but for files. The closest I've found is hadoop's hdfs but I don't think that's quite the correct solution.
Any recommendations? Something obvious I'm missing?

Related

writing lots of small amount of data to a file (like logging)

I have a job that need to write lots of small amount of data individually to a file (like logging). If I implement it just like normal file write access, will that wear out my disk very quick?
I later realized the best solution depends on different systems too. In my case I can use ramdisk or such. But I wonder what is the solution for industry systems normally use, in case I want expandability.
But I wonder what is the solution for industry systems normally use
Filebeat was used to manage logging in one of my past project. As far as I remember, you need to do few settings in its config file. You need to specify the remote server and the file that you want to keep uploading.
filebeat will keep uploading your logs on that remote server.
You can read more about it: https://www.elastic.co/beats/filebeat

Java EE, EJBs File handling

I'm developing a web application in which users are allowed to upload pictures, the system will then generate thumbs for them.
My problem relies on the fact that EJBs can be distributed on several servers and thus are not allowed to handle files directly. I could store the images in the databases but I was hoping to store them as files in one of the servers. How can I do this? Is there any way to centralize the storage of files? Or any approach to deal with files in Java EE with EJBs?
Currently, I'm storing my files in a database. So I have centralized access and I don't need a dedicated file server. I'm doing this because I don't know how to integrate ftp servers and EJBs. Is this however a good alternative?
What I want is: Using Stateless EJBs, store the uploaded images as files and the path to them in the database. So I can display them using
<h:graphicImage ... />
You actually have four aspects here,
Receiving and sending files
Creating thumbnails
Storing the files somewhere every node can access
Storing the original + thumbnail entities in a common database
Since you already have Java EE server, you probably also already have a (HTTP) servlet server, in which there is numerious ways of doing load balancing and caching, not to mention the obious potential for web-based interaction. If anything, support FTP transfer with a directory watcher as a bonus.
You should not create the thumbnails using stateless session beans, this means your servers will be crap at peak time - the server will give priority to buisness logic over making new connections. Rather, first receieve and store the file + original entity in the database, and then use a service bean to queue up thumbnail creation (maybe with n worker threads or message queues if you want). You can also use native tools in some cases, we do in linux.
You should use a shared file system, SAN, which is the right tool for sharing files across several machines. And structure your files according to your file system's limits - like number of files per directory and read/write capacity.
And a single database will be good enough for at least a small cluster, as long as you are not killing it with big binary blobs.
If in doubt, buy more ram ;) Especially the thumbnails are very cachable and will give good performance also in Tomcat - if you are not familiar with multi-threading, find a cache at google. Also cache the entities naturally, not only the files.
You might want to use a (private) FTP server for this. Your EJB beans can contact this server for storing and retrieving files.
There are various libraries in Java for accessing FTP servers. Specifically well suited for use in an EJB environment would be JCA based FTP connectors, but 'normal' ones will usually work fine too.
You can also look into using a clustered file system. RedHat Global File System and Symantec's Veritas Clustered File System are two I have experience with. These products allow you to mount the same file system across several servers for read/write access. All your application sees is just another directory. If your environment already has the requisite components (SAN and a good Sys Admin), this might be the best performing solution in a lot of use cases.
However, there are drawbacks to this approach:
You shift complexity from your app to the OS. These products aren't trivial to set up.
Scalability might become an issue if you have a large server farm. And when scaling problems arise finding the bottleneck is not as straight forward as arjan's ftp solution.
You need a SAN.
If you can make reasonable assumptions about "where" your EJB instance is, direct handling of a file is no problem. In your case (since you want to have files) I would read the image into a local temp folder and upload it to a remote destination.
A possible way to do that is http://txconnect.sourceforge.net/ a JCA Transaction Adapter that handes (among others) ftp connections. Configure the factory in xml and inject the connection into your bean and you have ready to go.
Depending on your Application server there might be a special connector available (f.e.: Oracle or IBM systems)
I'd suggest you to stick to your current solution. Ftp access (if needed for purposes other than just keeping files together) can be build on top of your ejb layer. Displaying images stored in DB is not a problem, simple servlet will do the trick.
You can :
Create a WebDAV based file share. This can be done by using many libraries available for Java or other languages. One such library is : http://milton.ettrema.com/index.html
All EJB instances can read /write images from this file share. They would need to use WebDav client libraries
DO setup backups of directories behind this file share

Best way storing binary or image files

What is the best way storing binary or image files?
Database System
File System
Would you please explain, why?
There is no real best way, just a bunch of trade offs.
Database Pros:
1. Much easier to deal with in a clustering environment.
2. No reliance on additional resources like a file server.
3. No need to set up "sync" operations in load balanced environment.
4. Backups automatically include the files.
Database Cons:
1. Size / Growth of the database.
2. Depending on DB Server and your language, it might be difficult to put in and retrieve.
3. Speed / Performance.
4. Depending on DB server, you have to virus scan the files at the time of upload and export.
File Pros:
1. For single web/single db server installations, it's fast.
2. Well understood ability to manipulate files. In other words, it's easy to move the files to a different location if you run out of disk space.
3. Can virus scan when the files are "at rest". This allows you to take advantage of scanner updates.
File Cons:
1. In multi web server environments, requires an accessible share. Which should also be clustered for failover.
2. Additional security requirements to handle file access. You have to be careful that the web server and/or share does not allow file execution.
3. Transactional Backups have to take the file system into account.
The above said, SQL 2008 has a thing called FILESTREAM which combines both worlds. You upload to the database and it transparently stores the files in a directory on disk. When retrieving you can either pull from the database; or you can go direct to where it lives on the file system.
Pros of Storing binary files in a DB:
Some decrease in complexity since the
data access layer of your system need
only interface to a DB and not a DB +
file system.
You can secure your files using the
same comprehensive permissions-based
security that protects the rest of
the database.
Your binary files are protected
against loss along with the rest of
your data by way of database backups.
No separate filesystem backup system
required.
Cons of Storing binary files in a DB:
Depending on size/number of files,
can take up significant space
potentially decreasing performance
(dependening on whether your binary
files are stored in a table that is
queried for other content often or
not) and making for longer backup
times.
Pros of Storing binary files in file system:
This is what files systems are good
at. File systems will handle
defragmenting well and retrieving
files (say to stream a video file to
through a web server) will likely be
faster that with a db.
Cons of Storing binary files in file system:
Slightly more complex data access
layer. Needs its own backup system.
Need to consider referential
integrity issues (e.g. deleted
pointer in database will need to
result in deletion of file so as to
not have 'orphaned' files in the
filesystem).
On balance I would use the file system. In the past, using SQL Server 2005 I would simply store a 'pointer' in db tables to the binary file. The pointer would typically be a GUID.
Here's the good news if you are using SQL Server 2008 (and maybe others - I don't know): there is built in support for a hybrid solution with the new VARBINARY(MAX) FILESTREAM data type. These behave logically like VARBINARY(MAX) columns but behind the scenes, SQL Sever 2008 will store the data in the file system.
There is no best way.
What? You need more info?
There are three ways I know of... One, as byte arrays in the database. Two, as a file with the path stored in the database. Three, as a hybrid (only if DB allows, such as with the FileStream type).
The first is pretty cool because you can query and get your data in the same step. Which is always nice. But what happens when you have LOTS of files? Your database gets big. Now you have to deal with big database maintenance issues, such as the trials of backing up databases that are over a terabyte. And what happens if you need outside access to the files? Such as type conversions, mass manipulation (resize all images, appy watermarks, etc)? Its much harder to do than when you have files.
The second is great for somewhat large numbers of files. You can store them on NAS devices, back them up incrementally, keep your database small, etc etc. But then, when you have LOTS of files, you start running into limitations in the file system. And if you spread them over the network, you get latency issues, user rights issues, etc. Also, I take pity on you if your network gets rearranged. Now you have to run massive updates on the database to change your file locations, and I pity you if something screws up.
Then there's the hybrid option. Its almost perfect--you can get your files via your query, yet your database isn't massive. Does this solve all your problems? Probably not. Your database isn't portable anymore; you're locked to a particular DBMS. And this stuff isn't mature yet, so you get to enjoy the teething process. And who says this solves all the different issues?
Fact is, there is no "best" way. You just have to determine your requirements, make the best choice depending on them, and then suck it up when you figure out you did the wrong thing.
I like storing images in a database. It makes it easy to switch from development to production just by changing databases (no copying files). And the database can keep track of properties like created/modified dates just as well as the File System.
I personally never store images IN the database for performance purposes. In all of my sites I have a "/files" folder where I can put sub-folders based on what kind of images i'm going to store. Then I name them on convention.
For example if i'm storing a profile picture, I'll store it in "/files/profile/" as profile_2.jpg (if 2 is the ID of the account). I always make it a rule to resize the image on the server to the largest size I'll need, and then smaller ones if I need them. So I'd save "profile_2_thumb.jpg" and "profile_2_full.jpg".
By creating rules for yourself you can simply in the code call img src="/files/profile__thumb.jpg"
Thats how I do it anyway!

Which is a better method for storing images - folder or SQL Server as binary?

I am planning the development of a photo gallery application for a client. I am developing the app in asp.net 3.5 and would like to develop it so that I can re-use the application across multiple platforms using various front-ends. Basically, I am wondering what are the dis-advantages and advantages of storing images in the database as binary files as opposed to simply storing the files in an application folder.
Any advice would be greatly appreciated!
Thanks,
Tristan
SQL Server 2008 supports FILESTREAM storage.
The files are stored on an NTFS volume like plain files, but are subject to transaction control and can be accessed via special file names passed to Win32 API functions (and of course any API built upon it) with additional SQL Server security checks (like GRANT options etc).
The disadvantage of storing as binary is that you blow the database size to incredible sizes. If you were to use an express edition of SQL Server, which is limited to 4GB per database, you photo gallery would "finish" quite soon.
The advantage is that you can easily manipulate the access restrictions per file and per user. You just look at user rights and decide whether you serve back the image or not.
File system storage will offer much better performance in saving and serving images, and is supported in every platform. If you can live without the security and transactional goodies you get from db storage, then I would go with the file system.
We had large LOB application that provided Bank tellers identification information about the member standing in front of them. Our textual data was stored in SQL Server. Image data was stored in files. The database field simply had a filename. This approach works well if you are behind the firewall. Reading and writing files is easy. The trouble is the file management. You should secure the file system so that random people cannot view the directory. Also, backups are more complicated with loose image files. You have to backup the database and the image files. The fields can reference paths that no longer exist. For example, some IT dude decides to move the image folder and now all the references in the db are broken. If your application needs to pass information through the firewall, I would suggest storing images in the SQL Server using the mentioned FileStream storage.
Storing the images in the database would have saved us some grief. We would have only had to backup the database, it would be more secure, the references would never break and we would not have had to jump through hoops to get files from network outside the firewall.
This debate has been going on in almost any SQL Server community for ages. there are good arguments for both sides and there is definitely not just one size fits all answer. It really depends on your individual situation and on many factors, such as number of users, avg. file size, update frequency, read/write ratio, disk-subsystem yadayadayada...
But as you mention SQLExpress probably the most important factor is the max database size limit and this is a very good reason to go for the filesystem approach. Anyway, this research paper might still be interesting for you: To BLOB or Not To BLOB:
Large Object Storage in a Database or a Filesystem?
This paper used to be on the Microsoft Research site here: http://research.microsoft.com/apps/pubs/default.aspx?id=64525, but that link doesn't work for me. SQL Server has come a long way since then. Quassnoi already mentioned FILSTREAM, for example.

What's the recommended way to store (externally) and read data on a BlackBerry?

We have an application we would like to port to the BlackBerry platform that reads its data from a SQLite database that, for the purposes of this port, will be around 4 MB. This database isn't particularly complicated (few relationships, two interesting tables to index/search and the resultant data) and is only used for reading.
What's the best way to reproduce such a thing on the BlackBerry without a database?
A couple of notes:
We'd love to use a database on the BlackBerry but, since this application is freeware, we can only consider freeware solutions (such as SQLite). We cannot push these costs to our consumers.
We are aware that 5.0 supports SQLite but we'll want to support older devices (i.e. OS 4.2).
This application cannot rely on a connection to the internet.
It looks like the following options are possibilities:
RMS (Record Management System) - Appears to be a possibility but we haven't been able to find a good API to write these files outside of the device. For example, we'd like to prepare the database using a Java or .NET program (much like we do SQLite) and simply transfer the resultant data files to the device. We won't be writing the records from a BlackBerry application.
BlackBerry Persistence Store - Appears to be a nicer version of RMS with the same major drawback.
File Connection API - This appears to our best choice, even though we have to do all the heavy lifting. I haven't had the chance to do the research but I'm hoping there are some APIs for writing database like formats (e.g. something akin to JSON) to flat files for applications such as the one we propose. Any help here would be appreciated.
To use persistent store or file for db storage? It depends on several things:
data amount (if it's > 1 mb, better use filesystem and several files)
security (if you need encryption, use persistent store)
performance (persistent store will slowdown memory performance, but filesystem io will hit processor performance no metter how large file size is)
framework limitation (ex. you can't open xml file > about 1.7 mb using kXML)
See also:
Blackberry - application settings save/load
J2ME/Blackberry - how to read/write text file?
Better approach for XML Creation in Blackberry

Resources