Java EE, EJBs File handling - file

I'm developing a web application in which users are allowed to upload pictures, the system will then generate thumbs for them.
My problem relies on the fact that EJBs can be distributed on several servers and thus are not allowed to handle files directly. I could store the images in the databases but I was hoping to store them as files in one of the servers. How can I do this? Is there any way to centralize the storage of files? Or any approach to deal with files in Java EE with EJBs?
Currently, I'm storing my files in a database. So I have centralized access and I don't need a dedicated file server. I'm doing this because I don't know how to integrate ftp servers and EJBs. Is this however a good alternative?
What I want is: Using Stateless EJBs, store the uploaded images as files and the path to them in the database. So I can display them using
<h:graphicImage ... />

You actually have four aspects here,
Receiving and sending files
Creating thumbnails
Storing the files somewhere every node can access
Storing the original + thumbnail entities in a common database
Since you already have Java EE server, you probably also already have a (HTTP) servlet server, in which there is numerious ways of doing load balancing and caching, not to mention the obious potential for web-based interaction. If anything, support FTP transfer with a directory watcher as a bonus.
You should not create the thumbnails using stateless session beans, this means your servers will be crap at peak time - the server will give priority to buisness logic over making new connections. Rather, first receieve and store the file + original entity in the database, and then use a service bean to queue up thumbnail creation (maybe with n worker threads or message queues if you want). You can also use native tools in some cases, we do in linux.
You should use a shared file system, SAN, which is the right tool for sharing files across several machines. And structure your files according to your file system's limits - like number of files per directory and read/write capacity.
And a single database will be good enough for at least a small cluster, as long as you are not killing it with big binary blobs.
If in doubt, buy more ram ;) Especially the thumbnails are very cachable and will give good performance also in Tomcat - if you are not familiar with multi-threading, find a cache at google. Also cache the entities naturally, not only the files.

You might want to use a (private) FTP server for this. Your EJB beans can contact this server for storing and retrieving files.
There are various libraries in Java for accessing FTP servers. Specifically well suited for use in an EJB environment would be JCA based FTP connectors, but 'normal' ones will usually work fine too.

You can also look into using a clustered file system. RedHat Global File System and Symantec's Veritas Clustered File System are two I have experience with. These products allow you to mount the same file system across several servers for read/write access. All your application sees is just another directory. If your environment already has the requisite components (SAN and a good Sys Admin), this might be the best performing solution in a lot of use cases.
However, there are drawbacks to this approach:
You shift complexity from your app to the OS. These products aren't trivial to set up.
Scalability might become an issue if you have a large server farm. And when scaling problems arise finding the bottleneck is not as straight forward as arjan's ftp solution.
You need a SAN.

If you can make reasonable assumptions about "where" your EJB instance is, direct handling of a file is no problem. In your case (since you want to have files) I would read the image into a local temp folder and upload it to a remote destination.
A possible way to do that is http://txconnect.sourceforge.net/ a JCA Transaction Adapter that handes (among others) ftp connections. Configure the factory in xml and inject the connection into your bean and you have ready to go.
Depending on your Application server there might be a special connector available (f.e.: Oracle or IBM systems)

I'd suggest you to stick to your current solution. Ftp access (if needed for purposes other than just keeping files together) can be build on top of your ejb layer. Displaying images stored in DB is not a problem, simple servlet will do the trick.

You can :
Create a WebDAV based file share. This can be done by using many libraries available for Java or other languages. One such library is : http://milton.ettrema.com/index.html
All EJB instances can read /write images from this file share. They would need to use WebDav client libraries
DO setup backups of directories behind this file share

Related

file server to replace a clustered file system

For various maintenance, stability and backup reasons I need to replace a 10 node (10 Linux hosts) ocfs2 shared filesystem with something that does not rely on a shared disk. The client applications are PHP in a Linux only environment.
Right now each PHP client requests a unique id from the database and creates a file with that id/name on the shared disk. The database stores all the file metadata. Existing files are accessed in a similar fashion.
I want to replace the shared disk solution with putfile(id, '/tmp/path') and getfile(id, '/tmp/path') calls to a file server over the network. Client-side I could work with the files in a tmpfs. The server should handle compression etc. This would also free me of the PHP client dependency and I could use the file server directly from some other applications as well like from Windows Delphi applications.
In theory a FTP based solution could even work, though it would probably not perform very well. Or am I wrong to distrust the old FTP protocol?
I have over 30 million file id's currently, most of them being a few KB in size with notable exceptions up to 300MB, totalling only 320GB. The PHP client also does some compression and grouping with gzip and tar, it's all very clumsy.
I was hoping to find something fast and simple like memcachedb but for files. The closest I've found is hadoop's hdfs but I don't think that's quite the correct solution.
Any recommendations? Something obvious I'm missing?

Architecture for Image hosting site

Scenario:
* A user uploads an image and enters some information about that image
* Information and image get uploaded (to all servers)
* User gets confirmation that image is uploaded
Factors:
* Dozens of servers, distributed all over the world
* Image should end up on disk, since it will be served
* Information should end up in a database
* Images are small, no bigger than 5mb
We considered various architectural solutions and technologies (git murder, rsync to name a few), but we're still not 100% how to approach this. Current solution is way too slow and we're looking to improve (we push files to all servers from our "upload" server).
Any thoughts? Thanks in advance
First, let's assume for simplicity that the data is written to a file and both files are zipped up together. So below I'm going to assume there is only one file (the zip file). This is just a detail (and is in fact completely unnecessary for bittorrent!)
Bittorrent (or something that works in a similar way) is basically the fastest way to do this, for large files. As soon as a server has downloaded a piece of the file, it will start trying to upload it to any other servers that need it. You could modify bittorrent to prefer geographically closer IPs in order to minimise inter-LAN bandwidth usage.
If you don't need to use bittorrent, or if the files are small so it wouldn't make sense, just make one server upload to two others, then those two others upload to two others each, etc. Or you could use a fan-out factor of more than 2. Experiment with what works best for you.
Take a look at Riak. It offers very good support for massive distribution and data replication. We've been using it successfully for a while now and it's proven very resilient.
I'd model it on Riak with the image and metadata stored separately, with a link between them. They both end up in a "database" and on disk this way, with an easy way to get form one to the other and accessible via a URL.
Note: for replication over WAN you'll need the enterprise version, which isn't free.

Which is a better method for storing images - folder or SQL Server as binary?

I am planning the development of a photo gallery application for a client. I am developing the app in asp.net 3.5 and would like to develop it so that I can re-use the application across multiple platforms using various front-ends. Basically, I am wondering what are the dis-advantages and advantages of storing images in the database as binary files as opposed to simply storing the files in an application folder.
Any advice would be greatly appreciated!
Thanks,
Tristan
SQL Server 2008 supports FILESTREAM storage.
The files are stored on an NTFS volume like plain files, but are subject to transaction control and can be accessed via special file names passed to Win32 API functions (and of course any API built upon it) with additional SQL Server security checks (like GRANT options etc).
The disadvantage of storing as binary is that you blow the database size to incredible sizes. If you were to use an express edition of SQL Server, which is limited to 4GB per database, you photo gallery would "finish" quite soon.
The advantage is that you can easily manipulate the access restrictions per file and per user. You just look at user rights and decide whether you serve back the image or not.
File system storage will offer much better performance in saving and serving images, and is supported in every platform. If you can live without the security and transactional goodies you get from db storage, then I would go with the file system.
We had large LOB application that provided Bank tellers identification information about the member standing in front of them. Our textual data was stored in SQL Server. Image data was stored in files. The database field simply had a filename. This approach works well if you are behind the firewall. Reading and writing files is easy. The trouble is the file management. You should secure the file system so that random people cannot view the directory. Also, backups are more complicated with loose image files. You have to backup the database and the image files. The fields can reference paths that no longer exist. For example, some IT dude decides to move the image folder and now all the references in the db are broken. If your application needs to pass information through the firewall, I would suggest storing images in the SQL Server using the mentioned FileStream storage.
Storing the images in the database would have saved us some grief. We would have only had to backup the database, it would be more secure, the references would never break and we would not have had to jump through hoops to get files from network outside the firewall.
This debate has been going on in almost any SQL Server community for ages. there are good arguments for both sides and there is definitely not just one size fits all answer. It really depends on your individual situation and on many factors, such as number of users, avg. file size, update frequency, read/write ratio, disk-subsystem yadayadayada...
But as you mention SQLExpress probably the most important factor is the max database size limit and this is a very good reason to go for the filesystem approach. Anyway, this research paper might still be interesting for you: To BLOB or Not To BLOB:
Large Object Storage in a Database or a Filesystem?
This paper used to be on the Microsoft Research site here: http://research.microsoft.com/apps/pubs/default.aspx?id=64525, but that link doesn't work for me. SQL Server has come a long way since then. Quassnoi already mentioned FILSTREAM, for example.

Store options in database or in a file?

in a client-server database application, the different options the client needs to read from the server, where you'd store them? In the database or in some file which will then will be transferred on the network, or is there any better way.
It depends on the specific.
Generally speaking there is very little to store something in a file on the server (apart from files themselves such as images, videos, songs and so forth) rather than a database.
If you're storing, say, client preferences you may store them in a file on the client but this has portability problems (in that the profile settings don't go to another PC with the same user) but it might be appropriate if the client can be used "offline".
Probably the best of both worlds is to store things in a database on the server and cache them on the client (in files probably) to allow offline usage, if that's appropriate to the application in question.
Depends on how dynamic access to the values has to be.
Putting them in a file means having to edit the file for changes. You have to edit the file, perhaps repackage the app with the new values, and bounce the server. If you're using an exploded version of the code on the server, it means giving clients write permissions on the server, which can be problematic.
If you put them in the database, clients can see changes without having to edit a file. They get to see the values right away. No server bounce needed. And you can dole out access using database permissions.
UPDATE: Another thought - are the options for all users or just a single individual? If it's the former, you have to worry about "oil canning", where one user changes a value and another switches it back. If it's for a single individual, you'd have to have a file for each user. A large user base could be a problem.
The clients need to read just some configuration settings, that's all. Also those options will not be changed from the clients, only from the server.
It depends upon use cases, but my own experience says, it is better to store options in database. In todays' world, we need to go towards shared-nothing type of architecture as far as we can. In that way, if tomorrow you make your application to be failsafe, you will find, it is better to store all options in database. Because otherwise you need to sync your files accross various nodes which are running the copy of your application. On the other hand, if it is in database, it is all there in db and most of the db supports High-Availability type of use case in which case you do not need to be worried about keeping your application with files running in different nodes in sync.
That depends on a number of factors:
Do you have more then one Front End
Server, then the database approach
requires less maintainance.
If you have a Dev, QA as well as a
Prod environment then the file
approach makes sense as the config
will not be changed when
copying/restoring a database from
one of the other environments.
Normally, I would store the configuration settings in a database, if available. However, in the project I am now working, the client wants multiple distributed copies of the database, updated from a master. Each installation has it's own configuration, such as printer settings.
The answer in this case is to have a local config file, where the local user settings are stored. The application generates the default for each setting. If need be, the file can be edited to update settings on the fly.

Store images(jpg,gif,png) in filesystem or DB? [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicates:
Which is more secure: filesystem or database?
User images - database vs. filesystem storage
store image in database or in a system file ?
I can't decide which one I should follow. Can you guys give some opinions? Should I store my images in the file-system or DB? (I would like to prevent others from stealing my images)
When you answer this question, please include comparisons of the security, performances etc.
Thanks.
Exact Duplicate: User Images: Database or filesystem storage?
Exact Duplicate: Storing images in database: Yea or nay?
Exact Duplicate: Should I store my images in the database or folders?
Exact Duplicate: Would you store binary data in database or folders?
Exact Duplicate: Store pictures as files or or the database for a web app?
Exact Duplicate: Storing a small number of images: blob or fs?
Exact Duplicate: store image in filesystem or database?
Moving your images into a database and writing the code to extract the image may be more hassle than it's worth. It will all go back to the business requirements surrounding the need to protect the images, or the requirement for performance.
I'd suggest sticking to the tried and true system of storing the filepath or directory location in the DB, and keeping the files on disk. Here's why:
A filesystem is easier to maintain. Some thought has to be put into the structure and organization of the images. i.e. a directory for each customer, and a subdirectory for each [Attribute-X] and another subfolder for each [Attribute-Y]. Keeping too many images in one directory will end up slowing down the file access (i.e. hundreds of thousands)
If the idea of storing in a DB is a counter-measure against filesystem loss, (i.e. a disk goes down, or a directory is deleted by accident), then I'd counter with the suggestions that when you use source control, it's no problem to retrieve any lost/missing/delete files.
If you ever need to scale and move to a content distribution scenario, you'd have to move out back to the filesystem or perform a big extract to push out to the providers.
It also goes with the saying: "keep structured data in a database". Microsoft has an article on Managing Unstructured Data.
If security is an issue to be addressed, the filesystem has a whole structure with ACLs. Reinventing the wheel on security may be out of scope in the business requirements.
A large amount of discussion for this topic is also found at:
Question 3748
Question 561447
Having your images stored as varbinary or a blob of some kind (depending on your platform), I'd suggest it's more hassle than it's worth. The effort that you'll need to extend means more code that you'll have to maintain, unit test, and defend against defects.
If your environment can support SQL Server 2008, you can have the best of both worlds with their new FileStream datatype in SQL 2008.
An MSDN article is touting the FileStream datatype in SQL 2008 as high performance.
SQL Skills has a great article with some SQL 2008 Filestream performance measurements.
Here is an article addressing varbinary vs. FileStream and performance of both datatypes.
If you are a SQL Mag subscriber, you can see a great article at SQL Mag on SQL 2008 FileStream.
Microsoft Research article:To Blob or Not To Blob
I'd love to see studies in real-world scenarios with large user bases like Flickr or Facebook.
Again, it all goes back to your business requirements. Good luck!
It doesn't matter where you store them in terms of preventing "theft". If you deliver the bytestream to a browser, it can be intercepted and saved by anyone. There's no way to prevent that (I'm assuming you're talking about delivering the images to a browser).
If you're just talking about securing images on the machine itself, it also doesn't matter. The operating system can be locked down as securely as the database, preventing anyone from getting at the images.
In terms of performance (when presenting images to a browser), I personally think it'll be faster serving from a filesystem. You have to present the images in separate HTTP transactions anyway, which would almost certainly require multiple trips to the database. I suspect (although I have no hard data) that it would be faster to store the image URLs in the database which point to static images on the file system - then the act of getting an image is a simple file open by the web server rather than running code to access the database.
You're probably going to have to get a whole ton of "but the filesystem is a DB" answers. This isn't one of them.
The filesystem option depends on many factors, for example, does the server have write premissisons to the directory? (And yes, I have seen servers where apache couldn't write to DocumentRoot.)
If you want 100% cross-compatibility across platforms, then the Database option is the best way to go. It'll also let you store image-related metadata such as a user ID, the date uploaded, and even alternate versions of the same image (such as cached thumbnails).
On the down side, you need to write custom code to read images from the DB and serve them to the user, while any standard web server would just let you send the images as they are.
When it comes to the bottom line, though, you should just choose the option that fits your project, and your server configuration.
Store them in FileSystem, store the file path in the DB.
Of course you can make this scalable and distributed, you just need to keep the images dirs synched between them (for JackM). Or use a shared storage connected to multiple web frontend servers.
Anyway, the stealing part was covered in your other question and is basically impossible. People that can access the images will always be able (with more or less work) to save them locally ... even if it means "print-screen" and paste into photoshop and saving.
It depends on how many images you expect to handle, and what you have to do with them. I have an application that needs to temporarily store between 100K and several million images a day. I write them in 2gb contiguous blocks to the filesystem, storing the image identifier, filename, beginning position and length in a database.
For retrieval I just keep the indices in memory, the files open read only and seek back and forth to fetch them. I could find no faster way to do it. It is far faster than finding and loading an individual file. Windows can get pretty slow once you've dumped that many individual files into the filesystem.
I suppose it could contribute to security, since the images would be somewhat difficult to retrieve without the index data.
For scalability, it would not take long to put a web service in front of it and distribute it across several machines.
For a web application I look after, we store the images in the database, but make sure they're well cached in the filesystem.
A request from one of the web server frontends for an image requires a quick memcache
check to see if the image has changed in the database and, if not, serves it from the filesystem. If it has changed it fetches it from the central database and puts a copy in the filesystem.
This gives most of the advantages of storing them in the filesystem while keeping some
of the advantages of database - we only have one location for all the data which makes
backups easier and means we can scale across quite a few machines without issue. It
also doesn't put excessive load on the database.
If you want your application to be scalable, do not use a file system on the actual web servers. You can store the location of files in a persistent datastore such as a database or a NoSQL solution.
For an AWS solution to this for example you should:
Store the images on S3
Save the S3 key to the database
Serve yourimages on S3 through cloudfront (Amazon CDN)
Saving your files to the DB will provide a some security in terms that another user would need access to the DB in order to retrieve the files, but, as far as efficiency goes, a sql query for every image loaded, leaving all the load to the server side. Do yourself a favor and find a way to protect your images inside the filesystem, they are many.
The biggest out-of-the-box advantage of a database is that it can be accessed from anywhere on the network, which is essential if you have more than one server.
If you want to access a filesystem from other machines you need to set up some kind of sharing scheme, which could add complexity and could have synchronization issues.
If you do go with storing images in the database, you can still use local (unshared) disks as caches to reduce the strain on the DB. You'd just need to query the DB for a timestamp or something to see if the file is still up-to-date (if you allow files that can change).
If the issue is scalability you'll take a massive loss by moving things into the database. You can round-robin webservers via DNS but adding the overhead of both a CGI process and a database lookup to each image is madness. It also makes your database that much harder to maintain and your images that much harder to process.
As to the other part of your question, you can secure access to a file as easily as a database record, but at the end of the day as long as there is an URL that returns a file you have limited options to prevent that URL being used (at least without making cookies and/or javascript compulsory).
Store files in a file server, and store primitive data in a database. While file servers (especially HTTP-based) scale well, database servers do not. Don't mix them together.
If you need to edit, manage, or otherwise maintain the images, you should store it outside the database.
Also, the filesystem has many security features that a database does not.
The database is good for storing pointers (file paths) to the actual data.

Resources