Disk-based database caching: Disadvantages?

Disk-based database caching: Disadvantages? - database

I'm running a database-backed web site on shared hosting that occasionally gets swarmed after a mention on a link sharing site.
Because of how much load the first couple of traffic surges put on the database, I have implemented file-based caching.
When a query runs, I just serialize the resultset object and save it to a file. I have a sub-directory structure in the cache directory that keeps thousands of files from ending up in the same directory. Next time I have to run the same query, I just pull the object out of the file instead.
It's been working pretty well so far. But I'm worried that I am overlooking something, and possibly asking for trouble if there is a higher level of traffic than I've previously enjoyed. Or maybe there's just an easier way to do this?
Please poke some holes in this for me? Thanks!

Ideally. cache in memory to remove disk access. Have a look at something like memcached

Since you're on shared hosting, you should do some throttling (google "Throttling your web server (Oct 00)" for ideas).
A related interesting read (which also mentions Stonehenge::Throttle) is
Building a Large-Scale E-commerce site with Apache and mod_perl
http://perl.apache.org/docs/tutorials/apps/scale_etoys/etoys.html

Related

Is there a simple way to host a program with a small data file (esp. on Heroku)?

If you go to my Heroku-hosted to do list program, you can put test data in, but it's gone pretty soon. This is because, I learned, Heroku has an "ephemeral" filesystem and disposes of any data that users write to it via post. I don't know how to set up a PostgreSQL database or any other kind of database (although maybe I soon will, as I'm working through Hartl's Rails tutorial). I'm just using a humble YAML file. It works fine in my local environment.
Any suggestions for beginners to work around this problem, short of just learning how to host a database? Is there another free service I might use that would work without further setup? Any advice greatly welcome.
I fully understand that I can't do what I'm trying to do with Heroku (see e.g. questions like this one). I just want to understand my options better.
UPDATE: Looks like this and this might have some ideas about using Dropbox to host (read/write) flat files.

The answer is no. But I'll take a minute to explain why.
I realize that you aren't yet familiar with building web applications, databases, and all that stuff. And that's OK! This is an excellent question.
What you need to know, however, is that doing what you're asking is a really bad idea when you're trying to build scalable websites. And Heroku is a platform company that SPECIFICALLY tries to help developers building scalable websites. That's really what the platform excels at.
While Heroku is really easy to learn and use, it isn't targeted at beginners. It's meant for experienced developers. This is really clear if you take a look at what Heroku's principles are, and what policies they enforce on their platform.
Heroku goes out of their way to make building scalable websites really easy, and makes it VERY difficult to do things that would make building scalable websites harder.
So, let's talk for a second about why Heroku has an ephemeral file system in the first place!
This design decision forces you (the developer of the application) to store files that your application needs in a safer, faster, dedicated file storage service (like Amazon S3). This practice results in a lot of scalability benefits:
If your webservers don't need to write to disk, they can be deployed many many times without worrying about storage constraints.
No disks need to be shared across webservers. Sharing disks typically causes IO contention and can adversely affect performance.
It makes it easy to scale your web application horizontally across commodity servers, since disk resources aren't required.
So, the reason why you cannot store flat files on Heroku is because doing this causes scalability and performance problems, and would make it nearly impossible for Heroku to help you scale your application easily (which is their main goal).
That is why it is recommended to use a file storage service to store files (like Amazon S3), or a database for storing data (like Postgres).
What I'd recommend doing (personally) is using Heroku Postgres. You mentioned you're using rails, and rails has excellent Postgres support built in. It has what's called an ORM that let's you talk to the database using some very simple Ruby objects, and removes almost all the prerequisite database background to get things going. It's really fun / easy once you give it a try!
Finally: Heroku Postgres also has a great free plan, which means you can store the data for your todo app in it for no cost at all.
Hope this helps!

With a file-based web service, should I use a database or just the filesystem?

I'm writing a document editing web service, in which documents can be edited via a website, or locally and pushed via git. I'm trying to decide if the documents should be stored as individual documents on the filesystem, or in a database. The points I'm wondering are:
If they're in a database, is there any way for git to see the documents?
How much higher are the overheads using the filesystem? I assume the OS is doing a lot more work. How can I alleviate some of this? For example, the web editor autosaves, what would the best way to cache the save data be, to minimise writes?
Does one scale significantly better or worse than the other? If all goes according to plan, this will be a service with many thousands of documents being accessed and edited.

If the documents go into a database, git can't directly see the documents. git will see the backing storage file(s) for the database, but have no way of correlating changes there to changes to files.
The overhead of using the database is higher than using a filesystem, as answered by Carlos. Databases are optimized for transactions, which they'll do in memory, but they have to hit the file. Unless you program the application to do database transactions at a sub-document level (Eg: changing only modified lines), the database will give you no performance improvement. Most modern filesystems do caching and you can 'write' in a way that will sit in RAM rather than going to your backing stoage as well. You'll need to manage the granularity of the 'autosaves' in your application (every change? every 30 seconds? 5 minutes?), but really, doing it at the same granularity with a database will cause the same amount of traffic to the backing store.
I think you intended to ask "does the filesystem scale as well as the database"? :) If you have some way to organize your files per-user, and you figure out the security issue of a particular user only being able to access/modify the files they should be able to (which are doable imo), the filesystem should be doable.

Filesystem will always be faster than DB, because after all, DB's store data in the Filesystem!
Git is quite efficiently on it's own as proven on github, so i say you stick with git, and workaround it.
After all, Linus should know something... ;)

Java EE, EJBs File handling

I'm developing a web application in which users are allowed to upload pictures, the system will then generate thumbs for them.
My problem relies on the fact that EJBs can be distributed on several servers and thus are not allowed to handle files directly. I could store the images in the databases but I was hoping to store them as files in one of the servers. How can I do this? Is there any way to centralize the storage of files? Or any approach to deal with files in Java EE with EJBs?
Currently, I'm storing my files in a database. So I have centralized access and I don't need a dedicated file server. I'm doing this because I don't know how to integrate ftp servers and EJBs. Is this however a good alternative?
What I want is: Using Stateless EJBs, store the uploaded images as files and the path to them in the database. So I can display them using
<h:graphicImage ... />

You actually have four aspects here,
Receiving and sending files
Creating thumbnails
Storing the files somewhere every node can access
Storing the original + thumbnail entities in a common database
Since you already have Java EE server, you probably also already have a (HTTP) servlet server, in which there is numerious ways of doing load balancing and caching, not to mention the obious potential for web-based interaction. If anything, support FTP transfer with a directory watcher as a bonus.
You should not create the thumbnails using stateless session beans, this means your servers will be crap at peak time - the server will give priority to buisness logic over making new connections. Rather, first receieve and store the file + original entity in the database, and then use a service bean to queue up thumbnail creation (maybe with n worker threads or message queues if you want). You can also use native tools in some cases, we do in linux.
You should use a shared file system, SAN, which is the right tool for sharing files across several machines. And structure your files according to your file system's limits - like number of files per directory and read/write capacity.
And a single database will be good enough for at least a small cluster, as long as you are not killing it with big binary blobs.
If in doubt, buy more ram ;) Especially the thumbnails are very cachable and will give good performance also in Tomcat - if you are not familiar with multi-threading, find a cache at google. Also cache the entities naturally, not only the files.

You might want to use a (private) FTP server for this. Your EJB beans can contact this server for storing and retrieving files.
There are various libraries in Java for accessing FTP servers. Specifically well suited for use in an EJB environment would be JCA based FTP connectors, but 'normal' ones will usually work fine too.

You can also look into using a clustered file system. RedHat Global File System and Symantec's Veritas Clustered File System are two I have experience with. These products allow you to mount the same file system across several servers for read/write access. All your application sees is just another directory. If your environment already has the requisite components (SAN and a good Sys Admin), this might be the best performing solution in a lot of use cases.
However, there are drawbacks to this approach:
You shift complexity from your app to the OS. These products aren't trivial to set up.
Scalability might become an issue if you have a large server farm. And when scaling problems arise finding the bottleneck is not as straight forward as arjan's ftp solution.
You need a SAN.

If you can make reasonable assumptions about "where" your EJB instance is, direct handling of a file is no problem. In your case (since you want to have files) I would read the image into a local temp folder and upload it to a remote destination.
A possible way to do that is http://txconnect.sourceforge.net/ a JCA Transaction Adapter that handes (among others) ftp connections. Configure the factory in xml and inject the connection into your bean and you have ready to go.
Depending on your Application server there might be a special connector available (f.e.: Oracle or IBM systems)

I'd suggest you to stick to your current solution. Ftp access (if needed for purposes other than just keeping files together) can be build on top of your ejb layer. Displaying images stored in DB is not a problem, simple servlet will do the trick.

You can :
Create a WebDAV based file share. This can be done by using many libraries available for Java or other languages. One such library is : http://milton.ettrema.com/index.html
All EJB instances can read /write images from this file share. They would need to use WebDav client libraries
DO setup backups of directories behind this file share

SQLite as a production database for a low-traffic site?

I'm considering using SQLite as a production database for a site that would receive perhaps 20 simultaneous users, but with the potential for a peak that could be many multiples of that (since the site would be accessible on the open internet and there's always a possibility that someone will post a link somewhere that could drive many people to the site all at once).
Is SQLite a possibility?
I know it's not an ideal production scenario. I'm only asking if this is within the realm of being a realistic possibility.

SQLite doesn't support any kind of concurrency, so you may have problems running it on a production website. If you're looking for a 'lighter' database, perhaps consider trying a contemporary object-document store like CouchDB.
By all means, continue to develop against SQLite, and you're probably fine to use it initially. If you find your application has more users down the track, you're going to want to transition to Postgres or MySQL however.
The author of SQLite addresses this on the website:
SQLite works great as the database engine for most low to medium traffic websites (which is to say, most websites). The amount of web traffic that SQLite can handle depends on how heavily the website uses its database. Generally speaking, any site that gets fewer than 100K hits/day should work fine with SQLite. The 100K hits/day figure is a conservative estimate, not a hard upper bound. SQLite has been demonstrated to work with 10 times that amount of traffic.
The SQLite website (https://www.sqlite.org/) uses SQLite itself, of course, and as of this writing (2015), it handles about 400K to 500K HTTP requests per day, about 15-20% of which are dynamic pages touching the database. Dynamic content uses about 200 SQL statements per webpage. This setup runs on a single VM that shares a physical server with 23 others and yet still keeps the load average below 0.1 most of the time.
So I think the long and short of it is, go for it, and if it's not working well for you, making the transition to an enterprise-class database is fairly trivial anyway. Do take care of your schema, however, and design your database with growth and efficiency in mind.
Here's a thread with some more independent comments around using SQLite for a production web application. It sounds like it has been used with some mixed results.
Edit (2014):
Since this answer was posted, SQLite now features a multi-threaded mode and write ahead logging mode which may influence your evaluation of its suitability for low-medium traffic sites.
Charles Leifer has written a blog post about SQLite's WAL (write ahead logging) feature and some well-considered opinions on appropriate use cases.

The small excerpt from SQLite website says it all.
Is the data separated from the application by a network? → choose
client/server
Many concurrent writers? → choose client/server
Big data? → choose client/server
Otherwise → choose SQLite!
SQLite "just works" (until it doesn't of course)

We often use SQLite for internal databases; The employee directory, our calendar of events, and other intranet services all run on lightweight databases. It would be major overkill to be running these apps at the scale we do on a "real" database like mySQL. This is especially true when you factor in that they're running along side 4 other virtual machines on a single mid-range computer.
At one point we had an outward facing site that ran on an sqlite db for months with only a single reboot required. Obviously, it was very low traffic, but it putted along nicely for what it did.

We have encountered a similar option on an environment with absolutely no writes, and we selected using SQLite.
See my blog post on the subject:
Well, the main assumption which makes this solution theoretically
possible is that our SQLite database is totally read-only. Our server
code should never change it. This would solve any locking problems, as
there are no read locks. We could find nowhere on the internet anyone
saying there is a problem in high-throughput reading of SQLite when
there are no writes - it could be possible!

I think it would depend mostly on what your read/write ratio will be. If it's mostly reading from the database, you may be okay. Multi-user writing in SQLite can be a problem because of how it locks the database.

People speak about concurrency problems, but sqlite has a way to cache incoming requests and have them wait for some time. It doesn't timeout immediately.
I've read things about the default timeout setting begin zero, meaning it times out immediately and that's nonsense. Maybe people didn't adjust this setting?

Depends on the usage of the site. If most of the time you're just reading data, you can pretty much use anything for a DB and cache the data in the application to achieve good performance.

I am using it in a very low traffic web server (it is a genomic database) and I don't have any problems. But there are only SELECT statements, no writing to the DB involved.

To add to an already brilliant answer: Since you are working with a server-less solution in this case, you can say goodbye to replication, or any sort of horizontal scaling of your db, as well as other advanced options. It also isn't the best choice if you have multiple users updating the same exact chunk of information. If you were to shard the database in the future you would have to migrate the data and move to something else. Also if you have a load balancer and multiple systems involved it would be difficult to maintain data centrality if using sqlite. These are just some of the reasons why it isn't recommended. Its great for smaller projects, and great for development.

It seems like with queuing you could also get away with avoiding a lot of the concurrency write problems with SQLite. Instead of writing directly to the sqlite db you would write to a queue that then in turn sequentially writes to the sqlite db in a first in first out mode. Not sure if your application reaches to where you would need this if it would be worth writing or just moving on to client/server DB...but a thought.

Caching moderate amounts of data in a web app - DB or flat files?

A web app I'm working on requires frequent parsing of diverse web resources (HTML, XML, RSS, etc). Once downloaded, I need to cache these resources to minimize network load. The app requires a very straightforward cache policy: only re-download a cached resource when more than X minutes have passed since the access time.
Should I:
Store both the access time (e.g. 6/29/09 at 10:50 am) and the resource itself in the database.
Store the access time and a unique identifier in the database. The unique identifier is the filename of the resource, stored on the local disk.
Use another approach or third party software solution.
Essentially, this question can be re-written as, "Which is better for storing moderate amounts of data - a database or flat files?"
Thanks for your help! :)
NB: The app is running on a VPS, so size restrictions on the database/flat files do not apply.

To answer your question: "Which is better for storing moderate amounts of data - a database or flat files?"
The answer is (in my opinion) Flat Files. Flat files are easier to backup, and easier to remove.
However, you have extra information that isn't encapsulated in this question, mainly the fact that you will need to access this stored data to determine if a resource has gone stale.
Given this need, it makes more sense to store it in a database. Flat Files do not lend themselves well for random access, and search, compared to a relational DB.

Depends on the platform, IF you use .NET
The answer is 3, use Cache object, ideally suited for this in ASP.NET
You can set time and dependency expiration,
this doc explains the cache object
https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-5034946.html

Neither.
Have a look at memcached to see if it works with your server/client platform. This is easier to set up and performs much better than filesystem/rdbms based caching, provided you can spare the RAM needed for the data being cached.

All of the proposed solutions are reasonable. However, for my particular needs, I went with flat files. Oddly enough, though, I did so for reasons not mentioned in some of the other answers. It doesn't really matter to me that flat files are easier to backup and remove, and both DB and flat-file solutions allow for easy checking of whether or not the cached data has gone stale. I went with flat files first and foremost because, on my mid-sized one-box VPS LAMP architecture, I think it will be faster than a third-party cache or DB-based solution.
Thanks to all for your thoughts! :)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight