As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am considering to adopt ZFS and I would be happy to know your experience in both production and testing environment.
I am not using ZFS in production - had no chance yet. Well, basically we have no need for giant storage currently and also we did not run any 7.0 up until recently.
At home I have a FreeBSD system (7.0-ish) which is more bleeding edge. I have been using ZFS for almost eight months now. I currently have a 1.2 TB in my tank. I like ZFS a lot, for multiple reaons:
grow my filesystem on demand
storage from "inexpensive" disks
filesystem snapshot
self-healing
copies (this is probably the most awesome of it all)
If you are looking to try it out and like FreeBSD, I'd recommend the FreeBSD wiki.
I have had some of the issues that are outlined on the wiki and I had a lot of help/feedback from people on irc (#freebsdhelp # Efnet). I haven't lost any data though. :) (Knock on wood!) If you are looking for more feedback, you can check back on IRC. There are a bunch of people who run ZFS pools.
Aside from FreeBSD, ZFS has been around for a while on the sun platform. It's more way more mature there since what I run on FreeBSD is a port and a lot of work in progress. :)
What do you plan to use it for? Most questions about filesystems can only be answered sensibly if there's a good understanding of the application and usage patterns. What works well for a traditional mail spool filesystem will probably not be what you choose for a database store, for example.
I used it as a low-rent near line storage system on a machine with OpenSolaris installed on it. I had it on a basic mirrored RAID system with 30 days worth of snapshots. On more than one occasion it saved my bacon and that was on a very basic setup. I can only imagine how much you could do with it on more serious/capable hardware.
As a sysadmin in a Linux shop I use ZFS as a backup server. Used to run a cronjob to do snapshots but these days I use the zfs-auto-snapshot service that comes with SXCE. Backup is NFS exported and automounted on all machines in the network - so people can restore files themselves - even snapshots are exported over the network!
I even have my home directory NFS mounted from all the linux machines - so I get hourly snapshots of my daily work.
While ZFS is not perfect it really seems to be the best filesystem available today.
I do development and use ZFS in two environments:
1) On my Mac Pro with RAIDZ2 over four discs
2) On a backup server which is DesktopBSD (based on FreeBSD) with two disks in RAIDZ1
My overall experience is that for the first time I don't have to go around making daily backups of data as I have seen that ZFS seems to be the most reliable storage system I have ever used.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am currently searching for a good distributed file system.
It should:
be open-source
be horizontally scalable (replication and sharding)
have no single point of failure
have a relatively small footprint
Here are the four most promising candidates in my opinion:
GridFS (based on MongoDB)
GlusterFS
Ceph
HekaFS
The filesystem will be used mainly for media files (images and audio). There are very small as well as medium sized files (1 KB - 10 MB). The amount of files should be around several millions.
Are there any benchmarks regarding performance, CPU-load, memory-consumption and scalability? What are your experiences using these or other distributed filesystems?
I'm not sure your list is quite correct. It depends on what you mean by a file system.
If you mean a file system that is mountable in an operating system and usable by any application that reads and writes files using POSIX calls, then GridFS doesn't really qualify. It is just how MongoDB stores BSON-formatted objects. It is an Object system rather than a File system.
There is a project to make GridFS mountable, but it is a little weird because GridFS doesn't have concepts for things like hierarchical directories, although paths are allowed. Also, I'm not sure how distributed writes on gridfs-fuse would be.
GlusterFS and Ceph are comparable and are distributed, replicable mountable file systems. You can read a comparison between the two here (and followup update of comparison), although keep in mind that the benchmarks are done by someone who is a little biased. You can also watch this debate on the topic.
As for HekaFS, it is GlusterFS that is set up for cloud computing, adding encryption and multitenancy as well as an administrative UI.
After working with Ceph for 11 months I came to conclusion that it utterly sucks so I suggest to avoid it. I tried XtreemFS, RozoFS and QuantcastFS but found them not good enough either.
I wholeheartedly recommend LizardFS which is a fork of now proprietary MooseFS. LizardFS features data integrity, monitoring and superior performance with very few dependencies.
2019 update: situation has changed and LizardFS is not actively maintained any more.
MooseFS is stronger than ever and free from most LizardFS bugs. MooseFS is well maintained and it is faster than LizardFS.
RozoFS has matured and maybe worth a try.
GfarmFS have its niche but today I would have chosen MooseFS for most applications.
OrangeFS, anyone?
I am looking for a HPC DFS and found this discussion here:
http://forums.gentoo.org/viewtopic-t-901744-start-0.html
Lots of good data and comparisons :)
After some talk the OP decided for OrangeFS, quoting:
"OrangeFS. It does not support quotas nor file locks (though all i/o operations are atomic and this
way consistency is kept without locks). But it works, and works well and stable. Furthermore this is
not a general file storage oriented system, but HPC dedicated one, targeted on parallel I/O including
ROMIO support. All test were done for stripe data distribution.
a) No quotas — to hell quotas. I gave up on them anyway, even glusterfs supports not common
uid/gid based quotas, but directory size limitations, more like LVM works.
b) Multiple active metadata servers are supported and stable. Compared to dedicated metadata
storage (single node) this gives +50% performance on small files and no significant difference on
large ones.
c) Excellent performance on large data chunks (dd bs=1M). It is limited by a sum of local hard drive
(do not forget each node participates as a data server as well) speed and available network bandwidth.
CPU consumption on such load is decent and is about 50% of single core on a client node and about
10% percents on each other data server nodes.
d) Fair performance on large sets of small files. For the test I untared linux kernel 3.1. It took 5 minutes
over OrangeFS (with tuned parameters) and almost 2 minutes over NFSv4 (tuned as well) for comparison.
CPU load is about 50% of single core (of course, it is actually distributed between cores) on the client and
about several percents on each node.
e) Support of ROMIO MPI I/O API. This is a sweet yummy for MPI aware applications, which allows to use
PVFS2/OrangeFS parallel input-output features directly from applications.
f) No support for special files (sockets, fifo, block devices). Thus can't be safely used as /home and I use
NFSv4 for that task providing users quota-restricted small home space. Though most distributed
filesystems don't support special files anyway. "
I do not know about the other systems you posted but I have made a comparison of 3 PHP CMS/Frameworks on local storage vs GlusterFS to see if it does better on real world tests than raw benchmarks. Sadly not.
http://blog.lavoie.sl/2013/12/glusterfs-performance-on-different-frameworks.html
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm gathering information for upcoming massive online game. I has my experience with MEGA MASSIVE farm-like games (millions of dau), and SQL databases was great solution. I also worked with massive online game where NoSQL db was used, and this particular db (Mongo) was not a best fit - bad when lot of connections and lot of concurrent writes going on.
I'm looking for facts, benchmarks, presentation about modern massive online games and technical details about their backend infrastructure, databases in particular.
For example I'm interested in:
Can it manage thousands of connection? May be some external tool can help (like pgbouncer for postgres).
Can it manage tens of thousands of concurrent read-writes?
What about disk space fragmentation? Can it be optimized without stopping database?
What about some smart replication? Can it tell that some data is missing from replica, when master fails? Can i safely propagate slave to master and know exactly what data is missing and act appropriately?
Can it fail gracefully? (like postgres for ex.)
Good reviews from using in production
Start with the premise that hard crashes are exceedingly rare, and when they occur
it won't be a tragedy of some information is lost.
Use of the database shouldn't be strongly coupled to the routine management of the
game. Routine events ought to be managed through more ephemeral storage. Some
secondary process should organize ephemeral events for eventual storage in a database.
At the extreme, you could imagine there being just one database read and one database
write per character per session.
Have you considered NoSQL ?
NoSQL database systems are often highly optimized for retrieval and
appending operations and often offer little functionality beyond
record storage (e.g. key–value stores). The reduced run-time
flexibility compared to full SQL systems is compensated by marked
gains in scalability and performance for certain data models.
In short, NoSQL database management systems are useful when working
with a huge quantity of data when the data's nature does not require a
relational model. The data can be structured, but NoSQL is used when
what really matters is the ability to store and retrieve great
quantities of data, not the relationships between the elements. Usage
examples might be to store millions of key–value pairs in one or a few
associative arrays or to store millions of data records. This
organization is particularly useful for statistical or real-time
analyses of growing lists of elements (such as Twitter posts or the
Internet server logs from a large group of users).
There are higher-level NoSQL solutions, for example CrouchDB, which has built-in replication support.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have a small Google App Engine site which seems to be outgrowing it, and I want to migrate somewhere else.
It is based on Java / Stripes Framework / Objectify, and only uses URLFetch from Google services. It uses ~60 front-end hours and ~4 GB datastore at the moment, with ~5k visitors doing ~25k page views per day.
Reasons why I believe I should migrate:
I have made some assumptions early on which no longer are valid and am running into 1MB memcache/datastore limits. While I could refactor this it would likely increase the number of datastore operations / overall worsen performance characteristics, and involve a database conversion step (which I may as well use to migrate somewhere else)
I want to add some features which would involve a significant increase in stored data (to ca 100 GB) and front-end hours
As I'm using resources past the free quota, the increase in costs seem to rise quite fast. While they are quite manageable now, if the application becomes more popular, I'm afraid it may no longer be affordable.
Some stability issues (getting some OutOfMemoryErrors and other errors that I cannot explain, and cannot replicate very well on my local environment)
I'm evaluating the following options:
Staying on GAE, optimizing the application, living with the growing costs (Cons: still will be having high costs and reliability issues)
Moving to AWS EC2 / EBS with MongoDB as a datastore (Pros: probably the most mature solution, Cons: appears difficult to set up, easy to make architecture/design mistakes).
Using Appscale to hopefuly largely leave my application as-is, but host it on AWS EC2 (Pros: seems easy on paper, Cons: seems to presume a largely Unix development environment, no idea if production ready or what is happening behind the scenes)
Use CloudFoundry.com with MongoDB as a datastore (Cons: no idea if production ready, post-beta costs are not known)
Get a VPS or a dedicated server with some hosting provider, deploy using MongoDB as a datastore (Cons: probably teaches me less of the things I want to learn than the other options, plenty of sysadmining to do)
It is a hobby site, so part of the goal is to also learn some new technologies in practice, I'd just want to invest my time in learning the right ones.
Notes - I have some, but quite limited system administration skills, especially on Linux, and I don't enjoy doing it. I have done a small project in MongoDB before (never put in production though). I've never used any of the AWS infrastructure.
My questions:
a. Is AppScale mature enough for the purpose of running a small website without much hassles (bugs, lack of documentation, etc.)? Is the learning curve very steep? How much system administration would using it in scenario #3 require? And most importantly - do I understand correctly that the Google 1MB and so forth limits are all there on AppScale?
b. Is CloudFoundry mature enough for the purpose of running a small website without much hassles (bugs, lack of documentation, etc.)? Is the learning curve very steep? How much system administration would using it in scenario #4 require? I presume moving off from CloudFoundry.com to anther CloudFoundry should be fairly easy if needed.
c. How much sysadmining does deploying on AWS EC2 / EBS involve for the described application? Assuming I don't care that much for temporary outages, but care about permanent data loss, do I need to mirror the EBS on my own, or can I just leave AWS to do it?
d. Which of the new options (AppScale, CloudFoundry, EC2/EBS) would work fine with a Windows / Eclipse based development approach? Which has the best Eclipse plug-ins?
I'm asking because upon quick review of AppScale docs, it seems they assume the development VM will be hosted by a Unix host, which is another hassle for me.
e. Which of my options 1.-5. would you recommend in my case?
I'm split between #2 and #4 at the moment.
Just some observation:
a. AppScale is a thin wrapper around other technologies (runtimes, datastores), so in general it's as reliable as those underlying parts are. For a small non-mission-critical website is IMO reliable enough. Btw, the memcache 1MB limit is per-object, not per-memcache. So I suppose you could break it up into multiple smaller objects.
b. I don't have experience with CloudFoundry, but they do say they are "beta" and they do not have SLA: http://support.cloudfoundry.com/entries/20971351-cloud-foundry-sla\
c. I'd guess a few hours a week. ESB is a disk-based service so you should not be loosing data with it. But you can still do incremental backups of ECB to S3. There are many solutions that do it automatically, for example: http://www.stardothosting.com/blog/2012/05/automated-amazon-ebs-snapshot-backup-script-with-7-day-retention/
d. IMO EC2 is the most mature with the most tools available. Note that AppScale is just a wrapper - you can deploy it to EC2. Dev environment (Eclipse+Windows) has nothing to do with deployment environment (Usually Linux, can also be Windows on EC2).
e. Personally I'd recommend staying with GAE (= option 1). IMHO anything else would be less reliable and more costly (due to setup/support costs, not even comparing base service costs).
Btw, if you are getting OutOfMemoryErrors you should really review your code. Where do you keep massive amounts of data in memory?
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Would any one please differentiate what is best to use SQLite or SQL Server? I was using XML file as a data storage ADD, delete , update.. Some one suggested to use SQLite for fast operation but I am not familier with SQLite I know SQL Server.
SQLite is a great embedded database that you deploy along with your application. If you're writing a distributed application that customers will install, then SQLite has the big advantage of not having any separate installer or maintenance--it's just a single dll that gets deployed along with the rest of your application.
SQLite also runs in process and reduces a lot of the overhead that a database brings--all data is cached and queried in-process.
SQLite integrates with your .NET application better than SQL server. You can write custom function in any .NET language that run inside the SQLite engine but are still within your application's calling process and space and thus can call out to your application to integrate additional data or perform actions while executing a query. This very unusual ability makes certain actions significantly easier.
SQLite is generally a lot faster than SQL Server.
However, SQLite only supports a single writer at a time (meaning the execution of an individual transaction). SQLite locks the entire database when it needs a lock (either read or write) and only one writer can hold a write lock at a time. Due to its speed this actually isn't a problem for low to moderate size applications, but if you have a higher volume of writes (hundreds per second) then it could become a bottleneck. There are a number of possible solutions like separating the database data into different databases and caching the writes to a queue and writing them asynchronously. However, if your application is likely to run into these usage requirements and hasn't already been written for SQLite, then it's best to use something else like SQL Server that has finer grained locking.
UPDATE: SQLite 3.7.0 added a new journal mode called Write Ahead Locking that supports concurrent reading while writing. In our internal multi-pricess contention test, the timing went from 110 seconds to 8 seconds for the exact same sequence of contentious reads/writes.
Both are in different league altogether. One is built for enterprise level data management and another is for mobile devices (embedded or server less environment). Though SQLite deployments can hold data in many hundred GBs but that is not what it is built for.
Updated: to reflect updated question:
Please read this blog post on SQLite. I hope that would help you and let you access it from redirect you to resources to programatically access SQLite from .net.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
What would be the best DB for Inserting records at a very high rate.
The DB will have only one table and the Application is very simple. Insert a row into the DB and commit it but the insertion rate will be very high.
Targetting about 5000 Row Insert per second.
Any of the very expensive DB's like Oracle\SQLServer are out of option.
Also what are the technologies for taking a DB Backup and will it be possible to create one DB from the older backed up DB's ?
I can't use InMemory capabilities of any DB's as I can't afford a crash of the Application. I need to commit the row as soon as I recieve it.
If your main goal is to insert a lot of data in a little time, perhaps the filesystem is all you need.
Why not write the data in a file, optionally in a DB-friendly format (csv, xml, ...) ? That way you can probably achieve 10 times your performance goal without too much trouble. And most OSs are robust enough nowadays to prevent data loss on application failures.
Edit: As said below, jounaling file systems are pretty much designed so that data is not lost in case of software (or even hardware in case of raid-arrays) failures. ZFS has a good reputation.
Postgres provides WAL (Write Ahead Log) which essentially does inserts into RAM until the buffer is full or the system has time to breath. You combine a large WAL cache with a UPS (for safety) and you have very efficient insert performance.
If you can't do SQLite, I'd take a look at Firebird SQL if I were you.
To get high throughput you will need to batch inserts into a big transaction. I really doubt you could find any db that allows you to round trip 5000 times a second from your client.
Sqlite can handle tons of inserts (25K per second in a tran) provided stuff is not too multithreaded and that stuff is batched.
Also, if structure correctly I see no reason why mysql or postgres would not support 5000 rows per second (provided the rows are not too fat). Both MySQL and Postgres are a lot more forgiving to having a larger amount of transactions.
The performance you want is really not that hard to achieve, even on a "traditional" relational DBMS. If you look at the results for unclustered TPC-C (TPC-C is the de-facto standard benchmark for transaction processing) many systems can provide 10 times your requirements in an unclustered system. If you are going for cheap and solid you might want to check out DB2 Express-C. It is limited to two cores and two gigabytes of memory but that should be more than enough to satisfy your needs.