I would like my program (written in Python) to monitor a given file system's hierarchy, record it into persistent data storage, and be able to update it when the file system changes. It might be read into volatile memory for quick access.
I've found some posts that suggested "the best persistent storage method to use with Python" here and here, as well as another post that answered "how to represent a filesystem in a relational database" here.
From the above links, it appears that SQLite is a good choice for persistence, as it is quick. However, I couldn't find much opinion on how good it is to use a database to store and represent filesystem hierarchy.
My considerations for the method implemented are:
Scalibality: I need to monitor and keep updated a hierarchy potentially up to hundreds of thousands of files
Ease of use: when I read the file system hierarchy into memory
Any other suggested considerations?
Is it a good idea to use a RDBMS to represent a filesystem hierarchy? What are the pros and cons in this method? Do you have other suggested methods, and what are the pros and cons of such methods?
I am currently learning UML. I looked everywhere but couldn't find an answer. Should the creator of an object be the one to always saves it in persistent memory via a data access object or by an expert, is an object better to save itself via a data access object in persistent memory?
Main alternatives
UML is agnostic in this regard. It depends of your architectural choices:
A popular approach is Repository objects that act as a kind of collection that hides the database. You’d insert, retrieve, suppress or update elements in the repository and the repository takes care of the database.
Another popular approach is the active record. Each persistent object is in charge of inserting, updating or deleting itself in the database.
There are several other approaches.
How to chose?
The scenario you describe corresponds to the second one. It is a tempting approach for CRUD oriented application with little domain logic. But it has several drawbacks that limit their suitability:
it does not enforce proper separation of concerns (see also Songle Responsibility Principle).
it couples your system tightly to the database and it might be difficult at a later stage to change the inderlying database.
Moreover, you’d need to care also for transactional logic (i.e.either complete a set of related changes or cancel them all), or to avoid that several objects in memory correspond to the same object in the database. While not impossible, this is more difficult with active records.
How to deepen your understanding of these alternatives?
The most comprehensive book on this topic is Fowler’s “Patterns of enterprise application architecture”. Another book worth to invest in for deepening what’s behind repositories, is Evans’ DDD bible. (Imho, you will save years of on the job experimentation and discovery by reading both books. You can then use the saved time to deepen your skills on other innovative domains).
Database ultimately stores the data in files, whereas File system also stores the data in files. In this case what is the difference between DB and File System. Is it in the way it is retrieved or anything else?
A database is generally used for storing related, structured data, with well defined data formats, in an efficient manner for insert, update and/or retrieval (depending on application).
On the other hand, a file system is a more unstructured data store for storing arbitrary, probably unrelated data. The file system is more general, and databases are built on top of the general data storage services provided by file systems. [Quora]
The file system is useful if you are looking for a particular file, as operating systems maintain a sort of index. However, the contents of a txt file won't be indexed, which is one of the main advantages of a database.
For very complex operations, the filesystem is likely to be very slow.
Main RDBMS advantages:
Tables are related to each other
SQL query/data processing language
Transaction processing addition to SQL (Transact-SQL)
Server-client implementation with server-side objects like stored procedures, functions, triggers, views, etc.
Advantage of the File System over Data base Management System is:
When handling small data sets with arbitrary, probably unrelated data, file is more efficient than database.
For simple operations, read, write, file operations are faster and simple.
You can find n number of difference over internet.
"They're the same"
Yes, storing data is just storing data. At the end of the day, you have files. You can store lots of stuff in lots of files & folders, there are situations where this will be the way. There is a well-known versioning solution (svn) that finally ended up using a filesystem-based model to store data, ditching their BerkeleyDB. Rare but happens. More info.
"They're quite different"
In a database, you have options you don't have with files. Imagine a textfile (something like tsv/csv) with 99999 rows. Now try to:
Insert a column. It's painful, you have to alter each row and read+write the whole file.
Find a row. You either scan the whole file or build an index yourself.
Delete a row. Find row, then read+write everything after it.
Reorder columns. Again, full read+write.
Sort rows. Full read, some kind of sort - then do it next time all over.
There are lots of other good points but these are the first mountains you're trying to climb when you think of a file based db alternative. Those guys programmed all this for you, it's yours to use; think of the likely (most frequent) scenarios, enumerate all possible actions you want to perform on your data, and decide which one works better for you. Think in benefits, not fashion.
Again, if you're storing JPG pictures and only ever look for them by one key (their id maybe?), a well-thought filesystem storage is better. Filesystems, btw, are close to databases today, as many of them use a balanced tree approach, so on a BTRFS you can just put all your pictures in one folder - and the OS will silently implement something like an early SQL query each time you access your files.
So, database or files?...
Let's see a few typical examples when one is better than the other. (These are no complete lists, surely you can stuff in a lot more on both sides.)
DB tables are much better when:
You want to store many rows with the exact same structure (no block waste)
You need lightning-fast lookup / sorting by more than one value (indexed tables)
You need atomic transactions (data safety)
Your users will read/write the same data all the time (better locking)
Filesystem is way better if:
You like to use version control on your data (a nightmare with dbs)
You have big chunks of data that grow frequently (typically, logfiles)
You want other apps to access your data without API (like text editors)
You want to store lots of binary content (pictures or mp3s)
Programming rarely says "never" or "always". Those who say "database always wins" or "files always win" probably just don't know enough. Think of the possible actions (now + future), consider both ways, and choose the fastest / most efficient for the case. That's it.
Something one should be aware of is that Unix has what is called an inode limit. If you are storing millions of records then this can be a serious problem. You should run df -i to view the % used as effectively this is a filesystem file limit - EVEN IF you have plenty of disk space.
The difference between file processing system and database management system is as follow:
A file processing system is a collection of programs that store and manage files in computer hard-disk. On the other hand, A database management system is collection of programs that enables to create and maintain a database.
File processing system has more data redundancy, less data redundancy in dbms.
File processing system provides less flexibility in accessing data, whereas dbms has more flexibility in accessing data.
File processing system does not provide data consistency, whereas dbms provides data consistency through normalization.
File processing system is less complex, whereas dbms is more complex.
Context: I've written a filesystem that has been running in production for 7 years now. [1]
The key difference between a filesystem and a database is that the filesystem API is part of the OS, thus filesystem implementations have to implement that API and thus follow certain rules, whereas databases are built by 3rd parties having complete freedom.
Historically, databases where created when the filesystem provided by the OS were not good enough for the problem at hand. Just think about it: if you had special requirements, you couldn't just call Microsoft or Apple to redesign their filesystem API. You would either go ahead and write your own storage software or you would look around for existing alternatives. So the need created a market for 3rd party data storage software which ended up being called databases. That's about it.
While it may seem that filesystems have certain rules like having files and directories, this is not true. The biggest operating systems work like that but there are many mall small OSs that work differently. It's certainly not a hard requirement. (Just remember, to build a new filesystem, you also need to write a new OS, which will make adoption quite a bit harder. Why not focus on just the storage engine and call it a database instead?)
In the end, both databases and filesystems come in all shapes and sizes. Transactional, relational, hierarchical, graph, tabled; whatever you can think of.
[1] I've worked on the Boomla Filesystem which is the storage system behind the Boomla OS & Web Application Platform.
The main differences between the Database and File System storage is:
The database is a software application used to insert, update and delete
data while the file system is a software used to add, update and delete
Saving the files and retrieving is simpler in file system
while SQL needs to be learn to perform any query on the database to
get (SELECT), add (INSERT) and update the data.
Database provides a proper data recovery process while file system did not.
In terms of security the database is more secure then the file system (usually).
The migration process is very easy in File system just copy and paste into the target
while for database this task is not as simple.
I am working to develop an application that needs data distributed across countries. Content will be supplied "per region", but needs to be able to be easily copied to another region. On top of this I have general information that needs to be shared and synchronized across the databases.
The organisation I work for is considering implementing this system themselves, but it feels like there should be some good solutions out there already (I am open to cloud solutions - the less my company needs to manage the better)?
This might be a vague question, but I think it is possible to answer it well.
What are my options when developing this kind of distributed data system?
Should have elaborated (but I'm not sure how much I can say given NDA). Suffice to say, I have "Content" which I need stored on some space (files). I need metadata stored about the content distributed over several nodes (that might be hosted by us or some one else) to allow fast-paced communication and regionalized differences in data. I need to control HOW data is replicated between nodes, but preferably in a standards compliant way. (Preferably not written by us)
You can try CouchDB. Its off-line replication model sounds like a good fit for geo distributed system.
Interesting question - but it would really help to get more context.
You talk about "data", which usually means something with a fairly well-defined structure, often implemented in a relational database.
You also talk about "content", which usually means something with a (much) less well-defined structure, often implemented as a document of some type. Many solutions exist for structuring "documents", e.g. file systems or web sites.
Assuming we are talking about structured data, the simplest thing to do is have single repository, accessible everywhere. Have a look at "cloud" offerings - Amazon's a good bet. Creating your own global data repository is a significant undertaking - but if you're dealing with highly confidential data, or have specific performance requirements, it may the way to go.
If neither of those options work, you're in the world of "enterprise service bus". Google it, but be careful - it's a complex field, and you really want to find someone who knows what they're doing.
Having said that, using an off the shelf ESB is many times less painful than building your own distributed data structure.
I know it's years after asking, but I was looking up the answer to the same question and it looks like Cassandra may fit the bill. Once setup, it looks and acts like other database solutions (Tables, Views, SQL, Transactions, etc.), but it can also be entirely decentralized. Each instance acts as a node in a cluster of other Cassandra nodes. They synchronize behind the scenes and if one goes down, the others pick up the slack. This makes Cassandra both highly scalable and highly fault tolerant.
We've been discussing design of a data warehouse strategy within our group for meeting testing, reproducibility, and data syncing requirements. One of the suggested ideas is to adapt a NoSQL approach using an existing tool rather than try to re-implement a whole lot of the same on a file system. I don't know if a NoSQL approach is even the best approach to what we're trying to accomplish but perhaps if I describe what we need/want you all can help.
Most of our files are large, 50+ Gig in size, held in a proprietary, third-party format. We need to be able to access each file by a name/date/source/time/artifact combination. Essentially a key-value pair style look-up.
When we query for a file, we don't want to have to load all of it into memory. They're really too large and would swamp our server. We want to be able to somehow get a reference to the file and then use a proprietary, third-party API to ingest portions of it.
We want to easily add, remove, and export files from storage.
We'd like to set up automatic file replication between two servers (we can write a script for this.) That is, sync the contents of one server with another. We don't need a distributed system where it only appears as if we have one server. We'd like complete replication.
We also have other smaller files that have a tree type relationship with the Big files. One file's content will point to the next and so on, and so on. It's not a "spoked wheel," it's a full blown tree.
We'd prefer a Python, C or C++ API to work with a system like this but most of us are experienced with a variety of languages. We don't mind as long as it works, gets the job done, and saves us time. What you think? Is there something out there like this?
Have you had a look at MongoDB's GridFS.
You can query files by the default metadata, plus your own additional metadata. Files are broken out into small chunks and you can specify which portions you want. Also, files are stored in a collection (similar to a RDBMS table) and you get Mongo's replication features to boot.
Whats wrong with a proven cluster file system? Lustre and ceph are good candidates.
If you're looking for an object store, Hadoop was built with this in mind. In my experience Hadoop is a pain to work with and maintain.
For me both Lustre and Ceph has some problems that databases like Cassandra dont have. I think the core question here is what disadvantage Cassandra and other databases like it would have as a FS backend.
Performance could obviously be one. What about space usage? Consistency?
A database file system is a file system that is a database instead of a hierarchy. Not too complex an idea initially but I thought I'd ask if anyone has thought about how they might do something like this? What are the issues that a simple plan is likely to miss? My first guess at an implementation would be something like a filesystem to for a Linux platform (probably atop an existing file system) but I really don't know much about how that would be started. Its a passing thought that I doubt I'd ever follow through on but I'm hoping to at least satisfy my curiosity.
DBFS is a really nice PoC implementation for KDE. Instead of implementing it as a file system directly, it is based on indexing on a traditional file system, and building a new user interface to make the results accessible to users.
The easiest way would be to build it using fuse, with a database back-end.
A more difficult thing to do is to have it as a kernel module (VFS).
On Windows, you could use IFS.
I'm not really sure what you mean with "A database file system is a file system that is a database instead of a hierarchy".
Probably, using "Filesystem in Userspace" (FUSE), as mentioned by Osama ALASSIRY, is a good idea. The FUSE wiki lists a lot of existing projects about databased-backed filesystems as well as filesystems in which you can search by SQL-like queries.
Maybe this is a good starting point for getting an idea how it could work.
It's a basic overview of the Firebird architecture.
Firebird is an opensource RDBMS, so you can have a real deep insight look, too, if you're interested.
Its been a while since you asked this. I'm surprised no one suggested the obvious. Look at mainframes and minis, especially iSeries-OS (now called IBM-i used to be called iOS or OS/400).
How to do an relational database as a mass data store is relatively easy. Oracle and MySQL both have these. The catch is it must be essentially ubiquitous for end user applications.
So the steps for an app conversion are:
1) Everything in a normal hierarchical filesystem
2) Data in BLOBs with light metadata in the database. File with some catalogue information.
3) Large data in BLOBs with extensive metadata and complex structures in the database. File with substantial metadata associated with it that can be essentially to understanding the structure.
4) Internal structures of the BLOB exposed in an object <--> Relational map with extensive meta-data. While there may be an exportable form, the application naturally works with the database, the notion of the file as the repository is lost.