Would it be possible to use Git as a hierarchical text database?
Obviously you would have to write a front end that would act as a middle man, translating user commands into git commands.
A record would correspond to a "file". In the "file", the text would have to have some kind of conventional format like:
[name]: John Doe
[address]: 13 Maple Street
[city]: Plainview
To do queries, you would have to write a grep front end to use git's search capability.
The database itself would be the repository.
The directory structure would be the hierarchical structure of the database.
The tricky part I see would be that you want the records to be in memory, not usually files on the drive (although that would be possible). So you would have to configure git to be working with files in a virtual file system that was actually in the memory of the db middleware.
Kind of a crazy idea, but would it work?
Potential Advantages:
all the records would be hashed with SHA-1 so there would be high integrity
git takes care of all the persistence problems
db operations like edits can be managed as git merges
db operations like record deletes can be managed as removals (rm)
all changes to the database are stored, so you can recover ANY change or previous state
making copies of the database can be done with clone
Yes, but it would be very slow and it wouldn't involve git. The functionality of git grep and git clone are available without git.
Filesystems can be used as certain types of databases. In fact, git itself uses the filesystem as a simple, reliable, fast, robust key/value store. Object 4fbb4749a2289a3cd949ebe08255266befd18f23 is in .git/objects/4f/bb4749a2289a3cd949ebe08255266befd18f23. Where the master branch is pointing at is located in .git/refs/heads/master.
What filesystem databases are very bad at is searching the contents of those files. Without indexing, you have to look at every file every time. You can use basic Unix file utilities like find and grep for it.
In addition, you'd have to parse the contents of the files each search which can be expensive and complicated.
Concurrency becomes a serious issue. If multiple processes want to work on a change at the same time they have to copy the whole repository and working directory, very expensive. Then they need to do a remote merge, also expensive, which may result in a conflict. Remote access has the same problem.
As to having the files in memory, your operating system will take care of this for you. It will keep frequently accessed files in memory.
Addressing the specific points...
all the records would be hashed with SHA-1 so there would be high integrity
This only tells you that a file is different, or that someone has tampered with the history. In a database files are supposed to change. It doesn't tell you if the content is corrupted or malformed or it's a normal change.
git takes care of all the persistence problems
Not sure what that means.
db operations like edits can be managed as git merges
They're files, edit them. I don't know how merging gets involved.
Merging means conflicts which means human intervention, not something you want in a database.
db operations like record deletes can be managed as removals (rm)
If each single file is a record, yes, but you can do the same thing without git.
all changes to the database are stored, so you can recover ANY change or previous state
This is an advantage, it sort of gives you transactions, but it will also make writing to your database supremely slow. Git is not meant to be committing hundreds of times a second.
making copies of the database can be done with clone
cp -r does the same thing.
In short, unless you're doing a very simple key/value store there is very little advantage to using a filesystem as a database. Something like SQLite or Berkeley DB are superior in almost every way.
Related
Database ultimately stores the data in files, whereas File system also stores the data in files. In this case what is the difference between DB and File System. Is it in the way it is retrieved or anything else?
A database is generally used for storing related, structured data, with well defined data formats, in an efficient manner for insert, update and/or retrieval (depending on application).
On the other hand, a file system is a more unstructured data store for storing arbitrary, probably unrelated data. The file system is more general, and databases are built on top of the general data storage services provided by file systems. [Quora]
The file system is useful if you are looking for a particular file, as operating systems maintain a sort of index. However, the contents of a txt file won't be indexed, which is one of the main advantages of a database.
For very complex operations, the filesystem is likely to be very slow.
Main RDBMS advantages:
Tables are related to each other
SQL query/data processing language
Transaction processing addition to SQL (Transact-SQL)
Server-client implementation with server-side objects like stored procedures, functions, triggers, views, etc.
Advantage of the File System over Data base Management System is:
When handling small data sets with arbitrary, probably unrelated data, file is more efficient than database.
For simple operations, read, write, file operations are faster and simple.
You can find n number of difference over internet.
"They're the same"
Yes, storing data is just storing data. At the end of the day, you have files. You can store lots of stuff in lots of files & folders, there are situations where this will be the way. There is a well-known versioning solution (svn) that finally ended up using a filesystem-based model to store data, ditching their BerkeleyDB. Rare but happens. More info.
"They're quite different"
In a database, you have options you don't have with files. Imagine a textfile (something like tsv/csv) with 99999 rows. Now try to:
Insert a column. It's painful, you have to alter each row and read+write the whole file.
Find a row. You either scan the whole file or build an index yourself.
Delete a row. Find row, then read+write everything after it.
Reorder columns. Again, full read+write.
Sort rows. Full read, some kind of sort - then do it next time all over.
There are lots of other good points but these are the first mountains you're trying to climb when you think of a file based db alternative. Those guys programmed all this for you, it's yours to use; think of the likely (most frequent) scenarios, enumerate all possible actions you want to perform on your data, and decide which one works better for you. Think in benefits, not fashion.
Again, if you're storing JPG pictures and only ever look for them by one key (their id maybe?), a well-thought filesystem storage is better. Filesystems, btw, are close to databases today, as many of them use a balanced tree approach, so on a BTRFS you can just put all your pictures in one folder - and the OS will silently implement something like an early SQL query each time you access your files.
So, database or files?...
Let's see a few typical examples when one is better than the other. (These are no complete lists, surely you can stuff in a lot more on both sides.)
DB tables are much better when:
You want to store many rows with the exact same structure (no block waste)
You need lightning-fast lookup / sorting by more than one value (indexed tables)
You need atomic transactions (data safety)
Your users will read/write the same data all the time (better locking)
Filesystem is way better if:
You like to use version control on your data (a nightmare with dbs)
You have big chunks of data that grow frequently (typically, logfiles)
You want other apps to access your data without API (like text editors)
You want to store lots of binary content (pictures or mp3s)
TL;DR
Programming rarely says "never" or "always". Those who say "database always wins" or "files always win" probably just don't know enough. Think of the possible actions (now + future), consider both ways, and choose the fastest / most efficient for the case. That's it.
Something one should be aware of is that Unix has what is called an inode limit. If you are storing millions of records then this can be a serious problem. You should run df -i to view the % used as effectively this is a filesystem file limit - EVEN IF you have plenty of disk space.
The difference between file processing system and database management system is as follow:
A file processing system is a collection of programs that store and manage files in computer hard-disk. On the other hand, A database management system is collection of programs that enables to create and maintain a database.
File processing system has more data redundancy, less data redundancy in dbms.
File processing system provides less flexibility in accessing data, whereas dbms has more flexibility in accessing data.
File processing system does not provide data consistency, whereas dbms provides data consistency through normalization.
File processing system is less complex, whereas dbms is more complex.
Context: I've written a filesystem that has been running in production for 7 years now. [1]
The key difference between a filesystem and a database is that the filesystem API is part of the OS, thus filesystem implementations have to implement that API and thus follow certain rules, whereas databases are built by 3rd parties having complete freedom.
Historically, databases where created when the filesystem provided by the OS were not good enough for the problem at hand. Just think about it: if you had special requirements, you couldn't just call Microsoft or Apple to redesign their filesystem API. You would either go ahead and write your own storage software or you would look around for existing alternatives. So the need created a market for 3rd party data storage software which ended up being called databases. That's about it.
While it may seem that filesystems have certain rules like having files and directories, this is not true. The biggest operating systems work like that but there are many mall small OSs that work differently. It's certainly not a hard requirement. (Just remember, to build a new filesystem, you also need to write a new OS, which will make adoption quite a bit harder. Why not focus on just the storage engine and call it a database instead?)
In the end, both databases and filesystems come in all shapes and sizes. Transactional, relational, hierarchical, graph, tabled; whatever you can think of.
[1] I've worked on the Boomla Filesystem which is the storage system behind the Boomla OS & Web Application Platform.
The main differences between the Database and File System storage is:
The database is a software application used to insert, update and delete
data while the file system is a software used to add, update and delete
files.
Saving the files and retrieving is simpler in file system
while SQL needs to be learn to perform any query on the database to
get (SELECT), add (INSERT) and update the data.
Database provides a proper data recovery process while file system did not.
In terms of security the database is more secure then the file system (usually).
The migration process is very easy in File system just copy and paste into the target
while for database this task is not as simple.
Hi so I am working on a terminal application and would like to use a database. I am not sure the best way to go about doing it though.
So I want to have the program be able to install like you would git or with brew install etc. And then each person that installs it would have their own database which will be accessed via certain commands.
I'm just not sure what the best way to do this would be? I was thinking for each entry I would put in the database I could just create a file in a .directory?
Thanks.
The filesystem is actually quite a good database, suitable for storing a moderate number of documents with sizes ranging from tiny to huge. It's not relational (the only "key" is the file name), and it's actually closer to a "NoSQL" database than a relational database.
The Git database, in fact, can store all its objects with one object per file. However, for efficiency reasons, it was found that the access patterns required by Git were better served by combining many objects into one "pack" file.
Git has a much different "database" scheme than other applications. As visualized here, it has a .git directory, with a objects/ directory and a bunch of files named after SHA checksums. This is not a very good model to use for someone who is starting out.
If you want to use the Git filesystem, there are ways to do that (search for "git plumbing").
I'd recommend one of MySQL, XML files, or a filesystem format (keep in mind, if you run on Windows, anything below 4KB is wasting space).
You are most likely looking for SQLite.
What is the best way storing binary or image files?
Database System
File System
Would you please explain, why?
There is no real best way, just a bunch of trade offs.
Database Pros:
1. Much easier to deal with in a clustering environment.
2. No reliance on additional resources like a file server.
3. No need to set up "sync" operations in load balanced environment.
4. Backups automatically include the files.
Database Cons:
1. Size / Growth of the database.
2. Depending on DB Server and your language, it might be difficult to put in and retrieve.
3. Speed / Performance.
4. Depending on DB server, you have to virus scan the files at the time of upload and export.
File Pros:
1. For single web/single db server installations, it's fast.
2. Well understood ability to manipulate files. In other words, it's easy to move the files to a different location if you run out of disk space.
3. Can virus scan when the files are "at rest". This allows you to take advantage of scanner updates.
File Cons:
1. In multi web server environments, requires an accessible share. Which should also be clustered for failover.
2. Additional security requirements to handle file access. You have to be careful that the web server and/or share does not allow file execution.
3. Transactional Backups have to take the file system into account.
The above said, SQL 2008 has a thing called FILESTREAM which combines both worlds. You upload to the database and it transparently stores the files in a directory on disk. When retrieving you can either pull from the database; or you can go direct to where it lives on the file system.
Pros of Storing binary files in a DB:
Some decrease in complexity since the
data access layer of your system need
only interface to a DB and not a DB +
file system.
You can secure your files using the
same comprehensive permissions-based
security that protects the rest of
the database.
Your binary files are protected
against loss along with the rest of
your data by way of database backups.
No separate filesystem backup system
required.
Cons of Storing binary files in a DB:
Depending on size/number of files,
can take up significant space
potentially decreasing performance
(dependening on whether your binary
files are stored in a table that is
queried for other content often or
not) and making for longer backup
times.
Pros of Storing binary files in file system:
This is what files systems are good
at. File systems will handle
defragmenting well and retrieving
files (say to stream a video file to
through a web server) will likely be
faster that with a db.
Cons of Storing binary files in file system:
Slightly more complex data access
layer. Needs its own backup system.
Need to consider referential
integrity issues (e.g. deleted
pointer in database will need to
result in deletion of file so as to
not have 'orphaned' files in the
filesystem).
On balance I would use the file system. In the past, using SQL Server 2005 I would simply store a 'pointer' in db tables to the binary file. The pointer would typically be a GUID.
Here's the good news if you are using SQL Server 2008 (and maybe others - I don't know): there is built in support for a hybrid solution with the new VARBINARY(MAX) FILESTREAM data type. These behave logically like VARBINARY(MAX) columns but behind the scenes, SQL Sever 2008 will store the data in the file system.
There is no best way.
What? You need more info?
There are three ways I know of... One, as byte arrays in the database. Two, as a file with the path stored in the database. Three, as a hybrid (only if DB allows, such as with the FileStream type).
The first is pretty cool because you can query and get your data in the same step. Which is always nice. But what happens when you have LOTS of files? Your database gets big. Now you have to deal with big database maintenance issues, such as the trials of backing up databases that are over a terabyte. And what happens if you need outside access to the files? Such as type conversions, mass manipulation (resize all images, appy watermarks, etc)? Its much harder to do than when you have files.
The second is great for somewhat large numbers of files. You can store them on NAS devices, back them up incrementally, keep your database small, etc etc. But then, when you have LOTS of files, you start running into limitations in the file system. And if you spread them over the network, you get latency issues, user rights issues, etc. Also, I take pity on you if your network gets rearranged. Now you have to run massive updates on the database to change your file locations, and I pity you if something screws up.
Then there's the hybrid option. Its almost perfect--you can get your files via your query, yet your database isn't massive. Does this solve all your problems? Probably not. Your database isn't portable anymore; you're locked to a particular DBMS. And this stuff isn't mature yet, so you get to enjoy the teething process. And who says this solves all the different issues?
Fact is, there is no "best" way. You just have to determine your requirements, make the best choice depending on them, and then suck it up when you figure out you did the wrong thing.
I like storing images in a database. It makes it easy to switch from development to production just by changing databases (no copying files). And the database can keep track of properties like created/modified dates just as well as the File System.
I personally never store images IN the database for performance purposes. In all of my sites I have a "/files" folder where I can put sub-folders based on what kind of images i'm going to store. Then I name them on convention.
For example if i'm storing a profile picture, I'll store it in "/files/profile/" as profile_2.jpg (if 2 is the ID of the account). I always make it a rule to resize the image on the server to the largest size I'll need, and then smaller ones if I need them. So I'd save "profile_2_thumb.jpg" and "profile_2_full.jpg".
By creating rules for yourself you can simply in the code call img src="/files/profile__thumb.jpg"
Thats how I do it anyway!
I have an application that creates records in a table (rocket science, I know). Users want to associate files (.doc, .xls, .pdf, etc...) to a single record in the table.
Should I store the contents of the
file(s) in the database? Wouldn't this
bloat the database?
Should I store the file(s) on a file
server, and store the path(s) in the
database?
What is the best way to do this?
I think you've accurately captured the two most popular approaches to solving this problem. There are pros and cons to each:
Store the Files in the DB
Most rbms have support for storing blobs (or binary file data, .doc, .xls, etc.) in a db. So you're not breaking new ground here.
Pros
Simplifies Backup of the data: you backup the db you have all the files.
The linkage between the metadata (the other columns ABOUT the files) and the file itself is solid and built into the db; so its a one stop shop to get data about your files.
Cons
Backups can quickly blossom into a HUGE nightmare as you're storing all of that binary data with your database. You could alleviate some of the headaches by keeping the files in a separate DB.
Without the DB or an interface to the DB, there's no easy way to get to the file content to modify or update it.
In general, its harder to code and coordinate the upload and storage of data to a DB vs. the filesystem.
Store the Files on the FileSystem
This approach is pretty simple, you store the files themselves in the filesystem. Your database stores a reference to the file's location (as well as all of the metadata about the file). One helpful hint here is to standardize your naming schema for the files on disk (don't use the file that the user gives you, create one on your own and store theirs in the db).
Pros
Keeps your file data cleanly separated from the database.
Easy to maintain the files themselves (if you need to change out the file or update it), you do so in the file system itself. You can just as easily do it from the application as well via a new upload.
Cons
If you're not careful, your database about the files can get out of sync with the files themselves.
Security can be an issue (again if you're careless) depending on where you store the files and whether or not that filesystem is available to the public (via the web I'm assuming here).
At the end of the day, we chose to go the filesystem route. It was easier to implement quickly, easy on the backup, pretty secure once we locked down any holes and streamed the file out (instead of just serving directly from the filesystem). Its been operational in pretty much the same format for about 6 years in two different government applications.
J
How well you can store binaries, or BLOBs, in a database will be highly dependant on the DBMS you are using.
If you store binaries on the file system, you need to consider what happens in the case of file name collision, where you try and store two different files with the same name - and if this is a valid operation or not. So, along with the reference to where the file lives on the file system, you may also need to store the original file name.
Also, if you are storing a large amount of files, be aware of possible performance hits of storing all your files in one folder. (You didn't specify your operating system, but you might want to look at this question for NTFS, or this reference for ext3.)
We had a system that had to store several thousands of files on the file system, on a file system where we were concerned about the number of files in any one folder (it may have been FAT32, I think).
Our system would take a new file to be added, and generate an MD5 checksum for it (in hex). It would take the first two characters and make that the first folder, the next two characters and make that the second folder as a sub-folder of the first folder, and then the next two as the third folder as a sub-folder of the second folder.
That way, we ended up with a three-level set of folders, and the files were reasonably well scattered so no one folder filled up too much.
If we still had a file name collision after that, then we would just add "_n" to the file name (before the extension), where n was just an incrementing number until we got a name that didn't exist (and even then, I think we did atomic file creation, just to be sure).
Of course, then you need tools to do the occasional comparison of the database records to the file system, flagging any missing files and cleaning up any orphaned ones where the database record no longer exists.
You should only store files in the database if you're reasonably sure you know that the sizes of those files aren't going to get out of hand.
I use our database to store small banner images, which I always know what size they're going to be. Your database will store a pointer to the data inside a row and then plunk the data itself somewhere else, so it doesn't necessarily impact speed.
If there are too many unknowns though, using the filesystem is the safer route.
Use the database for data and the filesystem for files. Simply store the file path in the database.
In addition, your webserver can probably serve files more efficiently than you application code will do (in order to stream the file from the DB back to the client).
Store the paths in the database. This keeps your database from bloating, and also allows you to separately back up the external files. You can also relocate them more easily; just move them to a new location and then UPDATE the database.
One additional thing to keep in mind: In order to use most of the filetypes you mentioned, you'll end up having to:
Query the database to get the file contents in a blob
Write the blob data to a disk file
Launch an application to open/edit/whatever the file you just created
Read the file back in from disk to a blob
Update the database with the new content
All that as opposed to:
Read the file path from the DB
Launch the app to open/edit/whatever the file
I prefer the second set of steps, myself.
The best solution would be to put the documents in the database. This simplifies all the linking and backingup and restoring issues - But it might not solve the basic 'we just want to point to documents on our file server' mindset the users may have.
It all depends (in the end) on actual user requirements.
BUt my recommendation would be to put it all together in the database so you retain control of them. Leaving them in the file system leaves them open to being deleted, moved, ACL'd or anyone of hundreds of other changes that could render your linking to them pointless or even damaging.
Database bloat is only an issue if you haven't sized for it. Do some tests and see what effects it has. 100GB of files on a disk is probably just as big as the same files in a database.
I would try to store it all in the database. Haven't done it. But if not. There are a small risk that file names get out of sync with files on the disk. Then you have a big problem.
And now for the completely off the wall suggestion - you could consider storing the binaries as attachments in a CouchDB document database. This would avoid the file name collision issues as you would use a generated UID as each document ID (which you what you would store in your RDBMS), and the actual attachment's file name is kept with the document.
If you are building a web-based system, then the fact that CouchDB uses REST over HTTP could also be leveraged. And, there's also the replication facilities that could prove of use.
Of course, CouchDB is still in incubation, although there are some who are already using it 'in the wild'.
I would store the files in the filesystem. But to keep the linking to the files robust, i.e. to avoid the cons of this option, I would generate some hash for each file, and then use the hash to retrieve it from the filesystem, without relying on the filenames and/or their path.
I don't know the details, but I know that this can be done, because this is the way BibDesk works (a BibTeX app for the Mac OS). It is wonderful software, used to keep tracks of the pdf attachments to the database of scientific literature that it manages.
I have an idea for a simple application which will monitor a group of folders, index any files it finds. A gui will allow me quickly tag new files and move them into a single database for storage and also provide an easy mechanism for querying the db by tag, name, file type and date. At the moment I have about 100+ GB of files on a couple removable hard drives, the database will be at least that big. If possible I would like to support full text search of the embedded binary and text documents. This will be a single user application.
Not trying to start a DB war, but what open source DB is going to work best for me? I am pretty sure SQLLite is off the table but I could be wrong.
I'm still researching this option for one of my own projects, but CouchDB may be worth a look.
Why store the files in the database at all? Simply store your meta-data and a filename. If you need to copy them to a new location for some reason, just do that as a file system copy.
Once you remove the file contents then any competent database will be able to handle the meta-data for a few hundred thousand files.
My preference would be to store the document with the metadata. One reason, is relational integrity. You can't easily move the files or modify the files without the action being brokered by the db. I am sure I can handle these problems but it isn't as clean as I would like and my experience has been that most vendors can handle huge amounts of binary data in the database these days. I guess I was wondering if PostgreSQL or MySQL have any obvious advantages in these areas, I am primarily familiar with Oracle. Anyway, thanks for the response, if the DB knows where the external file is it will also be easy to bring the file in at a later date if I want. Another aspect of the question was if either database is easier to work with when using Python. I'm assuming that is a wash.
I always hate to answer "don't", but you'd be better off indexing with something like Lucene (PyLucene). That and storing the paths in the database rather than the file contents is almost always recommended.
To add to that, none of those database engines will store LOBs in a separate dataspace (they'll be embedded in the table's data space) so any of those engines should perfom nearly equally as well (well except sqllite). You need to move to Informix, DB2, SQLServer or others to get that kind of binary object handling.
Pretty much any of them would work (even though SQLLite wasn't meant to be used in a concurrent multi-user environment, which could be a problem...) since you don't want to index the actual contents of the files.
The only limiting factor is the maximum "packet" size of the given DB (by packet I'm referring to a query/response). Usually these limit are around 2MB, meaning that your files must be smaller than 2MB. Of course you could increase this limit, but the whole process is rather inefficient, since for example to insert a file you would have to:
Read the entire file into memory
Transform the file in a query (which usually means hex encoding it - thus doubling the size from the start)
Executing the generated query (which itself means - for the database - that it has to parse it)
I would go with a simple DB and the associated files stored using a naming convention which makes them easy to find (for example based on the primary key). Of course this design is not "pure", but it will perform much better and is also easier to use.
why are you wasting time emulating something that the filesystem should be able to handle? more storage + grep is your answer.