I'm doing a project that deals with structured document database. I have a tree of categories (~1000 categories, up to ~50 categories on each level), each category contains several thousands (up to, say, ~10000) of structured documents. Each document is several kilobytes of data in some structured form (I'd prefer YAML, but it may just as well be JSON or XML).
Users of this systems do several types of operations:
retrieving of these documents by ID
searching for documents by some of the structured attributes inside them
editing documents (i.e. adding/removing/renaming/merging); each edit operation should be recorded as a transaction with some comment
viewing a history of recorded changes for particular document (including viewing who, when and why changed the document, getting earlier version - and probably reverting to this one if requested)
Of course, the traditional solution would be using some sort of document database (such as CouchDB or Mongo) for this problem - however, this version control (history) thing tempted me to a wild idea - why shouldn't I use git repository as a database backend for this application?
On the first glance, it could be solved like this:
category = directory, document = file
getting document by ID => changing directories + reading a file in a working copy
editing documents with edit comments => making commits by various users + storing commit messages
history => normal git log and retrieval of older transactions
search => that's a slightly trickier part, I guess it would require periodic export of a category into relational database with indexing of columns that we'll allow to search by
Are there any other common pitfalls in this solution? Have anyone tried to implement such backend already (i.e. for any popular frameworks - RoR, node.js, Django, CakePHP)? Does this solution have any possible implications on performance or reliability - i.e. is it proven that git would be much slower than traditional database solutions or there would be any scalability/reliability pitfalls? I presume that a cluster of such servers that push/pull each other's repository should be fairly robust & reliable.
Basically, tell me if this solution will work and why it will or won't do?
Answering my own question is not the best thing to do, but, as I ultimately dropped the idea, I'd like to share on the rationale that worked in my case. I'd like to emphasize that this rationale might not apply to all cases, so it's up to architect to decide.
Generally, the first main point my question misses is that I'm dealing with multi-user system that work in parallel, concurrently, using my server with a thin client (i.e. just a web browser). This way, I have to maintain state for all of them. There are several approaches to this one, but all of them are either too hard on resources or too complex to implement (and thus kind of kill the original purpose of offloading all the hard implementation stuff to git in the first place):
"Blunt" approach: 1 user = 1 state = 1 full working copy of a repository that server maintains for user. Even if we're talking about fairly small document database (for example, 100s MiBs) with ~100K of users, maintaining full repository clone for all of them makes disc usage run through the roof (i.e. 100K of users times 100MiB ~ 10 TiB). What's even worse, cloning 100 MiB repository each time takes several seconds of time, even if done in fairly effective maneer (i.e. not using by git and unpacking-repacking stuff), which is non acceptable, IMO. And even worse — every edit that we apply to a main tree should be pulled to every user's repository, which is (1) resource hog, (2) might lead to unresolved edit conflicts in general case.
Basically, it might be as bad as O(number of edits × data × number of users) in terms of disc usage, and such disc usage automatically means pretty high CPU usage.
"Only active users" approach: maintain working copy only for active users. This way, you generally store not a full-repo-clone-per-user, but:
As user logs in, you clone the repository. It takes several seconds and ~100 MiB of disc space per active user.
As user continues to work on the site, he works with the given working copy.
As user logs out, his repository clone is copied back to main repository as a branch, thus storing only his "unapplied changes", if there are any, which is fairly space-efficient.
Thus, disc usage in this case peaks at O(number of edits × data × number of active users), which is usually ~100..1000 times less than number of total users, but it makes logging in/out more complicated and slower, as it involves cloning of a per-user branch on every login and pulling these changes back on logout or session expiration (which should be done transactionally => adds another layer of complexity). In absolute numbers, it drops 10 TiBs of disc usage down to 10..100 GiBs in my case, that might be acceptable, but, yet again, we're now talking about fairly small database of 100 MiBs.
"Sparse checkout" approach: making "sparse checkout" instead of full-blown repo clone per active user doesn't help a lot. It might save ~10x of disc space usage, but at expense of much higher CPU/disc load on history-involving operations, which kind of kills the purpose.
"Workers pool" approach: instead of doing full-blown clones every time for active person, we might keep a pool of "worker" clones, ready to be used. This way, every time a users logs in, he occupies one "worker", pulling there his branch from main repo, and, as he logs out, he frees the "worker", which does clever git hard reset to become yet again just a main repo clone, ready to be used by another user logging in. Does not help much with disc usage (it's still pretty high — only full clone per active user), but at least it makes logging in/out faster, as expense of even more complexity.
That said, note that I intentionally calculated numbers of fairly small database and user base: 100K users, 1K active users, 100 MiBs total database + history of edits, 10 MiBs of working copy. If you'd look at more prominent crowd-sourcing projects, there are much higher numbers there:
│ │ Users │ Active users │ DB+edits │ DB only │
├──────────────┼───────┼──────────────┼──────────┼─────────┤
│ MusicBrainz │ 1.2M │ 1K/week │ 30 GiB │ 20 GiB │
│ en.wikipedia │ 21.5M │ 133K/month │ 3 TiB │ 44 GiB │
│ OSM │ 1.7M │ 21K/month │ 726 GiB │ 480 GiB │
Obviously, for that amounts of data/activity, this approach would be utterly unacceptable.
Generally, it would have worked, if one could use web browser as a "thick" client, i.e. issuing git operations and storing pretty much the full checkout on client's side, not on the server's side.
There are also other points that I've missed, but they're not that bad compared to the first one:
The very pattern of having "thick" user's edit state is controversial in terms of normal ORMs, such as ActiveRecord, Hibernate, DataMapper, Tower, etc.
As much as I've searched for, there's zero existing free codebase for doing that approach to git from popular frameworks.
There is at least one service that somehow manages to do that efficiently — that is obviously github — but, alas, their codebase is closed source and I strongly suspect that they do not use normal git servers / repo storage techniques inside, i.e. they basically implemented alternative "big data" git.
So, bottom line: it is possible, but for most current usecases it won't be anywhere near the optimal solution. Rolling up your own document-edit-history-to-SQL implementation or trying to use any existing document database would be probably a better alternative.
my 2 pence worth. A bit longing but ...... I had a similar requirement in one of my incubation projects. Similar to yours , my key requirements where a document database ( xml in my case),with document versioning. It was for a multi-user system with a lot of collaboration use cases. My preference was to use available opensource solutions that support most of the key requirements.
To cut to the chase, I could not find any one product that provided both, in a way that was scalable enough ( number of users, usage volumes, storage and compute resources).I was biased towards git for all the promising capability, and (probable) solutions one could craft out of it. As I toyed with git option more, moving from a single user perspective to a multi ( milli) user perspective became an obvious challenge. Unfortunately, I did not get to do substantial performance analysis like you did. ( .. lazy/ quit early ....for version 2, mantra) Power to you!. Anyway, my biased idea has since morphed to the next (still biased ) alternative: a mesh-up of tools that are the best in their separate spheres, databases and version control.
While still work in progress ( ...and slightly neglected ) the morphed version is simply this .
on the frontend: (userfacing ) use a database for the 1st level
storage ( interfacing with user applications )
on the backend,
use a version control system (VCS)(like git ) to perform
versioning of the data objects in database
In essence it would amount to adding a version control plugin to the database, with some integration glue, which you may have to develop, but may be a lot much easier.
How it would (supposed to ) work is that the primary multi-user interface data exchanges are through the database. The DBMS will handle all the fun and complex issues such as multi-user , concurrency e, atomic operations etc. On the backend the VCS would perform version control on a single set of data objects ( no concurrency, or multi-user issues). For each effective transactions on the database, version control is only performed on the data records that would have effectively changed.
As for the interfacing glue, it will be in the form of a simple interworking function between the database and the VCS. In terms of design, as simple approach would be an event driven interface, with data updates from the database triggering the version control procedures ( hint : assuming Mysql, use of triggers and sys_exec() blah blah ...) .In terms of implementation complexity, it will range from the simple and effective ( eg scripting ) to the complex and wonderful ( some programmed connector interface) . All depends on how crazy you want to go with it , and how much sweat capital you are willing to spend. I reckon simple scripting should do the magic. And to access the end result, the various data versions, a simple alternative is to populate a clone of the database ( more a clone of the database structure) with the data referenced by the version tag/id/hash in the VCS. again this bit will be a simple query/translate/map job of an interface.
There are still some challenges and unknowns to be dealt with, but I suppose the impact, and relevance of most of these will largely depend on your application requirements and use cases. Some may just end up being non issues. Some of the issues include performance matching between the 2 key modules, the database and the VCS, for an application with high frequency data update activity, Scaling of resources (storage and processing power ) over time on the git side as the data , and users grow: steady, exponential or eventually plateau's
Of the cocktail above, here is what I'm currently brewing
using Git for the VCS ( initially considered good old CVS for the due to the use of only changesets or deltas between 2 version )
using mysql ( due to the highly structured nature of my data, xml with strict xml schemas )
toying around with MongoDB (to try a NoSQl database, which closely matches the native database structure used in git )
Some fun facts
- git actually does clear things to optimize storage, such as compression, and storage of only deltas between revision of objects
- YES, git does store only changesets or deltas between revisions of data objects, where is it is applicable ( it knows when and how) . Reference : packfiles, deep in the guts of Git internals
- Review of the git's object storage ( content-addressable filesystem), shows stricking similarities ( from the concept perspective) with noSQL databases such mongoDB. Again, at the expense of sweat capital, it may provide more interesting possibilities for integrating the 2, and performance tweaking
If you got this far, let me if the above may be applicable to your case, and assuming it would be , how it would square up to some of the aspect in your last comprehensive performance analysis
An interesting approach indeed. I would say that if you need to store data, use a database, not a source code repository, which is designed for a very specific task. If you could use Git out-of-the-box, then it's fine, but you probably need to build a document repository layer over it. So you could build it over a traditional database as well, right? And if it's built-in version control that you're interested in, why not just use one of open source document repository tools? There are plenty to choose from.
Well, if you decide to go for Git backend anyway, then basically it would work for your requirements if you implemented it as described. But:
1) You mentioned "cluster of servers that push/pull each other" - I've thought about it for a while and still I'm not sure. You can't push/pull several repos as an atomic operation. I wonder if there could be a possibility of some merge mess during concurrent work.
2) Maybe you don't need it, but an obvious functionality of a document repository you did not list is access control. You could possibly restrict access to some paths(=categories) via submodules, but probably you won't be able to grant access on document level easily.
I implemented a Ruby library on top of libgit2 which makes this pretty easy to implement and explore. There are some obvious limitations, but it's also a pretty liberating system since you get the full git toolchain.
The documentation includes some ideas about performance, tradeoffs, etc.
As you mentioned, the multi-user case is a bit trickier to handle. One possible solution would be to use user-specific Git index files resulting in
no need for separate working copies (disk usage is restricted to changed files)
no need for time-consuming preparatory work (per user session)
The trick is to combine Git's GIT_INDEX_FILE environmental variable with the tools to create Git commits manually:
git hash-object
git update-index
git write-tree
git commit-tree
A solution outline follows (actual SHA1 hashes omitted from the commands):
# Initialize the index
# N.B. Use the commit hash since refs might changed during the session.
$ GIT_INDEX_FILE=user_index_file git reset --hard <starting_commit_hash>
#
# Change data and save it to `changed_file`
#
# Save changed data to the Git object database. Returns a SHA1 hash to the blob.
$ cat changed_file | git hash-object -t blob -w --stdin
da39a3ee5e6b4b0d3255bfef95601890afd80709
# Add the changed file (using the object hash) to the user-specific index
# N.B. When adding new files, --add is required
$ GIT_INDEX_FILE=user_index_file git update-index --cacheinfo 100644 <changed_data_hash> path/to/the/changed_file
# Write the index to the object db. Returns a SHA1 hash to the tree object
$ GIT_INDEX_FILE=user_index_file git write-tree
8ea32f8432d9d4fa9f9b2b602ec7ee6c90aa2d53
# Create a commit from the tree. Returns a SHA1 hash to the commit object
# N.B. Parent commit should the same commit as in the first phase.
$ echo "User X updated their data" | git commit-tree <new_tree_hash> -p <starting_commit_hash>
3f8c225835e64314f5da40e6a568ff894886b952
# Create a ref to the new commit
git update-ref refs/heads/users/user_x_change_y <new_commit_hash>
Depending on your data you could use a cron job to merge the new refs to master but the conflict resolution is arguably the hardest part here.
Ideas to make it easier are welcome.
Related
We are trying to determine the best source control branching strategy to use at work. We use a VSO frontend attached to a GIT backend. We have 4 database environments, DEV, QA, STAGE and PROD. At any given time we have many teams working on different features that often leapfrog each other, in addition to a lot of ongoing database cleanup work (adding Primary and Foreign Keys, setting columns to non nullable, etc)
My idea is to maintain four persistent branches, one for each database environment, that reflect their respective database environments. Any team working on a new feature will branch from Dev, and at the point the work is done merge back into the persistent DEV branch. When the work is ready to go to QA it will be merged to QA, at the point it is ready to move to STAGE it'll be merged to STAGE and so forth. Any non-breaking database updates not tied to a freature (like making columns non NULLABLE) can flow as change sets without needing a feature branch, but every potentially breaking change will need to work as a feature branch.
Has anyone used this strategy? Did it work?
Is there a better branching model you can recommend?
Well, since no one ended up answering I will use my own answer for an update. It appears we will end up using this branching strategy, more or less as I originally described. The main difference is that the PROD branch will be called master, and the feature branches will be branched from master rather than DEV.
This is because the master/PROD branch is considered more stable than DEV. The previous environment where I had branched from DEV successfully was a single release-train. Since features are expected to leapfrog each other here we can't do that.
Also, all development will need to be done in feature branches. This is due to a limitation of the VSO GIT plugin however, since the mechanism for linking GIT pushes to VSO tickets requires that work be done in feature branches.
I'm working on a web based Java project that stores end user data in a MySql database. I'd like to implement something that allows the user to have functionality similar to what I have for my source code version control (e.g. Subversion). In other words, I'd like to implement code that allows the user to commit and rollback work and return to an existing branch. Is there an existing framework for this? It seems like putting the database data into version control and exposing the version control functionality to the end user (i.e. write code that allows the user to commit, rollback, etc.) could be a reasonable approach but it also seems their might be some problems with this approach. For example, how would you allow one user to view a rolled back version of the data (i.e. you can't just replace the data the database is pointing to if one user wants to look at a rolled back version of the data)? If given the choice of completely rebuilding the system using any persistence architecture what could be used to store the data that would make this type of functionality easy to implement?
There are 2 very common solutions for what you need:
http://www.liquibase.org/
https://flywaydb.org/
Branching and merging the user data
Your question is about solutions to version the user data in a application, to give your users capabilities such as branching and merging. You pondered about exposing a real version control such as svn.
The side-effects I can foresee are:
You will have to index things by directory and filename. Maybe using an abstraction of directories as entities and filenames as the primary key.
Operating systems (linux, mac and windows alike) does not handle well directories with millions of files. You will have to partition the entity. Usually hashing the ID (md5 for example) and taking the beginning of the hash to create an subdirectory. The number of digits to take from the hash depends on the expected size of the entity.
Operating systems (linux, mac and windows alike) are not prepared for huge quantity of files. I did a test on that. It took me days to backup and finally remove an file tree with hundreds of millions of files.
You will not be able to have additional indexes beyond the primary key, however you can work around that creating a data-mart, as I will describe below.
You will not have database constraints, but similar functionality can be implemented through git/svn/cvs triggers.
You will not have strong transactions, but similar functionality can be implemented through git/svn/cvs triggers.
You will have a working copy for each user, this will consume space depending on the size of the repositories. That way each user will be in a single point in time.
GIT is fast enough to switch from a branch to another, so go back in time and back will take only seconds (unless the user data is big, of course).
I saw a Linus interview where he warned about low performance in huge git repositories. Maybe it is best to have a repository to each user or other means to avoid your application having a single humongous repository.
Resolution of the changes. I bet that if you create gazillions of versions any version control will complaint. I do not what gazillions mean. You will have to test it.
Query database
A version control working copy will be limited to primary key queries using the "=" operator and sequential scans. This is not enough to make good reports and statistics on any usage pattern I can think off. That why you need to build a data-mart from your application data and you have two ways of doing that:
A batch process: that reads the whole repository history and builds cubes and other views to allow easier querying.
GIT/SVN/CVS triggers: can call programs made by you on file addition, modification, exclusion, branch creation and merging. This could be used to update the database when a change happen.
The batch is easier to implement but takes time to the reports and statistics be synchronized with the activity. You probably will want to go that way in the 1.0 version and in time moving to triggers to get things more dynamic.
Simulating constraints and transactions
GIT, SVN and CVS supports triggers that execute programs when a new version is submitted. Then the relationships and consistency can be checked to accept or not the change.
Alternative Solutions
Since you do not specified the kind of application you want, I will talk about blogs, content portals and online stores. For those kinds of applications I see no much reason to reinvent the wheel and build a custom database. Most of the versioning necessary can be predicted in the database model. A good event-oriented database design will be enough.
For example, a revision in a blog post could be modeled as marking the end date/time of the post and creating a new row for the revised post, increasing the version number and setting the previous version id. The same strategy can be used with sales and catalog of an online store. If you model your application with good logs you does not need version control.
Some developers also do a row level trigger that records everything that has changed on the database. This is a bit harder for an auditor that would need to reconstruct the past from bad designed logs. I personally do not like this way because is very difficult to index this kinds of queries. I prefer to make my whole applications around a good designed and meaningful log.
For example:
History Table
10/10/2010 [new process] process_id=1; name=john
11/10/2010 [change name] process_id=1; old_name=john; new_name=john doe
12/10/2010 [change name] process_id=1; old_name=john doe; new_name=john doe junior
Process Table after 12/10/2010.
proc_id=1 name=john doe junior
That way I can reconstruct almost everything on the past and still have my operational data in a easy-to-use format.
However, this is not close to the usage pattern you want (branching and merging)
Conclusion
The applicability of version control as a database seems to me very powerful on one hand and very limited and dangerous in another. It is very inspiring for auditing and error correction purposes. But my main concern would be scale and reliability.
It seems like you want version control for your data rather than the database schema. I could find two databases that implement most of the version control features such as fork, clone, branch, merge, push, and pull:
https://github.com/dolthub/dolt - SQL based
https://github.com/terminusdb/terminusdb - graph based
You mentioned Subversion, which is a Centralized Version Control System. But let us focus on Git, because of reasons. Git is a Decentralized Version Control System. A local copy of a Git repository is the same as a remote copy of the repository, if a remote copy exists at all (services such as GitLab and GitHub provide the remote housing and managing of Git projects). With Git you can have version control in an arbitrary directory in your machine. You can do whatever you are accustomed to doing with SVN, and more, in this arbitrary directory.
What I am getting at, is that you could possibly create per user directories/repositories in your server programmatically, and apply version control in these directories/repositories, keeping a separate repository per user (the specifics of the architecture would be decided later, though, depending on the structure of the user's "work"). Your application would be in charge of adding and removing files on behalf of the user (e.g. Biography, My Sample Project, etc.), editing files, committing the changes, presenting a file history, etc., essentially issuing Git commands. Your application would, thus, interface with the Git repository, exploiting the advanced version control that Git provides. Your database would just make sure that the user is linked to the directory/repository that contains their "work".
To provide a critical analogy, the GitLab project is an open source web-based Git repository manager with wiki and issue tracking features. GitLab is written in Ruby and uses PostgreSQL (preferably). It is a typical (as in Code - Database - Data directories and files) multiuser web-based application. Its purpose is to manage Git repositories. These Git repositories are stored in a designated directory in the server. Part of the code is responsible for accessing the Git repositories that the logged-in user is authorized to access (as the owner or as a collaborator). An interesting use case is of a user editing a file online, which will result in a commit in some branch in some repository. Another interesting use case is of a user checking the history of a file. A final interesting use case is of a user reverting a specific commit. All of these actions are performed online, via a web browser.
To provide an interesting real-world use case, Atlas by O'Reilly is an online platform for publishing-related collaboration using GitLab as the backend.
For Java there is JGit, a lightweight, pure Java library implementing the Git version control system. JGit is used by Eclipse for all actions related to managing Git repositories. Maybe you could look into it. It is an extremely active project, supported by many, Google included.
All of the above make sense, if the "work" you refer to is more than some fields in a database table, which the user will fill in and may later change the values of. For instance, it would make sense for structured text, HTML, etc.
If this "work" is not so large-scale, maybe doing something like what is described above is overkill. In that case, you could employ some of the version control concepts in your database design, such as calculating diffs and applying patches (also in reverse, for viewing past versions / rolling back). Your tables should allow for a tree-like structure, to store the diffs, so you could allow for branches. You could have the active version of a file readily available, as well as the active index (what Git calls HEAD), and navigate to another indexed/hashed/tagged version in the file's history by applying all patches sequentially, if moving forward, or applying patches in reverse, and in the reverse chronological order, if moving backwards. If this "work" is really small-scale, you could even ditch the diff concept, and store the whole version of the "work" in the tree-like structure.
Pure fun.
Would it be possible to use Git as a hierarchical text database?
Obviously you would have to write a front end that would act as a middle man, translating user commands into git commands.
A record would correspond to a "file". In the "file", the text would have to have some kind of conventional format like:
[name]: John Doe
[address]: 13 Maple Street
[city]: Plainview
To do queries, you would have to write a grep front end to use git's search capability.
The database itself would be the repository.
The directory structure would be the hierarchical structure of the database.
The tricky part I see would be that you want the records to be in memory, not usually files on the drive (although that would be possible). So you would have to configure git to be working with files in a virtual file system that was actually in the memory of the db middleware.
Kind of a crazy idea, but would it work?
Potential Advantages:
all the records would be hashed with SHA-1 so there would be high integrity
git takes care of all the persistence problems
db operations like edits can be managed as git merges
db operations like record deletes can be managed as removals (rm)
all changes to the database are stored, so you can recover ANY change or previous state
making copies of the database can be done with clone
Yes, but it would be very slow and it wouldn't involve git. The functionality of git grep and git clone are available without git.
Filesystems can be used as certain types of databases. In fact, git itself uses the filesystem as a simple, reliable, fast, robust key/value store. Object 4fbb4749a2289a3cd949ebe08255266befd18f23 is in .git/objects/4f/bb4749a2289a3cd949ebe08255266befd18f23. Where the master branch is pointing at is located in .git/refs/heads/master.
What filesystem databases are very bad at is searching the contents of those files. Without indexing, you have to look at every file every time. You can use basic Unix file utilities like find and grep for it.
In addition, you'd have to parse the contents of the files each search which can be expensive and complicated.
Concurrency becomes a serious issue. If multiple processes want to work on a change at the same time they have to copy the whole repository and working directory, very expensive. Then they need to do a remote merge, also expensive, which may result in a conflict. Remote access has the same problem.
As to having the files in memory, your operating system will take care of this for you. It will keep frequently accessed files in memory.
Addressing the specific points...
all the records would be hashed with SHA-1 so there would be high integrity
This only tells you that a file is different, or that someone has tampered with the history. In a database files are supposed to change. It doesn't tell you if the content is corrupted or malformed or it's a normal change.
git takes care of all the persistence problems
Not sure what that means.
db operations like edits can be managed as git merges
They're files, edit them. I don't know how merging gets involved.
Merging means conflicts which means human intervention, not something you want in a database.
db operations like record deletes can be managed as removals (rm)
If each single file is a record, yes, but you can do the same thing without git.
all changes to the database are stored, so you can recover ANY change or previous state
This is an advantage, it sort of gives you transactions, but it will also make writing to your database supremely slow. Git is not meant to be committing hundreds of times a second.
making copies of the database can be done with clone
cp -r does the same thing.
In short, unless you're doing a very simple key/value store there is very little advantage to using a filesystem as a database. Something like SQLite or Berkeley DB are superior in almost every way.
I'm writing a document editing web service, in which documents can be edited via a website, or locally and pushed via git. I'm trying to decide if the documents should be stored as individual documents on the filesystem, or in a database. The points I'm wondering are:
If they're in a database, is there any way for git to see the documents?
How much higher are the overheads using the filesystem? I assume the OS is doing a lot more work. How can I alleviate some of this? For example, the web editor autosaves, what would the best way to cache the save data be, to minimise writes?
Does one scale significantly better or worse than the other? If all goes according to plan, this will be a service with many thousands of documents being accessed and edited.
If the documents go into a database, git can't directly see the documents. git will see the backing storage file(s) for the database, but have no way of correlating changes there to changes to files.
The overhead of using the database is higher than using a filesystem, as answered by Carlos. Databases are optimized for transactions, which they'll do in memory, but they have to hit the file. Unless you program the application to do database transactions at a sub-document level (Eg: changing only modified lines), the database will give you no performance improvement. Most modern filesystems do caching and you can 'write' in a way that will sit in RAM rather than going to your backing stoage as well. You'll need to manage the granularity of the 'autosaves' in your application (every change? every 30 seconds? 5 minutes?), but really, doing it at the same granularity with a database will cause the same amount of traffic to the backing store.
I think you intended to ask "does the filesystem scale as well as the database"? :) If you have some way to organize your files per-user, and you figure out the security issue of a particular user only being able to access/modify the files they should be able to (which are doable imo), the filesystem should be doable.
Filesystem will always be faster than DB, because after all, DB's store data in the Filesystem!
Git is quite efficiently on it's own as proven on github, so i say you stick with git, and workaround it.
After all, Linus should know something... ;)
Question:
Should I write my application to directly access a database Image Repository or write a middleware piece to handle document requests.
Background:
I have a custom Document Imaging and Workflow application that currently stores about 15 million documents/document images (90%+ single page, group 4 tiffs, the rest PDF, Word and Excel documents). The image repository is a commercial, 3rd party application that is very expensive and frankly has too much overhead. I just need a system to store and retrieve document images.
I'm considering moving the imaging directly into a SQL Server 2005 database. The indexing information is very limited - basically 2 index fields. It's a life insurance policy administration system so I index images with a policy number and a system wide unique id number. There are other index values, but they're stored and maintained separately from the image data. Those index values give me the ability to look-up the unique id value for individual image retrieval.
The database server is a dual-quad core windows 2003 box with SAN drives hosting the DB files. The current image repository size is about 650GB. I haven't done any testing to see how large the converted database will be. I'm not really asking about the database design - I'm working with our DBAs on that aspect. If that changes, I'll be back :-)
The current system to be replaced is obviously a middleware application, but it's a very heavyweight system spread across 3 windows servers. If I go this route, it would be a single server system.
My primary concerns are scalabity and performace - heavily weighted toward performance. I have about 100 users, and usage growth will probably be slow for the next few years.
Most users are primarily read users - they don't add images to the system very often. We have a department that handles scanning and otherwise adding images to the repository. We also have a few other applications that receive documents (via ftp) and they insert them into the repository automatically as they are received, either will full index information or as "batches" that a user reviews and indexes.
Most (90%+) of the documents/images are very small, < 100K, probably < 50K, so I believe that storage of the images in the database file will be the most efficient rather than getting SQL 2008 and using a filestream.
Oftentimes scalability and performance are ultimately married to each other in the sense that six months from now management comes back and says "Function Y in Application X is running unacceptably slow, how do we speed it up?" And all too the often the answer is to upgrade the back end solution. And when it comes to upgrading back ends, its almost always going to less expensive to scale out than to scale up in terms of hardware.
So, long story short, I would recommend building a middleware app that specifically handles incoming requests from the user app and then routes them to the appropriate destination. This will sufficiently abstract your front-end user app from the back end storage solution so that when scalability does become an issue only the middleware app will need to be updated.
This is straightforward. Write the application to an interface, use some kind of factory mechanism to supply that interface, and implement that interface however you want.
Once you're happy with your interface, then the application is (mostly) isolated from the implementation, whether it's talking straight to a DB or to some other component.
Thinking ahead a bit on your interface design but doing bone stupid, "it's simple, it works here, it works now" implementations offers a good balance of future proofing the system while not necessarily over engineering it.
It's easy to argue you don't even need an interface at this juncture, rather just a simple class that you instantiate. But if your contract is well defined (i.e. the interface or class signature), that is what protects you from change (such as redoing the back end implementation). You can always replace the class with an interface later if you find it necessary.
As far as scalability, test it. Then you know not only if you may need to scale, but perhaps when as well. "Works great for 100 users, problematic for 200, if we hit 150 we might want to consider taking another look at the back end, but it's good for now."
That's due diligence and a responsible design tactic, IMHO.
I agree with gabriel1836. However, an added benefit would be that you could for a time run a hybrid system for a time since you aren't going to convert 14 millions documents from your proprietary system to you home grown system overnight.
Also, I would strongly encourage you to store the documents outside of a database. Store them on a file system (local, SAN, NAS it doesn't matter) and store pointers to the documents in the database.
I'd love to know what document management system you are using now.
Also, don't underestimate the effort of replacing the capture (scanning and importing) provided by the proprietary system.