How does git, for example, create the database they use?

How does git, for example, create the database they use? - c

Hi so I am working on a terminal application and would like to use a database. I am not sure the best way to go about doing it though.
So I want to have the program be able to install like you would git or with brew install etc. And then each person that installs it would have their own database which will be accessed via certain commands.
I'm just not sure what the best way to do this would be? I was thinking for each entry I would put in the database I could just create a file in a .directory?
Thanks.

The filesystem is actually quite a good database, suitable for storing a moderate number of documents with sizes ranging from tiny to huge. It's not relational (the only "key" is the file name), and it's actually closer to a "NoSQL" database than a relational database.
The Git database, in fact, can store all its objects with one object per file. However, for efficiency reasons, it was found that the access patterns required by Git were better served by combining many objects into one "pack" file.

Git has a much different "database" scheme than other applications. As visualized here, it has a .git directory, with a objects/ directory and a bunch of files named after SHA checksums. This is not a very good model to use for someone who is starting out.
If you want to use the Git filesystem, there are ways to do that (search for "git plumbing").
I'd recommend one of MySQL, XML files, or a filesystem format (keep in mind, if you run on Windows, anything below 4KB is wasting space).

You are most likely looking for SQLite.

Related

Is it possible to save a 1 Gb file like a database file on GitHub?

I would like to know if it is possible for the GitHub repository to accept a database file that is bigger than 100 Mbs.
Is it something that can be done on GitHub?

No, it's not possible to save a a file that large on GitHub. GitHub imposes a 100 MB file limit size.
In general, Git repositories are not good for backups because storing large binary objects in them repeatedly makes them bloated and inefficient to maintain. Usually if people try to do this anyway, it causes problems for the hosting provider and they'll be asked to move their repository elsewhere.
Using Git LFS for this purpose avoids the painful maintenance problems, but it's not any better as a choice, since Git repositories, with or without Git LFS, store the entire history of a project forever, and usually people are not interested in storing every backup forever.
If you need a backup service, you should use a service directly for that. However, such a service will not be free because nobody will store your data for free. There are low-cost options, such as Tarsnap, that may meet your needs, though.

No, Github imposes a file size limit of 100MB.
Consider using something like git-annex or Git LFS which allow you to store large files outside of the repo (e.g. in a traditional file server) while storing version controlled links to those files inside the repo.

Use git as a text database?

Would it be possible to use Git as a hierarchical text database?
Obviously you would have to write a front end that would act as a middle man, translating user commands into git commands.
A record would correspond to a "file". In the "file", the text would have to have some kind of conventional format like:
[name]: John Doe
[address]: 13 Maple Street
[city]: Plainview
To do queries, you would have to write a grep front end to use git's search capability.
The database itself would be the repository.
The directory structure would be the hierarchical structure of the database.
The tricky part I see would be that you want the records to be in memory, not usually files on the drive (although that would be possible). So you would have to configure git to be working with files in a virtual file system that was actually in the memory of the db middleware.
Kind of a crazy idea, but would it work?
Potential Advantages:
all the records would be hashed with SHA-1 so there would be high integrity
git takes care of all the persistence problems
db operations like edits can be managed as git merges
db operations like record deletes can be managed as removals (rm)
all changes to the database are stored, so you can recover ANY change or previous state
making copies of the database can be done with clone

Yes, but it would be very slow and it wouldn't involve git. The functionality of git grep and git clone are available without git.
Filesystems can be used as certain types of databases. In fact, git itself uses the filesystem as a simple, reliable, fast, robust key/value store. Object 4fbb4749a2289a3cd949ebe08255266befd18f23 is in .git/objects/4f/bb4749a2289a3cd949ebe08255266befd18f23. Where the master branch is pointing at is located in .git/refs/heads/master.
What filesystem databases are very bad at is searching the contents of those files. Without indexing, you have to look at every file every time. You can use basic Unix file utilities like find and grep for it.
In addition, you'd have to parse the contents of the files each search which can be expensive and complicated.
Concurrency becomes a serious issue. If multiple processes want to work on a change at the same time they have to copy the whole repository and working directory, very expensive. Then they need to do a remote merge, also expensive, which may result in a conflict. Remote access has the same problem.
As to having the files in memory, your operating system will take care of this for you. It will keep frequently accessed files in memory.
Addressing the specific points...
all the records would be hashed with SHA-1 so there would be high integrity
This only tells you that a file is different, or that someone has tampered with the history. In a database files are supposed to change. It doesn't tell you if the content is corrupted or malformed or it's a normal change.
git takes care of all the persistence problems
Not sure what that means.
db operations like edits can be managed as git merges
They're files, edit them. I don't know how merging gets involved.
Merging means conflicts which means human intervention, not something you want in a database.
db operations like record deletes can be managed as removals (rm)
If each single file is a record, yes, but you can do the same thing without git.
all changes to the database are stored, so you can recover ANY change or previous state
This is an advantage, it sort of gives you transactions, but it will also make writing to your database supremely slow. Git is not meant to be committing hundreds of times a second.
making copies of the database can be done with clone
cp -r does the same thing.
In short, unless you're doing a very simple key/value store there is very little advantage to using a filesystem as a database. Something like SQLite or Berkeley DB are superior in almost every way.

How to efficiently store hundrets of thousands of documents?

I'm working on a system that will need to store a lot of documents (PDFs, Word files etc.) I'm using Solr/Lucene to search for revelant information extracted from those documents but I also need a place to store the original files so that they can be opened/downloaded by the users.
I was thinking about several possibilities:
file system - probably not that good idea to store 1m documents
sql database - but I won't need most of it's relational features as I need to store only the binary document and its id so this might not be the fastest solution
no-sql database - don't have any expierience with them so I'm not sure if they are any good either, there are also many of them so I don't know which one to pick
The storage I'm looking for should be:
fast
scallable
open-source (not crucial but nice to have)
Can you recommend what's the best way of storing those files will be in your opinion?

A filesystem -- as the name suggests -- is designed and optimised to store large numbers of files in an efficient and scalable way.

You can follow Facebook as it stores a lot of files (15 billion photos):
They Initially started with NFS share served by commercial storage appliances.
Then they moved to their onw implementation http file server called Haystack
Here is a facebook note if you want to learn more http://www.facebook.com/note.php?note_id=76191543919
Regarding the NFS share. Keep in mind that NFS shares usually limits amount of files in one folder for performance reasons. (This could be a bit counter intuitive if you assume that all recent file systems use b-trees to store their structure.) So if you are using comercial NFS shares like (NetApp) you will likely need to keep files in multiple folders.
You can do that if you have any kind of id for your files. Just divide it Ascii representation in to groups of few characters and make folder for each group.
For example we use integers for ids so file with id 1234567891 is stored as storage/0012/3456/7891.
Hope that helps.

In my opinion...
I would store files compressed onto disk (file system) and use a database to keep track of them.
and posibly use Sqlite if this is its only job.

File System : While thinking about the big picture, The DBMS use the file system again. And the File system is dedicated for keeping the files, so you can see the optimizations (as LukeH mentioned)

How would you build a database filesystem (DBFS)?

A database file system is a file system that is a database instead of a hierarchy. Not too complex an idea initially but I thought I'd ask if anyone has thought about how they might do something like this? What are the issues that a simple plan is likely to miss? My first guess at an implementation would be something like a filesystem to for a Linux platform (probably atop an existing file system) but I really don't know much about how that would be started. Its a passing thought that I doubt I'd ever follow through on but I'm hoping to at least satisfy my curiosity.

DBFS is a really nice PoC implementation for KDE. Instead of implementing it as a file system directly, it is based on indexing on a traditional file system, and building a new user interface to make the results accessible to users.

The easiest way would be to build it using fuse, with a database back-end.
A more difficult thing to do is to have it as a kernel module (VFS).
On Windows, you could use IFS.

I'm not really sure what you mean with "A database file system is a file system that is a database instead of a hierarchy".
Probably, using "Filesystem in Userspace" (FUSE), as mentioned by Osama ALASSIRY, is a good idea. The FUSE wiki lists a lot of existing projects about databased-backed filesystems as well as filesystems in which you can search by SQL-like queries.

Maybe this is a good starting point for getting an idea how it could work.
It's a basic overview of the Firebird architecture.
Firebird is an opensource RDBMS, so you can have a real deep insight look, too, if you're interested.

Its been a while since you asked this. I'm surprised no one suggested the obvious. Look at mainframes and minis, especially iSeries-OS (now called IBM-i used to be called iOS or OS/400).
How to do an relational database as a mass data store is relatively easy. Oracle and MySQL both have these. The catch is it must be essentially ubiquitous for end user applications.
So the steps for an app conversion are:
1) Everything in a normal hierarchical filesystem
2) Data in BLOBs with light metadata in the database. File with some catalogue information.
3) Large data in BLOBs with extensive metadata and complex structures in the database. File with substantial metadata associated with it that can be essentially to understanding the structure.
4) Internal structures of the BLOB exposed in an object <--> Relational map with extensive meta-data. While there may be an exportable form, the application naturally works with the database, the notion of the file as the repository is lost.

BLOB Storage - 100+ GB, MySQL, SQLite, or PostgreSQL + Python

I have an idea for a simple application which will monitor a group of folders, index any files it finds. A gui will allow me quickly tag new files and move them into a single database for storage and also provide an easy mechanism for querying the db by tag, name, file type and date. At the moment I have about 100+ GB of files on a couple removable hard drives, the database will be at least that big. If possible I would like to support full text search of the embedded binary and text documents. This will be a single user application.
Not trying to start a DB war, but what open source DB is going to work best for me? I am pretty sure SQLLite is off the table but I could be wrong.

I'm still researching this option for one of my own projects, but CouchDB may be worth a look.

Why store the files in the database at all? Simply store your meta-data and a filename. If you need to copy them to a new location for some reason, just do that as a file system copy.
Once you remove the file contents then any competent database will be able to handle the meta-data for a few hundred thousand files.

My preference would be to store the document with the metadata. One reason, is relational integrity. You can't easily move the files or modify the files without the action being brokered by the db. I am sure I can handle these problems but it isn't as clean as I would like and my experience has been that most vendors can handle huge amounts of binary data in the database these days. I guess I was wondering if PostgreSQL or MySQL have any obvious advantages in these areas, I am primarily familiar with Oracle. Anyway, thanks for the response, if the DB knows where the external file is it will also be easy to bring the file in at a later date if I want. Another aspect of the question was if either database is easier to work with when using Python. I'm assuming that is a wash.

I always hate to answer "don't", but you'd be better off indexing with something like Lucene (PyLucene). That and storing the paths in the database rather than the file contents is almost always recommended.
To add to that, none of those database engines will store LOBs in a separate dataspace (they'll be embedded in the table's data space) so any of those engines should perfom nearly equally as well (well except sqllite). You need to move to Informix, DB2, SQLServer or others to get that kind of binary object handling.

Pretty much any of them would work (even though SQLLite wasn't meant to be used in a concurrent multi-user environment, which could be a problem...) since you don't want to index the actual contents of the files.
The only limiting factor is the maximum "packet" size of the given DB (by packet I'm referring to a query/response). Usually these limit are around 2MB, meaning that your files must be smaller than 2MB. Of course you could increase this limit, but the whole process is rather inefficient, since for example to insert a file you would have to:
Read the entire file into memory
Transform the file in a query (which usually means hex encoding it - thus doubling the size from the start)
Executing the generated query (which itself means - for the database - that it has to parse it)
I would go with a simple DB and the associated files stored using a naming convention which makes them easy to find (for example based on the primary key). Of course this design is not "pure", but it will perform much better and is also easier to use.

why are you wasting time emulating something that the filesystem should be able to handle? more storage + grep is your answer.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight