Does it make sense to use neo4j to index a file system - database

I am working on a Java based backup client that scans for files on the file system and populates a Sqlite database with the directories and file names that it find to backup. Would it make sense to use neo4j instead of sqlite? Will it be more perfomant and easier to use for this application. I was thinking because a filesystem is a tree (or graph if you consider symbolic links), a gaph database may be suitable? The sqlite database schema defines only 2 tables, one for directories (full path and other info) and one for files (name only with foreign key to containing directory in directory table), so its relatively simple.
The application needs to index many millions of files so the solution needs to be fast.

As long as you can perform the DB operations essentially using string matching on the stored file system paths, using a relational databases makes sense. The moment the data model gets more complex and you actually can't do your queries with string matching but need to traverse a graph, using a graph database will make this much easier.

As I understand it then one of the earliest uses of Neo4j were to do exactly this as a part of the CMS system Neo4j is originiated from.
Lucene, the indexing backend for Neo4j, will allow you to build any indexes you might need.
You should read up on that and ask them directly.

I am considering a similar solution to index a data store on a filesystem. Remark about the queries above is right.
Examples of worst case queries:
For sqlite:
if you have a large quantity of subdirectories somewhere deep into the fs, your space need on sqlite will not be optimal: save the full path for each small subdirectories (think of a code project for instance)
if you need to move a directory, the closer to the root, the more work you will have to do, so that will not be a O(1) as it would be with neo4j
can you do multithreading on sqlite to scale?
For neo4j:
each time you search for a full path, you need to split it into components, and build a cypher query with all the elements of the path.
the data model will probably be more complex than 2 tables: all the different objects, then dir-in-dir relationship, file-in-dir relationship, symlink relationship
Greetings, hj

Related

File storage system for a Postgresql database

I currently have a database which needs to store a lot of files. However, I also store information about specific files. I want to be able to store the files alongside the database which contains this metadata, and I am wondering how best to do this. The files are auto-generated and are in a hierarchical structure which is suited to file systems.
The original idea was to store the path to the root directory of the file system containing all the files, and then reference the files relative to this (e.g. ../file_group_1/file_1). In looking into this, it is difficult to find a way to store the files in this file system without say, running a separate server alongside the database which manages the filesystem.
I have looked into the Large Objects type in Postgresql, but I'm also concerned about the security implications. Is there a better solution to this?
It is often better to store files outside the database, because the access is faster and the database size remains small. The down side is that your application will have to manage consistency between the database and the file system in the face of crashes, aborted transactions etc.
I'd store the complete path of the file with the metadata; splitting it up doesn't save a lot and will make queries more complicated.
To maintain consistency between the database and the file system, you could always write the file first and never delete files except during special garbage collection runs when you can also identify and remove orphans.
If files are stored in the database, bytea is better than large objects unless the files are very large or need to be read and written in parts.
For making best file system. I suggest to use folder and document hierarchy.
Document table will have reference of entity table and parent_doc_id for hierarchy logic. you should use Recursive CTE to get document tree as required.
In file system you can use path with document refrence.
i.e
entity => 1001
Document 1 => 1002
Document 2 => 1003
I suggest to use integer path in file system to avoid duplicate filename overlapping.
for document 1: 1001\1002
for document 2: 1001\1003
Actual file name and path you can store in table for refrence.

Database of metadata files across multiple directories

Consider multiple binary files associated with one metadata file each across multiple directories:
directory1: file1.bin file1.txt
directory2: file2.bin file2.txt
The metadata files contain structered data in XML or JSON format.
Is there a database which can use these metadata files for operating and running queries on them?
From what I understand about document oriented databases is, that their data files are stored in one directory.
My question is related to this stackexchange question. Unfortunately, there is no good description on a XML-based solution.
To get good query performance on metadata based queries, virtually any system will have to extract the metadata from individual metadata files and store in a more optimized form: one or more index(es) of some form or other. If there's associated data only stored in files, and not in the index (like your .bin files), then the index entry would need to store a path to to the file it so the associated data can be retrieved when needed. The path can typically store directories names, machine names, etc. In modern systems the path could be a URL.
A document oriented database might be a perfectly good place to store the metadata index, but isn't necessarily the best choice. If the metadata you need to query on is highly regular (always has the same fields, then some other form of index storage could have substantially better performance, but if you don't know ahead of time the structure of the metadata, a document oriented database might be be more flexible. Another approach might be use of a full-text search engine if you are trying to match words and phrases in the metadata.
So yes, such databasees exist. Unfortunately, there are far too many factors unspecified to make a specific recommendation. The question isn't well suited to a generic answer, the size of the document collection, expected transaction rate, required storage and retrieval latency targets and and consistency requirements could all factor into a recommendation, as would might any platform preferences (window vs *nix, on-premise vs cloud, etc.)
If you want to query structured data directly in XML or JSON files there are tools for doing so, for example:
xml-grep
jq
If your metadata text files relate to interpreting the binary files, I'm not aware of any generic parser for this. One may exist, but it seems a stretch unless you are using well-defined formats.
The general approach of working with these files directly is going to be inefficient if you need to make repeated queries, as any non-database solution is going to involve parsing the files to resolve your queries. A document-oriented database refers to the ability to store structured content, but the on-disk format will be more efficient (and complex) than text files and XML/JSON metadata which has to be parsed.
If you actually want to use a database and build appropriate indexes over structured content, you should import your raw data into one.

When would you store metadata on a filesystem rather than a database?

I want to store documents with metadata in a web application such that a person can view them in a hierarchy.
I gather that a typical way to do this is to create a database entry for each document, store metadata in the database and store the files on a filesystem.
It seems much simpler and faster to store both the documents and the metadata on the filesystem. So a directory might look like this
$ ls subdirectory
.json
Subsubdirectory
bar.pdf
bar.json
foo.tex
foo.json
And then I could get the metadata from the json files (or whatever format I use). I could render subdirectory/foo.html based on the contents of subdirectory/foo.json. And I could render subdirectory.html based on the contents of subdirectory/.json and the contents of the other child json files.
The main disadvantage I've thought of is that it might be harder to search based on the contents of the metadata file (though I could search based on filesystem-level metadata). What other disadvantages are there? And if people do use this approach, why don't I hear about it?
EDIT: I'm not really so concerned about searching; if I build some sort of searching, it'll probably be within a single, smallish directory.
"I could search based on filesystem-level metadata" - you could do this but it means each time you do a search you have to read all the metdata files from the FS and then you have to manually process it. There's no indexing, this is roughly the equivalent of a full table scan in an SQL database (but it's even slower..).
In general storing data on the FS has some other drawbacks, you have to do replication both for durability (so you don't lose files if the disk dies), and if your site is popupar, for scalability. But since you already storing the files on the disk you have to solve this issue anyway.

How to efficiently store hundrets of thousands of documents?

I'm working on a system that will need to store a lot of documents (PDFs, Word files etc.) I'm using Solr/Lucene to search for revelant information extracted from those documents but I also need a place to store the original files so that they can be opened/downloaded by the users.
I was thinking about several possibilities:
file system - probably not that good idea to store 1m documents
sql database - but I won't need most of it's relational features as I need to store only the binary document and its id so this might not be the fastest solution
no-sql database - don't have any expierience with them so I'm not sure if they are any good either, there are also many of them so I don't know which one to pick
The storage I'm looking for should be:
fast
scallable
open-source (not crucial but nice to have)
Can you recommend what's the best way of storing those files will be in your opinion?
A filesystem -- as the name suggests -- is designed and optimised to store large numbers of files in an efficient and scalable way.
You can follow Facebook as it stores a lot of files (15 billion photos):
They Initially started with NFS share served by commercial storage appliances.
Then they moved to their onw implementation http file server called Haystack
Here is a facebook note if you want to learn more http://www.facebook.com/note.php?note_id=76191543919
Regarding the NFS share. Keep in mind that NFS shares usually limits amount of files in one folder for performance reasons. (This could be a bit counter intuitive if you assume that all recent file systems use b-trees to store their structure.) So if you are using comercial NFS shares like (NetApp) you will likely need to keep files in multiple folders.
You can do that if you have any kind of id for your files. Just divide it Ascii representation in to groups of few characters and make folder for each group.
For example we use integers for ids so file with id 1234567891 is stored as storage/0012/3456/7891.
Hope that helps.
In my opinion...
I would store files compressed onto disk (file system) and use a database to keep track of them.
and posibly use Sqlite if this is its only job.
File System : While thinking about the big picture, The DBMS use the file system again. And the File system is dedicated for keeping the files, so you can see the optimizations (as LukeH mentioned)

BLOB Storage - 100+ GB, MySQL, SQLite, or PostgreSQL + Python

I have an idea for a simple application which will monitor a group of folders, index any files it finds. A gui will allow me quickly tag new files and move them into a single database for storage and also provide an easy mechanism for querying the db by tag, name, file type and date. At the moment I have about 100+ GB of files on a couple removable hard drives, the database will be at least that big. If possible I would like to support full text search of the embedded binary and text documents. This will be a single user application.
Not trying to start a DB war, but what open source DB is going to work best for me? I am pretty sure SQLLite is off the table but I could be wrong.
I'm still researching this option for one of my own projects, but CouchDB may be worth a look.
Why store the files in the database at all? Simply store your meta-data and a filename. If you need to copy them to a new location for some reason, just do that as a file system copy.
Once you remove the file contents then any competent database will be able to handle the meta-data for a few hundred thousand files.
My preference would be to store the document with the metadata. One reason, is relational integrity. You can't easily move the files or modify the files without the action being brokered by the db. I am sure I can handle these problems but it isn't as clean as I would like and my experience has been that most vendors can handle huge amounts of binary data in the database these days. I guess I was wondering if PostgreSQL or MySQL have any obvious advantages in these areas, I am primarily familiar with Oracle. Anyway, thanks for the response, if the DB knows where the external file is it will also be easy to bring the file in at a later date if I want. Another aspect of the question was if either database is easier to work with when using Python. I'm assuming that is a wash.
I always hate to answer "don't", but you'd be better off indexing with something like Lucene (PyLucene). That and storing the paths in the database rather than the file contents is almost always recommended.
To add to that, none of those database engines will store LOBs in a separate dataspace (they'll be embedded in the table's data space) so any of those engines should perfom nearly equally as well (well except sqllite). You need to move to Informix, DB2, SQLServer or others to get that kind of binary object handling.
Pretty much any of them would work (even though SQLLite wasn't meant to be used in a concurrent multi-user environment, which could be a problem...) since you don't want to index the actual contents of the files.
The only limiting factor is the maximum "packet" size of the given DB (by packet I'm referring to a query/response). Usually these limit are around 2MB, meaning that your files must be smaller than 2MB. Of course you could increase this limit, but the whole process is rather inefficient, since for example to insert a file you would have to:
Read the entire file into memory
Transform the file in a query (which usually means hex encoding it - thus doubling the size from the start)
Executing the generated query (which itself means - for the database - that it has to parse it)
I would go with a simple DB and the associated files stored using a naming convention which makes them easy to find (for example based on the primary key). Of course this design is not "pure", but it will perform much better and is also easier to use.
why are you wasting time emulating something that the filesystem should be able to handle? more storage + grep is your answer.

Resources