Neo4J: Binary File storage and Text Search "stack" - database

I have a project I would like to work on which I feel is a beautiful case for Neo4j. But there are aspects about implementing this that I do not understand enough to succinctly list my questions. So instead, I'll let the scenario speak for itself:
Scenario: In simplicity, I want to build an application that will allow Users who will receive files of various types such as docs, excel, word, images, audio clips and even videos - although not so much videos, and allow them to upload and categorize these.
With each file they will enter in any and all associations. Examples:
If Joe authors a PDF, Joe is associated with the PDF.
If a DOC says that Sally is Mary's mother, Sally is associated with Mary.
If Bill sent an email to Jane, Bill is associated with Jane (and the email).
If company X sends an invoice (Excel grid) to company Y, X is associated with Y.
and so on...
So the basic goal at this point would be to:
Have users load in files as they receive them.
Enter the associations that each file contains.
Review associations holistically, in order to predict or take some action.
Generate a report of the interested associations including the files that the associations are based on.
The value for this project is in the associations, which in reality would grow much more complex then the above examples and should produce interesting conclusions. However. if the User is asked "How did you come to that conclusion", they need to be able to produce a summary of the associations as well as any files that these associations are based on - ie the PDF or EXCEL or whatever.
Initial thoughts...
I also should also add that this applicatoin would be hosted internally, and probably used by approx 50 Users so I probably don't need super-duper, fastest, scalable, high availability possible solution. The data being loaded could get rather large though, maybe up to a terabyte in a year? (Not the associations but the actual files)
Wouldn't it be great if Neo4J just did all of this! Obviously it should handle the graph aspects of this very nicely, but I figure that the file storage and text search is going to need another player added to the mix.
Some combinations of solutions I know of would be:
Store EVERYTHING including files as binary in Neo4J.
Would be wrestling Neo4J for something its not built for.
How would I search text?
Store only associations and meta data in Neo4J and uploaded file on File system.
How would I do text searches on files that are stored on file server?
Store only associations and meta data in Neo4J and uploaded file in Postgres.
Not so confident of having all my files inside DB. Feel more comfortable having all my files accessible in folders.
Everyone says its great to put your files in DB. Everyone says its not great to put your files in DB.
Get to the bloody questions..
Can anyone suggest a good "stack" that would suit the above?
Please give a basic outline on how you would implement your suggestion, ie:
Have the application store the data into Neo4J, then use triggers to update Postgres.
Or have the files loaded into Postgres and triggers update Neo4J.
Or Have the application load data to Nea4J and then application loads data into Postgres.
etc
How you would tie these together is probably what I am really trying to grasp.
Thank you very much for any input on this.
Cheers.
p.s. What a ramble! If you feel the need to edit my question or title to simplify, go for it! :)

Here's my recommendations:
Never store binary files in the database. Store in filesystem or a service like AWS S3 instead and reference the file in your data model.
I would store the file first in S3 and a reference to it in your primary database (Neo4j?)
If you want to be able to search for any word in a document I would recommend using a full text search engine like Elastic Search. Elastic Search can scan multiple document formats like PDF using Tika.
You can probably also use Elastic/Tika to search for relationships in the document and surface them in order to update your graph.
Suggested Stack:
Neo4j
ElasticSearch
AWS S3 or some other redundant filesystem to avoid data loss
Bonus: See this SO question/answer for best practices on indexing files in multiple formats using ES.

Related

Database creation and query

So I have to created a recipe website and HTML-CSS is mainly my forte. I need a database to search through over a 100 recipes and mainly sort them,by author, apart from the other sorting orders. I don't want to use a CMS like Joomla. How do I start about?
Do I store the entire recipe(with a picture or two), into the database, or only a link to the recipe?
Secondly, the client would be updating the website as well, is there any way to simplify the process for the client who has absolutely no knowledge of adding into a database.
You're going to need to do some server-side scripting. If you don't want to use a CMS or framework, you (or someone else) will have to write the code for all of the site.
DB design pointers:
Store the recipe in the database, along with the author, etc.
Don't store the pictures in the db, even though it's easy enough to do. Better store than in a field in the db, called 'filename' or something which stores the path of the images on the server.
For the client - you will need to build a backend/admin page(s) with 'forms' for the client to upload (add), update and delete recipes and pictures.
You don't need save pictures into database. See database model of Prestashop(see only relative to images because are various tables), for example.
Regards and good luck!
You can add pictures into data bases as well. For that you can always reduce the size of the images before inserting into database.
For database, you can use php or javascript. Both provide easy way of accessing database.
Javascript even has inbuilt transaction commit and rollback feature.

Designing a generic unstructured data store

The project I have been given is to store and retrieve unstructured data from a third-party. This could be HR information – User, Pictures, CV, Voice mail etc or factory related stuff – Work items, parts lists, time sheets etc. Basically almost any type of data.
Some of these items may be linked so a User many have a picture for example. I don’t need to examine the content of the data as my storage solution will receive the data as XML and send it out as XML. It’s down to the recipient to convert the XML back into a picture or sound file etc. The recipient may request all Users so I need to be able to find User records and their related “child” items such as pictures etc, or the recipient may just want pictures etc.
My database is MS SQL and I have to stick with that. My question is, are there any patterns or existing solutions for handling unstructured data in this way.
I’ve done a bit of Googling and have found some sites that talk about this kind of problem but they are more interested in drilling into the data to allow searches on their content. I don’t need to know the content just what type it is (picture, User, Job Sheet etc).
To those who have given their comments:
The problem I face is the linking of objects together. A User object may be added to the data store then at a later date the users picture may be added. When the User is requested I will need to return the both the User object and it associated Picture. The user may update their picture so you can see I need to keep relationships between objects. That is what I was trying to get across in the second paragraph. The problem I have is that my solution must be very generic as I should be able to store anything and link these objects by the end users requirements. EG: User, Pictures and emails or Work items, Parts list etc. I see that Microsoft has developed ZEntity which looks like it may be useful but I don’t need to drill into the data contents so it’s probably over kill for what I need.
I have been using Microsoft Zentity since version 1, and whilst it is excellent a storing huge amounts of structured data and allowing (relatively) simple access to the data, if your data structure is likely to change then recreating the 'data model' (and the regression testing) would probably remove the benefits of using such a system.
Another point worth noting is that Zentity requires filestream storage so you would need to have the correct version of SQL Server installed (2008 I think) and filestream storage enabled.
Since you deal with XML, it's not an unstructured data. Microsoft SQL Server 2005 or later has XML column type that you can use.
Now, if you don't need to access XML nodes and you think you will never need to, go with the plain varbinary(max). For your information, storing XML content in an XML-type column let you not only to retrieve XML nodes directly through database queries, but also validate XML data against schemas, which may be useful to ensure that the content you store is valid.
Don't forget to use FILESTREAMs (SQL Server 2008 or later), if your XML data grows in size (2MB+). This is probably your case, since voice-mail or pictures can easily be larger than 2 MB, especially when they are Base64-encoded inside an XML file.
Since your data is quite freeform and changable, your best bet is to put it on a plain old file system not a relational database. By all means store some meta-information in SQL where it makes sense to search through structed data relationships but if your main data content is not structured with data relationships then you're doing yourself a disservice using an SQL database.
The filesystem is blindingly fast to lookup files and stream them, especially if this is an intranet application. All you need to do is share a folder and apply sensible file permissions and a large chunk of unnecessary development disappears. If you need to deliver this over the web, consider using WebDAV with IIS.
A reasonably clever file and directory naming convension with a small piece of software you write to help people get to the right path will hands down, always beat any SQL database for both access speed and sequential data streaming. Filesystem paths and file names will always beat any clever SQL index for data location speed. And plain old files are the ultimate unstructured, flexible data store.
Use SQL for what it's good for. Use files for what they are good for. Best tools for the job and all that...
You don't really need any pattern for this implementation. Store all your data in a BLOB entry. Read from it when required and then send it out again.
Yo would probably need to investigate other infrastructure aspects like periodically cleaning up the db to remove expired entries.
Maybe i'm not understanding the problem clearly.
So am I right if I say that all you need to store is a blob of xml with whatever binary information contained within? Why can't you have a users table and then a linked(foreign key) table with userobjects in, linked by userId?

Finding databases for use in applications

Does anyone have some recommendations on how I can find databases for random things that I might want to use in my application. For example, a database of zip code locations, area code cities, car engines, IP address locations, food calorie counts, book list, or whatever. I'm just asking generally when you decide you need a bunch of data where are some good places to start looking other than google?
EDIT:
Sorry if I was unclear. It doesn't necessarily need to be a database. Just something I could dump into a relational database like a CSV file. I was just listing some examples of things I've needed in the past. I always find myself searching all over Google for these types of things and am trying to find a few places to look first. Are there companies that just collect tons of data on things and sell it or give it away?
Bureau of Labor Statistics
The Zip Code Database Project
US Census Bureau (You can use their interface to create and download custom CSV files)
Data.gov has tons of stuff
You don't necessarily need a database to get going. A lot of the kinds of information you're listing will exist in delimited files (such as CSVs.) Any structured text file will do since importing woth most major database engines is somewhat trivial. In fact, the raw data will most likely not exist in a db for that reason. E.g. you can then imported into the RDBMS of your choice and the provider of that data does not need to worry about a bunch of different db formats.
Zip Code CSV file at SourceForge
List of english words (not CSV per se, but just as easy to import... for things like spell checkers)
Really good looking source of a few different datasets
If you want to track the users position this http://code.google.com/apis/gears/api_geolocation.html might be a better way than a lookup table or a csv file.

Local File-type Datastore for Winforms

I hope my title is not misleading, but what I'm looking for is a file-type datastore for a Winforms app (.NET 3.5) that will allow me to:
store/retrieve strings and decimal numbers (only);
open and save it to a file with a custom extension (like: *.abc); and
have some relational data mapping.
Basically, the application should allow me to create a new file with my custom extension (I am au fait with handling file-associations), and then save into this file data based on functionality determined by the application itself. Similar to when you would create a Word document, except that this file should also be able to store relational data.
An elementary example:
the "file" represents a person's car
every car will have standard data that will apply to it - Make, Model, Year, Color, etc.
However, each car may have self-determined categories associated with it such as: Mechanic History, In-Car Entertainment History, Modifications History, etc.
From the data stored in this "file", the person can then generate a comprehensive report on the car.
I realise that the above example could easily warrant using an embedded DB like SQLCE (SQL server compact edition), but this time around I want to be able to create datastores that can move with people, assuming that this application resides on more than one computer (eg. Work and Home).
I'm unsure if XML is the option to go with here, as the relational data-mapping may pose a problem. In theory I envision having a pre-defined data-model, and create a new SQLCE-type database (the "file") to which the data can be persisted to and retrieved from. File-size is not an issue, as long as it's portable, ie. I can copy it to a flash disk from the office and continue working with it at home.
If my question is still unclear, please let me know and I'll try my best to clarify! I would really appreciate any help in this regard.
THANX A MILLION!!
EDIT: my apologies for the long-winded essay!
Here are a couple of ways you could do this.
XML, DataSets and Linq
You can use an xml file and load it into a DataSet, then use Linq to query it.
SQLite
SQLite is a file-based database. You can ship the SQLite dll with your application, needs no install. You can name the files whatever you like, and then access the data using ADO.NET. Here is an ADO.NET provider and sqlite implementation all rolled into one dll: http://sqlite.phxsoftware.com/

Managing a large collection of music

I'd like to write my own music streaming web application for my personal use but I'm racking my brain on how to manage it. Existing music and their location's rarely change but are still capable of (fixing filename, ID3 tags, /The Chemical Brothers instead of /Chemical Brothers). How would the community manage all of these files? I can gather a lot of information through just an ID3 reader and my file system but it would also be nice to keep track of how often played and such. Would using iTunes's .xml file be a good choice? Just keeping my music current in iTunes and basing my web applications data off of it? I was thinking of keeping track of all my music by md5'ing the file and using that as the unique identifier but if I change the ID3 tags will that change the md5 value?
I suppose my real question is, how can you keep track of large amounts of music? Keep the meta info in a database? Just how I would connect the file and db entry is my real question or just use a read when need filesystem setup.
I missed part 2 of your question (the md5 thing). I don't think an MD5/SHA/... solution will work well because they don't allow you to find doubles in your collection (like popular tracks that appear on many different samplers). And especially with big collections, that's something you will want to do someday.
There's a technique called acoustic fingerprinting that shows a lot of promise, have a look here for a quick intro. Even if there are minor differences in recording levels (like those popular "normalized" tracks), the acoustic fingerprint should remain the same - I say should, because none of the techniques I tested is really 100% errorfree. Another advantage of these acoustic fingerprints is that they can help you with tagging: a service like FreeDB will only work on complete CD's, acoustic fingerprints can identify single tracks.
For inspiration, and maybe even for a complete solution, check out ampache. I don't know what you call large, but ampache (a php application backed by a mysql db) easily handles music collections of tens of thousands of tracks.
Reecently I discovered SubSonic, and the web site says "Manage 100,000+ files in your music collection without hazzle" bt I haven't been able to test it yet. It's written in Java and the source looks pretty neat at first sight, so maybe there's inspiration to get there too.

Resources