What database is good for unstructured data - database

I am working on a project with a great deal of unstructured data. Is there database software or a tool that is suitable for unstructured data. If there are no tools or other software what database design would I use if mysql or sql server are my only choice?

If you are going to have enough structured data to formulate a key I'd stick with any DB that supports blobs.
If you're not going to have a structured key I'd go with something like couchDB. It allows you to use unstructured keys to store unstructured data.
If you have unstructured keys and you're absolutely stuck with mysql / sql server you can still accomplish your goal using unstructured data (mysql for instance supports column prefix indexing where you provide it the length of a variable length field to use for indexing ).

VelocityDB is high performance database suitable to handle unstructured data. It is common to create inverted indexes when handling unstructured data. The VelocityDB website and download provides sample code for creating inverted indexes from books, web pages and the entire Wikipedia text.

Related

Is there any database model having these "no-sql", "schema free" & "relational" ? is it support c++?

I need schema free db with relational features for my C++ application.
I already using PostgreSQL and Mysql in my project.
I want to store data relationally in document and need CRUD using SQL.
"Relational" and "schema-free" are mutually exclusive.
Modern DBMS support several data models. For example, SQL Server supports relational, document-oriented (both XML and JSON) and graph (network) data models. You can combine the use of different models in the same database. A typical example, the table of documents contains several columns corresponding to most important attributes including the keys, and one column that stores an XML.
However, the relational data model is well structured by default, so it's hard to implement a schemaless relational database. This may be simulated with Excel sheets or tables using only some "variant" data type but such a solution seems to be fragile and has performance issues.
Another way is to use EAV extension inside a relational database.
You can have a look on "Programming with databases" book containing some examples of use Yes/NoSQL.
Please take a look at AgensGraph database. AgensGraph is the only true multi model database supporting Relational Database and Graph Database(Schema Free). It is as well supporting Key-Value and document model. Also its based on C language.
AgensGraph
The actual answer to your specific question based on your parameters:
Schema free
CRUD support
SQL "like", aka Relational
C++ support
is ArangoDB.

What nosql database is ideal to use for storing code/snippets?

I want to store code similar to how jsfiddle stores code. I currently use Postgres for my main database but I'm wondering if it's more ideal to be using a NoSQL database?
Code snippets for now will have just one author, but in the future there may be multiple authors and I want the ability for reverting as well.
I know there are key/value databases and document-oriented databases. Which specific noSQL db would suite my needs? Or should I still stick with my Postgres db?
FYI:
I'm using django
The users will be permanently stored in postgres ( I'm using openID )
You can't choose a non-relational data strategy without defining what you want to do with your data.
Relational database design comes from rules of normalization, which you can apply once you know your data alone. But non-relational database design depends on your queries more than your data.
But without knowing anything about your application, my first recommendation would be to stick with PostgreSQL. Store your code snippets in text blobs, and meta-data about the code (authorship, date, language, project, etc.) in additional columns alongside the text blob. Also you can consider using GIST indexes to allow for flexible searching.
You might also consider Apache Solr, which is technically similar to a document-oriented DBMS, though it is usually presented as a fulltext search engine.
As for NoSQL databases, the only ones I'm familiar with are XML (doesn't scale well and has bad concurrency), and local databases such as Paradox, dBase, FoxProx and Access. I would not recommend any of these.
I think that the idea that it's a NoSQL database should be a smaller factor in your decision. Consider these things instead.
Redundancy. Can you run it on two servers at the same time or does it support failover? (SQL Server, Interbase, Firebird)
Concurrency. Will you host this app on the web? How will it handle 10 concurrent operations? (PostGres, MySql, Interbase, Firebird)
Speed. How long is acceptable for a lookup or post?
Embeddability. Is this a desktop application? An embedded database can make things easier. (Local databases such as Paradox, dBase, FoxPro, Access, Interbase, Firebird or SQLite)
Portability. Desktop apps may run on Mac, Linux, Windows. (SQLite)
Sounds like a relatively uncomplicated application which could be implemented in a traditional relational database or a NoSQL without too many problems.
However if you're keeping the userbase info in PostgreSQL, it would seem simplest to just stick with that as a single storage method. Using both an SQL database and a NoSQL adds complexity, makes joining across the datasets hard (so eg. you couldn't make a query to do something like ‘list users along with their most recent document’), and makes it impossible to ensure consistency between the two datasets.
What do you get for this trouble? You want versioning. CouchDB will give you revision control, but it's questionable whether you should be using that for UI-level versioning (eg because compacting the database will lose your old versions).

Databases for easy comparison

We have an application which has metadata information stored in database (some tables with relations between). The metadata can be edited through web app or directly manipulating values in SQL Server database.
The problem: metadata changes and needs to be merged between different environments (test, staging, production, etc.). There are tools (e.g. RedGate) that help but it is still quite a lot of work to compare databases if autogenerated ID's are being used (as it is now in our DB, and yes, one way is to use natural keys to make comparison easier).
However, our metadata may be stored not necessarily in SQL database - it could be stored as documents in NOSQL databases (MongoDB, CouchDB, RavenDB) or even simple XML databases (maybe Berkeley DB XML?). Storing as XML file seems would work (as it easier to compare and merge files rather than databases) but may not be a good option as there needs to some concurrency mechanisms, some degree of transaction support.
We do not need replication to other servers, there is no need for high availability, etc.
The requirements to store data:
some kind of ACID
Should run on Windows
Easy comparison (bi-directional sync)
(optional) GUI to see what is in database
(optional) export to file (JSON, XML)
What are the options?
Why conflate the storage with the representation you are performing the diff on?
I'd keep everything in SQL, but when it came time to compare, select all the important data (not the ids) into a XML format, and use an XML differencing tool (or a csv format, and use a plain text comparer).
I have never used it but CouchDB has built-in support for birectional syncing between db's.

What advantages does a Document-based database have over a Relational database?

For example: Microsoft SQL Server vs. CouchDB.
The main benefit for me with CouchDB is that you can access it from pretty much anywhere! What advantages does a document based database have over a relational one?
Where would a document based db be a better choice over a relational?
I wouldn't say "accessing it from anywhere" is an advantage of CouchDB over SQL Server. Both are fully accessible from a variety of clients.
The key differentiating factor is the fundamental concept of how data is persisted as tables & columns (SQL Server) versus documents (CouchDB). In addition, CouchDB is designed to leverage multiple copies with replication/map-reduce in a highly forgiving fashion. SQL Server can do the same level of fault tolerance but true map-reduce is non-existant in it (it's ability to deal with sets mimics the capabilities fundamentally however - see GROUPING SETS keyword).
You should note this post which really shows that map reduce has its place, but you need to pick the right tool for the job:
http://gigaom.com/2009/04/14/mapreduce-vs-sql-its-not-one-or-the-other/

is there a database that can easily digest hierarchical data besides XML?

It is very difficult for me to design the database because it requires a lot of recursion. I really can't use XML because it is just not practical for scalability and the amount of data I need to store. Do you guys know of a database that can be used to store hierarchical data?
SQL Server 2008 has the HierarchyId data type. It's specifically designed for this task. Proper indexing and keys will give you fast access to data in both depth-first and breadth-first searches.
http://technet.microsoft.com/en-us/library/bb677290.aspx
Maybe you want a hierarchical database like LDAP? OpenLDAP is a free implementation.
Oracle easily allows hierarchy queries with the CONNECT BY syntax
You can have a self referential table like:
Part
part_id
parent_part_id
or a couple of tables like:
Organization
org_id
name
org_relation
org_id1
org_id2
If your open to NoSQL, then I'd recommend MongoDB. It is document oriented so your not tied down to a fixed schema. It is also very scalable and performant. There are a lot of good OOP like things in it. For instance, a document can contain embedded documents so that if your database is already designed as XML, it will be mostly trivial to store in MongoDB
SQL Server also allows you to store XML files in single fields, and then easily parse specific elements/attributes from them via queries

Resources