Are RDF / Triple Stores suited for storing application data? (as opposed to the graph Metadata) - database

I'm trying to create a small web application for a "personal information manager" / wiki kind of tool where I can take notes in the form of HTML snippets (or maybe Markdown), annotate them with some https://schema.org/ microdata and store both the snippet and the metadata somewhere for querying.
My understanding so far is that most semantic data stores (triple/quad stores, or databases supporting RDF) are better suited for storing and querying mainly the metadata. So I'll probably also want some traditional store of some sort (relational, document store, key-value, or even a non-rdf graph db) where I can store the full text of each note and maybe some other bits like time of last access, user-id that owns the note, etc, and also perform traditional (non-semantic) fulltext queries.
I started looking for stores that would allow me to store both data and metadata in a single place. I found a few: Ontotext GraphDB, Stardog, MarkLogic, etc. All of these seem to do exactly what I want, but have some pretty limiting free license terms that really discourage me from studying them in depth: I prefer to study open technologies that I could potentially use on a real product.
Before digging deeper, I was wondering:
If my assumption is correct: that I'll need to use one store for the data and another for the metadata.
if there's any setup involving free/open source software that developers with experience in RDF/Sparql can recommend, given the problem I describe.
Right now I'm just leaning towards using Apache Jena for the RDF store and SPARQL queries, and something completely independent for the rest of the data (PostgreSQL most likely).

Before digging deeper, I was wondering:
If my assumption is correct: that I'll need to use one store for the data and another for the metadata.
Not necessarily, no, though there certainly are some cases in which that distinction may be useful. But most RDF databases offer scalable storage for both data and metadata. The only requirement is that your (meta)data is represented as RDF. If you're worried about performance of things like text queries, most of them offer support for full-text indexing through Lucene, Solr, or Elasticsearch.
if there's any setup involving free/open source software that developers with experience in RDF/Sparql can recommend, given the problem I describe.
This is really not the right place to ask this question. Tool recommendations are considered off-topic on StackOverflow since they attract biased answers. But as said, there's plenty of tools, both open-source/free and commercial, that you can look into. I suggest you pick one you like the look of, experiment a bit, and perhaps talk to that particular tool's community to explain what you're trying to do. Apache Jena and Eclipse Rdf4j are two popular open-source projects, but there's plenty of others.

Related

Ideal database for a minimalist blog engine

So I'm designing this blog engine and I'm trying to just keep my blog data without considering comments or membership system or any other type of multi-user data.
The blog itself is surrounded around 2 types of data, the first is the actual blog post entry which consists of: title, post body, meta data (mostly dates and statistics), so it's really simple and can be represented by simple json object. The second type of data is the blog admin configuration and personal information. Comment system and other will be implemented using disqus.
My main concern here is the ability of such engine to scale with spiked visits (I know you might argue this but lets take it for granted). So since I've started this project I'm moving well with the rest of my stack except the data layer. Now I've been having this dilemma choosing the database, I've considered MongoDB but some reviews and articles/benchmarking were suggesting slow reads after collections read certain size. Next I was looking at Redis and using its persistence features RDB and AOF, while Redis is good at both fast reading/writing I'm afraid of using it because I'm not familiar with it. And this whole search keeps going on to things like "PostgreSQL 9.4 is now faster than MongoDB for storing JSON documents" etc.
So is there any way I can settle this issue for good? considering that I only need to represent my data in key,value structure and only require fast reading but not writing and the ability to be fault tolerant.
Thank you
If I were you I would start small and not try to optimize for big data just yet. A lot of blogs you read about the downsides of a NoSQL solution are around large data sets - or people that are trying to do relational things with a database designed for de-normalized data.
My list of databases to consider:
Mongo. It has huge community support and based on recent funding - it's going to be around for a while. It runs very well on a single instance and a basic replica set. It's easy to set up and free, so it's worth spending a day or two running your own tests to settle the issue once and for all. Don't trust a blog.
Couchbase. Supports key/value storage and also has persistence to disk. http://www.couchbase.com/couchbase-server/features Also has had some recent funding so hopefully that means stability. =)
CouchDB/PouchDB. You can use PouchDB purely on the client side and it can connect to a server side CouchDB. CouchDB might not have the same momentum as Mongo or Couchbase, but it's an actively supported product and does key/value with persistence to disk.
Riak. http://basho.com/riak/. Another NoSQL that scales and is a key/value store.
You can install and run a proof-of-concept on all of the above products in a few hours. I would recommend this for the following reasons:
A given database might scale and hit your points, but be unpleasant to use. Consider picking a database that feels fun! Sort of akin to picking Ruby/Python over Java because the syntax is nicer.
Your use case and domain will be fairly unique. Worth testing various products to see what fits best.
Each database has quirks and you won't find those until you actually try one. One might have quirks that are passable, one will have quirks that are a show stopper.
The benefit of trying all of them is that they all support schemaless data, so if you write JSON, you can use all of them! No need to create objects in your code for each database.
If you abstract the database correctly in code, swapping out data stores won't be that painful. In other words, your code will be happier if you make it easy to swap out data stores.
This is only an option for really simple CMSes, but it sounds like that's what you're building.
If your blog is super-simple as you describe and your main concern is very high traffic then the best option might be to avoid a database entirely and have your CMS generate static files instead. By doing this, you eliminate all your database concerns completely.
It's not the best option if you're doing anything dynamic or complex, but in this small use case it might fit the bill.

Backend for Web Development using Clojure/ClojureScript

I'm familiar with developing desktop apps in Clojure (written a multithreaded interactive visualization system). However, I'm fairly new to Web development using Clojure.
I plan to use Clojure on the server for handling logic; and ClojureScript for handing client side work. However, I don't know what to use for my database server. Should I use something like Monogodb? or Hadoop? Or .... ?
The app is something very simple; a basic forum. Total number of concurrent users will be < 100 at a given time. One thing that is important to me is the ability to easily backup / data consistency -- it's very very important to me that I can easily make daily backups (and not lose all the data.)
Thanks!
You can use many databases; if the database has an API for Java, you should be good to go. MySQL, MongoDB, Postgres, Hadoop… and more.
For a nice overview of the webstack in Clojure, check out brehaut's article on the matter.
For getting up and running quickly with Clojure and ClojureScript, try ClojureScriptOne.
There are many ways to write what you want to write; if you're already familiar with Clojure, it shouldn't be too hard to get going.
Haven't used it myself, but Datomic ( http://datomic.com/ ) looks great for anyone coming from Clojure.
Datomic is an amazing database, and I'd highly recommend it. It has many features which set it apart from other database systems:
Like Clojure's data structures, it's persistent, meaning that by default, adding new facts to the database doesn't delete old facts, allowing you to query the state of the database at a previous point in time, enhancing audit-ability and assistance in debugging.
The underlying Entity Attribute Value (EAV/triple) data model (at least partly inspired by RDF & the Semantic Web), is extremely flexible, allowing you to express arbitrary graph structures and effortlessly deal with polymorphism.
The query language is flavor of Datalog, a sort of pattern matching based query language strictly more expressive than SQL and the like in that it can do recursive queries, making it particularly well suited for dealing with graph data/queries.
In addition to Datalog queries, there's a pull api, which let's you pull data out of the database more simply using a GraphQL like expression which specifies the shape of a document-like structure you'd like to pull out of the database. These queries can even be used from within the :find clause of a Datalog query.
You can use Clojure functions from within your queries.
The indexing system is very smart and more or less automatic, in stark contrast with the work that typically goes into tuning SQL databases for performance.
Transactions go through a different API/function call than queries, meaning that the number one security risk identified by OWASP (SQL injection) is literally impossible in Datomic.
The transactor/read-replica design makes it super easy to scale reads/queries, while keeping pressure off the transactor.
It's fun as hell.
One of the things worth pointing out here is that by embracing the EAV data model and datalog/pull queries, Datomic ends up having structural flexibility closer to that of a NoSQL database, while still being fundamentally relational, and even more expressive in it's relational queries than SQL.
It's amazing and you should absolutely give it a shot. It will melt your brain a little. In the good way.
It's also worth noting that it's popularity has inspired a number of successful open source projects, so the underlying approach is not going anywhere any time soon:
DataScript: In memory clj/cljs partial implementation
Datahike: Fork of DataScript which queries over on disk indices, meaning you don't have to keep everything in memory to query
Mentat: Mozilla project trying to make a Datomic-alike for a Mozilla project

Distributed database management system - alternatives?

I am working to develop an application that needs data distributed across countries. Content will be supplied "per region", but needs to be able to be easily copied to another region. On top of this I have general information that needs to be shared and synchronized across the databases.
The organisation I work for is considering implementing this system themselves, but it feels like there should be some good solutions out there already (I am open to cloud solutions - the less my company needs to manage the better)?
This might be a vague question, but I think it is possible to answer it well.
What are my options when developing this kind of distributed data system?
Update:
Should have elaborated (but I'm not sure how much I can say given NDA). Suffice to say, I have "Content" which I need stored on some space (files). I need metadata stored about the content distributed over several nodes (that might be hosted by us or some one else) to allow fast-paced communication and regionalized differences in data. I need to control HOW data is replicated between nodes, but preferably in a standards compliant way. (Preferably not written by us)
You can try CouchDB. Its off-line replication model sounds like a good fit for geo distributed system.
Interesting question - but it would really help to get more context.
You talk about "data", which usually means something with a fairly well-defined structure, often implemented in a relational database.
You also talk about "content", which usually means something with a (much) less well-defined structure, often implemented as a document of some type. Many solutions exist for structuring "documents", e.g. file systems or web sites.
Assuming we are talking about structured data, the simplest thing to do is have single repository, accessible everywhere. Have a look at "cloud" offerings - Amazon's a good bet. Creating your own global data repository is a significant undertaking - but if you're dealing with highly confidential data, or have specific performance requirements, it may the way to go.
If neither of those options work, you're in the world of "enterprise service bus". Google it, but be careful - it's a complex field, and you really want to find someone who knows what they're doing.
Having said that, using an off the shelf ESB is many times less painful than building your own distributed data structure.
I know it's years after asking, but I was looking up the answer to the same question and it looks like Cassandra may fit the bill. Once setup, it looks and acts like other database solutions (Tables, Views, SQL, Transactions, etc.), but it can also be entirely decentralized. Each instance acts as a node in a cluster of other Cassandra nodes. They synchronize behind the scenes and if one goes down, the others pick up the slack. This makes Cassandra both highly scalable and highly fault tolerant.

What exactly is NoSQL?

What exactly is NoSQL? Is it database systems that only work with {key:value} pairs?
As far as I know MemCache is one of such database systems, am I right?
What other popular NoSQL databases are there and where exactly are they useful?
Thanks, Boda Cydo.
I'm not agree with the answers I'm seeing, although it's true that NoSQL solutions tends to break the ACID rules, not all are created from that approach.
I think first you should define what is a SQL Solution and then you can put the "Not Only" in front of it, that will be more accurate definition of what is a NoSQL solution.
With this approach in mind:
SQL databases are a way to group all the data stores that are accessible using Structured Query Language as the main (and most of the time only) way to communicate with them, this means it requires that the database support the structures that are common to those systems like "tables", "columns", "rows", "relationships", etc.
Now, put the "Not Only" in front of the last sentence and you will get a definition of what means "NoSQL". NoSQL groups all the stores created as an attempt to solve problems which cannot fit into the table/column/rows structures or even in SQL Statements, in most of the cases these databases will not support relationships, they're abandoning the well known structures just because the problems have changed since their conception.
If you have a text file, and you create an API to store/retrieve/organize this information, then you have a NoSQL database in your hands.
All of these means that there are several solutions to store the information in a way that traditional SQL systems will not allow to achieve better performance, flexibility, etc etc. Every NoSQL provider tries to solve a different problem and that's why you wont be able to compare two different solutions, for example:
djondb is a document store created to be used as
NoSQL enterprise solution supporting transactions, consistency, etc.
but sacrifice performance of its counterparts.
MongoDB is a document store (similar to
djondb) which accomplish great performance but trades some of the
ACID properties to achieve this.
CouchDB is another document store which
solves the queries slightly different providing views to retrieve the
information without doing a full query every time.
...
As you may have noticed I only talked about the document stores, that's because I wanted to show you that 3 different document stores implementations have different approach, therefore you should keep in mind the golden rule of NoSQL stores "Use the right tool for the right job".
I'm the creator of djondb and I've been doing a lot of research even before trying to start my own NoSQL implementation, but this is a field where the concepts will keep changing the way we see the information storage.
From wikipedia:
NoSQL is an umbrella term for a loosely defined class of non-relational data stores that break with a long history of relational databases and ACID guarantees. Data stores that fall under this term may not require fixed table schemas, and usually avoid join operations. The term was first popularised in early 2009.
The motivation for such an architecture was high scalability, to support sites such as Facebook, advertising.com, etc...
To quickly get a handle on NoSQL systems, see this blog post I wrote: Visual Guide to NoSQL Systems. Essentially, NoSQL systems sacrifice either consistency or availability in favor of tolerance to network partitions.
What is NoSQL ?
NoSQL is the acronym for Not Only SQL. The basic qualities of NoSQL databases are schemaless, distributed and horizontally scalable on commodity hardware. The NoSQL databases offers variety of functions to solve various problems with variety of data types, where “blob” used to be the only data type in RDBMS to store unstructured data.
1 Dynamic Schema
NoSQL databases allows schema to be flexible. New columns can be added anytime. Rows may or may not have values for those columns and no strict enforcement of data types for columns. This flexibility is handy for developers, especially when they expect frequent changes during the course of product life cycle.
2 Variety of Data
NoSQL databases support any type of data. It supports structured, semi-structured and unstructured data to be stored. Its supports logs, images files, videos, graphs, jpegs, JSON, XML to be stored and operated as it is without any pre-processing. So it reduces the need for ETL (Extract – Transform – Load).
3 High Availability Cluster
NoSQL databases support distributed storage using commodity hardware. It also supports high availability by horizontal scalability. This features enables NoSQL databases get the benefit of elastic nature of the Cloud infrastructure services.
4 Open Source
NoSQL databases are open source software. The usage of software is free and most of them are free to use in commercial products. The open sources codebase can be modified to solve the business needs. There are minor variations in the open source software licenses, users must be aware of license agreements.
5 NoSQL – Not Only SQL
NoSQL databases not only depend SQL to retrieve data. They provide rich API interfaces to perform DML and CRUD operations. These are APIs are move developer friendly and supported in variety of programming languages.
Take a look at these:
http://en.wikipedia.org/wiki/Nosql#List_of_NoSQL_open_source_projects
and this:
http://www.mongodb.org/display/DOCS/Comparing+Mongo+DB+and+Couch+DB
I used something called the Raima Data Manager more than a dozen years ago, that qualifies as NoSQL. It calls itself a "Set Oriented Database" Its not based on tables, and there is no query "language", just an C API for asking for subsets.
It's fast and easier to work with in C/C++ and SQL, there's no building up strings to pass to a query interpreter and the data comes back as an enumerable object rather than as an array. variable sized records are normal and don't waste space. I never saw the source code, but there were some hints at the interface that internally, the code used pointers a lot.
I'm not sure that the product I used is even sold anymore, but the company is still around.
MongoDB looks interesting, SourceForge is now using it.
I listened to a podcast with a team member. The idea with NoSQL isn't so much to replace SQL as it is to provide a solution for problems that aren't solved well with traditional RDBMS. As mentioned elsewhere, they are faster and scale better at the cost of reliability and atomicity (different solutions to different degrees). You wouldn't want to use one for a financial system, but a document based system would work great.
Here is a comprehensive list of NoSQL Databases: http://nosql-database.org/.
I'm glad that you have had success with RDM John! I work at Raima so it's great to hear feedback. For those looking for more information, here are a couple of resources:
Video Overview of RDM's General Architecture
Free Evaluation Download of RDM

Databases versus plain text

When dealing with small projects, what do you feel is the break even point for storing data in simple text files, hash tables, etc., versus using a real database? For small projects with simple data management requirements, a real database is unnecessary complexity and violates YAGNI. However, at some point the complexity of a database is obviously worth it. What are some signs that your problem is too complex for simple ad-hoc techniques and needs a real database?
Note: To people used to enterprise environments, this will probably sound like a weird question. However, my problem domain is bioinformatics. Most of my programming is prototypes, not production code. I'm primarily a domain expert and secondarily a programmer. Most of my code is algorithm-centric, not data management-centric. The purpose of this question is largely for me to figure out how much work I might save in the long run if I learn to use proper databases in my code instead of the more ad-hoc techniques I typically use.
1) Concurrency. Do you have multiple people accessing the same dataset? Then it's going to get pretty involved to broker all of the different readers and writers in a scalable fashion if you roll your own system.
2) Formatting and relationships: Is your data something that doesn't fit neatly into a table structure? Long nucleotide sequences and stuff like that? That's not really conveniently tabular data.
Another example: Nobody would consider implementing software like Photoshop to store PSDs in a relational format, because the data structures don't really lend themselves to that type of storage or query pattern.
3) ACID (sort of a corollary to #1): If Atomicity, Consistency, Integrity, and Durability are not challenges with a flat file, then go with a flat file.
For me, the line is crossed once I have to query my data in ways that involve more than a single relationship. Relating two flat data structures on disk is fairly simple, but once we get beyond that, a set-based language like SQL and formal database relationships actually reduce complexity.
I think at some point you'll miss the querying capabilities of a database, but you can consider some minimalistic database alternatives:
SQLite (Great, almost SQL-92 standard compliant)
shsql
SQL Server Compact
I would only write my own on-disk format under very special circumstances. Reusing someone else's code is nearly always faster.
For relational data, I would use SQLite. For key/value pairs, I would use BerkeleyDB (perhaps via KiokuDB). For simple objects, I would use JSON or YAML, but only if I only had a few.
With SQLite and BDB, "a real database" is literally two lines of code away. It is hard to beat that.
The problem with small projects is that they become bigger before we know it. And once they do , we start missing the sql capabilities.
Always design such that a db can be utilized later on if required without ripping apart half of the application.
It depends entirely on the domain-specific application needs. A lot of times direct text file/binary files access can be extremely fast, efficient, as well as providing you all the file access capabilities of your OS's file system.
Furthermore, your programming language most likely already has a built-in module (or is easy to make one) for specific parsing.
If what you need is many appends (INSERTS?) and sequential/few access little/no concurrency, files are the way to go.
On the other hand, when your requirements for concurrency, non-sequential reading/writing, atomicity, atomic permissions, your data is relational by the nature etc., you will be better off with a relational or OO database.
There is a lot that can be accomplished with SQLite3, which is extremely light (under 300kb), ACID compliant, written in C/C++, and highly ubiquitous (if it isn't already included in your programming language -for example Python-, there is surely one available). It can be useful even on db files as big as 1GB, possible more.
If your requirements where bigger, there wouldn't even be a discussion, go for a full-blown RDBMS.
For the kind of applications you are developing in bioinformatics, you are often doing one-shot applications (often scripts that define a workflow of calculations) that answer a specific questions, and you are not likely to be reusing these applications after you answered your question.
Often, you should therefore avoid creating databases to store the results, as after all you are not going to use their features very much.
You will probably be querying some webservices, files, or databases, run some local algorithms on the data gathered from different sources, and produce some tabular or structured output format (xml, json, etc).
For that, I would suggest you to use workflow tools like Knime (or a commercial solution like Inforsense KDE, Accelrys's Pipeline pilot, or Snaplogic, as they allow you to query data in a variety of formats and locations (rdbms, flat files, webservices), run algorithms, and build powerful web apps that allow you to easily publish your workflows to your users and let them interact at specific points).
If your prototype "grows" and you have to build more functionality on top of the data your workflows output, and if the output of your prototype is not likely to change everyday, then it's a wise decision to store a subset of the results in a database. This allows you to plug in powerful reporting tools like BusinessObjects, Crystal reports, jasper reports or whatever reporting solution available out there and show data to your users in a better shape than a spreadsheet or a csv file.
Finally, some development frameworks will make your choices more obvious : if you build a web application using an MVC framework, it is likely that your data will reside in an RDBMS (but please, don't put genomic sequences in a table column :-)).
All in all, it's a case by case choice, depending on your needs for each particular application.
In software I can usually get away with storing values in a XML configuration file or in the registry, e.g. software options. Once I need to persist objects I move to a database because the upfront cost is not that bad compared to the long term effects that relations and reporting can offer.
For bioinformatics you may be interested on that: Blast on DB. The guy who is working on that is a friend of mine and has a work on fast similarity sequence search, he found out to make his own binary storage better than using databases at this point.
I don't know specific details about his solution but you probably can exchange one or two ideias mailing the guy, even sharing code.
Do you need/want SQL queries?
Are multiple people going to want to access the data?
Is your data relational?
If you answered no to those questions, you (probably) don't need a full on database.
First, I'd consider:
How large will the database initially be: # of tables, # of rows
How quickly will it grow?
Is the data frequently queried?
If I were to create a personal recipe app, for example, I know I might add 50 favorite recipes to start and add no more than 5 recipes a year. With that being said, I could easily get by without a database since the size of the data store will have minimal impact on queries.
That said, I would probably use a database for any application where data entry and queries occur (even a small personal recipe app). I don't think it adds a lot of overhead especially when your framework (e.g. Rails) allows you to keep your database dumb (primarily tables, indexes, and constraints). It alleviates the chance that I'll have to eventually port to a database if I decide to scale up.
If you know the format of your data, flat files, if faster/easier to develop with, will be fine. If you expect your record formats to change frequently during development then I'd suggest that ALTER TABLE is your friend. Flat files will also tend to be faster (if you care about speed) unless you expect to implement the equivalent of joins across many combinations of files.
The real benefit of using a RDBMS during development is the flexibility with which you can modify your data schema and the ease with which you can access your data via queries.
Good design will ensure that you keep your data access layer relatively isolated (because of separation of concerns) so it should be a fairly straightforward (if tedious) matter to rework to a database later should it be worthwhile. Or, of course, if you use a database to develop your structures you may subsequently take the app back to flat/indexed files once those structures are crystallized in order to gain performance.
Use whatever persistence technology you're most comfortable with, and scales sufficiently.
YAGNI at least means "Don't add a new technology to your personal stack unless you can't be productive with whatever is already there."
For many (most?) of us, our comfort zone for data persistence is SQL. For some, it might be XML. Just don't write your own until (see paragraph 2).
As someone also doing research in Bioinformatics, I would suggest NOT using a database for these kinds of prototype projects unless you are sure it needs it. If you are on the fence, go with the databaseless solution and stick with flat files. It is also important to note that traditionally Bioinformatics researchers have go the flat file route, which means there are well defined file formats for most types of data in the feild. If you decide to go with a database solution, it may hurt your compatibility with existing research projects.

Resources