I need to develop an ontology in computational biochemistry and molecular dynamics. For this, I have collected the terms that is going to be used and attempted to reuse ontologies by searching the terms on ontology search service, such as EBI-OLS. Some terms are very relevant to import/reuse, however, the ontology itself is intended for a more specific domain, such as National Cancer Institute Thesaurus (which has 171,081 classes). Other than that, there are other 10 source ontologies that I could potentially reuse. Some of them are also huge, such as EDAM ontology.
Is it okay to reuse ontology that seemingly intended for a more
specific domain, such as cancer? We will use the ontology for a more
generic use in life science, not only cancer-related domain.
Is there any general rule of thumb on which of those 10-ish ontologies
that are suitable for reuse? (e.g., the paper describing that
ontology should be cited by at least n number of papers, or it
should be compatible with Open Biological and Biomedical Ontology
(OBO) Foundry principles, or it should be backed by a well-known
institution and still maintained).
How to decide the sweet spot on
the number of ontology sources one can based on? While we can reuse
as much available terms as we can (from many ontology sources,
especially in life science domain), there is a concern that it would make
the resulting knowledge graph representation much more complex.
Thank you for your answers.
Answers to your questions:
I would say yes, assuming the terms that you intent to use are indeed a match for your use case. I.e., if there is a term that you are interested in using, but say its definition or the synonyms do not match your needs, then I will probably consider not using the term.
Yes, there are. I really recommend reading the paper Ten Simple Rules for Selecting a Bio-ontology and the OBO Tutorial.
Try to keep the number of ontologies you want to use as small as is sensible (that is the smallest set of ontologies that are well aligned with the needs of your use case). The reason for this is that you will want to engage with the designers of the ontologies you use to extend and amend these ontologies for your use case. The more ontologies you use, the chances are that you will need to communicate with a larger community to affect change for your use case. This may increase development times. However, using an ontology that is not well aligned with your use case will also increase communication and timelines. Thus, the reason for keeping the number of ontologies as small as is sensible.
As for your concern regarding importing large ontologies into your ontology, the way this is dealt with is to extract only the terms you are interested using ROBOT and then to import the extracted ontology into your own ontology.
In general, I will really strongly recommend reaching out to the OBO Foundry. They have developed life science related ontologies for a number of years. Working with them you are likely to avoid many of the typical problems people run into when they start designing ontologies.
I have also written up some general guidelines from my perspective wrt choosing biological ontologies here.
My question is about subsumption in databases (versus ontologies). My understanding says that if I have instances that belong to Class B, then Class A which is the superclass of Class B will also have these instances.
Ontologies provide in-built subsumption inference through the various reasoners (e.g. RDFS++, Pellet, etc.). I would like to know if it is possible to achieve a similar task in database systems. If so, how flexible or easy is it to implement? Are there any advantages of the database implementation (if any) over the ontology-based approach?
To clarify, an ontology doesn't perform reasoning. An ontology is the set of logical axioms that a reasoner uses to answer queries with the inferred (and explicit) information in your knowledge base.
There are a number of existing open-source and commercial systems which perform reasoning and could be considered a database as opposed to something that is purely for reasoning, like Pellet/Fact/Hermit. Examples include AllegroGraph, GraphDb, and Stardog. So obviously, yes, it's possible. There are a couple different ways to approach the implementation, so you have some flexility on how to design the system based on your preferred use case.
It's not hard to create a toy reasoner that will parse an ontology and do some basic reasoning like subsumption. But if you want to support (correctly) one of the OWL fragments, and you want to do it at scale, it's not easy.
Go look at how Jena and its reasoners are implemented, that will be enough to get you going.
Sesame also has a RDFS reasoner, so that would be another source for you to review.
I'm investigating a new project which will be a social networking style site. I'm reading up on RavenDb and I like the look of a lot of its features. I've not read up on nosql all that much but I'm wondering if there's a niche it fits best with and old school sql is still the best choice for other stuff.
I'm thinking that the permissions plug in would be ideal for a social net style site - but will it really perform in an environment where the database will be getting hammered - or is it optimised for a more reporting style system where it's possible to keep throwing new data structures at the database and report on those structures.
I'm eager to use the right tool for the job - I'll be using MVC3, Windsor + either Nhibernate+Sql server or RavenDb.
Should I stick with the old school sql or go with the new kid on the block: ravendb?
This question can get very close to being subjective (even though it's really not), you're talking about NoSQL as if it is just one thing, and that is not the case.
You have
graph databases (Neo4j etc),
map/reduce style document databases (Couch,Raven),
document databases which attempt to feel like ordinary databases (Mongo),
Key/value stores (Cassandra etc)
moar goes here.
Each of them attempts to solve a different problem via different means, and whether you'd use one of them over a traditional relational store is
A matter of suitability
A matter of personal preference
At the end of the day, for the primary data-storage for a single system, a document database or relational store is probably what you want, although for different parts of your system you may well end up utilising a graph database (For calculating neighbours etc), or a key/value store (like Facebook does/did for inbox messages).
The main benefit of choosing a document store as your primary store over that of a relational one, is that you haven't got to worry about trying to map your objects into a collection of tables, and there is less configuration overhead involved in doing so.
The other downside/upside would be that you have to learn something new and make mistakes along the way.
So my answer if I am going to be direct?
RavenDB would be suitable
SQL would be suitable
Which do you prefer to use? These days I'd probably just go for Raven, knowing I can dump data into a relational store for reporting purposes and probably do likewise for other parts of my system, and getting free-text search and fastish-writes/fast-reads without going through the effort of defining separate read/write stores is an overall win.
But that's me, and I am biased.
I was thinking of starting a project that very clearly needs a persistent store. I was about to reluctantly decide on a RDBMS, when I came across an article which briefly mentions CouchDB. Seems some advancements in DB technology have happened since I last looked, so I thought I would ask here about databases before I got into it.
Here are my criteria. ( I list the criteria again at the end, so if you want to skip the explanations just scroll down. )
The project is open source and I will not be asking anything for it, so preferably the database is open source and free. Furthermore the software has to run on both Linux and Windows.
There are parts of the project that have to be in C++. The project is not large enough code wise to justify using a second language. So basically the whole thing will be C++.
This project will not have anything to do with the web, so preferably
the database will not require the detritus of a web library.
The objects I want to store fall into one of two categories: a basic object and a container object. The difference being objects which are containers will contain even more objects, ie: a parts of parts problem. I need a database that can handle such cases cleanly and efficiently.
I also expect the schema to evolve rapidly, at least initially. I alse suspect that some of the old data simply will not fit into the new schemas. So I would like to keep different versions of the schema around. Win possible, I would like to be able to transform data in one to schema into another schema.
For the application to work the way intended, people would have to exchange large chunks of database with each other. So I would want simple ways of importing and exporting data, which I could automate to some degree.
Finally it would be nice if the database could in someway be simulated in unit tests.
THose are my requirements. I have replicated them below to make it easier for people answering.
Thank you
Non Technical requirements
1. Open source preferably free.
2. Run on Windows and Linux
Has a C++ interface.
Is able to handle a non-web application, preferably without REST.
Can handle a "parts of parts" problem fairly well.
Can handle multiple indexes.
Has sort of concept of schema version, can handle multiple schema versions, and can migrate tables from one schema to another.
Should have a simple mechanism for move data from one instance of the database to another.
Preferably has some mechanism for testing.
HDF5 is a binary format which behaves like an hierarchical database. It has binding and libraries for C++ and python (I only use the latter) and it is used to store big amounts of data, like the ones produces in certain physics and astronomy experiments.
http://www.hdfgroup.org/HDF5/
I've looked at a few nosql databases some time ago (had an different requirement than than you though - needed it to be a standalone server). The ones that I remember as particularly interesting are Redis and Kyoto Cabinets. Have a look.
BTW, you don't mention any performance requirement. If so, have you considered SQLite? Simple, embedded, stable, and with the flexibility of SQL after all. With prepared statement the performance penalty of SQL should not be very high.
EDIT: ooops, just noticed that you asked this more than a year ago... Well, perhaps you can tell us what you've chosen :)
When dealing with small projects, what do you feel is the break even point for storing data in simple text files, hash tables, etc., versus using a real database? For small projects with simple data management requirements, a real database is unnecessary complexity and violates YAGNI. However, at some point the complexity of a database is obviously worth it. What are some signs that your problem is too complex for simple ad-hoc techniques and needs a real database?
Note: To people used to enterprise environments, this will probably sound like a weird question. However, my problem domain is bioinformatics. Most of my programming is prototypes, not production code. I'm primarily a domain expert and secondarily a programmer. Most of my code is algorithm-centric, not data management-centric. The purpose of this question is largely for me to figure out how much work I might save in the long run if I learn to use proper databases in my code instead of the more ad-hoc techniques I typically use.
1) Concurrency. Do you have multiple people accessing the same dataset? Then it's going to get pretty involved to broker all of the different readers and writers in a scalable fashion if you roll your own system.
2) Formatting and relationships: Is your data something that doesn't fit neatly into a table structure? Long nucleotide sequences and stuff like that? That's not really conveniently tabular data.
Another example: Nobody would consider implementing software like Photoshop to store PSDs in a relational format, because the data structures don't really lend themselves to that type of storage or query pattern.
3) ACID (sort of a corollary to #1): If Atomicity, Consistency, Integrity, and Durability are not challenges with a flat file, then go with a flat file.
For me, the line is crossed once I have to query my data in ways that involve more than a single relationship. Relating two flat data structures on disk is fairly simple, but once we get beyond that, a set-based language like SQL and formal database relationships actually reduce complexity.
I think at some point you'll miss the querying capabilities of a database, but you can consider some minimalistic database alternatives:
SQLite (Great, almost SQL-92 standard compliant)
shsql
SQL Server Compact
I would only write my own on-disk format under very special circumstances. Reusing someone else's code is nearly always faster.
For relational data, I would use SQLite. For key/value pairs, I would use BerkeleyDB (perhaps via KiokuDB). For simple objects, I would use JSON or YAML, but only if I only had a few.
With SQLite and BDB, "a real database" is literally two lines of code away. It is hard to beat that.
The problem with small projects is that they become bigger before we know it. And once they do , we start missing the sql capabilities.
Always design such that a db can be utilized later on if required without ripping apart half of the application.
It depends entirely on the domain-specific application needs. A lot of times direct text file/binary files access can be extremely fast, efficient, as well as providing you all the file access capabilities of your OS's file system.
Furthermore, your programming language most likely already has a built-in module (or is easy to make one) for specific parsing.
If what you need is many appends (INSERTS?) and sequential/few access little/no concurrency, files are the way to go.
On the other hand, when your requirements for concurrency, non-sequential reading/writing, atomicity, atomic permissions, your data is relational by the nature etc., you will be better off with a relational or OO database.
There is a lot that can be accomplished with SQLite3, which is extremely light (under 300kb), ACID compliant, written in C/C++, and highly ubiquitous (if it isn't already included in your programming language -for example Python-, there is surely one available). It can be useful even on db files as big as 1GB, possible more.
If your requirements where bigger, there wouldn't even be a discussion, go for a full-blown RDBMS.
For the kind of applications you are developing in bioinformatics, you are often doing one-shot applications (often scripts that define a workflow of calculations) that answer a specific questions, and you are not likely to be reusing these applications after you answered your question.
Often, you should therefore avoid creating databases to store the results, as after all you are not going to use their features very much.
You will probably be querying some webservices, files, or databases, run some local algorithms on the data gathered from different sources, and produce some tabular or structured output format (xml, json, etc).
For that, I would suggest you to use workflow tools like Knime (or a commercial solution like Inforsense KDE, Accelrys's Pipeline pilot, or Snaplogic, as they allow you to query data in a variety of formats and locations (rdbms, flat files, webservices), run algorithms, and build powerful web apps that allow you to easily publish your workflows to your users and let them interact at specific points).
If your prototype "grows" and you have to build more functionality on top of the data your workflows output, and if the output of your prototype is not likely to change everyday, then it's a wise decision to store a subset of the results in a database. This allows you to plug in powerful reporting tools like BusinessObjects, Crystal reports, jasper reports or whatever reporting solution available out there and show data to your users in a better shape than a spreadsheet or a csv file.
Finally, some development frameworks will make your choices more obvious : if you build a web application using an MVC framework, it is likely that your data will reside in an RDBMS (but please, don't put genomic sequences in a table column :-)).
All in all, it's a case by case choice, depending on your needs for each particular application.
In software I can usually get away with storing values in a XML configuration file or in the registry, e.g. software options. Once I need to persist objects I move to a database because the upfront cost is not that bad compared to the long term effects that relations and reporting can offer.
For bioinformatics you may be interested on that: Blast on DB. The guy who is working on that is a friend of mine and has a work on fast similarity sequence search, he found out to make his own binary storage better than using databases at this point.
I don't know specific details about his solution but you probably can exchange one or two ideias mailing the guy, even sharing code.
Do you need/want SQL queries?
Are multiple people going to want to access the data?
Is your data relational?
If you answered no to those questions, you (probably) don't need a full on database.
First, I'd consider:
How large will the database initially be: # of tables, # of rows
How quickly will it grow?
Is the data frequently queried?
If I were to create a personal recipe app, for example, I know I might add 50 favorite recipes to start and add no more than 5 recipes a year. With that being said, I could easily get by without a database since the size of the data store will have minimal impact on queries.
That said, I would probably use a database for any application where data entry and queries occur (even a small personal recipe app). I don't think it adds a lot of overhead especially when your framework (e.g. Rails) allows you to keep your database dumb (primarily tables, indexes, and constraints). It alleviates the chance that I'll have to eventually port to a database if I decide to scale up.
If you know the format of your data, flat files, if faster/easier to develop with, will be fine. If you expect your record formats to change frequently during development then I'd suggest that ALTER TABLE is your friend. Flat files will also tend to be faster (if you care about speed) unless you expect to implement the equivalent of joins across many combinations of files.
The real benefit of using a RDBMS during development is the flexibility with which you can modify your data schema and the ease with which you can access your data via queries.
Good design will ensure that you keep your data access layer relatively isolated (because of separation of concerns) so it should be a fairly straightforward (if tedious) matter to rework to a database later should it be worthwhile. Or, of course, if you use a database to develop your structures you may subsequently take the app back to flat/indexed files once those structures are crystallized in order to gain performance.
Use whatever persistence technology you're most comfortable with, and scales sufficiently.
YAGNI at least means "Don't add a new technology to your personal stack unless you can't be productive with whatever is already there."
For many (most?) of us, our comfort zone for data persistence is SQL. For some, it might be XML. Just don't write your own until (see paragraph 2).
As someone also doing research in Bioinformatics, I would suggest NOT using a database for these kinds of prototype projects unless you are sure it needs it. If you are on the fence, go with the databaseless solution and stick with flat files. It is also important to note that traditionally Bioinformatics researchers have go the flat file route, which means there are well defined file formats for most types of data in the feild. If you decide to go with a database solution, it may hurt your compatibility with existing research projects.