So this is the scenario:
You have a bunch of data that needs to end up in SQL.
It needs to entered by hand.
It is not an "enter once and you're done" scenario: it will need to be modified and expanded by humans in an ongoing iterative way. Comments will be associated with entries. It is also useful for data entry people to be able to see related entries near each other.
Different parts of data will need to be worked on simultaneously by different people.
Some error checking also needs to happen. (Let the data entry people correct their mistakes before SQL picks them up)
I have one answer, which is how my project currently operates, but it occurred to me that maybe there are other awesome ways of doing this which don't have the problems of my current method.
Look at YAML as a way to represent the data as plain, human-readable, and human-fixable text.
A very simple program can parse the YAML, locate errors and (if there are no errors) update the database.
These are some really basic requirements, and you probably have more issues than those stated. Nonetheless, you need a simple admin utility to enter data into your database.
A straight SQL query/update utility doesn't cut it because your team needs validation and such. You need multi-user access to the same data with transactional support. You also want to annotate your data entries and allow "related entries" to be viewed by your other users.
You need a database-maintenance application.
Consider using something like Django and it's built admin utilities. It might be more than you're expecting, but I imagine you have more needs in your future than what you've stated here.
My answer is basically
Have the data entry work in Prolog files (Prolog facts)
Have multiple files, split up in a way that is sane for the data.
Have a script that converts the Prolog facts to SQL.
Have some tests in Prolog that validate the Prolog facts.
CONS of this approach:
a little bit annoying to have to check across multiple files to see if an entry already exists, or has been moved etc.
Writing Prolog, as simple as this is, is pretty scary for non-programmers (compared to say, filling out an Excel spreadsheet, or some guided process)
maybe: Merging is tricky, or maybe my VCS is just not very smart (see Which SCM/VCS cope well with moving text between files?)
So this works pretty well, but maybe there is something better that I've never thought of!
If the constraints you're referring to can be enforced at the database level, free software like Quest Toad could allow them enter data directly into the db. It feels very much like using a spreadsheet when in grid view and displays an error when constraints are violated.
Alternatively, depending on what existing stack you have available, .Net grid views make it easy to slap together crud screens in little time.
Related
My lab is doing a lot of sequencing, but the way the sequences are documented makes it difficult to retrieve them or keep track of the data. I would like to create a database that has following features:
-A Graphical user interface to allow one to upload/retrieve/view data, and can incorporate links to quickly BLAST or analyse the sequences with other online tools.
-allows one to access it in the command line
-that has another section on the GUI that has records of what's in the lab, what needs to be ordered etc.
I wanted to know if there are general database templates I can adopt and modify to suit my lab needs? I have no experience in database design but have read about mySQL.
What are the first steps I should take in embarking on this project?
Thank you!
This is an interesting question and problem domain (one I now have expierence with btw). Your first step is to decide on a general architecture and then select technologies for this.
For the web/graphical side, there are lots of off the shelf components (I assume you are aware of tools like AntiSMASH, JBrowse, etc). But you will need to evaluate these. That is way outside the scope of the db side however.
On the database side, PostgreSQL performs admirably here. I have worked on a heavily loaded 10+TB db which was specifically storing sequencing data, BLAST reports, and so forth. If you add stuff like PostBIS on top of that, you get something quite functional.
A lot of the heavier portions of the industry however are using Hadoop because of the fact that the quantity of data available is increasing very rapidly but the amount of expertise required to make that work is also appropriately higher.
We already have a database structure, but it is the structure without normalization and very confused and in need of change, but already has a large volume of stored data, for example, all financial data company, which finance department officials are afraid of losing.
We are undecided about remodeling the entire structure of the database and retrieve the most basic and all that is possible, or continue with the same model along with their problems.
I wonder if someone has made a change like this, if you can actually transfer the data to a new structure.
thanks
Before you do any thing I would BACKUP!!! Next I would create a new database with the ideas that you had in mind. Remember this is were all the real work should be once this is created it is hard to go back. Put a lot of thought in and make the design a bullet proof tiger to the design of your company. Next create some procedures to transform the data you have in the new database as you see fit. It would help if you mentioned the platform(s) you are using and mabey provide some generic examples
I have found SSIS packages work well for projects like this if you are using SQLSERVER. While you will need to still write your transforms out the packages make the work easier for others to see what is happening
Anything can be done by you the developer. However it might make business sense to check out various 3rd party tools. There are many out there and depending on exactly what you are doing you may benefit from doing some research
Yes, it's called "database conversion". It is a very common practice, but it must be done carefully and methodically, ideally by someone who has done many of them and knows the pitfalls. It is not to be done casually by any means. Moreover, it is not unusual in the financial sector to run the "old system" in parallel with the new system for a couple of months, to reconcile month-end reports, before saying goodbye to the old system. Running parallel is a PITA, and can only be done if all of the conversion programs are in place, but it's better to be safe than sorry when the numbers must be correct to the penny.
I had the same problem, the way I solved this is by re-design a new database, then I made a script that copies the data from the old schema to the new one. It's not an easy task because you need to take care of what you are copying from the old model to the new one but it's doable!
absolutely you can migrate the data to an new structure. The real question is 'how difficult (expensive/time consuming/reliable) will the migration be?' To answer that question one would have to know
The accuracy of the existing data - does it have gaps, duplication that disagrees with each other and no way to resolve, errors, etc.
What structure do you imagine going to and is this going to introduce complexity to the migration
the skill level of the person/team doing the migration
How long the migration will take and will the platforms be changing (either the live system being modified or the new system design changing)
I was thinking of starting a project that very clearly needs a persistent store. I was about to reluctantly decide on a RDBMS, when I came across an article which briefly mentions CouchDB. Seems some advancements in DB technology have happened since I last looked, so I thought I would ask here about databases before I got into it.
Here are my criteria. ( I list the criteria again at the end, so if you want to skip the explanations just scroll down. )
The project is open source and I will not be asking anything for it, so preferably the database is open source and free. Furthermore the software has to run on both Linux and Windows.
There are parts of the project that have to be in C++. The project is not large enough code wise to justify using a second language. So basically the whole thing will be C++.
This project will not have anything to do with the web, so preferably
the database will not require the detritus of a web library.
The objects I want to store fall into one of two categories: a basic object and a container object. The difference being objects which are containers will contain even more objects, ie: a parts of parts problem. I need a database that can handle such cases cleanly and efficiently.
I also expect the schema to evolve rapidly, at least initially. I alse suspect that some of the old data simply will not fit into the new schemas. So I would like to keep different versions of the schema around. Win possible, I would like to be able to transform data in one to schema into another schema.
For the application to work the way intended, people would have to exchange large chunks of database with each other. So I would want simple ways of importing and exporting data, which I could automate to some degree.
Finally it would be nice if the database could in someway be simulated in unit tests.
THose are my requirements. I have replicated them below to make it easier for people answering.
Thank you
Non Technical requirements
1. Open source preferably free.
2. Run on Windows and Linux
Has a C++ interface.
Is able to handle a non-web application, preferably without REST.
Can handle a "parts of parts" problem fairly well.
Can handle multiple indexes.
Has sort of concept of schema version, can handle multiple schema versions, and can migrate tables from one schema to another.
Should have a simple mechanism for move data from one instance of the database to another.
Preferably has some mechanism for testing.
HDF5 is a binary format which behaves like an hierarchical database. It has binding and libraries for C++ and python (I only use the latter) and it is used to store big amounts of data, like the ones produces in certain physics and astronomy experiments.
http://www.hdfgroup.org/HDF5/
I've looked at a few nosql databases some time ago (had an different requirement than than you though - needed it to be a standalone server). The ones that I remember as particularly interesting are Redis and Kyoto Cabinets. Have a look.
BTW, you don't mention any performance requirement. If so, have you considered SQLite? Simple, embedded, stable, and with the flexibility of SQL after all. With prepared statement the performance penalty of SQL should not be very high.
EDIT: ooops, just noticed that you asked this more than a year ago... Well, perhaps you can tell us what you've chosen :)
When dealing with small projects, what do you feel is the break even point for storing data in simple text files, hash tables, etc., versus using a real database? For small projects with simple data management requirements, a real database is unnecessary complexity and violates YAGNI. However, at some point the complexity of a database is obviously worth it. What are some signs that your problem is too complex for simple ad-hoc techniques and needs a real database?
Note: To people used to enterprise environments, this will probably sound like a weird question. However, my problem domain is bioinformatics. Most of my programming is prototypes, not production code. I'm primarily a domain expert and secondarily a programmer. Most of my code is algorithm-centric, not data management-centric. The purpose of this question is largely for me to figure out how much work I might save in the long run if I learn to use proper databases in my code instead of the more ad-hoc techniques I typically use.
1) Concurrency. Do you have multiple people accessing the same dataset? Then it's going to get pretty involved to broker all of the different readers and writers in a scalable fashion if you roll your own system.
2) Formatting and relationships: Is your data something that doesn't fit neatly into a table structure? Long nucleotide sequences and stuff like that? That's not really conveniently tabular data.
Another example: Nobody would consider implementing software like Photoshop to store PSDs in a relational format, because the data structures don't really lend themselves to that type of storage or query pattern.
3) ACID (sort of a corollary to #1): If Atomicity, Consistency, Integrity, and Durability are not challenges with a flat file, then go with a flat file.
For me, the line is crossed once I have to query my data in ways that involve more than a single relationship. Relating two flat data structures on disk is fairly simple, but once we get beyond that, a set-based language like SQL and formal database relationships actually reduce complexity.
I think at some point you'll miss the querying capabilities of a database, but you can consider some minimalistic database alternatives:
SQLite (Great, almost SQL-92 standard compliant)
shsql
SQL Server Compact
I would only write my own on-disk format under very special circumstances. Reusing someone else's code is nearly always faster.
For relational data, I would use SQLite. For key/value pairs, I would use BerkeleyDB (perhaps via KiokuDB). For simple objects, I would use JSON or YAML, but only if I only had a few.
With SQLite and BDB, "a real database" is literally two lines of code away. It is hard to beat that.
The problem with small projects is that they become bigger before we know it. And once they do , we start missing the sql capabilities.
Always design such that a db can be utilized later on if required without ripping apart half of the application.
It depends entirely on the domain-specific application needs. A lot of times direct text file/binary files access can be extremely fast, efficient, as well as providing you all the file access capabilities of your OS's file system.
Furthermore, your programming language most likely already has a built-in module (or is easy to make one) for specific parsing.
If what you need is many appends (INSERTS?) and sequential/few access little/no concurrency, files are the way to go.
On the other hand, when your requirements for concurrency, non-sequential reading/writing, atomicity, atomic permissions, your data is relational by the nature etc., you will be better off with a relational or OO database.
There is a lot that can be accomplished with SQLite3, which is extremely light (under 300kb), ACID compliant, written in C/C++, and highly ubiquitous (if it isn't already included in your programming language -for example Python-, there is surely one available). It can be useful even on db files as big as 1GB, possible more.
If your requirements where bigger, there wouldn't even be a discussion, go for a full-blown RDBMS.
For the kind of applications you are developing in bioinformatics, you are often doing one-shot applications (often scripts that define a workflow of calculations) that answer a specific questions, and you are not likely to be reusing these applications after you answered your question.
Often, you should therefore avoid creating databases to store the results, as after all you are not going to use their features very much.
You will probably be querying some webservices, files, or databases, run some local algorithms on the data gathered from different sources, and produce some tabular or structured output format (xml, json, etc).
For that, I would suggest you to use workflow tools like Knime (or a commercial solution like Inforsense KDE, Accelrys's Pipeline pilot, or Snaplogic, as they allow you to query data in a variety of formats and locations (rdbms, flat files, webservices), run algorithms, and build powerful web apps that allow you to easily publish your workflows to your users and let them interact at specific points).
If your prototype "grows" and you have to build more functionality on top of the data your workflows output, and if the output of your prototype is not likely to change everyday, then it's a wise decision to store a subset of the results in a database. This allows you to plug in powerful reporting tools like BusinessObjects, Crystal reports, jasper reports or whatever reporting solution available out there and show data to your users in a better shape than a spreadsheet or a csv file.
Finally, some development frameworks will make your choices more obvious : if you build a web application using an MVC framework, it is likely that your data will reside in an RDBMS (but please, don't put genomic sequences in a table column :-)).
All in all, it's a case by case choice, depending on your needs for each particular application.
In software I can usually get away with storing values in a XML configuration file or in the registry, e.g. software options. Once I need to persist objects I move to a database because the upfront cost is not that bad compared to the long term effects that relations and reporting can offer.
For bioinformatics you may be interested on that: Blast on DB. The guy who is working on that is a friend of mine and has a work on fast similarity sequence search, he found out to make his own binary storage better than using databases at this point.
I don't know specific details about his solution but you probably can exchange one or two ideias mailing the guy, even sharing code.
Do you need/want SQL queries?
Are multiple people going to want to access the data?
Is your data relational?
If you answered no to those questions, you (probably) don't need a full on database.
First, I'd consider:
How large will the database initially be: # of tables, # of rows
How quickly will it grow?
Is the data frequently queried?
If I were to create a personal recipe app, for example, I know I might add 50 favorite recipes to start and add no more than 5 recipes a year. With that being said, I could easily get by without a database since the size of the data store will have minimal impact on queries.
That said, I would probably use a database for any application where data entry and queries occur (even a small personal recipe app). I don't think it adds a lot of overhead especially when your framework (e.g. Rails) allows you to keep your database dumb (primarily tables, indexes, and constraints). It alleviates the chance that I'll have to eventually port to a database if I decide to scale up.
If you know the format of your data, flat files, if faster/easier to develop with, will be fine. If you expect your record formats to change frequently during development then I'd suggest that ALTER TABLE is your friend. Flat files will also tend to be faster (if you care about speed) unless you expect to implement the equivalent of joins across many combinations of files.
The real benefit of using a RDBMS during development is the flexibility with which you can modify your data schema and the ease with which you can access your data via queries.
Good design will ensure that you keep your data access layer relatively isolated (because of separation of concerns) so it should be a fairly straightforward (if tedious) matter to rework to a database later should it be worthwhile. Or, of course, if you use a database to develop your structures you may subsequently take the app back to flat/indexed files once those structures are crystallized in order to gain performance.
Use whatever persistence technology you're most comfortable with, and scales sufficiently.
YAGNI at least means "Don't add a new technology to your personal stack unless you can't be productive with whatever is already there."
For many (most?) of us, our comfort zone for data persistence is SQL. For some, it might be XML. Just don't write your own until (see paragraph 2).
As someone also doing research in Bioinformatics, I would suggest NOT using a database for these kinds of prototype projects unless you are sure it needs it. If you are on the fence, go with the databaseless solution and stick with flat files. It is also important to note that traditionally Bioinformatics researchers have go the flat file route, which means there are well defined file formats for most types of data in the feild. If you decide to go with a database solution, it may hurt your compatibility with existing research projects.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have been developing web/desktop applications for about 6 years now. During the course of my career, I have come across application that were heavily written in the database using stored procedures whereas a lot of application just had only a few basic stored procedures (to read, insert, edit and delete entity records) for each entity.
I have seen people argue saying that if you have paid for an enterprise database use its features extensively. Whereas a lot of "object oriented architects" told me its absolute crime to put anything more than necessary in the database and you should be able to drive the application using the methods on those classes?
Where do you think is the balance?
Thanks,
Krunal
I think it's a business logic vs. data logic thing. If there is logic that ensures the consistency of your data, put it in a stored procedure. Same for convenience functions for data retrieval/update.
Everything else should go into the code.
A friend of mine is developing a host of stored procedures for data analysis algorithms in bioinformatics. I think his approach is quite interesting, but not the right way in the long run. My main objections are maintainability and lacking adaptability.
I'm in the object oriented architects camp. It's not necessarily a crime to put code in the database, as long as you understand the caveats that go along with that. Here are some:
It's not debuggable
It's not subject to source control
Permissions on your two sets of code will be different
It will make it more difficult to track where an error in the data came from if you're accessing info in the database from both places
Anything that relates to Referential Integrity or Consistency should be in the database as a bare minimum. If it's in your application and someone wants to write an application against the database they are going to have to duplicate your code in their code to ensure that the data remains consistent.
PLSQL for Oracle is a pretty good language for accessing the database and it can also give performance improvements. Your application can also be much 'neater' as it can treat the database stored procedures as a 'black box'.
The sprocs themselves can also be tuned and modified without you having to go near your compiled application, this is also useful if the supplier of your application has gone out of business or is unavailable.
I'm not advocating 'everything' should be in database, far from it. Treat each case seperately and logically and you will see which makes more sense, put it in the app or put it in the database.
I'm coming from almost the same background and have heard the same arguments. I do understand that there are very valid reasons to put logic into the database. However, it depends on the type of application and the way it handles data which approach you should choose.
In my experience, a typical data entry app like some customer (or xyz) management will massively benefit from using an ORM layer as there are not so many different views at the data and you can reduce the boilerplate CRUD code to a minimum.
On the other hand, assume you have an application with a lot of concurrency and calculations that span a lot of tables and that has a fine-grained column-level security concept with locking and so on, you're probably better off doing stuff like that directly in the database.
As mentioned before, it also depends on the variety of views you anticipate for your data. If there are many different combinations of columns and tables that need to be presented to the user, you may also be better off just handing back different result sets rather than map your objects one-by-one to another representation.
After all, the database is good at dealing with sets, whereas OO code is good at dealing with single entities.
Reading these answers, I'm quite confused by the lack of understanding of database programming. I am an Oracle Pl/sql developer, we source control for every bit of code that goes into the database. Many of the IDEs provide addins for most of the major source control products. From ClearCase to SourceSafe. The Oracle tools we use allow us to debug the code, so debugging isn't an issue. The issue is more of logic and accessibility.
As a manager of support for about 5000 users, the less places i have to look for the logic, the better. If I want to make sure the logic is applied for ALL applications that use the data , even business logic, i put it in the DB. If the logic is different depending on the application, they can be responsible for it.
#DannySmurf:
It's not debuggable
Depending on your server, yes, they are debuggable. This provides an example for SQL Server 2000. I'm guessing the newer ones also have this. However, the free MySQL server does not have this (as far as I know).
It's not subject to source control
Yes, it is. Kind of. Database backups should include stored procedures. Those backup files might or might not be in your version control repository. But either way, you have backups of your stored procedures.
My personal preference is to try and keep as much logic and configuration out of the database as possible. I am heavily dependent on Spring and Hibernate these days so that makes it a lot easier. I tend to use Hibernate named queries instead of stored procedures and the static configuration information in Spring application context XML files. Anything that needs to go into the database has to be loaded using a script and I keep those scripts in version control.
#Thomas Owens: (re source control) Yes, but that's not source control in the same sense that I can check in a .cs file (or .cpp file or whatever) and go and pick out any revision I want. To do that with database code requires a potentially-significant amount of effort to either retrieve the procedure from the database and transfer it to somewhere in the source tree, or to do a database backup every time a minor change is made. In either case (and regardless of the amount of effort), it's not intuitive; and for many shops, it's not a good enough solution either. There is also the potential here for developers who may not be as studious at that as others to forget to retrieve and check in a revision. It's technically possible to put ANYTHING in source control; the disconnect here is what I would take issue with.
(re debuggable) Fair enough, though that doesn't provide much integration with the rest of the application (where the majority of the code could live). That may or may not be important.
Well, if you care about the consistency of your data, there are reasons to implement code within the database. As others have said, placing code (and/or RI/constraints) inside the database acts to enforce business logic, close to the data itself. And, it provides a common, encapsulated interface, so that your new developer doesn't accidentally create orphan records or inconsistent data.
Well, this one is difficult. As a programmer, you'll want to avoid TSQL and such "Database languages" as much as possible, because they are horrendous, difficult to debug, not extensible and there's nothing you can do with them that you won't be able to do using code on your application.
The only reasons I see for writing stored procedures are:
Your database isn't great (think how SQL Server doesn't implement LIMIT and you have to work around that using a procedure.
You want to be able to change a behaviour by changing code in just one place without re-deploying your client applications.
The client machines have big calculation-power constraints (think small embedded devices).
For most applications though, you should try to keep your code in the application where you can debug it, keep it under version control and fix it using all the tools provided to you by your language.