We're developing an application that has to query 3D shapes (and query based on other parameters as well) within a bounding box. The number of shapes is more than I want to keep in memory, so I need a database to handle it.
Specifically, our primary operations are inserts and queries. We never modify existing data.
Because it's a desktop application, I'm trying to avoid the PostgreSQL and MySQL separate server types of things, hoping for something more simple for deployment. I found Spatialite but it does not index on the 3rd dimension, so it won't work.
I tried searching for kd-tree database but haven't found anything yet. I know there are kd-tree implementations, but getting it in database form would take a lot of effort to roll our own, so I'm trying to see if there is something already out there.
The application is in Haskell, but if we have to integrate with some other language, we might deal with that.
SQLite R*Trees
Given a query rectangle, an R-Tree is able to quickly find all entries that are contained within the query rectangle or which overlap the query rectangle. This idea is easily extended to three dimensions for use in CAD systems.
I would respectfully challenge your attempt to avoid PostgreSQL/MySQL. I'm experienced in PostgreSQL and it does the job you want and it is not difficult to administer. Certainly, anything else you find is not going to have the level of development and testing PostgreSQL does - so why bother?
Related
My lab is doing a lot of sequencing, but the way the sequences are documented makes it difficult to retrieve them or keep track of the data. I would like to create a database that has following features:
-A Graphical user interface to allow one to upload/retrieve/view data, and can incorporate links to quickly BLAST or analyse the sequences with other online tools.
-allows one to access it in the command line
-that has another section on the GUI that has records of what's in the lab, what needs to be ordered etc.
I wanted to know if there are general database templates I can adopt and modify to suit my lab needs? I have no experience in database design but have read about mySQL.
What are the first steps I should take in embarking on this project?
Thank you!
This is an interesting question and problem domain (one I now have expierence with btw). Your first step is to decide on a general architecture and then select technologies for this.
For the web/graphical side, there are lots of off the shelf components (I assume you are aware of tools like AntiSMASH, JBrowse, etc). But you will need to evaluate these. That is way outside the scope of the db side however.
On the database side, PostgreSQL performs admirably here. I have worked on a heavily loaded 10+TB db which was specifically storing sequencing data, BLAST reports, and so forth. If you add stuff like PostBIS on top of that, you get something quite functional.
A lot of the heavier portions of the industry however are using Hadoop because of the fact that the quantity of data available is increasing very rapidly but the amount of expertise required to make that work is also appropriately higher.
I have inherited a complex, poorly documented SQL Server database and am trying to reverse engineer the tables, their relationships and in general how the data fits together. To do this I am drawing the table diagrams using Visio, attempting to determine the data relationships using the existing queries (there are no relationships actually defined) and using the application logic (what little there is) to help.
This is however quite challenging given the lack of documentation / obfuscated nature of this system and I was wondering if there are any techniques / approaches that can help?
Not an easy job, and the bigger it is the harder it is. I do however have one small trick for discovering the necessary surface area of the database - remove all user access!
This sort of database often runs with full access (i.e. users are DBO level or above) so anyone can do anything. If you remove all user access (but not your own!), then start application testing and enable minimum permissions on a per object basis, you will eventually reveal the surface area of the database (the exposed objects). This information is very useful in helping to identify dead objects - your job is hard enough as it is without having to document SPs that are no longer used for example.
How successful this technique can be depends on how thorough you think you can be in your testing, how sure you can be that someone somewhere doesn't have an Excel workbook that connects to the database once a year to run some query that no-one else ever does.
I'm trying to design a database for a small project I'm working on. Eventually, I'd like to make it a web-app, but right now I don't mind just experimenting with data offline. However, I'm stuck in a crossroads.
The basic concept would be a user inputs values for 10 fields, to be compared against what is in the database, with each item having a weighted value. I know that if I was to code it, that I could use a look-up tables for each field, add up the values, and display the result to the end-user.
Another example would be having to get the distance between two points, each point stored in a row, with the X value getting its own column as well as the Y value.
Now, if I store data within a database, should I try to do everything within queries (which I think would involve temporary tables among other things), or just use simple queries, and manipulate the rows returned within the application code?
Right now, I'm thinking to go for the latter (manipulate data within the app) and just use queries to reduce the amount of data that I would have to go sort through. What would you guys suggest?
EDIT: Right now I'm using Microsoft Access to get the basics down pat and try to get a good design going. IIRC with my experience with Oracle and MySQL you can run commands together in a batch process and return just one result. But not sure if you can do that with Access.
If you're using a database I would strongly suggest using SQL to do all your manipulation. SQL is far more capable and powerful for this kind of job as compared to imperative programming languages.
Of course it does imply that you're comfortable in thinking about data as "sets" and programming in a declarative style. But spending time now to get really comfortable with SQL and manipulating data using SQL will pay off big time in the long run. Not only for this project but for projects in the future. I would also suggest using stored procedures over queries in code because stored procedure provide a beautiful abstraction layer allowing your table design to change over time without impacting the rest of the system.
A very big part of using and working with databases is understanding Data modeling, normalization and the like. Like everything else it will be a effort but in the long run it will pay off.
May I ask why you're using Access when you have a far better database available to you such as MSSQL Express? The migration path from MSSQL Express to MSSQL or SQL Azure even is quite seamless and everything you do and experience today (in this project) completely translates to MSSQL Server/SQL Azure for future projects as well as if this project grows beyond your expectations.
I don't understand your last statement about running a batch process and getting just one result, but if you can do it in Oracle and MySQL then you can do it in MSSQL Express as well.
What Shiv said, and also...
A good DBMS has quite a bit of solid engineering in it. There are two components that are especially carefully engineered, namely the query optimizer and the transaction controller. If you adopt the view of using the DBMS as just a stupid table retrieval tool, you will most likely end up inventing your own optimizer and transaction controller inside the application. You won't need the transaction controller until you move to an environment that supports multiple concurrent users.
Unless your engineering talents are extraordinary, you will probably end up with a home brew data management system that is not as good as the one in a good DBMS.
The learning curve for SQL is steep. You need to learn how to phrase queries that join, project, and restrict data from multiple tables. You need to learn how to handle updates in the context of a transaction.
You need to learn simple and sound table design and index design. This includes, but is not limited to, data normalization and data modeling. And you need a DBMS with a good optimizer and good transaction control.
The learning curve is steep. But the view from the top is worth the climb.
I have a very large database I need to diagram. The database is SQL Server 2008 on x64. It is large in that there are hundreds of related tables, each with up to 2000 fields (some are sparse), multiple relationships between tables (often hundreds per table, in fact), multiple schemas... you get the idea.
I tried to use the Database Diagrams feature of SQL Server Management Studio, but it crashed with a Win32Exception: "Not enough storage is available to process this command..."
I tried to use Visio's reverse engineering feature on a different machine to connect in and diagram it, but that's been going for a few hours with no sign of completion.
The scripts to build this giant schema are being by a tool we built for the job. While the tool is doing its job just fine, it's tricky to visualise its output.
I'm after a tool to kick out a diagram of this database so we can do this. Any suggestions?
EDIT:
Just to emphasize, the diagram is indeed not supposed to be used for actual useful reference. It's a client relationship management device to demonstrate the complexity/scale of the system.
I worked at a place that had several hundred tables (near 1k) and no one really knew what was going on in the system, company was growing and hiring a lot. A guy was tasked with doing a diagram, and he auto-magically created a gigantic tiled poster that contained every table with lines connecting various tables (going all over the place). I'm not sure what he used, it was Unix and Oracle years ago (way before Linux and open source). There was no real rhyme or reason to the layout of the the tables in his diagram. He had successfully created a diagram of every table. The "poster" was put on a wall in a common area, and got a few looks, but no one ever really used it, it was unusable, too cluttered, too unorganized. As a result, I used MS-Word to create a single page diagram containing the 20 main tables (it went through a few iterations as I "discovered" new main tables) with lines for each foreign key and each table located in a logical manner. I showed the column name, data type, nullability, PK, and all FKs. I put my diagram up on my wall by my monitor. Eventually everyone wanted a copy of my diagram, including the person that made the "poster". When I left that job they were still giving my diagram to new hires.
I recommend that you work like an explorer, find the key tables and map them as you go, making as many specific diagrams as necessary as you discover the system. Trying to make a gigantic "poster" automatically will not work very well.
Generating an image of any kind for a database of that size simply becomes eye candy that is stuck on a wall that draw's gasps, and honestly serves no real purpose except occasional glances. Why not use a tool like Red Gate's Documentation tool that will serve an actual purpose? Please understand I'm not saying this in a mocking way, but I've been down this road before trying to diagram a huge database, and I succeeded to some degree, but never found a good outlet where it was of some use.
Since you have multiple schemas maybe a good idea is to generate diagrams per schema instead
Use graphviz. Use some SQL statements to generate the digram, then run it through dot.exe to generate a PDF or PNG.
I've used it to generate digrams of data within SQL Server tables. No reason why you can use it for tables too.
http://www.graphviz.org/
There are also java, silverlight, and AJAX utilities for navigating extra large graphs, as PDF is only for one page.
I'd avoid doing the whole thing in a single diagram. As you mentioned, the tools crash, and it's probably not possible to easily comprehend a diagram with hundreds of tables with potentially thousands of records per table. Can you generate diagrams of smaller logical areas with some overlap to other
logical areas?
Alternately, you could try using something like graphviz to parse the DDL statements and then produce a graph. It will probably churn for a while, but I remember seeing in a university poster-sized diagrams with tiny print, that were probably of the same complexity as yours. Good luck!
FWIW, assuming you do want to go ahead with this I've personally found that the visual studio 2010 database modeller does the nicest diagrams I've come across so far - Just import your database as if you were going to use it for Linq2SQL
schemaspy
provides a handy interface to generate interactive diagrams that span multiple schemas using graphviz as a backend. I've never tried it on anything this size though.
IntelliJ (specifically IDEA as just tried with this, but I believe their other IDEs offer this feature https://www.jetbrains.com/) has a built in database client facility, from here you can connect to your database and analyse individual tables, specific combination or table or all your tables by highlighting the desired tables, 'right clicking' and selecting the 'diagram' option. You can save for later reference and also print. I have just tried this on a large DB of 500+ tables and it rendered in seconds, the vector diagram serves as an alternative way to digest database structures visually and the relationships and constraints between certain tables but not recommended for printing.
I was thinking of starting a project that very clearly needs a persistent store. I was about to reluctantly decide on a RDBMS, when I came across an article which briefly mentions CouchDB. Seems some advancements in DB technology have happened since I last looked, so I thought I would ask here about databases before I got into it.
Here are my criteria. ( I list the criteria again at the end, so if you want to skip the explanations just scroll down. )
The project is open source and I will not be asking anything for it, so preferably the database is open source and free. Furthermore the software has to run on both Linux and Windows.
There are parts of the project that have to be in C++. The project is not large enough code wise to justify using a second language. So basically the whole thing will be C++.
This project will not have anything to do with the web, so preferably
the database will not require the detritus of a web library.
The objects I want to store fall into one of two categories: a basic object and a container object. The difference being objects which are containers will contain even more objects, ie: a parts of parts problem. I need a database that can handle such cases cleanly and efficiently.
I also expect the schema to evolve rapidly, at least initially. I alse suspect that some of the old data simply will not fit into the new schemas. So I would like to keep different versions of the schema around. Win possible, I would like to be able to transform data in one to schema into another schema.
For the application to work the way intended, people would have to exchange large chunks of database with each other. So I would want simple ways of importing and exporting data, which I could automate to some degree.
Finally it would be nice if the database could in someway be simulated in unit tests.
THose are my requirements. I have replicated them below to make it easier for people answering.
Thank you
Non Technical requirements
1. Open source preferably free.
2. Run on Windows and Linux
Has a C++ interface.
Is able to handle a non-web application, preferably without REST.
Can handle a "parts of parts" problem fairly well.
Can handle multiple indexes.
Has sort of concept of schema version, can handle multiple schema versions, and can migrate tables from one schema to another.
Should have a simple mechanism for move data from one instance of the database to another.
Preferably has some mechanism for testing.
HDF5 is a binary format which behaves like an hierarchical database. It has binding and libraries for C++ and python (I only use the latter) and it is used to store big amounts of data, like the ones produces in certain physics and astronomy experiments.
http://www.hdfgroup.org/HDF5/
I've looked at a few nosql databases some time ago (had an different requirement than than you though - needed it to be a standalone server). The ones that I remember as particularly interesting are Redis and Kyoto Cabinets. Have a look.
BTW, you don't mention any performance requirement. If so, have you considered SQLite? Simple, embedded, stable, and with the flexibility of SQL after all. With prepared statement the performance penalty of SQL should not be very high.
EDIT: ooops, just noticed that you asked this more than a year ago... Well, perhaps you can tell us what you've chosen :)