I've never designed a database before, but I've had experience programming in a few languages and assembler throughout college, as well as some web design, so I'm able to at least pick up what I need to know if I can be pointed in the right direction. One of the tasks of my job is to sort through some data that we've been collecting in the field, using a "sonde" which measures temperature, pH, conductivity, and other parameters. The device sits in a stream 24/7 (except for when we take it out and switch it with our other sonde every couple weeks, so that we can put in a newly calibrated one in the stream and retrieve the data from the one that was in the field). It collects data every 15 minutes or so, and has done so since 2007. Currently, all of our data is spread across multiple excel spreadsheets, and we have additional data from a weather station and another instrument that all gets compiled into quarterly documents. My goal is to design as simple of a database as possible with most of the functionality of a database like this: http://hudson.dl.stevens-tech.edu/hrecos/d/index.shtml. Ours would be significantly simpler as it is not live data (but would instead retrieve data from files that we upload once we'd finished handling the formatting and compilation of all our data). I would very much like the graphing ability on the site that the above database has, but I at least need to be able to select a range of data and select as many variables as I want within that time range and then be able to download a spreadsheet with the generated data (or at least a CSV file).
I realize this is a tough task, and as I have not designed a database before, I suspect it is very much an uphill task. However if I would be able to learn the things necessary to do this, and make it web-accessible, that would be a huge accomplishment and very much impress my boss. Any advice or tips to go off in the right direction would be very much appreciated.
Thanks for your help!
There are actually 2 parts to the solution you're looking for:
The database, which will store your data in a single organized place, and
The application, which is the interface used by people to interact with the database.
Basically, a database by itself is just a container. You need some kind of application which accept criteria from a user, pull the appropriate data meeting the criteria from the database, and display it to the user in a meaningful fashion - in this case, a graph or a spreadsheet.
Normally for web-based apps the database and application are two separate components. However, for a small app with a fairly small number of users, and especially for someone just starting out, you may want to consider an all-in-one solution like InfoDome, sort of like MSAccess for the web.
Either way, you're still going to need to learn about database design. There's many good tutorials out there, just do some searching. DatabaseAnswers.org has been useful for me. They have a set of tutorials as well as a large collection of sample database schemas.
Related
I'm trying to design a database schema for Djabgo rest framework web application.
At some point, I have two choces:
1- Choose a schema in which in one or several apies, I have to get a queryset from database and iterate and order it with python. (For example, I can store some datas in an array-data-typed column, get them from database and sort them with python.)
2- store the data in another table and insert a kind of big number of rows with each insert. This way, I can get the data in my favorite format in much less lines with orm codes.
I tried some basic tests and benchmarking to see which way is faster, and letting database handle more of the job (second way) didn't let me down. But I don't have the means of setting a more real situatuin and here's the question:
Is it still a good idea to let database handle the job when it also has to handle hundreds of requests from other apies and clients each second?
Is database (and orm) usually faster and more reliable than backend?
As a general rule, you want to let the database do work when the work is appropriate for the database. Sorting result sets would be in that category.
Keep in mind:
The database is running on a server, often on a distributed system and so it has access to more resources.
Databases are designed to handle large data, so they are not limited by the memory in a single thread.
When this question comes up, often more data needs to be passed back to the application than is strictly needed. Consider a problem such as getting the top 10 of something.
Mixing processing in the application and the database often requires multiple queries and passing data back and forth, which is expensive.
(And there are no doubt other considerations.)
There are some situations where it might be more efficient or convenient to do work in the application. A common example is formatting result sets for the application -- say turning 1234.56 into $1,234.56. Other examples would be when the application language has capabilities that are not directly in SQL or are hard to implement in SQL.
My lab is doing a lot of sequencing, but the way the sequences are documented makes it difficult to retrieve them or keep track of the data. I would like to create a database that has following features:
-A Graphical user interface to allow one to upload/retrieve/view data, and can incorporate links to quickly BLAST or analyse the sequences with other online tools.
-allows one to access it in the command line
-that has another section on the GUI that has records of what's in the lab, what needs to be ordered etc.
I wanted to know if there are general database templates I can adopt and modify to suit my lab needs? I have no experience in database design but have read about mySQL.
What are the first steps I should take in embarking on this project?
Thank you!
This is an interesting question and problem domain (one I now have expierence with btw). Your first step is to decide on a general architecture and then select technologies for this.
For the web/graphical side, there are lots of off the shelf components (I assume you are aware of tools like AntiSMASH, JBrowse, etc). But you will need to evaluate these. That is way outside the scope of the db side however.
On the database side, PostgreSQL performs admirably here. I have worked on a heavily loaded 10+TB db which was specifically storing sequencing data, BLAST reports, and so forth. If you add stuff like PostBIS on top of that, you get something quite functional.
A lot of the heavier portions of the industry however are using Hadoop because of the fact that the quantity of data available is increasing very rapidly but the amount of expertise required to make that work is also appropriately higher.
I work for a research organization in India and have recently taken up with a program extending immunizations among poor rural communities. They're a fairly large organization but don't really have any IT infrastructure. Data reports on vaccine coverage, logistical questions, meeting attendance etc. come from hundreds of villages, go from pen-and-paper through several iterations of data entry and compilation, finally arriving each month at the central office as HUNDREDS of messy Excel sheets. The organization generally needs nothing more than simple totals and proportions from a large series of indicators, but doctors and high-level professionals are left spending days summing the sheets by hand, introducing lots of error and generally wasting a ton of time. I threw in some formulas and at least automated the process within single sheets, but the compilation and cross-referencing is still an issue.
There's not much to be done at the point of data collection...obviously it would be great to implement some system at the point of entry, but that would involve training hundreds of officials and local health workers; not practical at the moment.
My question: what can be done with the stack of excel sheets every month so we can analyze individually and also holistically? Is there any type of management app. or simple database we can build to upload and compile the data for easy analysis in R or even (gasp) excel? What kind of tools could I implement and then pass on to some relative technophobes? Can we house it all online?
I'm by no means a programmer but I'm an epidemiologist/stats analyst proficient in R and Google products and the general tools of a not-so-tech-averse millenial. I'd be into using this as an opportunity for learning some mySQL or similar, but need some guidance. Any ideas are appreciated...there has to be a better way!
step by step approach would be, first store the raw data from excel sheets and papers in structured database. Once data is maintained in DB you will have many tools to manipulate that data later.
Use any database like MySQL to store excel sheet ; Excel sheets or CSV files can be exported to database directly.
Later with simple database operations you can you can manipulate the data; you can use reports/web application/etc. to display and manage data.
and keep the good Work!
It might look like an odd question but the fact is I currently have a existing spreadsheet with data, not a lot, of hundreds of rows. What I am willing to do is a web app provide "search" interface for data (maybe some more advanced features in future, like data update). Thus I have two options:
I can programme on top of the excel, i.e. read/write spreadsheet directly
Or I can export the whole data to a database, say mysql, and work around it. The data conversion does take time but will not be very bad.
What I need to consider is:
if it is easy to code, for either solution. I have some experiences coding C with mysql but never touch Excel programming. If the learning curve is too steep I would rather spend days to export all data to database.
if it is easy to maintain. Day to day work like search or update will be done through web once the app is on. It needs to be stable and easy to expand. More features like comparing different columns or, reformating some data and emailing out could be considered.
Any suggestions? And, what language, libraries or framework I should pick up?
You'll definitely want to use a database. Excel wasn't made for quickly accessing data, it was made for calculations. Trying to build on it will likely cause lots of hassle and lag.
I think you should definitely choose database.
In my experience programming against Excel is very painful. Tools for accessing database are usually easy to use.
If you do not have any experience with web programming, I would suggest you to use PHP with MySQL. It is the easiest to start with and if you do not need anything complex, it would do totally fine.
I have to develop a database for a unique environment. I don't have experience with database design and could use everybody's wisdom.
My group is designing a database for piece of physics hardware and a data acquisition system. We need a system that will store all the hardware configuration parameters, and track the changes to these parameters as they are changed by the user.
The setup:
We have nearly 200 detectors and roughly 40 parameters associated with each detector. Of these 40 parameters, we expect only a few to change during the course of the experiment. Most parameters associated with a single detector are static.
We collect data for this experiment in timed runs. During these runs, the parameters loaded into the hardware must not change, although we should be able to edit the database at any time to prepare for the next run. The current plan:
The database will provide the difference between the current parameters and the parameters used during last run.
At the start of a new run, the most recent database changes be loaded into hardware.
The settings used for the upcoming run must be tagged with a run number and the current date and time. This is essential. I need a run-by-run history of the experimental setup.
There will be several different clients that both read and write to the database. Although changes to the database will be infrequent, I cannot guarantee that the changes won't happen concurrently.
Must be robust and non-corruptible. The configuration of the experimental system depends on the hardware. Any breakdown of the database would prevent data acquisition, and our time is expensive. Database backups?
My current plan is to implement the above requirements using a sqlite database, although I am unsure if it can support all my requirements. Is there any other technology I should look into? Has anybody done something similar? I am willing to learn any technology, as long as it's mature.
Tips and advice are welcome.
Thank you,
Sean
Update 1:
Database access:
There are three lite applications that can write and read to the database and one application that can only read.
The applications with write access are responsible for setting a non-overlapping subset of the hardware parameters. To be specific, we have one application (of which there may be multiple copies) which sets the high voltage, one application which sets the remainder of the hardware parameters which may change during the experiment, and one GUI which sets the remainder of the parameters which are nearly static and are only essential for the proper reconstruction of the data.
The program with read access only is our data analysis software. It needs access to nearly all of the parameters in the database to properly format the incoming data into something we can analyze properly. The number of connections to the database should be >10.
Backups:
Another setup at our lab dumps an xml file every run. Even though I don't think xml is appropriate, I was planning to back up the system every run, just in case.
Some basic things about the design; you should make sure that you don't delete data from any tables; keep track of the most recent data (probably best with most recent updated datetime); when the data value changes, though, don't delete the old data. When a run is initiated, tag every table used with the Run ID (in another column); this way, you maintain full historical record about every setting, and can pin exactly what the state used at a given run was.
Ask around of your colleagues.
You don't say what kind of physics you're doing, or how big the working group is, but in my discipline (particle physics) there is a deep repository of experience putting up and running just this type of systems (we call it "slow controls" and similar). There is a pretty good chance that someone you work with has either done this or knows someone who has. There may be a detailed description of the last time out in someone's thesis.
I don't personally have much to do with this, but I do know this: one common feature is to have no-delete-no-overwrite design. You can only add data, never remove it. This preserves your chances of figuring out what really happened in the case of trouble
Perhaps I should explain a little more. While this is an important task and has to be done right, it is not really related to physics, so you can't look it up on Spires or on arXive.org. No one writes papers on the design and implementation of medium sized slow controls databases. But they do sometimes put it in their dissertations. The easiest way to find a pointer really is to ask a bunch of people around the lab.
This is not a particularly large database by the sounds of things. So you might be able to get away with using Oracle's free database which will give you all kinds of great flexibility with journaling (not sure if that is an actual word) and administration.
Your mentioning of 'non-corruptible' right after you say "There will be several different clients that both read and write to the database" raises a red flag for me. Are you planning on creating some sort of application that has a interface for this? Or were you planning on direct access to the db via a tool like TOAD?
In order to preserve your data integrity you will need to get really strict on your permissions. I would only allow one (and a backup) person to have admin rights with the ability to do the data manipulation outside the GUI (which will make your life easier).
Backups? Yes, absolutely! Not only should you do daily, weekly and monthly backups you should do full and incremental. Also, test your backup images often to confirm they are in fact working.
As for the data structure I would need much greater detail in what you are trying to store and how you would access it. But from what you have put here I would say you need the following tables (to begin with):
Detectors
Parameters
Detector_Parameters
Some additional notes:
Since you will be doing so many changes I recommend using a version control like SVN to keep track of all your DDLs etc. I would also recommend using something like bugzilla for bug tracking (if needed) and using google docs for team document management.
Hope that helps.