Database efficiency - database

I am about to write a program to keep track of my school assignments and I was wondering what database language would be the most efficient and simple to implement to track the meta-data of the assignments? I am thinking about XML, but it would require several documents.
I (currently) have at least ten assignments per week for 45 weeks. The data that has to be stored includes name, issue date, due date, path, and various states of completion. What ever language it's in would have to be able to take a large increase in both the number of assignments and the amount of meta-data without having to make large changes in either the format or the retrieval system.

Quite frankly, if you pick a full-fledged database you run the risk of spending more time on data entry than you do on your homework. If you really need to keep track of this, I would seriously recommend a spreadsheet.

First, I think you are confusing a relational database system with a database language. In all likelihood, you will be using a database that uses SQL. From there, you will need to another programming platform to build an application around. If you wanted, you could use an Microsoft Access database that allows you to build a simple front-end that is stored in the same file as the database. In this case you would be programming with VBA.
Pretty much any more database system would be suitable for your needs, even Access handle orders of magnitude more work than you are describing.
Some possible database systems are, again, Microsoft Access, Microsoft SQL Server Express, VistaDB, SQLite (probably the best choice after access for your needs), and of course there are many others.
You could either build a web front end or a desktop; I assume you are using Windows. You could use Visual Studio C# Express for this if you wanted. Or you could go with VB.NET, VB6, or what have you.

My answer isn't directly related, but as you are designing your database structures you might want to take at some of the the objects in the SIF specification in particular look at the Assignment and GradingAssignment objects.
As for how to store the data, you could use a rdbms (sqlite, mysql) or perhaps key-value database (zodb, link).
Of course, if this is just a small personal project you could just serialize the data to something like xml, json, csv or whatever and storing it as a file. It might be better in the long run to use a database though. A database format will probably scale a lot easier.

I would recommend Oracle Express (With Application Express) It will scale up to 4gb of user data. Beyond that, you would have to start paying. Application Express is very simple and build CRUD applications for, which is what is sounds like yours is.

For a project like that I would use Sqlite or Mysql, it's be fast enough. Plus it's easy to setup.

Related

Databases in offline software?

I'm primarily a web developer, currently learning C and planning on going into C++ in a year or so when I feel absolutely confident with C (Note: I'm not saying I'll be a master at C, just that I'll understand it in a fair amount of depth and will retain it properly rather than forgetting it when I see a new language).
My question is, how are offline/networked applications written with database functionality? I've built many-a database driven website in PHP and MySQL and would like to know how to use databases with my C projects - a lot of the applications I have the desire to write rely more on content management rather than processing data as such. What database formats are available to me? What should I be looking at to build a simple contact database for example?
Thanks in advance.
I'd suggest SQLite for file-based database. Mongo is pretty awesome too if you run it locally but it is still networked.
For a small application SQLLite might be a good option for you - it is part of your application and not dependant on other software but as a database is fairly weak (No triggers, no stored procedures afaik).
If you are looking for something more substantial (especially when it involves multiple users) you should be looking for MySQL or SQLServer. These can be accessed directly from their respective API's or via some kindof common mediator such as ODBC.
Your question is really very open, much application software depends on relational database technology at some level but the OS and the required task ussually dictate the best choices.
Going the SQL route with offline applications in C is not straightforward. Whereas the database storage brings in advantages, in terms of reliability e.g., it adds conversion steps during the save/load of your data, simply by using SQL.
The question is why would you want to create SQL commands as character strings to load/save the data that is treated as binary in your program, and that you can store as binary directly in your system local storage? It costs!
On the other side, if you already know SQL well, then you'll only have to learn about an (there are several) API to access a database (SQLite, MySQL ...) from C to get started.

Fooling Around With Databases, Online Development

I'm trying to design a database for a small project I'm working on. Eventually, I'd like to make it a web-app, but right now I don't mind just experimenting with data offline. However, I'm stuck in a crossroads.
The basic concept would be a user inputs values for 10 fields, to be compared against what is in the database, with each item having a weighted value. I know that if I was to code it, that I could use a look-up tables for each field, add up the values, and display the result to the end-user.
Another example would be having to get the distance between two points, each point stored in a row, with the X value getting its own column as well as the Y value.
Now, if I store data within a database, should I try to do everything within queries (which I think would involve temporary tables among other things), or just use simple queries, and manipulate the rows returned within the application code?
Right now, I'm thinking to go for the latter (manipulate data within the app) and just use queries to reduce the amount of data that I would have to go sort through. What would you guys suggest?
EDIT: Right now I'm using Microsoft Access to get the basics down pat and try to get a good design going. IIRC with my experience with Oracle and MySQL you can run commands together in a batch process and return just one result. But not sure if you can do that with Access.
If you're using a database I would strongly suggest using SQL to do all your manipulation. SQL is far more capable and powerful for this kind of job as compared to imperative programming languages.
Of course it does imply that you're comfortable in thinking about data as "sets" and programming in a declarative style. But spending time now to get really comfortable with SQL and manipulating data using SQL will pay off big time in the long run. Not only for this project but for projects in the future. I would also suggest using stored procedures over queries in code because stored procedure provide a beautiful abstraction layer allowing your table design to change over time without impacting the rest of the system.
A very big part of using and working with databases is understanding Data modeling, normalization and the like. Like everything else it will be a effort but in the long run it will pay off.
May I ask why you're using Access when you have a far better database available to you such as MSSQL Express? The migration path from MSSQL Express to MSSQL or SQL Azure even is quite seamless and everything you do and experience today (in this project) completely translates to MSSQL Server/SQL Azure for future projects as well as if this project grows beyond your expectations.
I don't understand your last statement about running a batch process and getting just one result, but if you can do it in Oracle and MySQL then you can do it in MSSQL Express as well.
What Shiv said, and also...
A good DBMS has quite a bit of solid engineering in it. There are two components that are especially carefully engineered, namely the query optimizer and the transaction controller. If you adopt the view of using the DBMS as just a stupid table retrieval tool, you will most likely end up inventing your own optimizer and transaction controller inside the application. You won't need the transaction controller until you move to an environment that supports multiple concurrent users.
Unless your engineering talents are extraordinary, you will probably end up with a home brew data management system that is not as good as the one in a good DBMS.
The learning curve for SQL is steep. You need to learn how to phrase queries that join, project, and restrict data from multiple tables. You need to learn how to handle updates in the context of a transaction.
You need to learn simple and sound table design and index design. This includes, but is not limited to, data normalization and data modeling. And you need a DBMS with a good optimizer and good transaction control.
The learning curve is steep. But the view from the top is worth the climb.

Nonrelational Databases for C++

I was thinking of starting a project that very clearly needs a persistent store. I was about to reluctantly decide on a RDBMS, when I came across an article which briefly mentions CouchDB. Seems some advancements in DB technology have happened since I last looked, so I thought I would ask here about databases before I got into it.
Here are my criteria. ( I list the criteria again at the end, so if you want to skip the explanations just scroll down. )
The project is open source and I will not be asking anything for it, so preferably the database is open source and free. Furthermore the software has to run on both Linux and Windows.
There are parts of the project that have to be in C++. The project is not large enough code wise to justify using a second language. So basically the whole thing will be C++.
This project will not have anything to do with the web, so preferably
the database will not require the detritus of a web library.
The objects I want to store fall into one of two categories: a basic object and a container object. The difference being objects which are containers will contain even more objects, ie: a parts of parts problem. I need a database that can handle such cases cleanly and efficiently.
I also expect the schema to evolve rapidly, at least initially. I alse suspect that some of the old data simply will not fit into the new schemas. So I would like to keep different versions of the schema around. Win possible, I would like to be able to transform data in one to schema into another schema.
For the application to work the way intended, people would have to exchange large chunks of database with each other. So I would want simple ways of importing and exporting data, which I could automate to some degree.
Finally it would be nice if the database could in someway be simulated in unit tests.
THose are my requirements. I have replicated them below to make it easier for people answering.
Thank you
Non Technical requirements
1. Open source preferably free.
2. Run on Windows and Linux
Has a C++ interface.
Is able to handle a non-web application, preferably without REST.
Can handle a "parts of parts" problem fairly well.
Can handle multiple indexes.
Has sort of concept of schema version, can handle multiple schema versions, and can migrate tables from one schema to another.
Should have a simple mechanism for move data from one instance of the database to another.
Preferably has some mechanism for testing.
HDF5 is a binary format which behaves like an hierarchical database. It has binding and libraries for C++ and python (I only use the latter) and it is used to store big amounts of data, like the ones produces in certain physics and astronomy experiments.
http://www.hdfgroup.org/HDF5/
I've looked at a few nosql databases some time ago (had an different requirement than than you though - needed it to be a standalone server). The ones that I remember as particularly interesting are Redis and Kyoto Cabinets. Have a look.
BTW, you don't mention any performance requirement. If so, have you considered SQLite? Simple, embedded, stable, and with the flexibility of SQL after all. With prepared statement the performance penalty of SQL should not be very high.
EDIT: ooops, just noticed that you asked this more than a year ago... Well, perhaps you can tell us what you've chosen :)

Databases versus plain text

When dealing with small projects, what do you feel is the break even point for storing data in simple text files, hash tables, etc., versus using a real database? For small projects with simple data management requirements, a real database is unnecessary complexity and violates YAGNI. However, at some point the complexity of a database is obviously worth it. What are some signs that your problem is too complex for simple ad-hoc techniques and needs a real database?
Note: To people used to enterprise environments, this will probably sound like a weird question. However, my problem domain is bioinformatics. Most of my programming is prototypes, not production code. I'm primarily a domain expert and secondarily a programmer. Most of my code is algorithm-centric, not data management-centric. The purpose of this question is largely for me to figure out how much work I might save in the long run if I learn to use proper databases in my code instead of the more ad-hoc techniques I typically use.
1) Concurrency. Do you have multiple people accessing the same dataset? Then it's going to get pretty involved to broker all of the different readers and writers in a scalable fashion if you roll your own system.
2) Formatting and relationships: Is your data something that doesn't fit neatly into a table structure? Long nucleotide sequences and stuff like that? That's not really conveniently tabular data.
Another example: Nobody would consider implementing software like Photoshop to store PSDs in a relational format, because the data structures don't really lend themselves to that type of storage or query pattern.
3) ACID (sort of a corollary to #1): If Atomicity, Consistency, Integrity, and Durability are not challenges with a flat file, then go with a flat file.
For me, the line is crossed once I have to query my data in ways that involve more than a single relationship. Relating two flat data structures on disk is fairly simple, but once we get beyond that, a set-based language like SQL and formal database relationships actually reduce complexity.
I think at some point you'll miss the querying capabilities of a database, but you can consider some minimalistic database alternatives:
SQLite (Great, almost SQL-92 standard compliant)
shsql
SQL Server Compact
I would only write my own on-disk format under very special circumstances. Reusing someone else's code is nearly always faster.
For relational data, I would use SQLite. For key/value pairs, I would use BerkeleyDB (perhaps via KiokuDB). For simple objects, I would use JSON or YAML, but only if I only had a few.
With SQLite and BDB, "a real database" is literally two lines of code away. It is hard to beat that.
The problem with small projects is that they become bigger before we know it. And once they do , we start missing the sql capabilities.
Always design such that a db can be utilized later on if required without ripping apart half of the application.
It depends entirely on the domain-specific application needs. A lot of times direct text file/binary files access can be extremely fast, efficient, as well as providing you all the file access capabilities of your OS's file system.
Furthermore, your programming language most likely already has a built-in module (or is easy to make one) for specific parsing.
If what you need is many appends (INSERTS?) and sequential/few access little/no concurrency, files are the way to go.
On the other hand, when your requirements for concurrency, non-sequential reading/writing, atomicity, atomic permissions, your data is relational by the nature etc., you will be better off with a relational or OO database.
There is a lot that can be accomplished with SQLite3, which is extremely light (under 300kb), ACID compliant, written in C/C++, and highly ubiquitous (if it isn't already included in your programming language -for example Python-, there is surely one available). It can be useful even on db files as big as 1GB, possible more.
If your requirements where bigger, there wouldn't even be a discussion, go for a full-blown RDBMS.
For the kind of applications you are developing in bioinformatics, you are often doing one-shot applications (often scripts that define a workflow of calculations) that answer a specific questions, and you are not likely to be reusing these applications after you answered your question.
Often, you should therefore avoid creating databases to store the results, as after all you are not going to use their features very much.
You will probably be querying some webservices, files, or databases, run some local algorithms on the data gathered from different sources, and produce some tabular or structured output format (xml, json, etc).
For that, I would suggest you to use workflow tools like Knime (or a commercial solution like Inforsense KDE, Accelrys's Pipeline pilot, or Snaplogic, as they allow you to query data in a variety of formats and locations (rdbms, flat files, webservices), run algorithms, and build powerful web apps that allow you to easily publish your workflows to your users and let them interact at specific points).
If your prototype "grows" and you have to build more functionality on top of the data your workflows output, and if the output of your prototype is not likely to change everyday, then it's a wise decision to store a subset of the results in a database. This allows you to plug in powerful reporting tools like BusinessObjects, Crystal reports, jasper reports or whatever reporting solution available out there and show data to your users in a better shape than a spreadsheet or a csv file.
Finally, some development frameworks will make your choices more obvious : if you build a web application using an MVC framework, it is likely that your data will reside in an RDBMS (but please, don't put genomic sequences in a table column :-)).
All in all, it's a case by case choice, depending on your needs for each particular application.
In software I can usually get away with storing values in a XML configuration file or in the registry, e.g. software options. Once I need to persist objects I move to a database because the upfront cost is not that bad compared to the long term effects that relations and reporting can offer.
For bioinformatics you may be interested on that: Blast on DB. The guy who is working on that is a friend of mine and has a work on fast similarity sequence search, he found out to make his own binary storage better than using databases at this point.
I don't know specific details about his solution but you probably can exchange one or two ideias mailing the guy, even sharing code.
Do you need/want SQL queries?
Are multiple people going to want to access the data?
Is your data relational?
If you answered no to those questions, you (probably) don't need a full on database.
First, I'd consider:
How large will the database initially be: # of tables, # of rows
How quickly will it grow?
Is the data frequently queried?
If I were to create a personal recipe app, for example, I know I might add 50 favorite recipes to start and add no more than 5 recipes a year. With that being said, I could easily get by without a database since the size of the data store will have minimal impact on queries.
That said, I would probably use a database for any application where data entry and queries occur (even a small personal recipe app). I don't think it adds a lot of overhead especially when your framework (e.g. Rails) allows you to keep your database dumb (primarily tables, indexes, and constraints). It alleviates the chance that I'll have to eventually port to a database if I decide to scale up.
If you know the format of your data, flat files, if faster/easier to develop with, will be fine. If you expect your record formats to change frequently during development then I'd suggest that ALTER TABLE is your friend. Flat files will also tend to be faster (if you care about speed) unless you expect to implement the equivalent of joins across many combinations of files.
The real benefit of using a RDBMS during development is the flexibility with which you can modify your data schema and the ease with which you can access your data via queries.
Good design will ensure that you keep your data access layer relatively isolated (because of separation of concerns) so it should be a fairly straightforward (if tedious) matter to rework to a database later should it be worthwhile. Or, of course, if you use a database to develop your structures you may subsequently take the app back to flat/indexed files once those structures are crystallized in order to gain performance.
Use whatever persistence technology you're most comfortable with, and scales sufficiently.
YAGNI at least means "Don't add a new technology to your personal stack unless you can't be productive with whatever is already there."
For many (most?) of us, our comfort zone for data persistence is SQL. For some, it might be XML. Just don't write your own until (see paragraph 2).
As someone also doing research in Bioinformatics, I would suggest NOT using a database for these kinds of prototype projects unless you are sure it needs it. If you are on the fence, go with the databaseless solution and stick with flat files. It is also important to note that traditionally Bioinformatics researchers have go the flat file route, which means there are well defined file formats for most types of data in the feild. If you decide to go with a database solution, it may hurt your compatibility with existing research projects.

File database suggestion with support for multiple concurrent users

I need a database that could be stored network drive and would allow multiple users (up to 20) to use it without any server software.
I'm considering MS Access or Berkeley DB.
Can you share your experience with file databases?
Which one did you use, did you have any problems with it?
I really don't think that file-based databases can scale past half a dozen users. The last time I had an Access database (admittedly this was quite a while ago) I had to work really hard to get it to work for 8-9 people.
It is really much easier to install Ubuntu on an old junk computer with PostgreSQL or MySQL. That's what I had to do even when I kept my Access front-end.
I would suggest SQLite because the entire database is stored in a single file, and it quite safely handles multiple users accessing it at the same time. There are several different libraries that you can use for your client application and there is no server software needed.
One of the strengths is that it mimics SQL servers so closely that if you need to convert from using a database file to a full-fledged SQL Server, most of your queries in your client won't need to change. You'll just need to migrate the data over to the new server database (which I wouldn't be surprised if there are programs to convert SQLite databases to MySQL databases, for example.)
Beware of any file based database, they are all likely to have the same problems. Your situation really calls for a Client/Server solution.
From SQLite FAQ
A good rule of thumb is that you
should avoid using SQLite in
situations where the same database
will be accessed simultaneously from
many computers over a network
filesystem.
http://www.sqlite.org/whentouse.html
Access can be a bitch. Ive been in the position where i had to go around and tell 20-50 people to close access so I could go to "design mode" to change the design of the forms and maybe a column. No fun at all. (Old access, and it might just be a bad setup)
Ayende was recently trying to make a similar decision, and tried a bunch of so-called embedded databases. Hopefully his observations can help you.
I have been using Access for some time and in a variety of situations, including on-line. I have found that Access works well if it is properly set up according to the guidelines. One advantage of Access is that it includes everything in one package: Forms, Query Building, Reports, Database Management, and VBA. In addition, it works well with all other Office applications. The Access 2007 runtime can be obtained free from here, which makes distribution less expensive. Access is certainly unsuitable for large operations, but it should be quite suitable for twenty users. EDIT: Microsoft puts the number of concurrent users at 255.
Can Access be set up to support 10-20 users? Yes. It, as well as all file-based databases use the file system for locking and concurrency control, however. And, Access data files are more susceptible to database corruption than are database servers. And, while you can set it up for this, you MUST, as David Fenton mentions above, follow best practices, if you want to end up with a reliable system.
Personally, I find that, given the hoops that you need to jump through to ensure that an Access solution is reasonably trouble-free, it is much less trouble to implement an instance of MSDE/SQL Server Express, or postgreSql.
Berkeley DB supports a high degree of concurrency (far more then 20), but it does so primarily by utilizing shared memory and mutexes (possibly even replication) - facilities that do not work well when BDB is deployed as a file stored on a network drive.
In order to take advantage of DBD concurrency capabilities you will have to build an application around it.
The original question makes no sense to me, in that the options don't belong together. BerkeleyDB is a database engine only, while Access is an application development tool that ships with a default file-based (i.e., non-server) database engine (Jet). By virtue of putting Access with Berkeley, it seems obvious that what is needed is only a database engine, and no application at all, but how end users use Berkeley DB without a front end, I don't know (I've only used it from the command line).
Those who cannot run a Jet MDB with 20 simultaneous users are simply not competent to be giving advice on using Jet as a data store. It is completely doable as long as best practices are followed. I would recommend in addition to Microsoft's Best Practices web page, Tony Toews's Best Practices, and Tony's Corruption FAQ (i.e., things you want to avoid doing in order to have a stable application).
I strongly doubt that the original questioner is building no front end application, but since he doesn't indicate what kind of front end is involved, it's hard to recommend a back end that will go with it. Access has the advantage of giving you both parts of the equation, and when used properly, is perfectly reliable for multiple users.

Resources