Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I would like to store data persistently for my application, but I don't really need a full blown relational database. I really could get by with a basic "cache"-like persistent storage where the structure is just a (key, value) pair.
In lieu of a database what are my best, scalable options?
There's always SQLite, a database that's stored in a file. SQLite already has built-in concurrency, so you don't have to worry about things like file locking, and it's really fast for reads.
If, however, you are doing lots of database changes, it's best to do them all at once inside a transaction. This will only write the changes to the file once, as opposed to every time an change query is issued. This dramatically increases the speed of doing multiple changes.
When a change query is issued, whether it's inside a tranasction or not, the whole database is locked until that query finishes. This means that extremely large transactions could adversely affect the performance of other processes because they must wait for the transaction to finish before they can access the database. In practice, I haven't found this to be that noticeable, but it's always good practice to try to minimize the number of database modifying queries you issue.
if you want a 'persistent cache', and already use memcached, check memcachedb. it's a persistent hashtable using the memcached protocol, no need for a new client (but a new daemon)
I ask a similar question recently. Here are some choices:
The Berkley DB is groovy. (also see the Berkley DB Wikipedia entry)
If you are on Windows, then you can use the built in Extensible Storage Engine. This used to be called "Jet Blue".
Microsoft SQL Compact Edition is also free (But not built into the OS.
If you need scalability, then a RDBMS is your best bet. At the most basic level, you can serialize data structures into files - however, then you'd need to account for file locking issues which would limit concurrency.
SQLite is an SQL file-based database engine, which can run without a persistent database daemon (in PHP for example, it runs as an extension), however it too has concurrency issues (read the answers to this question which could help you decide if SQLite is right for you).
Unless you have a really good reason not to use a real DRBMS, I'd suggest you stick with MySQL or other "full-blown" engines.
If you want something really scalable, I wouldn't opt for a flat or XML file. As your data grows, it could kill your performance.
If you will have a lot of data at some stage, I would still opt for a database - I would take a look at something like SQLIte with a very simple schema to suit your needs.
I'm really not sure you should but have you considered just storing info in an XML doc if it really is that light? And if it's not have you considered SQLite?
If your writing a java program that wants an imbedded database look at hsqldb since it is written in java and works much better then sqlite if being called from java programs.
If you are writing Java, then there are Java database implementations (Jared mentions hsqldb, there are others) that you can directly include.
SQLite is fine for static inclusion, however you can also include MySQL within your application if you're using a compatible language such as C.
I think you would appreciate having SQL available as well. XML files just don't cut it any longer, maybe a couple of years ago when writing PDA software, but even the iPhone and Android includes SQLite now.
For Key=Value pairs, you can use INI file format with simple load and save procedures to load it and save it to in-memory hash table.
That can later scale up to anythig, just by changing load and save procedures to work with db.
You could try CounchDB, it is a very flexible document-oriented database that doesn't force you to define a schema up-front. It was written in Erlang and thanks to this it is considered very scalable solution. It can be easily queried through a REST interface.
hsqldb in in-memory mode will give you much better performance than flat file based databases. It's pretty easy to use too. And if the table gets too big, there is an option to cache it on disk too. Check out this performance comparison:
(source: hsqldb.org)
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Summary
I am facing the task of building a searchable database of about 30 million images (of different sizes) associated with their metadata. I have no real experience with databases so far.
Requirements
There will be only a few users, the database will be almost read-only, (if things get written then by a controlled automatic process), downtime for maintenance should be no big issue. We will probably perform more or less complex queries on the metadata.
My Thoughts
My current idea is to save the images in a folder structure and build a relational database on the side that contains the metadata as well as links to the images themselves. I have read about document based databases. I am sure they are reliable, but probably the images would only be accessible through a database query, is that true? In that case I am worried that future users of the data might be faced with the problem of learning how to query the database before actually getting things done.
Question
What database could/should I use?
Storing big fields that are not used in queries outside the "lookup table" is recommended for certain database systems, so it does not seem unusual to store the 30m images in the file system.
As to "which database", that depends on the frameworks you intend to work with, how complicated your queries usually are, and what resources you have available.
I had some complicated queries run for minutes on MySQL that were done in seconds on PostgreSQL and vice versa. Didn't do the tests with SQL Server, which is the third RDBMS that I have readily available.
One thing I can tell you: Whatever you can do in the DB, do it in the DB. You won't even nearly get the same performance if you pull all the data from the database and then do the matching in the framework code.
A second thing I can tell you: Indexes, indexes, indexes!
It doesn't sound like the data is very relational so a non-relational DBMS like MongoDB might be the way to go. With any DBMS you will have to use queries to get information from it. However, if your worried about future users, you could put a software layer between the user and DB that makes querying easier.
Storing images in the filesystem and metadata in the DB is a much better idea than storing large Blobs in the DB (IMHO). I would also note that the filesystem performance will be better if you have many folders and subfolders rather than 30M images in one big folder (citation needed)
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I'm creating a website using django. It's just about done but it hasn't gone live yet. I'm trying to decide whether SQLite is good enough for the site or if it would be worthwhile to use PostgreSQL now, at the beginning, rather than risk needing to transition to it later. (In this post, I mention PostgreSQL because that's the other contender for me. I'm sure similar analysis could be made with MySQL or Oracle.)
I could use some input from people about how they decide upon what database to use with their django projects.
Here's what I currently understand about this:
From my experience, SQLite is super easy. I don't need to worry about installing some other dependency for it and it pretty much just works out of the box with django.
From my research online, it seems that SQLite is able to handle quite a bit of load before becoming a performance bottleneck.
Here's what I don't know:
What would be involved with me transitioning to PostgreSQL from SQLite? Again, I'm currently in a development-only phase and therefore would not need to transition any database data from SQLite. Is it pretty much just a matter of installing PostgreSQL on the server and then tweaking the settings.py file to use it? I doubt it, but would any of my django code need to change? (I don't have any raw SQL queries - my database access is limited to django's model API.)
From a performance perspective, is PostgreSQL simply better in every way over SQLite? Or does SQLite have certain advantages over PostgreSQL?
Performance aside, does using PostgreSQL offer other deployment benefits over SQLite?
Essentially I'm thinking that SQLite is good enough for my little site. What's the odds of it becoming really popular? Probably not that great. SQLite is working for me now and would require no changes from my end. However, I'm concerned that maybe using PostgreSQL from the beginning would be easy and that I'll kick myself a year from now for not making the transition. I'm torn though - if I go to PostgreSQL, perhaps it will be unnecessary hassle on my part with no benefit.
Does anyone have general guidelines for deciding between SQLite and something else?
Thanks!
Here are a few things to consider.
SQLite does not allow concurrent writes. If an insert or an update is issued, the entire database is locked, and even readers are not allowed during a short moment of the actual update. If your application is going to have many users that update its state (posting comments, adding likes, etc), this is going to become a bottleneck. Unpleasant slowdowns will time to time occur even with a relatively small number of users.
SQLite does not allow several processes to efficiently access a database. There can only be one writing process, even if you have multiple CPUs, and even then the locking mechanism is very inefficient. To ensure data integrity you'd need to jump through many hoops, and every update will be horribly slow. Postgres can reorder locks optimally, lock tables at row level or even update without locking, so it will run circles around SQLite performance-wise, unless your database is strictly read-only.
SQLite does not allow data partitioning or even putting different tables to different tablespaces; everything lives in one file. If you have a "hot" table that is touched very often (e.g sessions, authorization, stats), you cannot tweak its parameters, put it on an SSD, etc. You can use a separate database for that, though, if relational integrity is not crucial.
SQLite does not have replication or failover features. If your app's downtime is going to cost you money, you'd better have a hot backup database server, ready to take over if the main server goes down. With Postgres, this is relatively painless; with SQLite, hardly.
SQLite does not have an online backup and point-in-time recovery capability. If data you receive from users is going to cost you money (e.g. merchant orders or user data under a SLA), you better back up your data regularly, even continuously. Postgres, of course, can do this; SQLite cannot.
In short: when your site stops being a toy, you should have switched already. You should have switched some time before your first serious load spike, to iron out any obvious problems.
Fortunately, Django ORM makes switching very easy on the Python side: you mostly change the connection string in settings.py. On the actual database side you'll have to do a bit more: profile your most important queries, tweak certain column types and indexes, etc. Unless you know how to cook Postgres yourself, seek help of people who know; databases have many non-obvious subtleties that influence performance significantly. Deploying Postgres is definitely more tricky than SQLite (though not realy hard); the result is way more functional when it comes to operation / maintenance under load.
I'm primarily a web developer, currently learning C and planning on going into C++ in a year or so when I feel absolutely confident with C (Note: I'm not saying I'll be a master at C, just that I'll understand it in a fair amount of depth and will retain it properly rather than forgetting it when I see a new language).
My question is, how are offline/networked applications written with database functionality? I've built many-a database driven website in PHP and MySQL and would like to know how to use databases with my C projects - a lot of the applications I have the desire to write rely more on content management rather than processing data as such. What database formats are available to me? What should I be looking at to build a simple contact database for example?
Thanks in advance.
I'd suggest SQLite for file-based database. Mongo is pretty awesome too if you run it locally but it is still networked.
For a small application SQLLite might be a good option for you - it is part of your application and not dependant on other software but as a database is fairly weak (No triggers, no stored procedures afaik).
If you are looking for something more substantial (especially when it involves multiple users) you should be looking for MySQL or SQLServer. These can be accessed directly from their respective API's or via some kindof common mediator such as ODBC.
Your question is really very open, much application software depends on relational database technology at some level but the OS and the required task ussually dictate the best choices.
Going the SQL route with offline applications in C is not straightforward. Whereas the database storage brings in advantages, in terms of reliability e.g., it adds conversion steps during the save/load of your data, simply by using SQL.
The question is why would you want to create SQL commands as character strings to load/save the data that is treated as binary in your program, and that you can store as binary directly in your system local storage? It costs!
On the other side, if you already know SQL well, then you'll only have to learn about an (there are several) API to access a database (SQLite, MySQL ...) from C to get started.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I have a file containing 250 million website URLs, each with an IP address, page title, country name, server banner (e.g. "Apache"), response time (in ms), number of images and so on. At the moment, these records are in a 25gb flat file.
I'm interested in generating various statistics from this file, such as:
number of IP addresses represented per country
average response time per country
number of images v response time
etc etc.
My question is, how would you achieve this type and scale of processing, and what platform and tools wuld you use(in a reasonable time)?
I am open to all suggestions, from MS SQL on Windows to Ruby on Solaris, all suggestions :-) Bonus points for DRY (don't repeat yourself), I'd prefer not to write a new program each time a different cut is required.
Any comments on what works, and what's to be avoided would greatly be appreciated.
Step 1: get the data into a DBMS that can handle the volume of data. Index appropriately.
Step 2: use SQL queries to determine the values of interest.
You'll still need to write a new query for each separate question you want answered. However, I think that is unavoidable. It should save you replicating the rest of the work.
Edited:
Note that although you probably can do a simple upload into a single table, you might well get better performance out of the queries if you normalize the data after loading it into the single table. This isn't completely trivial, but will likely reduce the volume of data. Making sure you have a good procedure (which will probably not be a stored procedure) for normalizing the data will help.
Load the data into a table in a SQL Server (or any other mainstream db) database, and then write queries to generate the statistics you need. You would not need any tools other than the database itself and whatever UI is used to interact with the data (e.g. SQL Server Management Studio for SQL Server, TOAD or SqlDeveloper for Oracle, etc.).
If you happen to use Windows, take a look at Log Parser. It can be found as a standalone download and also is included as part of the IIS Reource Kit.
Log Parser can read your logs and upload them to the Database.
Database Considerations:
For your Database Server you will want something that is fast (Microsoft SQL Server, IBM's DB2, PostgreSQL or Oracle). mySQL might be useful too but I have not experience with large Databases with it.
You will want all the memory you can afford. If you will be using the Database with regularity I'd say 4 GB at least. It can be done with less but you WILL notice big difference in performance.
Also, go for multicore/multi cpu servers if you can afford it and, again, if you will be using this Database with regularity.
Another recommendation is to analyze the king of queries you will be doing and plan the indexes accordingly. Remember: Every index you create will require additional storage space.
Of course, turn off the indexing or even destroy de indexes before masive data load operations. That will make the load lots faster. Re-index or re-create the indexes after the data load operation.
Now, if this Database will be an ongoing operation (i.e. is not just to investigate/analyze something and then discard it) you may want design a Database Schema with catalog and detail tables. This is called Database Normalization and the exact amount of normalization you will want depends on the usage pattern (data load operations versus query operations). An experienced DBA is a must if this Database will be used on an ongoing basis and have performance requirements.
P.S.
I will take the risk to include something obvious here but...
I think you may be interested in a Log Analyzer. These are computer programs that generate statistics from Web Server log files (some can analyze also ftp, sftp and mail server log files).
Web Log Analyzers generate reports with the statistics. Usually the reports are generated as HTML files and include graphics. There is a fair variety on depth analysis and options. Some are very customizable and some are not. You will find both commercial products and Open Source.
For the amount of data you will be managing, double check each candidate product and take a closer look on speed and ability to handle it.
One thing to keep in mind when you're importing the data is to try to create indexes that will allow you to do the kinds of queries you want to do. Think about what sort of fields will you be querying on and what those queries might look like. That should help you decide what indexing you will need.
25GB of flat file. I don't think writing any component on your own to read this file will be a good idea.
I would suggest that you should go for SQL import and take all the data to SQL Server. I agree that it would take ages to get this data in SQL Server, but once it is there you can do any thing you want with this data.
I hope once you put this data in DB, after that all you will get delta of information not 25 GB of flat file.
You haven't said how the data in your flat file is organised. The RDBMS suggestions are sensible, but presume that your flat file is formatted in some delimited way and a db import is a relatively simple task. If that is not the case then you first have the daunting task of decompiling the data cleanly into a set of fields on which you can do your analysis.
I'm going to presume that your data is not a nice CSV or TXT file, since you haven't said either way and nobody else has answered this part of the possible problem.
If the data have a regular structure, even without nice clean field delimiters you may be able to turn an ETL tool onto the job, such as Informatica. Since you are a techy and this is a one-off job, you should definitely consider writing some code of your own which does some regex comparisons for extraction of the parts that you want and spits out a file which you can then load into a database. Either way you are going to have to invest some significant effort in parsing and cleansing your data, so don't think of this as an easy task.
If you do write your own code then I would suggest you choose a compiled language and make sure you process the data a single row at a time (or in a way that buffers the reads into manageable chunks).
Either way you are going to have a pretty big job making sure that the results of any process that you apply to the data have been consistently executed, you don't want IP addresses turing up as decimal numbers in your calculations. On data of that scale it can be hard to detect a fault like that.
Once you have parsed it then I think that an RDBMS is the right choice to store and analyse your data.
Is this a one time thing or will you be processing things on a daily, weekly basis? Either way check out vmarquez's answer I've heard great things about logparser. Also check out http://awstats.sourceforge.net/ it's a full fledged web stats application.
SQL Server Analysis Services is designed for doing exactly that type of data analysis. The learning curve is a bit steep, but once you set up your schema you will be able to do any kind of cross-cutting queries that you want very quickly.
If you have more than one computer at your disposal, this is a perfect job for MapReduce.
Sounds like a job for perl to me. Just keep count of the stats you want. Use regex to parse the line. It would probably take less than 10 minutes to parse that size file. My computer reads through a 2 gig file (13 million lines) in about 45 seconds with perl.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have been developing web/desktop applications for about 6 years now. During the course of my career, I have come across application that were heavily written in the database using stored procedures whereas a lot of application just had only a few basic stored procedures (to read, insert, edit and delete entity records) for each entity.
I have seen people argue saying that if you have paid for an enterprise database use its features extensively. Whereas a lot of "object oriented architects" told me its absolute crime to put anything more than necessary in the database and you should be able to drive the application using the methods on those classes?
Where do you think is the balance?
Thanks,
Krunal
I think it's a business logic vs. data logic thing. If there is logic that ensures the consistency of your data, put it in a stored procedure. Same for convenience functions for data retrieval/update.
Everything else should go into the code.
A friend of mine is developing a host of stored procedures for data analysis algorithms in bioinformatics. I think his approach is quite interesting, but not the right way in the long run. My main objections are maintainability and lacking adaptability.
I'm in the object oriented architects camp. It's not necessarily a crime to put code in the database, as long as you understand the caveats that go along with that. Here are some:
It's not debuggable
It's not subject to source control
Permissions on your two sets of code will be different
It will make it more difficult to track where an error in the data came from if you're accessing info in the database from both places
Anything that relates to Referential Integrity or Consistency should be in the database as a bare minimum. If it's in your application and someone wants to write an application against the database they are going to have to duplicate your code in their code to ensure that the data remains consistent.
PLSQL for Oracle is a pretty good language for accessing the database and it can also give performance improvements. Your application can also be much 'neater' as it can treat the database stored procedures as a 'black box'.
The sprocs themselves can also be tuned and modified without you having to go near your compiled application, this is also useful if the supplier of your application has gone out of business or is unavailable.
I'm not advocating 'everything' should be in database, far from it. Treat each case seperately and logically and you will see which makes more sense, put it in the app or put it in the database.
I'm coming from almost the same background and have heard the same arguments. I do understand that there are very valid reasons to put logic into the database. However, it depends on the type of application and the way it handles data which approach you should choose.
In my experience, a typical data entry app like some customer (or xyz) management will massively benefit from using an ORM layer as there are not so many different views at the data and you can reduce the boilerplate CRUD code to a minimum.
On the other hand, assume you have an application with a lot of concurrency and calculations that span a lot of tables and that has a fine-grained column-level security concept with locking and so on, you're probably better off doing stuff like that directly in the database.
As mentioned before, it also depends on the variety of views you anticipate for your data. If there are many different combinations of columns and tables that need to be presented to the user, you may also be better off just handing back different result sets rather than map your objects one-by-one to another representation.
After all, the database is good at dealing with sets, whereas OO code is good at dealing with single entities.
Reading these answers, I'm quite confused by the lack of understanding of database programming. I am an Oracle Pl/sql developer, we source control for every bit of code that goes into the database. Many of the IDEs provide addins for most of the major source control products. From ClearCase to SourceSafe. The Oracle tools we use allow us to debug the code, so debugging isn't an issue. The issue is more of logic and accessibility.
As a manager of support for about 5000 users, the less places i have to look for the logic, the better. If I want to make sure the logic is applied for ALL applications that use the data , even business logic, i put it in the DB. If the logic is different depending on the application, they can be responsible for it.
#DannySmurf:
It's not debuggable
Depending on your server, yes, they are debuggable. This provides an example for SQL Server 2000. I'm guessing the newer ones also have this. However, the free MySQL server does not have this (as far as I know).
It's not subject to source control
Yes, it is. Kind of. Database backups should include stored procedures. Those backup files might or might not be in your version control repository. But either way, you have backups of your stored procedures.
My personal preference is to try and keep as much logic and configuration out of the database as possible. I am heavily dependent on Spring and Hibernate these days so that makes it a lot easier. I tend to use Hibernate named queries instead of stored procedures and the static configuration information in Spring application context XML files. Anything that needs to go into the database has to be loaded using a script and I keep those scripts in version control.
#Thomas Owens: (re source control) Yes, but that's not source control in the same sense that I can check in a .cs file (or .cpp file or whatever) and go and pick out any revision I want. To do that with database code requires a potentially-significant amount of effort to either retrieve the procedure from the database and transfer it to somewhere in the source tree, or to do a database backup every time a minor change is made. In either case (and regardless of the amount of effort), it's not intuitive; and for many shops, it's not a good enough solution either. There is also the potential here for developers who may not be as studious at that as others to forget to retrieve and check in a revision. It's technically possible to put ANYTHING in source control; the disconnect here is what I would take issue with.
(re debuggable) Fair enough, though that doesn't provide much integration with the rest of the application (where the majority of the code could live). That may or may not be important.
Well, if you care about the consistency of your data, there are reasons to implement code within the database. As others have said, placing code (and/or RI/constraints) inside the database acts to enforce business logic, close to the data itself. And, it provides a common, encapsulated interface, so that your new developer doesn't accidentally create orphan records or inconsistent data.
Well, this one is difficult. As a programmer, you'll want to avoid TSQL and such "Database languages" as much as possible, because they are horrendous, difficult to debug, not extensible and there's nothing you can do with them that you won't be able to do using code on your application.
The only reasons I see for writing stored procedures are:
Your database isn't great (think how SQL Server doesn't implement LIMIT and you have to work around that using a procedure.
You want to be able to change a behaviour by changing code in just one place without re-deploying your client applications.
The client machines have big calculation-power constraints (think small embedded devices).
For most applications though, you should try to keep your code in the application where you can debug it, keep it under version control and fix it using all the tools provided to you by your language.