What should I use for a Database? - database

My vb.net code calculates the growth rate of a company's stock price for every quarter from 1901 to present and stores it in a datatable. This takes a while to perform (10-15 minutes). I would like to save the information in the datatable after it is calculated so that I don't have to recalculate past growth rates every time I run the program. When I open my program I want the datatable to contain any growth rates that have already been calculated so I only have to calculate growth rates for new quarters.
Should I store my datatable in a database of some kind or is there another way to do this? My datatable is quite large. It currently has 450 columns (one for each quarter from 1901 to present) and can have thousands of rows (one for each company). Is this too big for Microsoft Access? Would Microsoft Excel be an option?
Thanks!

First of all, it's unclear you actually need a database. If you don't need things such as concurrent access, client/server operation, ACID transactions etc... you might as well just implement your cache using the file system.
If you conclude you do need a DBMS, there are many good choices, including free such as: PostgreSQL, MS SQL Server Express, Oracle Express, MySQL, Firebird, SQLite etc... or commercial such as: Oracle, MS SQL Server, IBM DB2, Sybase etc...
I suggest you make your data model flexible, so you don't have to add new column for each new quarter:
This model is also well suited for clustering (if your DBMS of choice supports it), so the calculations belonging to the same company are stored physically close together in the database, potentially lowering the I/O during querying. Alternatively, you may choose to cluster on year/quarter.

I would change the database design to:
ID
Quarter
Year
CompanyName
Value1
Value2
Value3
as your columns and start saving it as a vertical table.
Then, you don't have as much data as you think, so I'd recommend something free like mysql, or even nosql, since you're not doing anything but storing and retrieving the data. Any text based file: xml, csv, .xls that you use is going to be way slower because the entire file needs to get loaded into memory for you to be able to parse it.

Excel has a limit in regards to sizes of the sheets, and you shouldn't really ever use it as an explicit "database" for anything wish you wish to port over to different structures. It's good for things like spreadsheets and accounting in general, but you shouldn't use it for an absolute-truth database as is understood in computing. Also, Excel has a limit on the number of records that can be contained: Worksheet size 65,536 rows by 256 columns as of 2003
Access may work for this, but with the number of records you're looking at, you'll probably begin to experience issues with file sizes, slowdowns, and just general things like that. In situations when you start having more than 3,000 records at a time, it's probably better to use one of the big RDBMs or something like that; Oracle, MySQL, SQL Server, etc.

I think that the main problem might be the way you designed the database.
A column for each quarter doesn't sound very good practice, specially when you have to change your DB schema every new quarter.
You could start with a MS Access database and then if you have any performance problems with it, migrate to a SQL Server database or something.
Again, I think that you should take a carefull look at your database design.

I have a great deal of experience with stock data. Having tested quite a few methods, I think for a simple free method you should try SQL Server. The amount of data you are working with is just too much for Access (I imagine this is not the only calculations you would like). You can use SQL Server Express for free.
For this design I would create a database within SQL Server named HistoricalGrowthRate. I would have a table for each stock symbol and store the data in there.
One way to accomplish this is to have a separate database with a table that contains all the symbols you wish to follow (if you don't have can use the CompanyList.csv from Nasdaq). Loop through each symbol in that table and run a create table in HistoricalGrowthRate. When you wish to populate the values, simply loop again and insert your values. You could also just export from Access, which ever is faster for you.
This will decrease the load when you call for the information and provide an easy way to access the info. So, if you want the historical growth rate for AAPL, you simply set the connection string to your HistoricalGrowthRate database, refrence table AAPL and the extract the values.

Related

Best database for multi million row store/query

We have a database that has been growing for around 5 years. The main table has near 100 columns and 700 million rows (and growing).
The common use case is to count how many rows match a given criteria, that is:
select count(*) where column1='TypeA' and column2='BlockC'.
The other use case is to retrieve the rows that match a criteria.
The queries started by taking a bit of time, now they take a couple of minutes.
I want to find some DBMS that allows me to make the two use cases as fast as possible.
I've been looking into some Column store databases and Apache Cassandra but still have no idea what is the best option. Any ideas?
Update: these days I'd recommend Hive 3 or PrestoDB for big data analysis
I am going to assume this is an analytic (historical) database with no current data. If not, you should consider separating your dbs.
You are going to want a few features to help speed up analysis:
Materialized views. This is essentially pre-calculating values, and then storing the results for later analysis. MySQL and Postgres (coming soon in Postgres 9.3) do not support this, but you can mimic with triggers.
easy OLAP analysis. You could use Mondrian OLAP server (java), but then Excel doesn't talk to it easily, but JasperSoft and Pentaho do.
you might want to change the schema for easier OLAP analysis, ie the star schema. Good book:
http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247/ref=pd_sim_b_1
If you want open source, I'd go Postgres (doesn't choke on big queries like mysql can), plus Mondrian, plus Pentaho.
If not open source, then best bang for buck is likely Microsoft SQL Server with Analysis Services.

Need for OLAP cubes if we can Build views based directly off the RAW table

Assume that the table in the source data is is clean and in a state where they can be used directly.
I am trying to understand whether building views based off the RAW table is better than creating cubes. To make the VIEWS dynamic, we can have .NET application which would take paramteres for the view and execute a View with Parameters and get the data for Reporting and analysis.
If I want to view the Sales of a Product for United states in the Month of Februaray. So, I can create a view joining Product, Customer get the sales for a particular day in the month of February.
Instead of forming a Star Schema with Product, Date, Customer dimension. I am really trying to understand what is the standarad a company should go with.
I have folks telling Cubes are only good for analysis not good for reporting . Whatever information we want we can get it by creating a DYNAMIC Views
Any advise or ideas on this ?
Thanks!!
As the name suggest SSAS (SQL Server Analysis Services) is indeed built for analysis. The reason for this is the highly normalized table structure (e.g., the star schema) which allows for super efficient indexing combined with the pre-processing of aggregated values.
Views are a great way to take data that already exists within your OLTP (as compared to OLAP) database and transform it in a manner that better fits your querying needs. This works in the same manner as "get" stored procedures.
Now for my opinion:
If you have a small amount of data (relative to the power of your server, as well as many other factors) and you're not performing intense aggregations of the data, consider using stored procedures to access your database. You can specify the parameter in .NET like any other function, making this method super easy.
If you have a lot of data (like, over 100 million rows), consider creating a cube. This will allow your queries to fly. There's a lot more work that goes into these, but the speed payoff is huge.
End note:
If the data in your reports is pretty similar to the data you already have in your database (including JOINing the tables) and you have under half a billion rows, just use a stored proc, and look into using SSRS (or not). If you have a ton of data that needs to be aggregated and transformed, look into SSAS OLAP cubes.
From my limited experience with Microsoft's Analysis Services, I would agree with Norla. If the execution time of the view is reasonable, that would be the way to go. Cubes can certainly be reported against, as SQL Reporting Services accommodates them fairly well, but the development process can often be much more involved when using a cube as your data source.
Building views can be an alternative for small datasets. You could consider going that route BUT:
1) once the reports are taking a lot of time to load
2) It slows the transactional systems
Then you'll have to consider cubes.

Use one large database or use single databases per customer

Currently I'm working on a on-line webapplication for construction materials. Companies can log in on our website and then they can use the webapp.
From the beginnen the idea was to create a database per customer. But now it's becomming larger and larger (100+) so we have now 100 databases to manage.
We have to run approx. twice a year an update script for db maintanance.
The advantage that I see, is that when a customer wants to quit, we delete their database and than it's finished.
When I want to add new customer, I have to fill the database with approx. 1.000.000 unique records for that specific customer, because every customer have different prices /materials.
For backups I use a MySQL Dump script, that creates a *.sql file per database that I download every day.
What is your opnion and what do you think?
One large db or per customer a database?
I'm using MySQL with ASP.NET/C#...
I don't want to make a suggestion because there are far too many variables.
I do want to note, however, that my employer has 1000s of deployed databases -- we use one database per customer with replication (2+ databases).
So, the idea is workable. My job isn't related to DB management but I do recall that we do a lot in the way of automation and online tools. Backups and DB management is handled by a team.
Ultimately, you can make the 100+ deployments work but you are going to want to start investing in the development of utility and tools to help automate the backup and/or management of the DBs.
Ideally, nothing (DB Management) should be done by hand. Furthermore, the connection strings should be abstracted away from a given web app deployment.
But now it's becomming larger and larger (100+) so we have now 100 databases to manage
I think you have your answer right there.
Have to agree with #Hogan - the overhead of managing that many databases is probably far from ideal - especially if you ever need to make schema changes, etc. in the future.
That said, if you use a single database are you ever likely to need to separate out a given customer's data into a standalone database/site? If this is likely, how long would it take to carry out this separation?
In essence, if it's likely to take less effort to write a set of tools to handle the above case, then I'd be tempted to go for the single database approach. However, you'll also need to factor in the likely timescales for creating a unified version of the database schemas that handle datasets for each customer, etc.
Also, are the schemas precisely the same for all of the existing 100+ databases? If not, there's potentially a world of pain if you decide to migrate the existing data into a single database.
Update - Incidentally, all of the above is a bit generalised, but it's hard to be specific without knowing more about the amount of data, and traffic, etc. in use. (e.g.: If you ever had a high demand site for a customer it would be trivial to put it onto its own DB server if you were using a per-customer database.)
i agree with #Hogan and #middaparke... if the schemas are the same, you shuol dput it in one instance.
unfortuantely it is impossible to tell from here if your schemas would benefit from reusing most of those million rows or not, if normalized well, the ncertinly it would be beneficial.
it is also impossible to tell how difficult any changes to the applications would be based on this change.
unfortunately, it sounds like you have a large customer base with working applications, and therefore momentum to keep going in that direction - which thros you into the realm of sucking it up and dealing with it by automating the management of so many db's... not the way you would do it from scratch - but maybe cheapest since you are where you are.

Fastest method to fill a database table with 10 Million rows

What is the fastest method to fill a database table with 10 Million rows? I'm asking about the technique but also about any specific database engine that would allow for a way to do this as fast as possible. I"m not requiring this data to be indexed during this initial data-table population.
Using SQL to load a lot of data into a database will usually result in poor performance. In order to do things quickly, you need to go around the SQL engine. Most databases (including Firebird I think) have the ability to backup all the data into a text (or maybe XML) file and to restore the entire database from such a dump file. Since the restoration process doesn't need to be transaction aware and the data isn't represented as SQL, it is usually very quick.
I would write a script that generates a dump file by hand, and then use the database's restore utility to load the data.
After a bit of searching I found FBExport, that seems to be able to do exactly that - you'll just need to generate a CSV file and then use the FBExport tool to import that data into your database.
The fastest method is probably running an INSERT sql statement with a SELECT FROM. I've generated test data to populate tables from other databases and even the same database a number of times. But it all depends on the nature and availability of your own data. In my case i had enough rows of collected data where a few select/insert routines with random row selection applied half-cleverly against real data yielded decent test data quickly. In some cases where table data was uniquely identifying i used intermediate tables and frequency distribution sorting to eliminate things like uncommon names (eliminated instances where a count with group by was less than or equal to 2)
Also, Red Gate actually provides a utility to do just what you're asking. It's not free and i think it's Sql Server-specific but their tools are top notch. Well worth the cost. There's also a free trial period.
If you don't want to pay or their utility you could conceivably build your own pretty quickly. What they do is not magic by any means. A decent developer should be able to knock out a similarly-featured though alpha/hardcoded version of the app in a day or two...
You might be interested in the answers to this question. It looks at uploading a massive CSV file to a SQL server (2005) database. For SQL Server, it appears that a SSIS DTS package is the fastest way to bulk import data into a database.
It entirely depends on your DB. For instance, Oracle has something called direct path load (http://download.oracle.com/docs/cd/B10501_01/server.920/a96652/ch09.htm), which effectively disables indexing, and if I understand correctly, builds the binary structures that will be written to disk on the -client- side rather than sending SQL over.
Combined with partitioning and rebuilding indexes per partition, we were able to load a 1 billion row (I kid you not) database in a relatively short order. 10 million rows is nothing.
Use MySQL or MS SQL and embedded functions to generate records inside the database engine. Or generate a text file (in cvs like format) and then use Bulk copy functionality.

Help figuring out approaches to (near) real time multi dimensional data querying

I have a system that involves numerous related tables. Think of a standard category/product/order/customer/orderitem scenario. Some tables are self referencing (like Categories). None of the tables are particularly large (around 100k rows with an estimated scale to around 1 million rows). There are a lot of dimensions to this data I need to consider, but must be queried in a near real time way. I also don't know which dimensions a particular user is interested in- it can be one or many criteria across numerous tables. Things can range from
Give me everything with a category of Jackets
Give me everything with a category of Jackets->Parkas having a red color purchased in the last month in New York
Give me everything which wasn't purchased in New York which costs over $100.
Currently we have a very long SP which uses a "cascading data" approach- we go table by table, filtering everything into a temp table using whatever criteria was specified for that table. For the next table, we join the current temp table to whatever table we're using and apply a new filter set into a new temp table. It works, but manageability and performance is slow. I need something better.
I need a new approach to this problem. It's clearly a need for OLAP, possibly using a star schema. Does this work in real time? Can it be configured to work in real time? Should I use indexed views to create a set of denormalized tables? Should I offload this outside of the database completely?
FYI We're using Sql Server.
As you say, this is perfect for OLAP.
With Sql Server 2005 and 2008 you can set up an almost real time solution. You should:
Create a denormalized star schema
Build an OLAP cube using that schema
Enable proactive caching to update the cube when the underlying data source changes.
It's not a trivial job, and you need the Enterprise version of Sql Server to use proactive caching. You also need some front-end tool (maybe excel would do) to consume the cube.
It would probably be better to build a dynamic query in your code with all the joins you need, customized to each individual request. (properly parameterized for security of course).
You would use much of the same cascading logic you have now but you move it to to the code instead of the database. Then you only submit the exact query you need.
The performance would beat using all of the temp tables and you might get some caching benefit after a few queries were run.
Your dilemma sounds to me like "Is it better to achieve the same result by performing complex processing every time I need it, or should I do it once only for each new piece of data?".

Resources