I currently have a lot of financial data I would like to analyze and compute on. I have built a data system that reads from flat-files and does some decently intelligent caching to maintain the performance I want. But I am starting to have to much data for this system...
I was currently thinking about using POSTGres and having a schema sort of like this:
Table: Things
Fields: T_id, Row, Sub-Row, Column, Resolution, Readable-Name, Meta
Table: Data
Fields: d_id, T_id, timestamp, value
I was wondering if POSTGres would be performant with the above schema if my data table has billions of rows.
Another Idea I had was using a column-oriented database, but I can't seem to find any good open-source ones to get started with. Cassandra is really not made for this situation as I will be reading much much more than writing.
Depends on your expecting - PostgreSQL probably can process this queries on your schema, but it can be a minutes or hours long query - depends on processed rows - but column store databases can be faster about 10 times - just PostgreSQL is relational OLTP database and your schema is not well normalized and probably you prefer OLAP.
There are some open source column store databases like MonetDB or LucidDB, but they are not from PostgreSQL's space. There are only commercial database Vertica. You can look on MySQL engines http://www.mysqlperformanceblog.com/2010/08/16/testing-mysql-column-stores/
Answer depends on your budget.
Here is list of solutions which we using on practice(from cheap to expensive):
MongoDB
PostgreSQL
InfiniDB
kdb+
Related
I am working on a project which is highly performance dashboard where results are mostly aggregated mixed with non-aggregated data. First page is loaded by 8 different complex queries, getting mixed data. Dashboard is served by a centralized database (Oracle 11g) which is receiving data from many systems in realtime ( using replication tool). Data which is shown is realized through very complex queries ( multiple join, count, group by and many where conditions).
The issue is that as data is increasing, DB queries are taking more time than defined/agreed. I am thinking to move aggregated functionality to Columnar database say HBase ( all the counts), and rest linear data will be fetched from Oracle. Both the data will be merged based on a key on App layer. Need experts opinion if this is correct approach.
There are few things which are not clear to me:
1. Will Sqoop be able to load data based on query/view or only tables? on continuous basis or one time?
2. If a record is modified ( e.g. status is changed), how will HBase get to know?
My two cents. HBase is a NoSQL database build for fast lookup queries, not to make aggregated, ad-hoc queries.
If you are planning to use a hadoop cluster, you can try hive with parquet storage formart. If you need near real-time queries, you can go with MPP database. A commercial option is Vertica or maybe Redshift from Amazon. For an open-source solution, you can use InfoBrigth.
These columnar options is going to give you a greate aggregate query performance.
I have a postgresql operational DB with data partitioned per day
and a postgresql data warehouse DB.
In order to copy the data quickly from the operational DB to the DWH I would like to copy the tables as fast and with least of resources used.
Since the tables are partitioned by day, I understand that each partition is a table as itself.
Is that means I can somehow copy the data files between the machines and create the tables in the DWH with those data files?
What is the best practice in that case?
EDIT:
I will answer all questions asked in here:
1. I'm building an ETL. First step of ETL is to copy the data with less influence on the operational DB.
2. I would want to replicate the data if this won't slow the operational DB writings.
3. A bit more data, The operational DB is not in my responsepbility but the main concern is the write time on the that DB.
It writes about 500 Million rows a day where there are hours that are more loaded but there aren't hours with no writings at all.
4. I came across with few tools/ways - Replication, pg_dump. But I couldn't find something that compare the tools to know when to use what and to understand what is fit to my case.
If you are doing a bulk transfer I would actually consider running pg_dump on the warehouse system and piping the results into psql once a day. You could probably run Slony too but that woudl require more resources, and would probably be more complicated.
There are many good ways to replicate data between databases. While just looking for a
fast transfer of a table between databases
... a simple and fast solution is provided by the extension dblink. There are many examples here on SO. Try a search.
If you want a wider approach, continued synchronization etc. consider one of the established tools for replication. There is nice comparison in the manual to get you started.
I have a database whose size could go upto 1TB in a month. If I do a query directly, its taking a long time. So I was thinking of using Hadoop on top of the Database - most of the time my query would involve searching entire database. My database instance would be either 1 or 2, not more than that. After a while we purge the database.
So can we use hadoop framework since it helps processing large amount of data?
Hadoop is not "something you query" but you can use it to process a large amount of data and create a search index which you then load into a system you can query.
You can also look into HBase if you want a store for big data. In addition to HBase there are a number of other key-value or non-relational (NoSQL) stores that work well with large data.
A proper answer depends on the kind of query you are running. Are you always running a specific query? If so, then a key-value store works well; just choose the right keys. If your query needs to search the entire database as you say, and you only make one query every hour or two, then yes, in principle, you could write a simple "query" in Hive that will read from your HDFS store.
Note that querying in Hive only saves you time versus an RDBMS or a simple grep when you have a lot of data and access to a decent-sized cluster. If you only have one machine, it's a non-solution.
Hadoop works better on distributed system. Moreover 1TB is not big data., for this your relational database will do the job.
The real power of hadoop comes when you have to process 100 TB or more of data .. where the relational databases fail.
If look into Hbase it is fast but it is not a substitute to your MySQL or Oracle..
We have a database that has been growing for around 5 years. The main table has near 100 columns and 700 million rows (and growing).
The common use case is to count how many rows match a given criteria, that is:
select count(*) where column1='TypeA' and column2='BlockC'.
The other use case is to retrieve the rows that match a criteria.
The queries started by taking a bit of time, now they take a couple of minutes.
I want to find some DBMS that allows me to make the two use cases as fast as possible.
I've been looking into some Column store databases and Apache Cassandra but still have no idea what is the best option. Any ideas?
Update: these days I'd recommend Hive 3 or PrestoDB for big data analysis
I am going to assume this is an analytic (historical) database with no current data. If not, you should consider separating your dbs.
You are going to want a few features to help speed up analysis:
Materialized views. This is essentially pre-calculating values, and then storing the results for later analysis. MySQL and Postgres (coming soon in Postgres 9.3) do not support this, but you can mimic with triggers.
easy OLAP analysis. You could use Mondrian OLAP server (java), but then Excel doesn't talk to it easily, but JasperSoft and Pentaho do.
you might want to change the schema for easier OLAP analysis, ie the star schema. Good book:
http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247/ref=pd_sim_b_1
If you want open source, I'd go Postgres (doesn't choke on big queries like mysql can), plus Mondrian, plus Pentaho.
If not open source, then best bang for buck is likely Microsoft SQL Server with Analysis Services.
I have an interesting challenge of building a database that imports data from about 500 different sources.
Each source has their own schema, and many are very very different. However, they all are data about a common entity.
My first thought is a typical entity / Attribute / Value schema, however after converting the denormalized import from one source (550k rows) into AEV, I end up with 36 million rows in the Attribute_Value table. With proper indexes, this is still very fast, but this is just one out of 500 import sources in so far.
I don't think this will scale, however it does make for very nice logical partitioning, we don't need to join across import sources, so we could build out (theoretically) 50 or so separate databases.
I'm looking for people who have worked with massive datasources, and their experience with how to handle things when your row count is in the hundreds of millions.
Have you considered OLAP solutions? They are probably designed for situations like yours. Massive amount of data to read and analyze.
I have billion+ row tables, the number of rows is not as critical as the fragmentation level and the width of the table itself, the wider the table the less you can fit on a page
beside OLAP/SSAS
Have you looked at using partitioned functions (new in sql server 2005)
You could also take advantage of page and row level compression (new in sql server 2008) this will help you store more data into RAM, I did my own testing with compression, check out this link to see how it compared to no compression A Quick Look At Compression In SQL 2008