Join query on OLAP and operations data

Join query on OLAP and operations data - analytics

I manage in my database a list of tools with this schema:
[id] int PRIMARY
[name] varchar
Every few seconds each tool emits a measurement. I will save it in an OLAP store with this schema:
[toolID] int
[time] timestamp
[measurement] int
(We have not chosen the OLAP store yet but assume we need one due to data amounts, semantics, and types of queries we will run)
How do I query the list of tool names with measurements greater than 100? The challenge is that I need to join data from both OLAP and OLTP stores.
Option 1 - save in OLAP also the tool name with each measurement (denormalization). The problem is the tool name might have changed since the measurement and I need the latest. Also there may be many more details (and detail data) per tool, not sure if it make sense to save it all per measurement.
Option 2 - OLAP returns just list of IDs, then I issue a query to OLTP to get names. This would require SQL queries with many embedded IDs and seems not right.
Option 3 - Synchronize all OLTP data into OLAP every few minutes. But OLAP tools are not optimized for updates (e.g. Vertica) so this does not seem efficient.

Generally, in OLAP/DW systems, option 3 is preferred and the list of tools and their details would be stored in a Tool dimension table and the measurements would be stored in a Measurement fact table.
If, as you mentioned in your comment, you're not concerned with saving the history of a tool's details when the details change and the frequency as well as the number of updates to the tool details is small, then I would just update the records in the Tool dimension since it will be a relatively small number of updates.
If the frequency of updates is small but the actual number of updates is large then it may be easier and faster to simply truncate the Tool dimension and insert all Tool records from the OLTP system. In this case, you would need to ensure that there is a way to preserve the dimension keys in order to join back to the fact measurements that have already been stored. This could be difficult if you are using a surrogate key based on an auto-generated sequence.
The real problem arises when the frequency and number of updates to the tool details is large. In this case, you would have to step back and look at the overall model and determine if the tool details actually belong in a dimension or if they deserve their own fact table.

Related

SQLite performance advice for .net

I am using SQLite in my application. The scenario is that I have stock market data and each company is a database with 1 table. That table stores records which can range from couple thousand to half a million.
Currently when I update the data in real time I - open connection, check if that particular data exists or not. If not, I then insert it and close the connection. This is then done in a loop and each database (representing a company) is updated. The number of records inserted is low and is not the problem. But is the process okay?
An alternate way is to have 1 database with many tables (each company can be a table) and each table can have a lot of records. Is this better or not?
You can expect at around 500 companies. I am coding in VS 2010. The language is VB.NET.

The optimal organization for your data is to make it properly normalized, i.e., put all data into a single table with a company column.
This is better for performance because the table- and database-related overhead is reduced.
Queries can be sped up with indexes, but what indexes you need depends on the actual queries.

I did something similar, with similar sized data in another field. It depends a lot on your indexes. Ultimately, separating each large table was best (1 table per file, representing a cohesive unit, in you case one company). Plus you gain the advantage of each company table being the same name, versus having x tables of different names that have the same scheme (and no sanitizing of company names to make new tables required).
Internally, other DBMSs often keep at least one file per table in their internal structure, SQL is thus just a layer of abstraction above that. SQLite (despite its conceptors' boasting) is meant for small projects and querying larger data models will get more finicky in order to make it work well.

Difference between sql query aggregation and aggegration and querying an OLAP cube

I have a query with respect to the advantages of building a OLAP cube vs aggregating data in database table for querying ,data of say 6 months and then archiving the sql table later for analytics purpose.
Which one is better, table or OLAP cube? and why since I can aggregate and keep data in my tables also and query the aggregated data as and when needed.

Short version: Like many development decisions, it depends.
Long version: I wouldn't say that one is "better" than the other - it's just that the two have separate uses and one or the other might be the better solution depending on what the requirements are.
If you have a few specific reports which require specific aggregations, then it might be simpler and easier for everyone involved to just aggregate that information in a table or a view, and point your reports at that.
As an example, if you know your users only want reports at a monthly level for a particular set of parameters - maybe your sales department want the monthly value of each salesperson's sales, for example - then your best bet might be to aggregate this up and pop it into a report where they can select the month and the salesperson, and get the number that they want.
The benefits of this might be that it's quick to develop and provide to your users, there's not too much time spent testing as only a few figures need checking, etc. Your users also don't need to spend time being trained/learning to use a cube - reports are generally pretty easy for people to pick up and use.
But if your users want to be able to carry out much more open-ended analysis on their own terms then it's not much use if you need to go away and develop a report every time they have a new requirement. Your database might start getting very full of similar-but-different tables full of aggregated amounts. You could run into issues where one report ends up not agreeing with another for some reason - you might find you're dealing with the same data quality issues over and over again in each report.
In this case, it might make more sense to develop a cube over the top of data held at the lowest grain which your users want to analyse. In this way, they can essentially self-serve, rather than getting back in touch with you every time they need a new set of aggregated data. They can slice and dice through the data using multiple different "parameters" (dimensions in the OLAP world), rather than being limited by the nature of the reports.
Aggregated data still sometimes plays a role even when you have a cube in place, though. Sometimes performance gains can be found by aggregating data up to certain levels and holding it in a physical table, and getting your OLAP tool to use the physically aggregated data at that level instead of using its own aggregations - but this is an optimization step which would need careful consideration to see whether it's beneficial in terms of performance, whether the space vs. performance payoff is worthwhile, etc. I wouldn't worry about this aspect if you're just starting to look at OLAP, but wanted to note it for the sake of completeness.

To add to Jo's great answer, consider the grain of the facts that need to be aggregated and compared. If you have daily sales by product, but budgets by month and product category, you're going to need an aggregate fact table based on sales in order to compare budgets. That would be further represented as two cubes in your OLAP database - Sales cube, and Budget cube.

If there are very regular use cases which involve specific aggregated data, and this aggregated data would take a while to return from sql database tables then a cube might help.
If there are lots of potential ways in which your db table data needs to be sliced and diced at an aggregated level then there is definitely a good argument to start playing around with olap cubes.
In terms of sums of data olap is a great aggregation tool. I'm not convinced that it is the best tool for distinct counts though, so if your requirements includes lots of distinct counts then maybe look elsewhere. Do you have the option of Tabular/PowerPivot/DAX ?

Retrieve first 100 rows sorted by a function without evaluating all rows in the table?

I think the question in the title speaks it all and is general.
I can give a concrete example as well:
I have tagged articles and want to find similar articles with the tags associated with them.
The score function will look at two articles and count the number of tags in common.
Since the score is not stored anywhere, I'll have to calculate the score everytime I need to find similar articles given an article.
But this is too expensive.
What is the common work-around to this kind of problem in general?
Is there a better approach for my specific tag problem? (e.g. solr's moreLikeThis)
edit
I'm using postgres, if that matters.
I'm looking for a general solution that people used successfully, such as you should batch calculate the score and save it somewhere and etc...

The answer will vary wildly by database product and version. For example, in some database products, it may be the case that a view or an indexed view might be faster than the more common solution...
Typically the way to handle a situation like this is by precalculating the result. You can do that in a handful of ways:
a. You can use something like triggers (added in the SQL 99 standard) that update the counts as rows are added, updated or removed from the source table. In this solution, you are making a (presumably) small sacrifice on inserts, updates and deletes of the source table in order to make significant gains in retrieving the information.
b. You can use a data warehouse where you accept some level of latency of live data to reported data. That means you accept that the data queried from the data warehouse will be stale by some accepted number of minutes, hours, days, or weeks. The data warehouse works by periodically querying the live OLTP (Online Transaction Processing) data and updates the OLAP (Online Analytical Processing) database which contains the precalculated results. You then run your reports off the OLAP data or a combination of OLTP and OLAP data. A formal database warehouse isn't required to achieve the equivalent results. You could write a procedure which is executed on a timer that updates a table periodically with updated results.

how to manage millions/billions of small values in a "database"

I have an application that will generate millions of date/type/value entries. we don't need to do complex queries, only for example get the average value per day of type X between date A and B.
I'm sure a normal db like mysql isn't the best to handle these sort of things, is there a better system that like these sort of data.
EDIT: The goal is not to say that relational database cannot handle my problem but to know if another type of database like key/value database, nosql, document oriented, ... can be more adapted to what i want to do.

If you are dealing with a simple table as such:
CREATE TABLE myTable (
[DATE] datetime,
[TYPE] varchar(255),
[VALUE] varchar(255)
)
Creating an index probably on TYPE,DATE,VALUE - in that order - will give you good performance on the query you've described. Use explain plan or whatever equivalent on the database you're working with to review the performance metrics. And, setup a scheduled task to defragment that index regularly - frequency will depend on how often insert, delete and update occurs.
As far as an alternative persistence store (i.e. NoSQL) you don't gain anything. NoSQL shines when you want schema-less storage. In other words you don't know the entity definitions head of time. But from what you've described, you have a very clear picture of what you want to store, which lends itself well to a relational database.
Now possibilities for scaling over time include partitioning and each TYPE record into a separate table. The partitioning piece could be done by type and/or date. Really would depend on the nature of the queries you're dealing with, if you typically query for values within the same year for instance, and what your database offers in that regard.

MS SQL Server and Oracle offer concept of Partitioned Tables and Indexes.
In short: you could group your rows by some value, i.e. by year and month. Each group could be accessible as separate table with own index. So you can list, summarize and edit February 2011 sales without accessing all rows. Partitioned Tables complicate the database, but in case of extremely long tables it could lead to significantly better performance.

Based upon the costs you can choose either MySQL or SQL Server, in this case you have to be clear that what do you want to achieve with the database just for storage then any RDBMS can handle.

You could store the data as fixed length records in a file.
Do binary search on the file opened for random access to find your start and end records then sum the appropriate field for the given condition of all records between your start index and end index into the file.

Storing time-series data, relational or non?

I am creating a system which polls devices for data on varying metrics such as CPU utilisation, disk utilisation, temperature etc. at (probably) 5 minute intervals using SNMP. The ultimate goal is to provide visualisations to a user of the system in the form of time-series graphs.
I have looked at using RRDTool in the past, but rejected it as storing the captured data indefinitely is important to my project, and I want higher level and more flexible access to the captured data. So my question is really:
What is better, a relational database (such as MySQL or PostgreSQL) or a non-relational or NoSQL database (such as MongoDB or Redis) with regard to performance when querying data for graphing.
Relational
Given a relational database, I would use a data_instances table, in which would be stored every instance of data captured for every metric being measured for all devices, with the following fields:
Fields: id fk_to_device fk_to_metric metric_value timestamp
When I want to draw a graph for a particular metric on a particular device, I must query this singular table filtering out the other devices, and the other metrics being analysed for this device:
SELECT metric_value, timestamp FROM data_instances
WHERE fk_to_device=1 AND fk_to_metric=2
The number of rows in this table would be:
d * m_d * f * t
where d is the number of devices, m_d is the accumulative number of metrics being recorded for all devices, f is the frequency at which data is polled for and t is the total amount of time the system has been collecting data.
For a user recording 10 metrics for 3 devices every 5 minutes for a year, we would have just under 5 million records.
Indexes
Without indexes on fk_to_device and fk_to_metric scanning this continuously expanding table would take too much time. So indexing the aforementioned fields and also timestamp (for creating graphs with localised periods) is a requirement.
Non-Relational (NoSQL)
MongoDB has the concept of a collection, unlike tables these can be created programmatically without setup. With these I could partition the storage of data for each device, or even each metric recorded for each device.
I have no experience with NoSQL and do not know if they provide any query performance enhancing features such as indexing, however the previous paragraph proposes doing most of the traditional relational query work in the structure by which the data is stored under NoSQL.
Undecided
Would a relational solution with correct indexing reduce to a crawl within the year? Or does the collection based structure of NoSQL approaches (which matches my mental model of the stored data) provide a noticeable benefit?

Definitely Relational. Unlimited flexibility and expansion.
Two corrections, both in concept and application, followed by an elevation.
Correction
It is not "filtering out the un-needed data"; it is selecting only the needed data. Yes, of course, if you have an Index to support the columns identified in the WHERE clause, it is very fast, and the query does not depend on the size of the table (grabbing 1,000 rows from a 16 billion row table is instantaneous).
Your table has one serious impediment. Given your description, the actual PK is (Device, Metric, DateTime). (Please don't call it TimeStamp, that means something else, but that is a minor issue.) The uniqueness of the row is identified by:
(Device, Metric, DateTime)
The Id column does nothing, it is totally and completely redundant.
An Id column is never a Key (duplicate rows, which are prohibited in a Relational database, must be prevented by other means).
The Id column requires an additional Index, which obviously impedes the speed of INSERT/DELETE, and adds to the disk space used.
You can get rid of it. Please.
Elevation
Now that you have removed the impediment, you may not have recognised it, but your table is in Sixth Normal Form. Very high speed, with just one Index on the PK. For understanding, read this answer from the What is Sixth Normal Form ? heading onwards.
(I have one index only, not three; on the Non-SQLs you may need three indices).
I have the exact same table (without the Id "key", of course). I have an additional column Server. I support multiple customers remotely.
(Server, Device, Metric, DateTime)
The table can be used to Pivot the data (ie. Devices across the top and Metrics down the side, or pivoted) using exactly the same SQL code (yes, switch the cells). I use the table to erect an unlimited variety of graphs and charts for customers re their server performance.
Monitor Statistics Data Model.
(Too large for inline; some browsers cannot load inline; click the link. Also that is the obsolete demo version, for obvious reasons, I cannot show you commercial product DM.)
It allows me to produce Charts Like This, six keystrokes after receiving a raw monitoring stats file from the customer, using a single SELECT command. Notice the mix-and-match; OS and server on the same chart; a variety of Pivots. Of course, there is no limit to the number of stats matrices, and thus the charts. (Used with the customer's kind permission.)
Readers who are unfamiliar with the Standard for Modelling Relational Databases may find the IDEF1X Notation helpful.
One More Thing
Last but not least, SQL is a IEC/ISO/ANSI Standard. The freeware is actually Non-SQL; it is fraudulent to use the term SQL if they do not provide the Standard. They may provide "extras", but they are absent the basics.

Found very interesting the above answers.
Trying to add a couple more considerations here.
1) Data aging
Time-series management usually need to create aging policies. A typical scenario (e.g. monitoring server CPU) requires to store:
1-sec raw samples for a short period (e.g. for 24 hours)
5-min detail aggregate samples for a medium period (e.g. 1 week)
1-hour detail over that (e.g. up to 1 year)
Although relational models make it possible for sure (my company implemented massive centralized databases for some large customers with tens of thousands of data series) to manage it appropriately, the new breed of data stores add interesting functionalities to be explored like:
automated data purging (see Redis' EXPIRE command)
multidimensional aggregations (e.g. map-reduce jobs a-la-Splunk)
2) Real-time collection
Even more importantly some non-relational data stores are inherently distributed and allow for a much more efficient real-time (or near-real time) data collection that could be a problem with RDBMS because of the creation of hotspots (managing indexing while inserting in a single table). This problem in the RDBMS space is typically solved reverting to batch import procedures (we managed it this way in the past) while no-sql technologies have succeeded in massive real-time collection and aggregation (see Splunk for example, mentioned in previous replies).

You table has data in single table. So relational vs non relational is not the question. Basically you need to read a lot of sequential data. Now if you have enough RAM to store a years worth data then nothing like using Redis/MongoDB etc.
Mostly NoSQL databases will store your data on same location on disk and in compressed form to avoid multiple disk access.
NoSQL does the same thing as creating the index on device id and metric id, but in its own way. With database even if you do this the index and data may be at different places and there would be a lot of disk IO.
Tools like Splunk are using NoSQL backends to store time series data and then using map reduce to create aggregates (which might be what you want later). So in my opinion to use NoSQL is an option as people have already tried it for similar use cases. But will a million rows bring the database to crawl (maybe not , with decent hardware and proper configurations).

Create a file, name it 1_2.data. weired idea? what you get:
You save up to 50% of space because you don't need to repeat the fk_to_device and fk_to_metric value for every data point.
You save up even more space because you don't need any indices.
Save pairs of (timestamp,metric_value) to the file by appending the data so you get a order by timestamp for free. (assuming that your sources don't send out of order data for a device)
=> Queries by timestamp run amazingly fast because you can use binary search to find the right place in the file to read from.
if you like it even more optimized start thinking about splitting your files like that;
1_2_january2014.data
1_2_february2014.data
1_2_march2014.data
or use kdb+ from http://kx.com because they do all this for you:) column-oriented is what may help you.
There is a cloud-based column-oriented solution popping up, so you may want to have a look at: http://timeseries.guru

You should look into Time series database. It was created for this purpose.
A time series database (TSDB) is a software system that is optimized for handling time series data, arrays of numbers indexed by time (a datetime or a datetime range).
Popular example of time-series database InfluxDB

I think that the answer for this kind of question should mainly revolve about the way your Database utilize storage.
Some Database servers use RAM and Disk, some use RAM only (optionally Disk for persistency), etc.
Most common SQL Database solutions are using memory+disk storage and writes the data in a Row based layout (every inserted raw is written in the same physical location).
For timeseries stores, in most cases the workload is something like: Relatively-low interval of massive amount of inserts, while reads are column based (in most cases you want to read a range of data from a specific column, representing a metric)
I have found Columnar Databases (google it, you'll find MonetDB, InfoBright, parAccel, etc) are doing terrific job for time series.
As for your question, which personally I think is somewhat invalid (as all discussions using the fault term NoSQL - IMO):
You can use a Database server that can talk SQL on one hand, making your life very easy as everyone knows SQL for many years and this language has been perfected over and over again for data queries; but still utilize RAM, CPU Cache and Disk in a Columnar oriented way, making your solution best fit Time Series

5 Millions of rows is nothing for today's torrential data. Expect data to be in the TB or PB in just a few months. At this point RDBMS do not scale to the task and we need the linear scalability of NoSql databases. Performance would be achieved for the columnar partition used to store the data, adding more columns and less rows kind of concept to boost performance. Leverage the Open TSDB work done on top of HBASE or MapR_DB, etc.

I face similar requirements regularly, and have recently started using Zabbix to gather and store this type of data. Zabbix has its own graphing capability, but it's easy enough to extract the data out of Zabbix's database and process it however you like. If you haven't already checked Zabbix out, you might find it worth your time to do so.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight