How do I programmatically build ad-hoc queries quickly? - pivot-table

I've used Excel PivotTable to analyze data from my database because it allows me to "slice and dice" very quickly. As we know what is in our database tables, we all can write SQL queries that do what PivotTable does.
But I am wondering why PivotTable can construct the queries so fast while it knows nothing about the data and the meanings/relationship between the data fields we give it?
Put the question in another way, how can we build ad-hoc SQL queries in such a fast and efficient way? ("Use PivotTable, of course!", yep, but what I want is a programmatic way).

Just manipulate your order and group clauses as necessary.
Excel is fast because all the data is in memory, and it can be sorted fast and efficiently.

#Mark Ransom is definitely onto something with the notion of Excel keeping the data in memory, making it faster computationally. It's also possible that Excel pre-indexes datasets in such a way that makes it more responsive than your database.
There's one significant, non-algorithmic possibility for why it's faster: Excel, in Pivot Table usage, has no concept of a join. When you're fetching the data ad hoc from your database, any joins or correlations between tables will result in further lookups, scans, index loads, etc. Since Excel has all the data in a single location (RAM or no), it can perform lookups without having to pre-form datasets. If you were to load your database data into a temp table, it would be interesting to see how ad hoc queries against that table stacked up, performance-wise, against Excel.
One thing's certain, though: although databases are excellent tools for producing accurate reports, a traditionally-normalized database will be far less than optimal for ad hoc queries. Because normalized data structures focus on integrity above all else (if I may take that liberty), they sacrifice ad hoc optimization at the expense of keeping all the data sensible. Although this is a poor example, consider this normalized schema:
+--------+ +---------+
|tblUsers| |luGenders|
+--------+ +---------+
|userID | |genderID |
|genderID||gender |
+--------+ +---------+
SELECT * FROM luGenders;
> 1 Female
> 2 Male
If, in this example, we wished to know the number of female/male users in our system, the database would need to process the join and behave accordingly (again, this is a bad example due to the low number of joins and low number of possible values, which generally should bring about some database-engine optimisation). However, if you were to dump this data to Excel, you'd still incur some database penalty to pull the data, but actually pivoting the data in Excel would be fairly speedy. It could be that this notion of up-front, fixed-cost penalty is being missed by your idea of Excel being quicker than straight ad hoc queries, but I don't have the data to comment.
The most tangential point, though, is that while general databases are good for accuracy, they often suck at ad hoc reports. To produce ad hoc reports, it's often necessary to de-normalize ("warehouse") the data in a more queryable structure. Looking up info on data warehousing will provide a lot of good results on the subject.
Moral of the story: having a fully algorithmic, fast ad hoc query system is an awesome ideal, but is less than practical given space and time constraints (memory and people-hours). To effectively generate an ad hoc system, you really need to understand the use cases of your data, and then denormalize it effectively.
I'd highly recommend The Data Warehouse Toolkit. For the record, I'm no DBA, I'm just a lowly analyst who spends 80 hours per week munging Excel and Oracle. I know your pain.

My intuitive feeling tells me that the answer would have something to do with a Pivot Table outline, which has a fixed number of zones, namely:
- the Page Fields zone
- the Column Fields zone
- the Row Fields zone and
- the Data zone
In my wild guess:
- The Page zone builds the WHERE part of the ad-hoc query.
- The Column zone will put whichever fields drag-dropped to it in the GROUP BY clause.
- The Row zone will build a SELECT DISTINCT <field names>
- The Data zone will apply an AGGREGATE function to the field drag-dropped to it.
What do you think would happen "behind the scene" when we drag fields to those zones?

Related

Difference between sql query aggregation and aggegration and querying an OLAP cube

I have a query with respect to the advantages of building a OLAP cube vs aggregating data in database table for querying ,data of say 6 months and then archiving the sql table later for analytics purpose.
Which one is better, table or OLAP cube? and why since I can aggregate and keep data in my tables also and query the aggregated data as and when needed.
Short version: Like many development decisions, it depends.
Long version: I wouldn't say that one is "better" than the other - it's just that the two have separate uses and one or the other might be the better solution depending on what the requirements are.
If you have a few specific reports which require specific aggregations, then it might be simpler and easier for everyone involved to just aggregate that information in a table or a view, and point your reports at that.
As an example, if you know your users only want reports at a monthly level for a particular set of parameters - maybe your sales department want the monthly value of each salesperson's sales, for example - then your best bet might be to aggregate this up and pop it into a report where they can select the month and the salesperson, and get the number that they want.
The benefits of this might be that it's quick to develop and provide to your users, there's not too much time spent testing as only a few figures need checking, etc. Your users also don't need to spend time being trained/learning to use a cube - reports are generally pretty easy for people to pick up and use.
But if your users want to be able to carry out much more open-ended analysis on their own terms then it's not much use if you need to go away and develop a report every time they have a new requirement. Your database might start getting very full of similar-but-different tables full of aggregated amounts. You could run into issues where one report ends up not agreeing with another for some reason - you might find you're dealing with the same data quality issues over and over again in each report.
In this case, it might make more sense to develop a cube over the top of data held at the lowest grain which your users want to analyse. In this way, they can essentially self-serve, rather than getting back in touch with you every time they need a new set of aggregated data. They can slice and dice through the data using multiple different "parameters" (dimensions in the OLAP world), rather than being limited by the nature of the reports.
Aggregated data still sometimes plays a role even when you have a cube in place, though. Sometimes performance gains can be found by aggregating data up to certain levels and holding it in a physical table, and getting your OLAP tool to use the physically aggregated data at that level instead of using its own aggregations - but this is an optimization step which would need careful consideration to see whether it's beneficial in terms of performance, whether the space vs. performance payoff is worthwhile, etc. I wouldn't worry about this aspect if you're just starting to look at OLAP, but wanted to note it for the sake of completeness.
To add to Jo's great answer, consider the grain of the facts that need to be aggregated and compared. If you have daily sales by product, but budgets by month and product category, you're going to need an aggregate fact table based on sales in order to compare budgets. That would be further represented as two cubes in your OLAP database - Sales cube, and Budget cube.
If there are very regular use cases which involve specific aggregated data, and this aggregated data would take a while to return from sql database tables then a cube might help.
If there are lots of potential ways in which your db table data needs to be sliced and diced at an aggregated level then there is definitely a good argument to start playing around with olap cubes.
In terms of sums of data olap is a great aggregation tool. I'm not convinced that it is the best tool for distinct counts though, so if your requirements includes lots of distinct counts then maybe look elsewhere. Do you have the option of Tabular/PowerPivot/DAX ?

database design for large number of users

which one is better a or b:
a). 7 tables for each user e.g. user7 messages, user7mail etc.In this case if we have 1000 users the there will be 7000 tables.
b). 7 tables e.g. messages, mails etc. all the messasges or mails of every usr will be on same table.
in this case for 1000 users we have only 7 tables.
In most cases, on modern hardware and with reasonable tuning, your database should be able to support tens of millions of records without too much pain, as long as your data really is relational. If you're searching for text, or storing hierarchical data, or storing documents, or running reports, there are alternative options (e.g. NoSQL).
Where at all possible, stick with the orthodox way of using relational databases; that means normalization, query tuning, using caches and throwing hardware at the problem.
Only once you've proven you have a performance problem is it worth looking at more exotic solutions. Within RDBMS world, that might mean partitioning the data (sorta kinda similar to your "table per user" idea). Alternatively, you might jump to NoSQL.
The problems with your "table per user" strategy is that you gain almost no benefit when querying by index (on a modern RDBMS, searching a table with 1 row or a table with million rows when hitting the index makes almost no difference for finding the data). For actions that don't hit the index, you should see a decent gain - but that's usually a sign you're not really relational in the first place...
It makes developing the client application rather error prone, and more complicated than it needs to be, especially when creating moderately complex SQL queries (e.g. multi-table joins) - and tuning those queries will become much harder as a result. You won't be able to use the tools available to manage database queries (e.g. ORM tools), as these are all based on the "standard" relational model.
The biggest problem is changing the database - if you have to add an attribute to "message", you have to repeat that change over 7000 tables. You'll either spend a lot of time writing custom database management scripts, or have a human being repeat the same thing thousands of times (and make hard-to-spot mistakes).
Case B will be much better, just make sure that your users have a user_id type field that increments automatically, and link your tables together via that ID e.g.
user_id email
1000 hello
This will improve lookup speed because you do not have to iclude functionality to choose a specific piece of data from a search of 1000's of tables (in this case it would be searching columns of tables until it found the right table with the right column, it would be ludicrous)
but if you are searching a specific table (e.g. you only need messages) only 1 table will be included in the lookup, much faster and easier to manage all the tables at an admin level.
but and even better idea would be 1 table with several columns, say a 'communications' table which could be like
user_id email messages
1000 hello hi

Tableau Reading Data: Which table type is better?

What performs better (what returns queries faster) with Tableau (a read-only program) when Tableau is connected to tables of data through SQL Server? Multiple tall, thin tables that are joined or a single short and wide table?
The tall and thin tables have many rows but few columns and are joined. The short and wide table has fewer rows, but more columns.
I believe the tall and thin option returns queries faster because there is less redundant data, less columns (creates faster indexing), less NULLS, and less indexing (because there's less columns), but I need at least a second opinion, so please let me know yours.
The reason I'm interested in this question is to improve the query performance by our clients when there query our server for data to render their visualizations.
It depends largely on what you're trying to achieve. For some applications, it's better to have fewer entries with many fields, and for others it's better to have many entries with fewer fields.
Keep in mind that Tableau is not like Excel nor SQL, meaning, you should keep data manipulation to a minimum as some calculations are not easy/possible to be done in Tableau (and some are possible but involves exporting data and reconnecting to it). Tableau should be used mostly for data visualization purposes
Additionally, it's very troublesome to compare different measures in the same chart. Meaning, if you want to compare sum(A) to sum(B), you'll have to plot 2 different charts (and not put both in the same). I find it easier to have few measure fields and lots of dimensions. That way I can easily slice/compare measures. In the last example, instead of having 1 entry with A and B measures, I would have 2 entries, one with A measure and one dimension (saying it's A that is being measured) and one with B measure and one dimension (in the same respectively fields)
BUT that doesn't mean you should go always with "tall thin tables". You need to see what you're trying to achieve and what format better suits your needs (and Tableau design). And unless you're working with really big tables and your analysis are done many times a day (or real time) and performance is a very big issue, then you should focus in what makes your life easier (specially when you have to change and adapt analysis later on).
And for performance, in Tableau I follow 3 rules:
1) Always extract (data to a tde) - it's way faster than most of other database format (I didn't test all, but it's way faster the csv,mdb, xls or SQL connected directly)
2) Never use Tableau links - Unless it does not affect performance (e.g., nomenclature for a low range field) it's better that all your information is already in the same database
3) Remove the thrash - It's very appealing having all information possible in a database, but it also comes at a performance cost. I try to keep only the information necessary for the analysis, to the limits of flexibility I need. Filtering the data is ok, putting the filter in context is better, but filtering on the extract or in the data source itself is the best solution
After lots of researching, I've found a general answer. Generally, and especially with SQL Server and Tableau, you want to steer towards normalizing your tables, so you can avoid redundant data and thus, your table has less data to scan, making it's queries faster to execute. However, you don't want to normalize your tables to a point where the joins between the tables actually cause the query to take longer than if the query was just being send to one short,wide table. Ultimately, you're just going to have to test to see what amount of normalization/denormalization is best for the quickest query return.

Database Structure for hierarchical data with horizontal slices

We're currently looking at trying to improve performance of queries for our site, the core hierarchical data-structure has 5 levels, each type has about 20 fields.
level1: rarely added, updated infrequently, ~ 100 children
level2: rarely added, updated fairly infrequently, ~ 200 children
level3: added often, updated fairly often, ~ 1-50 children (average ~10)
level4: added often, updated quite often, ~1-50 children (average <10)
level5: added often, updated often (a single item might update once a second)
We have a single data pipeline which performs all of these updates and inserts (ie. we have full control over data going in).
The queries we need to do on this are:
fetch single items from a level + parents
fetch a slice of items across a level (either by PK, or sometimes filtering criteria)
fetch multiple items from level3 and parts of their children (usually by complex criteria)
fetch level3 and all children
We read from this datasource a lot, as-in hundreds of times a second. All of the queries we need to perform are known and optimised as well as they can be to the current data structure.
We're currently using MySQL queries behind memcached for this, and just doing additional queries to get children/parents, I'm thinking that some sort of Tree-based or Document based database might be more suitable.
My question is: what's the best way to model this data for efficient read performance?
Sounds like your data belongs in an OLAP (On-Line Analytical Processing) database. The way you're describing levels, slices, and performance concerns seems to lend itself to OLAP. It's probably modeled fine (not sure though), but you need a different tool to boost performance.
I currently manage a system like this. We have a standard relational database for input, and then copy the pertinent data for reporting to an OLAP server. Our combo is Microsoft SQL Server (input, raw data), Microsoft Analysis Services (pre-calculates then stores the analytical data to increase speed), and Microsoft Excel/Access Pivot Tables and/or Tableau for reporting.
OLAP servers:
http://en.wikipedia.org/wiki/Comparison_of_OLAP_Servers
Combining relational and OLAP:
http://en.wikipedia.org/wiki/HOLAP
Tableau:
http://www.tableausoftware.com/
*Tableau is a superb product, and can probably replace an OLAP server if your data isn't terribly large (even then it can handle a lot of data). It will make local copies as necessary to improve performance. I strongly advise giving it a look.
If I've misunderstood the issue you're having, then by all means please ignore this answer :\
UPDATE: After more discussion, an Object DB might be a solution as well. Your data sounds multi-dimensional in nature, one way or the other, but I think the difference would be whether you're doing analytic aggregate calculations and retrieval (SUMs, AVGs), or just storing and fetching categorical or relational data (shopping cart items, or friends of a family member).
ODBMS info: http://en.wikipedia.org/wiki/Object_database
InterSystem's Cache is one Object Database I know of that sounds like a more appropriate fit based on what you've said.
http://www.intersystems.com/cache/
If conversion to a different system isn't feasible (entirely understandable), then you might have to look at normalization and the types of data your queries are processing in order to gain further improvements in speed. In fact, that's probably a good first step before jumping to a different type of system (sorry I didn't get to this sooner).
In my case, I know on MS SQL that a switch we did from having some core queries use a VARCHAR field to using an INTEGER field made a huge difference in speed. Text data is one of the THE MOST expensive types of data to process. So for instance, if you have a query doing a lot of INNER JOINs on text fields, you might consider normalizing to the point where you're using INTEGER IDs that link to the text data.
An example of high normalization could be using ID numbers for a person's First or Last Name. Most DB designs store these names directly and don't attempt to reduce duplication, but you could normalize to the point where Last Name and/or First Name have their own tables (or one table to hold both First and Last names) and IDs for each unique name.
The point in your case would be more for performance than de-duplication of data, but something like switching from VARCHAR to INTEGER might have huge gains. I'd try it with a single field first, measure the before and after cases, and make your decision carefully from there.
And of course, in general you should be sure to have appropriate indexes on your data.
Hope that helps.
Document/Tree based database is designed to perform hierarchical queries. Do you have any hierarchical queries in your design -- I fail to see any? Querying one level up and down doesn't count: it is a simple join. Please have in mind that going "Document/Tree based database" route you would compromise your general querying ability. To summarize, just hire a competent db specialist who would analyze your performance bottlenecks -- they are usually cured with mundane index addition.
there's not really enough info here to say much useful - you'd need to measure things, look at "explains", etc - but one option that goes beyond the usual indexing would be to shard by level 3 instances. that would give you better performance on parallel queries that hit different shards, at its simplest (separate disks), or you could use separate machines if you want to throw more resources at each shard.
the only reason i mention this really is that your use cases suggest sharding at that level would work quite well (it looks like it would be simple enough to do in your application layer, if you wanted - i have no idea what tools mysql has for this).
and if your data volume isn't so high then with sharding you might be able to get it down to ssds...

Storing time-series data, relational or non?

I am creating a system which polls devices for data on varying metrics such as CPU utilisation, disk utilisation, temperature etc. at (probably) 5 minute intervals using SNMP. The ultimate goal is to provide visualisations to a user of the system in the form of time-series graphs.
I have looked at using RRDTool in the past, but rejected it as storing the captured data indefinitely is important to my project, and I want higher level and more flexible access to the captured data. So my question is really:
What is better, a relational database (such as MySQL or PostgreSQL) or a non-relational or NoSQL database (such as MongoDB or Redis) with regard to performance when querying data for graphing.
Relational
Given a relational database, I would use a data_instances table, in which would be stored every instance of data captured for every metric being measured for all devices, with the following fields:
Fields: id fk_to_device fk_to_metric metric_value timestamp
When I want to draw a graph for a particular metric on a particular device, I must query this singular table filtering out the other devices, and the other metrics being analysed for this device:
SELECT metric_value, timestamp FROM data_instances
WHERE fk_to_device=1 AND fk_to_metric=2
The number of rows in this table would be:
d * m_d * f * t
where d is the number of devices, m_d is the accumulative number of metrics being recorded for all devices, f is the frequency at which data is polled for and t is the total amount of time the system has been collecting data.
For a user recording 10 metrics for 3 devices every 5 minutes for a year, we would have just under 5 million records.
Indexes
Without indexes on fk_to_device and fk_to_metric scanning this continuously expanding table would take too much time. So indexing the aforementioned fields and also timestamp (for creating graphs with localised periods) is a requirement.
Non-Relational (NoSQL)
MongoDB has the concept of a collection, unlike tables these can be created programmatically without setup. With these I could partition the storage of data for each device, or even each metric recorded for each device.
I have no experience with NoSQL and do not know if they provide any query performance enhancing features such as indexing, however the previous paragraph proposes doing most of the traditional relational query work in the structure by which the data is stored under NoSQL.
Undecided
Would a relational solution with correct indexing reduce to a crawl within the year? Or does the collection based structure of NoSQL approaches (which matches my mental model of the stored data) provide a noticeable benefit?
Definitely Relational. Unlimited flexibility and expansion.
Two corrections, both in concept and application, followed by an elevation.
Correction
It is not "filtering out the un-needed data"; it is selecting only the needed data. Yes, of course, if you have an Index to support the columns identified in the WHERE clause, it is very fast, and the query does not depend on the size of the table (grabbing 1,000 rows from a 16 billion row table is instantaneous).
Your table has one serious impediment. Given your description, the actual PK is (Device, Metric, DateTime). (Please don't call it TimeStamp, that means something else, but that is a minor issue.) The uniqueness of the row is identified by:
(Device, Metric, DateTime)
The Id column does nothing, it is totally and completely redundant.
An Id column is never a Key (duplicate rows, which are prohibited in a Relational database, must be prevented by other means).
The Id column requires an additional Index, which obviously impedes the speed of INSERT/DELETE, and adds to the disk space used.
You can get rid of it. Please.
Elevation
Now that you have removed the impediment, you may not have recognised it, but your table is in Sixth Normal Form. Very high speed, with just one Index on the PK. For understanding, read this answer from the What is Sixth Normal Form ? heading onwards.
(I have one index only, not three; on the Non-SQLs you may need three indices).
I have the exact same table (without the Id "key", of course). I have an additional column Server. I support multiple customers remotely.
(Server, Device, Metric, DateTime)
The table can be used to Pivot the data (ie. Devices across the top and Metrics down the side, or pivoted) using exactly the same SQL code (yes, switch the cells). I use the table to erect an unlimited variety of graphs and charts for customers re their server performance.
Monitor Statistics Data Model.
(Too large for inline; some browsers cannot load inline; click the link. Also that is the obsolete demo version, for obvious reasons, I cannot show you commercial product DM.)
It allows me to produce Charts Like This, six keystrokes after receiving a raw monitoring stats file from the customer, using a single SELECT command. Notice the mix-and-match; OS and server on the same chart; a variety of Pivots. Of course, there is no limit to the number of stats matrices, and thus the charts. (Used with the customer's kind permission.)
Readers who are unfamiliar with the Standard for Modelling Relational Databases may find the IDEF1X Notation helpful.
One More Thing
Last but not least, SQL is a IEC/ISO/ANSI Standard. The freeware is actually Non-SQL; it is fraudulent to use the term SQL if they do not provide the Standard. They may provide "extras", but they are absent the basics.
Found very interesting the above answers.
Trying to add a couple more considerations here.
1) Data aging
Time-series management usually need to create aging policies. A typical scenario (e.g. monitoring server CPU) requires to store:
1-sec raw samples for a short period (e.g. for 24 hours)
5-min detail aggregate samples for a medium period (e.g. 1 week)
1-hour detail over that (e.g. up to 1 year)
Although relational models make it possible for sure (my company implemented massive centralized databases for some large customers with tens of thousands of data series) to manage it appropriately, the new breed of data stores add interesting functionalities to be explored like:
automated data purging (see Redis' EXPIRE command)
multidimensional aggregations (e.g. map-reduce jobs a-la-Splunk)
2) Real-time collection
Even more importantly some non-relational data stores are inherently distributed and allow for a much more efficient real-time (or near-real time) data collection that could be a problem with RDBMS because of the creation of hotspots (managing indexing while inserting in a single table). This problem in the RDBMS space is typically solved reverting to batch import procedures (we managed it this way in the past) while no-sql technologies have succeeded in massive real-time collection and aggregation (see Splunk for example, mentioned in previous replies).
You table has data in single table. So relational vs non relational is not the question. Basically you need to read a lot of sequential data. Now if you have enough RAM to store a years worth data then nothing like using Redis/MongoDB etc.
Mostly NoSQL databases will store your data on same location on disk and in compressed form to avoid multiple disk access.
NoSQL does the same thing as creating the index on device id and metric id, but in its own way. With database even if you do this the index and data may be at different places and there would be a lot of disk IO.
Tools like Splunk are using NoSQL backends to store time series data and then using map reduce to create aggregates (which might be what you want later). So in my opinion to use NoSQL is an option as people have already tried it for similar use cases. But will a million rows bring the database to crawl (maybe not , with decent hardware and proper configurations).
Create a file, name it 1_2.data. weired idea? what you get:
You save up to 50% of space because you don't need to repeat the fk_to_device and fk_to_metric value for every data point.
You save up even more space because you don't need any indices.
Save pairs of (timestamp,metric_value) to the file by appending the data so you get a order by timestamp for free. (assuming that your sources don't send out of order data for a device)
=> Queries by timestamp run amazingly fast because you can use binary search to find the right place in the file to read from.
if you like it even more optimized start thinking about splitting your files like that;
1_2_january2014.data
1_2_february2014.data
1_2_march2014.data
or use kdb+ from http://kx.com because they do all this for you:) column-oriented is what may help you.
There is a cloud-based column-oriented solution popping up, so you may want to have a look at: http://timeseries.guru
You should look into Time series database. It was created for this purpose.
A time series database (TSDB) is a software system that is optimized for handling time series data, arrays of numbers indexed by time (a datetime or a datetime range).
Popular example of time-series database InfluxDB
I think that the answer for this kind of question should mainly revolve about the way your Database utilize storage.
Some Database servers use RAM and Disk, some use RAM only (optionally Disk for persistency), etc.
Most common SQL Database solutions are using memory+disk storage and writes the data in a Row based layout (every inserted raw is written in the same physical location).
For timeseries stores, in most cases the workload is something like: Relatively-low interval of massive amount of inserts, while reads are column based (in most cases you want to read a range of data from a specific column, representing a metric)
I have found Columnar Databases (google it, you'll find MonetDB, InfoBright, parAccel, etc) are doing terrific job for time series.
As for your question, which personally I think is somewhat invalid (as all discussions using the fault term NoSQL - IMO):
You can use a Database server that can talk SQL on one hand, making your life very easy as everyone knows SQL for many years and this language has been perfected over and over again for data queries; but still utilize RAM, CPU Cache and Disk in a Columnar oriented way, making your solution best fit Time Series
5 Millions of rows is nothing for today's torrential data. Expect data to be in the TB or PB in just a few months. At this point RDBMS do not scale to the task and we need the linear scalability of NoSql databases. Performance would be achieved for the columnar partition used to store the data, adding more columns and less rows kind of concept to boost performance. Leverage the Open TSDB work done on top of HBASE or MapR_DB, etc.
I face similar requirements regularly, and have recently started using Zabbix to gather and store this type of data. Zabbix has its own graphing capability, but it's easy enough to extract the data out of Zabbix's database and process it however you like. If you haven't already checked Zabbix out, you might find it worth your time to do so.

Resources