I'm working on a report where I have to get 3 different values making use of 3 queries:
Count of all our users
Users who purchased certain products in the last year
Users who have a certain attribute and have sent documents for signing
All these queries use tables from the same database. The first query takes under a second, the second one takes around 2 min and the third query takes around 1min.
I can create the report in two ways:
Create one data set and get all the required values.
Create one data set for each of these queries and let them run in parallel.
If my understanding is correct, the first method would take around 3 mins to execute whereas the second one would take around 2min. In this case, I think creating multiple data sets is much better for performance than creating one single data set. What do you think? Does running queries in parallel improve the performance in this case? What is a good practice in SSRS in these kind of scenarios? Thanks!
Related
I'm planning to split up 1 schema to multiple schema's. This will allow me to run multiple cores with different document types. Then I will use join to get the related documents if needed.
At the moment I had multiple document types by using a type field.
How will this change affect the performance?
As far as I know, when you join between cores, you will be able to get information from only one core (not the other).
In my opinion, Solr works the best when it has to pull data from one location only. Joining might produce an overhead, thus essentially slowing the whole operation.
However, consider the following situation :- a user has 20 million records in one core and Solr has to search each and every record in that. If the user is able to separate them into two cores , one having 1 million records and the other core has 20 records, then joining might be efficient in such a case.
Summary :- it depends on how much data you have now, how much data you will have when you have multiple cores. If your situation is not like the above, then I suggest that you look for some other alternative.
Currently I have a project (written in Java) that reads sensor output from a micro controller and writes it across several Postgres tables every second using Hibernate. In total I write about 130 columns worth of data every second. Once the data is written it will stay static forever.This system seems to perform fine under the current conditions.
My question is regarding the best way to query and average this data in the future. There are several approaches I think would be viable but am looking for input as to which one would scale and perform best.
Being that we gather and write data every second we end up generating more than 2.5 million rows per month. We currently plot this data via a JDBC select statement writing to a JChart2D (i.e. SELECT pressure, temperature, speed FROM data WHERE time_stamp BETWEEN startTime AND endTime). The user must be careful to not specify too long of a time period (startTimem and endTime delta < 1 day) or else they will have to wait several minutes (or longer) for the query to run.
The future goal would be to have a user interface similar to the Google visualization API that powers Google Finance. With regards to time scaling, i.e. the longer the time period the "smoother" (or more averaged) the data becomes.
Options I have considered are as follows:
Option A: Use the SQL avg function to return the averaged data points to the user. I think this option would get expensive if the user asks to see the data for say half a year. I imagine the interface in this scenario would scale the amount of rows to average based on the user request. I.E. if the user asks for a month of data the interface will request an avg of every 86400 rows which would return ~30 data points whereas if the user asks for a day of data the interface will request an avg of every 2880 rows which will also return 30 data points but of more granularity.
Option B: Use SQL to return all of the rows in a time interval and use the Java interface to average out the data. I have briefly tested this for kicks and I know it is expensive because I'm returning 86400 rows/day of interval time requested. I don't think this is a viable option unless there's something I'm not considering when performing the SQL select.
Option C: Since all this data is static once it is written, I have considered using the Java program (with Hibernate) to also write tables of averages along with the data it is currently writing. In this option, I have several java classes that "accumulate" data then average it and write it to a table at a specified interval (5 seconds, 30 seconds, 1 minute, 1 hour, 6 hours and so on). The future user interface plotting program would take the interval of time specified by the user and determine which table of averages to query. This option seems like it would create a lot of redundancy and take a lot more storage space but (in my mind) would yield the best performance?
Option D: Suggestions from the more experienced community?
Option A won't tend to scale very well once you have large quantities of data to pass over; Option B will probably tend to start relatively slow compared to A and scale even more poorly. Option C is a technique generally referred to as "materialized views", and you might want to implement this one way or another for best performance and scalability. While PostgreSQL doesn't yet support declarative materialized views (but I'm working on that this year, personally), there are ways to get there through triggers and/or scheduled jobs.
To keep the inserts fast, you probably don't want to try to maintain any views off of triggers on the primary table. What you might want to do is to periodically summarize detail into summary tables from crontab jobs (or similar). You might also want to create views to show summary data by using the summary tables which have been created, combined with detail table where the summary table doesn't exist.
The materialized view approach would probably work better for you if you partition your raw data by date range. That's probably a really good idea anyway.
http://www.postgresql.org/docs/current/static/ddl-partitioning.html
On sites like SO, I'm sure it's absolutely necessary to store as much aggregated data as possible to avoid performing all those complex queries/calculations on every page load. For instance, storing a running tally of the vote count for each question/answer, or storing the number of answers for each question, or the number of times a question has been viewed so that these queries don't need to be performed as often.
But does doing this go against db normalization, or any other standards/best-practices? And what is the best way to do this, e.g., should every table have another table for aggregated data, should it be stored in the same table it represents, when should the aggregated data be updated?
Thanks
Storing aggregated data is not itself a violation of any Normal Form. Normalization is concerned only with redundancies due to functional dependencies, multi-valued dependencies and join dependencies. It doesn't deal with any other kinds of redundancy.
The phrase to remember is "Normalize till it hurts, Denormalize till it works"
It means: normalise all your domain relationships (to at least Third Normal Form (3NF)). If you measure there is a lack of performance, then investigate (and measure) whether denormalisation will provide performance benefits.
So, Yes. Storing aggregated data 'goes against' normalisation.
There is no 'one best way' to denormalise; it depends what you are doing with the data.
Denormalisation should be treated the same way as premature optimisation: don't do it unless you have measured a performance problem.
Too much normalization will hurt performance so in the real world you have to find your balance.
I've handled a situation like this in two ways.
1) using DB2 I used a MQT (Materialized Query Table) that works like a view only it's driven by a query and you can schedule how often you want it to refresh; e.g. every 5 min. Then that table stored the count values.
2) in the software package itself I set information like that as a system variable. So in Apache you can set a system wide variable and refresh it every 5 minutes. Then it's somewhat accurate but your only running your "count(*)" query once every five minutes. You can have a daemon run it or have it driven by page requests.
I used a wrapper class to do it so it's been while but I think in PHP was was as simple as:
$_SERVER['report_page_count'] = array('timeout'=>1234569783, 'count'=>15);
Nonetheless, however you store that single value it saves you from running it with every request.
Here's the issue:
The database is highly normalized, and one particular query relies on the multiple relationships in the database. The query is designed to join all the tables, construct the entire object, and then return a list of those objects.
In other words this particular query does a lot of work.
Now, the query does only return X number of items as it supports pagination, but we also need to know the total count of items that are there.
Currently these two tasks are independent, but highly similar queries in our Domain Service. Ideally what I'd like to do is combine these two queries so that the call to the server happens once, rather than twice, and that the joins happen only once.
Output/Reference parameters don't work, and since the function is designed to return an IQueryable of items, I'm stuck on how to return this list of items as well as the total count.
I'm sure someone's come across this before - any thoughts?
A count of item joined tables is not the same thing as returning a subset of those records. They just happen to share a certain amount of SQL code (specifically to join the tables). RIA does the actual paging server-side so you are actually getting a slightly different query for every paging call.
A count operation would also operate much faster than the record query as SQL counts can often be performed using database indexes only (although Linq may well optimise this for you to the same end result... Clever Linq coders!).
As you would only be requesting the total count once (on page load I assume), then you begin paging through multiple queries on different portions of the data, you are hitting different parts of the database with every call.
You are better off treating them as two distinct functions (as you were) and wear the slight overhead of an additional server call. There is always somewhere else you could make bigger gains (caching etc).
When in doubt: Do not overcomplicate any process for the sake of only a very small gain.
If the problem is with the client server communication, you can put the count result on the header of the result response.
We are building a new application in .net 3.5 with SQL server database. The database is fairly large having around 60 tables with loads on data. The .net application have functionality to bring data into this database from data entry and from third party systems.
After all the data is available in database the system have to do lots of calculation. The calculation logic is pretty complex. All the data required for calculations is in database and the output also needs to be stored in database. The data gathering will happen every week and the calculation needs to be done every week to generate required reports.
Due to above scenario I was thinking do all these calculations using Stored Procedure. The problem is we need data independence also and stored procedure will not be able to provide us that. But if I do all this in .net by query database all the time, I don't think it will be able to finish the work quickly.
For example, I need to query one table which will return me 2000 rows then for each row I need to query another table which will return me 300 results than for each row of this I need to query multiple tables (around 10) to get required data, do the calculation and store the output in another table.
Now my question should I go ahead with stored-procedure solution and forget about database independence since performance is important. I also think development time will be much less if we use stored procedure solution. If any of client want this solution on say oracle database (because they don't want to maintain another database) then we port the stored procedures to oracle database and maintain two versions for any future changes/enhancements. Similarly other clients may ask for other databases.
The 2000 rows which I mentioned above is of product skus. The 300 rows I mentioned is of different attributes which we want to calculate, e.g. handling cost, transport cost, etc. The 10 tables I mentioned have information about currency conversion, unit conversion, network, area, company, sell price, number sold per day, etc. The resulting table stores all the information as a star schema for analysis and reporting purpose. The goal is to get any minute information about the product so that one know what attribute of a product selling is costing us money and where we can do the improvement.
I wouldn't consider doing the data manipulation anywhere other than in the database.
most people try to work with database data using looping algorithms. if you need real speed, think of your data as a SET of rows and you can update thousands of rows within a single update. I have rewritten so many cursor loops written by novice programmers into single update statements where the execution time was massively improved.
you say:
I need to query one table which will
return me 2000 rows then for each row
I need to query another table which
will return me 300 results than for
each row of this I need to query
multiple tables (around 10) to get
required data
from your question it looks like you are not using joins, and you are already thinking in loops. even if you do intend to loop, it is much better to write a query to join in all data necessary then loop over it. remember update and insert statements can have massively complex queries driving them. include in CASE statements, derived tables, conditional joins (LEFT OUTER JOIN) and you can just about solve any problem in a single update/insert.
Well without any specific details of what data you have in these tables, just a back of the napkin calculation shows that you're talking about processing over 6 million rows of information in the example you provided (2,000 rows * 300 rows * (1 row * 10 tables)).
Are all of these rows distinct, or are the 10 tables lookup information that has a relatively low cardinality? In other words, would it be possible to make a program that has the information from the 10 lookup tables in memory, and then just process the 300 row result set in memory to perform the calculations?
Also, I would be concerned about scalability -- if you do this in a stored procedure, it is guaranteed to be a serial process limited by the speed of the single database server. If you have the possibility of multiple copies of a client program, each processing a chunk of the 2,000 initial record set, then you can perform some of the calculations in parallel perhaps speeding up your overall processing time, as well as making it scalable for when your initial record set is 10 times larger.
Programming things like calculation code tend to be easier and more maintainable in C#. Also, normally keeping processing on the SQL Server to a minimum is a good practice since the database is the hardest to scale.
Having said that, from your description it sounds like the stored procedure approach is the way to go. When calculation code is dependent on large volumes of data, it's going to be more expensive to move the data off server for calculation. So unless you have reasonable ways of optimizing the dependent data (such as caching lookup tables?) then you are most likely going to find it more painful then it's worth to not use a stored proc.
Stored procedures every time, but as KM said within those stored procedures keep those iterations to minimum that is to say use joins in your SQL, relational databases are soooooo good at joining.
Database scalibility will be a small issue especially as it sounds like you'd be performing these calcualtions in a batch process.
Database independence doesn't really exist except for the most trivial of CRUD applications so if your initial requirement is to get this all working with SQL Server then leverage the tools that the RDBMS provides (after all your client will have spent a great deal of money on it). If (and it's a big if) a subsequent client really really doesn't want to use SQL Server then you'll have to bite the bullet and code it up in another flavour of stored procedure. But then as you identifed: "if I do all this in .net by query database all the time, I don't think it will be able to finish the work quickly." you've defered the expense of doing it until if and when required.
I would consider doing this in SQL Server Integration Services (SSIS). I'd put the calculations into SSIS, but leave the queries as stored procedures. This would provide you database independence - SSIS can process data from any database with an ODBC connection - as well as high performance. Only the simple SELECT statements would be in stored procedures, and those are the parts of the SQL standard most likely to be identical across multiple database products (assuming you stick to standard forms of query).