My App requires Daily reports based on various user activities. My current design does not sum the daily totals in database, which means I must compute them everytime.
For example A report that shows Top 100 users based on the number of submissions they have made on a given day.
For such a report If I have 50,000 users, what is the best way to create daily report?
How to create monthly and yearly report with such data?
If this is not a good design, then how to deal with such design decision when the metrics of the report are not clear during db design and by the time it is clear we already have huge data with limited parameters (fields).
Please advice.
Ideally, I would advise you to create your data model in such a way that all of the items that needed to be reported could be precomputed in order to minimize the amount of querying that had to be done on the database. It sounds like you might not be able to do that, and in any case, it is an approach that can be brittle and resistant-to-change.
With the release of the 1.3.1 version of the SDK, you now have access to query cursors, and that makes it a good deal easier to deal with generating reports based on a large number of user. You could use appengine cron jobs to put a job on a task queue to compute the numbers for the report.
Since any given invocation of your task is unlikely to complete in the time that AppEngine allows it to run, you'll have to pass the query cursor from one instance to the next until it finishes.
This approach allows you to adapt to changes to your database and reporting needs, as you can record that Task that computes the report values fairly easily.
Related
I have started to build a Power BI dashboard to display various information about a lot of customers.
All my information is stored in a MS SQL Server database, and my main table has 30+ million rows (increasing once every month by ~1 mil rows). Data is a mix between financials and dimensions type information.
Before I have created a monthly report in PowerPoint, but I want to create more flexibility and have therefore started to build the same “setup” in Power BI.
My issue is that performance is relatively slow; which I think it shouldn’t be, especially if I am to distribute to senior stakeholders in my company. Ideally I would like the numbers to “appear” in an instant, rather than having to wait 10, 15, 30 or 60 seconds for things to load every time you change a filter.
Currently I am using DirectQuery.
I have now experimented with creating a bunch of aggregated tables on my database, so that I lose a ton of details, but am able to load data much much faster (even with direct query), almost instant. My idea is to create a stored procedure that fires once a month when new data is loaded, to update each of these aggregated tables with newest data.
However I feel like it kind of defeats the purpose of having a BI tool, as I lose ability to drill down and do ad hoc investigations into underlying parameters of each metric etc.
I have created some relationships in Power BI that gives me some drill functionality, but not nearly as detailed as I would like.
What is best practice in relation to this? Is it common to do aggregated tables, even if it is on the expense of doing drill downs and losing detail? Or should I consider some other approach?
I have other datasets with even more detail (100s of million of rows), so looking for best practice for when I get around to these more detailed datasets.
I need a general piece of advice, but for the record i use jpa.
I need to generate usage data statistics, eg breakdown of user purchases per product, etc... I see three possible strategies, 1) generate on the fly stats each time the stats are being viewed, 2) create a specific table for stats that i would update each time there is a change 3) do offline processing at regular time intervals
All have issues and advanages, eg cost vs not up to date data, and i was wondering if anyone with experience in this field could provide some advice. I am aware the question s pretty broad, i can refine my use case if needed.
I've done a lot of reporting and the first question I always want to know is if the stakeholder needs the data in real time or not. This definitely shifts how you think and how you'll design a reporting system.
Based on the size of your data, I think it's possible to do real time reporting. If you had data in the millions, then maybe you'd need to do some pre-processing or data warehousing (your options 2/3).
Some general recommendations:
If you want to do real time reporting, think about making a copy of the database so you aren't running reports against production data. Some reports can use queries that are heavy, so it's worth looking into replicating production data to some other server where you can run reports.
Use intermediate structures a lot for reports. Write views, stored procedures, etc. so every report isn't just some huge complex query.
If the reports start to get too complex for doing at the database level, make sure you move the report logic into the application layer. I've been bitten by this many times. I start writing a report with queries purely from the database and eventually it gets too complex and I have to jump through hoops to make it work.
Shoot for real time and then go to stale data if necessary. Databases are capable of doing a lot more than you'd think. Quite often you can make changes to your database structures that will give you a big yield in performance.
I'm currently working on a home-automation project which provides the user with the possibility to view their energy usage over a period of time. Currently we request data every 15 minutes and we are expecting around 2000 users for our first big pilot.
My boss is requesting we that we store at least half a year of data. A quick sum leads to estimates of around 35 million records. Though these records are small (around 500bytes each) I'm still wondering whether storing these in our database (Postgres) is a correct decision.
Does anyone have some good reference material and/or advise about how to deal with this amount of information?
For now, 35M records of 0.5K each means 37.5G of data. This fits in a database for your pilot, but you should also think of the next step after the pilot. Your boss will not be happy when the pilot will be a big success and that you will tell him that you cannot add 100.000 users to the system in the next months without redesigning everything. Moreover, what about a new feature for VIP users to request data at each minutes...
This is a complex issue and the choice you make will restrict the evolution of your software.
For the pilot, keep it as simple as possible to get the product out as cheap as possible --> ok for a database. But tell you boss that you cannot open the service like that and that you will have to change things before getting 10.000 new users per week.
One thing for the next release: have many data repositories: one for your user data that is updated frequently, one for you queries/statistics system, ...
You could look at RRD for your next release.
Also keep in mind the update frequency: 2000 users updating data each 15 minutes means 2.2 updates per seconds --> ok; 100.000 users updating data each 5 minutes means 333.3 updates per seconds. I am not sure a simple database can keep up with that, and a single web service server definitely cannot.
We frequently hit tables that look like this. Obviously structure your indexes based on usage (do you read or write a lot, etc), and from the start think about table partitioning based on some high level grouping of the data.
Also, you can implement an archiving idea to keep the live table thin. Historical records are either never touched, or reported on, both of which are no good to live tables in my opinion.
It's worth noting that we have tables around 100m records and we don't perceive there to be a performance problem. A lot of these performance improvements can be made with little pain afterwards, so you could always start with a common-sense solution and tune only when performance is proven to be poor.
With appropriate indexes to avoid slow queries, I wouldn't expect any decent RDBMS to struggle with that kind of dataset. Lots of people are using PostgreSQL to handle far more data than that.
It's what databases are made for :)
First of all, I would suggest that you make a performance test - write a program that generates test entries that corresponds to the number of entries you'll see over half a year, insert them and check results to see if query times are satisfactory. If not, try indexing as suggested by other answers. It is, btw, also worth trying write performance to ensure that you can actually insert the amount of data you're generating in 15 minutes in.. 15 minutes or less.
Making a test will avoid the mother of all problems - assumptions :-)
Also think about production performance - your pilot will have 2000 users - will your production environment have 4000 users or 200000 users in a year or two?
If we're talking a really big environment, you need to think about a solution that allows you to scale out by adding more nodes instead of relying on always being able to add more CPU, disk and memory to a single machine. You can either do this in your application by keeping track on which out of multiple database machines is hosting details for a specific user, or you can use one of the Postgresql clustering methods, or you could go a completely different path - the NoSQL approach, where you walk away completely from RDBMS and use systems which are built to scale horizontally.
There are a number of such systems. I only have personal experience of Cassandra. You have to think completely different compared to what you're used to from the RDBMS world which is something of a challenge - think more about how you want
to access the data rather than how to store it. For your example, I think storing the data with the user-id as key and then add a column with the column name being the timestamp and the column value being your data for that timestamp would make sense. You can then ask for slices of those columns for example for graphing results in a Web UI - Cassandra has good enough response times for UI applications.
The upside of investing time in learning and using a nosql system is that when you need more space - you just add a new node. Same thing if you need more write performance, or more read performance.
Are you not better off not keeping individual samples for the full period? You could possibly implement some sort of consolidation mechanism, which concatenates weekly/monthly samples into one record. And run said consolidation on a schedule.
You decision has to depend on the type of queries you need to be able to run on the database.
There are lots of techniques to handle this problem. you will only get performance if you touch minimum number of records. in your case you can use following techniques.
Try to keep old data in separate table here your can use table partitioning or can use a different kind of approach where you can store your old data in file system and can serve them directly from your application without connecting to database, this way your database will be free. I am doing this for one of my project and it already has more than 50GB of data but it is running very smoothly.
Try to index table columns but be careful as it will affect your insertion speed.
Try batch processing for your insertion or select queries. you can handle this issue very smartly here.
Example: suppose you are getting request to insert record in any table after every 1 second then you make a mechanism where you process this request in batch of 5 record in this way you will hit your database after 5 second which is much better. Yes, you can make users to wait for 5 second to wait for their record inserted like in Gmail where you send email and it ask you to wait/processing. for select you can put your resultset periodically in file system and can serve them directly to user without touching database like most stock market data company do.
You can also use some ORM like Hibernate. They will use some caching techniques to boost speed of your data.
For any further query you can mail me on ranjeet1985#gmail.com
I'm working for a company running a software product based on a MS SQL database server, and through the years I have developed 20-30 quite advanced reports in PHP, taking data directly from the database. This has been very successful, and people are happy with it.
But it has some drawbacks:
For new changes, it can be quite development intensive
The user can't experiment much with the data - it is locked to a hard-coded view
It can be slow for big reports
I am considering gradually going to a OLAP-based approach, which can be queried from Excel or some web-based service. But I would like to do this in a way that introduces the least amount of new complexity in the IT environment - the least amount of different services, synchronization jobs etc!
I have some questions in this regard:
1) Workflow-related:
What is a good development route from "black box SQL server" to "OLAP ready to use"?
Which servers and services should be set up, and which scripts should be written?
Which are the hardest/most critical/most time-intensive parts?
2) ETL:
I suppose it is best to have separate servers for their Data Warehouse and Production SQL?
How are these kept in sync (push/pull)? Using which technologies/languages?
For me SSIS looks overly complicated, and the graphical workflow doesn't appeal much to me -- I would rather like a text based script that does the job. Is this feasible?
Or is it advantagous to use the graphical client with only one source and one destination?
3) Development:
How much of this (data integration, analysis services) can be efficiently maintained from a CLI-tool?
Can the setup be transferred back and forth between production and development easily?
I'm happy with any answer that covers just some of this - and even though it is a MS environment, I'm also interested to hear about advantages in other technologies.
I only have experience with Microsoft OLAP, so here are my two cents regarding what I know:
If you are implementing cubes, then separate the production SQL Server from the source for the cubes. Cubes require a lot of SELECT DISTINCT column_name FROM source.table. You don't want cube processing to block your mission critical production system.
Although you can implement OLAP cubes with standard relation tables, you will quickly find that unless your data is a ledger-style system you will probably need to fully reprocess your fact and dimension tables and this will require requerying the source database over and over again. That's a large argument for building a separate data warehouse that uses ledger-style transactions for the fact tables. For instance, if a customer orders something and then cancels it, your source system may track this as a status change. In your fact table, you probably need to show this as a row for ordering that has a positive quantity and revenue stream and a row for cancelling that has a negative quantity and revenue stream.
OLAP may be overkill for your environment. The main issue you appeared to raise was that your reports are static and users want access to the data directly. You could build a data model and give users Report Builder access in SSRS, or report writing access in some other BI suite like Cognos, Business Objects, etc. I don't generally recommend this approach since it is way beyond what most users should have to know to get data, but in a small shop this may be sufficient and it is easy to implement. Let's face it -- users generally just want to get the data into Excel to manipulate it further. So if you don't want to give them a web front-end and you just want them to get to the data from Excel, you could give them direct database access to a copy of the production data. The downside of this approach is users don't generally understand SQL or database relationships. OLAP helps you avoid forcing users to learn SQL or relationships, but is isn't easy to implement on your end. If you only have a couple of power users who need this kind of access, it could be easy enough to teach the few power users how to do basic queries in Excel against the database and they will be happy to get this tomorrow. OLAP won't be ready by tomorrow.
If you only have a few kinds of source data systems, you could get away with building a super-dynamic static report. For instance, I have a report that was written in C# that basically allows users to select as many columns as they want from a list of 30 columns and filter the data on a few date range fields and field filter lists. This simple report covers about 40% of all ad hoc report requests from end-users since it covers all the basic, core customer metrics and fields. We recently moved this report to SSRS and that allowed us to up the number of fields to about 100 and improved the overall user experience. Regardless of the reporting platform, it is possible to give users some dynamic flexibility even in the confines of a static reporting system.
If you only have a couple of databases, you can probably backup and restore the databases as your ETL. However, if you want to do anything beyond that, then you might as well bite the bullet and use SSIS (or some other ETL tool). Once you get into ETL for data warehousing, you are going to use a graphic-oriented design tool. Coding works well for applications, but ETL is more about workflows and that's why the tools tend to converge on a graphical UI. You can work around this and try to code a data warehouse from a text editor, but in the end you are going to lose out on a lot. See this post for more details on the differences between loading data from code and loading data from SSIS.
FEEDBACK ON HOW TO USE CUBES WITH A RELATIONAL DATA STORE
It is possible to implement a cube over a relational data store, but there are some major problems with using this approach. The main reason it is technically feasible has to do with how you configure your DSV. The DSV is essentially a logical layer between the physical database and the cube/dimension definitions. Instead of importing the relational tables into the DSV, you could define Named Queries or create views in the database that flatten the data.
The advantage of this approach are as follows:
It is relatively easy to implement since you don't have to build an entire ETL subsystem to get started with OLAP.
This approach works well for prototyping how you want to build a more long-term solution. You can prototype it in 1-2 days and show some of the benefits of OLAP today.
Some very, very large tables don't have to be completely duplicated just to support an OLAP cube. I have several multi-billion row tables that are almost completely standardized fact tables. The only columns they don't have are date keys and they also contain some NULL values on fields that shouldn't have nulls at all. Instead of duplicating these very massive tables, you can create the surrogate date keys and set values for the nulls in the view or named query. If you aren't going to see a huge performance boon for duplicating the table, then this may be a candidate for leaving in a more raw format in the database itself.
The disadvantages of this approach are as follows:
If you haven't built a true Kimball method data warehouse, then you probably aren't tracking transactions in a ledger-style. Kimball method fact tables (at least as I understand them) always change values by adding and subtracting rows. If someone cancels part of an order, you can't update the value in the cube for the single transaction. Instead, you have to balance out the transaction with a negative value. If you have to update the transaction, then you will have to fully reprocess the partition of the cube to replace the value which can be a very expensive operation. Unless your source system is a ledger-style transaction system, you will probably have to build a ledger-style copy in your ETL subsystem.
If you don't build a Kimball method data warehouse, then you are probably using unobscured and possibly non-integer primary keys in your database. This directly impacts query performance inside the cube. It also sets you up for having a theoretically inflexible data warehouse. For instance, if you have an product ordering system that uses an integer key and you start using a second product ordering system either as a replacement for the legacy system or in tandem with the legacy system, you may struggle to combine the data together merely through the DSV since each system has different data points, metrics, workflows, data types, etc. Worse, if they have the same data types for the order id and the order id values overlap between systems, then you must declare a surrogate key that you can use across both systems. This can be difficult, but not impossible, to implement without using a flattened data warehouse.
You may have to build the system twice if you start with the relational data store and then move to flattened database. Frankly, I think the amount of duplicated work is trivial. Most of what you learned building the cube off a relational data store will translate to setting up the new OLAP cube. The main problem, though, is that you will probably create a new cube altogether and then any users of the old cube will have to migrate to the new cube. Any reports built in SSRS or Excel will probably break at that point and need to be rewritten from the ground up. So the main cost of rebuilding the cube is really on rebuilding dependent objects -- not on rebuilding the cube itself.
Let me know if you want me to expand on any of the above points. good luck.
You're basically asking the million dollar question of "How do I build a DWH". This is not really a question that can decisively be answered.
Nevertheless, here is a kickstart:
If you are looking for a minimum viable product, be aware that you are in a data environment, and not a pure software one. In data-heavy environments, it is much harder to incrementally build a product, because the amount of effort to introduce changes in the system is much greater. Think about it as if every change you make in a piece of software has to be somehow backwards-compatible with anything you've ever done. Now you understand the hell Microsoft are in :-).
Also, data systems involve many third-party tools such as DBs, ETL tools and reporting platforms. The choices you make should be viable for the expected development of your system, else you might have to completely replace these tools down the road.
While you can start with a DB cloning that will be based on simple copy SQLs and then aggregating it or pushing it into an OLAP, I would recommend getting your hands dirty with a real ETL tool from the start. This is especially true if you foresee the need to grow. 9 out of 10 times, the need will grow.
MS-SQL is a good choice for a DB if you don't mind the cost. The natural ETL tool would be SSIS, and it's a solid tool as well.
Even if your first transformations are merely "take this table and dump it in there", you still gain a lot in terms of process management (has the job run? What happens if it fails? etc) and debugging. Also, it is easier to organically grow as requirements and/or special cases have to be dealt with.
I have run into the following situation several times, and was wondering what best practices say about this situation:
Rows are inserted into a table as users complete some action. For example, every time a user visits a specific portion of a website, a row is inserted indicating their IP address, username, and referring URL. Elsewhere, I want to show summary information about those actions. In our example, I'd want to allow administrators to log onto the website and see how many visits there are for a specific user.
The most natural way to do this (IMO) is to insert a row for every visit and, every time the administrator requests totals, count up the number of rows in the appropriate table for that user. However, in situations like these there can be thousands and thousands of rows per user. If administrators frequently request the totals, constantly requesting the counts could create quite a load on the database. So it seems like the right solution is to insert individual rows but simultaneous keep some kind of summary data with running totals as data is inserted (to avoid recalculating those totals over and over).
What are the best practices or most common database schema design for this situation? You can ignore the specific example I made up, my real question is how to handle cases like this dealing with high-volume data and frequently-requested totals or counts of that data.
Here are a couple of practices; the one you select will depend upon your specific situation:
Trust your database engine Many database engines will automatically cache the query plans (and results) of frequently used queries. Even if the underlying data has changed, the query plan itself will remain the same. Appropriate parts of indexes will be kept in main memory, making rerunning a given query almost free. The most you may need to do in this case is tune the database's parameters.
Denormalize your database While 3rd-Normal Form (3NF) is still considered the appropriate database design, for performance reasons it can become necessary to add additional tables that include summary values that would normally be calculated as necessary via a SELECT ... GROUP BY ... query. Frequently these other tables are kept up to date by the use of triggers, stored procedures, or background processes. See Wikipedia for more about Denormalization.
Data warehousing With a Data warehouse, the goal is to push copies of live data to secondary databases (warehouses) for query and special reporting purposes. This is usually done with background processes using whatever replication techniques your database supports. These warehouses are frequently indexed more rigorously than may otherwise be needed for your base application, with the intent to support large queries with massive amounts of historical data.