Database solution for self-service - database

I'm an analyst preparing Tableau reports with analysis for other teams. I would like to get some workload of my shoulders by creating a data source so optimized, that the users will be able to use it to get the data they need and do the analysis by themselves.
Current situation:
We use Amazon Redshift. We have tables with raw data coming directly from the systems. Also we have some transformed tables for easier work. All in all, it's tens and tens of tables. We are using Tableau desktop and Tableau server.
Desired situation:
I would like to retain access to the raw data so I can backtrack any potential issues back to the original source. From the raw data, I would like to create transformed tables that will allow users to make queries on them (two-layer system). The tables should contain all the data a user might need, yet be simple enough for a beginner-level SQL user.
I see two ways of approaching this:
Small number of very large tables with all the data. If there are just a couple of tables that contain maximum amount of data, the user can just query one table and ask for columns he need. Or, if necessary, join one or two more tables to it.
Many small and very specialized tables. User will have to do multiple joins to get the data he needs, but all the tables will be very simple so it will not be difficult.
Also, access permissions to the data need to be considered.
What do you think is a good approach to solving my issue? Is it any of the two above mentioned solutions? Do you have any other solution? What would you recommend?

We had this problem and we sorted out with AWS Athena. You pay only when the data is scanned and used. Otherwise, you will not pay and no data will be touched.
With AWS Athena you can create any set of tables with different attributes from and easy to maintain the Role permissions.
Last part to cover, Tableau has a direct interface to Athena, so no need for any intermediate storage.
Also any time you don't want the table, just delete and remove from Roles. Rest of them will be automatically taken care.
On an Additional Note, we tried Redshift Spectrum on JSON data, it does not work with nested JSON yet. So all your attributes should be only one level deep.
Hope it helps.
EDIT1:
Redshift is a columnar database, there is no difference between small tables and big tables. If you can avoid joins with smaller tables. Even if the table is bigger, your query speed depends upon the fields involved in the query. If a field is not required in the query it is never touched when you are querying the data.
I prefer to have all related data in a bigger table so need to duplicate any relations or joins to the tables.
Also you need to ensure there is not much duplication of data when you store in a bigger table.
More about Database Normalization,
MySQL: multiple tables or one table with many columns?

Related

How do I minimize storage impact (bytes) of audit columns in database tables?

I'm a senior database developer who has always practiced creating four auditing columns on most database tables, if not all, as follows:
DATE_INSERTED
USER_INSERTED
DATE_UPDATED
USER_UPDATED
The reasons I wish to capture this information is NOT to comply with some external auditing requirements like Sarbanes Oxley. It's simply for troubleshooting purposes when Development is asked to investigate some data scenario in the database, and knowing who originally inserted some record and when, as well as who last updated it and when, strongly aids in the troubleshooting effort. Needing to store every version/state of that record that ever existed is probably overkill, but might be useful in some cases.
I'm new to both Azure SQL Database (essentially SQL Server) and its memory-optimized tables (in-memory tables), which I'm planning to implement. The amount of bytes you can store in memory database-wide is limited, so I'm being very conscious of limiting the number of bytes I put into memory, especially for large tables. Well, these four auditing columns will eat up a significant number of in-memory bytes if I add them to large in-memory tables, and I'd love to avoid that. These audit columns will never be queried or displayed in app code or reports, but they will be likely be populated by a combination of column default values, triggers and app code.
My question is whether there's a good data modeling strategy to keep these four columns out of memory, while keeping the remainder of the table in memory? The worst-case scenario seems like creating your main in-memory table, and then create a separate on-disk table with the 4 audit columns and a UNIQUE FOREIGN KEY column pointing to the in-memory table (thereby creating a 1-to-1 FK instead of 1-to-many). But I was hoping there was a more elegant approach to accomplish this in-memory/on-disk split, perhaps leveraging some SQL Server feature I haven't been able to find. As a bonus, it'd be nice if the main table and the audit columns appear as a single table to SQL without having to implement a database view.
Thanks in advance for any suggestions!

Need for OLAP cubes if we can Build views based directly off the RAW table

Assume that the table in the source data is is clean and in a state where they can be used directly.
I am trying to understand whether building views based off the RAW table is better than creating cubes. To make the VIEWS dynamic, we can have .NET application which would take paramteres for the view and execute a View with Parameters and get the data for Reporting and analysis.
If I want to view the Sales of a Product for United states in the Month of Februaray. So, I can create a view joining Product, Customer get the sales for a particular day in the month of February.
Instead of forming a Star Schema with Product, Date, Customer dimension. I am really trying to understand what is the standarad a company should go with.
I have folks telling Cubes are only good for analysis not good for reporting . Whatever information we want we can get it by creating a DYNAMIC Views
Any advise or ideas on this ?
Thanks!!
As the name suggest SSAS (SQL Server Analysis Services) is indeed built for analysis. The reason for this is the highly normalized table structure (e.g., the star schema) which allows for super efficient indexing combined with the pre-processing of aggregated values.
Views are a great way to take data that already exists within your OLTP (as compared to OLAP) database and transform it in a manner that better fits your querying needs. This works in the same manner as "get" stored procedures.
Now for my opinion:
If you have a small amount of data (relative to the power of your server, as well as many other factors) and you're not performing intense aggregations of the data, consider using stored procedures to access your database. You can specify the parameter in .NET like any other function, making this method super easy.
If you have a lot of data (like, over 100 million rows), consider creating a cube. This will allow your queries to fly. There's a lot more work that goes into these, but the speed payoff is huge.
End note:
If the data in your reports is pretty similar to the data you already have in your database (including JOINing the tables) and you have under half a billion rows, just use a stored proc, and look into using SSRS (or not). If you have a ton of data that needs to be aggregated and transformed, look into SSAS OLAP cubes.
From my limited experience with Microsoft's Analysis Services, I would agree with Norla. If the execution time of the view is reasonable, that would be the way to go. Cubes can certainly be reported against, as SQL Reporting Services accommodates them fairly well, but the development process can often be much more involved when using a cube as your data source.
Building views can be an alternative for small datasets. You could consider going that route BUT:
1) once the reports are taking a lot of time to load
2) It slows the transactional systems
Then you'll have to consider cubes.

Database design: one huge table or separate tables?

Currently I am designing a database for use in our company. We are using SQL Server 2008. The database will hold data gathered from several customers. The goal of the database is to acquire aggregate benchmark numbers over several customers.
Recently, I have become worried with the fact that one table in particular will be getting very big. Each customer has approximately 20.000.000 rows of data, and there will soon be 30 customers in the database (if not more). A lot of queries will be done on this table. I am already noticing performance issues and users being temporarily locked out.
My question, will we be able to handle this table in the future, or is it better to split this table up into smaller tables for each customer?
Update: It has now been about half a year since we first created the tables. Following the advices below, I created a handful of huge tables. Since then, I have been experimenting with indexes and decided on a clustered index on the first two columns (Hospital code and Department code) on which we would have partitioned the table had we had Enterprise Edition. This setup worked fine until recently, as Galwegian predicted, performance issues are springing up. Rebuilding an index takes ages, users lock each other out, queries frequently take longer than they should, and for most queries it pays off to first copy the relevant part of the data into a temp table, create indices on the temp table and run the query. This is not how it should be. Therefore, we are considering to buy Enterprise Edition for use of partitioned tables. If the purchase cannot go through I plan to use a workaround to accomplish partitioning in Standard Edition.
Start out with one large table, and then apply 2008's table partitioning capabilities where appropriate, if performance becomes an issue.
Datawarehouses are supposed to be big (the clue is in the name). Twenty million rows is about medium by warehousing standards, although six hundred million can be considered large.
The thing to bear in mind is that such large tables have a different physics, like black holes. So tuning them takes a different set of techniques. The other thing is, users of a datawarehouse must understand that they are dealing with huge amounts of data, and so they must not expect sub-second response (or indeed sub-minute) for every query.
Partitioning can be useful, especially if you have clear demarcations such as, as in your case, CUSTOMER. You have to be aware that partitioning can degrade the performance of queries which cut across the grain of the partitioning key. So it is not a silver bullet.
Splitting tables for performance reasons is called sharding. Also, a database schema can be more or less normalized. A normalized schema has separate tables with relations between them, and data is not duplicated.
I am assuming you have your database properly normalized. It shouldn't be a problem to deal with the data volume you refer to on a single table in SQL Server; what I think you need to do is review your indexes.
Since you've tagged your question as 'datawarehouse' as well I assume you know some things about the subject. Depending on your goals you could go for a star-schema (a multidemensional model with a fact and dimensiontables). Store all fastchanging data in 1 table (per subject) and the slowchaning data in another dimension/'snowflake' tables.
An other option is the DataVault method by Dan Lindstedt. Which is a bit more complex but provides you with full flexibility.
http://danlinstedt.com/category/datavault/
In a properly designed database, that is not a huge anmout of records and SQl server should handle with ease.
A partioned single table is usually the best way to go. Trying to maintain separate indivudal customer tables is very costly in termas of time and effort and far more probne to errors.
Also examine you current queries if you are experiencing performance issues. If you don't have proper indexing (did you for instance index the foreign key fields?) queries will be slow, if you don't have sargeable queries they will be slow if you used correlated subqueries or cursors, they will be slow. Are you returning more data than is striclty needed? If you have select * anywhere in your production code, get rid of it and only return the fields you need. If you used views that call views that call views or if you used EAV table, you willhave performance iisues at this level. If you allowed a framework to autogenerate SQl code, you may well have badly perforimng queries. Remember Profiler is your friend. Of course you could also have a hardware issue, you need a pretty good sized dedicated server for that number of records. It won't work to run this on your web server or a small box.
I suggest you need to hire a professional dba with performance tuning experience. It is quite complex stuff. Databases desigend by application programmers often are bad performers when they get a real number of users and records. Database MUST be designed with data integrity, performance and security in mind. If you didn't do that the changes of having them are slim indeed.
Partioning is definately something to look into. I had a database that had 2 tables sharded. Each table contained around 30-35million records. I have since merged this into one large table and assigned some good indexes. So far, I've not had to partition this table as it's working a treat, but I'm keep partitioning in mind. One thing that I have noticed, compared to when the data was sharded, and that's the data import. It is now slower, but I can live with that as the Import tool can be re-written ;o)
One table and use table partitioning.
I think the advice to use NOLOCK is unjustified based on the information given. NOLOCK means you will get inaccurate and unreliable results from your queries (dirty and phantom reads). Before using NOLOCK you need to be sure that's not going to be a problem for your customers.
Is this a single flat table (no particular model)? Typically in data warehouses, you either have a normalized data model (third normal form at least - usually in an entity-relationship-model) or you have dimensional data (Kimball method or variations - usually fact tables with associated dimension tables in a set of stars).
In both cases, indexes play a large part, and partitioning can also play a part in getting queries to perform (but partitioning is not usually about performance but about maintenance being able to add and drop partitions quickly) over very large data sets - but it really depends on the order of aggregation and the types of queries.
One table, then worry about performance. That is, assuming you are collecting the exact same information for each customer. That way, if you have to add/remove/modify a column, you are only doing it in one place.
If you're on MS SQL server and you want to keep the single table, table partitioning could be one solution.
Keep one table - 20M rows isn't huge, and customers aren't exactly the kind of table that you can easily 'archive off', and the aggrevation of searching multiple tables to find a customer isn't worth the effort (SQL is likely to be much more efficient at BTree searching than your own invention is)
You will need to look into the performance and locking issues however - this will prevent your db from scaling.
You can also create supplemental tables that hold already calculated details on historical information if there are common queries.

SQL Server performance with a large number of tables in database

I am updating a piece of legacy code in one of our web apps. The app allows the user to upload a spreadsheet, which we will process as a background job.
Each of these user uploads creates a new table to store the spreadsheet data, so the number of tables in my SQL Server 2000 database will grow quickly - thousands of tables in the near term. I'm worried that this might not be something that SQL Server is optimized for.
It would be easiest to leave this mechanism as-is, but I don't want to leave a time-bomb that is going to blow up later. Better to fix it now if it needs fixing (the obvious alternative is one large table with a key associating records with user batches).
Is this architecture likely to create a performance problem as the number of tables grows? And if so, could the problem be mitigated by upgrading to a later version of SQL Server ?
Edit: Some more information in response to questions:
Each of these tables has the same schema. There is no reason that it couldn't have been implemented as one large table; it just wasn't.
Deleting old tables is also an option. They might be needed for a month or two, no longer than that.
Having many tables is not an issue for the engine. The catalog metadata is optimized for very large sizes. There are also some advantages on having each user own its table, like ability to have separate security ACLs per table, separate table statistics for each user content and not least improve query performance for the 'accidental' table scan.
What is a problem though is maintenance. If you leave this in place you must absolutely set up task for automated maintenance, you cannot let this as a manual task for your admins.
I think this is definitely a problem that will be a pain later. Why would you need to create a new table every time? Unless there is a really good reason to do so, I would not do it.
The best way would be to simply create an ID and associate all uploaded data with an ID, all in the same table. This will require some work on your part, but it's much safer and more manageable to boot.
Having all of these tables isn't ideal for any database. After the upload, does the web app use the newly created table? Maybe it gives some feedback to the user on what was uploaded?
Does your application utilize all of these tables for any reporting etc? You mentioned keeping them around for a few months - not sure why. If not move the contents to a central table and drop the individual table.
Once the backend is taken care of, recode the website to save uploads to a central table. You may need two tables. An UploadHeader table to track the upload batch: who uploaded, when, etc. and link to a detail table with the individual records from the excel upload.
I will suggest you to store these data in a single table. At the server side you can create a console from where user/operator could manually start the task of freeing up the table entries. You can ask them for range of dates whose data is no longer needed and the same will be deleted from the db.
You can take a step ahead and set a database trigger to wipe the entries/records after a specified time period. You can again add the UI from where the User/Operator/Admin could set these data validity limit
Thus you could create the system such that the junk data will be auto deleted after specified time which could again be set by the Admin, as well as provide them with a console using which they can manually delete additional unwanted data.

Help figuring out approaches to (near) real time multi dimensional data querying

I have a system that involves numerous related tables. Think of a standard category/product/order/customer/orderitem scenario. Some tables are self referencing (like Categories). None of the tables are particularly large (around 100k rows with an estimated scale to around 1 million rows). There are a lot of dimensions to this data I need to consider, but must be queried in a near real time way. I also don't know which dimensions a particular user is interested in- it can be one or many criteria across numerous tables. Things can range from
Give me everything with a category of Jackets
Give me everything with a category of Jackets->Parkas having a red color purchased in the last month in New York
Give me everything which wasn't purchased in New York which costs over $100.
Currently we have a very long SP which uses a "cascading data" approach- we go table by table, filtering everything into a temp table using whatever criteria was specified for that table. For the next table, we join the current temp table to whatever table we're using and apply a new filter set into a new temp table. It works, but manageability and performance is slow. I need something better.
I need a new approach to this problem. It's clearly a need for OLAP, possibly using a star schema. Does this work in real time? Can it be configured to work in real time? Should I use indexed views to create a set of denormalized tables? Should I offload this outside of the database completely?
FYI We're using Sql Server.
As you say, this is perfect for OLAP.
With Sql Server 2005 and 2008 you can set up an almost real time solution. You should:
Create a denormalized star schema
Build an OLAP cube using that schema
Enable proactive caching to update the cube when the underlying data source changes.
It's not a trivial job, and you need the Enterprise version of Sql Server to use proactive caching. You also need some front-end tool (maybe excel would do) to consume the cube.
It would probably be better to build a dynamic query in your code with all the joins you need, customized to each individual request. (properly parameterized for security of course).
You would use much of the same cascading logic you have now but you move it to to the code instead of the database. Then you only submit the exact query you need.
The performance would beat using all of the temp tables and you might get some caching benefit after a few queries were run.
Your dilemma sounds to me like "Is it better to achieve the same result by performing complex processing every time I need it, or should I do it once only for each new piece of data?".

Resources