I would like to understand if there is a way in Snowflake to segregate credit usage by schema - or if this question is not possible to answer at all by the nature of Snowflake.
By that I mean, I have schemas with normalized tables and then I have a schema that contains the output of scheduled tasks in Snowflake. The queries behind the scheduled tasks join multiple tables from multiple schemas so perhaps Snowflake does not track the credit cost due to the resources utilizing each schema.
I'd like to better segregate out not just the most intensive tasks I have w/r/t credit usage but the most expense around which schemas are driving this.
I did see and attempt the query in this SO answer, Find credits used per database and schema in Snowflake, but it mainly segregates a granular look behind warehouse that doesn't revolve around schemas like the question originally asked.
Related
I'm an analyst preparing Tableau reports with analysis for other teams. I would like to get some workload of my shoulders by creating a data source so optimized, that the users will be able to use it to get the data they need and do the analysis by themselves.
Current situation:
We use Amazon Redshift. We have tables with raw data coming directly from the systems. Also we have some transformed tables for easier work. All in all, it's tens and tens of tables. We are using Tableau desktop and Tableau server.
Desired situation:
I would like to retain access to the raw data so I can backtrack any potential issues back to the original source. From the raw data, I would like to create transformed tables that will allow users to make queries on them (two-layer system). The tables should contain all the data a user might need, yet be simple enough for a beginner-level SQL user.
I see two ways of approaching this:
Small number of very large tables with all the data. If there are just a couple of tables that contain maximum amount of data, the user can just query one table and ask for columns he need. Or, if necessary, join one or two more tables to it.
Many small and very specialized tables. User will have to do multiple joins to get the data he needs, but all the tables will be very simple so it will not be difficult.
Also, access permissions to the data need to be considered.
What do you think is a good approach to solving my issue? Is it any of the two above mentioned solutions? Do you have any other solution? What would you recommend?
We had this problem and we sorted out with AWS Athena. You pay only when the data is scanned and used. Otherwise, you will not pay and no data will be touched.
With AWS Athena you can create any set of tables with different attributes from and easy to maintain the Role permissions.
Last part to cover, Tableau has a direct interface to Athena, so no need for any intermediate storage.
Also any time you don't want the table, just delete and remove from Roles. Rest of them will be automatically taken care.
On an Additional Note, we tried Redshift Spectrum on JSON data, it does not work with nested JSON yet. So all your attributes should be only one level deep.
Hope it helps.
EDIT1:
Redshift is a columnar database, there is no difference between small tables and big tables. If you can avoid joins with smaller tables. Even if the table is bigger, your query speed depends upon the fields involved in the query. If a field is not required in the query it is never touched when you are querying the data.
I prefer to have all related data in a bigger table so need to duplicate any relations or joins to the tables.
Also you need to ensure there is not much duplication of data when you store in a bigger table.
More about Database Normalization,
MySQL: multiple tables or one table with many columns?
i have a question why operational database are not fulfilling business challenges as data warehouse?
in operational database i can create reports in details about any product or any thing and i can issue statistical reports with charts and diagrams, so why the operational database can not use as data warehouse?
Best Regards
Usually an operational database only keeps track of the current state of each record.
The purpose of a data warehouse is two-fold:
- Keep track of historic events without overwhelming the operational database;
- Isolate OLAP queries so that they don't impact the load on the operational datastore.
If you try to query your operational data store for sales per product line per month for the past year the amount of joins required, as well as the amount of information you need to read from storage may cause performance degradation on your operational database.
A data warehouse tries to avoid this by 1) keeping things separated and 2) denormalising the data model (Kimball approach) so that query plans are simpler.
I suggest reading The Data Warehouse Toolkit, by Ralph Kimball. The first chapter deals precisely with this question: why do we need a data warehouse if we already have an operational data store?
i can create reports in details about any product or any thing
and i can issue statistical reports with charts and diagrams
Yes you can but a business user can not as they don't know SQL. And you it's very difficult to put a BI tool (for a business users to use) over the top of an operational database for many reasons:
The data model is not built for an end user to understand. A data warehouse data model is (i.e. there is ONE table for customers that has everything about a customer in it rather than being split into addresses, accounts etc.)
The operational data store is probably missing important reporting reference data such as grouping levels and hierarchies
A slowly changing dimension is a method of 'transparently' modelling changes to for example, customers. An operational data model generally doesn't do this very well. You need to understand all the table and join them correctly, if this information is even stored
There are many other reasons but these just serve to address your points
When you get to the point that you are too busy to service business users requests, and you are issuing reports that don't match from one day to the other, you'll start to see the value of a data warehouse.
I have an ASP.NET MVC 4 application that uses a database in US or Canada, depending on which website you are on.
This program lets you filter job data on various filters and the criteria gets translated to a SQL query with a good number of table joins. The data is filtered then grouped/aggregated.
However, now I have a new requirement: query and do some grouping and aggregation (avg salary) on data in both the Canada Server and US server.
Right now, the lookup tables are duplicated on both database servers.
Here's the approach I was thinking:
Run the query on the US server, run the query again on the Canada server and then merge the data in memory.
Here is one use case: rank the companies by average salary.
In terms of the logic, I am just filtering and querying a job table and grouping the results by company and average salary.
Would are some other ways to do this? I was thinking of populating a reporting view table with a nightly job and running the queries against that reporting table.
To be honest, the queries themselves are not that fast to begin with; running, the query again against the Canada database seems like it would make the site much slower.
Any ideas?
Quite a number of variables here. If you don't have too much data then doing the queries on each DB and merging is fine so long as you get the database to do as much of the work as it is able to (i.e. the grouping, averaging etc.).
Other options include linking your databases and doing a single query but there are a few downsides to this including
Having to link databases
Security associated with a linked database
A single query will require both databases to be online, whereas you can most likely work around that with two queries
Scheduled, prebuilt tables have some advantages & disadvantages but probably not really relevant to the root problem of you having 2 databases where perhaps you should have one (maybe, maybe not).
If the query is quite slow and called many times, the a single snapshot once could save you some resources provided the data "as at" the time of the snapshot is relevant and useful to your business need.
A hybrid is to create an "Indexed View" which can let the DB create a running average for you. That should be fast to query and relatively unobtrusive to keep up to date.
Hope some of that helps.
I am administering a rather large database that has grown in complexity and design from a single application database. Now there is a plan to add a fifth application that carries with it its own schema and specific data. I have been researching SSO solutions but that is not really what I am after. My goal is to have one point of customer registration, logins and authorization.
Ideally, each application would request authentication and be given authorization to multiple applications, where the applications would then connect to the appropriate database for operations. I do not have first hand experience dealing with this degree of separation as the one database has been churning flawlessly for years. Any best practice papers would be appreciated :)
I would envision a core database that maintained shared data - Customer/Company/Products
Core tables and primary Keys –To maintain referential integrity should I have a smaller replicated table in each “application” database. What are some ways to share keys among various databases and ensure referential integrity?
Replication – Two subscribers currently pull data from the production database where data is later batched into a DW solution for reporting. Am I going down a road that can lead to frustration?
Data integrity – How can I ensure for example that:
DATABASE_X.PREFERENCES.USER_ID =always references a= CORE_DATABASE.USERS.USER_ID
Reporting – What type of hurdles would I cross to replicate/transform data from multiple databases into one reporting database?
White Papers - Can anyone find good refernces to this strategy in practice?
Thanks
A few urls for you. Scale out implementations can vary wildly to suit requirements but hopefully these can help you.
http://blogs.msdn.com/b/sqlcat/archive/2008/06/12/sql-server-scale-out.aspx
this one is 2005 centric but is VERY good
http://msdn.microsoft.com/en-us/library/aa479364.aspx#scaloutsql_topic4
this one a good solution for reporting...
http://msdn.microsoft.com/en-us/library/ms345584.aspx
given you an analysis services one too :)
http://sqlcat.com/whitepapers/archive/2010/06/08/scale-out-querying-for-analysis-services-with-read-only-databases.aspx
I created something like this a few years ago using views and stored procedures to bring in the data from the Master database into the subordinate databases. This would allow you to fairly easily join those master tables into the other subordinate tables.
Have you looked into using RAC? You can have multiple physical databases but only one logical database. This would solve all of your integrity issues. And you can set aside nodes just for reporting.
Don't throw out the idea of having separate applications and linking the logon/off functions via webservice (esque) requests. I have seen billing/user registration systems separated in this way. Though at extremely large scales, this might not be a good idea.
As part of my role at the firm I'm at, I've been forced to become the DBA for our database. Some of our tables have rowcounts approaching 100 million and many of the things that I know how to do SQL Server(like joins) simply break down at this level of data. I'm left with a couple options
1) Go out and find a DBA with experience administering VLDBs. This is going to cost us a pretty penny and come at the expense of other work that we need to get done. I'm not a huge fan of it.
2) Most of our data is historical data that we use for analysis. I could simply create a copy of our database schema and start from scratch with data putting on hold any analysis of our current data until I find a proper way to solve the problem(this is my current "best" solution).
3) Reach out to the developer community to see if I can learn enough about large databases to get us through until I can implement solution #1.
Any help that anyone could provide, or any books you could recommend would be greatly appreciated.
Here are a few thoughts, but none of them are quick fixes:
Develop an archival strategy for the
data in your large tables. Create
tables with similar formats to the
existing transactional table and
copy the data out into those tables
on a periodic basis. If you can get
away with whacking the data out of
the tx system, then fine.
Develop a relational data warehouse
to store the large data sets,
complete with star schemas
consisting of fact tables and
dimensions. For an introduction to
this approach there is no better
book (IMHO) than Ralph Kimball's
Data Warehouse Toolkit.
For analysis, consider using MS
Analysis Services for
pre-aggregating this data for fast
querying.
Of course, you could also look at
your indexing strategy within the
existing database. Be careful with
any changes as you could add indexes
that would improve querying at the
cost of insert and transactional
performance.
You could also research
partitioning in SQL Server.
Don't feel bad about bringing in a DBA on contract basis to help out...
To me, your best bet would be to begin investigating movement of that data out of the transactional system if it is not necessary for day to day use.
Of course, you are going to need to pick up some new skills for dealing with these amounts of data. Whatever you decide to do, make a backup first!
One more thing you should do is ensure that your I/O is being spread appropriately across as many spindles as possible. Your data files, log files and sql server temp db data files should all be on separate drives with a database system that large.
DBA's are worth their weight in gold, if you can find a good one. They specialize in doing the very thing that you are describing. If this is a one time problem, maybe you can subcontract one.
I believe Microsoft offers a similar service. You might want to ask.
You'll want to get a DBA in there, at least on contract to performance tune the database.
Joining to a 100 Million record table shouldn't bring the database serer to its knees. My company customers do it many hundreds (possibly thousands) of times per minute on our system.