Redshift data summarization

Redshift data summarization - database

I have 3 tables in Amazon Redshift which have information regarding the usage of an app by the users (basically the screen clicks, the OS version, app version, etc).
I wish to create a summary table which would store the profile of each user with details like "last logged in time", recently used App version, last visited screen etc.
I am not much familiar with columnar databases and have worked previously only on RDBMS. I was thinking of writing a cron job which would run join queries with the three tables for past one day of data and merge the results into the profile table. I don't know if this is possible to do in Redshift.

Amazon Redshift is a fully-compliant SQL database. The fact that it is a columnar database shouldn't impact how you use the database -- it simply means that it can be faster and more efficient at certain types of operations (eg scanning millions and even billions of rows in tables).
Your idea of running a regular set of database queries would work fine. However, to make it more efficient, the queries should only update information for users who have had activity since the last update. That is, do not try to update information about all users since most user information would not change every day.
The query would basically say "select the latest value of click, os, version for any user who accessed the system since the last time we did an update", rather than "select latest click, os, version for all users".
Also, consider whether you actually need such a table to exist. Perhaps you could retrieve this information on-the-fly when you are seeking information about particular users rather than pre-computing the values each day. This would, of course, depend upon how often you wish to retrieve such information.

Related

Can I use Hadoop to speed up a slow SQL stored procedure?

The problem:
I have 2 SQL Server databases from 2 different applications. They describe different aspects of industrial machines: one is about "how many consumables were spent per order", the other is about "how many good/bad production items were produced per operator". Sometimes many operators are working on 1 order one after another, sometimes one operator is working on multiple small orders, and there is no connection Order-Operator in the database.
I want to have united fact table, where for every timestamp I know MachineID, OrderID and OperatorID. If a timestamp exists in DB1, then the record will have numeric measures from it (Consumables); if it exists in DB2, then it will have numeric measures from DB2 (good/bad production items). If it exists in both databases, then it have all numeric measures. A simple UNION ALL is not enough, because I want to have MachineID, OrderID and OperatorID for every record.
I created a T-SQL stored procedure to make FULL JOIN by timestamp and MachineID. But on large data sets (multiple machines, multiple customers) it becomes very slow. Both applications support editing history, so I need to merge full history from both databases at every nightly load.
To speed up the process, I would like to put calculations into multiple parallel threads, separated by Customer, MachineID, and Year.
I tried to do it by using SQL Server stored procedures, running in parallel by SQL Agent with different parameters, but I found that it didn't help the performance. Instead it created multiple deadlocks when updating staging and final tables.
I am looking for an alternative way to resolve this problem, but I don't know what is the right tool. Can Hadoop or similar parallel processing tool help with this task?
I am looking for solution with minimal cost, because it is needed for just one specific task. For everything else, SQL Server and PowerBI reporting are working just fine for me.

Hadoop seems hard to justify in this use case, given limited scope. The thing about Hadoop is that it scales well not only due to parallel processing but thanks to parallel IO, when data is distributed across multiple servers/storage media. Unless you happy to copy all data to HDFS distributed among multiple nodes, it likely will not help much. If you want to spin up a Hadoop cluster and run multiple jobs querying single SQL server, it'll likely end up badly for the later.
Have you considered optimizations which will allow you to limit the amount of data you processing nightly?
E.g. what is 'timestamp' field? Does it reflect last update time? Can you use it to filter rows which haven't been updated since the previous run?
Even if the 'timestamp' is not the time of last updates, can you add an "updateTime" field and triggers on updates which will populate the field, so you don't need to import rows which have not changed since the previous run? If you build an index on the field, then, if the number of updates during the day is not high relative to total table size, a query with a filter on such field will hit the index, and fetching of incremental changes should be fast.
Another thing to consider - are those DBs running on the same node/SQL server? Access to remote DBs is slow, so if that's the case, think about how to fix this first.

How often is supposed to change the schema of a table and how to deal with this?

We are using Redshift at my workplace, and in the last week I have been running through a serie of requests about changing the schema of a certain table, which have become a very tedious process (involving update of ETL jobs and Redshift views) every day.
The process can be summarized to:
Change the ETL job that produces the raw data before loading it to Redshift
Modify temporarily a Redshift view that uses the underlying table to allow modifications on such table.
Modify the table (e.g. add/change/remove column(s))
Modify the view back to use the updated table.
Of course, in the process there's testing involved and other time-consuming steps.
How often is it "natural" for a table schema to change? What are the best practices to deal with this without losing too much time or having to do all the "mechanic" process all over again?
Thanks!

This is one of the reasons that data warehouse automation tools exist. We know that users will change their mind when they see the warehouse, or as business requirements change. Automating the process means that everything you asked for could be delivered in a few clicks of a mouse.
You'll find a list of all the data warehouse automation products we know, on our web site, at http://ajilius.com/competitors/

Transactional table and Reporting table within same database

I did read posts about transactional and reporting database.
We have in single table which is used for reporting(historical) purpose and transactional
eg :order with fields
orderid, ordername, orderdesc, datereceived, dateupdated, confirmOrder
Is it a good idea to split this table into neworder and orderhistrory
The new ordertable records the current days transaction (select,insert and update activity every ms for the orders received on that day .Later we merge this table with order history
Is this a recommended approach.
Do you think this is would minimize the load and processing time on the Database?

PostgreSQL supports basic table partitioning which allows splitting what is logically one large table into smaller physical pieces. More info provided here.

To answer your second question: No. Moving data from one place to another is an extra load that you otherwise wouldn't have if you used the transational table for reporting. But there are some other questions you need to ask before you make this decision.
How often are these reports run?
If you are running these reports once an hour, it may make sense to keep them in the same table. However, if this report takes a while to run, you'll need to take care not to tie up resources for the other clients using it as a transactional table.
How up-to-date do these reports need to be?
If the reports are run less than daily or weekly, it may not be critical to have up to the minute data in the reports.
And this is where the reporting table comes in. The approaches I've seen typically involve having a "data warehouse," whether that be implemented as a single table or an entire database. This warehouse is filled on a schedule with the data from the transactional table, which subsequently triggers the generation of a report. This seems to be the approach you are suggesting, and is a completely valid one. Ultimately, the one question you need to answer is when you want your server to handle the load. If this can be done on a schedule during non-peak hours, I'd say go for it. If it needs to be run at any given time, than you may want to keep the single-table approach.
Of course there is nothing saying you can't do both. I've seen a few systems that have small on-demand reports run on transactional tables, scheduled warehousing of historical data, and then long-running reports against that historical data. It's really just a matter of how real-time you want the data to be.

SQL Server performance with a large number of tables in database

I am updating a piece of legacy code in one of our web apps. The app allows the user to upload a spreadsheet, which we will process as a background job.
Each of these user uploads creates a new table to store the spreadsheet data, so the number of tables in my SQL Server 2000 database will grow quickly - thousands of tables in the near term. I'm worried that this might not be something that SQL Server is optimized for.
It would be easiest to leave this mechanism as-is, but I don't want to leave a time-bomb that is going to blow up later. Better to fix it now if it needs fixing (the obvious alternative is one large table with a key associating records with user batches).
Is this architecture likely to create a performance problem as the number of tables grows? And if so, could the problem be mitigated by upgrading to a later version of SQL Server ?
Edit: Some more information in response to questions:
Each of these tables has the same schema. There is no reason that it couldn't have been implemented as one large table; it just wasn't.
Deleting old tables is also an option. They might be needed for a month or two, no longer than that.

Having many tables is not an issue for the engine. The catalog metadata is optimized for very large sizes. There are also some advantages on having each user own its table, like ability to have separate security ACLs per table, separate table statistics for each user content and not least improve query performance for the 'accidental' table scan.
What is a problem though is maintenance. If you leave this in place you must absolutely set up task for automated maintenance, you cannot let this as a manual task for your admins.

I think this is definitely a problem that will be a pain later. Why would you need to create a new table every time? Unless there is a really good reason to do so, I would not do it.
The best way would be to simply create an ID and associate all uploaded data with an ID, all in the same table. This will require some work on your part, but it's much safer and more manageable to boot.

Having all of these tables isn't ideal for any database. After the upload, does the web app use the newly created table? Maybe it gives some feedback to the user on what was uploaded?
Does your application utilize all of these tables for any reporting etc? You mentioned keeping them around for a few months - not sure why. If not move the contents to a central table and drop the individual table.
Once the backend is taken care of, recode the website to save uploads to a central table. You may need two tables. An UploadHeader table to track the upload batch: who uploaded, when, etc. and link to a detail table with the individual records from the excel upload.

I will suggest you to store these data in a single table. At the server side you can create a console from where user/operator could manually start the task of freeing up the table entries. You can ask them for range of dates whose data is no longer needed and the same will be deleted from the db.
You can take a step ahead and set a database trigger to wipe the entries/records after a specified time period. You can again add the UI from where the User/Operator/Admin could set these data validity limit
Thus you could create the system such that the junk data will be auto deleted after specified time which could again be set by the Admin, as well as provide them with a console using which they can manually delete additional unwanted data.

Does Oracle have something like Change Data Capture in SQL Server 2008?

Change Data Capture is a new feature in SQL Server 2008. From MSDN:
Change data capture provides
historical change information for a
user table by capturing both the fact
that DML changes were made and the
actual data that was changed. Changes
are captured by using an asynchronous
process that reads the transaction log
and has a low impact on the system
This is highly sweet - no more adding CreatedDate and LastModifiedBy columns manually.
Does Oracle have anything like this?

Sure. Oracle actually has a number of technologies for this sort of thing depending on the business requirements.
Oracle has had something called Workspace Manager for a long time (8i days) that allows you to version-enable a table and track changes over time. This can be a bit heavyweight, though, because it is based on views with instead-of triggers.
Starting in 11.1 (as an extra cost option to the enterprise edition), Oracle has a Total Recall that asynchronously mines the redo logs for data changes that get logged to a separate table which can then be queried using flashback query syntax on the main table. Total Recall is automatically going to partition and compress the historical data and automatically takes care of purging the data after a specified data retention period.
Oracle has a LogMiner technology that mines the redo logs and presents transactions to consumers. There are a number of technologies that are then built on top of LogMiner including Change Data Capture and Streams.
You can also use materialized views and materialized view logs if the goal is to replicate changes.

Oracle has Change Data Notification where you register a query with the system and the resources accessed in that query are tagged to be watched. Changes to those resources are queued by the system allowing you to run procs against the data.
This is managed using the DBMS_CHANGE_NOTIFICATION package.
Here's an infodoc about it:
http://www.oracle-base.com/articles/10g/dbms_change_notification_10gR2.php
If you are connecting to Oracle from a C# app, ODP.Net (Oracles .Net client library) can interact with Change Data Notification to alert your c# app when Oracle changes are made - pretty kewl. Goodbye to polling repeatedly for data changes if you ask me - just register the table, set up change data notifcation through ODP.Net and wala, c# methods get called only when necessary. woot!

"no more adding CreatedDate and LastModifiedBy columns manually" ... as long as you can afford to keep complete history of your database online in the redo logs and never want to move the data to a different database.
I would keep adding them and avoid relying on built-in database techniques like that. If you have a need to keep historical status of records then use an audit table or ship everything off to a data warehouse that handles slowly changing dimensions properly.
Having said that, I'll add that Oracle 10g+ can mine the log files simply by using flashback query syntax. Examples here: http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/statements_10002.htm#i2112847
This technology is also used in Oracle's Datapump export utility to provide consistent data for multiple tables.

I believe Oracle has provided auditing features since 8i, however the tables used to capture the data are rather complex and there is a significant performance impact when this is turned on.
In Oracle 8i you could only enable this for an entire database and not a table at a time, however 9i introduced Fine Grained Auditing which provides far more flexibility. This has been expanded upon in 10/11g.
For more information see http://www.oracle.com/technology/deploy/security/database-security/fine-grained-auditing/index.html.
Also in 11g Oracle introduced the Audit Vault, which provides secure storage for audit information, even DBA's cannot change this data (according to Oracle's documentation, I haven't used this feature yet). More info can be found at http://www.oracle.com/technology/deploy/security/database-security/fine-grained-auditing/index.html.

Oracle has mechanism called Flashback Data Archive. From A Fresh Look at Auditing Row Changes:
Oracle Flashback Query retrieves data as it existed at some time in the past.
Flashback Data Archive provides the ability to track and store all transactional changes to a table over its lifetime. It is no longer necessary to build this intelligence into your application. A Flashback Data Archive is useful for compliance with record stage policies and audit reports.
CREATE TABLESPACE SPACE_FOR_ARCHIVE
datafile 'C:\ORACLE DB12\ARCH_SPACE.DBF'size 50G;
CREATE FLASHBACK ARCHIVE longterm
TABLESPACE space_for_archive
RETENTION 1 YEAR;
ALTER TABLE EMPLOYEES FLASHBACK ARCHIVE LONGTERM;
select EMPLOYEE_ID, FIRST_NAME, JOB_ID, VACATION_BALANCE,
VERSIONS_STARTTIME TS,
nvl(VERSIONS_OPERATION,'I') OP
from EMPLOYEES
versions between timestamp timestamp '2016-01-11 08:20:00' and systimestamp
where EMPLOYEE_ID = 100
order by EMPLOYEE_ID, ts;

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight