Netezza find distribution key of a table programatically

Netezza find distribution key of a table programatically - netezza

Is it possible to programatically find what sort of distribution algorithm a Netezza table is using?
I can do it manually in Workbench by exporting table DDL but I would like to be able to do it programatically by running some sort of metadata SQL query.
I looked into most system tables but can't find this information anywhere.
Any ideas?

There might be a solution to this.
Running this query:
select * from _v_table_dist_map where database='database' and tablename='tablename';
If it returns no rows, it can be assumed a random distribution is being used (DISTRIBUTE ON RANDOM).
If it returns 1 or more rows, column based destribution is being used (DISTRIBUTE ON (col1, ..., coln)).

Related

Identify if a column is Virtual in Snowflake without SHOW COLUMNS

Currently we're identifying if a column is virtual in Snowflake by running a SHOW COLUMN query and checking the KIND field for VIRTUAL_COLUMN. Unfortunately, there's a 10k limit on entries returned from SHOW queries in Snowflake and we'd like to be able to run this query at the schema level on schemas ~25k tables.
According to this post there's no way to identify virtual columns in the information_schema.columns view and we'd like to avoid having to run a SHOW COLUMNS query at the table level or having to run a desc table on every table.
Is there some other way we can identify virtual columns at scale?

Unfortunately, not aware of any native capability. I would consider writing a script using the get_ddl() function and run it against all objects in a schema.

Can a Snowflake UDF be used to create MD5 on the fly?

I was wondering if anyone has an example of creating an MD5 result using an UDF in Snowflake?
Scenario: I want a UDF that can set X columns depending on the source to create an MD5 result. So table A might have 5 columns, table B has 10....and accounting for various data types.
Thanks,
Todd

Snowflake already provided md5 built in fucntion.
https://docs.snowflake.com/en/sql-reference/functions/md5.html
select md5('Snowflake');
----------------------------------+
MD5('SNOWFLAKE') |
----------------------------------+
edf1439075a83a447fb8b630ddc9c8de |
----------------------------------+

There are many ways you can do the MD5 calculation. But I thought it will be good to understand your use case. I am assuming that you want to use MD5 to validate the data migrated to Snowflake. If that is the case, then MD5 way of checking each row on snowflake may be expensive. A more optimal way of validation will be to identify each column for the table and calculate the MIN, MAX, COUNT, NUMBER OF NULLS, DISTINCT COUNT for each column and validate it with the source. I have created a framework with this approach where I use the 'SHOW COLUMNS' query to get the list if COLUMNS. The framework also allows to skip some columns if required, also filter the number of rows retrieved based on a dynamic criteria. This way of validating the data will be more optimal. It will definitely help to understand your use case better.
MD5
Does this work for you
create or replace function md5_calc (column_name varchar)
returns varchar
LANGUAGE SQL
AS $$
select md5(column_name)
$$;
SELECT EMPLID,md5_calc(EMPLID),EMPNAME,md5_calc(EMPNAME) from employee;

Remove duplicates from a SQL server rows using DISTINCT

I need to remove SQL server duplicated rows when importing file into database with distinct method.
HallGroup is my table in database. I'm using this
Sql procedure:
SELECT DISTINCT * INTO tempdb.dbo.tmpTable
FROM HallGroup
DELETE FROM HallGroup
INSERT INTO HallGroup SELECT * FROM tempdb.dbo.tmpTable
DROP TABLE tempdb.dbo.tmpTable
With this procedure works fine duplicated rows are deleted, but the problem is when i try to import again data to SQL server rows are still duplicating. What i'm missing, So any hint?
How to remove SQL server duplicated rows properly when importing file into database with distinct method?

I am just getting back into SQL after being out for a bit but I would not have solved your problem in that way that you are trying (not that I completely understand why you are doing it that way) as I believe (even if it were working correctly) over time your process will take longer each time you do it as the size of the table increases.
It would be much more efficient if you inserted the new data based on the absence of a key (you indicate you are already using a stored proc). If you don't have a key to use (which very recently happened to me), make one. I just solved a similar problem to yours whereas I am importing data into a table from an external source and wanted to eliminate the possibility of duplicates. In my case, I associate name of the external source datafile (is distinct by dataset to import) with the data to be imported and use that to ensure I am not re-importing already imported data. I load the external data into a table using a dtsx and then run a stored proc to merge that data with an existing table. This gives me the added advantage of having a audit trail of where each record came from.
Hope this helps.

Questions in Sybase lag and over by concept

I have table like this below in my sybase database
ID,Col1,Col2
1,100,300
2,300, 400
3,400,500
4,900,1000.
I want result like this below only in sybase.
1,100,500 --- cross interrow checking the values
2,900,1000.

SInce you did not specify which database you're using, I'm assuming your using Sybase ASE (rather than Sybase IQ or Sybase SQL Anywhere, which do support lag/lead etc.)
Also it's not quite clear what you want since you have not defined how the relation between the various rows and columns should be interpreted. But I'm guessing you're essentially hinting at a dependency graph between Col2->Col1.
In ASE, you'll need to write this as a multi-step, loop-based algorithm whereby you determine the dependency graph. Since you don't know how many levels deep this will run, you need a loop rather than a self-join. You need to keep track of the result in a temporary table.
Can't go further here... but that's the sort of approach you'll need.

How do you get an SSIS package to only insert new records when copying data between servers

I am copying some user data from one SqlServer to another. Call them Alpha and Beta. The SSIS package runs on Beta and it gets the rows on Alpha that meet a certain condition. The package then adds the rows to Beta's table. Pretty simple and that works great.
The problem is that I only want to add new rows into Beta. Normally I would just do something simple like....
INSERT INTO BetaPeople
SELECT * From AlphaPeople
where ID NOT IN (SELECT ID FROM BetaPeople)
But this doesn't work in an SSIS package. At least I don't know how and that is the point of this question. How would one go about doing this across servers?

Your example seems simple, looks like you are adding only new people, not looking for changed data in existing records. In this case, store the last ID in the DB.
CREATE TABLE dbo.LAST (RW int, LastID Int)
go
INSERT INTO dbo.LAST (RW, LastID) VALUES (1,0)
Now you can use this to insert the last ID of the row transferred.
UPDATE dbo.LAST SET LastID = #myLastID WHERE RW = 1
When selecting OLEDB source, set data access mode to SQL Command and use
DECLARE #Last int
SET #Last = (SELECT LastID FROM dbo.LAST WHERE RW = 1)
SELECT * FROM AlphaPeople WHERE ID > #Last;
Note, I do assume that you are using ID int IDENTITY for your PK.
If you have to monitor for data changes of existing records, then have the "last changed" column in every table, and store time of the last transfer.
A different technique would involve setting-up a linked server on Beta to Alpha and running your example without using SSIS. I would expect this to be way slower and more resource intensive than the SSIS solution.
INSERT INTO dbo.BetaPeople
SELECT * FROM [Alpha].[myDB].[dbo].[AlphaPeople]
WHERE ID NOT IN (SELECT ID FROM dbo.BetaPeople)

Add a lookup between your source and destination.
Right click the lookup box to open Lookup Transformation Editor.
Choose [Redirect rows to no match output].
Open columns, map your key columns.
Add an entry with the table key in lookup column , lookup operation as
Connect lookup box to destination, choose [Lookup no Match Output]

Simplest method I have used is as follows:
Query Alpha in a Source task in a Dataflow and bring in records to the data flow.
Perform any needed Transformations.
Before writing to the Destination (Beta) perform a lookup matching the ID column from Alpha to those in Beta. On the first page of the Lookup Transformation editor, make sure you select "Redirect rows to no match output" from the dropdown list "Specify how to handle rows with now matching error"
Link the Lookup task to the Destination. This will give you a prompt where you can specify that it is the unmatched rows that you want to insert.

This is the classical Delta detection issue. The best solution is to use Change Data Capture with/without SSIS. If what you are looking for is a once in a life time activity, no need to go for SSIS. Use other means such as linked server and compare with existing records.

The following should solve issue of loading Changed and New records using SSIS:
Extract Data from Source usint Data flow.
Extract Data from Target.
Match on Primary key Add Unmatch records and split matched and unmatched records from Source and Matched records from Target call them Matched_Source,
Unmatch_Source and Matched_Target.
Compare Matched_Source and Matched_Target and Split Matched_Source to Changed and Unchanged.
Null load TempChanged Table.
Add Changed Records to TempChanged.
Execute SQL script/stored proc to Delete Records from Target for primary key in TempChanged and add records in TempChanged to Target.
Add Unmatched_Source to Target.

Another solution would be to use a temporary table.
In the properties for Beta's connection manager, change RetainSameConnection to true (by default SSIS runs each query in it's own connection, this would mean the temporary table would be killed as soon as it has been created).
Create a SQL Task using Beta's connection and use the following SQL to create your temporary table:
SELECT TOP 0 *
INTO ##beta_temp
FROM Beta
Next create a data flow that pulls data from Alpha and loads into ##beta_temp (you will need to run the SQL statement above on SSMS first so that Visual Studio can see the table at design time and you will also need to set the DelayValidation property to true on the Data Flow task).
Now you have two tables on the same server and you can just use your example SQL modified to use the temporary table.
INSERT INTO Beta
SELECT * FROM ##beta_temp
WHERE ID NOT IN (SELECT ID FROM Beta)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Netezza find distribution key of a table programatically - netezza

Related

Identify if a column is Virtual in Snowflake without SHOW COLUMNS

Can a Snowflake UDF be used to create MD5 on the fly?

Remove duplicates from a SQL server rows using DISTINCT

Questions in Sybase lag and over by concept

How do you get an SSIS package to only insert new records when copying data between servers

Categories

Resources