snowflake external table from .csv file under s3 - snowflake-cloud-data-platform

Let's assume I have .csv file like:
event,user
1,123
2,321
This .csv file is located under s3.
Running the following sql to create an external table(with #TEST_STAGE created and has correct s3 path):
CREATE OR REPLACE EXTERNAL TABLE TEST_CSV_TABLE1(
event_id VARCHAR AS (value:$1::varchar),
user_id VARCHAR AS (value:$2::varchar)
)
WITH LOCATION = #TEST_STAGE
FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = ',' SKIP_HEADER = 1);
Querying the following table results in the following output:
|-----|----------------------------|----------|---------|
| Row | VALUE | EVENT_ID | USER_ID |
|-----|----------------------------|----------|---------|
| 1 | { "c1": "1", "c2": "123" } | NULL | NULL |
|-----|----------------------------|----------|---------|
| 2 | { "c1": "2", "c2": "321" } | NULL | NULL |
|-----|----------------------------|----------|---------|
However, if I just create a table as
CREATE OR REPLACE TABLE TEST_CSV_TABLE2(
event_id VARCHAR,
user_id VARCHAR
);
and load the same file like:
COPY INTO TEST_CSV_TABLE2 FROM #TEST_STAGE
FILES = ('test.csv')
FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = ',' SKIP_HEADER = 1);
or even like:
COPY INTO TEST_CSV_TABLE2
FROM (
SELECT
t.$1,
t.$2
FROM # TEST_STAGE t)
FILES = ('test.csv')
FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = ',' SKIP_HEADER = 1);
This results into properly assigned columns:
|-----|----------|---------|
| Row | EVENT_ID | USER_ID |
|-----|----------|---------|
| 1 | 1 | 123 |
|-----|----------|---------|
| 2 | 2 | 321 |
|-----|----------|---------|
Why columns are not pick properly in case of external table?
Many thanks ahead.

You need to use the name of the column when you are pulling it out of the JSON. What you have is creating the JSON column, and then parsing it for attributes in the JSON called "$1" and "$2". When it doesn't find such an attribute it returns NULL to that column.
CREATE OR REPLACE EXTERNAL TABLE TEST_CSV_TABLE1(
event_id VARCHAR AS (value:c1::varchar),
user_id VARCHAR AS (value:c2::varchar)
)
WITH LOCATION = #TEST_STAGE
FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = ',' SKIP_HEADER = 1);
Using a copy into with the $1 and $2 isn't using those to parse JSON like above, it's the syntax specific to a copy into query to reference the columns in a file.

Related

Snowflake Data Masking

We're looking to mask certain PII in our Snowflake environment where it relates to team members, and at the moment our masking is set up to mask every row in the column we define in our masking policies.
What we'd like to get to though is only masking only rows contain a membership number in a separate table. Is that possible to implement or how would I go about doing it?
member
name
A
acds
B
asdas
C
asdeqw
member
B
Just as an example, in the above tables, we'd only want to mask member B. At the moment, all 3 rows in the first table would be masked.
We've a possible workaround in doing this in logic of an extra view but that's actually altering the data, whereas our hope was we could use the Dynamic Data Masking and then have exception processes for it.
Prepare the data:
create or replace table member (member_id varchar, name varchar);
insert into member values ('A', 'member_a'),('B', 'member_b'),('C', 'member_c');
create or replace table member_to_be_masked(member_id varchar);
insert into member_to_be_masked values ('B');
If you want to mask the member column:
create or replace masking policy member_mask as (val string) returns string ->
case
when exists
(
select
member_id
from
member_to_be_masked
where member_id = val
)
then '********'
else val
end;
alter table if exists member
modify column member_id
set masking policy member_mask;
select * from member;
+-----------+----------+
| MEMBER_ID | NAME |
|-----------+----------|
| A | member_a |
| ******** | member_b |
| C | member_c |
+-----------+----------+
However, if you want to mask the name column, I don't see an easy way. I have tried the policy to link back to the table itself to find out if the member_id for the current column name value, but it fails with the below error message:
Policy body contains a UDF or Select statement that refers to a Table attached to another Policy.
It looks like that in the policy, we can't reference back to the source table. And because the policy will only get the value of the defined column value, it has no knowledge of other column values, so we can't make decision on whether to apply the mask or not.
If can work if you also store the "name" into the mapping table, together with the member_id, like the below:
create or replace table member (member_id varchar, name varchar);
insert into member values ('A', 'member_a'),('B', 'member_b'),('C', 'member_c');
create or replace table member_to_be_masked(member_id varchar, name varchar);
insert into member_to_be_masked values ('B', 'member_b');
create or replace masking policy member_mask as (val string) returns string ->
case
when exists
(
select member_id
from member_to_be_masked
where name = val
)
then '********'
else val
end;
alter table if exists member
modify column name
set masking policy member_mask;
select * from member;
+-----------+----------+
| MEMBER_ID | NAME |
|-----------+----------|
| A | member_a |
| B | ******** |
| C | member_c |
+-----------+----------+
The downside of this approach is that if different members with the same name, all members with this name will be masked, regardless if the member's id is in the mapping table or not.
Keeping my previous answer in case it can still be useful.
Another workaround I can think of is to use variant data and then create a view on top of it.
prepare the data in JSON format:
create or replace table member_json (member_id varchar, data variant);
insert into member_json
select
'A', parse_json('{"member_id": "A", "name" : "member_a"}')
union
select
'B', parse_json('{"member_id": "B", "name" : "member_b"}')
union
select
'C', parse_json('{"member_id": "C", "name" : "member_c"}')
;
create or replace table member_to_be_masked(member_id varchar);
insert into member_to_be_masked values ('B');
Data looks like below:
select * from member_json;
+-----------+----------------------+
| MEMBER_ID | DATA |
|-----------+----------------------|
| A | { |
| | "member_id": "A", |
| | "name": "member_a" |
| | } |
| B | { |
| | "member_id": "B", |
| | "name": "member_b" |
| | } |
| C | { |
| | "member_id": "C", |
| | "name": "member_c" |
| | } |
+-----------+----------------------+
select * from member_to_be_masked;
+-----------+
| MEMBER_ID |
|-----------|
| B |
+-----------+
create a JS UDF:
create or replace function json_mask(mask boolean, v variant)
returns variant
language javascript
as
$$
if (MASK) {
V["member_id"] = '******'
V["name"] = '******';
}
return V;
$$;
create a masking policy using the UDF:
create or replace masking policy member_mask
as (val variant)
returns variant ->
case
when exists
(
select
member_id
from
member_to_be_masked
where member_id = val['member_id']
)
then json_mask(true, val)
else val
end;
apply the policy to the member_json table:
alter table if exists member_json
modify column data
set masking policy member_mask;
query the table will see member B masked:
select * from member_json;
+-----------+--------------------------+
| MEMBER_ID | DATA |
|-----------+--------------------------|
| A | { |
| | "member_id": "A", |
| | "name": "member_a" |
| | } |
| B | { |
| | "member_id": "******", |
| | "name": "******" |
| | } |
| C | { |
| | "member_id": "C", |
| | "name": "member_c" |
| | } |
+-----------+--------------------------+
create a view on top of it:
create or replace view member_view
as
select
data:"member_id" as member_id,
data:"name" as name
from member_json;
query the view will see masked data as well:
select * from member_view;
+-----------+------------+
| MEMBER_ID | NAME |
|-----------+------------|
| "A" | "member_a" |
| "******" | "******" |
| "C" | "member_c" |
+-----------+------------+
Not sure if this can help in your case use.
As I understand, you want to mask one column in your table based on another column and also lookup.
We can use conditional masking in this case - https://docs.snowflake.com/en/sql-reference/sql/create-masking-policy.html#conditional-masking-policy
create or replace masking policy name_mask as (val string, member_id string) returns string ->
case
when exists
(
select 1
from member_to_be_masked m
where m.member_id = member_id
)
then '********'
else val
end;
In query profile, it would come as a secure function. Please evaluate the peformance. Based on the total records for which this function has to be applied, peformance difference may be significant

Postgress select like any of array of text

i have 2 tables; table 1: contain a wildcard paths and table 2: files with full path;
i want to select all files that match wild card path
example:
table1:
| type | path |
| sys | /etc/* |
| protected | /etc/* |
| sys | /sys/* |
| log | /log/* |
table2:
| file | path |
| f1.cmd | /etc/folder/name |
| f2.cmd | /log/folder/name |
| f3.cmd | /etc/folder/name |
| f4.cmd | /sys/folder/name |
my ultimate goal is: to create a VIEW that has all data from table2 and add one more column type to tell me which type does this file belongs to.
so that i can select all files that is of type = sys for example
** what i tried:**
step 1: get list of all paths of wanted type.
select array_agg(replace(path,'*','%')) from
table1 where type = 'sys'
group by type
this will result with something like {"etc\\%","sys\\%"}
step 2 select files using like any
select * from symbols where path like any ( array['etc\\%', 'sys\\%'] )
this successfully returned all files with paths like one i need.
now quesiton is how can i combine both queries into one :D ?
or is there an easier way using JOIN for example.
thanks
You could get the table1.type from each row in table2 by checking if table1.path is a substring of table2.path:
with table1(type, path) as (
values ('sys', '/etc/*'),
('sys', '/sys/*'),
('log', '/log/*'),
('etc', '/etc/*')
),
table2(file, path) as (
values ('f1.cmd', '/etc/folder/name'),
('f2.cmd', '/log/folder/name'),
('f3.cmd', '/etc/folder/name'),
('f4.cmd', '/sys/folder/name')
)
select *,
(select type
from table1
where position(replace(path, '*', '') in table2.path) > 0
limit 1) as type
from table2;
file | path | type
--------+------------------+------
f1.cmd | /etc/folder/name | sys
f2.cmd | /log/folder/name | log
f3.cmd | /etc/folder/name | sys
f4.cmd | /sys/folder/name | sys
(4 rows)

Different results when reading messages written in Kafka with upsert-kafka format

I am using following three test cases to test the behavior of upsert-kafka
Write the aggregation results into kafka with upsert-kafka format (TestCase1)
Using fink table result print to output the messages.(TestCase2)
Consume the Kafka Messages directly with the consume-console.sh tool.(TestCase3)
I found that when using fink table result print, it prints two messages with -U and +U to indicate that one is deleted, and the other is inserted, and for the consume-console, it prints the result correctly and directly.
I would ask why fink table result print behaves what I have observed
Where does -U and +U (delete message and insert message) come from, are they saved in Kafka as two messages? I think the answer is NO, because I didn't see these immediate results.
when consuming with consumer-console.
package org.example.official.sql
import org.apache.flink.api.common.RuntimeExecutionMode
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.bridge.scala._
import org.example.model.Stock
import org.example.sources.StockSource
import org.scalatest.funsuite.AnyFunSuite
class UpsertKafkaTest extends AnyFunSuite {
val topic = "test-UpsertKafkaTest-1"
//Test Case 1
test("write to upsert kafka: upsert-kafka as sink") {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setRuntimeMode(RuntimeExecutionMode.STREAMING)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.setParallelism(1)
val ds: DataStream[Stock] = env.addSource(new StockSource(emitInterval = 1500, print = false))
ds.print()
val tenv = StreamTableEnvironment.create(env)
tenv.createTemporaryView("sourceTable", ds)
val ddl =
s"""
CREATE TABLE sinkTable (
id STRING,
total_price DOUBLE,
PRIMARY KEY (id) NOT ENFORCED
) WITH (
'connector' = 'upsert-kafka',
'topic' = '$topic',
'properties.bootstrap.servers' = 'localhost:9092',
'key.format' = 'json',
'value.format' = 'json'
)
""".stripMargin(' ')
tenv.executeSql(ddl)
tenv.executeSql(
"""
insert into sinkTable
select id, sum(price)
from sourceTable
group by id
""".stripMargin(' '))
env.execute()
}
//Test Case 2
test("read from upsert kafka: upsert-kafka as source 2") {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.setParallelism(1)
val tenv = StreamTableEnvironment.create(env)
val ddl =
s"""
CREATE TABLE sourceTable (
id STRING,
total_price DOUBLE,
PRIMARY KEY (`id`) NOT ENFORCED
) WITH (
'connector' = 'upsert-kafka',
'topic' = '$topic',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'testGroup001',
'key.format' = 'json',
'key.json.ignore-parse-errors' = 'true',
'value.format' = 'json',
'value.json.fail-on-missing-field' = 'false',
'value.fields-include' = 'EXCEPT_KEY'
)
""".stripMargin(' ')
tenv.executeSql(ddl)
val result = tenv.executeSql(
"""
select * from sourceTable
""".stripMargin(' '))
result.print()
/*
+----+--------------------------------+--------------------------------+
| op | id | total_price |
+----+--------------------------------+--------------------------------+
| +I | id1 | 1.0 |
| -U | id1 | 1.0 |
| +U | id1 | 3.0 |
| -U | id1 | 3.0 |
| +U | id1 | 6.0 |
| -U | id1 | 6.0 |
| +U | id1 | 10.0 |
| -U | id1 | 10.0 |
| +U | id1 | 15.0 |
| -U | id1 | 15.0 |
| +U | id1 | 21.0 |
| -U | id1 | 21.0 |
| +U | id1 | 28.0 |
| -U | id1 | 28.0 |
| +U | id1 | 36.0 |
| -U | id1 | 36.0 |
| +U | id1 | 45.0 |
*/
}
//Test Case 3
test("read from upsert kafka with consumer console") {
/*
kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic test-UpsertKafkaTest-1 --from-beginning
{"id":"id1","total_price":1.0}
{"id":"id1","total_price":3.0}
{"id":"id1","total_price":6.0}
{"id":"id1","total_price":10.0}
{"id":"id1","total_price":15.0}
{"id":"id1","total_price":21.0}
{"id":"id1","total_price":28.0}
{"id":"id1","total_price":36.0}
{"id":"id1","total_price":45.0}
*/
}
}
With Flink SQL we speak of the duality between tables and streams -- that a stream can be thought of as a (dynamic) table, and vice versa. There are two types of streams/tables: appending and updating. An append stream corresponds to a dynamic table that only performs INSERT operations; nothing is ever deleted or updated. And an update stream corresponds to a dynamic table where rows can be updated and deleted.
Your source table is an upsert-kafka table, and as such, is an update table (not an appending table). An upsert-kafka source corresponds to a compacted topic, and when compactions occur, that leads to updates/retractions where the existing values for various keys are updated over time.
When an updating table is converted into a stream, there are two possible results: you either get an upsert stream or a retraction stream. Some sinks support one or the other of these types of update streams, and some support both.
What you are seeing is that the upsert-kafka sink can handle upserts, and the print sink cannot. So the same update table is being fed to Kafka as a stream of upsert (and possibly deletion) events, and it's being sent to stdout as a stream with an initial insert (+I) for each key, followed by update_before/update_after pairs encoded as -U +U for each update (and deletions, were any to occur).

Split string and output into rows

I want to split a string based on delimiter ',' and put the results into rows. Hence, I'm trying to use SPLIT_TO_TABLE function in Snowflake, but not working successfully.
I used the regexp_replace to clean the string. How can I output this into rows for each id?
SELECT value,
TRIM(regexp_replace(value, '[{}_]', ' ')) AS extracted
Here is the sample data:
+--------+------------------------------------+
| id | value |
+--------+------------------------------------+
| fsaf12 | {Other Questions,Missing Document} |
| sfas11 | {Other} |
+--------+------------------------------------+
Expected result:
+--------+------------------+
| id | extracted |
+--------+------------------+
| fsaf12 | Other Questions |
| fsaf12 | Missing Document |
| sfas11 | Others |
+--------+------------------+
Adding another way to split the data and output it as rows :
SELECT b,TRIM(regexp_replace(splitvalue, '[{}_]', '')) AS extracted from
(SELECT b, C.value::string AS splitvalue
FROM split,
LATERAL FLATTEN(input=>split(a, ',')) C);
where a and b are the columns in table "split" and data is as follows :
A
B
{First,Second}
row1
{Third,Fourth}
row2
HEre is answer , used replace function instead of regexp_replace
WITH DATATABLE(ID ,VALUEA ) AS
(
SELECT * FROM VALUES ('fsaf12','{Other Questions,Missing Document}'),('sfas11',' {Other} ')
)
SELECT ID, REPLACE(REPLACE(VALUE,'{',''),'}','') SPLITTED_VALUE FROM DATATABLE , LATERAL SPLIT_TO_TABLE (VALUEA,',') ;

Inject SQL before Codeigniters $this->db->get() call

Here is what I try to do. I have a table with the following structure, that is supposed to hold translated values of other data in any other table
Translations
| Language id | translation | record_id | column_name | table_name |
====================================================================
| 1 | Hello | 1 | test_column | test_table |
| 2 | Aloha | 1 | test_column | test_table |
| 1 | Test input | 2 | test_column | test_table |
In my code I use in my views, I have a function that looks up this table, and returns the string in the language of the user. If the string is not translated in his language, the function returns the string in the default of the application (let's say with ID = 1)
It works fine, but I have to go through about 600 view files to apply this... I was wondering if it was possible to inject some SQL in my CodeIgniter models right before the $this->db->get() of the original record, that replaces the original column with the translated one.
Something like this:
$this->db->select('column_name, col_2, col_3');
// Injected SQL pseudocode:
// If RECORD EXISTS in table Translations where Language_id = 2 and record_id = 2 AND column_name = test_column AND table_name = test_table
// BEGIN
// SELECT translations.translation as column_name
// WHERE translations.table_name = test_table AND column_name = test_column AND record_id = 2
// END
// ELSE
// BEGIN
// SELECT translations.translation as column_name
// WHERE translations.table_name = test_table AND column_name = test_column AND record_id = 1
// END
$this->db->get('test_table');
Is this possible to be done somehow?
what you're asking for doesn't really make sense. You "inject" by simply making different query first, then altering your second query based on the results.
the other option (perhaps better) would be to do all of this in a stored procedure, but it is still essentially the same, just with less connections & prolly quicker processing

Resources