dbt - stage_external_sources - partitioning - snowflake-cloud-data-platform

I'm trying to do a nice partitioning of the incremental files delivered on azure blob storage.
However, I can't get partition column and table columns to play nice.
If I remove partition column, I get all table columns specified.
If I have partition column, I only get variant column and partition column.
- name: arsm
external:
location: '#my_azure_stage'
file_format: 'myformat'
pattern: '.*path.*tablename_.*'
auto_refresh: false # depends on your Azure setup
partitions: # optional
- name: LOAD_DATE
data_type: date
expression: TO_DATE(substring(metadata$filename,16,10))
columns:
- name: "AoNr"
data_type: bigint
- name: "AoNrAlfa"
data_type: varchar(65)
- name: "AoPos"
data_type: int
- name: "ArtikelVariant"
data_type: varchar(30)
- name: "ArtKalkBer"
data_type: NUMERIC

I think you just need to out-dent the columns array by two spaces. The columns array should be a top-level key of the source table, at the same level as name and external; right now you have it nested within the external dict.

Related

CassandraDB table with multiple Key-Value

I am a new CassandraDB user. I am trying to create a table which has 3 static columns, for example "name", "city" and "age", and then I was thinking in two "key" and "value" columns, since my table could receive a lot of inputs. How can I define this table? I am trying to achieve something scalable, i.e:
Table columns --> "Name", "City", "Age", "Key", "Value"
Name: Mark
City: Liverpool
Age: 26
Key: Car
Value: Audi A3
Key: Job
Value: Computer Engineer
Key: Main hobby
Value: Football
I am looking for the TABLE DEFINITION.. Any help? Thank you so so much in advance.
If I understand correctly, you want to create a key-value store, grouped by "name", "city" and "age". There are few solutions for this approach -
First by using STATIC columns -
create table record_by_id(
recordId text,
name text static,
city text static,
age int static,
key text,
value text
primary key (recordId, key)
);
Which this table design, Name, City, Age remain constant for same recordid. You can any number of key- values for same record id.
Second Approach would be -
create table record_by_id(
name text ,
city text ,
age int ,
key text,
value text
primary key ((name,city,age),key)
);
In this design, Name , city and age is are part of partition key. The key column is part of clustering key.
Both approach are scalable but first approach is good for maintenance.
table which has 3 static columns
So by "static" I assume you're not referring to Cassandra's definition of static columns. Which is cool, I know what you mean. But the mention did give me an idea of how to approach this:
trying to create the table definition
I see two ways to go about this.
CREATE TABLE user_properties (
name TEXT,
city TEXT STATIC,
age INT STATIC,
key TEXT,
value TEXT,
PRIMARY KEY (name,key));
Because we have static columns (only stored w/ the partition key name) adding more key/values is just a matter of adding more keys to the same name, so INSERTing data looks like this:
INSERT INTO user_properties (name,city,age,key,value)
VALUES ('Mark','Liverpool',26,'Car','Audi A3');
INSERT INTO user_properties (name,key,value)
VALUES ('Mark','Job','Computer Engineer');
INSERT INTO user_properties (name,key,value)
VALUES ('Mark','Main hobby','Football');
Querying looks like this:
> SELECT * FROm user_properties WHERE name='Mark';
name | key | age | city | value
------+------------+-----+-----------+-------------------
Mark | Car | 26 | Liverpool | Audi A3
Mark | Job | 26 | Liverpool | Computer Engineer
Mark | Main hobby | 26 | Liverpool | Football
(3 rows)
This is the "simple" way to go about it.
Or
CREATE TABLE user_properties_map (
name TEXT,
city TEXT,
age INT,
kv MAP<TEXT,TEXT>,
PRIMARY KEY (name));
With a single partition key as the PRIMARY KEY, we can INSERT everything in one shot:
INSERT INTO user_properties_map (name,city,age,kv)
VALUES ('Mark','Liverpool',26,{'Car':'Audi A3',
'Job':'Computer Engineer',
'Main hobby':'Football'});
And querying looks like this:
> SELECT * FROm user_properties_map WHERE name='Mark';
name | age | city | kv
------+-----+-----------+--------------------------------------------------------------------------
Mark | 26 | Liverpool | {'Car': 'Audi A3', 'Job': 'Computer Engineer', 'Main hobby': 'Football'}
(1 rows)
This has the added benefit of putting the properties into a map, which might be helpful if that's the way you're intending to work with it on the application side. The drawbacks, are that Cassandra collections are best kept under 100 items, the writes are a little more complicated, and you can't query individual entries of the map.
But by keying on name (might want to also include last name or something else to help with uniqueness), data should scale fine. And partition growth won't be a problem, unless you're planning on thousands of key/value pairs.
Basically, choose the structure based ons the standard Cassandra advice of considering how you'd query the data, and then build the table to suit it.

Snowflake External Table Partition - Granular Path

Experts,
We have our JSON files stored in the below folder structure in S3 as
/appname/lob/2020/07/24/12,/appname/lob/2020/07/24/13,/appname/lob/2020/07/24/14
stage #SFSTG = /appname/lob/
We need to create a external table with partition based on the hours. We can derive the partition part from the metadata$filename. However question here is should the partition column should be created as timestamp or varchar?
Which partition datatype helps us in better performance when accessing the file from snowflake using External table.
Snowflake's recommendation is the following:
date_part date as to_date(substr(metadata$filename, 14, 10), 'YYYY/MM/DD'),
*Double check 14 is the correct start of your partition in your stage url I may have it incorrect here.
Full example:
CREATE OR REPLACE EXTERNAL TABLE Database.Schema.ExternalTableName(
date_part date as to_date(substr(metadata$filename, 14, 10), 'YYYY/MM/DD'),
col1 varchar AS (value:col1::varchar)),
col2 varchar AS (value:col2::varchar))
PARTITION BY (date_part)
INTEGRATION = 'YourIntegration'
LOCATION=#SFSTG/appname/lob/
AUTO_REFRESH = true
FILE_FORMAT = (TYPE = JSON);

Dynamic workflow: storing data in static tables vs creating dynamic table

I have an web application where user can create data forms on the fly. The application is in ASP.NET with MSSQL in the backend. In order to create a form, user has to specify fields for a form. In order to add a field, user has to specify field label, type of the field e.g. (numeric, amount, email ) and other attributes. Below is an example of a dynamic form configuration
Form Name: Sales
Field Label: Customer Name
Field Type: text
Length: 100
Field Label: Amount
Field Type: money
I hope you are getting the gist. There can be many such forms. There can be thousands of rows for each form. The field configuration is stored in tables with schema as below:
Table: DataTypes
Column: DataTypeId int, DataTypeName varchar(100)
Table: FormFields
Columns: FieldId int, FieldLabel varchar(100), DataTypeId int
There are two ways I can store the form values in database.
Approach 1: Store data in static tables as a key value pair. With this approach my table will look like below:
Table: FormFieldValues
Columns: FieldId int, Numeric_value numeric(18,5), Text_Value varchar(500), Money_Value money
So a row in FormFieldValues table will look like below
Field Id | Numeric_Value | Text_Value | Money_Value
1 100.512
2 John Wayne
3 $200.50
The advantage of this approach is that since the table structure is static, dynamic SQL is not needed when fetching rows for example. The disadvantage is to convert rows into columns in code.
Approach 2: Using the fields configuration, create a dynamic table on the fly. So in above example I can create a table as below
Table: Form1
Columns: Id int, CustomerName varchar(100), Amount money, Version numeric (15,5)
A row in this table looks like this
CustomerName | Amount | Version
John Wayne $200.50 100.512
Advantage of this approach, each column is created accordingly to exact precision specified in the field configuration. Disadvantage is dynamic SQL is needed to fetch the data and store it.
Which approach is better approach ? Which approach would be more scalable and performance friendly in the long run?
Note: SO doesn't support table formats so the tables may not appear proper but I hope the readers will get the gist

Partitioning table based on first letter of a varchar field

I have a massive table (over 1B records) that have a specific requirement for table partitioning:
(1) Is it possible to partition a table in Postgres based on the first character of a varchar field?
For example:
For the following 3 records:
a-blah
a-blah2
b-blah
a-blah and a-blah2 would go in the "A" partition, b-blah would go into the "B" partition.
(2) If the above is not possible with Postgres, what is a good way to evenly partition a large growing table? (without partitioning by create date -- since that is not something these records have).
You can use an expression in the partition by clause, e.g.:
create table my_table(name text)
partition by list (left(name, 1));
create table my_table_a
partition of my_table
for values in ('a');
create table my_table_b
partition of my_table
for values in ('b');
Results:
insert into my_table
values
('abba'), ('alfa'), ('beta');
select 'a' as partition, name from my_table_a
union all
select 'b' as partition, name from my_table_b;
partition | name
-----------+------
a | abba
a | alfa
b | beta
(3 rows)
If the partitioning should be case insensitive you might use
create table my_table(name text)
partition by list (lower(left(name, 1)));
Read in the documentation:
Table Partitioning
CREATE TABLE

SQL Server nonclustered indexes

I am trying to figure out the best way to handle the indexes on a table in SQL Server.
I have a table that only needs to be read from. No real writing to the table (after the initial setup).
I have about 5-6 columns in the table that need to be indexed. Does it make more sense to setup one nonclustered index for the entire table and add all the columns that I need indexed to that index or should I set up multiple nonclustered indexes each with one column?
I am wondering which setup would have better read performance.
Any help on this would be great.
UPDATE:
There are some good answers already but I wanted to elaborate on my needs a little more.
There is one main table with auto records. I need to be able to perform very quick counts on over 100MM records. The where statements will vary but I am trying to index all of the possible columns in the where statement. So I will have queries like:
SELECT COUNT(recordID)
FROM tableName
WHERE zip IN (32801, 32802, 32803, 32809)
AND makeID = '32'
AND modelID IN (22, 332, 402, 504, 620)
or something like this:
SELECT COUNT(recordID)
FROM tableName
WHERE stateID = '9'
AND classCode IN (3,5,9)
AND makeID NOT IN (55, 56, 60, 80, 99)
So there is about 5-6 columns that could be in the where clause but it will vary a lot on which ones.
The fewer indexes you have - the better. Each index might speed up some queries - but it also incurs overhead and needs to be maintained. Not so bad if you don't write much to the table.
If you can combine multiple columns into a single index - perfect! But if you have a compound index on multiple columns, that index can only be used if you use/need the n left-most columns.
So if you have an index on (City, LastName, FirstName) like in a phone book - this works if you're looking for:
everyone in a given city
every "Smith" in "Boston"
every "Paul Smith" in "New York"
but it cannot be used to find all entries with first name "Paul" or all people with lastname of "Brown" in your entire table; the index can only be used if you also specify the City column
So therefore - compound indexes are beneficial and desirable - but only if you can really use them! Having just one index with your 6 columns does not help you at all, if you need to select the columns individually
Update: with your concrete queries, you can now start to design what indexes would help:
SELECT COUNT(recordID)
FROM tableName
WHERE zip IN (32801, 32802, 32803, 32809)
AND modelID = '32'
AND model ID IN (22, 332, 402, 504, 620)
Here, an index on (zip, modelID) would probably be a good idea - both zip and modelID are used in the where clause (together), and having the recordID in the index as well (as an Include(RecordID) clause) should help, too.
SELECT COUNT(recordID)
FROM tableName
WHERE stateID = '9'
AND classCode IN (3,5,9)
AND makeID NOT IN (55, 56, 60, 80, 99)
Again: based on the WHERE clause - create an index on (stateID, classCode, makeID) and possibly add Include(RecordID) so that the nonclustered index becomes covering (e.g. all the info needed for your query is in the nonclustered index itself - no need to go back to the "base" tables).
It depends on your access pattern
For a read only table, I'd most likely create multiple non-clustered indexes, each having multiple key columns to match WHERE clauses, and INCLUDEd columns for non-key columns
I would have neither one non-clustered for all nor one per column: they won't be useful.actual queries

Resources