Hive - insert values in array from file (comma and semicolon separated) - arrays

I have a file with columns separated by semicolons. I want to add a type column as an Array<String>. What I have now is that I store my values raw, just like this (the type column is text):
| age | type | country |
24 a us
29 a,b au <--------- this line is not OK
25 a uk
My file is like the following:
age;type1,type2;country
age;type1;country
age;type2;country
How do I correctly put the types in my table as an Array<String>?

Same data will work. Create table :
CREATE TABLE array_data_type(
age int,
type array<string>,
contry varchar(100))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\;'
COLLECTION ITEMS TERMINATED BY ',';
Load same data in this table.
If this data is in a local file:
LOAD DATA LOCAL INPATH '<file-path>' INTO TABLE array_data_type;
or in case of HDFS file:
LOAD DATA INPATH '<hdfs-file-path>' INTO TABLE array_data_type;

Related

Snowflake join table with stage file

)
I have a CSV files that have multiple columns. Sometimes it can be 2 sometimes it can be 43. I have mapped this columns to the Snowflake meta data table. I want to insert values to the target table but sometimes in the CSV files the column name can be diffrent for example subject, subject_name or subject_names. In the target table I have only one columnn for this called subjet_name. So if the column "subject" in the CSV file is null I need to check the "subject_name" column and if the "subject_name" i need to check "subject_names". Is there anyway how to check if this columns have a null values. I must add that the columns in the CSV are not always in the same place. So i can't use select $1 from #stage

Snowflake - Keeping target table schema in sync with source table variant column value

I ingest data into a table source_table with AVRO data. There is a column in this table say "avro_data" which will be populated with variant data.
I plan to copy data into a structured table target_table where columns have the same name and datatype as the avro_data fields in the source table.
Example:
select avro_data from source_table
{"C1":"V1", "C2", "V2"}
This will result in
select * from target_table
------------
| C1 | C2 |
------------
| V1 | V2 |
------------
My question is when schema of the avro_data evolves and new fields get added, how can I keep schema of the target_table in sync by adding equivalent columns in the target table?
Is there anything out of the box in snowflake to achieve this or if someone has created any code to do something similar?
Here's something to get you started. It shows how to take a variant column and parse out the internal columns. This uses a table in the Snowflake sample data database, which is not always the same. You can to adjust the table name and column name.
SELECT DISTINCT regexp_replace(regexp_replace(f.path,'\\\\[(.+)\\\\]'),'(\\\\w+)','\"\\\\1\"') AS path_name, -- This generates paths with levels enclosed by double quotes (ex: "path"."to"."element"). It also strips any bracket-enclosed array element references (like "[0]")
DECODE (substr(typeof(f.value),1,1),'A','ARRAY','B','BOOLEAN','I','FLOAT','D','FLOAT','STRING') AS attribute_type, -- This generates column datatypes of ARRAY, BOOLEAN, FLOAT, and STRING only
REGEXP_REPLACE(REGEXP_REPLACE(f.path, '\\\\[(.+)\\\\]'),'[^a-zA-Z0-9]','_') AS alias_name -- This generates column aliases based on the path
FROM
"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."JCUSTOMER",
LATERAL FLATTEN("CUSTOMER", RECURSIVE=>true) f
WHERE TYPEOF(f.value) != 'OBJECT'
AND NOT contains(f.path, '[');
This is a snippet of code modified from here: https://community.snowflake.com/s/article/Automating-Snowflake-Semi-Structured-JSON-Data-Handling. The blog author attributes credit to a colleague for this section of code.
While the current incarnation of the stored procedure will create a view from the internal columns in a variant, an alternate version could create and/or alter a table to keep it in sync with changes.

How to pass array with same delimiter as of collection field delimiter in Hive?

I have one file. It contains 4 fields out of which last two fields are array. So I created the table in Hive as:
create table testtable(f1 string, f2 string, f3 array<string>) row format delimited fields terminated by ',' collection items terminated by ',';
Data:
a,b,c,d
1,sdf,2323,sdaf
1,sdf,34,wer
1,sdf,223,daf
1,sdf,233,af
When I load data into table using below query, it loads the data successfully but it gives incorrect result. It didnt load the last two field in array and loaded just one field. Below is the result:
load data inpath 'data/file.txt' into table testtable;
Result:
hive> select * from testtable;
OK
a b ["c"]
1 sdf ["2323"]
1 sdf ["34"]
1 sdf ["223"]
1 sdf ["233"]
So the question is how can I load the data in array field having the same collection delimiter? My source file will always contains the same delimiter.
Hive is interpreting all the separators as field separators, and therefore sees your input as 3 4 columns. Since you've defined your table as having 3 columns, it just ignores the fourth column. I think you need to read your data into a temporary 4 column table, then build your desired table from it:
create table temptesttable(f1 string, f2 string, f3 string, f4 string)
row format delimited fields terminated by ',';
load data inpath 'data/file.txt' into table temptesttable;
create table testtable as select f1, f2, array(f3, f4) as f3 from temptesttable;

T-SQL for loop through column names and insert

I'm working on data warehouse project, and i'm stuck on moment, where i need to loop through column names in dimension table and select value coresponding to specific column name in my base data table (the one with actual data, that i want to insert into fact table). Here's my table structure:
Data Table
closing_course | max_course | min_course
234 | 241 | 187
254 | 277 | 198
Dimension Table
course_id | course_type
1 | closing_course
2 | max_course
3 | min_course
In short i want to build a procedure that FOR EVERY COURSE TYPE it will get the VALUE of each course and insert course_id and coresponding value inside FACT TABLE (among other dimension data but i think i can handle that).
I am not quite sure what you are looking for, maybe you could show an example what do you want to achieve. Here a some possible solution:
INSERT INTO FactTable (courseId,value)
SELECT 1, closing_course FROM DataTable
UNION
SELECT 2, max_course FROM DataTable
UNION
SELECT 3, min_course FROM DataTable

Fine Alphabet in number in SQL

i have a table like this :
CREATE TABLE [Mytable](
[Name] [varchar](10),
[number] [nvarchar](100) )
i want to find [number]s that include Alphabet character?
data must format like this:
Name | number
---------------
Jack | 2131546
Ali | 2132132154
but some time number insert informed and there is alphabet char and other no numeric char in it, like this:
Name | number
---------------
Jack | 2[[[131546ddfd
Ali | 2132*&^1ASEF32154
i wanna find this informed row.
i can't use 'Like' ,because 'Like' make my query very slow.
Updated to find all non numeric characters
select * from Mytable where number like '%[^0-9]%'
Regarding the comments on performance maybe using clr and regex would speed things up slightly but the bulk of the cost for this query is going to be the number of logical reads.
A bit outside the box, but you could do something like:
bulk copy the data out of your table into a flat file
create a table that has the same structure as your original table but with a proper numeric type (e.g. int) for the [number] column.
bulk copy your data into this new table, making sure to specify a batch size of 1 and an error file (where rows that won't fit the schema will go)
rows that end up in the error file are the rows that have non-numerics in the [number] column
Of course, you could do the same thing with a cursor and a temp table or two...

Resources