Hive Query: working with String Array - arrays

I have a HIVE Table with following schema like this:
hive>desc books;
gen_id int
author array<string>
rating double
genres array<string>
hive>select * from books;
| gen_id | rating | author |genres
+----------------+-------------+---------------+----------
| 1 | 10 | ["A","B"] | ["X","Y"]
| 2 | 20 | ["C","A"] | ["Z","X"]
| 3 | 30 | ["D"] | ["X"]
Is there a query where I can perform some SELECT operation and that returns individual rows, like this:
| gen_id | rating | SplitData
+-------------+---------------+-------------
| 1 | 10 | "A"
| 1 | 10 | "B"
| 1 | 10 | "X"
| 1 | 10 | "Y"
| 2 | 20 | "C"
| 2 | 20 | "A"
| 2 | 20 | "Z"
| 2 | 20 | "X"
| 3 | 30 | "D"
| 3 | 30 | "X"
Can someone guide me how can get to this result. Thanks in advance for any kind of help.

You need to do Lateral view and explode,i.e.
SELECT
gen_id,
rating,
SplitData
FROM (
SELECT
gen_id,
rating,
array (ex_author,ed_genres) AS ar_SplitData
FROM
books
LATERAL VIEW explode(books.author) exploded_authors AS ex_author
LATERAL VIEW explode(books.genres) exploded_genres AS ed_genres
) tab
LATERAL VIEW explode(tab.ar_SplitData) exploded_SplitData AS SplitData;
I had no chance to test it but it should show you general path. GL!

Related

Best way of storing enumerated fields with ability to change order Postgres

What is the best way for storing enumerated fields with ability to change its order?
Lets say my database looks like this:
| Table |
|---------------------|
| id | name | order|
| 1 | 1st | 1 |
| 2 | 2nd | 2 |
| 3 | 3rd | 3 |
| 4 | 4th | 4 |
Now, when user change order in such a away
| Table |
|---------------------|
| id | name | order|
| 1 | 1st | 1 |
| 4 | 4nd | 2 |
| 2 | 2nd | 3 |
| 3 | 3rd | 4 |
Here I would have to update all rows in this table.
I consider 2 solutions
Solution 1)
When inserting row X between for example order 2 and order 3, I would change row's X order field to 3.5, So I would choose number in the middle between adjacent orders.
Above table would look like this
| Table |
|---------------------|
| id | name | order|
| 1 | 1st | 1 |
| 4 | 4nd | 2.5 |
| 2 | 2nd | 2 |
| 3 | 3rd | 3 |
Then, after for example 16 changes I would update table and normalize all order fields, so table after normalization would be like this:
| Table |
|---------------------|
| id | name | order|
| 1 | 1st | 1 |
| 4 | 4nd | 2 |
| 2 | 2nd | 3 |
| 3 | 3rd | 4 |
Solution 2)
I also consider adding fields "next" (or "next" and "prev") to each row, but it looks for me like waste of memory.
I really dont want to update whole table every time somebody change order. What is the best way of solving this problem?

Generate a new dataset from two existings datasets with conditions

I have two dataset with the same columns, and I would like to create a new one in another sheet with all rows from the first dataset and add to it specific rows from the second one.
My first dataset is like:
| Item Type | Item Numb | Start Date | End date |
---------------------------------------------------
| 1 | 1 | 17/02/2022 | 21/02/2022 |
| 1 | 2 | 19/02/2022 | 24/02/2022 |
| 2 | 1 | 15/02/2022 | 18/02/2022 |
| 2 | 2 | 17/02/2022 | 20/02/2022 |
| 3 | 1 | 21/02/2022 | 25/02/2022 |
And the second one is like:
| Item Type | Item Numb | Start Date | End date |
---------------------------------------------------
| 1 | 2 | 17/02/2022 | 20/02/2022 |
| 2 | 2 | 17/02/2022 | 20/02/2022 |
| 2 | 3 | 20/02/2022 | 23/02/2022 |
| 3 | 1 | 20/02/2022 | 23/02/2022 |
| 4 | 1 | 21/02/2022 | 24/02/2022 |
| 4 | 2 | 23/02/2022 | 28/02/2022 |
So now, I would like in a new sheet to retrieve the rows from the first dataset and add at the end the rows from the second one who are absent.
If a Combination of "Item Type" and "Item Numb" is already imported I don't want to get them from the second dataset, but if this specific combination isn't in the first one so I would like to add the row.
That's what I need as the result:
| Item Type | Item Numb | Start Date | End date |
---------------------------------------------------
| 1 | 1 | 17/02/2022 | 21/02/2022 |
| 1 | 2 | 19/02/2022 | 24/02/2022 |
| 2 | 1 | 15/02/2022 | 18/02/2022 |
| 2 | 2 | 17/02/2022 | 20/02/2022 |
| 3 | 1 | 21/02/2022 | 25/02/2022 |
| 2 | 3 | 20/02/2022 | 23/02/2022 |
| 4 | 1 | 21/02/2022 | 24/02/2022 |
| 4 | 2 | 23/02/2022 | 28/02/2022 |
Thanks in advance for your time folks!
try:
=INDEX(ARRAY_CONSTRAIN(QUERY(SORTN(
{Sheet1!A2:D, Sheet1!A2:A&Sheet1!B2:B;
Sheet2!A2:D, Sheet2!A2:A&Sheet2!B2:B}, 9^9, 2, 5, 1),
"where Col1 is not null", 0), 9^9, 4)

Stripping out dates, of several formats, from strings

I have a column of strings, called "MyStrings" like the following:
...
Foo bar Jul15 blah blah.xlsx
Choo bar Jul-15 blah far.xlsx
Star bar 10-Jul-15 blah far.xlsx
Car Star bar 10.Jul.2015 blah far.xlsx
...
...
I'd like to do string manipulation so all dates, whatever format, are not included in the results.
So the following query:
SELECT results = <manipulated "MyStrings">
FROM aTable
Should have these results:
...
Foo bar blah blah.xlsx
Choo bar blah far.xlsx
Star bar blah far.xlsx
Car Star bar blah far.xlsx
...
...
Is there a quick way of doing this or do I need to consider each format individually?
You need a Split function
If you split first by <space> is easy create regular expresion for
monDD
mon-DD
DD-mon-YY
DD-mon-YYYY
SQL Fiddle Demo
WITH splitCTE AS (
SELECT s.[id], f.Number, f.Item
FROM dbo.SourceData AS s
CROSS APPLY dbo.SplitStrings(s.[test], ' ') as f
)
SELECT *,
CASE
WHEN item Like 'Jul[0-9][0-9]' THEN 'mmmdd'
WHEN item Like 'Jul-[0-9][0-9]' THEN 'mmm-dd'
WHEN item Like '[0-9][0-9]-Jul-[0-9][0-9]' THEN 'dd-mmm-yy'
WHEN item Like '[0-9][0-9].Jul.[0-9][0-9][0-9][0-9]' THEN 'dd.mmm.yyyy'
ELSE ''
END matchType
FROM splitCTE
OUTPUT
Need a join with list of 3 char months to replace the wired Jul.
Easy expand to also include a version with full month name.
Will match Jul77 as mmmdd but is a start.
You can calculate a IsValidDate column in another step
For some of the format you can use CONVERT to check for a valid date
For other like Jul77 you can separate first 3 char with last 2 and try to get a date.
.
| id | Number | Item | matchType |
|----|--------|-------------|-------------|
| 1 | 1 | Foo | |
| 1 | 2 | bar | |
| 1 | 3 | Jul15 | mmmdd |
| 1 | 4 | blah | |
| 1 | 5 | blah.xlsx | |
| 2 | 1 | Choo | |
| 2 | 2 | bar | |
| 2 | 3 | Jul-15 | mmm-dd |
| 2 | 4 | blah | |
| 2 | 5 | far.xlsx | |
| 3 | 1 | Star | |
| 3 | 2 | bar | |
| 3 | 3 | 10-Jul-15 | dd-mmm-yy |
| 3 | 4 | blah | |
| 3 | 5 | far.xlsx | |
| 4 | 1 | Car | |
| 4 | 2 | Star | |
| 4 | 3 | bar | |
| 4 | 4 | 10.Jul.2015 | dd.mmm.yyyy |
| 4 | 5 | blah | |
| 4 | 6 | far.xlsx | |
Then use your favorite XML PATH to join back without the matching elements

Generate variables that move information between rows in hierarchical data with spss syntax

I was wondering if you can help me with the following problem in spss syntax.
My dataset has nested structure.
Data are nested in companies, then each company has 1 or 2 bosses, but in this case I care only about boss 1. At a previous stage in time the boss graded the workers (not all of them). Now, the ID and the grade of the workers is on the row each worker.
I would like to move the information that was obtained during worker's assessment and create new sets of variables for each (worker ID and grade) on the line/row of the boss.
+---------+------+--------+--------------+---------+---------+--------+---------+
| company | boss |workerID|worker's grade|N:workID1|N:grade1 |N:work2 |N:grade2 |
+---------+------+--------+--------------+---------+---------+--------+---------+
| A | 1 | 1 | | 3 | A | 4 | A |
| A | 2 | 2 | | | |
| A | 0 | 3 | A | | |
| A | 0 | 4 | A | | |
| A | 0 | 5 | | | |
| B | 1 | 1 | | 3 | B | 4 | A |
| B | 0 | 2 | | | |
| B | 0 | 3 | B | | |
| B | 0 | 4 | A | | |
| C | 1 | 1 | | 2 | D | -1 | -1 |
| C | 0 | 2 | D | | |
I would like to move the worker's id and the grade that to the row of the boss in the NEW variables, without loosing the existing variables on workerID and worker's grade.
Basically, I will need to feed forward the information into the new variables and to the row of boss EQ 1 separately for each company.
I have no idea how to proceed with this. I assume that I need a loop that creates new variable for each worker ID that has a valid grade and then feeds forward the information from the worker's row to the boss' newly generated variables.
Any suggestions are very wellcome :-)
Take a look at VARSTOCASES (Data > Restructure)

How to Write Conditional Statement in SQL Server

I am having a logic issue in relation to querying an SQL database. I need to exclude 3 different categories and any item that is included in those categories; however, if an item under one of those categories meets the criteria for another category I need to keep said item.
This is an example output I will get after querying the database at its current version:
ExampleDB | item_num | pro_type | area | description
1 | 45KX-76Y | FLCM | Finished | coil8x
2 | 68WO-93H | FLCL | Similar | y45Kx
3 | 05RH-27N | FLDR | Finished | KH72n
4 | 84OH-95W | FLEP | Final | tar5x
5 | 81RS-67F | FLEP | Final | tar7x
6 | 48YU-40Q | FLCM | Final | bile6
7 | 19VB-89S | FLDR | Warranty | exp380
8 | 76CS-01U | FLCL | Gator | low5
9 | 28OC-08Z | FLCM | Redo | coil34Y
item_num and description are in a table together, and pro_type and area are in 2 separate tables--a total of 3 tables to pull data from.
I need to construct a query that will not pull back any item_num where area is equal to: Finished, Final, and Redo; but I also need to pull in any item_num that meets the type criteria: FLCM and FLEP. In the end my query should look like this:
ExampleDB | item_num | pro_type | area | description
1 | 45KX-76Y | FLCM | Finished | coil8x
2 | 68WO-93H | FLCL | Similar | y45Kx
3 | 84OH-95W | FLEP | Final | tar5x
4 | 81RS-67F | FLEP | Final | tar7x
5 | 19VB-89S | FLDR | Warranty | exp380
6 | 76CS-01U | FLCL | Gator | low5
7 | 28OC-08Z | FLCM | Redo | coil34Y
Try this:
select * from table
join...
where area not in('finished', 'final', 'redo') or type in('flcm', 'flep')
Are you looking for something like
SELECT *
FROM Table_1
JOIN Table_ProType ON Table_1.whatnot = Table_ProType.whatnot
JOIN Table_Area ON Table_1.whatnot = Table_Area.whatnot
WHERE Table.area NOT IN ('Finished','Final','Redo') OR ProType.pro_type IN ('FLCM','FLEP')
Giving the names of the three tables and the joining criteria will help me improve the answer.

Resources