Querying a set in Hive if it contains - arrays

I have the below, which gives me a set in hive
collect_set(quotaname)
Sample output:
["Quota1", "Quota2"]
I have tried to use contains or LIKE operator, but I can't figure out how to say If the last item in the set = X then...
Can anyone advise me?

You can use array_contains:
hive> select array_contains(array("Quota1", "Quota2"),'Quota2');
OK
true
Time taken: 0.144 seconds, Fetched: 1 row(s)
hive> select array_contains(array("Quota1", "Quota2"),'Quota3');
OK
false
Time taken: 0.096 seconds, Fetched: 1 row(s)
For accessing last item in a set, use set[size(set)-1]:
with mydata as (
select array('Quota1', 'Quota2') as myset
)
select case when myset[size(myset)-1] = 'Quota2' then 'Contains!' else 'No' end from mydata ;
Result:
OK
Contains!
Time taken: 3.375 seconds, Fetched: 1 row(s)
You can use LIKE operator: myset[size(myset)-1] LIKE '%Quota%', etc, etc

Related

What is wrong with my UPDATE statement WHERE NOT EXISTS?

What I am trying to accomplish is to update the ISCURRENT field to 'N' and the EFFECTIVE_END_DATE field to the current date if the record of its type does not have the most recent EFFECTIVE_START_DATE.
An error does not get thrown it just tells me "0 rows affected" but I created a record with a more recent EFFECTIVE_START_DATE which should affect the other record in the table that has the earlier EFFECTIVE_START_DATE.
Here is an image of the 2 records I'm using to test it out.
The record that has a KTEXT of '400 Atlantic' should be changed from this script to have an ISCURRENT ='N' and EFFECTIVE_END_DATE=GETDATE() because the record with the KTEXT of 500 Maria has a more recent EFFECTIVE_START_DATE
UPDATE [SAP].[src_gl_sap_m_cepct]
set ISCURRENT='N',
EFFECTIVE_END_DATE=GETDATE()
WHERE NOT EXISTS (SELECT [SPRAS],
[PRCTR],
MAX(EFFECTIVE_START_DATE)
FROM [SAP].[src_gl_sap_m_cepct] AS A
WHERE CONCAT([SAP].[src_gl_sap_m_cepct].[SPRAS],[SAP].[src_gl_sap_m_cepct].[PRCTR]) = CONCAT(A.[SPRAS],A.[PRCTR]
)
GROUP BY [SPRAS],[PRCTR]);
Thank you !
Correct me if I am wrong, but this part of your query
FROM [SAP].[src_gl_sap_m_cepct] AS A
WHERE CONCAT([SAP].[src_gl_sap_m_cepct].[SPRAS],[SAP].[src_gl_sap_m_cepct].[PRCTR]) = CONCAT(A.[SPRAS],A.[PRCTR]
can also be written like this (because you have a self join)
FROM [SAP].[src_gl_sap_m_cepct] AS A
WHERE CONCAT(A.[SPRAS], A.[PRCTR]) = CONCAT(A.[SPRAS], A.[PRCTR]
And like this I notice that you are simply comparing a value to the same value again.
Thus this will always evaluate as TRUE
And thus the not exists clause will never evaluate as true
And therefore no updates will happen.
I think something like this might work for you
UPDATE c
set c.ISCURRENT='N',
c.EFFECTIVE_END_DATE = GETDATE()
FROM SAP.src_gl_sap_m_cepct c
WHERE EXISTS ( select 1
FROM SAP.src_gl_sap_m_cepct AS A
WHERE CONCAT(c.SPRAS, c.PRCTR) = CONCAT(A.SPRAS, A.PRCTR)
AND A.EFFECTIVE_START_DATE > c.EFFECTIVE_START_DATE
)
If I understood correctly, the statement should be like this:
UPDATE c
set ISCURRENT='N',
EFFECTIVE_END_DATE = GETDATE()
FROM [SAP].[src_gl_sap_m_cepct] c
WHERE EXISTS (
SELECT 1
FROM [SAP].[src_gl_sap_m_cepct] AS A
WHERE CONCAT(c.[SPRAS], c.[PRCTR]) = CONCAT(A.[SPRAS],A.[PRCTR])
AND c.EFFECTIVE_START_DATE < A.EFFECTIVE_START_DATE
);

How to return just one result from SELECT CASE query?

i have a table like this
DBName p_server_fqdn p_server_alias q_server_fqdn q_server_alias
cube1 server1.com p1server.com server5.com q1server.com
cube1 server2.com p1server.com server6.com q1server.com
cube2 server3.com p2server.com server7.com q2server.com
cube2 server4.com p2server.com server8.com q2server.com
I want to run a case select query in which i get the alias of a server input that matches a server column with corresponding DBName
this is what im trying so far
$SAlias = Invoke-sqlcmd -Query "SELECT DISTINCT CASE
WHEN ($cubeTable.DBName like $CUBE_input) AND ($cubeTable.p_server_fqdn) like $server_input THEN p_server_alias
WHEN ($cubeTable.DBName like $CUBE_input) AND ($cubeTable.q_server_fqdn) like $server_input THEN q_server_alias
ELSE 'unknown'
END as SAlias
FROM table $cubeTable" -ConnectionString "connectionstuff" | Select -ExpandProperty SAlias
but when i try the query itself in SSMS (with hardcoded values like cube1 and server2.com), i get back 2 rows with the row that dont match the DBName as "unknown" while 1 row shows p_server_alias
result im getting:
i should only get back the 1st row: p1server.com in this case, so why am i also getting unknown?
set #cubeInput = 'cube1';
set #serverInput = 'server6.com';
select
case when count(*) = 0 then 'UNKNOWN'
when m.p_server_fqdn = #serverInput then m.p_server_alias
when m.q_server_fqdn = #serverInput then m.q_server_alias
end as alias
from mytable m
where DBName = #cubeInput and (
p_server_fqdn = #serverInput
or q_server_fqdn = #serverInput
);
here is the implementation of my answer : http://sqlfiddle.com/#!9/b967a22/61
#Cataster solution return 2 rows becouse actualy he get 4 rows (3 rows 'unkown' and 1 row 'p1server.com') then he put distinct in the query. it's make result become 2 rows.
my solution little bit tricky :). Using filter in the query. than if we get no row as the result use the count function. So we get 1 row and the value is 0 than show it as 'UNKNOWN'.

Select data based on datestamp with sqlite3 query

I'm trying to use sqlite3 to access data in a database based on a value for the datestamp. Please consider the following code:
live_db_conn = sqlite3.connect('/Users/user/Documents/database.db')
time_period = (dt.now() - timedelta(seconds=time)).strftime('%H:%M:%S')
time_period_data = pd.read_sql_query('SELECT * FROM table1 WHERE Datestamp > {}'.format(str(time_period)), live_db_conn)
When I run this code I get the following error:
pandas.io.sql.DatabaseError: Execution failed on sql 'SELECT * FROM table1 WHERE Datestamp > 12:33:33': near ":33": syntax error
I don't understand where this error comes from, because if I run the following code:
df = pd.read_sql_query('SELECT Datestamp FROM table1 LIMIT 10', live_db_conn)
print(df)
I get the following output:
Datestamp
0 10:46:54
1 10:46:59
2 10:47:04
3 10:47:09
4 10:47:14
5 10:47:19
6 10:47:24
7 10:47:29
8 10:47:34
9 10:47:39
So it seems (to me at least) that my sql query is correct. I've tried to do .format(time_period) instead of .format(str(time_period)) but I can't figure out what I'm doing wrong.
Question: How do I select the portion of the data that corresponds to the selected time period?
Edit: It seems that something is going wrong with the minutes in the timestamp. When I ran the code again I got the same error but with a different timestamp:
pandas.io.sql.DatabaseError: Execution failed on sql 'SELECT * FROM table1 WHERE Datestamp > 12:49:10': near ":49": syntax error
So I'd say that the syntax error has something to do with the minutes in the timestamp..
Instead of
time_period_data = pd.read_sql_query('SELECT * FROM table1 WHERE Datestamp > {}'.format(str(time_period)), live_db_conn)
I did:
time_period_data = pd.read_sql_query('SELECT * FROM table1 WHERE Datestamp > "{}"'.format(time_period), live_db_conn)
which solved the problem!

Never ending query in SQL Server 2012

SQL Server 2012, Amazon RDS
This is my simple query
update [dbo].[DeliveryPlan]
set [Amount] = dp.Amount +
case when #useAmountColumn = 1 and dbo.ConvertToInt(bs.Amount) > 0
then dbo.ConvertToInt(bs.Amount)
else #amount
end
from
BaseSpecification bs
join
BaseSpecificationStatusType t on (StatusTypeID = t.StatusTypeID)
join
[DeliveryPlan] dp on (dp.BaseSpecificationID = bs.BaseSpecificationID and dp.ItemID = #itemID)
where
bs.BaseID = 130 and t.IsActive = 1
It can't be finished. If where condition bs.BaseID=130 (update 7000 rows) change for bs.BaseID=3 (update 1000000 rows) it lasts 13 sec.
Statistics are actual, I think
In performance monitor I see 5% processor usage
When I use sp to watch active connections and for this query
tempdb_allocations is 32, tembdb_current - 32, reads - 32 000 000, cpu - 860 000 (query lasts 20 minutes)
What is the problem?
UPDATE: I added non-clustered index for [DeliveryPlan] - by BaseSpecificationID + ItemID and problem is gone. Unfortunately I see this problem every day with different queries. And problem disappears unpredicatedly.
This will perform better and in a different way as the join conditions will narrow down the number of rows in the first go itself, rather than waiting for the where clause to execute. The execution plan will be different for both (with where/ without where).
UPDATE dp
SET Amount = dp.Amount + CASE
WHEN #useAmountColumn = 1
AND dbo.ConvertToInt( bs.Amount ) > 0 THEN dbo.ConvertToInt( bs.Amount )
ELSE #amount
END
FROM BaseSpecification bs
JOIN BaseSpecificationStatusType t ON
( bs.StatusTypeID = t.StatusTypeID
AND bs.BaseID = 130
AND t.IsActive = 1
)
JOIN DeliveryPlan dp ON
( dp.BaseSpecificationID = bs.BaseSpecificationID
AND dp.ItemID = #itemID
);
You may suffer from a locking condition for your base tables.
Optimize your query to update dp directly to avoid update all rows of DeliveryPlan
update dp set [Amount] = dp.Amount +
case
when #useAmountColumn=1 and dbo.ConvertToInt(bs.Amount)>0 then
dbo.ConvertToInt(bs.Amount)
else #amount
end
from
BaseSpecification bs
join BaseSpecificationStatusType t on (bs.StatusTypeID = t.StatusTypeID)
join [DeliveryPlan] dp on (dp.BaseSpecificationID = bs.BaseSpecificationID)
where
bs.BaseID = 130
and t.IsActive = 1
and dp.ItemID = #itemID
If the problem mentioned in the update part is that it comes and goes randomly, it sounds like bad parameter sniffing. When the problem happens you could look into plan cache to check if the query plan looks ok and in case it doesn't, what are the values the plan was created with (you can find them in the leftmost object in the plan) and for example use sp_recompile and see what kind of plan you'll get the next time.

Netezza Timesptamp failing

I'm trying to query the database as flows:
select count(distinct(TE_ID)) from TE where LAST_UPDATE_TIME >= '2013-01-08-00:00:00.000000' and LAST_UPDATE_TIME < '2013-01-09-00:00:00.000000'
However the error I receive is:
11:25:09 [SELECT - 0 row(s), 0.000 secs] [Error Code: 1100, SQL State: HY000] ERROR: Bad timestamp external representation '2013-01-08-00:00:00.000000'
... 1 statement(s) executed, 0 row(s) affected, exec/fetch time: 0.000/0.000 sec [0 successful, 0 warnings, 1 errors]
The timestamp you are giving has an extra dash.
Yours: select cast('2013-01-08-00:00:00.000000' as timestamp)
Should be: select cast('2013-01-08 00:00:00.000000' as timestamp)
To control it might be a good idea to explicitly cast like the example below:
to_timestamp('2013-01-08 00:00:00.000000','YYYY-MM-DD HH:MI:SS.US')
HH = Hour
MI = Minute
SS = Second
US = Microseconds
Try this:
select count(distinct(TE_ID)) from TE where LAST_UPDATE_TIME >= '2013-01-08 00:00:00.000000' and LAST_UPDATE_TIME < '2013-01-09 00:00:00.000000'

Resources