select list based on select query in apache zeppelin - apache-zeppelin

i am using apache zeppelin version 0.6.
i have the following hive query
select certificate_name,count(*) from student_withdraw
now i want to have a where clause which is represented to the end user as a select list. the inner query is like below
select certificate_name,count(*) from student_withdraw where lecturer_name in (select distinct lecturer_name from student_withdraw)
now the default notation to have a select list is "${item=A,A|B|C}"
i tried to have it like below
%jdbc(hive)
select certificate_name,count(*) from student_withdraw where lecturer_name = "${item=Null,select distinct lecturer_name from student_withdraw}" group by certificate_name
but cannot fetch the distinct lecturers in the select list. all is shown in the select list is the query.
how can i select the distinct lectures for the select list?
thank you

Assumption
Your scenario looks involving dynamic forms on Zeppelin. I can agree with your logic yet dynamic forms won't execute any SQL, or HiveQL, then pass the result as an option onto a page, merely exactly what you typed.
I assume your installation of Zeppelin includes all interpreters, the table is a managed native table on Hive, and the selection of lecturers for end users is necessary.
Solution
If the number of unique lecturer names is not much, like below 10, just typed them all manually in the select form of the query.
SELECT certificate_name, COUNT(*)
FROM student_withdraw
WHERE lecturer_name = ${item=nameA, nameA|nameB|nameC}
GROUP BY certificate_name
Otherwise, you may consider composing the string of the whole lecturer names first, then copying and pasting the outcome into the select form of the query.
Something like the following:
%pyspark
from pyspark.sql import HiveContext
hc = HiveContext(sc)
student_withdraw = hc.table("student_withdraw")
student_withdraw.registerTempTable("student_withdraw")
lecturer_list = student_withdraw.sql('SELECT DISTINCT lecturer_name FROM student_withdraw').rdd.map(r => r(0)).collect()
lecturer_names = '|'.join(lecturer_list)
print(lecturer_names)
 
%jdbc(hive)
SELECT certificate_name, COUNT(*)
FROM student_withdraw
/*the second argument in the select form is copied from the result of the previous execution*/
WHERE lecturer_name = ${item=nameA, nameA|nameB....|nameY|nameZ}
GROUP BY certificate_name

Related

WHERE IN as Subquery in Postgres

I have two table.
Table users has one column category_ids as array of integer integer[] which is storing as {1001,1002}
I am trying to get all categories details but it is not working.
SELECT *
FROM categories
WHERE id IN (Select category_ids from users where id=1)
When I run Select category_ids from users where id=1 I am getting result as {1001,1002}. How to make {1001,1002} as (1001,1002) or what should I change to make work above query?
You could use =ANY():
SELECT *
FROM categories
JOIN users ON categories.id =any(category_ids)
WHERE users.id = 1;
But why are you using an array?
I would use an EXISTS condition:
SELECT *
FROM categories c
WHERE EXISTS (Select *
from users u
where c.id = any(u.category_ids)
and u.id = 1);
This should possibly be the fastest option, but you should EXPLAIN ANALYZE all the answers to make sure.
SELECT *
FROM categories c
WHERE id IN (SELECT unnest(category_ids) from users where id=1)
SELECT *
FROM categories c
JOIN (SELECT unnest(category_ids) AS id from users where id=1) AS foo
ON c.id=foo.id
The first query will remove duplicates in the array due to IN(), the second will not.
I didn't test the syntax so there could be typos. Basically unnest() is a function that expands an array to a set of rows, ie it turns an array into a table that you can then use to join to other tables, or put into an IN() clause, etc. The opposite is array_agg() which is an aggregate function that returns an array of aggregated elements. These two are quite convenient.
If the optimizer turns the =ANY() query into a seq scan with a filter, then unnest() should be much faster as it should unnest the array and use nested loop index scan on categories.
You could also try this:
SELECT *
FROM categories c
WHERE id =ANY(SELECT category_ids from users where id=1)

Using sum on two queries joined using UNION ALL

I am using Microsoft SQL Server 2014. I have two queries that I have joined using Union. Each query gives me a total but I need to be able to get a total of those two queries. Therefore, take the values given in these two queries and add them together to give me my final number. The two queries are:
select sum(acct.balance) as 'Balance'
from acct
where
acct.status <> 'closed'
Union all
select sum(term.balance) as 'Balance'
from term
where
term.status = 'active'
I have tried other suggestions posted on here but none have worked. My query should show me the balance of Acct.balance + term.balance.
In this case, your problem is easy that you have only two values, so you even could have directly added them, instead of union-ing them. I only give this example for completion and theory.
select (select sum(acct.balance) from acct where acct.status <> 'closed' ) + (select sum(term.balance) from term where term.status = 'active') as Balance
I mention that because it seems like the union all is what got you stuck. And yes, you can put that in a sub query or CTE, but in this case you don't even have a set, but just two values, since you aren't grouping by anything.
Other examples show CTE and subquery, which is how you can continue and build upon an existing query. (Another option may be to create a view if it's going to get reused a lot, but again, that is overkill for your example.)
When to use which?
I prefer CTE when I'm going to join something in more than once. For example, if I find and rank something, and then join the prior item to the next item. There are also other tricks with CTE's that go beyond that into areas like recursion. (http://www.databasejournal.com/features/mssql/article.php/3910386/Tips-for-Using-Common-Table-Expressions.htm)
If I just have a query that I want to build upon, I often just make it a subquery as long as the code is pretty short and straight forward.
A nice thing about either a CTE or a sub query is that you can select that inner code, and run just that when you're trying to understand why you're seeing the actual results.
All that being said, I don't generally like to see subqueries with the select region, so how I'd actually write this would be closer to :
select sum(SubTotals.Balance) as Balance
from
(
select sum(acct.balance) as Balance
from acct
where acct.status <> 'closed'
Union all
select sum(term.balance) as Balance
from term
where term.status = 'active'
) SubTotals
I give that example with the comment that meaningful names are good.
you can use CTE to do this
;with mycte as (
select
sum(acct.balance) as 'Balance'
from acct
where acct.status <> 'closed'
Union all
select sum(term.balance) as 'Balance'
from term
where
term.status = 'active'
)
Select
sum(Balance) as total_balance
from mycte
select sum(t.balance) from
(select balance
from acct
where
acct.status <> 'closed'
Union all
select balance
from term
where
term.status = 'active') t

SQL Server : group all data by one column

I need some help in writing a SQL Server stored procedure. All data group by Train_B_N.
my table data
Expected result :
expecting output
with CTE as
(
select Train_B_N, Duration,Date,Trainer,Train_code,Training_Program
from Train_M
group by Train_B_N
)
select
*
from Train_M as m
join CTE as c on c.Train_B_N = m.Train_B_N
whats wrong with my query?
The GROUP BY smashes the table together, so having columns that are not GROUPED combine would cause problems with the data.
select Train_B_N, Duration,Date,Trainer,Train_code,Training_Program
from Train_M
group by Train_B_N
By ANSI standard, the GROUP BY must include all columns that are in the SELECT statement which are not in an aggregate function. No exceptions.
WITH CTE AS (SELECT TRAIN_B_N, MAX(DATE) AS Last_Date
FROM TRAIN_M
GROUP BY TRAIN_B_N)
SELECT A.Train_B_N, Duration, Date,Trainer,Train_code,Training_Program
FROM TRAIN_M AS A
INNER JOIN CTE ON CTE.Train_B_N = A.Train_B_N
AND CTE.Last_Date = A.Date
This example would return the last training program, trainer, train_code used by that ID.
This is accomplished from MAX(DATE) aggregate function, which kept the greatest (latest) DATE in the table. And since the GROUP BY smashed the rows to their distinct groupings, the JOIN only returns a subset of the table's results.
Keep in mind that SQL will return #table_rows X #Matching_rows, and if your #Matching_rows cardinality is greater than one, you will get extra rows.
Look up GROUP BY - MSDN. I suggest you read everything outside the syntax examples initially and obsorb what the purpose of the clause is.
Also, next time, try googling your problem like this: 'GROUP BY, SQL' or insert the error code given by your IDE (SSMS or otherwise). You need to understand why things work...and SO is here to help, not be your google search engine. ;)
Hope you find this begins your interest in learning all about SQL. :D

Summarizing count of multiple talbes in one row or column

I've designed a migration script and as the last sequence, I'm running the following two lines.
select count(*) from Origin
select count(*) from Destination
However, I'd like to present those numbers as cells in the same table. I haven't decided yet if it's most suitable to put them as separate rows in one column or adjacent columns on one row but I do want them in the same table.
How can I select stuff from those selects into vertical/horizontal line-up?
I've tried select on them both with and without parentheses but id didn't work out (probably because of the missing from)...
This questions is related to another one but differs in two aspects. Firstly, it's much more to-the-point and clearer states the issue. Secondly, it asks about both horizontal and vertical line-up of the selected values whereas the linked questions only regards the former.
select
select count(*) from Origin,
select count(*) from Destination
select(
select count(*) from Origin,
select count(*) from Destination)
You need to nest the two select statements under a main (top) SELECT in order to get one row with the counts of both tables:
SELECT
(select count(*) from Origin) AS OriginCount,
(select count(*) from Destination) AS DestinationCount
SQLFiddle for the above query
I hope this is what you are looking for, since the "same table" you are mentioning is slightly confusing. (I'm assuming you're referring to result set)
Alternatively you can use UNION ALL to return two cells with the count of both tables.
SELECT COUNT(*), 'Origin' 'Table' FROM ORIGIN
UNION ALL
SELECT COUNT(*), 'Destination' 'Table' FROM Destination
SQLFiddle with UNION ALL
SQLFiddle with UNION
I recommend adding the second text column so that you know the corresponding table for each number.
As opposed to simple UNION the UNION ALL command will return two rows everytime. The UNION command will generate a single result (single cell) if the count of rows in both tables is the same (the same number).
...or if you want vertical...
select 'OriginalCount' as Type, count(*)
from origin
union
select 'DestinationCount' as Type, count(*)
from destination

Transact SQL parallel query execution

Suppose I have
INSERT INTO #tmp1 (ID) SELECT ID FROM Table1 WHERE Name = 'A'
INSERT INTO #tmp2 (ID) SELECT ID FROM Table2 WHERE Name = 'B'
SELECT ID FROM #tmp1 UNION ALL SELECT ID FROM #tmp3
I would like to run queries 1 & 2 in parallel, and then combine results after they are finished.
Is there a way to do this in pure T-SQL, or a way to check if it will do this automatically?
A background for those who wants it: I investigate a complex search where there're multiple conditions which are later combined (term OR (term2 AND term3) OR term4 AND item5=term5) and thus I investigate if it would be useful to execute those - largely unrelated - conditions in parallel, later combining resulting tables (and calculating ranks, weights, and so on).
E.g. should be several resultsets:
SELECT COUNT(*) #tmp1 union #tmp3
SELECT ID from (#tmp1 union #tmp2) WHERE ...
SELECT * from TABLE3 where ID IN (SELECT ID FROM #tmp1 union #tmp2)
SELECT * from TABLE4 where ID IN (SELECT ID FROM #tmp1 union #tmp2)
You don't. SQL doesn't work like that: it isn't procedural. It leads to race conditions and data issues because of other connections
Table variables are also scoped to the batch and connection so you can't share results over 2 connections in case you're wondering.
In any case, all you need is this, unless you gave us an bad example:
SELECT ID FROM Table1 WHERE Name = 'A'
UNION
SELECT ID FROM Table2 WHERE Name = 'B'
I suspect you're thinking of "run in parallel" because of this procedural thinking. What is your actual desired problem and goal?
Note: table variables do not allow parallel operations: Can queries that read table variables generate parallel exection plans in SQL Server 2008?
You don't decide what to parallelise - SQL Server's optimizer does. And the largest unit of work that the optimizer will work with is a single statement - so, you find a way to express your query as a single statement, and then rely on SQL Server to do its job, which it will usually do quite well.
If, having constructed your query, the performance isn't acceptable, then you can look at applying hints or forcing certain plans to be used. A lot of people break their queries into multiple statements, either believing that they can do a better job than SQL Server, or because it's how they "naturally" think of the task at hand. Both are "wrong" (for certain values of wrong), but if there's a natural breakdown, you may be able to replicate it using Common Table Expressions - these would allow you to name each sub-part of the problem, and then combine them together, all as part of a single statement.
E.g.:
;WITH TabA AS (
SELECT ID FROM Table1 WHERE Name = 'A'
), TabB AS (
SELECT ID FROM Table2 WHERE Name = 'B'
)
SELECT ID FROM TabA UNION ALL SELECT ID FROM TabB
And this will allow the server to decide how best to resolve this query (e.g. deciding whether to store intermediate results in "temp" tables)
Seeing in one of your other comments you discussing about having to "work with" the intermediate results - this can still be done with CTEs (if it's not just a case of you failing to be able to express the "final" result as a single query), e.g.:
;WITH TabA AS (
SELECT ID FROM Table1 WHERE Name = 'A'
), TabAWithCalcs AS (
SELECT ID,(ID*5+6) as ModID from TabA
)
SELECT * FROM TabAWithCalcs
Why not just:
SELECT ID FROM Table1 WHERE Name = 'A'
UNION ALL
SELECT ID FROM Table2 WHERE Name = 'B'
then if SQL Server wants to run the two selects in parallel, it will do at its own violition.
Otherwise we need more context for what you're trying to achieve if this isn't practical.

Resources