HIVE: GROUP BY doesn't behave as it would in MySQL

HIVE: GROUP BY doesn't behave as it would in MySQL - database

I have some experience with MySQL and recently I have to do some work on HIVE instead.
The basic structure of the queries is quite similar between the two, but the GROUP BY in HIVE seems to work a bit differently... Thus I cannot achieve what I could previously achieve in MySQL using GROUP BY.
Following is my question, so say I have a table with column A, B, C, and I want to select the rows with max. B column values grouping by column A. I will do:
SELECT A, max(B) FROM myTable GROUP BY A
The above code would work in HIVE with no problem. But what if I also want to see the value in column C which is in the same row of the row with max. B value? In MySQL I can just do:
SELECT A, max(B), C FROM myTable GROUP BY A
But in HIVE I can't do this. It complains that C is not in the GROUP BY keys, but if I add C into GROUP BY, the result is totally not what I want.
So what is the way to select such desired result in HIVE? Some say using collect_set on column C can solve the problem, but I have no idea how the collect_set is ordered and thus don't know which element to return...

Okay I figured this out... The following would do the trick:
SELECT A, maxB, C FROM myTable JOIN
(SELECT A, max(B) as maxB FROM myTable GROUP BY A) temp
ON myTable.A = temp.A AND myTable.B = temp.maxB
It turns out that I have to write much more code in HIVE to get the same result as I would get with just one line in MySQL... :(

In MySQL you would just get a random C, which is not the one you seem to expect.
See MySQL's SQL_MODE to appropriately let MySQL also refuse such ambiguous code.
(or use MIN(C), to get a specific one)

Related

Group by an evaluated field (sql server) [duplicate]

Why are column ordinals legal for ORDER BY but not for GROUP BY? That is, can anyone tell me why this query
SELECT OrgUnitID, COUNT(*) FROM Employee AS e GROUP BY OrgUnitID
cannot be written as
SELECT OrgUnitID, COUNT(*) FROM Employee AS e GROUP BY 1
When it's perfectly legal to write a query like
SELECT OrgUnitID FROM Employee AS e ORDER BY 1
?
I'm really wondering if there's something subtle about the relational calculus, or something, that would prevent the grouping from working right.
The thing is, my example is pretty trivial. It's common that the column that I want to group by is actually a calculation, and having to repeat the exact same calculation in the GROUP BY is (a) annoying and (b) makes errors during maintenance much more likely. Here's a simple example:
SELECT DATEPART(YEAR,LastSeenOn), COUNT(*)
FROM Employee AS e
GROUP BY DATEPART(YEAR,LastSeenOn)
I would think that SQL's rule of normalize to only represent data once in the database ought to extend to code as well. I'd want to only right that calculation expression once (in the SELECT column list), and be able to refer to it by ordinal in the GROUP BY.
Clarification: I'm specifically working on SQL Server 2008, but I wonder about an overall answer nonetheless.

One of the reasons is because ORDER BY is the last thing that runs in a SQL Query, here is the order of operations
FROM clause
WHERE clause
GROUP BY clause
HAVING clause
SELECT clause
ORDER BY clause
so once you have the columns from the SELECT clause you can use ordinal positioning
EDIT, added this based on the comment
Take this for example
create table test (a int, b int)
insert test values(1,2)
go
The query below will parse without a problem, it won't run
select a as b, b as a
from test
order by 6
here is the error
Msg 108, Level 16, State 1, Line 3
The ORDER BY position number 6 is out of range of the number of items in the select list.
This also parses fine
select a as b, b as a
from test
group by 1
But it blows up with this error
Msg 164, Level 15, State 1, Line 3
Each GROUP BY expression must contain at least one column that is not an outer reference.

There is a lot of elementary inconsistencies in SQL, and use of scalars is one of them. For example, anyone might expect
select * from countries
order by 1
and
select * from countries
order by 1.00001
to be a similar queries (the difference between the two can be made infinitesimally small, after all), which are not.

I'm not sure if the standard specifies if it is valid, but I believe it is implementation-dependent. I just tried your first example with one SQL engine, and it worked fine.

use aliasses :
SELECT DATEPART(YEAR,LastSeenOn) as 'seen_year', COUNT(*) as 'count'
FROM Employee AS e
GROUP BY 'seen_year'
** EDIT **
if GROUP BY alias is not allowed for you, here's a solution / workaround:
SELECT seen_year
, COUNT(*) AS Total
FROM (
SELECT DATEPART(YEAR,LastSeenOn) as seen_year, *
FROM Employee AS e
) AS inline_view
GROUP
BY seen_year

databases that don't support this basically are choosing not to. understand the order of the processing of the various steps, but it is very easy (as many databases have shown) to parse the sql, understand it, and apply the translation for you. Where its really a pain is when a column is a long case statement. having to repeat that in the group by clause is super annoying. yes, you can do the nested query work around as someone demonstrated above, but at this point it is just lack of care about your users to not support group by column numbers.

SQL Server pulling data from multiple tables

I'm having a problem trying to pull a specific data from two tables. According to the textbook its:
Select *
From terra..retailsales and terra..retailaccount
Where retailaccountid in retailsales = 2345678
Get date range from = 3/01/2014 to 6/30/2015
However, when running the code it produces an syntax error within the in. Yet to me the whole code looks wrong. Can someone help me. I would like to get this to work in order to do my assignment. It's driving me nuts! I contacted the prof and he said that the code in the book is correct, but I think he's wrong.
Can someone help?

The code you provided is not TSQL - actually looks more like some kind of pseudocode.
Just guessing at your column names here, but if I've got it right your query should look something like this:-
SELECT * FROM terra..retailsales
WHERE retailaccountid = 2345678
AND [date range] BETWEEN '20140301' AND '20150630'
Not sure where the 2nd table comes into this though.

You can JOIN two table, like this:
SELECT *
FROM terra..retailsales RS
INNER JOIN terra..retailaccount RC
ON RS.retailaccountid = RC.ID
WHERE RS.retailaccountid = 2345678
AND [date] BETWEEN '20140301' AND '20150630'

Your provided code is very confusing. I see the [terra..retailsales] table, but have no idea what your other table is. Are you sure you need to get your data from two tables?
What is the syntax error you're receiving? Can you paste the exact code you're trying in a code block? Not much of that makes any sense.
In order to pull data from two tables, you could union those tables in a CTE (common table expression), throw them both into a temp table, or join them in a select statement. If the format of both tables is identical, then why do you have two of them?
You're missing the column name where you want to compare a [date] to "3/01/2014 to 6/30/2015". You can use getDate() to return the current time.
Select *
FROM [terra..retailsales]
Where [retailaccountid] = 2345678
AND [<DateColumn>] BETWEEN '3/01/2014' AND '6/30/2015'
You don't need to re-specify your table name again in line "Where retailaccountid in retailsales = 2345678". It will just assume that the retailaccountid is from retailsales.

sql select (..) where id in (list_of_values) vs select (...) where id in (subquery)

This is my first post on Stackoverflow. :-)
I use an SQL server, which has a huge table (up to 50'000'000 records), and a smaller table (up to 500'000 records).
Let's consider two queries:
Query A
SELECT *
FROM my_huge_table
WHERE column_name IN(list_of_values)
versus Query B
SELECT *
FROM my_huge_table
WHERE column_name IN (
SELECT column_name
FROM smaller_table
WHERE ...
)
The subquery in query B returns exactly the same list as list_of_values in query A.
The length of list_of_values in query A is limited (as described in many places on web). If my list_of_values is long (eg: 10'000), I have to split it into chunks, but when I use query B, it works fine, although results from the subquery from query B also contains 10'000 records...
What is more when I use query B, it is faster than query A. I looked into execution plans and it shows that query B uses some parallel calculations (I'm not familiar with execution plans).
Questions
In my script I have a list of values, I cannot easily create a subquery. Is there any way to execute something similar to query B using list of values?
Why is query B faster than query A?
Are there any other optimizations to be done?
PS: I've already created an index on the queried column.
Thanks.

Mixing indexed and calculated fields in a table-valued function

I work with SQL Server 2008, but can use a later version if it would matter.
I have 2 tables with pretty similar data about some people but in different formats (no intersections between these 2 sets of people).
Table 1:
int personID
bit IsOldPerson //this field is indexed
Table 2:
int PersonID
int Age
I want to have a combined view that has the same structure as the Table 1. So I write the following script (a simplified version):
CREATE FUNCTION CombinedView(#date date)
RETURNS TABLE
AS
RETURN
select personID as PID, IsOldPerson as IOP
from Table1
union all
select personID as PID, dbo.CheckIfOld(Age,#date) as IOP
from Table2
GO
The function "CheckIfOld" returns yes/no depending on the input age at the date #date.
So I have 2 questions here:
A. if I try select * from CombinedView(TODAY) where IOP=true, whether the SQL Server will do the following separately: 1) for the Table 1 use the index for the field IsOldPerson and do a "clever" index-based selection of results; 2) for the Table 2 calculate CheckIfOld for all the rows and during the calculation pick up or rejecting rows on the row-by-row basis ?
B. how can I check the execution plan in this particular case to understand whether my guess in the question (A) is correct or not?
Any help is greatly appreciated! Thanks!

Yes, if the query isn't too complex, the query optimizer should "see through" the view into its constituent UNION-ed SELECT statements, evaluate them separately, and concatenate the results. If there is an index on Table1, it should be able to use it. I tested this using tables we had and the same function concepts you presented. I reviewed the query plans of the raw SELECT to Table1 and the SELECT to the inline table-valued function with the UNION and the portion of the query plan relevant to Table1 was the same-- and it used the index.
Now if performance is a concern, I suggest you do one of two things:
If (a) Table2 is read-heavy rather than write-heavy, (b) you have the space, and (c) you can write CheckIfOld as a single CASE statement (as its name and context in your question implies), then you should consider creating a persisted calculated field in Table2 with the calculation from IsOldPerson and applying an index to it.
If Table2 is write-heavy, or you have no space for additional fields, you should at least consider converting CheckIfOld into an inline function. You will likely reap performance gains, depending on how it is used. In your case, it would be used like this:
select personID as PID, IOP.IsOldPerson from Table2 CROSS APPLY dbo.CheckIfOld(Age,#date) AS IOP

SQL query join elements

I will re-write my doubt to be more easy to understand.
I have one table named SeqNumbers that have only one column of data named PossibleNumbers, that has value from 1 to 10.000.
Then I have another Table named Terminals and one of the columns have the serial numbers of the terminals. What I want is get all the SerialNumbers that not exists in the Terminals table from 1 to 10.000.
I've created the SeqNumbers table only to do this... maybe there's another solution without using it... that's fine to me.
The query I have is:
SELECT PossibleNumbers from SeqNumbers
Where PossibleNumbers NOT IN (SELECT SerialNumbers from Terminals)**
Basically I want to list ALL serial numbers of terminals that doesn't exists in the database.
This Query works fine I think... but the problem is that I don't want all results in a single column.. I want these results displayed in 4 or 5 columns.
For my purpose I can only use the results from the query like that. I cannot use programmatically methods to do that.
Hope this is more clear now... Thanks for all the help...

select x, x+1000 from tablename
Will that do it for you?

If I'm reading this right, you'd probably have to do a self join; something like:
SELECT
LeftValues.ColA,
RightValues.ColA AS ColB
FROM YourTable LeftValues
LEFT JOIN YourTable RightValues ON LeftValues.ColA = RightValues.ColA - 1000
WHERE LeftValues.ColA < 1000
Note: Use the JOIN that makes sense for you (left if you are willing to accept NULLs in ColB, inner if you only want them where both values exist)

You can use a scripting language to parse the MySQL results to format it anyway you like. Are you using PHP to access the database? If so, let me know and I can cook one up for you.
I just saw your new updated question. In this case the order of the columns will be ordered by your SELECT statement and you can also sort too. Here is an example:
SELECT Column1, Column2 FROM my_table ORDER BY Column1, Column2 ASC

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

HIVE: GROUP BY doesn't behave as it would in MySQL - database

In MySQL you would just get a random C, which is not the one you seem to expect. See MySQL's SQL_MODE to appropriately let MySQL also refuse such ambiguous code. (or use MIN(C), to get a specific one)

Related

Group by an evaluated field (sql server) [duplicate]

SQL Server pulling data from multiple tables

sql select (..) where id in (list_of_values) vs select (...) where id in (subquery)

Mixing indexed and calculated fields in a table-valued function

SQL query join elements

Categories

Resources