Druid -> Order data by another column than timestamp by native queries - database

I'm using scan query in druid.
I'm looking for a way to sort data using some query.
How I can do this?
Now I have:
DataSource: Data,
Intervals: "1000"/"2000",
Limit: 10
Legacy: true,
I have column "values" and I want to sort data by this column (no by timestamp) and return each column from table but sorted by "values".
Something like:
SELECT __time, value, company, count
FROM shares
ORDER BY 1 ASC
WHERE value > 200

Tried a similar query with the wikipedia test data :
SELECT namespace, channel, cityName, sum_added
FROM "wikipedia_demo" r
WHERE sum_added > 30
ORDER BY sum_added DESC
which results in an error:
Error: Unknown exception
Cannot build plan for query: SELECT namespace, channel, cityName, sum_added FROM "wikipedia_demo" r WHERE sum_added > 30 ORDER BY sum_added DESC
org.apache.druid.java.util.common.ISE
The reason is that ORDER BY is only allowed on GROUP BY columns, aggregate expressions or if no grouping is done, then only on __time.
Take a look at the docs here: https://druid.apache.org/docs/latest/querying/sql.html#order-by
If you are not aggregating, you can still use GROUP BY selecting all the SELECT expressions and then ORDER BY any of them, as in:
SELECT namespace, channel, cityName, sum_added
FROM "wikipedia_demo" r
WHERE sum_added > 30
GROUP BY 1,2,3,4
ORDER BY sum_added DESC
Caution: since it is time-series data, it is a good practice to include a condition on __time to avoid scanning the whole table.

Please also see the following docs page on the options you have regarding order on scan queries.
https://druid.apache.org/docs/latest/querying/scan-query.html#time-ordering

Yes. That's a separate question but yes, you can submit SQL through the API instead of Native JSON query.
JSON file named "a_query.json":
{
"query":
"SELECT namespace, channel, cityName, sum_added
FROM \"wikipedia_demo\" r
WHERE sum_added > 30
GROUP BY 1,2,3,4
ORDER BY sum_added DESC"
}
Example API call:
curl -X 'POST' -H 'Content-Type:application/json' -d #/<path_to_file>/a_query.json http://localhost:8888/druid/v2/sql | jq

SQL query working good for me, so I found "EXPLAIN PLAN FOR" label to add before SQL in druid console. Druid showed me how JSON for druid native should be look.
I solved my problem in this way.
Thanks

Related

need to filter data based on no. of days for one record from column along with all the actual records

I have two tables. Caseclaims and claimstates , from these tables i am getting the resultant , claimcount and conditional counf for claimstates such as SENT, ACCEPTED , CLAIM STATUS REQUESTED .
NOW , MY requirement is along with count for all claimstates , i need to filter the count for 'SENT' BASED ON primaryADJUDICATIONDAYS (THIS IS FROM ANOTHER TABLE)
I am not sure what you are looking for.Add table structure by using DESC <table_name> with your question which will greatly help contributors.
just ansering on assumptions: try GROUP BY statement on which want to count. i.e.:
SELECT COUNT(status) FROM ...... .... .... GROUP BY (on which column u want to count)

Snowflake - View what tables and columns are queried the most

is there any way within snowflake/sql query to view what tables are being queried the most as well as what columns? I want to know what data is of most value to my users and not sure how to do this programatically. Any thoughts are appreciated - thank you!
2021 update
The new ACCESS_HISTORY view has this information (in preview right now, enterprise edition).
For example, if you want to find the most used columns:
select obj.value:objectName::string objName
, col.value:columnName::string colName
, count(*) uses
, min(query_start_time) since
, max(query_start_time) until
from snowflake.account_usage.access_history
, table(flatten(direct_objects_accessed)) obj
, table(flatten(obj.value:columns)) col
group by 1, 2
order by uses desc
Ref: https://docs.snowflake.com/en/sql-reference/account-usage/access_history.html
2020 answer
The best I found (for now):
For any given query, you can find what tables are scanned through looking at the plan generated for it:
SELECT *, "objects"
FROM TABLE(EXPLAIN_JSON(SYSTEM$EXPLAIN_PLAN_JSON('SELECT * FROM a.b.any_table_or_view')))
WHERE "operation"='TableScan'
You can find all of your previous ran queries too:
select QUERY_TEXT
from table(information_schema.query_history())
So the natural next step would be combine both - but that's not straightforward, as you'll get an error like:
SQL compilation error: argument 1 to function EXPLAIN_JSON needs to be constant, found 'SYSTEM$EXPLAIN_PLAN_JSON('SELECT * FROM a.b.c')'
The solution would be to combine the queries from the query_history() with the SYSTEM$EXPLAIN_PLAN_JSON outside (to make the strings constant), and then you will be able to find out the most queried tables.

how to select first rows distinct by a column name in a sub-query in sql-server?

Actually I am building a Skype like tool wherein I have to show last 10 distinct users who have logged in my web application.
I have maintained a table in sql-server where there is one field called last_active_time. So, my requirement is to sort the table by last_active_time and show all the columns of last 10 distinct users.
There is another field called WWID which uniquely identifies a user.
I am able to find the distinct WWID but not able to select the all the columns of those rows.
I am using below query for finding the distinct wwid :
select distinct(wwid) from(select top 100 * from dbo.rvpvisitors where last_active_time!='' order by last_active_time DESC) as newView;
But how do I find those distinct rows. I want to show how much time they are away fromm web apps using the diff between curr time and last active time.
I am new to sql, may be the question is naive, but struggling to get it right.
If you are using proper data types for your columns you won't need a subquery to get that result, the following query should do the trick
SELECT TOP 10
[wwid]
,MAX([last_active_time]) AS [last_active_time]
FROM [dbo].[rvpvisitors]
WHERE
[last_active_time] != ''
GROUP BY
[wwid]
ORDER BY
[last_active_time] DESC
If the column [last_active_time] is of type varchar/nvarchar (which probably is the case since you check for empty strings in the WHERE statement) you might need to use CAST or CONVERT to treat it as an actual date, and be able to use function like MIN/MAX on it.
In general I would suggest you to use proper data types for your column, if you have dates or timestamps data use the "date" or "datetime2" data types
Edit:
The query aggregates the data based on the column [wwid], and for each returns the maximum [last_active_time].
The result is then sorted and filtered.
In order to add more columns "as-is" (without aggregating them) just add them in the SELECT and GROUP BY sections.
If you need more aggregated columns add them in the SELECT with the appropriate aggregation function (MIN/MAX/SUM/etc)
I suggest you have a look at GROUP BY on W3
To know more about the "execution order" of the instruction you can have a look here
You can solve problem like this by rank ordering the results by a key and finding the last x of those items, this removes duplicates while preserving the key order.
;
WITH RankOrdered AS
(
SELECT
*,
wwidRank = ROW_NUMBER() OVER (PARTITION BY wwid ORDER BY last_active_time DESC )
FROM
dbo.rvpvisitors
where
last_active_time!=''
)
SELECT TOP(10) * FROM RankOrdered WHERE wwidRank = 1
If my understanding is right, below query will give the desired output.
You can have conditions according to your need.
select top 10 distinct wwid from dbo.rvpvisitors order by last_active_time desc

How to construct yql for "select count(DISTINCT user_id), count(*) from music group by gender"

I tried the following yql statement for
select count(DISTINCT user_id), count(*) from music group by gender
/search/?yql=select (…) | all(group(gender) each(output(count())
all(group(user_id) output(count()))));
and got an error:
"code": 5,
"message": "Failed searching: Can not use output label 'count()' for multiple siblings.",
The problem is that you want two different counts at the same level (documents and unique users).
This can usually be solved by unique labelling using "as", but unfortunately we require a matching each for label which doesn't account for this case. I'll create a GitHub issue for that. That said, you can achieve this by writing it as two parallel groupings with matching each'es:
all(all(group(gender) each(output(count())) as(documents))
all(group(gender) each(group(user_id) output(count())) as(users)))

TSQL: order by asc without column name

I'm quite new to SQL Server. Now I came across a query like this:
SELECT country FROM Hovercraft.Orders GROUP BY country ORDER BY ASC
There is no column name given in the order by clause. Is this possible? SSMS says no.
Jörg
It's probably a misprint - you have to specify what you are ordering by; this can be a column name, an expression or the number of a column in the output. It's most likely that the query you have seen was one of the latter, that simply omitted the column number 1 - like so:
SELECT country FROM Hovercraft.Orders GROUP BY country ORDER BY 1 ASC
- so this would order by the contents of the first column of output (ie. country).
I agree with #Mahmoud Gamal. But, also, it's possible to write such hack like this -
SELECT o.country, const_column = 1
FROM Hovercraft.Orders o
GROUP BY o.country
ORDER BY const_column ASC
In this case, sorting will be performed, but rows' order will not be changed.
On MS SQL 2005:
On MS SQL 2012:
This is not possible !..
Order by clause always require column name or column number .
Would you please response me why you want this kind of situation , I think you are working with dynamic query or else please let me know.
As per SQL Standard this not possible.
Thanks.

Resources