Flink Stream SQL order by - apache-flink

I have a streaming input, say stock price data (including multiple stocks), and I want to do a ranking by their price every 1 minutes. The ranking is based on all stocks' latest price and needs to sort all of them no matter if it was updated in the previous 1 minute or not. I tried to use ORDER BY in flink stream SQL.
I failed to implement my logic and I am confused about two parts:
Why can ORDER BY only use a time attribute as primary and only support ASC? How can I implement an order by an other type like price?
What does the below SQL (from the Flink document) mean? There is no window and there is no window so I assume the SQL will be executed immediately for each order come in, in that case, it looks meaningless to sort one element.
[Update]: When I read the code of ProcimeSortProcessFunction.scala, it seems that Flink sorts the elements received during the next one millisecond.
SELECT *
FROM Orders
ORDER BY orderTime
Finally, is there a way to implement my logic in SQL?

ORDER BY in streaming queries are difficult to compute because we don't want to update the whole result when we have to emit a result that would need to go to the beginning of the result table. Therefore, we only support ORDER BY time-attribute if we can guarantee that the results have (roughly) increasing timestamps.
In the future (Flink 1.6 or later), we will also support some queries like ORDER BY x ASC LIMIT 10, which will result in an updating table that contains the records with the 10 smallest x values.
Anyway, you cannot (easily) compute a top-k ranking per minute using a GROUP BY tumbling window. GROUP BY queries aggregate the records of group (also window in case of GROUP BY TUMBLE(rtime, INTERVAL '1' MINUTE)) into a single record. So there won't be multiple records per minute but just one.
If you'd like a query to compute top-10 on field a per minute you would need a query similar to this one:
SELECT a, b, c
FROM (
SELECT
a, b, c,
RANK() OVER (ORDER BY a PARTITION BY CEIL(t TO MINUTE) BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as rank
FROM yourTable)
WHERE rank <= 10
However, such queries are not yet supported by Flink (version 1.4) because the time attribute is used in the PARTITION BY clause and not the ORDER BY clause of the OVER window.

Related

SELECT with WHERE drops performance

In my application I use queries like
SELECT column1, column2
FROM table
WHERE myDate >= date1 AND myDate <= date2
My application crashes and returns a timeout exception. I copy the query and run it in SSMS. The results pane displays ~ 40 seconds of execution time. Then I remove the WHERE part of the query and run. This time, the returned rows appear immediately in the results table, although the query continues to print more rows (there are 5 million rows in the table).
My question is: how can a WHERE clause affect query performance?
Note: I don't change the CommandTimeOut property in the application. Left by default.
Without a WHERE clause, SQL Server is told to just start returning rows, so that's what it does, starting from the first row it can find efficiently (which may be the "first" row in the clustered index, or in a covering non-clustered index).
When you limit it with a where clause, SQL Server first has to go find those rows. That's what you're waiting on, because you don't have an index on myDate (or date1/date2, which I'm not sure are columns or variables), it needs to examine every single row.
Another way to look at it is to think of a phone book, which is an outdated analogy but gets the job done. Without a WHERE clause, it's like you're asking me to read you off all of the names and numbers in the book. If you add a WHERE clause that is not supported by an index, like read me off the names and numbers of every person with the first name 'John', it's going to take me a lot longer to start returning rows because I can't even start until I find the first John.
Or a slightly different analogy is to think of the index in a book. If you ask me to read off the page numbers for all the terms that are indexed, I can do that from the index, just starting from the beginning and reading through until the end. If you ask me to read off all the page numbers for all the terms that aren't in the index, or a specific unindexed term (like "the"), or even all the page numbers for indexed terms that contain the letter a, I'm going to have a much harder time.

efficient way to select many records from oracle database(or in short time)

I am currently develop a program to retrieve records from database based on numbers of tables.
Here is my sql cmd string.
select distinct a.id,
a.name,
b.status,
b.date
from table1 a
left join table2 b
on a.id = b.id where #$%^$#%#
And some of the tables have around 50 millions of records or above. At most of the time, the sockets will not return timeout error because users will input the where clauses for what they want. However, when the program try to retrieve all the records from the database, it will show socket error because it takes too much time to retrieve which is not allowed.
One of my through is that to limit the rows retrieved by using rownum because users might not really want so many datas from the tables. For example, user can input the max number of rows that they want to retrieve 10 thoursands records. And I will return 10000 records back to them. But I fail to retrieve same exact number of records by using rownum < 10000 and I don't know how it can work too....
And so here I am to ask for any suggestions from professional developers here. Please help! Thanks so much!!!!
First of all you have to make it clear (to yourself) what data you need.
If you need to generate overall statistics, then you need all data. Saving intermediate results may help, but you still have to read everything. In that case set the socket timeout to some 24 hours, just make sure your SELECTs don't block other processes.
But if you are making a client application (displaying some data in an HTML table), then you definitely do not need everything. Design your application so that users apply filters first, then they receive the first result page, then the second... See how Google search or e-shops work - you get an empty homepage first (possibly with some promotion), after that you start filtering.
Secondly, technical ideas:
Limit was implemented in Oracle 12, so you can use SELECT * FROM table OFFSET 0 ROWS FETCH NEXT 10000 ROWS ONLY. For older versions you have to use the old WHERE rownum <= 10000, which does not work well with ORDER BY.
Save intermediate results when using aggregations, etc.
Minimize the need of JOINs (denormalize).
I can think of using an optimizer hint to tell Oracle that you want the first n rows fast like described here: https://docs.oracle.com/cd/B10500_01/server.920/a96533/hintsref.htm#4942
Other two answers already mention that you can implement paging using order by clause and rownum like this:
select * from (
SELECT *, rownum rnum FROM (SELECT foo FROM Foo ORDER BY OFR_ID) WHERE rownum <= stopOffset
) where rnum >=startOffset
or by using OFFSET in modern Oracle.
I want to point out additional thing which shows up when you want to retrieve many (like hundrend thousands to millions) rows to process them in application - be sure to set large enough fetch size (usually in range 1000 < ? < 5000) in your application when you do it. Typically there is a big difference in execution time when you retrieve results with default fetch size comparing to retrieving them with larger fetch size when it's known that there will be lot of rows. For example, in Java you can explicitly set fetchSize on your Statement object when crafting a query:
PreparedStatement statement = connection.prepareStatement(query);
statement.setFetchSize(1000);

Use TOP (1) specification when searching for primary key?

When querying a table using its primary key, like this:
SELECT * FROM foo WHERE myPrimaryKey = #bar;
would it make sense/be faster to use a TOP (1) specification?
SELECT TOP (1) * FROM foo WHERE myPrimaryKey = #bar;
Or is SQL Server smart enough to stop searching after it's found the primary key?
No, In your particular case using the TOP (1) is not useful at all.
The TOP clause is applied after the entire query is processed, so it's useful only to limit the overhead of a possibile high data flow between the server and the client, or when you want to limit no matter what the amount of rows you will retrieve from the server.
The reason because I say that TOP is applied after everything else is because it needs to have the ordered data, so it has to work after the last evaluated clause: ORDER BY.
Also TOP can let you retrieve the first x percent rows using TOP(x) PERCENT, so again, it needs to know the amount of rows and their order.
A simple example is the biggest enemy of a development DBMS: SELECT * FROM Table (I've specified development because that's the only environment where that kind of query should be seen).
Sometimes I use a SELECT * FROM kind of query when I have to understand what kind of data (not data type) I have to expect when I'll develop something that has to use that table.
Since I want to write a very short query and all I need is a bunch of records, I use the TOP clause: SELECT TOP 5 * FROM Table
SQL Server still process the query as SELECT * FROM Table but it will only send me back the first 5 rows.
You can try out yourself: write a query that should retrieve more than 1 row, check its execution plan, add the TOP clause and check the execution plan again. They will be the same in both cases.
The image down there shows how TOP impacts on your query. The query without TOP returned around 40700 rows. You can clearly see that the Wait time on server is only 2ms but all the rest of the time (267ms) is spent in downloading data.

SQL Server last 25 records query optimization

I have 4million records in one of my tables. I need to get the last 25 records that have been added in the last 1 week.
This is how my current query looks
SELECT TOP(25) [t].[EId],
[t].[DateCreated],
[t].[Message]
FROM [dbo].[tblEvent] AS [t]
WHERE ( [t].[DateCreated] >= Dateadd(DAY, Datediff(DAY, 0, Getdate()) - 7, 0)
AND [t].[EId] = 1 )
ORDER BY [t].[DateCreated] DESC
Now I do not have any indexes running for this table and do not intend to have one. This query takes about 10-15 seconds to run and my apps times-out, now is there a way to better it?
You should create an index on EId, DateCreated or at least DateCreated
Without this only way of optimising this that I can think of would be to maintain the last 25 in a separate table via an insert trigger (and possibly update and delete triggers as well).
If you have an ID in the table that is autoincrement (not the Eid but a separate PK) you can order by ID desc instead of DateCreated, that might make your order by faster.
otherwise you do need an index (but your question says you do not want that).
If the table has no indexes to support the query you are going to be forced to perform a table scan.
You are going to struggle to get around the table scan aspect of that - and as the table grows, the response time will get slower.
You are going to have to endevour to educate your client as to the problems going forward they face, and that they should consider an index. They may be saying no, you need to show the evidence to support the reasoning, show them times with / without, and make sure the impact to the record insertion is also shown, it's a relatively simple cost / benefit / detriment for the adding of the index / not adding of it. If they insist on no index, then you have no choice but to extend your timeouts.
You should also try query hint:
http://msdn.microsoft.com/en-us/library/ms181714.aspx
With option FAST n -- number of rows.

SQL Query Costing, aggregating a view is faster?

I have a table, Sheet1$ that contains 616 records. I have another table, Rates$ that contains 47880 records. Rates contains a response rate for a given record in the sheet for 90 days from a mailing date. Within all 90 days of a records Rates relation the total response is ALWAYS 1 (100%)
Example:
Sheet1$: Record 1, 1000 QTY, 5% Response, Mail 1/1/2009
Rates$: Record 1, Day 1, 2% Response
Record 1, Day 2, 3% Response
Record 1, Day 90, 1% Response
Record N, Day N, N Response
So in that, I've written a view that takes these tables and joins them to the right on the rates to expand the data so I can perform some math to get a return per day for any given record.
SELECT s.[Mail Date] + r.Day as Mail_Date, s.Quantity * s.[Expected Response Rate] * r.Response as Pieces, s.[Bounce Back Card], s.Customer, s.[Point of Entry]
FROM Sheet1$ as s
RIGHT OUTER JOIN Rates$ as r
ON s.[Appeal Code] = r.Appeal
WHERE s.[Mail Date] IS NOT NULL
AND s.Quantity <> 0
AND s.[Expected Response Rate] <> 0
AND s.Quantity IS NOT NULL
AND s.[Expected Response Rate] IS NOT NULL);
So I save this as a view called Test_Results. Using SQL Server Management Studio I run this query and get a result of 211,140 records. Elapsed time was 4.121 seconds, Est. Subtree Cost was 0.751.
Now I run a query against this view to aggregate a piece count on each day.
SELECT Mail_Date, SUM(Pieces) AS Piececount
FROM Test_Results
GROUP BY Mail_Date
That returns 773 rows and it only took 0.452 seconds to execute! 1.458 Est. Subtree Cost.
My question is, with a higher estimate how did this execute SO much faster than the original view itself?! I would assume a piece might be that its returning rows to management studio. If that is the case, how would I go about viewing the true cost of this query without having to account for the return feedback?
SELECT * FROM view1 will have a plan
SELECT * FROM view2 (where view2 is based on view1) will have its own complete plan
The optimizer is smart enough to make the plan for view2 combine/collapse the operations into a most efficient operation. It is only going to observe the semantics of the design of view1, but it is not necessarily required to use the plan for SELECT * FROM view1 and than apply another plan for view2 - this will, in general, be a completely different plan, and it will do whatever it can to get the most efficient results.
Typically, it's going to push the aggregation down to improve the selectivity, and reduce the data requirements, and that's going to speed up the operation.
I think that Cade has covered the most important part - selecting from a view doesn't necessarily entail returning all of the view rows and then selecting against that. SQL Server will optimize the overall query.
To answer your question though, if you want to avoid the network and display costs then you can simply select each query result into a table. Just add "INTO Some_Table" after the column list in the SELECT clause.
You should also be able to separate things out by showing client statistics or by using Profiler, but the SELECT...INTO method is quick and easy.
Query costs are unitless, and are just used by the optimizer to choose what it thinks the most efficient execution path for a particular query is. They can't really be compared between queries. This, although old, is a good quick read. Then you'll probably want to look around for some books or articles on the MSSQL optimizer and about reading query plans if you're really interested.
(Also, make sure you're viewing the actual execution plan, and not the explain plan ... they can be different)

Resources