wrong result in Apache flink full outer join - apache-flink

I have 2 data streams which were created from 2 tables like:
Table orderRes1 = ste.sqlQuery(
"SELECT orderId, userId, SUM(bidPrice) as q FROM " + tble +
" Group by orderId, userId");
Table orderRes2 = ste.sqlQuery(
"SELECT orderId, userId, SUM(askPrice) as q FROM " + tble +
" Group by orderId, userId");
DataStream<Tuple2<Boolean, Row>> ds1 = ste.toRetractStream(orderRes1 , Row.class).
filter(order-> order.f0);
DataStream<Tuple2<Boolean, Row>> ds2 = ste.toRetractStream(orderRes2 , Row.class).
filter(order-> order.f0);
I wonder to perform a full outer join on these 2 streams, and I used both orderRes1.fullOuterJoin(orderRes2 ,$(exp))
and a sql query containing a full outer join, as below:
Table bidOrdr = ste.fromDataStream(bidTuple, $("orderId"),
$("userId"), $("price"));
Table askOrdr = ste.fromDataStream(askTuple, $("orderId"),
$("userId"), $("price"));
Table result = ste.sqlQuery(
"SELECT COALESCE(bidTbl.orderId,askTbl.orderId) , " +
" COALESCE(bidTbl.userId,askTbl.orderId)," +
" COALESCE(bidTbl.bidTotalPrice,0) as bidTotalPrice, " +
" COALESCE(askTbl.askTotalPrice,0) as askTotalPrice, " +
" FROM " +
" (SELECT orderId, userId," +
" SUM(price) AS bidTotalPrice " +
" FROM " + bidOrdr +
" Group by orderId, userId) bidTbl full outer JOIN " +
" (SELECT orderId, userId," +
" SUM(price) AS askTotalPrice" +
" FROM " + askOrdr +
" Group by orderId, userId) askTbl " +
" ON (bidTbl.orderId = askTbl.orderId" +
" AND bidTbl.userId= askTbl.userId) ") ;
DataStream<Tuple2<Boolean, Row>> = ste.toRetractStream(result, Row.class).filter(order -> order.f0);
However, the result in some cases in not correct: imagine user A sells with a price to B 3 times, after that user B sells to A 2 times, the second time the result is:
7> (true,123,a,300.0,0.0)
7> (true,123,a,300.0,200.0)
10> (true,123,b,0.0,300.0)
10> (true,123,b,200.0,300.0)
the second and forth lines are the expected result of stream, but it will generate the 1st and 3rd lines too.
worth mentioning that coGroup is the other solution, yet I do not want to use windowing in this scenario, and a non-windowing solution is just accessible in bounded streams (DataSet).
Hint: orderId and userId will repeat in both streams, and I want to produce 2 rows in each action, containing:
orderId, userId1, bidTotalPrice, askTotalPrice AND
orderId, userId2, bidTotalPrice, askTotalPrice

Something like this is to be expected with streaming queries (or in other words, with queries executed on dynamic tables). Unlike a traditional database, where the input relations to a query are kept static during query execution, the inputs to a streaming query are being continuously updated -- and so the result must also be continuously updated.
If I understand the setup here, the "incorrect" results on lines 1 and 3 are correct up until the relevant rows from orderRes2 are processed. If those rows never arrive, then lines 1 and 3 will remain correct.
What you should expect is an eventually correct result, including retractions as necessary. You can reduce the number of intermediate results by turning on mini-batch aggregation.
This mailing list thread gives more insight. If I've misunderstood your situation, please provide a reproducible example that illustrates the problem.

Related

Nested match_recognize query not supported in flink SQL?

I am using flink 1.11 and trying nested query where match_recognize is inside, as shown below :
select * from events where id = (SELECT * FROM events MATCH_RECOGNIZE (PARTITION BY org_id ORDER BY proctime MEASURES A.id AS startId ONE ROW PER MATCH PATTERN (A C* B) DEFINE A AS A.tag = 'tag1', C AS C.tag <> 'tag2', B AS B.tag = 'tag2'));
And I am getting an error as : org.apache.calcite.sql.validate.SqlValidatorException: Table 'A' not found
Is this not supported ? If not what's the alternative ?
I was able to get something working by doing this:
Table events = tableEnv.fromDataStream(input,
$("sensorId"),
$("ts").rowtime(),
$("kwh"));
tableEnv.createTemporaryView("events", events);
Table matches = tableEnv.sqlQuery(
"SELECT id " +
"FROM events " +
"MATCH_RECOGNIZE ( " +
"PARTITION BY sensorId " +
"ORDER BY ts " +
"MEASURES " +
"this_step.sensorId AS id " +
"AFTER MATCH SKIP TO NEXT ROW " +
"PATTERN (this_step next_step) " +
"DEFINE " +
"this_step AS TRUE, " +
"next_step AS TRUE " +
")"
);
tableEnv.createTemporaryView("mmm", matches);
Table results = tableEnv.sqlQuery(
"SELECT * FROM events WHERE events.sensorId IN (select * from mmm)");
tableEnv
.toAppendStream(results, Row.class)
.print();
For some reason, I couldn't get it to work without defining a view. I kept getting Calcite errors.
I guess you are trying to avoid enumerating all of the columns from A in the MEASURES clause of the MATCH_RECOGNIZE. You may want to compare the resulting execution plans to see if there's any significant difference.

What to substitute for array_agg in a PostgreSQL subquery when switching to Derby?

I have a java application that for some installations acceses a PostgreSQL database, while in others it acceses essentially the same database in Derby.
I have a SQL query that returns an examination record from the examination table. There is an exam_procedure table that relates to the examination table in a one (examination) to many fashion. I need to concatenate the potentially multiple string records in the exam_procedure table so that I can add a single string value to the query return that represents all the related exam_procedure records. For a variety of reasons (eg, joins return too many records, especially when multiple subqueries are needed for other related one to many tables), I need to do this via a subquery in the SELECT section of the main query. The following SQL works just fine for PostgreSQL, but my understanding is that array_agg is not available in Derby. What Derby subquery can I substitute for the PostgreSQL subquery?
Many thanks.
// part of the query
"SELECT "
+ "patient_id, "
+ "examination_date, "
+ "examination_number, "
+ "operating_physician_id, "
+ "referring_physician_id, "
+ "patient.last_name AS pt_last_name, "
+ "patient.first_name AS pt_first_name, "
+ "patient.middle_name AS pt_middle_name, "
+ "("
+ "SELECT "
+ "array_agg(prose) "
+ "FROM "
+ "exam_procedure "
+ "WHERE examination_id = " + examId
+ " GROUP BY examination_id"
+ ") AS agg_procedures, "
+ "FROM "
+ "examination "
+ "JOIN patient ON patient.id = examination.patient_id "
+ "WHERE "
+ "examination.id = ?"
;

How to select unrelated entities with a single query

Let's imagine we have 5 tables that have no relationship between each other but all of them share the same column. Let's name the tables ClojureConf, KotlinConf, ScalaConf, GroovyConf, JavaConf. They all have a column UserId. The number and data types of other columns are different in each of them. A given User may have attended zero or more conferences.
The task is to just select all the records from each of the 5 tables for a given UserId, convert them to DTOs and return as json.
Currently the code does 5 trips to the database to get a list of results from each table.
Is there a library support in hibernate/jpa what would make a single trip to a database? The goal is to improve performance.
Is it possible to define a projection for entities that would look similar to this:
interface ConfAttended {
List<ClojureConf> getClojureConfs();
List<KotlinConf> getKotlinConfs();
List<ScalaConf> getScalaConfs();
List<GroovyConf> getGroovyConfs();
List<JavaConf> getJavaConfs();
}
and a repository that would select and map the results in one go
interface ConfAttendedDAO extends JpaRepository<User, Long> {
#Query("SELECT c, k, s, g, j FROM ClojureConf c " +
"JOIN KotlinConf k ON c.UserId = k.UserId " +
"JOIN ScalaConf s ON c.UserId = s.UserId " +
"JOIN GroovyConf g ON c.UserId = g.UserId " +
"JOIN JavaConf j ON c.UserId = j.UserId " +
"WHERE c.UserId = :userId")
ConfAttended findByUserIdForProjection(#Param("userId") long userId);
}
?
I ended up having a query like this:
interface ConfAttendedDAO extends JpaRepository<User, Long> {
#Query("SELECT c, k, s, g, j FROM User u " +
"LEFT JOIN ClojureConf c ON c.UserId = :userId " +
"LEFT JOIN KotlinConf k ON k.UserId = :userId " +
"LEFT JOIN ScalaConf s ON s.UserId = :userId " +
"LEFT JOIN GroovyConf g ON g.UserId = :userId " +
"LEFT JOIN JavaConf j ON j.UserId = :userId " +
"WHERE u.Id = :userId")
List<Object[]> findAllByUserId(#Param("userId") long userId);
}
Hibernate takes care of mapping rows to entities. Each Object[] has all 5 entities (or nulls) as its elements. Selecting from User is to force the query to return results. Otherwise if first table returns nothing - the whole query returns nothing. Another downside is that if one table has 10 results and another has 1, the table with less results gets them duplicated.
As for performance (the sole point of doing all this), getting and processing results is 4-5 times faster than with 5 separate SELECTs.

Get random result from JPQL query over large table

I'm currently using JPQL queries to retrieve information from a database. The purpose of the project is testing a sample environments through randomness with different elements so I need the queries to retrieve random single results all through the project.
I am facing that JPQL does not implement a proper function for random retrieval and postcalculation of random takes too long (14 seconds for the attached function to return a random result)
public Player getRandomActivePlayerWithTransactions(){
List<Player> randomPlayers = entityManager.createQuery("SELECT pw.playerId FROM PlayerWallet pw JOIN pw.playerId p"
+ " JOIN p.gameAccountCollection ga JOIN ga.iDAccountStatus acs"
+ " WHERE (SELECT count(col.playerWalletTransactionId) FROM pw.playerWalletTransactionCollection col) > 0 AND acs.code = :status")
.setParameter("status", "ACTIVATED")
.getResultList();
return randomPlayers.get(random.nextInt(randomPlayers.size()));
}
As ORDER BY NEWID() is not allowed because of JPQL restrictions I have tested the following inline conditions, all of them returned with syntax error on compilation.
WHERE (ABS(CAST((BINARY_CHECKSUM(*) * RAND()) as int)) % 100) < 10
WHERE Rnd % 100 < 10
FROM TABLESAMPLE(10 PERCENT)
Have you consider to generate a random number and skip to that result?
I mean something like this:
String q = "SELECT COUNT(*) FROM Player p";
Query query=entityManager.createQuery(q);
Number countResult=(Number) query.getSingleResult();
int random = Math.random()*countResult.intValue();
List<Player> randomPlayers = entityManager.createQuery("SELECT pw.playerId FROM PlayerWallet pw JOIN pw.playerId p"
+ " JOIN p.gameAccountCollection ga JOIN ga.iDAccountStatus acs"
+ " WHERE (SELECT count(col.playerWalletTransactionId) FROM pw.playerWalletTransactionCollection col) > 0 AND acs.code = :status")
.setParameter("status", "ACTIVATED")
.setFirstResult(random)
.setMaxResults(1)
.getSingleResult();
I have figured it out. When retrieving the player I was also retrieving other unused related entity and all the entities related with that one and so one.
After adding fetch=FetchType.LAZY (don't fetch entity until required) to the problematic relation the performance of the query has increased dramatically.

HIbernate + MSSQL query compatibility

I need to get the latest "version" of a Task object for a given objectUuid. The Task is identified by its objectUuid, taskName and createdTimestamp attributes.
I had the following HQL query:
select new list(te) from " + TaskEntity.class.getName() + " te
where
te.objectUuid = '" + domainObjectId + "' and
te.createdTimestamp = (
select max(te.createdTimestamp) from " + TaskEntity.class.getName() + " teSub
where teSub.objectUuid = te.objectUuid and teSub.taskName = te.taskName
)
which ran and produced the correct results on H2 (embedded) and MySQL.
However after installing this code in production to MS SQL Server I get the following error:
An aggregate may not appear in the WHERE clause unless it is in a
subquery contained in a HAVING clause or a select list, and the column
being aggregated is an outer reference.
I tried to rewrite the query but HQL doesn't seem to support subqueries properly. My latest attempt is something like:
select new list(te) from " + TaskEntity.class.getName() + " te
inner join (
select te2.objectUuid, te2.taskName, max(te2.createdTimestamp)
from " + TaskEntity.class.getName() + " te2
group by te2.objectUuid, te2.taskName
) teSub on
teSub.objectUuid = te.objectUuid and teSub.taskName = te.taskName
where
te.objectUuid = '" + domainObjectId + "'
but of course it fails at the "(" after the join statement.
Since this is a very frequent type of query I cannot believe there is no solution that works with HQL+MSSQL.
Uh-oh. Can this be a typo?
max(teSub.createdTimestamp)
instead of
max(te.createdTimestamp)
in the subquery.

Resources