Spark speed up multiple join operations - database

Suppose I have a rule like this:
p(v3,v4) :- t1(k1,v1), t2(k1,v2), t3(v1,v3), t4(v2,v4).
The task is join t1, t2, t3, and t4 together to produce a relation p.
Suppose t1, t2, t3, and t4 are already having same partitioner for their keys.
A common strategy is to join relations one by one, but it will force at least 3 shuffle/repartition operations. Details are below(suppose I have 10 partitions).
1.join: x = t1.join(t2)
2.repartition: x = x.map(lambda (k1, (v1,v2)): (v1,v2)).partitionBy(10)
3.join: x = x.join(t3)
4.repartition: x = x.map(lambda (v1, (v2,v3)): (v2,v3)).partitionBy(10)
5.join: x = x.join(t4)
6.repartition: x = x.map(lambda (v2, (v3,v4)): (v3,v4)).partitionBy(10)
Because t1 to t4 all have same partitioner, and I repartition the intermediate result after every join, each join operations will not involve any shuffle.
However, the intermediate result(i.e. variable x) is huge in my practical code, 3 shuffle operations are still too many for me.
My questions are:
Is there anything wrong with my strategy to evaluate this rule? Is there any better, more efficient solution?
My understanding of shuffle operation is that, for each partition, Spark will do repartition independently, it will generate repartition results for each partition on disk(so-called shuffle write). Then, for each partition, Spark will get new repartition results from disk(so-called shuffle read). If my understanding is correct, each shuffle/repartition will always cost disk read and write. It's kind of a waste, if I can guarantee my memory is huge enough to store all data. Just as said in http://www.trongkhoanguyen.com/2015/04/understand-shuffle-component-in-spark.html. Is there any workaround to disable this kind of shuffle write and read operations? I think my program's performance bottleneck is due to shuffle IO overhead.
Thank you.

Related

Database merge join cost evaluation problem

I got a question to evaluate the minimum page I/O costs for query πA,B,C,D(R ⋈A=C S) by using merge join method. I need to evaluate followings:
Page I/O cost to sort R.
Page I/O cost to sort S.
Page I/O cost to join R and S.
My question is that, since the question has a projection on sub-set of attributes (A,B,C,D) only, is it possible to eliminate the unwanted attribute during separate sort of R and S (Provided that A and B are in R, C and D are in S)? If can then the formula of "2Br([log M-1(br/M)]+1)" seems can't apply directly.
Or more precisely, when to eliminate the unwanted attribute is the best practice?
This question stuck me a long time. Hope to get some insight on this.
Thanks.

CASE statement versus temporary table

I have my choice between two different techniques in converting codes to text:
insert into #TMP_CONVERT(code, value)
(1, 'Uno')
,(2, 'Dos')
,(3, 'Tres')
;
coalesce(tc.value, 'Unknown') as THE_VALUE
...
LEFT OUTER JOIN #TMP_CONVERT tc
on tc.code = x.code
Or
case x.code
when 1 then 'Uno'
when 2 then 'Dos'
when 3 then 'Tres'
else 'Unknown'
end as THE_VALUE
The table has about 20 million rows.
Typical size the of the code lookup table is 10 rows.
I would rather to #1 but I don't like left outer joins.
My questions are:
Is one faster than the other in any really meaningful way?
Does any SQL engine optimize this out anyway? That is: It just reads the table into memory essentially does the case statement logic anyway?
I happen to be using tsql, but I would like to know for any number of RDBM systems because I use several.
[Edit to clarify not liking LEFT OUTER JOIN]
I use LEFT OUTER JOINS when I need them, but whenever I use them I double check my logic and data to confirm I actually need them. Then I add a comment to the code that indicates why I am using a LEFT OUTER JOIN. Of course I have to do a similar exercise when I use INNER JOIN; that is: make sure I am am not dropping data.
There is some overhead to using a join. However, if the code is made a primary clustered key, then the performance for the two might be comparable -- with the join even winning out. Without an index, I would expect the case to be slightly better than the left join.
These are just guesses. As with all performance questions, though, you should check on your data on your systems.
I also want to react to your not liking left joins. These provide important functionality for SQL and are the right way to address this problem.
The code executed for the join will likely be substantially more than the code executed for the hardcoded case options.
The execution plan will have an extra join iterator along with an extra scan or seek operator (depending on the availability of a suitable index). On the positive side likely the 10 rows will all fit on a single page in #TMP_CONVERT and this will be in memory anyway, also being a temp table it won't bother with taking and releasing row locks each time, but still the code to latch the page, locate the correct row, and crack the desired column value out of it over 20,000,000 iterations would likely add some amount of measurable CPU time compared with looking up in a hardcoded list of values (potentially you could try nested CASE statements too, to perform a binary search and avoid the need for 10 branches there).
But even if there is a measurable time difference it still may not be particularly significant as a proportion of the query time as a whole. Test it. Let us know what you find...
You can also avoid creating temporary table in this case by using with construction. So you query might be something like this.
WITH TMP_CONVERT(code,value) AS -- Semicolon can be required before WITH.
(
SELECT * FROM (VALUES (1,'UNO'),
(2,'DOS'),
(3,'Tres')
) tbl(code,value)
)
coalesce(tc.value, 'Unknown') as THE_VALUE
...
LEFT OUTER JOIN TMP_CONVERT tc
on tc.code = x.code
OR even sub query can be used :
coalesce(tc.value, 'Unknown') as THE_VALUE
...
LEFT OUTER JOIN (VALUES (1,'UNO'),
(2,'DOS'),
(3,'Tres')
) tc(code,value)
ON tc.code = x.code
Hope this can be helpful.

Optimising Matrix Updating and Multiplication

Consider as an example the matrix
X(a,b) = [a b
a a]
I would like to perform some relatively intensive matrix algebra computations with X, update the values of a and b and then repeat.
I can see two ways of storing the entries of X:
1) As numbers (i.e. floats). Then after our matrix algebra operation we update all the values in X to the correct values of a and b.
2) As pointers to a and b, so that after updating them, the entries of X are automatically updated.
Now, I initially thought method (2) was the way to go as it skips out the updating step. However I believe that using method (1) allows a better use of the cache when doing for example matrix multiplication in parallel (although I am no expert so please correct me if I'm wrong).
My hypothesis is that for unexpensive matrix computations you should use method (2) and there will be some threshold as the computation becomes more complex that you should switch to (1).
I imagine this is not too uncommon a problem and my question is which is the optimal method to use for general matrices X?
Neither approach sounds particularly hard to implement. The simplest answer is make a test calculation, try both, and benchmark them. Take the faster one. Depending on what sort of operations you're doing (matrix multiplication, inversion, etc?) you can potentially reduce the computation by simplifying the operations given the assumptions you can make about your matrix structure. But I can't speak to that any more deeply since I'm not sure what types of operations you're doing.
But from experience, with a matrix that size, you probably won't see a performance difference. With larger matrices, you will, since the CPU's cache starts to fill. In that case, doing things like separating multiplication and addition operations, pointer indexes, and passing inputs as const enable the compiler to make significant performance enhancements.
See
Optimized matrix multiplication in C and
Cache friendly method to multiply two matrices

How do I perform an atomic operation using the Datomic database?

TL;DR I want the function: "update Y ONLY IF Y=10", otherwise it fails.
Example: imagine the timeline being T1, T2 and T3. At time T1, the entity X contains the attribute Y=10, at time T2 the attribute is Y=14. My aim is to apply a complex operation in Y (assume that this operation is the sum of 1). I read the value of Y at T1, which is 10 and place this value in a queue to be processed. At T3, when the complex operation is completed and the result is 11, I will update the attribute Y. If I simply update the attribute, the value Y=14 that was at T2, it will be mistakenly discarded. However, at T3, upon updating, I want to be sure that the final value is Y=10, otherwise I have to read Y=14 at T2 for reprocessing.
I know about Database Functions to make atomic read-modify-update processing, but this approach is not good if the operation is complex and need to be done distributed (after put in a queue).
What I want is something equivalent to Conditional Writes in DynamoDB.
You could run the ensure process peer-side, validate for a certain basis T, and then check the basis T of the database in the transaction. So the computationally complex or expensive code is handled peer-side and the transaction function is only responsible for the basis T validation.
For anything that matches the standard use case (e.g., your example in your description), Database Functions are the correct and canonical answer.
The built-in :db.fn/cas function is available and is now documented at http://docs.datomic.com/transactions.html
With regard to Felipe's comment "a transaction function that insert only if or throw exception", can you not just use the built-in one see :db.fn/cas?

Multithreading a series of equations

I have a long series of equations that looks something like this except with about 113 t's:
t1 = L1;
t2 = L2 + 5;
t3 = t2 + t1;
t4 = L3
...
t113 = t3 + t4
return t113;
Where L's are input arguments.
It takes a really long time to calculate t113. So I'm trying to split this up into several different threads in an attempt to make this quicker. Problem is I'm not sure how to do this. I tried drawing out the t's in the form of a tree by hand on paper so I could analyse it better, but it grew too large and unwieldy midway.
Are there other ways to make the calculations faster? Thanks.
EDIT: I'm using an 8 core DSP with SYS/BIOS. According to my predecessor, these inverse and forward kinematic equations will take the most time to process. My predecessor also knowingly chose this 8 core DSP as the hardware for implementation. So I'm assuming I should write the code in a way that takes advantage of all 8 cores.
With values that are dependent on other values, you're going to have a very tough time allocating the work to different threads. Then it's also likely that you'll have one thread waiting on another. And to fire off new threads is likely more expensive than calculating only 113 values.
Are you sure it's taking a long time to calculate t113? or is it something else that takes a long time.
I'm assuming that the tasks are time intensive and more that just L2 + L3 or something. If not then the overhead in the threading is going to vastly exceed any minimal gains the threading.
If this was Java then I'd use a Executors.newCachedThreadPool(); which starts a new thread whenever needed and then allow the jobs themselves to submit jobs to the thread-pool and wait for the response. That's a bit of a strange pattern but it would work.
For example:
private final ExecutorService threadPool = Executors.newCachedThreadPool();
...
public class T3 implements Callable<Double> {
public Double call() throws Exception {
Future<Double> t2 = threadPool.submit(new T2());
Future<Double> t1 = threadPool.submit(new T1());
return t2.get() + t1.get();
}
}
Then the final task would be:
Future<Double> t3 = threadPool.submit(new T3());
// this throws some exceptions that need to be caught
double result = t3.get();
threadPool.shutdown();
Then the thread pool would just take care of the results. It would do as much parallelization as it can. Now if the output of the T1 task was used in multiple places, this would not work.
If this is another language, maybe a similar pattern can be used depending on the thread libraries available.
If all the assignments are as simple as the ones you show, a reasonable compiler will reduce it fine. For the parts you show,
return L1 + L2 + L3 + 5, should be all the work it's doing.
Perhaps this could be done in two threads (on two CPUs) like:
T1: L1 + L2
T2: L3 + 5
Parent thread: Add the two results.
But with only 113 additions -- if that's what they are -- and modern computers are very good at adding, probably wont be "faster".
Your simple example would automatically multithread (and optimise the solution path) using Excel multi-threaded calculation.
But you don't give enough specifics to tell whether this would be a sensible approach for your real world application.

Resources