Why is skipping nodes from a query very slow in jackrabbit? - jackrabbit

When I perform a simple query like this:
select * from nodeType
Calling skip(N) on the range iterator is slow.
What am I doing wrong?

Found out why (self answering) - was using document order by default.
Try adding a sensible "order by" to the query - goes from minutes for 10000 nodes to < 1 second.

Sadly, the RangeIterator skip() method in Jackrabbit implementation (RangeIterator is just an interface) is traversing over the nodes linearly. You might as well just write
int counter = 0;
while ( counter < offset && iter.hasNext() ) { iter.next(); counter++; }

Related

Why case..when get a table scan ? how to workarround

When I use CASE .. WHEN .. END I get an index scan less efficient than the index seek.
I have complex business rules I need to use the CASE, is there any workaround ?
Query A:
select * from [dbo].[Mobile]
where((
CASE
where ([MobileNumber] = (LTRIM(RTRIM('987654321'))))
END
) = 1)
This query gets an index scan and 199 logical reads.
Query B:
select * from [dbo].[Mobile]
where ([MobileNumber] = (LTRIM(RTRIM('987654321'))))
This query gets an index seek and 122 logical reads.
For the table
CREATE TABLE #T(X CHAR(1) PRIMARY KEY);
And the query
SELECT *
FROM #T
WHERE CASE WHEN X = 'A' THEN 1 ELSE 0 END = 1;
It is apparent without that much thought that the only circumstances in which the CASE expression evaluates to 1 are when X = 'A' and that the query has the same semantics as
SELECT *
FROM #T
WHERE X = 'A';
However the first query will get a scan and the second one a seek.
The SQL Server optimiser will try all sorts of relational transformations on queries but will not even attempt to rearrange expressions such as CASE WHEN X = 'A' THEN 1 ELSE 0 END = 1 to express it as an X = expression so it can perform an index seek on it.
It is up to the query writer to write their queries in such a way that they are sargable.
There is no workaround to get an index seek on column MobileNumber with your existing CASE predicate. You just need to express the condition differently (as in your example B).
Potentially you could create a computed column with the CASE expression and index that - and you could then see an index seek on the new column. However this is unlikely to be useful to you as I assume in reality the mobile number 987654321 is dynamic and not something to be hardcoded into a column used by an index.
After cleaning up and fixing your code, you have a WHERE which is boolean expression based around a CASE.
As mentioned by #MartinSmith, there is simply no way SQL Server will re-arrange this. It does not do the kind of dynamic slicing that would allow it to re-arrange the first query into the second version.
select *
from [dbo].[Mobile]
where
CASE
WHEN [MobileNumber] = LTRIM(RTRIM('987654321'))
THEN 1
END
= 1
You may ask: the second version also has an expression in it, why does this not also get a scan?
select *
from [dbo].[Mobile]
where [MobileNumber] = LTRIM(RTRIM('987654321'))
The reason is that what SQL Server can recognize is that LTRIM(RTRIM('987654321')) is a deterministic constant expression: it does not change depending on runtime settings, nor on the result of in-row calculations.
Therefore, it can optimize by calculating it at compile time. The query therefore becomes this under the hood, which can be used against an index on MobileNumber.
select *
from [dbo].[Mobile]
where [MobileNumber] = '987654321'

Subtotals with ArrayFormula()

I've created a simple table and trying to split data with subtotals.
A indicates the subtotal lines.
B contains the rows number for previous subtotal. This is just extra field to simplify formulas.
C Contains some amounts.
D Contains subtotals of amounts between previous and current subtotal line.
The subtotal formula has the following view:
=ArrayFormula(
IF($A2:$A; MMULT(
($B2:$B < TRANSPOSE(ROW($A2:$A))) * (TRANSPOSE(ROW($A2:$A)) < ROW($A2:$A));
IF(ISNUMBER(C2:C); C2:C; 0)
); )
)
The problem is that the formula is extrimely slow. Is there a way to make it faster?
Example file:
https://docs.google.com/spreadsheets/d/1HPGeLZfar2s6pIQMVdQ8mIPzNdw2ESqKAwZfo4IicnA/edit?usp=sharing
You could also try this much simpler formula:
=ArrayFormula(
if(B3:B="","",
sumif(row(B3:B),"<="&row(B3:B),C3:C)-
sumif(row(B3:B),"<="&B3:B,C3:C)
)
)
Yes there is
The easier is to remove the blank rows below the data range.
One that might require maintenance, replace open reference like $A2:$A by closed references, i.e. $A2:$A100
One that incresase the formula complexity an volatility, put each open reference inside ARRAY_CONSTRAIN but it's easier to maintain in case that new data rows were added
use the "necessary" range:
=ARRAYFORMULA(IFERROR(IF(A2:A; MMULT((
INDIRECT("B2:B"&MAX(IF(B2:B="";; ROW(B2:B)))) < TRANSPOSE(ROW(
INDIRECT("A2:A"&MAX(IF(A2:A=TRUE; ROW(A2:A); )))))) * (TRANSPOSE(ROW(
INDIRECT("A2:A"&MAX(IF(A2:A=TRUE; ROW(A2:A); ))))) < ROW(
INDIRECT("A2:A"&MAX(IF(A2:A=TRUE; ROW(A2:A); ))))); IF(ISNUMBER(
INDIRECT("C2:C"&MAX(IF(C2:C="";; ROW(C2:C)+1))));
INDIRECT("C2:C"&MAX(IF(C2:C="";; ROW(C2:C)+1))); 0)); )))
this should be way faster...

How many times is the DBRecordReader getting created?

Below is what I think of the hadoop framework processing text files. Please correct me if I am going wrong somewhere.
Each mapper acts on an input split which contains some records.
For each input split a record reader is getting created which starts reading records from the input split.
If there are n records in an input split the map method in the mapper is called n times which in turn reads a key-value pair using the record reader.
Now coming to the databases perspective
I have a database on a single remote node. I want to fetch some data from a table in this database. I would configure the parameters using DBConfigure and mention the input table using DBInputFormat. Now say if my table has 100 records in all, and I execute an SQL query which generates 70 records in the output.
I would like to know :
How are the InputSplits getting created in the above case (database) ?
What does the input split creation depend on, the number of records which my sql query generates or the total number of records in the table (database) ?
How many DBRecordReaders are getting created in the above case (database) ?
How are the InputSplits getting created in the above case (database)?
// Split the rows into n-number of chunks and adjust the last chunk
// accordingly
for (int i = 0; i < chunks; i++) {
DBInputSplit split;
if ((i + 1) == chunks)
split = new DBInputSplit(i * chunkSize, count);
else
split = new DBInputSplit(i * chunkSize, (i * chunkSize)
+ chunkSize);
splits.add(split);
}
There is the how, but to understand what it depends on let's take a look at chunkSize:
statement = connection.createStatement();
results = statement.executeQuery(getCountQuery());
results.next();
long count = results.getLong(1);
int chunks = job.getConfiguration().getInt("mapred.map.tasks", 1);
long chunkSize = (count / chunks);
So chunkSize takes the count = SELECT COUNT(*) FROM tableName and divides this by chunks = mapred.map.tasks or 1 if it is not defined in the configuration.
Then finally, each input split will have a RecordReader created to handle the type of database you are reading from for instance: MySQLDBRecordReader for MySQL database.
For more info check out the source
It appears #Engineiro explained it well by taking the actual hadoop source. Just to answer, number of DBRecordReader is equal to number of map tasks.
To explain further, the Hadoop Map side framework creates an instance of DBRecordReader for each Map task, in case where the child JVM is not reused for further Map tasks. In other words, the number of input splits is equals to the value of map.reduce.tasks in case of DBInputFormat. So, each map Task's record Reader has the meta information to construct the query to get subset of data from the table. Each Record Reader executes a pagination type of SQL which is similar to the below.
SELECT * FROM (SELECT a.*,ROWNUM dbif_rno FROM ( select * from emp ) a WHERE rownum <= 6 + 7 ) WHERE dbif_rno >= 6
The above SQL is for the second Map tasks to return the rows between 6 and 13
To generalize for any type of Input formats, the number of Record Readers is equals to the number of Map Tasks.
This post talks about all that you want : http://blog.cloudera.com/blog/2009/03/database-access-with-hadoop/

How to get data from an sqlite3 prepared statement with 'union all' clause

I got three SELECT statements connected with UNION ALL. I set up the compound statement such that sqlite3_prepare_v2() would prepare the resulting rows for me. If I were to make use of the resulting data, could I use a for() loop?
int i;
for(i = 0; sqlite3_step(res) == SQLITE_ROW; i++) {
if (i == 0) data1 = sqlite3_column_int(res,0);
else if (i == 1) data2 = sqlite3_column_int(res,0);
else data3 = sqlite3_column_int(res,0);
}
Is this supposed to work? I've tried it but it gives me garbage data. Is there an alternative way to implement this?
This works. the garbage data came from somewhere else entirely. You can use this to shorten the amount of code you type but note that there is a limit on how much you can use this.
Additional Info: You can use UNION to select from different tables using the same arguments of WHERE on each, say you let the user input two data parameters and there are four tables using those parameters, different only by some additional value. Let's say I want to get the value of that additional column. I will let the user type something distinct to the table where my desired output belongs. using SELECT statements compounded with UNION, one can select the desired output much easier than doing actual comparisons and lengthy iterations.

what is the best way to create all pairs of objects of two, from a huge list of objects (say a million strings)?

lets say I've a list of 10 strings (lets just call it "str1", "str2", ... "str10" etc). I want to be able to generate all pairs from this
("str1", "str2")
("str1", "str3")
.
.
.
etc upto ("str9", "str10"). That is easy, with two loops. How to do the same thing with a million strings? Is there anyway to put it in a table, and run a query?
Put them in a table, and use this join:
Select t1.StringValue, T2.StringValue
From StringsTable T1
INNER JOIN StringsTable T2
ON T1.StringValue <> T2.StringValue
Now, if you run a Million strings in some sort of Query Analyzer / GUI, you're setting yourself up for some hurt - that's a huge load of data returned.
In C# (Java would be similar. C++ only a bit different)
for(int i = 0; i < ArrayOfString.Length-1; ++i)
for(int j = i+1; i < ArrayOfString.Length; ++j)
ListOfPairs.Add(new Pair(ArrayOfString[i], ArrayOfString[j]));
If you want to create all those pairs you will get almost one trillion pairs.
To store them somewhere you need approximately 20 TB of data, based on 20 bytes/string-pair.
If you want to make all those pairs you should consider a generative approach that generates the pairs on the fly instead of storing them somewhere.
In c# it would look something like this:
private IEnumerable<Tuple<string, string>> GetPairs(IEnumerable<string> strings)
{
foreach (string outer in strings)
{
foreach (string inner in strings)
{
if (outer != inner)
{
yield return Tuple.Create(outer, inner);
}
}
}
yield break;
}
The call
string[] strings = new string[] { "str1", "str2", "str3" };
foreach (var stringPairs in GetPairs(strings))
{
Console.WriteLine("({0},{1})", stringPairs.Item1, stringPairs.Item2);
}
Generates the expected result (if you care about the order of the items in the pair).
(str1,str2)
(str1,str3)
(str2,str1)
(str2,str3)
(str3,str1)
(str3,str2)
Expect it to take a while with 1M strings.
To do this in a Table (I presume you mean SQL Server or similar)
create table T
(
Value nvarchar(10)
)
insert into T select '1'
insert into T select '2'
insert into T select '3'
insert into T select '4'
insert into T select '5'
select
A.Value, B.Value
from T A
Cross Join T B
where A.Value<>B.Value
order by A.Value

Resources