JackRabbit - Removal of duplicate rows - jackrabbit

I have asked this question on jackrabbit-users list but I didn't get an answer.
JCR-SQL2 doesn't provide a SELECT DISTINCT (or similar, AFAIK). Neither do SQL or XPATH in JCR 1.0... How are people getting around this?. What's the best way of removing duplicate rows?
I read that someone was iterating over the results and putting them in a Set. In my case, because of the possible huge number of results, that approach may end up being too costly.
Does anyone here have a suggestion?

None of the query languages defined in JCR 1.0 (e.g., JSR-170) or JCR 2.0 (e.g., JSR-283) have a notion of SELECT DISTINCT.
The only way to do this is to process the results manually and throwing out any rows (or nodes) that you've already seen. Using a set of paths or Node objects would work. This isn't too difficult, but it's unfortunately harder than it should be and, as you mention, can be expensive if there are a lot of rows and/or duplicates.
This is why ModeShape provides full support for JCR-SQL2 queries but also allows use of SELECT DISTINCT. In fact, ModeShape supports a number of other features such as
non-correlated subqueries in the WHERE clause
LIMIT n and OFFSET m
UNION, INTERSECT and EXCEPT
FULL OUTER JOIN and CROSS JOIN
BETWEEN criteria
set criteria, using IN and NOT IN
DEPTH and PATH dynamic operands
and a few others. For details, see the documentation.

Related

How can I make T-SQL perform OPTION (FORCE ORDER) when defining views?

My DBA tells me that I should always use OPTION (FORCE ORDER) in my SQL statements when accessing a particular set of views. I understand this is to prevent the server vetoing the order of his joins.
Fair enough - it's worth while keeping the DBA happy and I am happy to comply.
However, I would like to write a couple of views in my own schema, but this isn't supported apparently.
How then, can I achieve the same when writing my views, ie having OPTION (FORCE ORDER) being enforced?
Thanks
Fred
Blindly appending OPTION (FORCE ORDER) onto all queries that reference a particular view is extremely poor blanket advice.
OPTION (FORCE ORDER) is a query hint and these are not valid inside a view - you would need to put it on the outer level on all queries referencing your own views.
It is valid to use Join hints inside views though and
If a join hint is specified for any two tables, the query optimizer
automatically enforces the join order for all joined tables in the
query, based on the position of the ON keywords.
So
SELECT v1.Foo,
v2.Bar
FROM v1
INNER HASH JOIN v2
ON v1.x = v2.x;
Would enforce the join order inside v1 and v2 (as well as enforcing the join ordering and algorithm between them).
But I would not recommend this. These types of hints should only be used in an extremely targeted manner in a last resort after not being able to get a satisfactory plan any other way. Not as a matter of policy without even testing alternatives.

GAE NDB Sorting a multiquery with cursors

In my GAE app I'm doing a query which has to be ordered by date. The query has to containt an IN filter, but this is resulting in the following error:
BadArgumentError: _MultiQuery with cursors requires __key__ order
Now I've read through other SO question (like this one), which suggest to change to sorting by key (as the error also points out). The problem is however that the query then becomes useless for its purpose. It needs to be sorted by date. What would be suggested ways to achieve this?
The Cloud Datastore server doesn't support IN. The NDB client library effectively fakes this functionality by splitting a query with IN into multiple single queries with equality operators. It then merges the results on the client side.
Since the same entity could be returned in 1 or more of these single queries, merging these values becomes computationally silly*, unless you are ordering by the Key**.
Related, you should read into underlying caveats/limitations on cursors to get a better understanding:
Because the NOT_EQUAL and IN operators are implemented with multiple queries, queries that use them do not support cursors, nor do composite queries constructed with the CompositeFilterOperator.or method.
Cursors don't always work as expected with a query that uses an inequality filter or a sort order on a property with multiple values. The de-duplication logic for such multiple-valued properties does not persist between retrievals, possibly causing the same result to be returned more than once.
If the list of values used in IN is a static list rather than determined at runtime, a work around is to compute this as an indexed Boolean field when you write the Entity. This allows you to use a single equality filter. For example, if you have a bug tracker and you want to see a list of open issues, you might use a IN('new', 'open', 'assigned') restriction on your query. Alternatively, you could set a property called is_open to True instead, so you no longer need the IN condition.
* Computationally silly: Requires doing a linear scan over an unbounded number of preceding values to determine if the current retrieved Entity is a duplicate or not. Also known as conceptually not compatible with Cursors.
** Key works because we can alternate between different single queries retrieving the next set of values and not have to worry about doing a linear scan over the entire proceeding result set. This gives us a bounded data set to work with.

Using For XML + STUFF to achieve a STRAGG-like functionality multiple times for the same selection

I've seen people combining both FOR XML and STUFF statements to achieve a STRAGG-like in Sql Server.
I have a problem with this approach in that, given I have a couple of fields I want to aggregate into strings to the outermost query, I am repeating the same select statement over and over again, for each new aggregation.
As far as I understand, however, this FOR XML + STUFF solution has to be applied in the innermost level - meaning I can't add an inline view with the select statement I will be using in the aggregations and apply Distinct on it, because I will already have joined the results with each of the distinct values.
In a short, adding to the example taken from the site cited above, here's what I want to do:
http://www.sqlfiddle.com/#!3/84199/2/0
Is there any better solution when you want to do that for a number of aggregations, so to avoid such redundant performance penalties ?

Which join syntax is better?

So we are migrating from Informix to Sql Server. And I have noticed that in Informix the queries are written in this manner:
select [col1],[col2],[col3],[col4],[col5]
from tableA, tableB
where tableA.[col1] = table.[gustavs_custom_chrome_id]
Whereas all the queries I write in SQL Server are written as:
select [col1],[col2],[col3],[col4],[col5]
from tableA
inner join tableB on tableA.[col1] = table.[gustavs_custom_chrome_id]
Now, my first thought was: that first query is bad. It probably creates this huge record set then whittles to the actual record set using the Where clause. Therefore, it's bad for performance. And it's non-ansi. So it's double bad.
However, after some googling, it seems that they both are, in theory, pretty much the same. And they both are ANSI compliant.
So my questions are:
Do both queries perform the same? IE. runs just as fast and always gives the same answer.
Are both really ANSI-compliant?
Are there any outstanding reasons why I should push for one style over another? Or should I just leave good enough alone?
Note: These are just examples of the queries. I've seen some queries (of the first kind) join up to 5 tables at a time.
Well, "better" is subjective. There is some style here. But I'll address your questions directly.
Both perform the same
Both are ANSI-compliant.
The problem with the first example is that
it is very easy to inadvertently derive the cross product (since it is easier to leave out join criteria)
it also becomes difficult to debug the join criteria as you add more and more tables to the join
since the old-style outer join (*=) syntax has been deprecated (it has long been documented to return incorrect results), when you need to introduce outer joins, you need to mix new style and old style joins ... why promote inconsistency?
while it's not exactly the authority on best practices, Microsoft recommends explicit INNER/OUTER JOIN syntax
with the latter method:
you are using consistent join syntax regardless of inner / outer
it is tougher (not impossible) to accidentally derive the cross product
isolating the join criteria from the filter criteria can make debugging easier
I wrote the post Kevin pointed to.

High-performance wiki-schema

I'm using MS SQL Server 2005.
What's the best schema for a Wiki-like system? where users edit/revise a submission and the system keeps track of these submissions.
Lets say we're doing a simple wiki-based system. Will keep track of each revision plus views and latest activity of each revision. In other screens, the system will list "Latest Submissions" and "Most Viewed", plus search by title.
My current schema (and I know its bad) is using a single table. When I need to see the "Latest Submissions" I sort by "LatestActivity", group by "DocumentTitle", then take first N records. I assume a lot of grouping (especially grouping on nvarchar) is bad news. For listing the most viewed I also do the same: sort by views, group by name, take first N records. Most of the time, I will also be doing a "WHERE DocumentName LIKE '%QUERY-HERE%'".
My current schema is "Version 1", see below:
alt text http://www.anaimi.com/junk/schemaquestion.png
I assume this is not acceptable. So i'm trying to come up with another/more-performant design. How does Version 2 sound to you? In version two I get the advantage of grouping on WikiHeadId which is a number - i'm assuming grouping over a number is better than nvarchar.
Or the extreme case which is version 3, where I will do no grouping, but has several disadvantages such as duplicating values, maintaining these values in code, etc.
Or is there a better/known schema for such systems?
Thanks.
(moved from ServerFault - i think its a development question more than an IT question)
Firstly (and out of curiosity) how does the current schema indicate what the current version is? Do you just have multiple 'WikiDocument' entries with the same DocumentTitle?
I'm also not clear on why you need a 'LastActivity' at a Version level. I don't see how 'LastActivity' fits with the concept of a 'Version' -- in most wikis, the 'versions' are write-once: if you modify a version, then you're creating a new version, so the concept of a last-updated type value on the version is meaningless -- it's really just 'datecreated'.
Really, the 'natural' schema for your design is #2. Personally, I'm a bit of a fan of the old DB axiom 'normalize until it hurts, then denormalize until it works'. #2 is a cleaner, nicer design (simple, with no duplication), and if you have no urgent reason to denormalize to version 3, I wouldn't bother.
Ultimately, it comes down to this: are you worrying about 'more performant' design because you've observed performance problems, or because you hypothetically might have some? There's no real reason #2 shouldn't perform well. Grouping isn't necessarily bad news in SQL Server -- in fact, if there's an appropriate covering index for the query, it can perform extremely well because it can just navigate to a particular level in the index to find the grouped values, then use the remaining columns of the index to use to MIN/MAX/whatever. Grouping by NVARCHAR isn't particularly bad -- if it's not observed to be a problem, don't fret about it, though (non-binary) collations can make it a little tricky -- but in version 2, where you need to GROUP BY you can do it by WikiHeadId, right?
One thing that may make life easier, if you do a lot of operations on the current version (as I assume you would), to add an FK back from the head table to the body table, indicating the current version. If you want to view the current versions with the highest number of hits, with #2 as it stands now it might be:
SELECT TOP ...
FROM WikiHead
INNER JOIN
(SELECT WikiHeadId, MAX(WikiBodyVersion) /* or LastUpdated? */ AS Latest
FROM WikiBody GROUP BY WikiHeadId) AS LatestVersions
INNER JOIN WikiBody ON
(Latest.WikiHeadId = WikiBody.WikiHeadId)
AND (WikiBody.WikiBodyVersion = LatestVersions.Latest)
ORDER BY
Views DESC
or alternatively
...
INNER JOIN WikiBody ON
(WikiHead.WikiHeadId = WikiBody.WikiHeadId)
AND (WikiBody.WikiBodyVersion =
(SELECT MAX(WikiBodyVersion) FROM WikiBody WHERE WikiBody.WikiHeadId = WikiHead.WikiHeadId)
...
both of which are icky. If the WikiHead keeps a pointer to the current version, it's just
...
INNER JOIN WikiBody ON
(WikiHead.WikiHeadId = WikiBody.WikiHeadId)
AND (WikiHead.Latest = WikiBody.WikiBodyVersion)
...
or whatever, which may be a useful denormalization just because it makes your life easier, not for performance.
Check this out.
It's the database schema for mediawiki, what wikipedia is based on.
It looks pretty well documented and would be an interesting read for you.
From this page.

Resources