Counting in facet results - solr

By counting in facet results I mean resolve the problem:
I have 7 documents:
A1 B1 C1
A2 B1 C1
A3 B2 C1
A4 B2 C2
A5 B3 C2
A6 B3 C2
A7 B3 C2
If I make the facet query by field B, get the result: B1=2, B2=2, B3=3.
A1 B1 C1
A2 B1 C1 2 - facing by B
--------------====
A3 B2 C1
A4 B2 C2 2 - facing by B
--------------====
A5 B3 C2
A6 B3 C2
A7 B3 C2 3 - facing by B
--------------====
I want to get additional information, something like count in results, by field C. So, how can I query to get a result similar to the following:
A1 B1 C1
A2 B1 C1 2, 1 - facing by B, count C in facet results
--------------=======
A3 B2 C1
A4 B2 C2 2, 2 - facing by B, count C in facet results
--------------=======
A5 B3 C2
A6 B3 C2
A7 B3 C2 2, 1 - facing by B, count C in facet results
--------------=======
Thanks

What you need is Facet Pivots
This will help you get the results and counts of hierarchies.
This is available in Solr 4.0 trunk build. So you may need to apply the patch.
References -
http://wiki.apache.org/solr/HierarchicalFaceting
http://wiki.apache.org/solr/SimpleFacetParameters#Pivot_.28ie_Decision_Tree.29_Faceting

Related

Django ORM to get a nested tree of models from DB with minimal SQL queries

I have some nested models with foreign key relationships going 4 levels deep.
A <- B <- C <- D
A has a set of B models, which each have a set of C models, which each have a set of D models.
I'm iterating of the each model (4 layers of looping from A down to B). This is producing lots of DB hits.
I don't need to do any filtering at the DB fetch level, as I need all the data from all the DB tables, so I ideally I'd like to get all the data with ONE SQL query that hits the DB (if that's possible) and then somehow have the data organized/filtered into their correct sets for each model. i.e. it's all pre-fetched and structured ready for using the data (e.g. in a web dashboard).
There seems to be a lot of django related pre-fetch helpers and packages, but none of them seem to work the way I expect. e.g. django-auto-prefetch (which seems ideal).
Is this a common use case (I thought it would be)?
How can I construct the ORM to get all the data in one hit and then just use the bits I need.
NOTE: target system is raspberry pi class device (1GHz Arm processor) with eMMC storage (similar to SD card), and using SQLite as the DB backend.
NOTE: I'm also using this with django-polymorphic, which may or may not make a difference?
Thanks, Brendan.
Using one query would result in a huge amount of bandwidth, since the values for the columns of the A model will be repeated per B model per C model per D model.
Indeed, the response would look like:
a_col1 | a_col2 | b_col1 | b_col2 | c_col1 | d_col1
A1 A1 B1 B1 C1 D1
A1 A1 B1 B1 C1 D2
A1 A1 B1 B1 C1 D3
A1 A1 B1 B1 C2 D4
A1 A1 B1 B1 C2 D5
A1 A1 B1 B1 C2 D6
A1 A1 B2 B2 C3 D7
A1 A1 B2 B2 C3 D8
A1 A1 B2 B2 C3 D9
A1 A1 B2 B2 C4 D10
A1 A1 B2 B2 C4 D11
A1 A1 B2 B2 C4 D12
A2 A2 B3 B3 C5 D13
A2 A2 B3 B3 C5 D14
A2 A2 B3 B3 C5 D15
A2 A2 B3 B3 C5 D16
We thus would repeat the values for the a_columns, b_columns, etc. a large number of times resulting in a large amount of bandwidth going from the database to the Python/Django layer. This would not only result in large amounts of data being transferred, but also large amounts of memory being used by Django to deserialize this response.
Therefore .prefetch_related makes one (or two depending on the type of relation) extra queries per level at most, so three to seven queries in total, which will minimize the bandwidth.
You thus can fetch all objects in memory with:
for a in A.objects.prefetch_related('b_set', 'b_set__c_set', , 'b_set__c_set__d_set'):
print(a)
for b in a.b_set.all():
print(b)
for c in b.c_set.all():
print(c)
for d in c.d_set.all():
print(d)

Relational Algebra expression for the given query

I know this question is already asked here.
But it's not answered properly there.
Q) Consider the following relational database schemes:
COURSES(Cno.name)
PRE-REQ(Cno, pre-Cno)
COMPLETED(student_no, Cno)
COURSES gives the number and name of all the available courses.
PRE-REQ gives the information about which courses are pre-requisites for a given course.
COMPLETED indicates what courses have been completed by students
Express the following using relational algebra:
List all the courses for which a student with student_no 2310 has
completed all the pre-requisites.
Answer given here:
S ← π Cno (σ student_no=2310 (COMPLETED))
RESULT ← ((ρ (Course,Cno) (PRE−REQ))÷S)
But I found a flaw in it.
Suppose
PRE-REQ COMPLETED
Cno Pre-Cno Student_no Cno
C1 C3 2310 C3
C2 C4 2310 C4
The desired result should be C1, C2 but my query will return an empty relation as C1 doesn't have C4 as its pre-requisite course and similarly, C2 doesn't have C3 as its pre-requisite course.
S RESULT
Cno Course Cno
C3
C4
Another solution which is given in one of the answers here but by using SQL is-
SELECT Pre-Req.Cno
FROM Completed, Pre-Req
WHERE student_no = '2310'
GROUP BY Pre-req.Cno
HAVING pre-Cno IN (
SELECT C.cno
FROM Completed AS C
WHERE C.student_no = '2310';
)
Is there any other possible way to write is as a relational algebra expression ?
Here is a possible solution (I will use a simpler notation):
COURSES_OF_2310 = π c←Cno (σ student_no=2310 (COMPLETED))
PARTIALLY_SATISFIED = PRE_REQ ⨝ PRE_REQ.preCno=c COURSES_OF_2310
NOT_SATISFIED = PRE_REQ - π Cno, preCno←c PARTIALLY_SATISFIED
FULLY_SATISFIED = π Cno PRE_REQ - π Cno (NOT_SATISFIED)
This is quite complex and probably could be simplified. However it should work now. Here is an example tested with RelaX:
COURSES(Cno)
C1
C2
C3
C4
C5
C6
PRE-REQ(Cno, pre-Cno)
C1 C3
C1 C4
C2 C3
C2 C4
C5 C3
C5 C6
C6 C4
COMPLETED(student_no, Cno)
2310 C3
2310 C4
PARTIALLY_SATISFIED(Cno, preCno c)
C1 C3 C3
C1 C4 C4
C2 C3 C3
C2 C4 C4
C5 C3 C3
C6 C4 C4
NOT_SATISFIED(Cno, preCno)
C5 C6
FULLY_SATISFIED(Cno)
C1
C2
C6

Systems of inequations and counting constraint

I am playing with Gecode to solve a problem where you label the arcs of a multi-digraph such that the sums of the arc labels in each path from a single source to a single sink come from a multiset. So for example we might have a graph with 8 paths and want path arc sums to come from the set {5,5,5,4,3,2,1,0}. So we must have exactly 3 paths with a sum of 5 and 5 paths with a unique sum of 0..4.
You can reformulate this problem as asking if some permutation of {5,5,5,4,3,2,1,0} is in the column space of the path arc incidence matrix of the graph.
I model this multiset match with the "count" constraint. The arc sums are linear equations.
My graphs have many parallel edges that come in pairs. Using symmetry I use this to impose a partial order on the path sums. This also means that there are sets of pairs of path sums that have the same difference. So from my example if the paths are b0...b7 I have the following pair sets:
b0 - b1 = b3 - b4 = b5 - b6, b0 - b2 = b5 - b7, b0 - b3 = b1 - b4, b0 - b5 = b1 - b6 = b2 - b7
b1 - b2 = b6 - b7, b3 - b5 = b4 - b6
Including these differences into the model seems to cut down the search space by two orders of magnitude in Gecode. I am pleased with this because I think it's telling me something important about the graphs I am studying and it fits with some conjectures in the area I work in.
The partial order now tells us that only b0,b2,b3,b5 and b7 may take the value 5.
It's possible to now prove this system can not be satisfied. I am interested in techniques from constraint satisfaction etc that can be used to analyse a system of inequations (!=) along with "count". Obviously Gecode can prove this by assigning values and failing. I am interested in general techniques to both learn about constraint satisfaction, help improve the model and maybe gain some understanding of the things I am investigating.
To see the problem is not soluble we can show that each of the 6 sets of pair difference systems can not have a difference of zero. If they did they would generate duplicates of the wrong values or too many 5's.
For example b0 - b1 = b3 - b4 = b5 - b6 would have b0 = b1 which is impossible since b1 can not be 5 and that's the only value that can have duplicates.
Or b0 - b2 = b5 - b7 would mean that b0 = b2 and b5 = b7 requiring 4 5's.
So we end up with a set of inequations on the paths whose sum could be 5:
b0 != b2, b0 != b3, b0 != b5, b2 != b7, b5 != b7
We can see that b0 != 5 since if it was we could only get 2 fives. From the remaining 4 values we are forced to be able to set at most two to 5 so the whole system is insoluble.

What is the optimum query for searching string appears in many columns in 1 table?

Ok, Say, I got a table that have 3 columns: c1, c2, c3
C1 - C2 - C3
A2 - B2 - N2
K1 - B2 - N1
K1 - B3 - N1
L1 - A2 - C1
Ok, when users search for any combination of A1, A2, A3, B1, B2, ... then the system can be able to pick the rows with the closet match (it means as long as the word appears in 1 column, the system will pick it, the more words appear in more columns the closer it matches) & order them according to the closest match
Ex1: a user searches for "K1 A2 C1 N1", the system will show:
K1 - B2 - N1
K1 - B3 - N1
A2 - B2 - N2
L1 - A2 - C1
Ex2: a user searches for "K1 A2 C1 N2 B2", the system will show:
A2 - B2 - N2
L1 - A2 - C1
K1 - B2 - N1
K1 - B3 - N1
My solution is to split the search string into separate words & then search each of these words against the columns in the table. But I am not sure it is the optimum query since the DB have to search in many loops.
So if u r expert in DB, then what is the best query in this scenario?
select c1
,c2
,c3
,case when c1 in ('K1','A2','C1','N1') then 1 else 0 end
+case when c2 in ('K1','A2','C1','N1') then 1 else 0 end
+case when c3 in ('K1','A2','C1','N1') then 1 else 0 end as sortweight
from theTable
where c1 in ('K1','A2','C1','N1')
or c2 in ('K1','A2','C1','N1')
or c3 in ('K1','A2','C1','N1') order by sortweight desc
,c1
,c2
,c3
Someone from other forum suggested the above query

How to generate all possible rankings of a document set when each document can take one of 2 types?

I need to generate possible ranking of all possible ranking of n documents. I understand that the permutations of an array {1, 2,..., n} will give me the set of all possible rankings.
My problem is a bit more complex as each document could take one of 2 possible types. Therefore, in all there are n!*2n possible rankings.
For instance, let us say I have 3 documents a, b, and c. Then possible rankings are the following:
a1 b1 c1
a1 b1 c2
a1 b2 c1
a1 b2 c2
a2 b1 c1
a2 b1 c2
a2 b2 c1
a2 b2 c2
a1 c1 b1
a1 c1 b2
a1 c2 b1
a1 c2 b2
a2 c1 b1
a2 c1 b2
a2 c2 b1
a2 c2 b2
b1 a1 c1
b1 a1 c2
b1 a2 c1
b1 a2 c2
b2 a1 c1
b2 a1 c2
...
What would be an elegant way to generate such rankings?
It's a kind of cross product between the permutations of B={a,b, ...} and the k-combinations of T{1,2} where k is the the number of elements in B. Say we take a p from Perm(B), e.g. p=(b,c,a) and a c from 3-Comb(T), e.g. c=(2,1,1) then we would merge p and c into (b2,c1,a1).
I don't really know if it's elegant but I would choose an algorithm to generate sequentially the permutations of B (cf TAOCP Volume 4 fascicle 2b) and for each permutation apply the above "product" with all the k-combinations generated sequentially (or stored in an array if k is small) (cf TAOCP Volume 4 fascicle 3a).
B={a,b,c, ... }
T={1,2}
k=length(B)
reset_perm(B)
do
p = next_perm(B)
reset_comb(T,k)
do
c = next_kcomb(T,k)
output product(p,c)
while not last_kcomb(T,k)
while not last_perm(B)

Resources