Sphinx Search Engine using with an access rights table - database

I want to check while searching through sphinx index the read-permission of user who is looking for some documents.
For examle i have an documents table with doc_id, doc_title and doc_is_global. On other side i have an accessprivileges table with an structure like:
user_id, user_group_id, doc_id, doc_category_id
users can grouped in an "user_group" with identifier user_group_id, and documents equivalent into document_categories.
The Access Table could look like:
user_id, user_group_id, doc_id, doc_category_id
1 , NULL, 1, NULL
NULL, 12, NULL, 32
1, NULL, NULL, 31
NULL, 10, 1, NULL
A user should only find documents where is_global flag is set to 1 or he has access by his user_id, or by a group_id he is member of.
In plain MySQL I get the right result by some JOINs Like:
SELECT * from documents d
LEFT JOIN document_category dc ON dc.doc_id = d.doc_id
LEFT JOIN access a ON a.user_id = {$user} and a.doc_id = d.doc_id
LEFT JOIN access a ON a.category_id = dc.category_id and dc.group_id IN ({$groups})
[...]
In Sphinx, I know, I can put multiple attributes to an indexed document but that are not what i want. In my productive envirenment i have also to check which user has given the read access and only if he can do that, the user becomes the access to read.
Made that situation with multiple attributes using sphinx it returns something like:
access_user_id = (1,4,6,2) accessed_by_user = (1,5,3)
so there aren't possibilities to check who gave read permission to who. Next problem is that Sphinx only supports max. 4gb attributes per index.
I need some hint for an idea to build the index to filter out the results the user isn't allowed to see (maybe with multiple indexes?)

Well you can index this with
sql_query =
SELECT d.doc_id, ...
GROUP_CONCAT(a.user_id) AS access_user_id,
GROUP_CONCAT(a.user_group_id) AS access_user_group_id
FROM documents d
LEFT JOIN document_category dc ON (dc.doc_id = d.doc_id)
LEFT JOIN access a ON (a.doc_id = d.doc_id OR a.doc_category_id = dc.category_id)
GROUP BY doc_id
Then can filter on that
$cl->setSelect("*, IF(IN({$user},access_user_id),1,0)+IF(IN({$group},access_user_group_id),1,0) AS myint");
$cl->setFilter('myint',array(1,2));
Next problem is that Sphinx only supports max. 4gb attributes per index.
Sphinx only supports 4gb of string attributes per index. Are you sure there is such a limit on MVA attributes?
In anycase, if too many attributes - the limit is per index. So shard the index to parts :)
As running into issues with max-length in GROUP CONCAT, easiest would probabyl be to use a MVA query.
See the docs for it http://sphinxsearch.com/docs/current.html#conf-sql-attr-multi
In there can define a query to fetch the data for the MVA directly, avoids the use of GROUP_CONCAT/GROUP_BY
sql_query = SELECT d.doc_id, ... FROM documents d
sql_attr_multi = uint access_user_id from query; SELECT DISTINCT doc_id, a.user_id FROM documents d
LEFT JOIN document_category dc ON (dc.doc_id = d.doc_id)
LEFT JOIN access a ON (a.doc_id = d.doc_id OR a.doc_category_id = dc.category_id)
sql_attr_multi = uint access_user_group_id from query; SELECT DISTINCT doc_id, a.user_group_id FROM documents d
LEFT JOIN document_category dc ON (dc.doc_id = d.doc_id)
LEFT JOIN access a ON (a.doc_id = d.doc_id OR a.doc_category_id = dc.category_id)
(can probably optimise those queries a bit, but at least should show how enough to get started)

Related

SqlServer Many to Many AND

I have 3 (hypothetical) tables.
Photos (a list of photos)
Attributes (things describing the photos)
PhotosToAttributes (a table to link the first 2)
I want to retrieve the Names of all the Photos that have a list of attributes.
For example, all photos that have both dark lighting and are portraits (AttributeID 1 and 2). Or, for example, all photos that have dark lighting, are portraits and were taken at a wedding (AttributeID 1 and 2 and 5). Or any arbitrary number of attributes.
The scale of the database will be maybe 10,000 rows in Photos, 100 Rows in Attributes and 100,000 rows in PhotosToAttributes.
This question: SQL: Many-To-Many table AND query is very close. (I think.) I also read the linked answers about performance. That leads to something like the following. But, how do I get Name instead of PhotoID? And presumably my code (C#) will build this query and adjust the attribute list and count as necessary?
SELECT PhotoID
FROM PhotosToAttributes
WHERE AttributeID IN (1, 2, 5)
GROUP by PhotoID
HAVING COUNT(1) = 3
I'm a bit database illiterate (it's been 20 years since I took a database class); I'm not even sure this is a good way to structure the tables. I wanted to be able to add new attributes and photos at will without changing the data access code.
It is probably a reasonable way to structure the database. An alternate would be to keep all the attributes as a delimited list in a varchar field, but that would lead to performance issues as you search the field.
Your code is close, to take it to the final step you should just join the other two tables like this:
Select p.Name, p.PhotoID
From Photos As p
Join PhotosToAttributes As pta On p.PhotoID = pta.PhotoID
Join Attributes As a On pta.AttributeID = a.AttributeID
Where a.Name In ('Dark Light', 'Portrait', 'Wedding')
Group By p.Name, p.PhotoID
Having Count(*) = 3;
By joining the Attributes table like that it means you can search for attributes by their name, instead of their ID.
For first create view from your joins:
create view vw_PhotosWithAttributes
as
select
p.PhotoId,
a.AttributeID,
p.Name PhotoName,
a.Name AttributeName
from Photos p
inner join PhotosToAttributes pa on p.PhotoId = pa.PhotoId
inner join Attributes a on a.AttributeID = pa.AttributeID
You can easy ask for attribute, name, id but don't forget to properly index field.

NHibernate Criteria SQL Inner Join on Sub Select Same Table

I can't for the life of me figure out how to translate the following SQL query using NHibernate's Criteria API:
SELECT r.* from ContentItemVersionRecords as r
INNER JOIN (
SELECT ContentItemId as CID, Max(Number) as [Version]
FROM ContentItemVersionRecords
GROUP BY ContentItemId
) AS l
ON r.ContentItemId = l.CID and r.Number = l.[Version]
WHERE Latest = 0 and Published = 0
The table looks like this:
The result of the SQL query above will return the highlighted records.
The idea is to select the latest version of content items, so I basically need to group by ContentItemId and get the record with the highest Number.
So the result will look like this:
I started out with a detached criteria, but I am clueless as to how to use it in the criteria:
// Sub select for the inner join:
var innerJoin = DetachedCriteria.For<ContentItemVersionRecord>()
.SetProjection(Projections.ProjectionList()
.Add(Projections.GroupProperty("ContentItemId"), "CID")
.Add(Projections.Max("Number"), "Version"));
// What next?
var criteria = session.CreateCriteria<ContentItemVersionRecord>();
Please note that I have to use the Criteria API - I can't use LINQ, HQL or SQL.
Is this at all possible with the Criteria API?
UPDATE: I just came across this post which looks very similar to my question. However, when I apply that as follows:
var criteria = session
.CreateCriteria<ContentItemVersionRecord>()
.SetProjection(
Projections.ProjectionList()
.Add(Projections.GroupProperty("ContentItemId"))
.Add(Projections.Max("Number")))
.SetResultTransformer(Transformers.AliasToBean<ContentItemVersionRecord>());
I get 2 results, which looks promising, but all of the integer properties are 0:
UPDATE 2: I found out that if I supply aliases, it will work (meaning I will get a list of ContentItemVersionRecords with populated objects):
var criteria = session
.CreateCriteria<ContentItemVersionRecord>()
.SetProjection(
Projections.ProjectionList()
.Add(Projections.Max("Id"), "Id")
.Add(Projections.GroupProperty("ContentItemId"), "ContentItemId")
.Add(Projections.Max("Number"), "Number"))
.SetResultTransformer(Transformers.AliasToBean<ContentItemVersionRecord>());
However, I can't use the projected values as the end result - I need to use these results as some sort of input into the outer query, e.g.
SELECT * FROM ContentItemVersionRecord WHERE Id IN ('list of record ids as a result from the projection / subquery / inner join')
But that won't work, since the projection returns 3 scalar values (Id, ContentItemId and Number). If it would just return "Id", then it might work. But I need the other two projections to group by ContentItemId and order by Max("Number").
OK, so in a nutshell, you need to unwind that nested query, and do a group by with a having clause, which is pretty much a where on aggregated values, as in the following HQL:
SELECT civ.ContentItem.Id, MAX(civ.Number) AS VersionNumber
FROM ContentItemVersionRecord civ
JOIN ContentItem ci
GROUP BY civ.ContentItem.Id " +
HAVING MAX(civ.Latest) = 0 AND MAX(civ.Published) = 0
This gives you, for each deleted content items (those have all their latest and published flags to zero on all their content item version records), the maximum version number, i.e. the latest version of each deleted content item.

How can I convert a view containing a START WITH...CONNECT BY sub-query to SQL Server?

I am trying to convert a view from an Oracle RDBMS to SQL Server. The view looks like:
create or replace view user_part_v
as
select part_region.part_id, users.id as users_id
from part_region, users
where part_region.region_id in(select region_id
from region_relation
start with region_id = users.region_id
connect by parent_region_id = prior region_id)
Having read about recursive CTE's and also about their use in sub-queries, my best guess at translating the above into SQL Server syntax is:
create view user_part_v
as
with region_structure(region_id, parent_region_id) as (
select region_id
, parent_region_id
from region_relation
where parent_region_id = users.region_id
union all
select r.region_id
, r.parent_region_id
from region_relation r
join region_structure rs on rs.parent_region_id = r.region_id
)
select part_region.part_id, users.id as users_id
from part_region, users
where part_region.region_id in(select region_id from region_structure)
Obviously this gives me an error about the reference to users.region_id in the CTE definition.
How can I achieve the same result in SQL Server as I get from the Oracle view?
Background
I am working on the conversion of a system from running on an Oracle 11g RDMS to SQL Server 2008. This system is a relatively large Java EE based system, using JPA (Hibernate) to query from the database.
Many of the queries use the above mentioned view to restrict the results returned to those appropriate for the current user. If I cannot convert the view directly then the conversion will be much harder as I will need to change all of the places where we query the database to achieve the same result.
The tables referenced by this view have a structure similar to:
USERS
ID
REGION_ID
REGION
ID
NAME
REGION_RELATIONSHIP
PARENT_REGION_ID
REGION_ID
PART
ID
PARTNO
DESCRIPTION
PART_REGION
PART_ID
REGION_ID
So, we have regions, arranged into a hierarchy. A user may be assigned to a region. A part may be assigned to many regions. A user may only see the parts assigned to their region. The regions reference various geographic regions:
World
Europe
Germany
France
...
North America
Canada
USA
New York
...
If a part, #123, is assigned to the region USA, and the user is assigned to the region New York, then the user should be able to see that part.
UPDATE: I was able to work around the error by creating a separate view that contained the necessary data, and then have my main view join to this view. This has the system working, but I have not yet done thorough correctness or performance testing yet. I am still open to suggestions for better solutions.
I reformatted your original query to make it easier for me to read.
create or replace view user_part_v
as
select part_region.part_id, users.id as users_id
from part_region, users
where part_region.region_id in(
select region_id
from region_relation
start with region_id = users.region_id
connect by parent_region_id = prior region_id
);
Let's examine what's going on in this query.
select part_region.part_id, users.id as users_id
from part_region, users
This is an old-style join where the tables are cartesian joined and then the results are reduced by the subsequent where clause(s).
where part_region.region_id in(
select region_id
from region_relation
start with region_id = users.region_id
connect by parent_region_id = prior region_id
);
The sub-query that's using the connect by statement is using the region_id from the users table in outer query to define the starting point for the recursion.
Then the in clause checks to see if the region_id for the part_region is found in the results of the recursive query.
This recursion follows the parent-child linkages given in the region_relation table.
So the combination of doing an in clause with a sub-query that references the parent and the old-style join means that you have to consider what the query is meant to accomplish and approach it from that direction (rather than just a tweaked re-arrangement of the old query) to be able to translate it into a single recursive CTE.
This query also will return multiple rows if the part is assigned to multiple regions along the same branch of the region heirarchy. e.g. if the part is assigned to both North America and USA a user assigned to New York will get two rows returned for their users_id with the same part_id number.
Given the Oracle view and the background you gave of what the view is supposed to do, I think what you're looking for is something more like this:
create view user_part_v
as
with user_regions(users_id, region_id, parent_region_id) as (
select u.users_id, u.region_id, rr.parent_region_id
from users u
left join region_relation rr on u.region_id = rr.region_id
union all
select ur.users_id, rr.region_id, rr.parent_region_id
from user_regions ur
inner join region_relation rr on ur.parent_region_id = rr.region_id
)
select pr.part_id, ur.users_id
from part_region pr
inner join user_regions ur on pr.region_id = ur.region_id;
Note that I've added the users_id to the output of the recursive CTE, and then just done a simple inner join of the part_region table and the CTE results.
Let me break down the query for you.
select u.users_id, u.region_id, rr.parent_region_id
from users u
left join region_relation rr on u.region_id = rr.region_id
This is the starting set for our recursion. We're taking the region_relation table and joining it against the users table, to get the starting point for the recursion for every user. That starting point being the region the user is assigned to along with the parent_region_id for that region. A left join is done here and the region_id is pulled from the user table in case the user is assigned to a top-most region (which means there won't be an entry in the region_relation table for that region).
select ur.users_id, rr.region_id, rr.parent_region_id
from user_regions ur
inner join region_relation rr on ur.parent_region_id = rr.region_id
This is the recursive part of the CTE. We take the existing results for each user, then add rows for each user for the parent regions of the existing set. This recursion happens until we run out of parents. (i.e. we hit rows that have no entries for their region_id in the region_relationship table.)
select pr.part_id, ur.users_id
from part_region pr
inner join user_regions ur on pr.region_id = ur.region_id;
This is the part where we grab our final result set. Assuming (as I do from your description) that each region has only one parent (which would mean that there's only one row in region_relationship for each region_id), a simple join will return all the users that should be able to view the part based on the part's region_id. This is because there is exactly one row returned per user for the user's assigned region, and one row per user for each parent region up to the heirarchy root.
NOTE:
Both the original query and this one do have a limitation that I want to make sure you are aware of. If the part is assigned to a region that is lower in the heirarchy than the user (i.e. a region that is a descendent of the user's region like the part being assigned to New York and the user to USA instead of the other way around), the user won't see that part. The part has to be assigned to either the user's assigned region, or one higher in the region heirarchy.
Another thing is that this query still exhibits the case I mentioned above about the original query, where if a part is assigned to multiple regions along the same branch of the heirarchy that multiple rows will be returned for the same combination of users_id and part_id. I did this because I wasn't sure if you wanted that behavior changed or not.
If this is actually an issue and you want to eliminate the duplicates, then you can replace the query below the CTE with this one:
select p.part_id, u.users_id
from part p
cross join users u
where exists (
select 1
from part_region pr
inner join user_regions ur on pr.region_id = ur.region_id;
where pr.part_id = p.part_id
and ur.users_id = u.users_id
);
This does a cartesian join between the part table and the users table and then only returns rows where the combination of the two has at least one row in the results of the subquery, which are the results that we are trying to de-duplicate.

How to determine the types of indexes in a database?

I'm fiddling around with PostgreSQL right now.
I can see the user indexes using SELECT * FROM pg_stat_user_indexes
However, it doesn't seem like the result gives any information on the type of each index such as 'B-tree', 'R-tree', 'Hash', and 'GiST'.
Anyone know how I can find out the type of each index?
pg_stat_user_indexes stores statistics, not the general index data.
Use this:
SELECT i.indexname, a.amname
FROM pg_indexes i
JOIN pg_class c
ON c.relname = i.indexname
JOIN pg_am a
ON a.oid = c.relam
WHERE i.schemaname = 'public' -- or whatever your schema is

How do I assign weights to different columns in a full text search?

In my full text search query, I want to assign particular columns a higher weightage. Consider this query:
SELECT Key_Table.RANK, FT_Table.* FROM Restaurants AS FT_Table
INNER JOIN FREETEXTTABLE(Restaurants, *, 'chilly chicken') AS Key_Table
ON FT_Table.RestaurantID = Key_Table.[KEY]
ORDER BY Key_Table.RANK DESC
Now, I want the Name column to have a higher weightage in the results (Name, Keywords and Location are full-text indexed). Currently, if the result is found in any of the three columns, the ranks are not affected.
For example, I'd like a row with Name "Chilly Chicken" to have higher rank than one with Keywords "Chilly Chicken", but another name.
Edit:
I'm not eager to use ContainsTable, because that would mean separating the phrases (Chilly AND Chicken, etc.), which would involve me having to search all possible combinations - Chilly AND Chicken, Chilly OR Chicken, etc. I would like the FTS engine to automatically figure out which results match best, and I think FREETEXT does a fine job this way.
Apologies if I've misunderstood how CONTAINS/CONTAINSTABLE works.
The best solution is to use ContainsTable. Use a union to create a query that searches all 3 columns and adds an integer used to indicate which column was searched. Sort the results by that integer and then rank desc.
The rank is internal to sql server and not something you can adjust.
You could also manipulate the returned rank by dividing the rank by the integer (Name would be divided by 1, Keyword and Location by 2 or higher). That would cause the appearance of different rankings.
Here's some example sql:
--Recommend using start change tracking and start background updateindex (see books online)
SELECT 1 AS ColumnLocation, Key_Table.Rank, FT_Table.* FROM Restaurants AS FT_Table
INNER JOIN ContainsTable(Restaurant, Name, 'chilly chicken') AS Key_Table ON
FT_Table.RestaurantId = Key_Table.[Key]
UNION SELECT 2 AS ColumnLocation, Key_Table.Rank, FT_Table.* FROM Restaurants AS FT_Table
INNER JOIN ContainsTable(Restaurant, Keywords, 'chilly chicken') AS Key_Table ON
FT_Table.RestaurantId = Key_Table.[Key]
UNION SELECT 3 AS ColumnLocation, Key_Table.Rank, FT_Table.* FROM Restaurants AS FT_Table
INNER JOIN ContainsTable(Restaurant, Location, 'chilly chicken') AS Key_Table ON
FT_Table.RestaurantId = Key_Table.[Key]
ORDER BY ColumnLocation, Rank DESC
In a production environment, I would insert the output of the query into a table variable to perform any additional manipulation before returning the results (may not be necessary in this case). Also, avoid using *, just list the columns you really need.
Edit: You're right about using ContainsTable, you would have to modify the keywords to be '"chilly*" AND "chicken*"', I do this using a process that tokenizes an input phrase. If you don't want to do that, just replace every instance of ContainsTable above with FreeTextTable, the query will still work the same.

Resources