We understand that it is not possible to do ORs in Cloud Datastore. Has anyone any suggested patterns for a workaround for this particular query requirement:
Users have a list of tags that represent their access control. Groups also have a list of tags that represent their membership. We want to run a query essentially that gets all the Groups a User has access to, similar to:
SELECT * FROM Group where tags IN ("administrator", "super_user")
or
SELECT * FROM Group where tags.administrator = true OR tags.super_user = true
Is there any way to model this particular access pattern? We understand that we can do this at the application level by merging queries, but these resultsets are large and we don't want to start sorting at the application level.
If we can't do this any way then is there a better way to model our entities in Datastore to make this kind of query possible?
Related
I am parsing documents on the web and storing them in solr database. Every day I see thousand of documents and some of them are repeating.
I'd like to give user an option to see which document was most seen on a given date, or in a given timespan. Queries of interest correspond to:
-show me which documents were seen the most on 16/10/2022,
-show me which documents were seen the most between 16/10/2022 and 23/10/2022
When writing solr queries, you specify field name to search on. What field type should I use and in what format should I store the number of times the document was seen on a given date?
How I would try it:
Create a separate collection - very simple collection with fields:
view time
doc id
title or body (whatever you're querying)
... do this for EVERY view.
you can query it by the gap you want:
curl http://localhost:8983/solr/query -d 'q=title:abc&rows=0&json.facet={
per_month: { range : {
field : last_modified,
start:'2022-01-01T00:00:00Z',
end:'2022-12-31T23:59:59Z',
gap:'+1MONTH',
}}
}}
This would return all views by MONTH (can change it to DAY, YEAR, etc).
But your doc is probably too big for this solution. If you want to normalize this:
a JOIN query. Since solr 8.6, you can now do cross-collection joins on multiple shards. this is a good article about how to do those queries. this is a decent video of how to set this up It's not that hard to do.
The JOIN query would be much faster.
If you don't want to do the JOIN query:
If the views change often, do not store them in the document store. There's no notion of partial updates in solr. If you're updating views every day, you'll need to update every document that's been viewed. That's going to cause a lot of unnecessary disk thrashing.
Other thoughts:
can you use a database? This is a far better use of views. Solr isn't good as a master record for views.
Another suggestion is to make the views go to an analytics engine - a far better solution since you can get rich analytics about the actual users. An analytics engine does a lot that rendering views does not - especially filtering out false positives (like bots!). It's not fun to maintain an accurate view count if you have a high-trafficed site.
In the past I've used an analytics engine to collect the data and used the analytics engine to export that data into solr. This way you can have the view logic be done by the software component that knows views best (the analytics engine like Google analytics or Salesforce marketing engine) and run an hourly process to update the views in solr using one of the above tactics.
How could I express the following design?
There are two entities: user and group
Group can have users and other groups
User can't have other users or groups
Efficiently query any group and everything it contains
There are conceptually no depth limits (current hardware dictates it, e.g 5 for query speed)
Examples:
I need to use NoSQL and also be able to cache this data (Redis for example, which is NoSQL itself).
---
My current idea:
Every group is a single unit and only contains children (users and groups) IDs. Then I query all the children by IDs. If some of these also have children, I'll make another roundtrip and so on and on..
As you can imagine, this solution requires multiple queries and the amount increases with every "level of deepness". The good news is that I query all these items by ID which should be extremely fast.
Can anyone suggest a better way?
I would use a graph database as they are very powerful when dealing with this kind of queries.
Bear in mind that you won't be able to query the "parents" of a node though.
You could use Neo4j for this. They have a community edition that is free. https://neo4j.com/
We have an application running in Google App Engine and storing data in Google Datastore.
For a given Datastore kind, all our entities have a property type.
We are interested in running a query with an IN query filter to fetch multiple types at once, something like:
type in ['event', 'comment', 'custom']
As there are thousands of entities within this kind, pagination is needed.
The problem we are having is that it is a known limitation of the Datastore that queries with "IN" filters do not support cursor.
Are there sensible ways to get around this limitation?
Using offset would be costy and not performant. Also we can't fetch all entities and filter in the client as we are building an API, hence we don't develop the client ourselves.
Any hint would be really appreciated, thanks!
IN filter results in individual EQUAL queries for each item on the list. This is why they do not support cursors - in your case, there will be 3 distinct positions in the index after your run the IN query.
Consider instead adding another property to your entity, which will serve as a flag for this type of API call: its value will be "true" if a type is in ['event', 'comment', 'custom'], or "false" otherwise. Maybe this flag may allow you to make "type" property unindexed - that would be an additional benefit.
With this new indexed property you can use a regular EQUAL filter. It will be faster (1 query instead of 3), and you can use cursors for pagination.
For example I have Documents A, B, C. User 1 must only be able to see Documents A, B. User 2 must only be able to see Document C. Is it possible to do it in SOLR without filtering by metadata? If I use metadata filter, everytime there are access right changes, I have to reindex.
[update 2/14/2012] Unfortunately, in the client's case, change is frequent. Data is confidential and usually only managed by the owners which are internal users. Then the specific case is they need to be able to share those documents to certain external users and specify access levels for those users. And most of the time this is an adhoc task, and not identified ahead of time
I would suggest storing the access roles (yes, its plural) as document metadata. Here the required field access_roles is a facet-able multi-valued string field.
Doc1: access_roles:[user_jane, manager_vienna] // Jane and the Vienna branch manager may see it
Doc2: access_roles:[user_john, manager_vienna, special_team] // Jane, the Vienna branch manager and a member of special team may see it
The user owning the document is a default access role for that document.
To change the access roles of a document, you edit access_roles.
When Jane searches, the access roles she belongs to will be part of the query. Solr will retrieve only the documents that match the user's access role.
When Jane (user_jane), manager at vienna office (manager_vienna) searches, her searches go like:
q=mainquery
&fq=access_roles:user_jane
&fq=access_roles:manager_vienna
&facet=on
&facet.field=access_roles
which fetches all documents which contains user_jane OR manager_vienna in access_roles; Doc1 and Doc2.
When Bob, (user_bob), member of a special team (specia_team) searches,
q=mainquery
&fq=access_roles:user_bob
&fq=access_roles:special_team
&facet=on
&facet.field=access_roles
which fetches Doc2 for him.
Queries adapted from http://wiki.apache.org/solr/SimpleFacetParameters#Multi-Select_Faceting_and_LocalParams
Might want to check the Document level Security patches.
https://issues.apache.org/jira/browse/SOLR-1872
https://issues.apache.org/jira/browse/SOLR-1834
I think my approach would be similar to #aitchnyu's answer. I would however NOT use individual users in the meta data.
If you create groups for each document, then you will have to reindex for security reason less often.
For a given document, you might have access_roles: group_1, group_3
In this way, the group_1 and group_3 always retain rights to the document. However, I can vary what groups each user belongs to and adjust the query accordingly.
When the query then is generated, it always passes as a part of the query the user's groups. If I belong to group_1 and group_2, my query will look like this:
q=mainquery
&fq=access_roles:group_1
&fq=access_roles:group_2
Since the groups are dynamically generated in the query, I simply remove a user from the group, and when a new query is issued, they will no longer include the removed group in the query. So removing the user from group_1 would new create a query like this:
q=mainquery
&fq=access_roles:group_2
All documents that require group 1 will no longer be accessible to the user.
This allows most changes to be done in real-time w/out the need to reindex the documents. The only reason you would have to reindex for security reasons is if you decided that a particular group should no longer have access to a document.
In many real-world scenarios, that should be a relatively uncommon occurrence. It seems much more likely that HR documents will always be available to the HR department, however a specific user may not always be part of the HR group.
Hope that helps.
You can implement your security model using Solr's PostFilter. For more information see http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/
Note: you should probably cache your access rights otherwise performance will be terrible.
Keeping in mind that solr is pure text based search engine,indexing system,to facilitate fast searching, you should not expect RDMS style capabilities from it. solr does not provide security for documents being indexed, you have to write such an implementation if you want. In that case you have two options.
1)Just index documents into solr and keep authorization details into RDBMS.Now query solr for your search and collect the results returned.Now fire another query to DB for the doc ids returned by solr to see if the user has an access to them or not.Filter out those documents on which user in action has no access.You are done ! But not really, your problem starts from here only.Assume, what if all results returned by solr gets filtered out ? (Assuming you are not accessing all the documents at a time,means you are retrieving top 1000 results only from solr result set,otherwise you can not get fast search) You have to query solr again for next bunch of result set and have to iterate these steps until you get enough results to display.
2)Second approach to this is to index authorization meta data along with document in solr.Same as aitchnyu has explained.But to answer your query for document sharing to an external user,along with usergroup and role detail, you index these external user's userid into access_roles field or you can just add an another field to your schema 'access_user' too. Now you can modify search queries for external user's sharing to include access_user field into your filter query.
e.g
q=mainquery
&fq=access_roles:group_1
&fq=access_user:externaluserid
Now the most important thing, update to an indexed documents.Well its off course tedious task, but with careful design and async processing along with solrs partial document update feature(solr 4.0=>), you can achieve reasonably good TPS with solr. If you are using solr <4.0 you can have separate systems for both searching and updates and with care full use of load balancer and master slave replication strategies you will have smile on your face !
There are no built in mechanisms for Solr that I am aware of that will allow you to control access to documents without maintaining the rights in the metadata. The approach outlined by aitchnyu seems reasonable if you keep it a true role level and not assign user specific permissions to a document. That way you can assign roles to users and this will grant them the ability to see documents in the index. Granted you will still need to reindex documents when the roles change, but hopefully you can identify most of the needed roles ahead if time and reduce the need for frequent reindexing.
I need a list of all the users common to a known collection of groups, using a single LDAP query of our Active Directory. It would seem, from the our reading so far, that such is not possible, but I thought it best to ask the hive mind.
Try this:
(&(objectCategory=Person)
(&
(memberOf=CN=group1,dc=company,dc=local)
(memberOf=CN=group2,dc=company,dc=local)
(memberOf=CN=group3,dc=company,dc=local)
)
)
This is similar to my question, except there I wanted all users who were NOT members of groups. You'll need to delete all the whitespace for most query tools to work.
Yes it's possible with an attribute scoped query. It requires W2K3 AD or later but will give you the all of the users that have a particular attribue i.e. membership in a group or in your case multiple groups (intersection of groups). One of the best examples is from Joe Kaplan and Ryan Dunns book "The .NET Developers Guide to Directory Services Programming" for AD work it's hard to beat look at page 179 for a good walk through.
Caveat:At this point you are past trivial searches in AD and a number of things are becoming important like the search root, scope and the effect of searching through some potentially HUGE set of data for the items you want. Looking through 50 or 60K users to find the members of a group does have an effect on performance and be prepared to do paged results or similar in case the dataset is large. Kaplan/Ryan do an excellent job of down to earth work to get you where you need to be. That said, I have used them on two AD projects with great success. Being able retrieve the data from AD without recursive queries is VERY worth while and I found that it is fast as long as I control the size of my dataset.
It's not possible in a single query if your groups contain nested groups.
You would need to write some code that recursively resolves the group members and does the logical equivalent of an "inner join" on the results, producing a list of users that are common to all the original groups.