The right record access implementation - solr

I am looking into indexing engines, specifically Apache Lucene Solr. We are willing to use it for our searches, yet one of the problems solved by our frameworks search is row-level access.
Solr does not provide record access out of the box:
<...> Solr does not concern itself with security either at the document level or the communication level.
And in the section about document level security:
There are few suggestions - either use Manifold CF (which is highly undocumented and seems in a very pre-beta stage) or write your own request handler/search component (that part is marked as stub) - I guess that the later one would have bigger impact on performance.
So I assume not much is being done in this field.
In the recently released 4.0 version of Solr, they have introduced joining two indexed entities. Joining might seem a nice idea, since our framework also does a join to know whether the record is accessible for the user. The problem here is that sometimes we do a inner join, and sometimes and outer (depending on the optimistic (everything what's not forbidden is allowed) or pessimistic (everything is forbidden only what is explicitly allowed) security setting in the scope).
To give a better understanding of what our structure looks like:
DocumentNr | Name
1 | Foo
2 | Bar
DocumentNr | UserNr | AllowRead | AllowUpdate | AllowDelete
1 | 1 | 1 | 1 | 0
So for example the generated query for the Documents in pessimistic security setting would be:
SELECT * FROM Documents AS d
INNER JOIN DocumentRecordAccess AS dra ON dra.DocumentNr=d.DocumentNr AND dra.AllowRead=1 AND dra.UserNr=1
This would return only the foo, but not the bar. And in optimistic setting:
SELECT * FROM Documents AS d
LEFT JOIN DocumentRecordAccess AS dra ON dra.DocumentNr=d.DocumentNr AND dra.AllowRead=1 AND dra.UserNr=1
Returning both - the Foo and the Bar.
Coming back to my question - maybe someone has already done this and can share their insight and experience?

I am afraid there's no easy solution here. You will have to sacrifice something to get ACLs working together with the search.
If your corpus size is small (I'd say up to 10K documents), you could create a cached bit set of forbidden (or allowed, whichever less verbose) documents and send relevant filter query (+*:* -DocumentNr:1 ... -DocumentNr:X). Needless to say, this doesn't scale. Sending large queries will make the search a bit slower, but this is manageable (up to a point of course). Query parsing is cheap.
If you can somehow group these documents and apply ACLs on document groups, this would allow cutting on query length and the above approach would fit perfectly. This is pretty much what we are using - our solution implements taxonomy and has taxonomy permissions done via fq query.
If you don't need to show the overall result set count, you can run your query and filter the result set on the client side. Again, not perfect.
You can also denormalize your data structures and store both tables flattened in a single document like this:
DocumentNr: 1
Name: Foo
Allowed_users: u1, u2, u3 (or Forbidden_users: ...)
The rest is as easy as sending user id with your query.
Above is only viable if the ACLs are rarely changing and you can afford reindexing the entire corpus when they do.
You could write a custom query filter which would have cached BitSets of allowed or forbidden documents by user(group?) retrieved from the database. This would require not only providing DB access for Solr webapp but also extending/repackaging the .war which comes with Solr. While this is relatively easy, the harder part would be cache invalidation: main app should somehow signal Solr app when ACL data gets changed.
Options 1 and 2 are probably more reasonable if you can put Solr and your app onto the same JVM and use javabin driver.
It's hard to advice more without knowing the specifics of the corpus/ACLs.

I am agree with mindas, what he has suggested (sol-4), i have implemented my solution the same way,but the difference is i have few different type of ACLs. At usergroup level,user level and even document level too (private access).
The solution is working fine. But the main concern in my case is that ACLs gets changed frequently and that needs to be updated in the index,mean while search performance should not get affected too.
I am trying to manage this with load balancing and adding few more nodes into the cluster.
mindas,unicron can you please put your thoughts on this?


Saleforce SOQL query - Jersey Readtimeout error

I'm having a problem on a batch job that has a simple SOQL query that returns a lot of records. More than a million.
The query, as it is, cannot be optimized much further according to SOQL best practices. (At least, as far as I know. I'm not an SF SOQL expert.)
The problem is that I'm getting -
Caused by: Read timed out
I try bumping up the Jersey readtime out value from 30 seconds to 60 seconds, but it still times out.
Any recommendation on how to deal with this issue? Any recommended value for the readtimeout parameter for a query that returns that much data?
The query is like this:
SELECT Id, field1, field2__c, field3__c, field3__c FROM Object__c
WHERE field2__c = true AND (not field3 like '\u0025Some string\u0025')
ORDER BY field4__c ASC
In no specific order...
Batches written in Apex time out after 2 minutes so maybe set same in your Java application
Run your query in Developer Console using the query plan feature (you probably will have to put real % in there, not \u0025). Pay attention which part has "Cost" column > 1.
what are field types? Plain checkbox and text or some complex formulas?
Is that text static or changes depending on what your app needs? would you consider filtering out the string in your code rather than SOQL? Counter-intuitive to return more records than you really need but well, might be an option.
would you consider making a formula field with either whole logic or just the string search and then asking SF to index the formula. Or maybe making another field (another checkbox?) with "yes, it contains that text" info, set the value by workflow maybe (essentially prepare your data a bit to efficiently query it later)
read up about skinny tables and see if it's something that could work for you (needs SF support)
can you make an analytic snapshot of your data (make a report, make SF save results to helper object, query that object)? Even if it'd just contain lookups to your original source so you'll access always fresh values it could help. Might be a storage killer though
have you considered "big objects" and async soql
I'm not proud of it but in the past I had some success badgering the SF database. Not via API but if I had a nightly batch job that was timing out I kept resubmitting it and eventually 3rd-5th time it managed to start. Something in the query optimizer, creation of cursor in underlying Oracle database, caching partial results... I don't know.
what's in the ORDER BY? Some date field? If you need records updated since X first then maybe replication API could help getting ids first.
does it make sense to use LIMIT 200 for example? Which API you're using, SOAP or REST? Might be that returning smaller chunks (SOAP: batch size, REST API: special header) would help it finish faster.
when all else fails (but do contact SF support, make sure you exhausted the options) maybe restructure the whole thing. Make SF push data to you whenever it changes, not pull. There's "Streaming API" (CometD implementation, Bayeux protocol, however these are called) and "Change Data Capture" and "Platform Events" for nice event bus-driven architecture decisions, replaying old events up to 3 days back if the client was down and couldn't listen... But that's a totally different topic.

How to handle frequently changing multivalue string fields in SOLR?

I have a SOLR (or rather Heliosearch 0.07) core on a single EC2 instance. It contains about 20M documents and takes about 50GB on disc. The core is quite fixed/frozen and performs quite well, if everything is warmed up.
The problem is a multimulti value string field: That field contains assigned categories, which change quite frequently for large parts of the 20M documents. After a commit, the warm up takes way too long to be usable in production.
The field is used only for facetting and filtering. My idea was, to store the categories outside SOLR and to inject them somehow using custom code. I checked quite some approaches in various JIRA issues and blogs, but I could not find some working solution. Item 2 of this issue suggests that there is a solution, but I don't get what he's talking about.
I would appreciate any solution which allows me to update my category field without having to re-warmup my caches again afterwards.
I'm not sure that JIRA will help you: it seems an advanced topic and most impprtant it is still unresolved so not yet available.
Partial document updates are not useful here because a) it requires everything is stored in your schema b) behind the scenes it does reindex again the whole index
From what you say it seems tou have a one monolithic index: have you considered to split the index vertically using sharding or SolrCloud? In that way each "portion" would be smaller and the autowarm shouldn't be a big problem.

Everything in one "table" on app engine?

This question refers to database design using app engine and objectify. I want to discuss pros and cons of the approach of placing all (or let's say multiple) entities into a single "table".
Let's say I have a (very simplified) data model of two entities:
class User {
#Index Long userId;
String name;
class Message {
#Index Long messageId;
String message;
private Ref<User> recipient;
At first glance, it makes no sense to put these into the same "table" as they are completely different.
But let's look at what happens when I want to search across all entities. Let's say I want to find and return users and messages, which satisfy some search criteria. In traditional database design I would either do two separate search requests, or else create a separate index "table" during writes where I repeat fields redundantly so that I can later retrieve items in a single search request.
Now let's look at the following design. Assume I would use a single entity, which stores everything. The datastore would then look like this:
Type | userId | messageId | Name | Message
USER | 123456 | empty | Jeff | empty
MESSAGE | empty | 789012 | Mark | This is text.
See where I want to go? I could now search for a Name and would find all Users AND Messages in a single request. I would even be able to add an index field, something like
#Index List index;
to the "common" entity and would not need to write data twice.
Given the behavior of the datastore that it never returns a record when searching for an indexed field which is empty, and combining this with partial indexes, I could also get the User OR Message by querying fields unique to a given Type.
The cost for storing long (non-normalized) records is not higher than storing individual records, as long as many fields are empty.
I see further advantages:
I could use the same "table" for auditing as well, as every record
stored would form a "history" entry (as long as I don't allow
updates, in which case I would need to handle this manually).
I can easily add new Types without extending the db schema.
When search results are returned over REST, I can return them in a single List, and the client looks at the Type.
There might be disadvantages as well, for example with caching, but maybe not. I can't see this at this point.
Anybody there, who has tried going down this route or who can see serious drawbacks to this approach?
This is actually how the google datastore works under the covers. All of your entities (and everyone else's entities) are stored in a single BigTable that looks roughly like this:
{yourappid}/{key}/{serialized blob of your entity data}
Indexes are stored in three BigTables shared across all applications. I try to explain this in a fair amount of detail in my answer to this question: efficient searching using appengine datastore ancestor paths
So to rephrase your question, is it better to have Google maintain the Kind or to maintain it yourself in your own property?
The short answer is that having Google maintain the Kind makes it harder to query across all Kinds but makes it easier to query within one Kind. Maintaining the pseudo-kind yourself makes it easier to query across all Kinds but makes it harder to query within one Kind.
When Google maintains the Kind as per normal use, you already understand the limitation - there is no way to filter on a property across all different kinds. On the other hand, using a single Kind with your own descriminator means you must add an extra filter() clause every time you query:
ofy().load().type(Anything.class).filter("discriminator", "User").filter("name >", "j")
Sometimes these multiple-filter queries can be satisfied with zigzag merges, but some can't. And even the ones that can be satisfied with zigzag aren't as efficient. In fact, this tickles the specific degenerative case of zigzags - low-cardinality properties like the discriminator.
Your best bet is to pick and choose your shared Kinds carefully. Objectify makes this easy for you with polymorphism:
A polymorphic type hierarchy shares a single Kind (the kind of the base #Entity); Objectify manages the discriminator property for you and ensures queries like ofy().load().type(Subclass.class) are converted to the correct filter operation under the covers.
I recommend using this feature sparingly.
One SERIOUS drawback to that will be indexes:
every query you do will write a separate index to be servable, then ALL writes you do will need to write to ALL these tables (for NO reason, in a good amount of cases).
I can't think of other drawbacks at the moment, except the limit of a meg per entity (if you have a LOT of types, with a LOT of values, you might run into this as you end up having a gazillion columns)
Not mentioning how big your ONE entity model would be, and how possibly convoluted your code to "triage" your entity types could end up being

google app engine query opimization

I am trying to do my reads and writes for GAE as efficiently as possible and I was wondering which is the best of the following two options.
I have a website where users are able to post different things and right now whenever I want to show all posts by that user I do a query for all posts with that user's user ID and then I display them. Would it be better to store all of the post IDs in the user entity and do a get_by_id(post_ID_list) to return all of the posts? Or would that extra space being used up not be worth it?
Is there anywhere I can find more information like this to optimize my web app?
The main reason you would want to store the list of IDs would be so that you can get each entity separately for better consistency - entity gets by id are consistent with the latest version in the datastore, while queries are eventually consistent.
Check datastore costs and optimize for cost:
Getting entities by key wouldn't be any cheaper than querying all the posts. The query makes use of an index.
If you use projection queries, you can reduce your costs quite a bit.
There is several cases.
First, if you keep track for all ids of user's posts. You must use entity group for consistency. Thats means speed of write to datastore would be ~1 entity per second. And cost is 1 read for object with ids and 1 read per entity.
Second, if you just use query. This is not need consistency. Cost is 1 read + 1 read per entity retrieved.
Third, if you quering only keys and after fetching. Cost is 1 read + 1 small per key retrieved. Watch this: Keys-Only Queries. This equals to projection quering for cost.
And if you have many result, and use pagination then you need use Query Cursors. That prevent useless usage of datastore.
The most economical solution is third case. Watch this: Batch Operations.
In case you have a list of id's because they are stored with your entity, a call to ndb.get_multi (in case you are using NDB, but it would be similar with any other framework using the memcache to cache single entities) would save you further datastore calls if all (or most) of the entities correpsonding to the keys are already in the datastore.
So in the best possible case (everything is in the memcache), the datastore wouldn't be touched at all, while using a query would.
See this issue for a discussion and caveats:

SOLR Permissions / Filtering Results depending on Access Rights

For example I have Documents A, B, C. User 1 must only be able to see Documents A, B. User 2 must only be able to see Document C. Is it possible to do it in SOLR without filtering by metadata? If I use metadata filter, everytime there are access right changes, I have to reindex.
[update 2/14/2012] Unfortunately, in the client's case, change is frequent. Data is confidential and usually only managed by the owners which are internal users. Then the specific case is they need to be able to share those documents to certain external users and specify access levels for those users. And most of the time this is an adhoc task, and not identified ahead of time
I would suggest storing the access roles (yes, its plural) as document metadata. Here the required field access_roles is a facet-able multi-valued string field.
Doc1: access_roles:[user_jane, manager_vienna] // Jane and the Vienna branch manager may see it
Doc2: access_roles:[user_john, manager_vienna, special_team] // Jane, the Vienna branch manager and a member of special team may see it
The user owning the document is a default access role for that document.
To change the access roles of a document, you edit access_roles.
When Jane searches, the access roles she belongs to will be part of the query. Solr will retrieve only the documents that match the user's access role.
When Jane (user_jane), manager at vienna office (manager_vienna) searches, her searches go like:
which fetches all documents which contains user_jane OR manager_vienna in access_roles; Doc1 and Doc2.
When Bob, (user_bob), member of a special team (specia_team) searches,
which fetches Doc2 for him.
Queries adapted from
Might want to check the Document level Security patches.
I think my approach would be similar to #aitchnyu's answer. I would however NOT use individual users in the meta data.
If you create groups for each document, then you will have to reindex for security reason less often.
For a given document, you might have access_roles: group_1, group_3
In this way, the group_1 and group_3 always retain rights to the document. However, I can vary what groups each user belongs to and adjust the query accordingly.
When the query then is generated, it always passes as a part of the query the user's groups. If I belong to group_1 and group_2, my query will look like this:
Since the groups are dynamically generated in the query, I simply remove a user from the group, and when a new query is issued, they will no longer include the removed group in the query. So removing the user from group_1 would new create a query like this:
All documents that require group 1 will no longer be accessible to the user.
This allows most changes to be done in real-time w/out the need to reindex the documents. The only reason you would have to reindex for security reasons is if you decided that a particular group should no longer have access to a document.
In many real-world scenarios, that should be a relatively uncommon occurrence. It seems much more likely that HR documents will always be available to the HR department, however a specific user may not always be part of the HR group.
Hope that helps.
You can implement your security model using Solr's PostFilter. For more information see
Note: you should probably cache your access rights otherwise performance will be terrible.
Keeping in mind that solr is pure text based search engine,indexing system,to facilitate fast searching, you should not expect RDMS style capabilities from it. solr does not provide security for documents being indexed, you have to write such an implementation if you want. In that case you have two options.
1)Just index documents into solr and keep authorization details into RDBMS.Now query solr for your search and collect the results returned.Now fire another query to DB for the doc ids returned by solr to see if the user has an access to them or not.Filter out those documents on which user in action has no access.You are done ! But not really, your problem starts from here only.Assume, what if all results returned by solr gets filtered out ? (Assuming you are not accessing all the documents at a time,means you are retrieving top 1000 results only from solr result set,otherwise you can not get fast search) You have to query solr again for next bunch of result set and have to iterate these steps until you get enough results to display.
2)Second approach to this is to index authorization meta data along with document in solr.Same as aitchnyu has explained.But to answer your query for document sharing to an external user,along with usergroup and role detail, you index these external user's userid into access_roles field or you can just add an another field to your schema 'access_user' too. Now you can modify search queries for external user's sharing to include access_user field into your filter query.
Now the most important thing, update to an indexed documents.Well its off course tedious task, but with careful design and async processing along with solrs partial document update feature(solr 4.0=>), you can achieve reasonably good TPS with solr. If you are using solr <4.0 you can have separate systems for both searching and updates and with care full use of load balancer and master slave replication strategies you will have smile on your face !
There are no built in mechanisms for Solr that I am aware of that will allow you to control access to documents without maintaining the rights in the metadata. The approach outlined by aitchnyu seems reasonable if you keep it a true role level and not assign user specific permissions to a document. That way you can assign roles to users and this will grant them the ability to see documents in the index. Granted you will still need to reindex documents when the roles change, but hopefully you can identify most of the needed roles ahead if time and reduce the need for frequent reindexing.
