We have a use case where users must be able to search content that is only available in Groups that they have access to. The search must be across all groups that they have access to.
Some details:
A Group has many Posts, and a user may have access to hundreds of Groups and thousands of Posts within each Group.
A search for "Foo" should return all Groups with "Foo" in the name and all Posts, within the Groups that they have access to, and have "Foo" in the content.
The way I thought of dealing with it is to have a list of user_id's associated on each document index and then include the user_id in the query string to verify that the user has access. Once the results are returned we could do an additional check to see that they have access to the content before returning the results.
The document index is something like this:
fields = [
search.TextField(name="data", value="some searchable stuff"),
search.AtomField(name="post_id", value="id of post"),
search.AtomField(name="group_id", value="id of group"),
search.AtomField(name="user_id", value=user_id_1),
search.AtomField(name="user_id", value=user_id_2),
#.... add the thousand other users who have access to the group (done in loop)
]
#then query run a user 123 would be as follows:
results = index.search("data = Foo AND user_id = 123")
My concern with the above approach:
Every new user who subscribes to a group would require the search index to be reindexed to include their user_id on each document.
Is there a better way of handling this use case?
Thanks
Rob
There is no simple answer to your question. You need to plan for (a) a typical use-case, and (b) extreme cases.
If a typical user belongs to 1-3 groups, searching by group_id maybe the best solution. You will do 1-2 extra searches, but you won't need to re-index every document every time a user joins or exits a group, which is prohibitively expensive.
You can have a separate implementation for extreme cases. If a user belongs to more than X groups, it may be more efficient to retrieve all results matching the keyword, and then filter them by group_id.
An alternative approach is to always retrieve all results regardless of group_id/user_id, and store them in Memcache. Then you can filter them in memory.
Users tend to search using the same keywords - depending on your corpus, 1% of words may account for up to 99% of searches. If you have a lot of users - and a big enough cache - you will get a lot of cache hits. Note that 1GB of cache can fit tens or even hundreds of thousands of query results. An additional advantage of this approach is that it speeds up all queries, especially phrase or multi-keyword searches.
Related
If I have an index with 10,000,000 documents and search text and ask to retrieve the top 1,000 items, is there a way to scope the facets to those 1,000 items?
My current problem is:
We have a very large index with a few different facets, including manufacturer. If I search for a product (WD-40 for instance), that matches a lot of different document and document fields. It finds the product and it is the top scoring match, but because they only make 1 or 2 products, the manufacturer doesn't show up as a top facet option because it is sorted by count.
Is there a way to scope the facets to the top X documents? Or, is there a way to only grab documents which are above a certain #search.score?
The purpose of a refiner is to give users options to narrow down the result set. I would say the $top parameter and returned facets works as it should. Trying to limit the number of refiners to be based on the top 1000 results is a bad idea when we think about it. You'll end up with confusing usability- and recall issues.
Your query for WD-40 returns a large result set. So large that there are 155347 unique manufacturers listed. I'm guessing you have several million hits. The intent of that query is to return the products called WD-40 (my assumption). But, since you search all properties in all types of content, you end up with various products like doors, hinges, and bikes that might have some text saying that "put some WD-40 on it to stop squeaks".
I'm guessing that most of the hits you get are irrelevant. Thus, you should either limit the scope of your initial query by default. For example, limit to searching only the title property. Or add a filter to exclude a category of documents (like manuals, price lists, etc.).
You could also consider submitting different queries from your frontend application. One narrowly scoped query that retrieves the refiners and another, broader query that returns the results.
I don't have a relevant data set to test on, but I believe the $top parameter might do what you want. See this link:
https://learn.microsoft.com/en-us/rest/api/searchservice/search-documents#top-optional
That said, there are other approaches to solve your use case.
Normalize your data
I don't know how clean your data is. But, for any data set of this size, it's common that the manufacturer name is not consistent. For example, your manufacturer may be listed as
WD40 Company
WD-40 Company
WDFC
WD 40
WD-40 Inc.
...
Normalizing will greatly reduce the number of values in your refiners. It's probably not enough for your use case, but still worth doing.
Consider adding more refiners
When you have a refiner with too many options it's always a good idea to consider having more refiner with course values. For example a category or perhaps a simple refiner that splits the results in two. Like a "Physical vs. Digital" product as a first choice. Or consumer vs. professional product. In stock or on back-order. This pattern allows users to quickly reduce the result set without having to use the brand refiner.
Categorize your refiner with too many options
In your case, your manufacturer refiner contained too many options. I have seen examples where people add a search box within the refiner. Another option is to group your refiner options in buckets. For text values like a manufacturer, you could generate a new property with the first character of the manufacturer's name. That way you could present a refiner that lets users select a manufacturer from A-Z.
I use GAE NDB Python
Approach 1:
# both models below have similar properties (same number and type)
class X1(ndb.Model):
p1 = ndb.StringProperty()
::
class X2(ndb.Model):
p1 = ndb.StringProperty()
::
def get(self):
q = self.request.get("q")
w = self.request.get("w")
record_list = []
if (q=="a"):
qry = X1.query(X1.p1==w)
record_list = qry.fetch()
elif (q=="b"):
qry = X2.query(X2.p1==w)
record_list = qry.fetch()
Approach 2:
class X1(ndb.Model):
p1 = ndb.StringProperty()
::
def get(self):
q = self.request.get("q")
w = self.request.get("w")
if (q=="a"):
k = ndb.Key("type_1", "k1")
elif (q=="b"):
k = ndb.Key("type_2", "k1")
qry = X1.query(ancestor=k, X1.p1==w)
record_list = qry.fetch()
My Questions:
Which approach is better in terms of query performance when I scale up the entities
Would there be significant impact on query performance if I scale up the ancestors (in the same hierarchy level horizontally) to 10,000 or 1,00,000 in approach 2
Is this application the correct use case for ancestor
Context:
This project is for understanding GAE better and the goal is to create an ecommerce website like amazon.com where I need to query based on a lot many(10) filter conditions(like, price range, brand, screen size, and so on). Each filter condition has few ranges(like, there could be five price bands); multiple ranges of a filter condition could be selected simultaneously. Multiple filter conditions could be selected just like on amazon.com left pane.
If I put all the filter conditions in the query in the form of AND, OR connected expression, it would take huge amount of time for scaled data sets even if I use query cursor and fetch by page.
To overcome this, I thought I would store the data in entities with parent as a string. The parent would be a cancatenation of the the different filters options which the product matches. There would be a lot of redundancy as I would store the same data in several entities for all the combinations of filter values which it satisfies. The disadvantage of this approach is that each product data is being stored multiple times in different entities(much more storage); but I was hoping to get a much better query performance(<2 seconds) since now my query string would contain only one or two AND or OR connected elements apart from ancestor. The ancestor would be the concatenation of the filter conditions which the user has selected to search for a product
Please let me know if I am not clear.. This is just an experimental approach that I am trying.. Another approach would have been to cache the results through a cron job periodically..
Any other suggestion to achieve a good query performance for such a website would be highly appreciated..
UPDATE(NEW STRATEGY):
i have decided to go with a model with some boolean properties(flags) for each range of each category(total such property per entity is ~14).. for one category, which had two possible values, I have three models(one having all entities of with either of the two values, and the other two for entites with each value).. so there is duplication(same data could be store twice in two entities)..
also my complete product data model is a separate one.. the above model contains a key to this complete model..
i could not do away with Query class and write my own filtering(i actually did that with good success initially).. the reason is that i need to fetch results page by page(~15 results).. and i need to sort them too.. if i fetch all results and apply my own filtering, with large data set the fetching of all results takes a huge amount of time because of the large size of the results returned..
the initial development server results look good.. query execution time is <3 seconds for ~6000 matched entities.. (though i wished it to be ~1 second).. need to scale up the production datastore to test there..
EDIT after context definition:
Tough subject there. You have plenty of datastore limitations that can get in your way :
Write throughput (1 write/sec per Entity Group)
Query inequality filters limit
Cross entity group transactions at write time (duplicating your product in each
"query filter" specific entity group )
Max entity size (1MB) if you duplicate whole products for every "query filter" entity
I don't have any "ready made" answer, just some humble advice based on common sense.
In my opinion your first solution will get overly complex as you add new filtering criterias, type of products, etc.
The problem with the datastore, and most "NoSQL" solutions, is that they tend to have few analytic/query features out of the box (they are not at the maturity level of RDBMS that have evolved for years), forcing you to compute results "by hand".
For your case, I don't see anything out of the box, and the "datastore query engine" is clearly not enough for such queries.
Keep your data quite simple though, just store your products as entities with properties.
If you have clearly different product categories, you may store them as different entity kinds -> I highly doubt people will run a "brand" query for both "shoes" and "food".
You will have to run a datastore query within the limitations to quickly get a gross result set, and refine it by hand (map reduce job, async task..) ... and then cache the result for as long as you can.
-> your aggressive cache solutions looks far better from a performance, cost and maintainability standpoint.
You won't be able to cache your whole product base, and some queries for rarities will take longer... like I said, I don't see any perfect answers here, just different tradeoffs for performance.
Just my 2 cents :) I'll be curious in what solution you end up adopting.
You typically use ancestors for data that is own by an entity.
For example :
A Book is your root entity, and it "owns" Page entities.
A Page without a Book is meaningless.
Book is the ancestor of Page.
A User is your root entity, and it "owns" BlogPost entities.
A BlogPost without its Writter is quite meaningless.
User is the ancestor of BlogPost.
If your two entities X1 and X2 share the same attributes, I'd say they are the same X entity, with just an additonal "type" attribute to determine if your talking about X Type1 or X type2.
been trying to solve this problem for a week and couldn't come up with any solutions in all my research so I thought I'd ask you all.
I have a "Product" table and a "productSent" table, here's a quick scheme to help explain:
class Product(ndb.Model):
name = ndb.StringProperty();
rating = ndb.IntegerProperty
class productSent(ndb.Model): <--- the key name here is md5(Product Key+UUID)
pId = ndb.KeyProperty(kind=Product)
uuId = ndb.KeyProperty(kind=userData)
action = ndb.StringProperty()
date = ndb.DateTimeProperty(auto_now_add=True)
My goal is to show users the highest rated product that they've never seen before--fast. So to keep track of the products users have seen, I use the productSent table. I created this table instead of using Cursors because every time the rating order changes, there's a possibility that the cursor skips the new higher ranking product. An example: assume the user has seen products 1-24 in the db. Next, 5 users liked product #25, making it the #10 product in the database--I'm worried that the product will never be shown again to the user (and possibly mess things up on a higher scale).
The problem with the way I'm doing it right now is that, once the user has blown past the first 1,000 products, it really starts slowing down the query performance. Because I'm literally pulling 1,000+ results, checking if they've been sent by querying against the productSent table (doing a keyName lookup to speed things up) and going through the loop until 15 new ones have been detected.
One solution I thought of was to add a repeated property (listProperty) to the Product table of all the users who have seen a product. Or if I don't want to have inequality filters I could put a repeated property of all the users who haven't seen a product. That way when I query I can dynamically take those out. But I'm afraid of what happens when I have 1,000+ users:
a) I'll go through the roof on the limit of repeated properties in one entity.
b) The index size will increase size costs
Has anyone dealt with this problem before (I'm sure someone has!) Any tips on the best way to structure it?
update
Okay, so had another idea. In order to minimize the changes that take place when a rating (number of likes) changes, I could have a secondary column that only has 3 possible values: positive, neutral, negative. And sort by that? Ofcourse for items that have a rating of 0 and get a 'like' (making them a positive) would still have a chance of being out of order or skipped by the cursor--but it'd be less likely. What do y'all think?
Sounds like the inverse, productNotSent would work well here. Every time you add a new product, you would add a new productNotSent entity for each user. When the user wants to see the highest rated product they have not seen, you will only have to query over the productNotSent entities that match that user. If you put the rating directly on the productNotSent you could speed the query up even more, since you will only have to query against one Model.
Another idea would be to limit the number of productNotSent entities per user. So each user only has ~100 of these entities at a time. This would mean your query would be constant for each user, regardless of the number of products or users you have. The creation of new productNotSent entities would become more complex, though. You'd have to have a cron job or something that "tops up" a user's collection of productNotSent entities when they use some up. You also may want to double-check that products rated higher than those already within the user's set of productNotSent entities get pushed in there. These are a little more difficult and well require some design trade-offs.
Hope this helps!
I do not know your expected volumes and exact issues (only did a quick perusal of your question), but you may consider using Json TextProperty storage as part of your plan. Create dictionaries/lists and store them in records by json.dump()ing them to a TextProperty. When the client calls, simply send the TextProperties to the client, and figure everything out on the client side once you JSON.parse() them. We have done some very large array/object processing in JS this way, and it is very fast (particularly indexed arrays). When the user clicks on something, send a transaction back to update their record. Set up some pull or push queue processes to handle your overall product listing updates, major customer rec updates, etc.
One downside is higher bandwidth going out of you app, but I think this cost will be minimal given potential processing savings on GAE. If you structure this right, you may be able to use get_by_id() to replace all or most of your planned indices and queries. We have found json.loads() and json.dumps() to be very fast inside the app, but we only use simple dictionary/list structures.This approach will be, though, a big, big quantum measure lower than your planned use of queries. The other potential issue is that very large objects may run into soft memory limits. Be sure that your Json objects are fairly simple+lightweight to avoid this (e.g. do no include product description, sub-objects, etc. in the Json item, just the basics such as product number). HTH, -stevep
I am working on my first GAE project using java and the datastore. And this is my first try with noSQL database. Like a lot of people i have problems understanding the right model to use. So far I've figured out two models and I need help to choose the right one.
All the data is represented in two classes User.class and Word.class.
User: couple of string with user data (username, email.....)
Word: two strings
Which is better :
Search in 10 000 000 entities for the 100 i need. For instance every entity Word have a string property owner and i query (owner = ‘John’).
In User.class i add property List<Word> and method getWords() that returns the list of words. So i query in 1000 users for the one i need and then call method like getWords() that returns List<Word> with that 100 i need.
Which one uses less resources ? Or am i going the wrong way with this ?
The answer is to use appstats and you can find out:
AppStats
To keep your application fast, you need to know:
Is your application making unnecessay RPC calls? Should it be caching
data instead of making repeated RPC calls to get the same data? Will
your application perform better if multiple requests are executed in
parallel rather than serially?
Run some tests, try it both ways and see what appstats says.
But I'd say that your option 2) is better simply because you don't need to search millions of entities. But who knows for sure? The trouble is that "resources" are a dozen different things in app engine - CPU, datastore reads, datastore writes etc etc etc.
For your User class, set a unique ID for each user (such as a username or email address). For the Word class, set the parent of each Word class as a specific User.
So, if you wanted to look up words from a specific user, you would do an ancestor query for all words belonging to that specific user.
By setting an ID for each user, you can get that user by ID as opposed to doing an additional query.
More info on ancestor queries:
https://developers.google.com/appengine/docs/java/datastore/queries#Ancestor_Queries
More info on IDs:
https://developers.google.com/appengine/docs/java/datastore/entities#Kinds_and_Identifiers
It really depends on the queries you're using. I assume that you want to find all the words given a certain owner.
Most likely, 2 would be cheaper, since you'll need to fetch the user entity instead of running a query.
2 will be a bit more work on your part, since you'll need to manually keep the list synchronized with the instances of Word
Off the top of my head I can think of 2 problems with #2, which may or may not apply to you:
A. If you want to find all the owners given a certain word, you'll need to keep that list of words indexed. This affects your costs. If you mostly find words by owner, and rarely find owners by words, it'll still make sense to do it this way. However, if your search pattern flips around and you're searching for owners by words a lot, this may be the wrong design. As you see, you need to design the models based on the queries you will be using.
B. Entities are limited to 1MB, and there's a limit on the number of indexed properties (5000 I think?). Those two will limit the number of words you can store in your list. Make sure that you won't need more than that limit of words per user. Method 1 allows you unlimted words per user.
For example I have Documents A, B, C. User 1 must only be able to see Documents A, B. User 2 must only be able to see Document C. Is it possible to do it in SOLR without filtering by metadata? If I use metadata filter, everytime there are access right changes, I have to reindex.
[update 2/14/2012] Unfortunately, in the client's case, change is frequent. Data is confidential and usually only managed by the owners which are internal users. Then the specific case is they need to be able to share those documents to certain external users and specify access levels for those users. And most of the time this is an adhoc task, and not identified ahead of time
I would suggest storing the access roles (yes, its plural) as document metadata. Here the required field access_roles is a facet-able multi-valued string field.
Doc1: access_roles:[user_jane, manager_vienna] // Jane and the Vienna branch manager may see it
Doc2: access_roles:[user_john, manager_vienna, special_team] // Jane, the Vienna branch manager and a member of special team may see it
The user owning the document is a default access role for that document.
To change the access roles of a document, you edit access_roles.
When Jane searches, the access roles she belongs to will be part of the query. Solr will retrieve only the documents that match the user's access role.
When Jane (user_jane), manager at vienna office (manager_vienna) searches, her searches go like:
q=mainquery
&fq=access_roles:user_jane
&fq=access_roles:manager_vienna
&facet=on
&facet.field=access_roles
which fetches all documents which contains user_jane OR manager_vienna in access_roles; Doc1 and Doc2.
When Bob, (user_bob), member of a special team (specia_team) searches,
q=mainquery
&fq=access_roles:user_bob
&fq=access_roles:special_team
&facet=on
&facet.field=access_roles
which fetches Doc2 for him.
Queries adapted from http://wiki.apache.org/solr/SimpleFacetParameters#Multi-Select_Faceting_and_LocalParams
Might want to check the Document level Security patches.
https://issues.apache.org/jira/browse/SOLR-1872
https://issues.apache.org/jira/browse/SOLR-1834
I think my approach would be similar to #aitchnyu's answer. I would however NOT use individual users in the meta data.
If you create groups for each document, then you will have to reindex for security reason less often.
For a given document, you might have access_roles: group_1, group_3
In this way, the group_1 and group_3 always retain rights to the document. However, I can vary what groups each user belongs to and adjust the query accordingly.
When the query then is generated, it always passes as a part of the query the user's groups. If I belong to group_1 and group_2, my query will look like this:
q=mainquery
&fq=access_roles:group_1
&fq=access_roles:group_2
Since the groups are dynamically generated in the query, I simply remove a user from the group, and when a new query is issued, they will no longer include the removed group in the query. So removing the user from group_1 would new create a query like this:
q=mainquery
&fq=access_roles:group_2
All documents that require group 1 will no longer be accessible to the user.
This allows most changes to be done in real-time w/out the need to reindex the documents. The only reason you would have to reindex for security reasons is if you decided that a particular group should no longer have access to a document.
In many real-world scenarios, that should be a relatively uncommon occurrence. It seems much more likely that HR documents will always be available to the HR department, however a specific user may not always be part of the HR group.
Hope that helps.
You can implement your security model using Solr's PostFilter. For more information see http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/
Note: you should probably cache your access rights otherwise performance will be terrible.
Keeping in mind that solr is pure text based search engine,indexing system,to facilitate fast searching, you should not expect RDMS style capabilities from it. solr does not provide security for documents being indexed, you have to write such an implementation if you want. In that case you have two options.
1)Just index documents into solr and keep authorization details into RDBMS.Now query solr for your search and collect the results returned.Now fire another query to DB for the doc ids returned by solr to see if the user has an access to them or not.Filter out those documents on which user in action has no access.You are done ! But not really, your problem starts from here only.Assume, what if all results returned by solr gets filtered out ? (Assuming you are not accessing all the documents at a time,means you are retrieving top 1000 results only from solr result set,otherwise you can not get fast search) You have to query solr again for next bunch of result set and have to iterate these steps until you get enough results to display.
2)Second approach to this is to index authorization meta data along with document in solr.Same as aitchnyu has explained.But to answer your query for document sharing to an external user,along with usergroup and role detail, you index these external user's userid into access_roles field or you can just add an another field to your schema 'access_user' too. Now you can modify search queries for external user's sharing to include access_user field into your filter query.
e.g
q=mainquery
&fq=access_roles:group_1
&fq=access_user:externaluserid
Now the most important thing, update to an indexed documents.Well its off course tedious task, but with careful design and async processing along with solrs partial document update feature(solr 4.0=>), you can achieve reasonably good TPS with solr. If you are using solr <4.0 you can have separate systems for both searching and updates and with care full use of load balancer and master slave replication strategies you will have smile on your face !
There are no built in mechanisms for Solr that I am aware of that will allow you to control access to documents without maintaining the rights in the metadata. The approach outlined by aitchnyu seems reasonable if you keep it a true role level and not assign user specific permissions to a document. That way you can assign roles to users and this will grant them the ability to see documents in the index. Granted you will still need to reindex documents when the roles change, but hopefully you can identify most of the needed roles ahead if time and reduce the need for frequent reindexing.