I have a community website with 20.000 members using everyday a search form to find other members. The results are sorted by the last connected members. I'd like to use Solr for my search (right now it's mysql) but I'd like to know first if it's good practice to update the document of every member who would login in order to change their login date and time ? There will be around 20.000 update of documents a day, I don't really know if it's too much updating and could alter performances ? Tank you for your help.
20k updates/day is not unreasonable at all for Solr.
OTOH, for very frequently updating fields (imagine one user could log in multiple times a day so you might want to update it all those times), you can use External Fields to keep that field stored outside the index (in a text file) and still use it for sorting in solr.
Generally, Solr does not be used for this purpose, using your database is still better.
However, if you want to use Solr, you will deal with it in a way like the database. i.e every user document should has a unique field, id for example. When the user make log in, you may use an update query for that user's document last_login_date field by its id. You could able to know more about Partial Update from this link.
Related
I am parsing documents on the web and storing them in solr database. Every day I see thousand of documents and some of them are repeating.
I'd like to give user an option to see which document was most seen on a given date, or in a given timespan. Queries of interest correspond to:
-show me which documents were seen the most on 16/10/2022,
-show me which documents were seen the most between 16/10/2022 and 23/10/2022
When writing solr queries, you specify field name to search on. What field type should I use and in what format should I store the number of times the document was seen on a given date?
How I would try it:
Create a separate collection - very simple collection with fields:
view time
doc id
title or body (whatever you're querying)
... do this for EVERY view.
you can query it by the gap you want:
curl http://localhost:8983/solr/query -d 'q=title:abc&rows=0&json.facet={
per_month: { range : {
field : last_modified,
start:'2022-01-01T00:00:00Z',
end:'2022-12-31T23:59:59Z',
gap:'+1MONTH',
}}
}}
This would return all views by MONTH (can change it to DAY, YEAR, etc).
But your doc is probably too big for this solution. If you want to normalize this:
a JOIN query. Since solr 8.6, you can now do cross-collection joins on multiple shards. this is a good article about how to do those queries. this is a decent video of how to set this up It's not that hard to do.
The JOIN query would be much faster.
If you don't want to do the JOIN query:
If the views change often, do not store them in the document store. There's no notion of partial updates in solr. If you're updating views every day, you'll need to update every document that's been viewed. That's going to cause a lot of unnecessary disk thrashing.
Other thoughts:
can you use a database? This is a far better use of views. Solr isn't good as a master record for views.
Another suggestion is to make the views go to an analytics engine - a far better solution since you can get rich analytics about the actual users. An analytics engine does a lot that rendering views does not - especially filtering out false positives (like bots!). It's not fun to maintain an accurate view count if you have a high-trafficed site.
In the past I've used an analytics engine to collect the data and used the analytics engine to export that data into solr. This way you can have the view logic be done by the software component that knows views best (the analytics engine like Google analytics or Salesforce marketing engine) and run an hourly process to update the views in solr using one of the above tactics.
We have a requirement that documents that we currently index in SOLR may periodically need to be PARTIALLY UPDATED. The updates can either be
a. add new fields
b. update the content of existing fields.
Some of the fields in our schema are stored, others are not.
SOLR 4 does allow this but all the fields must be stored. See Update a new field to existing document and http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/
Questions:
1. Is there a way that SOLR can achieve this. We've tried SOLR JOINs in the past but it wasn't the right fit for all our use cases.
On the other hand, can elastic search , linkedin's senseidb or other text search engines achieve this ?
For now, we manage by re-indexing the affected documents when they need to be indexed
Thanks
Solr has the limitation of stored fields, that's correct. The underlying lucene always requires to delete the old document and index the new one. In fact lucene segments are write-once, it never goes back to modify the existing ones, thus it only markes documents as deleted and deletes them for real when a merge happens.
Search servers on top of lucene try to work around this problem by exposing a single endpoint that's able to delete the old document and reindex the new one automatically, but there must be a way to retrieve the old document somehow. Solr can do that only if you store all the fields.
Elasticsearch works around it storing the source documents by default, in a special field called _source. That's exactly the document that you sent to the search engine in the first place, while indexing. This is by the way one of the features that make elasticsearch similar to NoSQL databases. The elasticsearch Update API allows you to update a document in two ways:
Sending a new partial document that will be merged with the existing one (still deleting the old one and indexing the result of the merge
Executing a script on the existing document and indexing the result after deleting the old one
Both options rely on the presence of the _source field. Storing the source can be disabled, if you disable it you of course lose this great feature.
For example I have Documents A, B, C. User 1 must only be able to see Documents A, B. User 2 must only be able to see Document C. Is it possible to do it in SOLR without filtering by metadata? If I use metadata filter, everytime there are access right changes, I have to reindex.
[update 2/14/2012] Unfortunately, in the client's case, change is frequent. Data is confidential and usually only managed by the owners which are internal users. Then the specific case is they need to be able to share those documents to certain external users and specify access levels for those users. And most of the time this is an adhoc task, and not identified ahead of time
I would suggest storing the access roles (yes, its plural) as document metadata. Here the required field access_roles is a facet-able multi-valued string field.
Doc1: access_roles:[user_jane, manager_vienna] // Jane and the Vienna branch manager may see it
Doc2: access_roles:[user_john, manager_vienna, special_team] // Jane, the Vienna branch manager and a member of special team may see it
The user owning the document is a default access role for that document.
To change the access roles of a document, you edit access_roles.
When Jane searches, the access roles she belongs to will be part of the query. Solr will retrieve only the documents that match the user's access role.
When Jane (user_jane), manager at vienna office (manager_vienna) searches, her searches go like:
q=mainquery
&fq=access_roles:user_jane
&fq=access_roles:manager_vienna
&facet=on
&facet.field=access_roles
which fetches all documents which contains user_jane OR manager_vienna in access_roles; Doc1 and Doc2.
When Bob, (user_bob), member of a special team (specia_team) searches,
q=mainquery
&fq=access_roles:user_bob
&fq=access_roles:special_team
&facet=on
&facet.field=access_roles
which fetches Doc2 for him.
Queries adapted from http://wiki.apache.org/solr/SimpleFacetParameters#Multi-Select_Faceting_and_LocalParams
Might want to check the Document level Security patches.
https://issues.apache.org/jira/browse/SOLR-1872
https://issues.apache.org/jira/browse/SOLR-1834
I think my approach would be similar to #aitchnyu's answer. I would however NOT use individual users in the meta data.
If you create groups for each document, then you will have to reindex for security reason less often.
For a given document, you might have access_roles: group_1, group_3
In this way, the group_1 and group_3 always retain rights to the document. However, I can vary what groups each user belongs to and adjust the query accordingly.
When the query then is generated, it always passes as a part of the query the user's groups. If I belong to group_1 and group_2, my query will look like this:
q=mainquery
&fq=access_roles:group_1
&fq=access_roles:group_2
Since the groups are dynamically generated in the query, I simply remove a user from the group, and when a new query is issued, they will no longer include the removed group in the query. So removing the user from group_1 would new create a query like this:
q=mainquery
&fq=access_roles:group_2
All documents that require group 1 will no longer be accessible to the user.
This allows most changes to be done in real-time w/out the need to reindex the documents. The only reason you would have to reindex for security reasons is if you decided that a particular group should no longer have access to a document.
In many real-world scenarios, that should be a relatively uncommon occurrence. It seems much more likely that HR documents will always be available to the HR department, however a specific user may not always be part of the HR group.
Hope that helps.
You can implement your security model using Solr's PostFilter. For more information see http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/
Note: you should probably cache your access rights otherwise performance will be terrible.
Keeping in mind that solr is pure text based search engine,indexing system,to facilitate fast searching, you should not expect RDMS style capabilities from it. solr does not provide security for documents being indexed, you have to write such an implementation if you want. In that case you have two options.
1)Just index documents into solr and keep authorization details into RDBMS.Now query solr for your search and collect the results returned.Now fire another query to DB for the doc ids returned by solr to see if the user has an access to them or not.Filter out those documents on which user in action has no access.You are done ! But not really, your problem starts from here only.Assume, what if all results returned by solr gets filtered out ? (Assuming you are not accessing all the documents at a time,means you are retrieving top 1000 results only from solr result set,otherwise you can not get fast search) You have to query solr again for next bunch of result set and have to iterate these steps until you get enough results to display.
2)Second approach to this is to index authorization meta data along with document in solr.Same as aitchnyu has explained.But to answer your query for document sharing to an external user,along with usergroup and role detail, you index these external user's userid into access_roles field or you can just add an another field to your schema 'access_user' too. Now you can modify search queries for external user's sharing to include access_user field into your filter query.
e.g
q=mainquery
&fq=access_roles:group_1
&fq=access_user:externaluserid
Now the most important thing, update to an indexed documents.Well its off course tedious task, but with careful design and async processing along with solrs partial document update feature(solr 4.0=>), you can achieve reasonably good TPS with solr. If you are using solr <4.0 you can have separate systems for both searching and updates and with care full use of load balancer and master slave replication strategies you will have smile on your face !
There are no built in mechanisms for Solr that I am aware of that will allow you to control access to documents without maintaining the rights in the metadata. The approach outlined by aitchnyu seems reasonable if you keep it a true role level and not assign user specific permissions to a document. That way you can assign roles to users and this will grant them the ability to see documents in the index. Granted you will still need to reindex documents when the roles change, but hopefully you can identify most of the needed roles ahead if time and reduce the need for frequent reindexing.
I use Lucene to index my documents and search. Actually I have 800k documents indexed in Lucene. Those documents have some fields:
Id: is a Numeric field to index the documents
Name: is a textual field to be stored and analyzed
Description: like name
Availability: is a numeric field to filter results. This field can be updated frequently, every day.
My question is: What's the better way to create a filter for availability?
1 - add this information to index and make a lucene filter.
With this approach I have to update document (remove and add, because lucene 3.0.2 not have update support) every time the "availability" changes. What the cost of reindex?
2 - don't add this information to index, and filter the results with a DB select.
This approach will do a lot of selects, because I need select every id from database to check availability.
3 - Create a separated index with id and availability.
I don't know if it is a good solution, but I can create a index with static information and other with information can be frequently updated. I think it is better then update all document, just because some fields were updated.
I would stay away from 2, if you can deal only with the search in lucene, instead of search in lucene+db, do it. I deal in my project with this case (Lucene search + DB search), but I do it cause there is no way out of it.
The cost of an update is internally:
delete the doc
insert new doc (with new field).
I would just try approach number 1 (as is the simplest), if the performance is good enough, then just stick with it, if not then you might look ways to optimize it or try 3.
Answer provided from lucene-groupmail:
How often is "frequently"? How many updates do you expect to do in
a day? And how quickly must those updates be reflected in the search
results?
800K documents isn't all that many. I'd go with the simple approach first
and monitor the results, #then# go to a more complex solution if you
see a problem arising. Just update (delete/add) the documents when
the value changes.
Well, the cost to reindex is just about what the cost to index it orignally
is. The old version of the document is marked deleted and the new one
is added. It's essentially the same cost as to index a new document.
This leaves some gaps in your index, that is the deleted docs are still in
there, but the next optimize will compact them.
From which you may infer that optimizing is the expensive part. I'd do that,
say
once daily (or even weekly).
HTH
Erick
I have a location auto-complete field which has auto complete for all countries, cities, neighborhoods, villages, zip codes. This is part of a location tracking feature I am building for my website. So you can imagine this list will be in the multi-millions of rows. Expecting over 20 million atleast with all the villages and potal codes. To make the auto-complete work well I will use memcached so we dont hit the database always to get this list. It will be used a lot as this is the primary feature on the site. But the question is:
Is only 1 instace of the list stored in memcached irrespective of the users pulling the info or does it need to maintain a separate instance for each? So if say 20 million people are using it at the same time, will that differ from just 1 person using the location auto-complete? I am open to other ideas also on how to implement this location auto complete so it performs well.
Or can i do something like this: When a user logs in in the background I send them the list anyways, so by the time they reach the auto complete textfield their computer will have it ready to load instant?
Take a look at Solr (or Lucene itself), using NGram (or EdgeNGram) tokenizers you can get good autocomplete performance on massive datasets.