I am using solr to index reports from DB. I am successful in doing that. However, I also need to track user activity to report whether a document has been read by the user or not. I am aware that Solr is not built to index/keep track user activity, but is there a good approach going about this ?
Any suggestions?
No, as you say there is no support for this in Solr. From a Solr perspective it’s more related to how you build you web-application. I would recommend you to ask yourself this:
When tracking the reading statistics of my users do I need to index that information into Solr too?
The answer to this question depends on if you need to the information to facet, search or use it in the relevance model. Say for example you want to have a facet that allows your users to filter on read or unread documents then of course you need to index this into Solr.
If you only want to present whether or not a document has been read or not (in the web interface) you might as well store this information inside a SQL database fetching it when presenting the results.
Related
We have a Cloudant database on Bluemix that contains a large number of documents that are answer units built by the Document Conversion service. These answer units are used to populate a Solr Retrieve and Rank collection for our application. The Cloudant database serves as our system of record for the answer units.
For reasons that are unimportant, our Cloudant database is no longer valid. What we need is a way to download everything from the Solr collection and re-create the Cloudant database. Can anyone tell me a way to do that?
I'm not aware of any automated way to do this.
You'll need to fetch all your documents from Solr (and assuming you have a lot of them, do this in a paginated way - there are some examples of how to do this in the Solr doc) and add them into Cloudant.
Note that you'll only be able to do this for the fields that you have set to be stored in your schema. If there are important fields that you need in Cloudant that you haven't got stored in Solr, then you might be stuck. :(
You can replicate one Cloudant database to another which will create you an exact replica.
Another technique is to use a tool such as couchbackup which takes a copy of your database's documents (ignoring any deletions) and allows you to save the data in a text file. You can then use the couchrestore tool to upload the data file to a new database.
See this blog for more details.
Is there any way through Elastic Search or Lucene metadata, to store a count of how many times a particular document has satisfied queries even though one has not recalled the document for processing.
For example, say you issue a query and get 100 results. You process the first 10 and not go any further. We would like to flag ALL the documents (100) that satisfied the search criteria for later analysis.
Thanks
Currently, Azure Search does not expose this information (and neither does Elasticsearch or Lucene). However, we're working on building better ranking models, and we're thinking about capturing (and potentially exposing) this type of data.
We'd be very interested in learning more about your scenario. Could you email me at eugenesh at the usual Microsoft domain? Thanks!
Is solr just for searching ie it's not for 'updating' or 'inserting' data?
My site is currently MySQL based, and on looking at SOLR as an alt option, I see you make your queries through http requests.
My first thought was - how do you stop someone from making a query that updates or inserts data?
Obviously, I'm not understanding SOLR, hence my question here.
Cheers
Solr mainly is for Full Text search, and rather should not be used as a Persistent store.
Solr stores its data in the File store and does not provide the features of Relational database (ACID or Nested Entities etc )
Usually, the model followed is use Relationship database for you data management.
Replicate the data into Solr for Full Text search.
You can always control the Insert/Update access for Solr by securing the urls.
I want to create prospective search subscription with empty query, but GAE raises exception
QuerySyntaxError: query:'' detail:'Query is empty.'
which is not compatible with Search API, which allows empty queries. Any workarounds? Should I file an issue?
The Prospective Search Service is intended to support applications that filter a stream of documents; applications that want less than all documents matched. In such an application, an "empty query" would normally be considered evidence of a bug. Admittedly, empty queries might sometimes be useful for various debugging purposes, however, the decision was made to design the interface's contracts with production use in mind.
As suggested by Will Brown, if you want a subscription that will match all documents, then insert some dummy field with a constant value into your documents and then create a query that matches just that field and value. Given that there is such an easy work-around available for those rare cases when "all documents" are needed, I think it unlikely that we would provide support for empty queries. It might also be interesting to note that the prohibition against empty queries is not just in the AppEngine code but also in the backend servers that AppEngine accesses to provide the Prospective Search Service.
Although the "Search API" (which really should be called the "Retrospective Search API") may support empty queries, it is important to realize that resource utilization patterns for prospective search are very, very different from those of retrospective search. For instance, you might have an application that is streaming hundreds of documents per second into both a document index (using retrospective search) and through a query index (using prospective seach). In such a system, an empty retrospective query is only going to return just a few documents whenever that query is submitted. On the other hand, a prospective query would generate a real-time stream of all documents. The presence of just a few prospective queries could thus generate significant loads on your application. In general, if you want a firehose, real-time push feed of everything published, it is best to code that up explicitly.
You can file a feature request for this, but it is by design (I don't know why). If you know that incoming documents will have something in common, you can write a query for those; for example, if you add a field "alldocuments" with content "yes" to the document when you send the request, you could register a query like "alldocuments:yes" to match all documents.
Consider the following situation. We have a database which stores writers and books in two separate tables. One book obviously stores the reference to the writer who wrote the book.
For Solr i have to denormalize this structure into one big document where every book contains the details of the writer associated. This index is now used for querying books.
One user of the system now decides to update a writer record in the system. Because many books can be associated with it i have to update every document in Solr which have embedded data from this writer record. This is very painful because i have to delete and re-add every affected document as far as i know.
Is there any better way of doing this? I need near realtime update of the index in the system if one of the referenced data gets modified.
This would be a perfect usecase for nested documents. As far as I know lucene does support nested documents but Solr doesn't, not totally sure about the current state of this feature.
This feature is available in elasticsearch though. You might want to have a look at it, there's an article I just wrote that can be interesting if you want to know what's so cool about elasticsearch in my opinion. Your question just reminded me that I didn't mention the nested documents feature in my article, which is really cool too. You can use the nested type in your mapping. If you want to know more you can have a look at this article. By the way it contains exactly the books/authors example.
Elasticsearch also helps you while updating documents. You don't need to reindex the whole document but send only the changes through a script. Thanks to the fact that it stores the source document that has been indexed it internally retrieves it, updates it running the script and reindexes it. That's how lucene internally works since its index segments are write-once. With Solr 4, which will be soon released, you can update documents providing only the changes, but as far as I know this works only if all your fields are stored. The fields that are not stored cannot be retrieved from the index.
If we are talking about Near Real Time updates, elasticsearch does use the Lucene Near Real Time API and refreshes automatically the index reader every second. Solr 3 doesn't use yet those APIs but Solr 4 does.
For updating nested types in SOLR you can use dataimporters and delta imports. The example on https://wiki.apache.org/solr/DataImportHandler#Delta-Import_Example shows how this would work. Obviously you would then need to have solr access your database.