How can I modify the Solr Update Handler to not simply overwrite existing documents? - solr

I'm working with Solr indexing data from two sources - real-time "pump" inserting (and updating) documents into Solr and database which holds backups of those documents.
The problem we encountered looks like that - if we make a data import from database while pump is performing inserts, we may index a doc from pump, and later overwrite it with doc extracted from database - which is a backup, so it's probably little outdated.
If we close the pump, import from database and open the pump again, it probably will cause instabilities in our application.
What I'd like to do is tell Solr to not automatically overwrite the document, but do so conditionally (for example by the value of 'last_modified_date' field).
My question is - how can I do it? Do I have to modify Solr source, make a new class overwriting some update processor, or just add some magic lines to solrconfig?

Sorry, but there there is not an option or config to tell Solr to not automatically update documents, but instead use some conditional check. The current model for Solr is that if you insert a document with the same unique id as one already in the index, it will "update" that document by a delete/add operation. Solr also does not currently support the ability to only update specific fields in an existing indexed document. Please see issue SOLR-139 for more details.
Based on the scenario you have described, I would suggest that you create a process outside of Solr that handles the retrieval of items from your data sources and then performs the conditional check to see what is in the index already and determine if an update to the index is necessary.

You can use solr script processors to check if that document exists proceeds in its accordance
Below code only works when solr uses java 8
function processAdd(cmd) {
doc = cmd.solrDoc;
var previousDoc=null;
try {
// create a term type object
var Term = Java.type("org.apache.lucene.index.Term");
var TermObject =new Term("fieldForSearchTryUnique","Value of field");
//retrieve document id from solr return -1 if not present
previousDocId= req.getSearcher().getFirstMatch(TermObject);
if(-1!=perviousDocId) {
// get complete document from solr for that searched field
previousDoc=req.getSearcher().doc(previousDocId);
// do required process here
}
}
catch(err) {
logger.error("error in update processor "+err)
}
}

Related

How to get fullDocument from MongoDB changeStream when a document is deleted?

My Code
I have a MongoDB with two collections, Items and Calculations.
Items
value: Number
date: Date
Calculations
calculation: Number
start_date: Date
end_date: Date
A Calculation is a stored calcluation based off of Item values for all Items in the DB which have dates in between the Calculation's start date and end date.
Mongo Change Streams
I figure a good way to create / update Calculations is to create a Mongo Change Stream on the Items collection which listens for changes to the Items collection to then recalculate relevant Calculations.
The issue is that according to the Mongo Change Event docs, when a document is deleted, the fullDocument field is omitted which would prevent me from accessing the deleted Item's date which would inform which Calculations should be updated.
Question
Is there any way to access the fullDocument of a Mongo Change Event that was fired due to a document deletion?
No I don't believe there is a way. From https://docs.mongodb.com/manual/changeStreams/#event-notification:
Change streams only notify on data changes that have persisted to a majority of data-bearing members in the replica set.
When the document was deleted and the deletion was persisted across the majority of the nodes in a replica set, the document has ceased to exist in the replica set. Thus the changestream cannot return something that doesn't exist anymore.
The solution to your question would be transactions in MongoDB 4.0. That is, you can adjust the Calculations and delete the corresponding Items in a single atomic operation.
fullDocument is not returned when a document is deleted.
But there is a workaround.
Right before you delete the document, set a hint field. (Obviously use a name that does not collide with your current properties.)
await myCollection.updateOne({_id:theId}, {_deleting: true})
await myCollection.deleteOne({_id:theId})
This will trigger one final change event in the stream, with a hint that the document is getting deleted. Then in your stream watcher, you simple check for this value.
stream.on('change', event => {
if (!event.fullDocument) {
// The document was deleted
}
else if (event.fullDocument._deleting) {
// The document is going to be deleted
}
else {
// Document created or updated
}
})
My oplog was not getting updated fast enough, and the update was looking up a document that was already removed, so I needed to add a small delay to get this working.
myCollection.updateOne({_id:theId}, {_deleting: true})
setTimeout( ()=> {
myCollection.deleteOne({_id:theId})
}, 100)
If the document did not exist in the first place, it won't be updated or deleted, so nothing gets triggered.
Using TTL Indexes
Another way to make this work is to add a ttl index, and then set that field to the current time. This will trigger an update first, and then delete the document automatically.
myCollection.setIndex({_deleting:1}, {expireAfterSeconds:0})
myCollection.updateOne({_id:theId}, {$set:{_deleting:new Date()}})
The problem with this approach is that mongo prunes TTL documents only during certain intervals, 60s or more as stated in the docs, so I prefer to use the first approach.

XQuery update for document with versions

I am trying to update value of some property in xDB using XQuery, here is my transaction:
let $source := doc('/historicalresource')/HistoricalResourceData[id/#UUID = '0361513e-30fa-45e4-a73a-d05870b8a284']
let $res := $source/ResourceProperty[#PropertyName="cpu|limit"]/#PropertyValue
let $change := '21572'
return replace value of node $res with $change
After running this query I receive such error:
com.xhive.error.XhiveException: VERSION_ACCESS_DENIED: This document
is versioned and can only be changed through a versioning operation
Indeed in my case historicalresource is a folder which may contain more than one document and all they have a version, like: v1.1 ,v1.2, etc..
How can I update the value of last version using xquery?
How should I modify my query to be able to update desired value ?
Finally, I have found the answer to my question, therefore adding it here:
There is no way to update xDB file using xQuery as it is version based system and xQuery cannot create a new version, but it could be done manually or automatically by some coding language (for example: java):
Manually - by editing existing version in xDB, in that case a new version will be automatically created after saving any change to existing version of the file.
Automatically - by checking out the file, making the change and submitting the change as a new version.
For versioned documents (on which XhiveLibraryChildIf.makeVersionable() was called), this is indeed the case: you have to check out the document first, then run the update query on the checked out copy, and finally check the new version in. For non-versioned documents, however, you can apply the update query directly.

CFINDEX throwing attribute validation error exception

I am upgrading to ColdFusion 11 from ColdFusion 8, so I need to rebuild my search indices to work Solr instead of Verity. I cannot find any reliable way to import my old Verity collections, so I'm attempting to build the new indices from scratch. I am using the following code to index some items along with their corresponding documents which are located on the server:
<cfsetting requesttimeout="3600" />
<cfquery name="qDocuments" datasource="#APPLICATION.DataSource#">
SELECT DISTINCT
ID,
Status,
'C:\Documents\'
CONCAT ID
CONCAT '.PDF' AS File
FROM tblDocuments
</cfquery>
<cfindex
query="qDocuments"
collection="solrdocuments"
action="fullimport"
type="file"
key="document_file"
custom1="ID"
custom2="Status" />
A very similar setup was used with Verity for years without a problem.
When I run the above code, I get the following exception:
Attribute validation error for CFINDEX.
The value of the FULLIMPORT attribute is invalid.
Valid values are: UPDATE, DELETE, PURGE, REFRESH, FULL-IMPORT,
DELTA-IMPORT,STATUS, ABORT.
This makes absolutely no sense, since there is no "FULLIMPORT" attribute for CFINDEX.
I am running ColdFusion 11 Update 3 with Java 1.8.0_25 on Windows Server 2008R2/IIS7.5.
You should believe the error message. try this:
<cfindex
query="qDocuments"
collection="solrdocuments"
action="FULL-IMPORT"
type="file"
key="document_file"
custom1="ID"
custom2="Status" />
It's referring to the value of the attribute action.
This is definitely a bug. In the ColdFusion documentation, fullimport is not an attribute of cfindex.
I know this is an old thread, but in case anyone else has the same question, it's just poor descriptions in the documentation. The action "FullImport" is only available when using type="dih" (i.e. Data Import Handler). When using the query attributes, use action="refresh" instead.
Source: CFIndex Documentation:
...
When type="dih", these actions are used:
abort: Aborts an ongoing indexing task.
deltaimport: For partial indexing. For instance, for any updates in the database,
instead of a full import, you can perform delta import to update your
collection.
fullimport: To index full database. For instance,
when you index the database for the first time.
status:
Provides the status of indexing, such as the total number of documents
processed and status such as idle or running.

Solr Custom RequestHandler - optimizing results

Yet another potentially embarrassing question. Please feel free to point any obvious solution that may have been overlooked - I have searched for solutions previously and found nothing, but sometimes it's a matter of choosing the wrong keywords to search for.
Here's the situation: coded my own RequestHandler a few months ago for an enterprise-y system, in order to inject a few necessary security parameters as an extra filter in all queries made to the solr core. Everything runs smoothly until the part where the docs resulting from a query to the index are collected and then returned to the user.
Basically after the filter is created and the query is executed we get a set of document ids (and scores), but then we have to iterate through the ids in order to build the result set, one hit at a time - which is a good 10x slower that querying the standard requesthandler, and only bound to get worse as the number of results increase. Even worse, since our schema heavily relies on dynamic fields for flexibility, there is no way (that I know of) of previously retrieving the list of fields to retrieve per document, other than testing all possible combinations per doc.
The code below is a simplified version of the one running in production, for querying the SolrIndexSearcher and building the response.
Without further ado, my questions are:
is there any way of retrieving all results at once, instead of building a response document by document?
is there any possibility of getting the list of fields on each result, instead of testing all possible combinations?
any particular WTFs in this code that I should be aware of? Feel free to kick me!
//function that queries index and handles results
private void searchCore(SolrIndexSearcher searcher, Query query,
Filter filter, int num, SolrDocumentList results) {
//Executes the query
TopDocs col = searcher.search(query,filter, num);
//results
ScoreDoc[] docs = col.scoreDocs;
//iterate & build documents
for (ScoreDoc hit : docs) {
Document doc = reader.document(hit.doc);
SolrDocument sdoc = new SolrDocument();
for(Object f : doc.getFields()) {
Field fd = ((Field) f);
//strings
if (fd.isStored() && (fd.stringValue() != null))
sdoc.addField(fd.name(), fd.stringValue());
else if(fd.isStored()) {
//Dynamic Longs
if (fd.name().matches(".*_l") ) {
ByteBuffer a = ByteBuffer.wrap(fd.getBinaryValue(),
fd.getBinaryOffset(), fd.getBinaryLength());
long testLong = a.getLong(0);
sdoc.addField(fd.name(), testLong );
}
//Dynamic Dates
else if(fd.name().matches(".*_dt")) {
ByteBuffer a = ByteBuffer.wrap(fd.getBinaryValue(),
fd.getBinaryOffset(), fd.getBinaryLength());
Date dt = new Date(a.getLong());
sdoc.addField(fd.name(), dt );
}
//...
}
}
results.add(sdoc);
}
}
Per OPs request:
Although this doesn't answer your specific question, I would suggest another option to solve your problem.
To add a Filter to all queries, you can add an "appends" section to the StandardRequestHandler in the SolrConfig.xml file. Add a "fl" (stands for filter) section and add your filter. Every request piped through the StandardRequestHandler will have the filter appended to it automatically.
This filter is treated like any other, so it is cached in the FilterCache. The result is fairly fast filtering (through docIds) at query time. This may allow you to avoid having to pull the individual documents in your solution to apply the filtering criteria.

Creating logger in CouchDB?

I would like to create a logger using CouchDB. Basically, everytime someone accesses the file, I would like like to write to the database the username and time the file has been accessed. If this was MySQL, I would just add a row for every access correspond to the user. I am not sure what to do in CouchDB. Would I need to store each access in array? Then what do I do during update, is there a way to append to the document? Would each user have his own document?
I couldn't find any documentation on how to append to an existing document or array without retrieving and updating the entire document. So for every event you log, you'll have to retrieve the entire document, update it and save it to the database. So you'll want to keep the documents small for two reasons:
Log files/documents tend to grow big. You don't want to send large documents across the wire for each new log entry you add.
Log files/documents tend to get updated a lot. If all log entries are stored in a single document and you're trying to write a lot of concurrent log entries, you're likely to run into mismatching document revisions on updates.
Your suggestion of user-based documents sounds like a good solution, as it will keep the documents small. Also, a single user is unlikely to generate concurrent log entries, minimizing any race conditions.
Another option would be to store a new document for each log entry. Then you'll never have to update an existing document, eliminating any race conditions and the need to send large documents between your application and the database.
Niels' answer is going down the right path with transactions. As he said, you will want to create a different document for each access - think of them as actions. Here's what one of those documents might look like
{
"_id": "32 char hash",
"_rev": "32 char hash",
"when": Unix time stamp,
"by": "some unique identifier
}
If you were tracking multiple files, then you'd want to add a "file" field and include a unique identifier.
Now the power of Map/Reduce begins to really shine, as it's extremely good at aggregating multiple pieces of data. Here's how to get the total number of views:
Map:
function(doc)
{
emit(doc.at, 1);
}
Reduce:
function(keys, values, rereduce)
{
return sum(values);
}
The reason I threw the time stamp (doc.at) into the key is that it allows us to get total views for a range of time. Ex., /dbName/_design/designDocName/_view/viewName?startkey=1000&endkey=2000&group=true gives us the total number of views between those two time stamps.
Cheers.
Although Sam's answer is an ok pattern to follow I wanted to point out that there is, indeed, a nice way to append to a Couch document. It just isn't very well documented yet.
By defining an update function in your design document and using that to append to an array inside a couch document you may be able to save considerable disk space. Plus, you end up with a 1:1 correlation between the file you're logging accesses on and the couch doc that represents that file. This is how I imagine a doc might look:
{
"_id": "some/file/path/name.txt",
"_rev": "32 char hash",
"accesses": [
{"at": 1282839291, "by": "ben"},
{"at": 1282839305, "by": "kate"},
{"at": 1282839367, "by": "ozone"}
]
}
One caveat: You will need to encode the "/" as %2F when you request it from CouchDB or you'll get an error. Using slashes in document ids is totally ok.
And here is a pair of map/reduce functions:
function(doc)
{
if (doc.accesses) {
for (i=0; i < doc.accesses.length; i++) {
event = doc.accesses[i];
emit([doc._id, event.by, event.at], 1);
}
}
}
function(keys, values, rereduce)
{
return sum(values);
}
And now we can see another benefit of storing all accesses for a given file in one JSON document: to get a list of all accesses on a document just make a get request for the corresponding document. In this case:
GET http://127.0.0.1:5984/dbname/some%2Ffile%2Fpath%2Fname.txt
If you wanted to count the number of times each file was accessed by each user you'll query the view like so:
GET http://127.0.0.1:5984/test/_design/touch/_view/log?group_level=2
Use group_level=1 if you just want to count total accesses per file.
Finally, here is the update function you can use to append onto that doc.accesses array:
function(doc, req) {
var whom = req.query.by;
var when = Math.round(new Date().getTime() / 1000);
if (!doc.accesses) doc.accesses = [];
var event = {"at": when, "by": whom}
doc.accesses.push(event);
var message = 'Logged ' + event.by + ' accessing ' + doc._id + ' at ' + event.at;
return [doc, message];
}
Now whenever you need to log an access to a file issue a request like the following (depending on how you name your design document and update function):
http://127.0.0.1:5984/my_database/_design/my_designdoc/_update/update_function_name/some%2Ffile%2Fpath%2Fname.txt?by=username
A comment to the last two anwers is that they refer to CouchBase not Apache CouchDb.
It is however possible to define updatehandlers in CouchDb but I have not used it.
http://wiki.apache.org/couchdb/Document_Update_Handlers

Resources