DIH in SOLR based on primary key of a table - solr

currently am using DIH for pulling data from MSSQL server to SOLR. Where in am using dataimporter.last_index_time to pull the records which are into database only after last_index_time. So i was exploring if there are any other option for DIH to use instead of using last_index_time may be something like last_pk_id.
Is such option available? could anyone let me know.

not provided by Solr itself.
But nothing prevents you from doing this:
set your DIH sql for the delta like this:
WHERE (last_pk_id > '${dataimporter.request.LAST_PK_ID}')
when you run some indexing, store, outside Solr, the last_pk_id value you indexed, say 333.
next time you need to delta index, add to your request
...&clean=false&LAST_PK_ID=333
store your new LAST_PK_ID (you can query solr for this)

Related

Apache Nifi GetSolr configuration without Date Field

I'm new to Apache Nifi. My requirement is to retrieve data from a solr index, do some processing and store it in a different solr index.
I'm trying to use Nifi GetSolr processor to retrieve the data. GetSolr processor has a mandatory field Date Field. however My solr index doesn't have a date/timestamp field in the collections.
Please see a sample document in my solr collection below.
Any workaround to this? Can I use GetSolr without the Date field and use someting like the version field instead?
Thanks.
GetSolr is meant to do incremental extraction from an index, meaning each time it runs it finds documents newer than the last time it ran. It can only do that if it can sort the documents by a date/time to compare against it's last execution time.
If you just want a one-time extraction, you may want to use QuerySolr instead.

Solr to Solr via import/update and modify data

Given: running Solr v.4.6 (3.5M docs) in a closed system with a fixed data source (just queries possible)
ToDo: Solr v.7.4
Just looking for an easy way to import complete index from v.4.6 including changing of schema to copy fields and config for facetting within searchHandler and keep it up to date against last modified date of documents. (about ~1k update titles/day)
Any recommendation to do this?
really appreciate
Francois

How to update index of a document by its "id" in Solr?

How can I update or re-index solr index of an existing document in Solr by using its "id"? I am using Solr 4.0.
I researched but I don't find a definite answer anywhere.
Lucene is an append only store. Update syntax in solr would essentially translate into delete and insert. So we heed to give full document with all its fields to update. With Solr 4 and above, you have option of atomic update so that you can only update certain fields. In this case there is a constraint that all the solr schema fields should be "stored". So internally solr does get and then merge the field into the retrieved document and then insert.
The documentation links below:
Update using CSV - The index can also updated using XML
Atomic Update - This just saves on the network to retrieve the document to client and then update.

conversion of DateField to TrieDateField in Solr

I'm using Apache Solr for powering the search functionality in my Drupal site using a contributed module for drupal named ApacheSolr Search Integration. I'm pretty novice with Solr and have a basic understanding of it, hence wish to convey my apologies in advance if this query sounds outrageous.
I have a date field added through one of drupal's hooks named ds_myDate which I initially used for sorting the search results. I decided to use a date boosting, so that the search results are displayed based on relevancy and boosted by their date rather than merely being displayed by the descending order of date. Once I had updated my hook to implement the same by adding a boost field as recip(ms(NOW/HOUR,ds_myDate),3.16e-11,1,1) I got a HTTP 400 error stating
Can't use ms() function on non-numeric legacy date field ds_myDate
Googling for the same suggested that I use a TrieDateField instead of the Legacy DateField to prevent this error. Adding a TrieDate field named tds_myDate following the suggested naming convention and implementing the boost as recip(ms(NOW/HOUR,tds_myDate),3.16e-11,1,1) did effectively achieve the boosting. However this requires me to reindex all the content (close to 500k records) to populate the new TrieDate field so that I may be able to use it effectively.
I'd request to know if there's an effective workaround than re-indexing all my content such as converting my ds_myDate to a TrieDate field like running an alter query on a mysql table field to change its type. Since I'm unfamiliar with how Solr works would request to know if such an option is feasible and what the right thing to do would be for this case.
You may be able to achieve it by doing a Partial update, but for that you need to be on on Solr 4+ and storing all indexed fields.
Here is how I would go with this:
Make sure version of Solr is 4+
Make sure all indexed fields are stored (requirement for partial updates)
If above two conditions meet, write a script(PHP), which does following:
1) Iterate through full Solr index, and for each doc:
----a) read value stored in ds_myDate field
----b) Convert it to TrieDateField format
----c) Push onto Solr, via partial update to only tds_myDate field (see sample query)
Sample query:
curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d '[{"id":"$id","tds_myDate":{"set":$converted_Val}}]'
For more details on partial updates: http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/
Unfortunately, once a document has been indexed a certain way and you change the schema, you cannot have the new schema changes be applied to existing documents until those documents are re-indexed.
Please see this previous question - Does Schema Change need Reindex for additional details.

updating solr document using a query?

In the solr documentation, there are options to delete documents using a query, something like the following:
<delete><query>*:*</query></delete>
<delete><query>id:298253</query>
<query>entitytype:BlogEntry</query></delete>
However, I could not find any references about updating documents based on a query. Is this possible with updates in solr? Basically I would like to update the values of all the documents that match a query.
Something like update prop1=val1, prop2=val2 where ( prop3 < val3 and prop4=val4 )
Thanks,
-Vineel
The ability to update documents is being added to the Solr 4.0 release, which just went into Beta this week. I am not sure if there will be the ability to update documents based on a query, but you could ask in the Solr Users List. Unfortunately, I have not had a chance to explore the 4.0 version yet to see how atomic updates work.
Keep in mind that for partially updating documents in Solr, they need to be stored. Which increases the index size. Check this for some background

Resources