Solr suggester filter at build time - solr

Is it possible to create a suggestion dictionary using a pre-filtered query ? In other word is it possible to create a suggestion dictionary based on a subset of an existing index instead of the whole index ?
This is a feature very similar to context filtering, but beforehand.
In my case, I'd like to create a suggestion dictionary using semi-public data only (not owned by a tenant) and language, from my main index.

no that is not possible right now. I see 3 possible ways to do what you want:
you create a new collection from scratch, indexing from the original source only the subset you want in a field, and use that field as DocumentDictionaryFactory for the suggester
again, create a new index, but index it off the existing collection, either using DIH and the SolrEntityProcessor or with with Streaming expressions
you create your custom FilteredDocumentDictionaryFactory that does what you need and plug it in into Solr.

Related

In Azure Search, can an indexer combine information from different documents to a single index item without them overwritting each other?

My goal is to create a single searchable Azure Index that has all of the relevant information currently stored in many different sql tables.
I'm also using an Azure Cognitive Service to add additional info from related documents. Each document is tied to only a single item in my Index, but each item in the index will be tied to many documents.
According to my understanding, if two documents have the same value for the indexer's Key, then the index will overwrite the extracted information from the first document with the information extracted from the second. I'm hoping there's a way to append the information instead of overwriting it. For example: if two documents relate to the same index item, I want the values mapped to keyphrases for that item to include the keyphrases found in the first document and the keyphrases found in the second document.
Is this possible? Is there a different way I should be approaching this?
If it is possible, can I do it without having duplicate values?
Currently I have multiple indexes and I'm combining the search results from each one, but this seems inefficient and likely messes up the default scoring algorithm.
Every code example I find only has one document for each index item and doesn't address my problem. Admittedly, I haven't tried to set up my index as described above, because it would take a lot of refactoring, and I'm confident it would just overwrite itself.
I am currently creating my indexes and indexers programmatically using dotnet. I'm assuming my code isn't relevant to my question, but I can provide it if need be.
Thank you so much! I'd appreciate any feedback you can give.
Edit: I'm thinking about creating a custom skill to do the aggregation for me, but I don't know how the skill would access access everything it needs. It needs the extracted info from the current document, and it needs the previously aggregated info from previous documents. I guess the custom skill could perform a search on the index and get the item that way, but that sounds dangerously hacky. Any thoughts would be appreciated.
Pasting from docs:
Indexing actions: upload, merge, mergeOrUpload, delete
You can control the type of indexing action on a per-document basis, specifying whether the document should be uploaded in full, merged with existing document content, or deleted.
Whether you use the REST API or an SDK, the following document operations are supported for data import:
Upload, similar to an "upsert" where the document is inserted if it is new, and updated or replaced if it exists. If the document is missing values that the index requires, the document field's value is set to null.
merge updates a document that already exists, and fails a document that cannot be found. Merge replaces existing values. For this reason, be sure to check for collection fields that contain multiple values, such as fields of type Collection(Edm.String). For example, if a tags field starts with a value of ["budget"] and you execute a merge with ["economy", "pool"], the final value of the tags field is ["economy", "pool"]. It won't be ["budget", "economy", "pool"].
mergeOrUpload behaves like merge if the document exists, and upload if the document is new.
delete removes the entire document from the index. If you want to remove an individual field, use merge instead, setting the field in question to null.

Adding new fields to solr after indexing with Dynamic fields

I am trying to learn and implement Solr for a customer use case where we faced a question what if we need to add more fields(for storing and indexing) is it possible without re-indexing or reloading the data, and when I searched over the net for it and at most of the places it was given that while adding new field that need not to be indexed is okay and can be achieved but we want to add a new indexed field then we have to reload/reindex the data. But then there are dynamic fields in schema.xml which can be used to map to new fields whether they need to be indexed or just stored. My questions is:
If that is a possible workaround to add new fields to existing data/index then why is is not suggested? is there any overhead associated with it or it's fine to use dynamic fields?
Dynamic fields are there so Solr knows how to map your new content to the types. You would still need to reindex the actual documents.
Solr has API/format to partially update a document, so you only need to provide additional information, but under the covers that's still reindexing and you need to be careful that all fields are stored. If a field is store=false and you try partial reindexing, that value will disappear.

Create index of nested documents in SOLR

How should I import nested entities from DB to Solr index? For some reasons i don't want to flatten documents into single one. What should i write in schema.xml and data-config.xml ? I'm using Solr 4.10.
The currently distributed version of the DataImportHandler does not support nested documents (or BlockJoins as they're called in Solr/Lucene).
There is however a patch available that you can try out - be sure to follow the discussion on JIRA (SOLR-5147) about how to use it and where it goes in the future.
Since you can't use the DataImportHandler, you could write custom code to do this. I'd recommend using SolrJ to load childDocuments. To handle childDocuments, first you have to create all of your required fields (for all of your different record types) in your schema.xml (or use dynamic fields). From there, you can create a SolrInputDocument for the parent, and a SolrInputDocument for the child, and then call addChildDocument(doc) on the parent SolrInputDocument to add the child to it.
I'd also recommend creating a field that can indicate what level you're at - something like "content_type" that you fill in with "parent" or "root," or whatever works for you. Then, once you've loaded the records, you can use the Block/Join Queries to search hierarchically. Be aware that doing this will create an entry for each record, though, and if you do a q=: query, you'll get all of your records intermixed with each other.

Apache solr multi field indexing with django haystack

I have a design problem in django-haystack that I don't know how to solve. In my django model, there is a number of text fields that I want index.
As far as I know from the official haystack documentation, the only way to index content is to merge everything you want to index in a single template.
The problem here is that I want to maintain a per-field index, i.e. I want to do a full-text search on each field separately or on a set of fields.
Is there a way do it?
You should just be able to create additional CharFields? The first tutorial page gives an example: http://django-haystack.readthedocs.org/en/latest/tutorial.html#handling-data for field author.

Index file content and custom metadata separately with Solr3.3

I am doing a POC on content/text search using Solr3.3.
I have requirement where documents along with content and their custom metadata would be indexed initially. After the documents are indexed and made available for searching, user can change the custom metadata of the documents. However once the document is added to index the content of the document cannot be updated. When the user updates the custom metadata, the document index has to be updated to reflect the metadata changes in the search.
But during index update, even though the content of the file is not changed, it is also indexed and which causes delays in the metadata update.
So I wanted to check if there is a way to avoid content indexing and update just the metadata?
Or do I have to store the content and metadata in separate index files. i.e. documentId, content in index1 and documentId, custom metadata in another index. In that case how I can query onto these two different indexes and return the result?
"if there is a way to avoid content indexing and update just the metadata" This has been covered in solr indexing and reindexing and the answer is no.
Do remember that Solr uses a very loose schema. Its like a database where everything is put into a single table. Think sparse matrices, think Amazon SimpleDB. Two solr indexes are considered as two databases, not two tables, if you had DB-like joins in mind. I just answered on it on How to start and Stop SOLR from A user created windows service .
I would enter each file as two documents (a solr document = a DB row). Hence for a file on "watson":
id: docs_contents_watson
type:contents
text: text of the file
and the metadata as
id:docs_metadata_watson
type:metadata
author:A J Crown
year:1984
To search the contents of a document:
http://localhost:8080/app/select?q=type:contents&text:"on a dark lonely night"
To do metadata searches:
http://localhost:8080/app/select?q=type:metadata&year:1984
Note the type:xx.
This may be a kludge (an implementation that can cause headaches in the long run). Fellow SO'ers, please critic this.
We did try this and it should work. Take a snapshot of what you have basically the SOLrInputDocument object before you send it to lucene. Compress it and serialize the object and then assign it to one more field in your schema. Make that field as a binary field.
So when you want to update this information to one of the fields just fetch the binary field unserialize it and append/update the values to fields you are interested and re-feed it to lucene.
Never forget to store the XML as one of the fields inside SolrInputDocument that contains the text extracted by TIKA which is used for search/indexing.
The only negative: Your index size will grow a little bit but you will get what you want without re-feeding the data.

Resources