Dspace and SOLR configuration - solr

I would like to add ommitNorm=true to the title field.
It is wrongfully overboosting some of our titles.
However I don't know how the title field is indexed. What is its name - just dc.title?
Because in the schema.xml, I don't see anything about it. What is the type of that field, what analyzer or anything else is used for it. Is there anyway to know?

Most metadata fields in DSpace are handled via dynamic fields. That's why you don't see each specified individually in the search core's schema.xml file.
I'm not sure where the boosting is happening (or whether DSpace does any, even). I don't recall seeing any boost clauses when looking through the solr log files. I see some extraction parameters being set in SolrServiceImpl#writeDocument, where the document is being indexed. It looks like there is an extraction parameter for boosting individual fields, perhaps you can play with that to get what you'd like.
If you want to see the field type for any Solr field, the easiest option is probably the Schema Browser in the Solr admin user interface, eg
http://localhost:8080/solr/#/search/schema-browser?field=title (you may need to use an SSH tunnel or the like to access Solr running on a different host since the DSpace solr install is typically IP-limited to access from localhost).

Related

Solr Language Detection using DataImportHandler

In my Solr configuration files I have defined a DataImportHandler that fetches data from a Mysql database and also processes contents of PDF files that are related with registers of the SQL database. The data import works fine.
I'm trying to detect the language of text contained in the files during the data import phase. I have specified in my solrconfig.xml a TikaLanguageIdentifierUpdateProcessorFactory as explained in https://wiki.apache.org/solr/LanguageDetection and have defined in my document schema the language fields, nevertheless, after I run the indexation from the Solr admin, I cannot see any language field on my documents.
In all the examples I have seen, language detection is done by posting a document to solr with the post command, is it possible to do language detection with a DataImportHandler?
Once you have defined the UpdateRequestProcessor chain, you need to actually specify it in the request handler (DataImportHandler's in this case). You do that with update.chain parameter.
Also, ensure that you include LogUpdate and RunUpdate processors, otherwise you are not even indexing at all.

How does Solr's schema-less feature work? How to revert it to classic schema?

Just found that Solr 5 doesn't require a schema file to be predefined and it generates the schema, based on the indexing being performed. I would like to know how does this work in the background?
And whether it's a good practice or not? Is there any way to disable it?
The schemaless feature has been in Solr since version 4.3. But it might be more stable only now as a concurrency issue with it was fixed in 4.10.
It is also called managed schema. When you configure Solr to use managed schema, Solr uses a special UpdateRequestProcessor to intercept document indexing requests and it guesses field types.
Solr starts with your schema.xml file and creates a new file called, by default, managed-schema to store all the inferred schema information. This file is automatically overwritten by Solr as it detects changes to the schema.
You should then use the Schema API if you want to make changes to the Schema. See also the Schemaless Mode documentation.
How to change Solr managed schema to classic schema
Stop Solr: $ bin/solr stop
Go to server/solr/mycore/conf, where "mycore" is the name of your core/collection.
Edit solrconfig.xml:
search for <schemaFactory class="ManagedIndexSchemaFactory"> and comment the whole element
search for <schemaFactory class="ClassicIndexSchemaFactory"/> and uncomment it
search for the <initParams> element that refers to add-unknown-fields-to-the-schema and comment out the whole <initParams>...</initParams>
Rename managed-schema to schema.xml and you are done.
You can now start Solr again: $ bin/solr start, go to http://localhost:8983/solr/#/mycore/documents and check that Solr now refuses to index a document with a new field not yet specified in schema.xml.
Is it a good practice? When to use it?
It depends on what you want. If you want to enforce a specific document structure (e.g. to make sure that all docs are "well-formed" according to your definition), then you want to use the classical schema management.
If on the other hand you don't know upfront what the doc structure is then you might want to use the schema-less feature.
Limits
While it is called schema-less, there are limits to the kinds of structures that you can index. This is true both for Solr and Elasticsearch, by the way. For example, if you first index this doc:
{"name":"John Doe"}
then you will get an error if you try to index a doc like that next:
{"name": {
"first": "Daniel",
"second": "Dennett"
}
}
That is because in the first case the field name was of type string while in the second case it is an object.
If you would like to use indexing which goes beyond these limitations then you could use SIREn - it is an open source semi-structured information retrieval engine which is implemented as a plugin for both Solr and Elasticsearch. (Disclaimer: I worked for the company that develops SIREn)
This is so called schemaless mode in Solr. I don't know about internal details, how it's implemented, etc.
bin/solr start -e schemaless
This snippet above will start Solr in schemaless mode, if you don't do that, it will work as usual.
For more information on schemaless, take a look here - https://cwiki.apache.org/confluence/display/solr/Schemaless+Mode

How to Insert a document to a specific shard in Apache Solr Cloud Mode

Is there a way we can add documents into a specific shard?
For example, documents type A will always get inserted into shard1 and document type B always go to shard2.
I have tried using custom router but it does not guaranty that different prefix will route to different shard.
PS. I am on Solr 5 using cloud mode.
A caveat: I'm using SolrNet to access SolrCloud, and it doesn't integrate with ZooKeeper yet. For Java clients, this might be far easier.
Despite what I read here and here with regard to the CompositeId Router, I could never get it to work. What #jay helped me figure out is a way to use "implicit" routing to achieve this. If you create your collection like this (leave out the numShards parameter):
http://localhost:8983/solr/admin/collections?action=CREATE&name=myCol&maxShardsPerNode=2&router.name=implicit&shards=shard1,shard2&router.field=shard
then add a field to your schema.xml named "shard" (matching the router.field parameter), you can index to a specific shard simply by adding the shard field to the document being indexed and specifying the shard name. At query time, you can specify shards to search -- more here (I was able to simply specify the shard name w/o a specific address).
I haven't tested this in production yet, but have verified using multiple VirtualBox instances, with ZooKeeper, HAProxy, and several Solr nodes, and it's doing exactly what I expected. Corrections and comments welcome.

Display the actual index field

At the moment I am researching what the best configuration for Solr is to fit the scope of my application. It involves a lot of testing and I was wondering if I can display what Solr saves as index. I.e. I want to see the tokenized, stemmed, lower cased, etc. version of my documents. Is there any way Solr will provide this information?
Thank you
Jan
Have a look at Luke: http://www.getopt.org/luke/
Solr also has a Luke handler built-in: https://wiki.apache.org/solr/LukeRequestHandler
You can use the Solr Analysis which is provided on Solr admin interface. http://wiki.apache.org/solr/SolrAdminGUI
When on the analysis page, just putting the 'field type' or 'field name' you want the analysis on and put in any field value. Solr Analysis will show you what each Filter/Tokenizer is doing and how exactly does your content look after each step. Its great for testing and debugging.
You can do the same on a query if you have set such analyzers (tokenizers/filters) on your query as well in the schema.
Hope this helps.

Identifying strings in documents, with nutch+solr?

I'm looking into a search solution that will identify strings (company names) and use these strings for search and facets in Solr.
I'm new to Nutch and Solr so I wonder if this is best done in Nutch or in Solr. One solution would be to generate a Parser in Nutch that identifies the strings in question and then index the name of the company, later mapped to a Solr value. I'm not sure on how, but I guess this could also be done inside Solr directly from the text?
Does it make sense to do this string identification in Nutch or in Solr and is there some functionality in Solr or Nutch that could help me here?
Thanks.
You could embed a NER library (see opennlp, lingpipe, gate) in to a custom parser, generate new fields and create an indexingfilter accordingly. This is not particularly difficult and the advantage compared to doing this on the SOLR side is that you'd gain from the scalability of mapreduce (NLP tasks are often CPU-hungry).
See Behemoth for an example of how to embed GATE in mapreduce
Nutch works with Solr by indexing the crawled data to Solr via the Solr HTTP API. You trigger the indexation by calling the solrindex command. See this page for details on how to setup this.
To be able to extract the company names, I would add the necessary code in Solr. I would use a UpdateRequestProcessor. It allows to add an extra step in the indexing process to add extra fields in the document being indexed. Your UpdateRequestProcessor would be used to examine to document sent to Solr by Nutch, extract the company names from the text and add them as new fields in the document. Solr would them index the document + the fields that you add.

Resources