Solr Language Detection using DataImportHandler

Solr Language Detection using DataImportHandler - solr

In my Solr configuration files I have defined a DataImportHandler that fetches data from a Mysql database and also processes contents of PDF files that are related with registers of the SQL database. The data import works fine.
I'm trying to detect the language of text contained in the files during the data import phase. I have specified in my solrconfig.xml a TikaLanguageIdentifierUpdateProcessorFactory as explained in https://wiki.apache.org/solr/LanguageDetection and have defined in my document schema the language fields, nevertheless, after I run the indexation from the Solr admin, I cannot see any language field on my documents.
In all the examples I have seen, language detection is done by posting a document to solr with the post command, is it possible to do language detection with a DataImportHandler?

Once you have defined the UpdateRequestProcessor chain, you need to actually specify it in the request handler (DataImportHandler's in this case). You do that with update.chain parameter.
Also, ensure that you include LogUpdate and RunUpdate processors, otherwise you are not even indexing at all.

Related

Solr: indexing fb2 files

I want to use Solr for indexing some library, that represent books in fb2 format.
In fact fb2 is just xml with similar xsd format.
But, post.jar ignores *.fb2 files, and I dont understand how to map values in fb2 file to index fields, like:
<book-title>some book</book-title>
...to "book-title" field in index.
Should I create a plug-in, or something else?

You should look at the Solr Data Import Handler (DIH).
https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler
In the Solr examples folder you have an RSS import example. If you look in the rss-data-config.xml file you will see how they use the XPathEntityProcessor to map from XML to the Solr fields, e.g.:
Here is some more information: http://www.andornot.com/blog/post/Sample-Solr-DataImportHandler-for-XML-Files.aspx
I have also written Tika parsers in the past to work with specific file formats.
https://lucidworks.com/blog/2010/06/18/extending-apache-tika-capabilities/
For more flexibility you can just read your files using your favorite programming language and send the data to Solr using an API. We had to do this for a recent application as the DIH wasn't flexible enough for what we wanted to achieve.

Dspace and SOLR configuration

I would like to add ommitNorm=true to the title field.
It is wrongfully overboosting some of our titles.
However I don't know how the title field is indexed. What is its name - just dc.title?
Because in the schema.xml, I don't see anything about it. What is the type of that field, what analyzer or anything else is used for it. Is there anyway to know?

Most metadata fields in DSpace are handled via dynamic fields. That's why you don't see each specified individually in the search core's schema.xml file.
I'm not sure where the boosting is happening (or whether DSpace does any, even). I don't recall seeing any boost clauses when looking through the solr log files. I see some extraction parameters being set in SolrServiceImpl#writeDocument, where the document is being indexed. It looks like there is an extraction parameter for boosting individual fields, perhaps you can play with that to get what you'd like.
If you want to see the field type for any Solr field, the easiest option is probably the Schema Browser in the Solr admin user interface, eg
http://localhost:8080/solr/#/search/schema-browser?field=title (you may need to use an SSH tunnel or the like to access Solr running on a different host since the DSpace solr install is typically IP-limited to access from localhost).

How does Solr's schema-less feature work? How to revert it to classic schema?

Just found that Solr 5 doesn't require a schema file to be predefined and it generates the schema, based on the indexing being performed. I would like to know how does this work in the background?
And whether it's a good practice or not? Is there any way to disable it?

The schemaless feature has been in Solr since version 4.3. But it might be more stable only now as a concurrency issue with it was fixed in 4.10.
It is also called managed schema. When you configure Solr to use managed schema, Solr uses a special UpdateRequestProcessor to intercept document indexing requests and it guesses field types.
Solr starts with your schema.xml file and creates a new file called, by default, managed-schema to store all the inferred schema information. This file is automatically overwritten by Solr as it detects changes to the schema.
You should then use the Schema API if you want to make changes to the Schema. See also the Schemaless Mode documentation.
How to change Solr managed schema to classic schema
Stop Solr: $ bin/solr stop
Go to server/solr/mycore/conf, where "mycore" is the name of your core/collection.
Edit solrconfig.xml:
search for <schemaFactory class="ManagedIndexSchemaFactory"> and comment the whole element
search for <schemaFactory class="ClassicIndexSchemaFactory"/> and uncomment it
search for the <initParams> element that refers to add-unknown-fields-to-the-schema and comment out the whole <initParams>...</initParams>
Rename managed-schema to schema.xml and you are done.
You can now start Solr again: $ bin/solr start, go to http://localhost:8983/solr/#/mycore/documents and check that Solr now refuses to index a document with a new field not yet specified in schema.xml.
Is it a good practice? When to use it?
It depends on what you want. If you want to enforce a specific document structure (e.g. to make sure that all docs are "well-formed" according to your definition), then you want to use the classical schema management.
If on the other hand you don't know upfront what the doc structure is then you might want to use the schema-less feature.
Limits
While it is called schema-less, there are limits to the kinds of structures that you can index. This is true both for Solr and Elasticsearch, by the way. For example, if you first index this doc:
{"name":"John Doe"}
then you will get an error if you try to index a doc like that next:
{"name": {
"first": "Daniel",
"second": "Dennett"
}
}
That is because in the first case the field name was of type string while in the second case it is an object.
If you would like to use indexing which goes beyond these limitations then you could use SIREn - it is an open source semi-structured information retrieval engine which is implemented as a plugin for both Solr and Elasticsearch. (Disclaimer: I worked for the company that develops SIREn)

This is so called schemaless mode in Solr. I don't know about internal details, how it's implemented, etc.
bin/solr start -e schemaless
This snippet above will start Solr in schemaless mode, if you don't do that, it will work as usual.
For more information on schemaless, take a look here - https://cwiki.apache.org/confluence/display/solr/Schemaless+Mode

Key Points/Challenges while working with Apache Tika and Solr

Recently I got involved in a task, and part of it require to use Apache Solr ( for Document Search) ,and Apache Tika ( to Extract the meta-text or plain text from documents)
I have n't integrated Solr and tika yet ,But I have worked with both of them individually I might have set of questions related to Apache Solr and Apache Tika , It might be at beginners level or average.
Following types of practical I did with Solr e.g. created a dummy database, wrote a program, configured - schema.xml things, ran Solr sever, and program which fetches documents from database and store in Solr Document Index , Made a Simple client to fetch data from Solr via JSON Interface, Made a Program which keeps MySQL Database to sync with Apache’s Solr document Index.
Following types of practical I did with tika e.g. compiled and Installed Tika, understood its document parsing capablities.
..
My Sample Task statement:
Part of my project require to store around 100,000 of documents (Data of these 100,000 (Doc,PDF,Txt) docs are fetched by Apache tika and pushed to MySql’s Database and later that pushed to apache Solr’s Document Database)for Full Text Search and search them those via a client interface (Browser)
In simple programmatical level this task will get done,
I would like to understand the challenges related to managing the index or something else in Solr e.g.
** In advanced level does it require optimizing the Solr’s Open Source Code?
** While Solr works in proper way, does it provide any specific challenges?
** What Key things need to consider initially so that, Solr should work in a proper way.
** Do you think any extra tool to developed to monitor Solr’s working ?
Hope you got the idea related to questions I have ?
** Also I would like to know If you have any experience of using apache Tika with apache Solr, and any challenges or key things to consider ?
Would you like to recommend and specific sources Or If you have any document or anything which you feel to be helpful.

Running Solr in read-only mode

I think I'm missing something obvious here. I have to imagine a lot of people open up their Solr servers to other developers and don't want them to be able to modify the index.
Is there something in solrconfig.xml that can be set to effectively make the index read-only?
Update for clarification:
My goal is to use Solr with an existing Lucene index managed by another application. This works just fine, but I want to be sure Solr never tries to write to this index.

Exposing a Solr instance to the public internet is a bad idea. Even though you can strip some components to make it read-only, it just wasn't designed with security in mind, it's meant to be used as an internal service, just like you wouldn't expose a RDBMS.
From the Solr Security wiki page:
First and foremost, Solr does not
concern itself with security either at
the document level or the
communication level. It is strongly
recommended that the application
server containing Solr be firewalled
such the only clients with access to
Solr are your own. A default/example
installation of Solr allows any client
with access to it to add, update, and
delete documents (and of course
search/read too), including access to
the Solr configuration and schema
files and the administrative user
interface.
Even ajax-solr, a Solr client for javascript meant to run in a browser, recommends talking to Solr through a proxy.
Take for example guardian.co.uk: it's well-known that they use Solr for searching, but they built an API to let others access their content. This way they can define and control exactly what and how they want people to search for things.
Otherwise, any script kiddie can write a trivial loop to DoS your Solr instance and therefore bring down your site.

You can probably just remove the line that defines your solr.XmlUpdateRequestHandler in solrconfig.xml.
Replication is a nice way to setup read-only while being able to do indexation. Just setup a master with restricted access and a slave that is read-only (by removing your XmlUpdateRequestHandler from the config). The slave will be replicated from the master but won't accept any indexation directly.
UPDATE
I just read that in Solr 1.4, you can disable component. I just tried it on the /update requestHandler and I was not able to index anymore.