Solr: indexing fb2 files - solr

I want to use Solr for indexing some library, that represent books in fb2 format.
In fact fb2 is just xml with similar xsd format.
But, post.jar ignores *.fb2 files, and I dont understand how to map values in fb2 file to index fields, like:
<book-title>some book</book-title>
...to "book-title" field in index.
Should I create a plug-in, or something else?

You should look at the Solr Data Import Handler (DIH).
https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler
In the Solr examples folder you have an RSS import example. If you look in the rss-data-config.xml file you will see how they use the XPathEntityProcessor to map from XML to the Solr fields, e.g.:
Here is some more information: http://www.andornot.com/blog/post/Sample-Solr-DataImportHandler-for-XML-Files.aspx
I have also written Tika parsers in the past to work with specific file formats.
https://lucidworks.com/blog/2010/06/18/extending-apache-tika-capabilities/
For more flexibility you can just read your files using your favorite programming language and send the data to Solr using an API. We had to do this for a recent application as the DIH wasn't flexible enough for what we wanted to achieve.

Related

Solr Language Detection using DataImportHandler

In my Solr configuration files I have defined a DataImportHandler that fetches data from a Mysql database and also processes contents of PDF files that are related with registers of the SQL database. The data import works fine.
I'm trying to detect the language of text contained in the files during the data import phase. I have specified in my solrconfig.xml a TikaLanguageIdentifierUpdateProcessorFactory as explained in https://wiki.apache.org/solr/LanguageDetection and have defined in my document schema the language fields, nevertheless, after I run the indexation from the Solr admin, I cannot see any language field on my documents.
In all the examples I have seen, language detection is done by posting a document to solr with the post command, is it possible to do language detection with a DataImportHandler?
Once you have defined the UpdateRequestProcessor chain, you need to actually specify it in the request handler (DataImportHandler's in this case). You do that with update.chain parameter.
Also, ensure that you include LogUpdate and RunUpdate processors, otherwise you are not even indexing at all.

How to index text files in apache solr

I have some information in a text file. I want to index it on solr. What should be the procedure. Any tool that can be used for indexing in solr ? Please guide me in details as I am not familiar with solr too mutch?
I'd refer you to Solr DataImportHandler Page, it has a comprehensive tutorial on how to import data from various source. Importing text files is under FileDataSource
One way would be to convert the plain text into CSV file. You can then use the CSV file uploading process to index data in Solr. Check the documentation here for more configurations
Here

How to extract solr index docs

I need to do some transformations on docs before indexing them in solr. but the texts come from various resources and it's diffcult to do the transformations before indexing because i will have to adapt several programs to parse the files. I'm thinking of indexing them in solr, extract the text fields, do transformations and reindex again.
I tried :
curl 'http://localhost:8983/solr/collection1/select?q=*&rows=20000&wt=xml&indent=true'
but the output is a results xml file while i'm looking for some way to extract the docs with fields like in the posting format. is this possible? how should i do?
Thanks
I would recommend using one of the Solr Clients listed on the Integrating Solr page. This will allow you to use your programming language of choice to extract and transform the Solr documents and then reload them into the index.

How to index text files using apache solr

I wanted to index text files. After searching a lot I got to know about Apache tika. Now in some sites where I studied Apache tika, I got to know that Apache tika converts the text it into XML format and then sends it to solr. But while converting it creates only one tag example
.......
Now the text file I wish to index is a tomcat local host access file. This file is in GB's. I cannot store it and a single index. I want each line to have line-id
.......
So that i can easily retrieve the matching line.
Can this be done in Apache Tika?
Solr with Tika supports extraction of data from multiple file formats.
The complete list of supported file formats can be found # link
You can provide as an input any of the above file formats and Tika would be able to autodetect the file format and extract text from the files and provide it to Solr for indexing.
Edit :-
Tika does not convert the text file to XML before sneding it to Solr.
Tika would just extract the metadata and the content of the file and populate fields in Solr as per the mapping defined.
You either have to feed the entire file as input to solr, which would be indexed as a single document OR you have to read the file line by line and provide it to Solr as a seperate document.
Solr and Tika would not handle this for you.
You may want to look at DataImportHandler to parse the file into lines or entries. It is a better match than running Tika on something that already has internal structure.

Solr metadata index

I am new with Solr and I am extracting metadata from binary files through URLs stored in my database. I would like to know what fields are available for indexing from PDFs (the ones that would be initiated as column=””). I would also like to know how to create customized fields in Solr. How is that implemented and mapped to specific metadata coming from the files. If someone has a code snippet that could show me it would be greatly appreciated.
Thank you in advance.
To create custom fields in Solr, you will need to modify the schema.xml file for your Solr installation. The schema.xml file that comes with the Solr example included in the distribution (found under the /example folder) includes a large number of predefined metadata fields for file extraction. For information on creating custom fields in Solr, please see the following:
SchemaXml
Documents, Fields & Schema Design
Solr has a built in request handler for extracting and mapping metadata from binary files. For details, please referer to the following:
ExtractingRequestHandler
Uploading Data with Solr Cell using Apache Tika

Resources