Solr metadata index - solr

I am new with Solr and I am extracting metadata from binary files through URLs stored in my database. I would like to know what fields are available for indexing from PDFs (the ones that would be initiated as column=””). I would also like to know how to create customized fields in Solr. How is that implemented and mapped to specific metadata coming from the files. If someone has a code snippet that could show me it would be greatly appreciated.
Thank you in advance.

To create custom fields in Solr, you will need to modify the schema.xml file for your Solr installation. The schema.xml file that comes with the Solr example included in the distribution (found under the /example folder) includes a large number of predefined metadata fields for file extraction. For information on creating custom fields in Solr, please see the following:
SchemaXml
Documents, Fields & Schema Design
Solr has a built in request handler for extracting and mapping metadata from binary files. For details, please referer to the following:
ExtractingRequestHandler
Uploading Data with Solr Cell using Apache Tika

Related

How to index text files in apache solr

I have some information in a text file. I want to index it on solr. What should be the procedure. Any tool that can be used for indexing in solr ? Please guide me in details as I am not familiar with solr too mutch?
I'd refer you to Solr DataImportHandler Page, it has a comprehensive tutorial on how to import data from various source. Importing text files is under FileDataSource
One way would be to convert the plain text into CSV file. You can then use the CSV file uploading process to index data in Solr. Check the documentation here for more configurations
Here

solr reindexing only modified documents

I am using solr dataimporthandler tika for doing a search in rich documents such as word, pdf documents. Whenever there is a new file added or any file being changed I have to do a full import to include the changes in the search. As the number of documents is very high, I need an option to re-index only the newly added or modified (similar to delta-import). I know delta-import cannot be used with tika-entity processor and neither clean=false attribute working for my scenario. Is there anyways ways I can achieve this. Thanks for the response in advance.

tika installation

I integrated Tika with Solr following the instructions provided in this link
Correct me if I am wrong, it seems to me that it can index the document files(pdf,doc,audio) located on my own system (given the path of directory in which those files are stored), but cannot index those files, located on internet, when I crawl some sites using nutch.
Can I index the documents files(pdf,audio,doc,zip) located on the web using Tika?
There are basically two ways to index binary documents within Solr, both with Tika:
Using Tika on the client side to extract information from binary files and then manually indexing the extracted text within Solr
Using ExtractingRequestHandler through which you can upload the binary file to the Solr server so that Solr can do the work for you. This way tika is not required on the client side.
In both cases you need to have the binary documents on the client side. While crawling, nutch should be able to download binary files, use Tika to generate text content out of them and then index data in Solr as it'd normally do with text documents. Nutch already uses Tika, I guess it's just a matter of configuring the type of documents you want to index changing the regex-urlfilter.txt nutch config file by removing from the following lines the file extensions that you want to index.
# skip some suffixes
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
This way you would use the first option I mentioned. Then you need to enable the Tika plugin on nutch within your nutch-site.xml, have a look at this discussion from the nutch mailing list.
This should theoretically work, let me know if it doesn't.

Path of Solr Document

I would like to know where the indexed document is saved in solr search.
I have installed solr server at C:\solr and using solr 1.4. By making
necessary changes in the configuration files i am able to search data
using solr client.
Just wondering where that indexed document is saved.
Indexed documents are saved in index, which is located in solr/data/index folder.
Here you can find more details about those files.
From LuceneFAQ:
The index database is composed of 'segments' each stored in a separate
file. When you add documents to the index, new segments may be
created. These are periodically merged together.
EDIT:
If you want to examine contents of your index and tweak or troubleshoot your schema (analysis), see instructions about the greatest Lucene tool ever, called Luke in this recent post.

Identifying strings in documents, with nutch+solr?

I'm looking into a search solution that will identify strings (company names) and use these strings for search and facets in Solr.
I'm new to Nutch and Solr so I wonder if this is best done in Nutch or in Solr. One solution would be to generate a Parser in Nutch that identifies the strings in question and then index the name of the company, later mapped to a Solr value. I'm not sure on how, but I guess this could also be done inside Solr directly from the text?
Does it make sense to do this string identification in Nutch or in Solr and is there some functionality in Solr or Nutch that could help me here?
Thanks.
You could embed a NER library (see opennlp, lingpipe, gate) in to a custom parser, generate new fields and create an indexingfilter accordingly. This is not particularly difficult and the advantage compared to doing this on the SOLR side is that you'd gain from the scalability of mapreduce (NLP tasks are often CPU-hungry).
See Behemoth for an example of how to embed GATE in mapreduce
Nutch works with Solr by indexing the crawled data to Solr via the Solr HTTP API. You trigger the indexation by calling the solrindex command. See this page for details on how to setup this.
To be able to extract the company names, I would add the necessary code in Solr. I would use a UpdateRequestProcessor. It allows to add an extra step in the indexing process to add extra fields in the document being indexed. Your UpdateRequestProcessor would be used to examine to document sent to Solr by Nutch, extract the company names from the text and add them as new fields in the document. Solr would them index the document + the fields that you add.

Resources