Apache solr coverting to xml - solr

I am learning how to use solr but I am struck at a point how to upload a .txt format book to solr I know I need to convert in to XML format but I dont know How to or how format looks like can some one explain me in step by step process

in order to avoid creating the input doc in xml format yourself, you could use tika request handler (it extracts text form various formats including plain text), see here

Related

Indexing customized text documents in solrj

I am trying to parse text document which is in customized format.
The format for this document is as follows:
16:15:32.036 attribute1:text1
16:15:32:042 attribute2:text2
.
.
.
.
So the format is time-stamp attribute:value. I am designing the search application using solrj api.
My query is how should I go about this?
Should I index the whole text or should I index line by line?
I am not able to find solrj api for this. I would appreciate if you guide me with the approach you would take.

I am new to apache solr, and working on a package. The index created by solr has only .CFS, .gen, insegmentparents file, and .del file

I know it contains header and file data in raw format, but does this mean everytime i query the index, the raw data is processed to find out the frequency of terms? Since I cannot see a .frq file? Is there any way to find out how the data is stored in .cfs file?
The Index file format is Compound and hence the cfs file created which has all the files combined.
Check File Format which will give a detail for Lucene Index file formats.
You can use Luke to explore your Lucene Index files.

formatting of files before indexing into solr server

I'm using the Solr server to provide search capability for a tool. I wanted to know if there is a facility provided by solr that will allow me to format some files before they are indexed ? more specifically i have a plain text file with a lot of data ! i want to convert them to an xml format before i index the xml file . eg
some data! some more data : more values
i want to convert this sample line to something like
<field 1>sample data </field 1>
<field 2> some more data </field 2>
<field 3> more values </field 3>
does solr provide a facility for this type of transformation before iindexing a file using solr cell. does it provie any classes or interfaces that i can implement in my java application ??
thanks in advance!
Are you pushing data into Solr or can you pull it from the source by Solr?
If you are pushing into Solr, then you have to use Update Request Processor. However, I am not aware of any that will split data into multiple fields. You may need to write one yourself.
If you are pulling from the source using DataImportHandler, it has a built-in support for splitting content into multiple fields using RegexTransformer.
Both Request Processor and DIH support JavaScript (and possibly other Java script languages) transformers, so you can also write your own script to split the data in whatever way you want.
Some of this is starting with version 4 of Solr though. That's a requirement to keep in mind.
You'll need a custom Index Handler or a SolrRequestHandler

Solr CSV import that has a field with a non standard date format

I am trying to import several .csv files that contain a few fields with a date format of "yyyyMMdd". I quickly found that the DataImportHandler does not easily support csv files. In the DataImportHandler, it is possible to use the LineEntityProcessor and then a RegexTransformer, but that is pretty messy. The next method I tried was to post the file to the CSVRequestHandler, but I have not found a way to specify what SimpleDateFormat to use to parse the field. I have been searching for a way around this problem, but I think I am doomed to either a pre-processing step, or mucking with the RegexTransformer. Any help would be greatly appreciated.
Edit: I should add that I am on Solr 3.5.
or. to stay within solr. index it in a string field. and using an UpdateProcessor copy it to a date field in the right format. preety easy
Have you looked into the DateFormatTransformer?

How to export solr result into a text file?

I needs to export doc_id, all fields, socr, rank of one search result to evaluate the results. How can I do this in solr?
Solr provides you with a CSV Response writer, which will help you to export the results of solr in an csv file.
http://localhost:8983/solr/select?q=ipod&fl=id,cat,name,popularity,price,score&wt=csv
All the fields queried would be returned by Solr in proper format.
This has nothing to do with SOLR. When you make a SOLR query over http, then SOLR does the search and returns the results to you in your desired format. The default is XML but lots of people specify wt=json to get results in json format. If you want this result in a text file, then make your search client put it there.
In the browser, File -> Save As.
But most people who want this use curl as the client and use the -o option like this:
curl -o result1.xml 'http://solr.local:8080/solr/stuff/select?indent=on&version=2.2&q=fish&fq=&start=0&rows=10&fl=*%2Cscore&qt=&wt=&explainOther=&hl.fl='
Note the single quotes around the URL due to the use of & characters.
There is not a built in export function in Solr. The easiest way would be to query your Solr instance and evaluate the XML result. Check out Querying Data in the Solr Tutorial for details on how to query a result from Solr. In order to convert the result into a text file, I would recommend using one of the Solr Clients found on the Integrating Solr page in the Solr Wiki and then choose your programming language of choice to create the text file.

Resources