Indexing customized text documents in solrj - solr

I am trying to parse text document which is in customized format.
The format for this document is as follows:
16:15:32.036 attribute1:text1
16:15:32:042 attribute2:text2
.
.
.
.
So the format is time-stamp attribute:value. I am designing the search application using solrj api.
My query is how should I go about this?
Should I index the whole text or should I index line by line?
I am not able to find solrj api for this. I would appreciate if you guide me with the approach you would take.

Related

how to search a text in solr using REST Api?

I have a query in Elastic Search shown below
sample code
How can I perform the same using Solr?
You can search by query string like given below.
http://domain/solr/core_name/select?q=*:*&fq=studentId:14466&wt=json&indent=true
You can do a single match against a field in standard Lucene syntax:
q=field:value
Having field names with characters outside of [a-zA-Z0-9] isn't recommended, but you could probably escape it properly with student\ I\'d:14466.

In Solr hierarchical facets, is there a way to use another character than «/» to separate nodes in the hierarchical facetting path field?

I need your help.
I'm working on a Typo3 website about mathematics, and we use :
A Solr server to provide the search engine.
A Typo3 Solr extension to provide the connection between our Typo3 CMS and our Sorr server.
We have indexed objects that are organized in a tree, and we use this tree to provide a hierarchical facets presentation for search. For this, we generate and maintain programmatically a path string, that Solr uses.
But unfortunately we happen to have slashes «/» in some of our indexed objects titles (for example those involving fractions), and that leads to unpredicable results when rendering the hierarchical facets based on these titles, because Solr interprets the slashes as a child node.
We cannot use HTML entitizing and de-entitizing because we would loose the search features on the names, unless we manage everywhere encoding and recoding of the special characters, which we do have no time to do.
My question is simple :
Is there a way to configure a separator char for the hierarchical facets path ? For example in typoScript a neat simple configuration key :
plugin.tx_solr.index.fieldProcessingInstruction.separator = ### #<--Whatever...
I would be so glad to not have to dive again in the Typo3 Solr extension source code to bugfix my website !
Thanks to anybody for any clue.
OK, after having lost some time trying to configure it in the schema.xml and in general_schema_*.xml files, I went to the source code of the Typo3 Solr extension, my old dreaded sleeping balrog.
It appears that the separator character is specified hardcoded in 5 scattered class files :
class.tx_solr_facet_hierarchicalfacetrenderer.php
class.tx_solr_fieldprocessor_pathtohierarchy.php
class.tx_solr_facet_hierarchicalfacethelper.php
class.tx_solr_fieldprocessor_pageuidtohierarchy.php
class.tx_solr_query_filterencoder_hierarchy.php
All I did was replace it in these files (pointing to one unique public static constant, duh) and apologize to my supervisors for taking so long correcting a so simple and stupid bug, and now everything works fine !

formatting of files before indexing into solr server

I'm using the Solr server to provide search capability for a tool. I wanted to know if there is a facility provided by solr that will allow me to format some files before they are indexed ? more specifically i have a plain text file with a lot of data ! i want to convert them to an xml format before i index the xml file . eg
some data! some more data : more values
i want to convert this sample line to something like
<field 1>sample data </field 1>
<field 2> some more data </field 2>
<field 3> more values </field 3>
does solr provide a facility for this type of transformation before iindexing a file using solr cell. does it provie any classes or interfaces that i can implement in my java application ??
thanks in advance!
Are you pushing data into Solr or can you pull it from the source by Solr?
If you are pushing into Solr, then you have to use Update Request Processor. However, I am not aware of any that will split data into multiple fields. You may need to write one yourself.
If you are pulling from the source using DataImportHandler, it has a built-in support for splitting content into multiple fields using RegexTransformer.
Both Request Processor and DIH support JavaScript (and possibly other Java script languages) transformers, so you can also write your own script to split the data in whatever way you want.
Some of this is starting with version 4 of Solr though. That's a requirement to keep in mind.
You'll need a custom Index Handler or a SolrRequestHandler

Apache solr coverting to xml

I am learning how to use solr but I am struck at a point how to upload a .txt format book to solr I know I need to convert in to XML format but I dont know How to or how format looks like can some one explain me in step by step process
in order to avoid creating the input doc in xml format yourself, you could use tika request handler (it extracts text form various formats including plain text), see here

How to export solr result into a text file?

I needs to export doc_id, all fields, socr, rank of one search result to evaluate the results. How can I do this in solr?
Solr provides you with a CSV Response writer, which will help you to export the results of solr in an csv file.
http://localhost:8983/solr/select?q=ipod&fl=id,cat,name,popularity,price,score&wt=csv
All the fields queried would be returned by Solr in proper format.
This has nothing to do with SOLR. When you make a SOLR query over http, then SOLR does the search and returns the results to you in your desired format. The default is XML but lots of people specify wt=json to get results in json format. If you want this result in a text file, then make your search client put it there.
In the browser, File -> Save As.
But most people who want this use curl as the client and use the -o option like this:
curl -o result1.xml 'http://solr.local:8080/solr/stuff/select?indent=on&version=2.2&q=fish&fq=&start=0&rows=10&fl=*%2Cscore&qt=&wt=&explainOther=&hl.fl='
Note the single quotes around the URL due to the use of & characters.
There is not a built in export function in Solr. The easiest way would be to query your Solr instance and evaluate the XML result. Check out Querying Data in the Solr Tutorial for details on how to query a result from Solr. In order to convert the result into a text file, I would recommend using one of the Solr Clients found on the Integrating Solr page in the Solr Wiki and then choose your programming language of choice to create the text file.

Resources