formatting of files before indexing into solr server - solr

I'm using the Solr server to provide search capability for a tool. I wanted to know if there is a facility provided by solr that will allow me to format some files before they are indexed ? more specifically i have a plain text file with a lot of data ! i want to convert them to an xml format before i index the xml file . eg
some data! some more data : more values
i want to convert this sample line to something like
<field 1>sample data </field 1>
<field 2> some more data </field 2>
<field 3> more values </field 3>
does solr provide a facility for this type of transformation before iindexing a file using solr cell. does it provie any classes or interfaces that i can implement in my java application ??
thanks in advance!

Are you pushing data into Solr or can you pull it from the source by Solr?
If you are pushing into Solr, then you have to use Update Request Processor. However, I am not aware of any that will split data into multiple fields. You may need to write one yourself.
If you are pulling from the source using DataImportHandler, it has a built-in support for splitting content into multiple fields using RegexTransformer.
Both Request Processor and DIH support JavaScript (and possibly other Java script languages) transformers, so you can also write your own script to split the data in whatever way you want.
Some of this is starting with version 4 of Solr though. That's a requirement to keep in mind.

You'll need a custom Index Handler or a SolrRequestHandler

Related

Where is the schema definition of PDF index in SOLR

All, I had succeeded in indexing the PDF file into SOLR with Post.jar.
I can see the file indexed when I tried to query the query result .
But I was wondering where do thes fields like id, stream_content_type,pdf_pdfversion etc comes from . I tried to search them in the schema.xml. But not found them yet. Where are they defined ? Did I missed something . Thanks.
This is the metatdata stored by Apache Tika
In addition to Tika's metadata, Solr adds the following metadata
(defined in ExtractingMetadataConstants):
https://wiki.apache.org/solr/ExtractingRequestHandler#Metadata
Documentation
Metadata
As has been implied up to now, Tika produces Metadata about the
document. Metadata often contains things like the author of the file
or the number of pages, etc. The Metadata produced depends on the type
of document submitted. For instance, PDFs have different metadata from
Word docs.
In addition to Tika's metadata, Solr adds the following metadata
(defined in ExtractingMetadataConstants):
"stream_name" - The name of the ContentStream as uploaded to Solr.
Depending on how the file is uploaded, this may or may not be set.
"stream_source_info" - Any source info about the stream. See
ContentStream. "stream_size" - The size of the stream in bytes(?)
"stream_content_type" - The content type of the stream, if available.
It is highly recommend that you try using the extract only option to
see what values actually get set for these.

In Solr hierarchical facets, is there a way to use another character than «/» to separate nodes in the hierarchical facetting path field?

I need your help.
I'm working on a Typo3 website about mathematics, and we use :
A Solr server to provide the search engine.
A Typo3 Solr extension to provide the connection between our Typo3 CMS and our Sorr server.
We have indexed objects that are organized in a tree, and we use this tree to provide a hierarchical facets presentation for search. For this, we generate and maintain programmatically a path string, that Solr uses.
But unfortunately we happen to have slashes «/» in some of our indexed objects titles (for example those involving fractions), and that leads to unpredicable results when rendering the hierarchical facets based on these titles, because Solr interprets the slashes as a child node.
We cannot use HTML entitizing and de-entitizing because we would loose the search features on the names, unless we manage everywhere encoding and recoding of the special characters, which we do have no time to do.
My question is simple :
Is there a way to configure a separator char for the hierarchical facets path ? For example in typoScript a neat simple configuration key :
plugin.tx_solr.index.fieldProcessingInstruction.separator = ### #<--Whatever...
I would be so glad to not have to dive again in the Typo3 Solr extension source code to bugfix my website !
Thanks to anybody for any clue.
OK, after having lost some time trying to configure it in the schema.xml and in general_schema_*.xml files, I went to the source code of the Typo3 Solr extension, my old dreaded sleeping balrog.
It appears that the separator character is specified hardcoded in 5 scattered class files :
class.tx_solr_facet_hierarchicalfacetrenderer.php
class.tx_solr_fieldprocessor_pathtohierarchy.php
class.tx_solr_facet_hierarchicalfacethelper.php
class.tx_solr_fieldprocessor_pageuidtohierarchy.php
class.tx_solr_query_filterencoder_hierarchy.php
All I did was replace it in these files (pointing to one unique public static constant, duh) and apologize to my supervisors for taking so long correcting a so simple and stupid bug, and now everything works fine !

How to update Solr documents on the Solr server side with custom handler / plugin

I have a core with millions of records.
I want to add a custom handler which scan the existing documents and update one of the field based on a condition (age>12 for example).
I prefer doing it on the Solr server side for avoiding sending millions of documents to the client and back.
I was thinking of writing a solr plugin which will receive a query and update some fields on the query documents (like the delete by query handler).
I was wondering whether there are existing solutions or better alternatives.
I was searching the web for a while and couldn't find examples of Solr plugins which update documents (I don't need to extend the update handler).
I've written a plug-in which use the following code which works fine but isn't as fast as I need.
Currently I do:
AddUpdateCommand addUpdateCommand = new AddUpdateCommand(solrQueryRequest);
DocIterator iterator = docList.iterator();
SolrIndexSearcher indexReader = solrQueryRequest.getSearcher();
while (iterator.hasNext()) {
Document document = indexReader.doc(iterator.nextDoc());
SolrInputDocument solrInputDocument = new SolrInputDocument();
addUpdateCommand.clear();
addUpdateCommand.solrDoc = solrInputDocument;
addUpdateCommand.solrDoc.setField("id", document.get("id"));
addUpdateCommand.solrDoc.setField("my_updated_field", new_value);
updateRequestProcessor.processAdd(addUpdateCommand);
}
But this is very expensive since the update handler will fetch again the document which I already hold at hand.
Is there a safe way to update the lucene document and write it back while taking into account all the Solr related code such as caches, extra solr logic, etc?
I was thinking of converting it to a SolrInputDocument and then just add the document through Solr but I need first to convert all fields.
Thanks in advance,
Avner
I'm not sure whether the following is going to improve the performance, but thought it might help you.
Look at SolrEntityProcessor
Its description sounds very relevant to what you are searching for.
This EntityProcessor imports data from different Solr instances and cores.
The data is retrieved based on a specified (filter) query.
This EntityProcessor is useful in cases you want to copy your Solr index
and slightly want to modify the data in the target index.
In some cases Solr might be the only place were all data is available.
However, I couldn't find an out-of-the-box feature to embed your logic. So, you may have to extend the following class.
SolrEntityProcessor and the link to sourcecode
You may probably know, but a couple of other points.
1) Make the entire process exploit all the cpu cores available. Make it multi-threaded.
2) Use the latest version of Solr.
3) Experiment with two Solr apps on different machines with minimal network delay. This would be a tough call :
same machine, two processes VS two machines, more cores, but network overhead.
4) Tweak Solr cache in a way that applies to your use-case and particular implementation.
5) A couple of more resources: Solr Performance Problems and SolrPerformanceFactors
Hope it helps. Let me know the stats despite this answer. I'm curious and your info might help somebody later.
To point out where to put custom logic, I would suggest to have a look at the SolrEntityProcessor in conjunction with Solr's ScriptTransformer.
The ScriptTransformer allows to compute each entity after it is extracted from the source of a dataimport, manipulate it and add custom field values before the new entity is written to solr.
A sample data-config.xml could look like this
<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<script>
<![CDATA[
function calculateValue(row) {
row.put("CALCULATED_FIELD", "The age is: " + row.get("age"));
return row;
}
]]>
</script>
<document>
<entity name="sep" processor="SolrEntityProcessor"
url="http://localhost:8080/solr/your-core-name"
query="*:*"
wt="javabin"
transformer="script:calculateValue">
<field column="ID" name="id" />
<field column="AGE" name="age" />
<field column="CALCULATED_FIELD" name="update_field" />
</entity>
</document>
</dataConfig>
As you can see, you may perform any data transformation you like and is expressible in javascript. So this would be a good point to express your logic and transformations.
You say one constraint maybe age > 12. I would handle this via the query attribute of the SolrEntityProcessor. You could write query=age:[* TO 12] so that only records with an age up to 12 would be read for the update.

Index every word of a text file which are delimited by whitespace in solr?

I am implementing solr 3.6 in my application.as i have the below data in my text file..
**
date=2011-07-08 time=10:55:06 timezone="IST" device_name="CR1000i"
device_id=C010600504-TYGJD3 deployment_mode="Route"
log_id=031006209001 log_type="Anti Virus" log_component="FTP"
log_subtype="Clean" status="Denied" priority=Critical fw_rule_id=""
user_name="hemant" virus="codevirus" FTP_URL="ftp.myftp.com"
FTP_direction="download" filename="hemantresume.doc" file_size="550k"
file_path="deepti/Shortcut to virus.lnk" ftpcommand="RETR"
src_ip=10.103.6.100 dst_ip=10.103.6.66 protocol="TCP" src_port=2458
dst_port=21 dstdomain="myftp.cpm" sent_bytes=162 recv_bytes=45
message="An FTP download of File resume.doc of size 550k from server
ftp.myftp.com could not be completed as file was infected with virus
codevirus"
**
now i want to split above data based on key-value pairs..and want the each value to be indexed based on the key..
i want the changes should be in the configuraion files..i have gone through tokenizer in which whitespaceokenizer may work.but want the whole structure to be indexed..so can anyone please help me on this???
thanks..
There is no tokenizer that I know of does this.
Using static fields:
You have to define all your "keys" as fields in schema.xml . They should have the relevant types (dates, string etc).
Create a POJO with these fields and parse this key/value pairs and populate the POJO. Add this pojo to solr using solrj.
Using dynamic fields:
In this case you dont need to define the keys in schema but use dynamic fields (based on the type of data). You still need to parse the key/value pairs and add to solr document. These fields need to be added using solrInputdoc.addField method.
As you define add new key/value pairs, the client would still need to know of the existence of this new key. But your indexer does not need to.
This cannot be done with a tokenizer. Tokenizers are called for each field, but you need processing before handing the data to a field.
A Transformer could probably do this, or you might do some straightforward conversion before submitting it as XML. It should not be hard to write something that reads that format and generates the proper XML format for Solr submissions. It sure wouldn't be hard in Python.
For this input:
date=2011-07-08 time=10:55:06 timezone="IST" device_name="CR1000i"
You would need to create the matching fields in a schema, and generate:
<doc>
<field name="date">2011-07-08</field>
<field name="time">2011-07-08</field>
<field name="timezone">IST</field>
<field name="device_name">CR1000i</field>
...
Also in this pre-processing, you almost certainly want to convert the first three fields into a single datetime in UTC.
For details about the Solr XML update format, see: http://wiki.apache.org/solr/UpdateXmlMessages
The Apache wiki is down at this exact moment, so try again if there is an error page.

How to export solr result into a text file?

I needs to export doc_id, all fields, socr, rank of one search result to evaluate the results. How can I do this in solr?
Solr provides you with a CSV Response writer, which will help you to export the results of solr in an csv file.
http://localhost:8983/solr/select?q=ipod&fl=id,cat,name,popularity,price,score&wt=csv
All the fields queried would be returned by Solr in proper format.
This has nothing to do with SOLR. When you make a SOLR query over http, then SOLR does the search and returns the results to you in your desired format. The default is XML but lots of people specify wt=json to get results in json format. If you want this result in a text file, then make your search client put it there.
In the browser, File -> Save As.
But most people who want this use curl as the client and use the -o option like this:
curl -o result1.xml 'http://solr.local:8080/solr/stuff/select?indent=on&version=2.2&q=fish&fq=&start=0&rows=10&fl=*%2Cscore&qt=&wt=&explainOther=&hl.fl='
Note the single quotes around the URL due to the use of & characters.
There is not a built in export function in Solr. The easiest way would be to query your Solr instance and evaluate the XML result. Check out Querying Data in the Solr Tutorial for details on how to query a result from Solr. In order to convert the result into a text file, I would recommend using one of the Solr Clients found on the Integrating Solr page in the Solr Wiki and then choose your programming language of choice to create the text file.

Resources