Index every word of a text file which are delimited by whitespace in solr? - solr

I am implementing solr 3.6 in my application.as i have the below data in my text file..
**
date=2011-07-08 time=10:55:06 timezone="IST" device_name="CR1000i"
device_id=C010600504-TYGJD3 deployment_mode="Route"
log_id=031006209001 log_type="Anti Virus" log_component="FTP"
log_subtype="Clean" status="Denied" priority=Critical fw_rule_id=""
user_name="hemant" virus="codevirus" FTP_URL="ftp.myftp.com"
FTP_direction="download" filename="hemantresume.doc" file_size="550k"
file_path="deepti/Shortcut to virus.lnk" ftpcommand="RETR"
src_ip=10.103.6.100 dst_ip=10.103.6.66 protocol="TCP" src_port=2458
dst_port=21 dstdomain="myftp.cpm" sent_bytes=162 recv_bytes=45
message="An FTP download of File resume.doc of size 550k from server
ftp.myftp.com could not be completed as file was infected with virus
codevirus"
**
now i want to split above data based on key-value pairs..and want the each value to be indexed based on the key..
i want the changes should be in the configuraion files..i have gone through tokenizer in which whitespaceokenizer may work.but want the whole structure to be indexed..so can anyone please help me on this???
thanks..

There is no tokenizer that I know of does this.
Using static fields:
You have to define all your "keys" as fields in schema.xml . They should have the relevant types (dates, string etc).
Create a POJO with these fields and parse this key/value pairs and populate the POJO. Add this pojo to solr using solrj.
Using dynamic fields:
In this case you dont need to define the keys in schema but use dynamic fields (based on the type of data). You still need to parse the key/value pairs and add to solr document. These fields need to be added using solrInputdoc.addField method.
As you define add new key/value pairs, the client would still need to know of the existence of this new key. But your indexer does not need to.

This cannot be done with a tokenizer. Tokenizers are called for each field, but you need processing before handing the data to a field.
A Transformer could probably do this, or you might do some straightforward conversion before submitting it as XML. It should not be hard to write something that reads that format and generates the proper XML format for Solr submissions. It sure wouldn't be hard in Python.
For this input:
date=2011-07-08 time=10:55:06 timezone="IST" device_name="CR1000i"
You would need to create the matching fields in a schema, and generate:
<doc>
<field name="date">2011-07-08</field>
<field name="time">2011-07-08</field>
<field name="timezone">IST</field>
<field name="device_name">CR1000i</field>
...
Also in this pre-processing, you almost certainly want to convert the first three fields into a single datetime in UTC.
For details about the Solr XML update format, see: http://wiki.apache.org/solr/UpdateXmlMessages
The Apache wiki is down at this exact moment, so try again if there is an error page.

Related

Change Solr Field type

One string field in Lucene/Solr is stored like this: 'yyyyMMdd'.
I need to convert the field to tdate type.
How can I achieve this and do a re-index?
If your data is coming with the incomplete date format and you want to parse it, you need to use UpdateRequestProcessor chain for that. The specific URP is ParseDateFieldUpdateProcessorFactory. It's used as part of schemaless example in Solr, so you can check its usage in the solrconfig.xml there.
Most likely, you need to re-index from the source collection. There is no rewrite in-place options in Solr for individual fields.

Solr dynamicField not searched in query without field name

I'm experimenting with the Example database in Solr 4.10 and not understanding how dynamicFields work. The schema defines
dynamicField name="*_s" type="string" indexed="true" stored="true"
If I add a new item with a new field name (say "example_s":"goober" in JSON format), a query like
?q=goober
returns no matches, while
?q=example_s:goober
will find the match. What am I missing?
I would like to see the SearchHandler from solrconfig.xml file that you are using to execute the above mentioned query.
In SearchHandler we generally have Default Query Field i.e. qf parameter.
Check that your dynamic field example_s is present in that query field list of solrconfig file else you can pass it while sending query to search handler.
Hope this will help you in resolving your problem.
If you are using the default schema, here's what's happening:
You are probably using default end-point (/select), so you get the definition of search type and parameters from that. Which means, it is default (lucene) search and the field searched is text.
The text field is an aggregate and is populated by copyField instruction from other fields.
Your dynamic field definition for *_s allows you to index the text with any name ending in _s, such as example_s. It's indexed (so you could search against it directly) and stored (so you can see it when you ask for all fields). It will not however search it as a general text. Notice that (differently from ElasticSearch), Solr strings have to be matched fully and completely. If you have some multi-word text in it, there is barely any point searching it. "goober" is one word so it's not a very good example to understand the difference here.
The easiest solution for you is add another copyField instruction:
<copyField source="*_s" dest="text"/>, then all your *_s dynamic fields would also be searchable. But notice that the search analyzers will not be the ones for *_s definition, but the ones for the text field's definition, which is not string, but text_general, defined elsewhere in the file.
As to Solr vs. ElasticSearch, they both err on the different sides of magic. Solr makes you configure the system and makes it very easy to see the exact current configuration. ElasticSearch hides all of the configuration, but you have to rediscover it the second you want to change away from the default behaviour. In the end, the result is probably similar and meets somewhere in the middle.

SOLR Search results with associated file

I am using solr search (solr 4.X), everything working as expected, I got the requirement that I need to show the associated file also along with the search results.
I am getting the search results but not the files. How do I get, at least I am expecting file name along with the search results.
Thanks for the help. Please help me
Solr is a generic enterprise search server. It does not know anything about files or where the data it indexes comes from. You will have do do this on your own.
The Schema (schema.xml) defines what fields get indexed. When you design your schema, you have to make decisions on what is stored and in what way.
If you want the filenames back, you will have to manually add them to your index, by first providing a field in your schema and than by filling that field every time you add something to your index.
You probably do not want to tokenizer your filename, unless you want to search on it, too. If your filename includes a full path, it can be considered unique and you could use it as your id, too.
If you add it via xml, all you need is a new field in your doc list, e.g.
<doc>
...
<field name="filename">/some/path/basename.extension</field>
...
</doc>
If you are using solrj, it will look something like this:
HttpSolrServer server = new HttpSolrServer(host);
SolrInputDocument doc = new SolrInputDocument();
doc.addField("filename", document.getFilename());
Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();
docs.add(doc);
server.add(docs);

formatting of files before indexing into solr server

I'm using the Solr server to provide search capability for a tool. I wanted to know if there is a facility provided by solr that will allow me to format some files before they are indexed ? more specifically i have a plain text file with a lot of data ! i want to convert them to an xml format before i index the xml file . eg
some data! some more data : more values
i want to convert this sample line to something like
<field 1>sample data </field 1>
<field 2> some more data </field 2>
<field 3> more values </field 3>
does solr provide a facility for this type of transformation before iindexing a file using solr cell. does it provie any classes or interfaces that i can implement in my java application ??
thanks in advance!
Are you pushing data into Solr or can you pull it from the source by Solr?
If you are pushing into Solr, then you have to use Update Request Processor. However, I am not aware of any that will split data into multiple fields. You may need to write one yourself.
If you are pulling from the source using DataImportHandler, it has a built-in support for splitting content into multiple fields using RegexTransformer.
Both Request Processor and DIH support JavaScript (and possibly other Java script languages) transformers, so you can also write your own script to split the data in whatever way you want.
Some of this is starting with version 4 of Solr though. That's a requirement to keep in mind.
You'll need a custom Index Handler or a SolrRequestHandler

Solr copyField mixed with RegexTransformer

Scenario:
In the database I have a field called Categories which of type string and contains a number of digits pipe delimited such as 1|8|90|130|
What I want:
In Solr index, I want to have 2 fields:
Field Categories_ pipe which would contain the exact string as in the DB i.e. 1|8|90|130|
Field Categories which would be a multi-valued field of type INT containing values 1, 8, 90 and 130
For the latter, in the entity specification I can use a regexTransformer then I specify the following field in data-config.xml:
<field column="Categories" name="Navigation" splitBy="\|"/> and then specify the field as multi-valued in schema.xml
What I do not know is how can I 'copy' the same field twice and perform regex splitting only on one. I know there is the copyField facility that can be defined in schema.xml however I can't find a way to transform the copied field because from what I know (and I maybe wrong here), transformers are only available in the entity specification.
As a workaround I can also send the same field twice from the entity query but in reality, the field Categories is a computed field (selects nested) which is somewhat expensive so I would like to avoid it.
Any help is appreciated, thanks.
Instead of splitting it at data-config.xml. You could do that in your schema.xml. Here is what you could do,
Create a fieldType with tokenizer PatternTokenizerFactory that uses regex to split based on |.
FieldSplit: Create a multivalued field using this new fieldType, will eventually have 1,8,90,130
FieldOriginal: Create String field (if you need no analysis on that), that preserves original value 1|8|90|130|
Now you can use copyField to copy FieldSplit , FieldOriginal values based on your need.
Check this Question, it is similar.
You can create two columns from the same data and treat them separately.
SELECT categories, categories as categories_pipe FROM category_table
Then you can split the "categories" column, but index the other one as-is.

Resources