SOLR Search results with associated file

SOLR Search results with associated file - solr

I am using solr search (solr 4.X), everything working as expected, I got the requirement that I need to show the associated file also along with the search results.
I am getting the search results but not the files. How do I get, at least I am expecting file name along with the search results.
Thanks for the help. Please help me

Solr is a generic enterprise search server. It does not know anything about files or where the data it indexes comes from. You will have do do this on your own.
The Schema (schema.xml) defines what fields get indexed. When you design your schema, you have to make decisions on what is stored and in what way.
If you want the filenames back, you will have to manually add them to your index, by first providing a field in your schema and than by filling that field every time you add something to your index.
You probably do not want to tokenizer your filename, unless you want to search on it, too. If your filename includes a full path, it can be considered unique and you could use it as your id, too.
If you add it via xml, all you need is a new field in your doc list, e.g.
<doc>
...
<field name="filename">/some/path/basename.extension</field>
...
</doc>
If you are using solrj, it will look something like this:
HttpSolrServer server = new HttpSolrServer(host);
SolrInputDocument doc = new SolrInputDocument();
doc.addField("filename", document.getFilename());
Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();
docs.add(doc);
server.add(docs);

Related

Solr dynamicField not searched in query without field name

I'm experimenting with the Example database in Solr 4.10 and not understanding how dynamicFields work. The schema defines
dynamicField name="*_s" type="string" indexed="true" stored="true"
If I add a new item with a new field name (say "example_s":"goober" in JSON format), a query like
?q=goober
returns no matches, while
?q=example_s:goober
will find the match. What am I missing?

I would like to see the SearchHandler from solrconfig.xml file that you are using to execute the above mentioned query.
In SearchHandler we generally have Default Query Field i.e. qf parameter.
Check that your dynamic field example_s is present in that query field list of solrconfig file else you can pass it while sending query to search handler.
Hope this will help you in resolving your problem.

If you are using the default schema, here's what's happening:
You are probably using default end-point (/select), so you get the definition of search type and parameters from that. Which means, it is default (lucene) search and the field searched is text.
The text field is an aggregate and is populated by copyField instruction from other fields.
Your dynamic field definition for *_s allows you to index the text with any name ending in _s, such as example_s. It's indexed (so you could search against it directly) and stored (so you can see it when you ask for all fields). It will not however search it as a general text. Notice that (differently from ElasticSearch), Solr strings have to be matched fully and completely. If you have some multi-word text in it, there is barely any point searching it. "goober" is one word so it's not a very good example to understand the difference here.
The easiest solution for you is add another copyField instruction:
<copyField source="*_s" dest="text"/>, then all your *_s dynamic fields would also be searchable. But notice that the search analyzers will not be the ones for *_s definition, but the ones for the text field's definition, which is not string, but text_general, defined elsewhere in the file.
As to Solr vs. ElasticSearch, they both err on the different sides of magic. Solr makes you configure the system and makes it very easy to see the exact current configuration. ElasticSearch hides all of the configuration, but you have to rediscover it the second you want to change away from the default behaviour. In the end, the result is probably similar and meets somewhere in the middle.

In Solr 4 - How do I include file names in the index?

I am building a search Engine with Solr 4.8.1 - in doing so, I am attempting to display the file names of each indexed document in my GUI search results.
I can successfully display any field that is in Solr's Schema.xml file (title, author, id, resourcename last_modified etc.). I cannot, however, find a field in the schema.xml that holds the name of the file (such as for the file Test.pdf the name "Test" or for Example.docx the word "Example")
The closest field I can find is "resourcename" which displays the entire file path in my system (ex. C:\Users\myusername\Documents\solr-4.8.1\example\exampledocs\filename.docx when all I want to display is filename.docx)
(1) How do I tell solr to index the name of a file?
or
(2) Is there a field that cover the file name that I am just missing?
Sincerest thanks!
---Research Update---
It seems this question is asking for the same thing - Solr return file name - however, I do not believe that simply adding a field called "filename" will cause Solr to index the file name! I know I need to add a field to the Schema.xml file - now how do I point that field to the name of a file?

This is not so much a question regarding solr functionality as it is about the tools you use to publish to solr. While adding a new field called fileName to solr will resolve part of the issue, modifying the publish tool to add the testPDF.pdf value to each . I guess i'd point my eyes at Tika : http://tika.apache.org/ , seeing how you mention both pdf and doc files.

Adding and Updating Solr and lucene field

I am new to solr. can someone address below questions.
1. Currently I have an index with 1.5 mill records. I am having a need to update value of a field to a new value. How do I do it. Will it be a re-indexing? Sample code will be helpful.
I have another need where I want to add a index field but don't want to reindex the entire content. I have document ids with me. For this requirement I can use lucene if that helps.

Currently I have an index with 1.5 mill records. I am having a need to update value of a field to a new value. How do I do it. Will it be a re-indexing? Sample code will be helpful.
Well, the good news is that the latest versions of Solr (starting with 4.3 or 4.4, I think) allows you to do what they call Atomic Updates. See here:
http://wiki.apache.org/solr/Atomic_Updates
From the coding point of view, it as if you were only updating the desired field. Using the Java SolrJ API it's something like this:
Let's say you have a document with a multi value field called "stuffedAnimals". The field already contains "teddy bear" and "stuffed turtle" as values. You want to update it and add a new value like "pink fluffy flamingo". What you can do is:
SolrInputDocument updateDocument = new SolrInputDocument();
//here you must add the id field with the desired value, corresponding to the doc you want to update:
updateDocument.addField("id", 2312312);
//tell it to add the new value to the existing ones, rather then replace them with it:
updateDocument.addField("stuffedAnimals", new HashMap(){{put("add","pink fluffy flamingo");}});
Problem with this is performance: what actually happens when you do this is that the document is removed and re-added entirely (not just the field). This is something you need to take into consideration if you plan on doing a lot of such operations.
I have another need where I want to add a index field but don't want to reindex the entire content. I have document ids with me. For this requirement I can use lucene if that helps.
Well, as I was saying above: when you update a field, the document is actually re-written entirely, so that means it's re-indexed with the new field as well. If you're using Solr 4.4 or earlier you need to declare the new fields in the schema.xml file. If you're using Solr 4.5 or newer you don't need to worry about the schema.xml any more.
Finally, as a remark for both questions: if you want to update a Solr document, make sure all its fields are marked as "stored" (stored=true in schema.xml). Since a partial update on a field translates into Solr removing and re-adding the document (with the update applied), if certain fields are not stored, Solr won't know what value to put in them after the update.

Take a look at atomic update feature added in 4.0.
It allows You to change value of particular field without reindexing whole document.
Remember that all fields in your schema have to be stored(without copyFields). If You need further assistance please write more detailed description.

how to implement solr index partitioning

I want solr to create indexes based on a specific field. For e.g. I have a field in schema.xml, createDate (which might be of value 2012/2013/etc). Now while indexing if the value of that specific field is 2013, the document should be indexed at /data/2013/index folder (or some logically separated folder). I tried to provide the following in my solrconfig xml just before the <config> tag ends:
<partition>
<partitionField name="creationYear">
<value>2004</value>
<value>2005</value>
<value>2006</value>
<value>2007</value>
<value>2008</value>
<value>2009</value>
<value>2010</value>
<value>2011</value>
<value>2012</value>
<value>2013</value>
</partitionField>
</partition>
While indexing its not working and it seems that this was just an idea but not really implemented in solr. Am I assuming correct? Or is there a way I can allow solr to create dynamic index folders based on the year(as in this example)?
Any help would be appreciated!!

Index every word of a text file which are delimited by whitespace in solr?

I am implementing solr 3.6 in my application.as i have the below data in my text file..
**
date=2011-07-08 time=10:55:06 timezone="IST" device_name="CR1000i"
device_id=C010600504-TYGJD3 deployment_mode="Route"
log_id=031006209001 log_type="Anti Virus" log_component="FTP"
log_subtype="Clean" status="Denied" priority=Critical fw_rule_id=""
user_name="hemant" virus="codevirus" FTP_URL="ftp.myftp.com"
FTP_direction="download" filename="hemantresume.doc" file_size="550k"
file_path="deepti/Shortcut to virus.lnk" ftpcommand="RETR"
src_ip=10.103.6.100 dst_ip=10.103.6.66 protocol="TCP" src_port=2458
dst_port=21 dstdomain="myftp.cpm" sent_bytes=162 recv_bytes=45
message="An FTP download of File resume.doc of size 550k from server
ftp.myftp.com could not be completed as file was infected with virus
codevirus"
**
now i want to split above data based on key-value pairs..and want the each value to be indexed based on the key..
i want the changes should be in the configuraion files..i have gone through tokenizer in which whitespaceokenizer may work.but want the whole structure to be indexed..so can anyone please help me on this???
thanks..

There is no tokenizer that I know of does this.
Using static fields:
You have to define all your "keys" as fields in schema.xml . They should have the relevant types (dates, string etc).
Create a POJO with these fields and parse this key/value pairs and populate the POJO. Add this pojo to solr using solrj.
Using dynamic fields:
In this case you dont need to define the keys in schema but use dynamic fields (based on the type of data). You still need to parse the key/value pairs and add to solr document. These fields need to be added using solrInputdoc.addField method.
As you define add new key/value pairs, the client would still need to know of the existence of this new key. But your indexer does not need to.

This cannot be done with a tokenizer. Tokenizers are called for each field, but you need processing before handing the data to a field.
A Transformer could probably do this, or you might do some straightforward conversion before submitting it as XML. It should not be hard to write something that reads that format and generates the proper XML format for Solr submissions. It sure wouldn't be hard in Python.
For this input:
date=2011-07-08 time=10:55:06 timezone="IST" device_name="CR1000i"
You would need to create the matching fields in a schema, and generate:
<doc>
<field name="date">2011-07-08</field>
<field name="time">2011-07-08</field>
<field name="timezone">IST</field>
<field name="device_name">CR1000i</field>
...
Also in this pre-processing, you almost certainly want to convert the first three fields into a single datetime in UTC.
For details about the Solr XML update format, see: http://wiki.apache.org/solr/UpdateXmlMessages
The Apache wiki is down at this exact moment, so try again if there is an error page.