Indexing full-text and descriptive metadata in Solr - solr

I have a small set of descriptive metadata (~50) and for each of them a corresponding full text file (.txt). My understanding is that the Apache Tika framework is used for detecting and extracting metadata and structured text from various types of documents. However, I would also need to implement a linkage mechanism whereby a given metadata is matched to its full-text. Can this be done in Solr?
Thanks,
Ilaria

If you have metadata and the document content, you can index the metadata and store the content. Your field definition would look something like this
<field name="filename" type="text" indexed="true" stored="true"/>
... <!-- other metadata /-->
<field name="content" type="text" indexed="false" stored="true"/>
This will allow you to search by any metadata, and give you back the content. You can add as much meta information as required to search the text. I wouldn't index the full text as there is already some structured metadata available.
Apache TIKA extracts meta information from HTML pages etc. Since you already have the metadata available, you need not use TIKA. Besides, AFAIK, Tika does not work with plain text files.
Edit 1:
Ok, so the link between the metadata and content will be maintained in Solr. For ex, if you have
File1.txt <-> Metadata1.txt
You could have one record (document) in Solr that has (no. of metadatafields + 1 plaintextcontent field). This gives you the flexibility to look up the document by any metadata. For example,
q=filename:File1.txt
or
q=filesize:[1 to 100]
where filename and filesize are example metadata fields. plaintextcontent would be your text file content, so thus in your Solr schema, you have your link.
Now the trick is to setup the indexing. Here's one way to do it -
Indexing the text file is very simple. You could use the DataImportHandler's PlainTextEntityProcessor.
Indexing the metadata along with it could be slightly tricky (need to understand the structure of metadata). You could use LineEntityProcessor or any one of the Transformers of DataImportHandler, depending on what suits you best.

Related

Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

I am new to Solr and I need to implement a full-text search of some PDF files. The indexing part works out of the box by using bin/post. I can see search results in the admin UI given some queries, though without the matched texts and the context.
Now I am reading this post for the highlighting part. It is for an older version of Solr when managed schema was not available. Before fully understand what it is doing I have some questions:
He defined two fields:
<field name="content" type="text_general" indexed="false" stored="true" multiValued="false"/>
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
But why are there two fields needed? Can I define a field
<field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/>
to capture the full text?
How are the fields filled? I don't see relevant information in TikaEntityProcessor's documentation. The current text extractor should already be Tika (I can see
"x_parsed_by":
["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser"]
in the returned JSON of some query). But even I define the fields as he said I cannot see them in the search results as keys in JSON.
The _text_ field seems a concatenation of other fields, does it contain the full text? Though it does not seem to be accessible by default.
To be brief, using The Elements of
Statistical Learning as an example, how to highlight the relevant texts for the query "SVM"? And if changing the file name into "The Elements of Statistical Learning - Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for the query "id:Trevor Hastie"?
Before I get started on the questions let me just give a brief how solr works. Solr in its core uses lucene when simply put is a matching engine. It creates inverted indexes of document with the phrases. What this means is for each phrase it has a list of documents which makes it so fast. Getting to your questions:
Solr does not convert your pdf to text,well its the update processor configured in the handler which does it ,again this can be configured in solrconfig.xml or write your own handler here.
Coming back why are there two fields. To simply put the first one(content) is a stored field which stores the data as it is. And the second one is a copyfield which copies the data for each document as per the configuration in schema.xml.
We do this because we can then choose the indexing strategy such as we add a lowercase filter factory to text field so that everything is indexed in lower case. Then "Sam" and "sam" when searched returns the same results.Or remove certain common occurring words such as "a","the" which will unnecessarily increase your index size. Which uses a lot of memory when you are dealing with millions of records, then you want to be careful which fields to index to better utilise the resources.
The field "text" is a copyfield which copies data from certain fields as mentioned in the schema to text field. Then when searching in general one does not need to fire multiple queries for each field. As everything thing is copied into "text" field and you get the result. This is the reason it's "multivaled". As it can stores an array of data. Content is a stored field and text is not,and opposite for indexed because when you return your result to the end user you show him what ever you saved not the stripped down data that you just did with the text field applying multiple filters(such as removing stop words and applying case filters,stemming etc).
This is the reason you do not see "text" field in the search result as this is used solr.
For highlighting see this.
For more these are some great blog yonik and joel.
Hope this helps. :)

Solr: should I index large fields?

After a webpage has been crawled with Apache Nutch 2.2.1, contents of that page are pushed to Solr. Solr stores the contents of entire webpages in the "content" field, so data in that field is usually very sizable. So here's my concerns:
Should I index the "content" field in Solr? Indexing such a large field will increase index size. In Solr's schema.xml file I found the following recommendation:
NOTE: This field is not indexed by default, since it is also copied to "text"
using copyField below. This is to save space. Use this field for returning and
highlighting document content. Use the "text" field to search the content.
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>
However, if I left this field unindexed, would it increase search response time significantly?
I'd greatly appreciate any information that will help me to understand benefits of not indexing this large field or benefits of indexing it.
If you're going to search against the field, it needs to be indexed. The example in the schema assumes that since you're going to search against text instead of content, there is no need to create the index twice. They do however want to keep a reference to the content by itself, so that it can be displayed in the application or used for highlighting (which require the whole field content to be available).
If you don't seen any situation where you'll need the field for querying, there is no need to create an index for the field.

How to view non-stored fields per document?

I have a field like this:
<field name="status" type="string" indexed="true" stored="false" required="false" />
Using LukeRequestHandler I can view only statistics of the indexed terms, I can view indexed terms per document if stored="true". TermsComponent can show only frequencies of terms, I cannot view terms per document.
Is it possibly to look inside the inverted index without setting stored="true" and reindexing Solr?
In order to view the indexed terms for a single document, you need to use the full Luke application, not the LukeRequestHandler. You would need to copy the index folder from your Solr data directory to another location, then open it in Luke.
There is however a workaround within solr itself - do a search that will return just the one document, and facet on the field you want to examine. Every term in the index for that field on that document will be an entry in the facet output. Here is a full sample URL for this kind of search:
http://localhost:8983/solr/core/select?q=id:1234&facet.field=status&facet.limit=-1&facet.mincount=1&facet=true&facet.method=enum
If you decide to go the Luke route, you can step through your index (or search for an individual document) and view just one document.
The official Luke page is here, but it only supports up through 4.0-ALPHA:
http://code.google.com/p/luke/
You can find Luke for versions beyond 4.0-ALPHA here:
https://java.net/projects/opengrok/downloads
There is an effort underway to absorb Luke into the Lucene/Solr source code as a module, so it will always be current and released at the same time as each Lucene/Solr version.

SOLR - Use single text field in schema for full text search

I am getting familiar with SOLR.
I would like to use SOLR for full text search for many kind of entities. I don't want to create a Document for every different type of entity. I don't want to be able to search for specific fields. I am only interested in that if a specified string is anywhere in any item.
In database terms for example I have a table News and a table Employee and I want to search for the word 'apple', I don't mind in which field it is, I only want to get back the database ID from the records which contain it.
Could it be a solution, that I use a SOLR schema something like this:
<fields>
<field name="id" type="string" indexed="true" stored="true"/>
<field name="content" type="text" indexed="true" stored="false"/>
</fields>
So, I only need an ID and the contents. I put all the data, in which I want to be able search into one 'content' field. When I search for some words it looks for it in the 'id' and int the 'content'.
Is this a good idea? Any performance or design problem?
Thanks,
Tamas
See https://wiki.apache.org/solr/SchemaXml#Copy_Fields. It says:
A common requirement is to copy or merge all input fields into a single solr field. This can be done as follows:-
<copyField source="*" dest="text"/>
That's typically what is done to search across multiple fields.
But if you don't even want your original fields, just concatenate all your fields into one big field content and index in Solr. There should be no problems with that.
You can either copyField to text (see example in the distribution) and have that set as default field ("df" parameter in solrconfig.xml for the select handler).
Or, if you anticipate more complex requirements down the line and/or non-text searches, I would recommend looking at eDismax with qf parameter and it will handle searching all those fields itself.

Index every word of a text file which are delimited by whitespace in solr?

I am implementing solr 3.6 in my application.as i have the below data in my text file..
**
date=2011-07-08 time=10:55:06 timezone="IST" device_name="CR1000i"
device_id=C010600504-TYGJD3 deployment_mode="Route"
log_id=031006209001 log_type="Anti Virus" log_component="FTP"
log_subtype="Clean" status="Denied" priority=Critical fw_rule_id=""
user_name="hemant" virus="codevirus" FTP_URL="ftp.myftp.com"
FTP_direction="download" filename="hemantresume.doc" file_size="550k"
file_path="deepti/Shortcut to virus.lnk" ftpcommand="RETR"
src_ip=10.103.6.100 dst_ip=10.103.6.66 protocol="TCP" src_port=2458
dst_port=21 dstdomain="myftp.cpm" sent_bytes=162 recv_bytes=45
message="An FTP download of File resume.doc of size 550k from server
ftp.myftp.com could not be completed as file was infected with virus
codevirus"
**
now i want to split above data based on key-value pairs..and want the each value to be indexed based on the key..
i want the changes should be in the configuraion files..i have gone through tokenizer in which whitespaceokenizer may work.but want the whole structure to be indexed..so can anyone please help me on this???
thanks..
There is no tokenizer that I know of does this.
Using static fields:
You have to define all your "keys" as fields in schema.xml . They should have the relevant types (dates, string etc).
Create a POJO with these fields and parse this key/value pairs and populate the POJO. Add this pojo to solr using solrj.
Using dynamic fields:
In this case you dont need to define the keys in schema but use dynamic fields (based on the type of data). You still need to parse the key/value pairs and add to solr document. These fields need to be added using solrInputdoc.addField method.
As you define add new key/value pairs, the client would still need to know of the existence of this new key. But your indexer does not need to.
This cannot be done with a tokenizer. Tokenizers are called for each field, but you need processing before handing the data to a field.
A Transformer could probably do this, or you might do some straightforward conversion before submitting it as XML. It should not be hard to write something that reads that format and generates the proper XML format for Solr submissions. It sure wouldn't be hard in Python.
For this input:
date=2011-07-08 time=10:55:06 timezone="IST" device_name="CR1000i"
You would need to create the matching fields in a schema, and generate:
<doc>
<field name="date">2011-07-08</field>
<field name="time">2011-07-08</field>
<field name="timezone">IST</field>
<field name="device_name">CR1000i</field>
...
Also in this pre-processing, you almost certainly want to convert the first three fields into a single datetime in UTC.
For details about the Solr XML update format, see: http://wiki.apache.org/solr/UpdateXmlMessages
The Apache wiki is down at this exact moment, so try again if there is an error page.

Resources