Using Apache Solr to do faceted search for unstructured text - solr

I have a set of text documents that I have indexed in my respective collection defined in Solr. I am able to do a keyword based search that returns the required documents having the typed in keyword. But my next objective is to do faceted search on unstructured text wherein I am able to retrieve results based on facet fields.
I have tried the following steps:
1) I have defined a new field (distributioncompany)in managed-schema that will act as facet field with a copyfield also defined (distributioncompany_str). But when I do indexing curl command ( with id and distribution company name passed as arguments), I get the facet counts with q =*, but doesnot work when I type in a keyword in q field.
2) I also tried text tagger feature with a tag field defined for te facet field so that required entity can be extracted from document and matched against a list of tag values. But tag field not getting returned
For 1st approach:
1)
curl -X POST -H ‘Content-type:text/plain’ –data-binary ‘{“add-field”:{“name”:”distributioncompany_str”,”type”:”string”,”multiValued”:true,
“indexed”:true,”stored”:false}}’
http://localhost:8983/solr/collectionname/schema
(same code used for adding distributioncompany field)
2) Copy field added:
curl -X POST -H ‘Content-type:text/plain’ –data-binary ‘{“add-copy-field”:{“source”:”distributioncompany”,”dest”:”distributioncompany_str”}}’
http://localhost:8983/solr/collectionname/schema
3) Added a new document to index:
curl ‘http://localhost:8983/solr/collectionname/update/json/docs’ -H ‘Content-type:text/plain’ -d ‘{“id”:”Appeal No. 220 of 2013.pdf.txt”,”distributioncompany”:”Himachal Pradesh State Electricity Board”}’
But if query is done using q =*, it shows facet field count, but if query done using a keyword present in the document, it doesnot show up
For 2nd approach ( text tagger)
1) Added a new fieldtype "tag" in schema of collection
2) Added new fields ( a) trancompany (type:text_general),b) trancompany_tag(type:tag), c) copyfield for these 2 fields
3) Added a new custom SolrRequest Handler in Solrconfig file:
curl -X POST -H ‘Content-type:application/json’ http://192.168.0.95:8983/solr/rajdhanitest2/config -d ‘{
“add-requesthandler”:{
“name”:”/tag”,
“class”:”solr.TaggerRequestHandler”,
“defaults”:{“field”:” trancompany_tag”}
}
}’
4) Updates values for tag field "trancompany_tag" with curl command
5) But on passing text with one of tag values updated, only id gets returned not the tag value
For both approaches, the required faceted/ tagged field to be extracted from text document is not displayed when search query is done. Would appreciate help in guiding me how to do a faceted search for unstructured text documents

Related

Is it possible to use multiple words in a filter query in SOLRJ / SOLR?

I am using SOLRJ (with SOLR 7) and my index features some fields for the document contents named content_eng, content_ita, ...
It also features a field with the full path to the document (processed by a StandardTokenizer and a WordDelimiterGraphFilter).
The user is able to search in the content_xyz fields thanks to the lines :
final SolrQuery query = new SolrQuery();
query.setQuery(searchedText);
query.set("qf",searchFields); // searchFields is a generated String which looks like "content_eng content_ita" (field names separated by space)
Now the user needs to be able to specify some words contained in the path (namely some subdirectories). So I added a filterQuery :
query.addFilterQuery(
"full_path_split:" + searchedPath);
If searchedPath contains only a single word contained in the document path, the document is correctly returned however if searchedPath has several words contained in the path, the document is not returned. To sum it up the fq only works if searchedPath contains a single word.
For example doc1 is in /home/user/dir1/doc1.txt
If I search for all (* in searchedText) documents that are in user dir (fq=full_path_split%3Adir) doc1.txt is returned.
If I do the same search but for documents that are in user and dir1 (fq=full_path_split%3user+dir1) doc1.txt is not returned, and I think it is because the fq is parsed as "+full_path_split:user +text:dir1" as debug=query shows. I don't know where text comes from it may be a default field.
So is it possible to use a filter query with several words to fulfill my needs ?
Any help appreciated,
Your suspicion is correct - the _text_:dir1 part comes from you not providing a field name, and the default field name being used instead.
You can work around this by using the more general edismax (or the older dismax) parser as you're doing in your main query with qf:
fq={!type=edismax qf='full_path_split'}user dir1

Solr Query using Facet missing the special characters and its showing in split values

I have added the documents into the solr using the solr client java API
Consider 2 fields,
field1 | field2
aaa#test.com value1
I was able to successfully index the documents.
In the solr admin UI when i executed the query i was able to see 1 record with these above values.
In the admin UI I have enabled Facet on this field and try to execute the query.
But i got result in splitted values as shown below
Checked the facet checkbox and in the facet.field = owner and then clicked execute query got the below result
"facet_counts":{
"facet_queries":{},
"facet_fields":{
"owner":[
"com",1,
"test",1,
"aaa",1]},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{}}}
If you see in the above result i got splited string how to get that in single output
aaa#test.com , 1
Please help me on this
The facets are generated from the tokens for the field. If you're using a text based field with a tokenizer attached, the value will be split into multiple tokens.
To get the behavior you want, use a string field and reindex your content to that field. Use a copyField instruction if you still want to be able to search with partial content against the field, and facet on the new field instead.

How can I query Solr to get a list with all field-names prefixed by a string?

I would like to create an output based on the field-names of my Solr index objects.
What I have are objects like this e.g.:
{
"Id":"ID12345678",
"GroupKey":"Beta",
"PricePackage":5796.0,
"PriceCoupon":5316.0,
"PriceMin":5316.0
}
Whereby the Price* fields may vary from object to object, some might have more than three of those, some less, however they would be always prefixed with Price.
How can I query Solr to get a list with all field-names prefixed by Price?
I've looked into filters, facets but could not find any clue on how to do this, as all examples - e.g. regex facet - are in regard to the field-value, not the field-name itself. Or at least I could not adapt it to that.
You can get a comma separated list of all existing field names if you query for 0 documents and use the csv response writer (wt parameter) to generate the field name list.
For example if you request /solr/collection/select?q=*:*&wt=csv you get a list of all fields. If you only want fields prefixed with Price you could also add the field list parameter (fl) to limit the fields.
So the request to /solr/collection/select?q=*:*&wt=csv&fl=Price*should return the following response:
PricePackage,PriceCoupon,PriceMin
With this solution you get all fields existing including dynamic fields.

how to edit solr 5 schema which is created by default

How do I edit a schema such as the gettingstarted collection as mentioned in
https://lucene.apache.org/solr/quickstart.html
Thanks
Joyce
Solr 5 uses a managed schema by default, while Solr 4 used the schema.xml file. Solr 5 automatically creates the schema for you by guessing the type of the field. Once the type is assigned to the field, you can't change it. You have to set the type of the field before you add data to Solr 5.
To change the schema in Solr 5, you will want to use the Schema Api, which is a REST interface.
Schemaless Mode states the following:
You Can Still Be Explicit - Even if you want to use schemaless mode for most fields, you can still use the Schema API to pre-emptively create some fields, with explicit types, before you index documents that use them.
... Once a field has been added to the schema, its field type is fixed.
If you are using the quick start guide for Solr 5, here's what you have to do if you want to explicitly specify the field types:
After you end the following command: bin/solr start -e cloud -noprompt
Then enter a command like this:
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field" : { "name":"MYFIELDNAMEHERE", "type":"tlong",
"stored":true}}' http://localhost:8983/solr/gettingstarted/schema
The previous command will force the MYFIELDNAMEHERE field to be a tlong. Replace MYFIELDNAMEHERE with the field name that you want to be explicitly set, and change tlong to the Solr type that you want to use.
After doing that, then load your data as usual.

SOLR Tika: add text of file to existing record (ExtractingRequestHandler)

I am indexing posts in SOLR with "name", "title", and "description" fields. I'd like to later be able to add a file (like a Word doc or a PDF) using Tika / the ExtractingRequestHandler.
I know I can add documents like so: (or through other interfaces)
curl
'http://localhost:8983/solr/update/extract?literal.id=post1&commit=true'
-F "myfile=#tutorial.html"
But this replaces the correct post (post1 above) -- is there a parameter I can pass to have it only add to the record?
In Solr (ver < 4.0) you can't modify fields in a document. You can only delete or add/replace whole documents. Therefore, when "appending" a file to the Solr document you have to rebuild your document from its current values (using literal), i.e. query for the document and then:
http://localhost:8983/solr/update/extract?literal.id=post1&literal.name=myName&literal.title=myTitle&literal.description=myDescription&commit=true

Resources