Can SOLR calculate and then query based on array position matches? - arrays

In our SOLR documents, we store the contributor first and last names as two separate arrays, that correspond to one another, along with their role in the publication.
In this case, I'd like to write a query to run from our middleware platform that only returns documents in which authors that correspond to certain contributor codes. An example is here:
<arr name="Role">
<str>Author</str>
<str>Cover</str>
<str>Author2</str>
<str>Summary</str>
</arr>
<arr name="Forename">
<str>George</str>
<str>John</str>
<str>George</str>
<str>Sue</str>
</arr>
<arr name="Surname">
<str>Anderson</str>
<str>Smith</str>
<str>Anderson</str>
<str>Maryson</str>
</arr>
but the positions of the Role array items will vary with each document...
Here, I'd like only to query Author and Author2, i.e. Forename and Surname array positions 1 and 3 "George" and "Anderson", and ignore John Smith and Sue Maryson.
Is this possible at the query level?
Thanks

Related

How to search over documents with 2 or more entries in multivalued field in Solr?

I have a schema that allows a multivalued field, how do I construct a search that only returns documents that have 2 or more entries in that field? for example in this subset of data:
<doc>
<str name="id">A</str>
<arr name="multivaluedField">
<str>One</str>
<str>Two</str>
</arr></doc>
<doc>
<str name="id">B</str>
<arr name="multivaluedField">
<str>One</str>
</arr></doc>
<doc>
<str name="id">C</str>
<arr name="multivaluedField">
<str>Three</str>
<str>Four</str>
</arr></doc>
The search would return documents A and C only since they have 2 entries in MultivaluedField even if they are different entries.
The easiest (and most effective) way would be to index a integer value that contains the count of values together with the existing values, so you have a multiValued_count field. This field can be indexed and you can do both efficient range queries and exact value lookups.
You can do this in your indexing code directly or in an updateprocessor if needed.

Solr Indexing Design Requirement

I am having a 5 tables in database namely State,District,City,Locality and Pincode(Hierarchy as mention).
Each table is having the Foreign Keys corresponding to all parents in hierarchy.But some of the Pincodes maynot have the locality id.I am trying to indexing this data with Solr.
So far i am indexing such as below
<doc>
<str name="state">Punjab</str>
<arr name="district">
<str>test</str>
<str>test1</str>
</arr>
<arr name="city">
<str>abc</str>
<str>dfsdf</str>
</arr>
<arr name="locality">
<str>fggf</str>
<str>gddd</str>
</arr>
<arr name="pincode">
<str>123</str>
<str>345</str>
</arr>
</doc>
But i hope this is not the correct way for fetching the data as there is no relation between district and city,city and locality etc..
help me on this
You are looking at this problem backwards. You need to work from the results. What do you want to find?
Imagine you already have everythin working correctly. Given that, what individual record would be in that search result (pincode-level entries?). Then, de-normalize down to that level and include all the information required to find that record.
See the presentation from Gilt regarding how they refactored their initial architecture to reflect their needs better. Ignore all the technical details for now, just follow the logic arguments.
Then, you will probably have a (separate) technical question on how to implement it.

Store complex (i.e. label + id) meta data in SOLR document

I use SOLR to store documents having some meta data that is composed out of multiple values. Usually an id with a label. A simple example would be the name of a city and the unique id of that city. The id is needed, because different cities can have the same name like Berlin in Germany and Berlin in the US. The name is obvioulsy needed, because I want to search for that string.
If I use facets, I would like to get back two facets having the label "Berlin". If I restrict my search (using some other meta data field) to documents from germany, I would expect to get only one facet for the german Berlin. Obviously this does not work, if I store id and label in two seperated SOLR fields.
I would assume that this is not an uncommon requirement, but I was not able to find any useful information. My current approaches are:
Implement a complete custom field type in Java: Hard to estimate for me, because I'm currently just a SOLR user, not a SOLR developer.
Put id and label in a single string (like "123:Berlin" and "456:Berlin") and define custom field types in schema.xml using a custom analyzer which splits the value. Sound reasonable to me, but I'm not 100% sure if it will work with faceting.
I found some references to subfields, but only on older pages and I was not able to find useful documentation.
Is there some well known way to solve this in SOLR?
Pivot faceting can work.
Say you have the fields: cityId, cityName, country
Do a pivot facet over city-id, city-name by using query parameters:
facet.pivot=cityId,cityName
At the first level, like a standard facet, you will get each city ID. But on the second level, you will get the name of each city. Given that each city ID will have only one name, you can simply read each city ID's name from the next facet level (under the pivot element in the XML).
<lst name="facet_pivot">
<arr name="cityId,city">
<lst>
<str name="field">cityId</str>
<str name="value">1</str>
<int name="count">1</int>
<arr name="pivot">
<lst>
<str name="field">city</str>
<str name="value">berlin</str>
<int name="count">1</int>
</lst>
</arr>
</lst>
<lst>
<str name="field">cityId</str>
<str name="value">2</str>
<int name="count">1</int>
<arr name="pivot">
<lst>
<str name="field">city</str>
<str name="value">berlin</str>
<int name="count">1</int>
</lst>
</arr>
</lst>
<lst>
<str name="field">cityId</str>
<str name="value">3</str>
<int name="count">1</int>
<arr name="pivot">
<lst>
<str name="field">city</str>
<str name="value">melbourne</str>
<int name="count">1</int>
</lst>
</arr>
</lst>
</arr>
</lst>
Basically, if the ID is unique, you will be guaranteed to only have one pivot value at the second level.
Optionally, if you want to group your 'Berlins' together, just reverse the order of the facet pivot and make it:
facet.pivot=cityName,cityId
and you will get 'Berlin' at the first level and possibly multiple IDs at the second level (and as a bonus, you could add a third level country so that you can read the country for each city off the third level).
There seems no out of the box solution.
Your #2 should work fine with some client side modifications.
You can index your data with id_name as a single string field. Needs to
change at indexing time. Easier using Transformers if you are using
DIH.
You would have unique facets for each id now, and at Client
side you can always split the Facets for display.
You can also check Facet Pivots, which can provide an Hierarchical Faceting
That should work. If you add a filter query such as fq=country_name:Germany, it should return facets for cities only in Germany. Please take a look at this example below:
Suppose you have 4 fields in your schema:
id, city_name, country_name, state_name
SAMPLE DATA:
id: 1
city_name: Berlin
country_name: Germany
state_name: Some_State1
id: 2
city_name: Berlin
country_name: USA
state_name: Some_State2
id: 3
city_name: Dublin
country_name: Ireland
state_name: Some_State3
id: 4
city_name: Dublin
country_name: USA
state_name: California
id: 5
city_name: Dublin
country_name: USA
state_name: Virginia
If you want to get facet for all cities with name Dublin:
/select/?q=*:*&facet=true&facet.field=country_name&facet.field=city_name&fq=city_name:Dublin
In the result, the count of facet Dublin will be 3
Now if you want to get facet for all cities with name Dublin and restrict country to USA, your query will be:
/select/?q=*:*&facet=true&facet.field=country_name&facet.field=city_name&fq=city_name:Dublin&fq=country_name:USA
In the result, the count for facet Dublin will be 2, because we have two Dublins in USA, one is in California and other in Virginia
NOTE: I added &fq=country_name:USA
A rather simple suggestion: use two fields at the index time through copyField for values like "123:Berlin"
one notindexed and stored string field for faceting plus parsing/cleaning on the client side
and for search use the copy one indexed and not stored with a simple regex analyzer in ex: PatternReplaceCharFilterFactory.
No need for custom analyzers or new type of fields, just like you already pointed out in your second solution

Replacing SOLR output field value

I have below mentioned SOLR query which works fine.
query:"COMPLEX CONDITION 1" OR query:"COMPLEX CONDITION 2"
I get 4 documents in result - 2 from condition1 and 2 from condition2. I need to know documents belong to which condition.
I cannot figure out from the result as the conditions are too complex.
What i want to do is change the value of the "status" field in the output.
Lets say, status=Active for condition1 and status=Expired for condition2.
The current value of status is not accurate as the status is decided based on the conditions i use.
Is there a way to overwrite the output value of any field(s) in SOLR?
have you tried using highlighting to determine which documents matched which condition? If you turn on highlighting (&hl=on&hl.fl=<fields_you're_trying_to_match>), then Solr will return a structure at the end of the results structure (whether you're returning results in JSON or XML) called "highlighting." This structure in turn will contain structures named according to the unique key of your index (if there is one) with elements that match.
<lst name="highlighting">
<lst name="1">
<arr name="title">
<str>Bob <em>Jones</em></str>
</arr>
<arr name="category">
<str><em>Jones</em> Family</str>
</arr>
<arr name="description">
<str>This is a book about Bob <em>Jones</em>, the patriarch of the <em>Jones</em> Family.</str>
</arr>
<lst>
<lst>
More here:
How to return column that matched the query in Solr..?
Now I apologize that this doesn't answer the latter part of your question, but gives you some help for the first part.

Filter doc if a specified multivalued filed contains only one value

We encounter a query case that to filter doc if a specified multivalued filed contains only one value.
For instance:
We have an index of suit, including clothes ,trousers or other things. If there is only one product within a suit due to out of stock, we can't show the suit to user, because it's not 'suit'.
Here is our data:
<doc>
<int name="suitId">001</int>
<arr name="productName">
<str>T-shirt</str>
<str>jeans</str>
</arr>
</doc>
<doc>
<int name="suitId">002</int>
<arr name="productName">
<str>T-shirt</str>
</arr>
</doc>
We wanna except the suit of suitId=002.
It would be better to have a separate field maintaining the count of the products for a suit and use it to filter the suits.
I don't think you can use the range queries for the text multivalued fields.
you can probably use productName:[* TO *] to select suit having atleast one product, but not the count.

Resources