preserve association in multivalued in solr - solr

I have multivalued fields in my solr datasource. sample is
<doc>
<str name="id">23606</str>
<arr name="institution">
<str>Harvard University</str>
<str>Yale Universety</str>
<str>Cornell University</str>
<str>TUFTS University</str>
<str>University of Arizona</str>
</arr>
<arr name="degree_level">
<str>Bachelors</str>
<str>Diploma</str>
<str>Master</str>
<str>Master</str>
<str>PhD</str>
</arr>
</doc>
in the example above this user has got Bachelors degree from Harvard, Diploma from Yale, Master from Cornell, Master from TUFTS, and PhD from Arizona.
now if i search for users who have Bachelors degree and graduated from Harvard, i will get this user, which is correct.
MyDomain:8888/solr/mycol/select?facet=true&q=:&fq=degree_level:Bachelors&fq=institution:Harvard+University
but if i want those who have Bachelors from Cornell, i will get this user as well, which is incorrect!
MyDomain:8888/solr/mycol/select?facet=true&q=:&fq=degree_level:Bachelors&fq=institution:Cornell+University
The question is: how could i preserve ordering/mapping in multivalued in solr?
Edit:
By the way, i know that i can solve my problem by creating new field to contain concatenation of the degree with university (ie, "Bachelors_Harvard University", "Diploma_Yale Universety", and so on) but i need a solution based on solr core itself as i have a lot of multivalued fields with a lot of combinations.

Below is a list of some suggestions
try using dynamic fields
<dynamicField name="degree_level_*" type="string" indexed="true" stored="true" />
and create fields dynamically while indexing degree_level_Bachelors with value Harward University and so on. so when you want to filter on Bachelors degree, filter on field degree_level_Bachelors. Similarly, if you want to allow filtering on institutions, create a dynamic field for institutions.
you can pre define how you will be storing data:
<year><seperator><degree><seperator><institution><seperator><Major> etc etc.
and then filter on the reqired regex.
eg:
fq=educationDetails:2009#Bachelors#Harvard#*
this will give you all records with bachelors from Harvard in 2009.
you will have to come up with the regex expressions for all the different filters.
two collections to correctly model the one-to-many relationship between user and degree queried using {!join}
one collection at a "user-degree" level of granularity that gets deduped via Solr's field collapsing support.

Related

Working with Highlights on Solr 6.4.1

I am running Solr 6.4.1 on a Windows 7 machine, with Chrome for testing query URLs currently.
I have set up and got working an index on a set of test documents - a small number of of webpages saved as Docx files in a folder. I can get basic queries working and am now trying to get highlighting working.
I have not modified the schema in any way - simply indexed the folder into a Core called test.
The following query and highlights as I expect:
http://localhost:8983/solr/test/select?hl=on&hl.fl=meta_author&q=steven&wt=xml&fl=meta_author
and returns
...<lst name="highlighting">
<lst name="C:\Users\steven\Documents\Indexing\Dungeon Arena Building.docx">
<arr name="meta_author">
<str><em>steven</em></str>
</arr>
</lst>...
However if I change the fields try and highlight where the term is found in the name of the document it does not work in this way.
http://localhost:8983/solr/test/select?hl=on&hl.fl=dc_title&q=gothic&wt=xml&fl=dc_title
returns
...<lst name="highlighting">
<lst name="C:\Users\steven\Documents\Indexing\Basic Gothic Dungeon.docx"/>
<lst name="C:\Users\steven\Documents\Indexing\Dungeon Arena Building.docx"/>
</lst>...
The results are correct but it does not highlight the identified data fields.
Are there some rules around the available fields that can be highlighted or do I need to amend something in the schema?
For context I aim to bring over all the file content into the index so that I can then present back the match in context of the surrounding text for the users to see.
check whether the field is stored for dc_title .
In your schema your field should look like(field type can be different, as you defined, but set stored=true), after modification, reindex doc and search again.
<field name="dc_title" type="text_general" indexed="true" stored="true"/>

Using logic AND in a text field

I'm using a schema that has a text field containing ids separated by spaces. The field definition in schema is below:
<field name="aux_identifiers" type="text" indexed="true" stored="true"/>
a query that fetch a single document returns the field as below - example:
<str name="aux_identifiers">1 2 3 4</str>
is there any possibility to apply a logic AND operator to these fields? I need to find the documents that has, as example, the ids 2 and 3 in the field.
fyi, we can't modifiy those fields to multivalued or array and reindex right now. that's why i'm trying a alternate solution.
It would depend on what kind of processing you have on that field, but this should work:
q=aux_identifier:2 AND aux_identifier:3

Solr click scoring implementation

after searching and searching over the net, i've found a possible open-source solution for the click-count-popularity in solr (=does not require a payd version of lucid work search).
In my next two answers i will try to solve the problem in a easy way and in a way a little bit complex...
But first some pre-requisites.
We suppose to google-like scenario:
1. the user will introduce some terms in a textfield and push the search button
2. the system (a custom web-app coupled with solr) will produce a web page with results that are clickable
3. the user will select one of the results (e.g. to access to the details) and will inform the system to change the 'popularity' of the selected result
The very easy way.
We define a field called 'popularity' in solr schema.xml
<field name="popularity" type="long" indexed="true" stored="true"/>
We suppose the user will click on the document with id 1234, so we (=the webapp) have to call solr to update the popularity field of the document with id 1234 using the url
http://mysolrappserver/solr/update?commit=true
and posting in the body
<add>
<doc>
<field name="id">**1234**</field>
<field name="popularity" update="inc">1</field>
</doc>
</add>
So, each time the webapp will query something to solr (combining/ordering the solr 'boost' field with our custom 'popularity' field) we will obtain a list ordered also by popularity
The more complex idea is to update the solr index tracing not only the user selection but also the search terms used to obtain the list.
First of all we have to define a history field where to store the search terms used:
<field name="searchHistory" type="text_general" stored="true" indexed="true" multiValued="true"/>
Then we suppose the user searched 'something' and selected from the result list the document with id 1234. The webapp will call the solr instance at the url
http://mysolrappserver/solr/update?commit=true
adding a new value to the field searchHistory
<add>
<doc>
<field name="id">**1234**</field>
<field name="searchHistory" update="add">**something**</field>
</doc>
</add>
finally, using the solr termfreq function in every following query we will obtain a 'score' that combined with 'boost' field can produce a sorted list based of click-count-popularity (and the history of search terms).
This is interesting approach however I see some disadvantages in it:
Overall items storage will grow dramatically with each and every search.
You're assuming that choosing specific item is 100% correct and it wasn't done by mistake or for brief only. In this way you might get wrong search results along the way.
I suggest only to increment the counter or even to maintain relative counter based on the other results that the user didn't click it.

Searching multi-valued fields in the same position

Let's say we have a document like this:
<arr name="pvt_rate_type">
<str>CORPORATE</str>
<str>AGENCY</str>
</arr>
<arr name="pvt_rate_set_id">
<str>1</str>
<str>2</str>
</arr>
Now I do a search where I want to return the document only if it contains pvt_rate_set_id = 1 AND pvt_rate_type = AGENCY in the same position in their mutli-valued fields so the above document should NOT be returned (because pvt_rate_set_id 1 has a pvt_rate_type of CORPORATE)
Is this possible at all in SOLR ? or is my schema badly designed ? how else would you design tat schema to allow for the searching I want?
This may not be available Out of the Box.
You would need to modify the schema to have fields with pvt rate type as field name and id as its value
e.g.
CORPORATE=1
AGENCY=2
This can be achieved by having dynamic fields defined.
e.g.
<dynamicField name="*_pvt_rate_type" type="string" indexed="true" stored="true"/>
So you can input data as corporate_pvt_rate_type or agency_pvt_rate_type with the respective values.
The filter queries will be able to match the exact mappings fq=corporate_pvt_rate_type:1
Unfortunately Solr does not seem to support this.
Another way to do this in Solr would be to store a concatenated string field type_and_id with a delimiter (say comma) separating the type and the id and query like:
q=type_and_id:AGENCY%2C1
(where %2C is the URL encoding for comma).

Solr query must match all words/tokens in a field

I have a text-field called name in my schema.xml. A query q=name:(organic) returns the following documents:
<doc>
<str name="id">ontology.category.1483</str>
<str name="name">Organic Products</str>
</doc>
<doc>
<str name="id">ontology.keyword.4896</str>
<str name="name">Organic Stores</str>
</doc>
This is perfectly right in a normal Solr Search, however I would like to construct the query so that it doesn't return anything because 'organic' only matches 1 of the 2 words available in the field.
A better way to say it could be this: Only return results if all tokens in the field are matched. So if there are two words (tokens) in a field and I only match 1 ('organic', 'organics','organ' etc.) I shouldn't get a match because only 50% of the field has been searched on.
Is this possible in Solr? How do I construct the query?
you are probably using StandardTokenizerFactory (or something similar), one solution is to use KeywordTokenizerFactory and issue a phrase query and then only perfect matches will work. Of course remember other filters you might want to use (like LowerCaseFilterFactory etc). Note that: "stores organic" will not match your doc either
Due to time contraints, I had to resort to the following (hacky) solution.
I added the term count to the index via a DynamicField field called tc_i.
<dynamicField name="*_i" type="int" indexed="true" stored="true"/>
Now at query time I count the terms and append it to the query, so q=name:(organic) becomes q=name:(organic) AND tc_i:(1) and this won't return documents for "organic stores" / "organic products" obviously because their tc_i fields are set at 2 (two words).

Resources