Indexing joined records in Solr

Indexing joined records in Solr - solr

I am new to Solr and stuck at something basic (I think), which is probably a lack of understanding/comprehension on my behalf. I've read the documentation on DIH and spent a lot of time searching this issue, without finding my solution.
My use case is a messaging/email system, where users can message each other and start a thread, to which they can reply (so it's more like email than direct messages on a user base).
The question is simple; I have one table, threads, that is the base for this and contains searchable data like user info and subject. Then joined from that is the emails table, with the html column searchable.
When I run below collection in Solr and do a search, it will only pick up a single email for a thread and search that, as opposed to what I'm hoping for; get all emails belonging to that thread. So say I have 10 threads, but 100 messages, it says Fetched: 100, but Processed: 10.
How do I get Solr to index all of this content properly and allow for a search on it? In this particular use case, I have also created a reversed example, getting messages first, then the threads it belongs to and then de-dupe the results (which works to some extent), but the next step is that there is also a left join for email attachments. So looking for a solution with this setup.
Using Solr 6.6
<dataConfig>
<dataSource name="ds-db" type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="${dataimporter.request.url}"
user="${dataimporter.request.user}"
password="${dataimporter.request.password}"/>
<document name="threads">
<entity name="thread" dataSource="ds-db"
query="
SELECT threads.id
, threads.user_id
, threads.subject
, users.first_name
, users.last_name
, users.email
FROM threads
LEFT JOIN users ON users.user_id=threads.user_id
">
<field column="id" name="thread_id"/>
<field column="user_id" name="user_id"/>
<field column="subject" name="subject"/>
<field column="first_name" name="first_name"/>
<field column="last_name" name="last_name"/>
<field column="email" name="email"/>
<entity name="message" dataSource="ds-db" transformer="HTMLStripTransformer"
query="
SELECT id
, html
FROM emails
WHERE thread_id = ${thread.id}
">
<field column="id" name="id"/>
<field column="html" name="html" stripHTML="true"/>
</entity>
</entity>
</document>
</dataConfig>
managed-schema
<schema name="example-data-driven-schema" version="1.6">
...
<field name="id" type="string" multiValued="false" indexed="true" required="true" stored="true"/>
<field name="thread_id" type="string" multiValued="false" indexed="true" required="true" stored="true"/>
<field name="first_name" type="string_lowercase" indexed="true" stored="true"/>
<field name="last_name" type="string_lowercase" indexed="true" stored="true"/>
<field name="email" type="string_lowercase" indexed="true" stored="true"/>
<field name="subject" type="string_lowercase" indexed="true" stored="true"/>
<field name="html" type="string_lowercase" indexed="true" stored="true"/>
...
<copyField source="first_name" dest="_text_"/>
<copyField source="last_name" dest="_text_"/>
<copyField source="email" dest="_text_"/>
<copyField source="subject" dest="_text_"/>
<copyField source="html" dest="_text_"/>
...
</schema>

If you want all the emails in a single field, that field has to be set as multiValued="true" - otherwise you'll only get one of the dependent entities indexed.

Related

Solr - Not Getting Results When searching

I am having trouble with Solr 8.5.2 when providing a word in a query. It's fine when the query is :. But when I put in a word, it does not hit any document.
Here is my schema.xml config.
<field name="quoteid" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="quotenumber" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="formdata" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="creationtimeintickssinceepoch" type="plong" indexed="true" stored="true"/>
<field name="_version_" type="plong" indexed="false" stored="false"/>
<field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>
Here is a sample document. (FormData field is actually a Json string, as you notice)
{
"quoteid":"466f4dea-XXXX-443c-b1e4-XXXXXXX",
"quotenumber":"NAAAAA",
"creationtimeintickssinceepoch":15927195449809739,
"formdata":"{\"formModel\": {\"SomeProperty0\":\"somevalue\",\"SomeProperty1\":\"somevalue\",\"SomeProperty2\":\"somevalue\"}"...blahblahblah here,
"_version_":1670089165635584000}
I tried entering NAAAAA, no results. I tried 'SomeProperty1', no results too.

If you're not giving any field names in your query or using dismax or edismax with the qf argument, the default search field is used (usually named _text_ - this could be configured in your schema, but is usually given as df with the default query handler).
You'll need to include your field name when you're querying other fields - quotenumber:NAAAAA to get hits in the quotenumber field.

SOLR Related facet search

I am using SOLR and i have a schema something Like this :
<fields>
<field name="Id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="Username" type="text_general" indexed="true" stored="true" omitNorms="true" multiValued="false"/>
<field name="ServerName" type="text_general" indexed="true" stored="true" multiValued="false" />
<fields/>
I want to use facet to get the result that give me the number of user per each server
how can i do that?
desired result :
server 1 : 200 (userNumber)
server 2: 300
and so on...
thank you

This is not a complete solution, as i do not have your data and schema. But what i think you need is pivot Faceting http://wiki.apache.org/solr/SimpleFacetParameters#Pivot_.28ie_Decision_Tree.29_Faceting .
So you need to do something like this (again , you need to adjust this to make it work for you)
http://ip:port/solr/collection1/select?q=*:*&rows=0&facet=true&facet.pivot=Username,ServerName

Index and query Unique Key URL Solr

i indexed a collection of archived websites for querying using solr. As unique key i use the URL's of the sites. What i would like to do is to use the url field in filter queries to limit the search to a certain domain when needed. For example i want to query for "Barack Obama", but limit the results to the "whitehouse.gov" domain. Sounds like a pretty basic use case to me, however searches on the URL field do not return any results at all. Here is my config (schema.xml):
.
.
.
<field name="collection" type="string" indexed="true" stored="true"/>
<field name="content" type="text_de" indexed="true" stored="true" multiValued="true"/>
<field name="date" type="string" indexed="true" stored="true"/>
<field name="digest" type="string" indexed="true" stored="true"/>
<field name="length" type="string" indexed="true" stored="true"/>
<field name="segment" type="string" indexed="true" stored="true"/>
<field name="site" type="string" indexed="true" stored="true"/>
<field name="title" type="text_de" indexed="true" stored="true" multiValued="true"/>
<field name="type" type="string" indexed="true" stored="true"/>
<field name="url" type="text_en_splitting" indexed="true" stored="true"/>
.
.
.
<!-- Field to use to determine and enforce document uniqueness.
Unless this field is marked with required="false", it will be a required field
-->
<uniqueKey>url</uniqueKey>
And here is my query (simplified):
http://mysolrserver.com:8983/solr/select/?q=content:Barack+Obama&fq=url:whitehouse.gov
The query analyzer tells me, that my query should match:
Does anyone have an idea why this is not working? I highly appreciate any hints i can get! Thanks alot guys!!

The fq=url:whitehouse.gov filtering should work.
However I see the problem with the query q=content:Barack+Obama.
Whats your default search field ??
Does removing the query component and using q=*:* return results for you. ??
q=content:Barack+Obama query would actually result into a query like content:barack defaultsearchfield:obama
As the default search field would not have obama this would not result in any results.

solr indexes documents but does not search in them

I am a novice with Solr and i was trying the example that comes in the example folder of Solr(3.6) package(apache-solr-3.6.0.tgz). I started the server and posted the sample xml files in example/exampledocs and then i could search for stuff and Solr would return matches and it was all good. But then i tried posting another xml file with more than 10,000 documents. I modified the example/solr/conf/schema.xml file to add the fields of my xml file and then restarted the server and posted my xml file. I checked the statistics in Solr admin panel(http://localhost:8983/solr/admin/stats.jsp) and it shows numDocs : 10020. Now this means that the documents were successfully posted. But when i search for anything that was present in my posted documents(from the 10,000 document xml file),it returns 0 results. But Solr is still able to return results from searches that match content in the documents that come by default in the example/exampledocs folder. I am clueless about what has happened here. The value of numDoc clearly suggests that the documents i posted in the xml file were indexed.
Anything else i can inspect to see what's wrong with this?
The schema which comes in the example with the Solr package is like this
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="sku" type="text_en_splitting_tight" indexed="true" stored="true" omitNorms="true"/>
<field name="name" type="text_general" indexed="true" stored="true"/><field name="alphaNameSort" type="alphaOnlySort" indexed="true" stored="false"/>
<field name="manu" type="text_general" indexed="true" stored="true" omitNorms="true"/>
<field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="features" type="text_en_splitting" indexed="true" stored="true" multiValued="true"/>
<field name="includes" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true"/>
<field name="weight" type="float" indexed="true" stored="true"/>
<field name="price" type="float" indexed="true" stored="true"/>
<field name="popularity" type="int" indexed="true" stored="true"/>
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="subject" type="text_general" indexed="true" stored="true"/>
<field name="description" type="text_general" indexed="true" stored="true"/>
<field name="inStock" type="boolean" indexed="true" stored="true"/>
and more....
The schema of the xml file which i posted had some fields in common with the above schema like title,description,price,etc so i entered the rest of the fields in schema.xml like this
<field name="cid" type="int" indexed="false" stored="false"/>
<field name="discount" type="float" indexed="true" stored="true"/>
<field name="link" type="string" indexed="true" stored="true"/>
<field name="status" type="string" indexed="true" stored="true"/>
<field name="pubDate" type="string" indexed="true" stored="true"/>
<field name="image" type="string" indexed="false" stored="false"/>

If you are using the default settings from the Solr example site, then by virtue of the df setting in the solrconfig.xml file for the /select request handler, it is setting the default search field to the text field.
<requestHandler name="/select" class="solr.SearchHandler">
<!-- default values for query parameters can be specified, these
will be overridden by parameters in the request
-->
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="df">text</str>
</lst>
....
</requestHandler>
If you look in the schema.xml file just below the field definitions you will see the multiple copyField settings that are moving the values from certain fields into the text field and therefore making them searchable via the default field setting. In your example of searching for Sony in the title field, if you look at the copyField statements, you will see that the title field is not being copied to the text default search field. Therefore, the documents with the Sony title value are not being returned in your query.
I would suggest the following:
Try a query by specifying the following: title:Sony that should return what you are expecting.
If you want the title field to be included in the default query field, then add the following copyField statement to the schema.xml file and reload your 10000 document file.
<copyField source="title" dest="text">
I hope this helps.

Solr Schema Design

I have some questions regarding the solr schema design. Basically I'm setting up a search engine for product catalogue website and my table relationships are as follows.
Product Belongs to Merchant
Product Belongs to Brand
Product has and belongs to many Categories
Category has many Sub Categories
Sub Category has many Types
Type has many Sub Types
So far my Schema.xml is looks like this.
<field name="product_id" type="string" indexed="true" stored="true" required="true" />
<field name="name" type="string" indexed="true" stored="true"/>
<field name="merchant" type="string" indexed="true" stored="true"/>
<field name="merchant_id" type="string" indexed="true" stored="true"/>
<field name="brand" type="string" indexed="true" stored="true"/>
<field name="brand_id" type="string" indexed="true" stored="true"/>
<field name="categories" type="string" multiValued="true" indexed="true" stored="true"/>
<field name="sub_categories" type="string" multiValued="true" indexed="true" stored="true"/>
<field name="types" type="string" multiValued="true" indexed="true" stored="true"/>
<field name="sub_types" type="string" multiValued="true" indexed="true" stored="true"/>
<field name="price" type="float" indexed="true" stored="true"/>
<field name="description" type="text" indexed="true" stored="true"/>
<field name="image" type="text" indexed="true" stored="true"/>
<field name="text" type="text" indexed="true" stored="false" multiValued="true"/>
<uniqueKey>product_id</uniqueKey>
<defaultSearchField>text</defaultSearchField>
<solrQueryParser defaultOperator="OR"/>
<copyField source="name" dest="text"/>
<copyField source="merchant" dest="text"/>
<copyField source="brand" dest="text"/>
<copyField source="categories" dest="text"/>
<copyField source="sub_categories" dest="text"/>
<copyField source="types" dest="text"/>
<copyField source="sub_types" dest="text"/>
So my Questions now:
1) Is the Schema correct?
2) Let's assume I need to find products for Category XYZ. My Senior programer doesn't like querying the solr by Category Name, instead he wan't to use CategoryID.
He is suggesting to store CategoryID_CategoryName (1001_Category XYZ) and from web front he is sending ID. (Assuming that Names with white spaces doesn't work properly).
So to find the products I should then do a partial match of categories and identify the category id from the string i.e (fetch 1001 from 1001_Category XYZ)
or
What if I keep the Names on categories field and setup another field for category_ids? that's seems a better option for me.
or
is there any Solr multi valued field type to store CategoryID and CategoryName together?
Let me know your thoughts, thanks.

Answers to your questions.
Maybe - it depends on how you plan on structuring your queries, what you intend to search and what you intend to retrieve in search results. In your schema, you're storing & indexing everything which can be quite inefficient. Index what you intend to query, store what you intend to retrieve/display. If you were looking for optimizations, I would review the datatypes used in the schema - try to stay as native to the source type as you can.
Querying by CategoryId - your programmer is correct, you want to query by category Id. Your approach of storing Ids and Names in separate fields is accurate as well. Presuming your Id-based fields are integers/longs, you don't want to structure them as strings but rather as integers/longs.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Indexing joined records in Solr - solr

If you want all the emails in a single field, that field has to be set as multiValued="true" - otherwise you'll only get one of the dependent entities indexed.

Related

Solr - Not Getting Results When searching

SOLR Related facet search

Index and query Unique Key URL Solr

solr indexes documents but does not search in them

Solr Schema Design

Categories

Resources