Find duplicates objects with solr4 and Haystack - solr

I use the facet mode of solr to find duplicates. It works pretty well but I can't figure how to get objects id's.
>>> from haystack.query import SearchQuerySet
>>> sqs = SearchQuerySet().facet('text_string', limit=-1)
>>> sqs.facet_counts()
{
'dates': {},
'fields': {
'text_string': [
('the red ballon', 4),
('my grand pa is an alien', 2),
('be kind rewind', 12),
],
},
'queries': {}
}
How can I get id of my objects 'the red ballon', 'my grand pa is an alien', etc. , do I have to add id field in the schema.xml of solr ?
I'm expecting something like that:
>>> sqs.facet_counts()
{
'dates': {},
'fields': {
'text_string': [
(object_id, 'the red ballon', 4),
(object_id, 'my grand pa is an alien', 2),
(object_id, 'be kind rewind', 12),
],
},
'queries': {}
}
EDIT: Added schema.xml and search_indexes.py
schema.xml for solr
...
<fields>
<!-- general -->
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/>
<field name="django_ct" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="django_id" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="_version_" type="long" indexed="true" stored ="true"/>
<dynamicField name="*_i" type="int" indexed="true" stored="true"/>
<dynamicField name="*_s" type="string" indexed="true" stored="true"/>
<dynamicField name="*_l" type="long" indexed="true" stored="true"/>
<dynamicField name="*_t" type="text_en" indexed="true" stored="true"/>
<dynamicField name="*_b" type="boolean" indexed="true" stored="true"/>
<dynamicField name="*_f" type="float" indexed="true" stored="true"/>
<dynamicField name="*_d" type="double" indexed="true" stored="true"/>
<dynamicField name="*_dt" type="date" indexed="true" stored="true"/>
<dynamicField name="*_p" type="location" indexed="true" stored="true"/>
<dynamicField name="*_coordinate" type="tdouble" indexed="true" stored="false"/>
<field name="text" type="text_en" indexed="true" stored="true" multiValued="false" termVectors="true" />
<field name="title" type="text_en" indexed="true" stored="true" multiValued="false" />
<!-- Used for duplicate content detection -->
<copyField source="title" dest="text_string" />
<field name="text_string" type="string" indexed="true" stored="true" multiValued="false" />
<field name="pk" type="long" indexed="true" stored="true" multiValued="false" />
</fields>
<!-- field to use to determine and enforce document uniqueness. -->
<uniqueKey>id</uniqueKey>
<!-- field for the QueryParser to use when an explicit fieldname is absent -->
<defaultSearchField>text</defaultSearchField>
<!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
<solrQueryParser defaultOperator="AND"/>
...
searche_indexes.py
class VideoIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
pk = indexes.IntegerField(model_attr='pk')
title = indexes.CharField(model_attr='title', boost=1.125)
def index_queryset(self, using=None):
return Video.on_site.all()
def get_model(self):
return Video

Faceting is the arrangement of search results into categories (which are based on indexed terms). Within each category, Solr reports on the number of hits for relevant term, which is called a facet constraint. Faceting makes it easy for users to explore search results on sites such as movie sites and product review sites, where there are many categories and many items within a category.
Here is good example of it...
faceting example by Yonik
faceting example on solr wiki
In your case you may need to fire a query again to get the id and othere details....

Related

Missing 2 fields after applying the schema file puzzler

I am using Solr 7.4 and creating core using the 3 files from the gist (one can download the files and save them in the directory <dir>/test/conf).
solr create -c test -d <dir>/test
The schema has 14 files, while only 12 end up in schema browser in Admin UI.
The schema file looks like:
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="collection" version="1.6"
xmlns:inc="http://www.w3.org/2001/XInclude">
<types>
<!-- The StrField type is not analyzed, but indexed/stored verbatim. -->
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<!-- boolean type: "true" or "false" -->
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" />
<fieldType name="int" class="solr.IntPointField" sortMissingLast="true"/>
<fieldType name="long" class="solr.LongPointField" sortMissingLast="true"/>
</types>
<fields>
<field name="childCode" type="string" indexed="true" stored="true" multiValued="false" />
<field name="parentCode" type="string" indexed="true" stored="true" multiValued="false" />
<field name="id" type="string" indexed="true" stored="true" multiValued="false" />
<filed name="sortOrder" type="int" indexed="true" stored="true" multiValued="false" />
<filed name="locked" type="boolean" indexed="true" stored="true" multiValued="false" />
<field name="status" type="string" indexed="true" stored="true" multiValued="false" />
<field name="filename" type="string" indexed="false" stored="true" multiValued="false" />
<field name="url" type="string" indexed="false" stored="true" multiValued="false" />
<field name="previewUrl" type="string" indexed="false" stored="true" multiValued="false" />
<field name="shape" type="string" indexed="true" stored="true" multiValued="false" />
<field name="originalHeight" type="int" indexed="true" stored="true" multiValued="false" />
<field name="originalWidth" type="int" indexed="true" stored="true" multiValued="false" />
<field name="sizes" type="string" indexed="true" stored="true" multiValued="true" />
<field name="_version_" type="long" indexed="true" stored="true"/>
</fields>
<uniqueKey>id</uniqueKey>
</schema>
The missing fields are 'sortOrder' and 'locked'. Based on the documentation those are valid field names:
The name of the field. Field names should consist of alphanumeric or underscore characters only and not start with a digit. This is not currently strictly enforced, but other field names will not have first class support from all components and back compatibility is not guaranteed. Names with both leading and trailing underscores (e.g., version) are reserved. Every field must have a name.
Other int fields with camel case are created such as 'originalHeight' and 'originalWidth'. I am able to go into Admin UI and add the fields manually with the name and the type from the file.
I am puzzled and would appreciate any clue to this disappearing fields mystery.
Your spelling is wrong:
<filed name="sortOrder" ..
<filed name="locked" ..
Change it to <field> and it'll work as the other fields.

parent child indexing in apache solr

I'm new to Apache solr search. I'm not getting ho to get solr search result with child documents.
My entity in data-config.xml
<entity name="products" query="SELECT DISTINCT IDENTIFIER,PDT_NAME,PDT_DESCRIPTION FROM **PARENT_TABLE**"
deltaQuery="SELECT IDENTIFIER FROM PARENT_TABLE WHERE LAST_MODIFIED_DATE > '${dataimporter.last_index_time}'">
<field column="IDENTIFIER" name="pdtid" />
<field column="PDT_NAME" name="productname" />
<field column="PDT_DESCRIPTION" name="productdescription" />
<entity name="productVersions" child="true" query="SELECT DISTINCT child_id , child_name FROM WHERE IDENTIFIER = '${**products.IDENTIFIER**}'">
<field column="IDENTIFIER" name="productVersions.pdtesat" />
<field column="VERSION_NUMBER" name="productVersions.versionnum" />
<field column="DISPLAY_NAME" name="productVersions.displayname" />
</entity>
</entity>
field details in managed-schema file:
<field name="pdtid" type="text_general" indexed="true" stored="true" multiValued="false" />
<field name="productname" type="text_general" indexed="true" stored="true" multiValued="true" />
<field name="productnamerrr" type="text_general" indexed="true" stored="true" multiValued="false" />
<field name="productdescription" type="text_general" indexed="true" stored="true" multiValued="false" />
<field name="productVersions.childid" type="text_general" indexed="true" stored="true" multiValued="false" />
<field name="productVersions.versionnum" type="text_general" indexed="true" stored="true" multiValued="false" />
<field name="productVersions.displayname" type="text_general" indexed="true" stored="true" multiValued="false" />
I'm expecting my solr result should be :
"response":{"numFound":26,"start":0,"docs":[
{
"productdescription":" Java",
"productnamerrr":"pdtid",
"pdtid":"6591",
"child_docs" : [
"productVersions":[
"productVersions.childid":"123"
"productVersions.versionnum":"V1"
"productVersions.displayname":"disp"],
"productVersions":[
"productVersions.childid":"456"
"productVersions.versionnum":"V2"
"productVersions.displayname":"disp2"]
],
"id":"92689209-dc5f-4ae6-bd3c-d55dbd0e200c",
"_version_":1599132440456069120},
Please help me in getting the multiple child docs in json format after indexing.
May 2nd edit.
My query result from solr search like below.
"response":{"numFound":38,"start":0,"docs":[
{
"productdescription":" JIRA provides issue (bug) and project tracking
for the software development team.",
"productnamerrr":"Atlassian JIRA",
"productVersions":
["childid:6.x,versionnum:Jira 6.x,displayname :Withdrawn",
"childid:2.0.3,versionnum:Atlassian JIRA,displayname:Planning",
"childid:JIRA Server 5.0.1 - 6.3.15,versionnum:JIRA - JEditor,displayname :Withdrawn",
"childid:1.x,versionnum:Jira 1.x,displayname :Withdrawn"
],
"id":"0b5ba528-ef7a-49ba-a97b-2ea94922cbb5",
"_version_":1599297669816123392},
Edited on May 3-2018
returned data is correct. But the i'm expecting in parent child documents explicitly. getting child docs as below.
"productVersions":["childid:6.x,versionnum:Jira 6.x,displayname :Withdrawn",
"childid:2.0.3,versionnum:Atlassian JIRA,displayname:Planning",
"childid:JIRA Server 5.0.1 - 6.3.15,versionnum:JIRA - JEditor,displayname :Withdrawn",
"childid:1.x,versionnum:Jira 1.x,displayname :Withdrawn"
],
Expecting like below.
"productVersions":[
"productVersions.childid":"123"
"productVersions.versionnum":"V1"
"productVersions.displayname":"disp"],
"productVersions":[
"productVersions.childid":"456"
"productVersions.versionnum":"V2"
"productVersions.displayname":"disp2"]
],
How can i change the query to get child docs separately as a separate entity.??

solr join function to query documents in multiple cores NullPointerException

I use solr join to query documents from two cores, my cores is defined as follows:
Post core:
<fields>
<!-- general -->
<field name="id"type="string"indexed="true"stored="true" multiValued="false" required="true"/>
<field name="creatorId"type="string"indexed="true"stored="true"multiValued="false" required="true"/>
.
.
.
</fields>
User core:
<fields>
<!-- general -->
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/>
<field name="username" type="string" indexed="true" stored="true" multiValued="false" />
<field name="email" type="string" indexed="true" stored="true" multiValued="false" />
<field name="userBrief" type="string" indexed="true" stored="true" multiValued="false" />
<field name="jobNumber" type="string" indexed="true" stored="true" multiValued="false" />
</fields>
now I want to query all user who has created post, I use join function, my url is like this:
http://localhost:9080/solr/user/select?q=*:*&fq={!join from=creatorId to=id fromIndex=post}
but it don't work, and it throw a exception:
null: java.lang.NullPointerException
at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:559)
at org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:646)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:280)
.
.
.
I don't know why, can you help me?
The fq parameter requires a valid query with the !join.
Try adding an everything search to the end of the fq param like this. http://localhost:9080/solr/user/select?q=*:*&fq={!join from=creatorId to=id fromIndex=post}*:*
In a realistic setting you would likely want to filter the joined results in some way, for example, "Find me all action movies rated by this user updated in the past two weeks," where the movies and user ratings are stored as separate documents.

SOLR 4.0 alphabetical sorting trouble

I'm having a hard time of getting my head around an issue I have with my SOLR address database.
I built this one up from the example files. I'm basically running the example configuration with a modified schema.
schema.xml:
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="_version_" type="long" indexed="true" stored="true" required="false" multiValued="false" />
<field name="givenname_s" type="text_de" indexed="true" stored="true" required="true" multiValued="false" />
<field name="middleinitial_s" type="text_de" indexed="false" stored="true" required="false" multiValued="false" />
<field name="surname_s" type="text_de" indexed="true" stored="true" required="true" multiValued="false" />
<field name="gender_s" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="pictureuri_s" type="string" indexed="false" stored="true" required="false" multiValued="false" />
<field name="function_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="organizationalunit_s" type="text_general" indexed="true" stored="true" required="false" multiValued="false" />
<field name="organizationalunitdescription_s" type="text_de" indexed="false" stored="true" required="false" multiValued="false" />
<field name="company_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="street_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="streetnumber_s" type="int" indexed="true" stored="true" required="false" multiValued="false" />
<field name="postcode_s" type="int" indexed="true" stored="true" required="false" multiValued="false" />
<field name="city_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="building_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="roomnumber_s" type="int" indexed="true" stored="true" required="false" multiValued="false" />
<field name="country_s" type="text_en" indexed="true" stored="true" required="true" multiValued="false" />
<field name="countrycode_s" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="emailaddress_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="phone1_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="phone2_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="mobile_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="fax_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
I am populating the database by pushing about 20.000 random test datasets like the following to post.jar:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<add>
<doc>
<field name="id">1352498443_1</field>
<field name="givenname_s">Aynur</field>
<field name="middleinitial_s"/>
<field name="surname_s">Lehnen</field>
<field name="gender_s">F</field>
<field name="pictureuri_s">dummy_assets/female.jpg</field>
<field name="function_s">Zugschaffner/in</field>
<field name="organizationalunit_s">P 07</field>
<field name="organizationalunitdescription_s">Lorem Ipsum sadipscing voluptua ipsum invidunt dolor et dolore invidunt sed consetetur accusam dolore Lorem tempor.</field>
<field name="company_s">Lorem Lagna Epsum Emet</field>
<field name="street_s">Erlenweg</field>
<field name="streetnumber_s">82</field>
<field name="postcode_s">76297</field>
<field name="city_s">Lübeck</field>
<field name="building_s"/>
<field name="roomnumber_s">242</field>
<field name="country_s">GERMANY</field>
<field name="countrycode_s">DE</field>
<field name="emailaddress_s">aynur.lehnen#lorem-lagna-epsum-emet.de</field>
<field name="phone1_s">0392984823</field>
<field name="phone2_s">0124111417</field>
<field name="mobile_s">0325117132</field>
<field name="fax_s">0171459177</field>
</doc>
</add>
However when retreiving data I seem to have problems with alphabetical sorting. Consider the folowing query:
{
"responseHeader": {
"status": 0,
"QTime": 5,
"params": {
"sort": "surname_s asc",
"fl": "surname_s",
"indent": "true",
"wt": "json",
"q": "city_s:berlin"
}
},
"response": {
"numFound": 1094,
"start": 0,
"docs": [{
"surname_s": "Weil"
}, {
"surname_s": "Abel"
}, {
"surname_s": "Adam"
}, {
"surname_s": "Ade"
}, {
"surname_s": "Adrian"
}, {
"surname_s": "Aigner"
}, {
"surname_s": "Aigner"
}, {
"surname_s": "Alber"
}, {
"surname_s": "Alber"
}, {
"surname_s": "Albers"
}]
}
}
Why is "Weil" on position one, while the rest of the data appears to be sorted correctly?
I believe that some of the additional analyzers that are being applied in the text_de field type are the cause for this sorting behavior. In my experience, for the best results when sorting strings is to use the alphaOlySort fieldType that comes with the example schema.xml shown below.
<fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<!-- KeywordTokenizer does no actual tokenizing, so the entire
input string is preserved as a single token
-->
<tokenizer class="solr.KeywordTokenizerFactory"/>
<!-- The LowerCase TokenFilter does what you expect, which can be
when you want your sorting to be case insensitive
-->
<filter class="solr.LowerCaseFilterFactory" />
<!-- The TrimFilter removes any leading or trailing whitespace -->
<filter class="solr.TrimFilterFactory" />
<!-- The PatternReplaceFilter gives you the flexibility to use
Java Regular expression to replace any sequence of characters
matching a pattern with an arbitrary replacement string,
which may include back references to portions of the original
string matched by the pattern.
See the Java Regular Expression documentation for more
information on pattern and replacement string syntax.
http://java.sun.com/j2se/1.6.0/docs/api/java/util/regex/package-summary.html
-->
<filter class="solr.PatternReplaceFilterFactory"
pattern="([^a-z])" replacement="" replace="all"
/>
</analyzer>
</fieldType>
I would recommend creating a new field and then copying the value from surname_s via copyField, something like the following:
<field name="surname_s_sort" type="alphaOnlySort" indexed="true" stored="false" required="false" multiValued="false" />
<copyField source="surname_s" dest="surname_s_sort"/>
Note: there is not any need to store the value in the surname_s_sort field, hence the stored="false" attribute, unless you expect to display that to the users.
Then you can just change your query to sort on the surname_s_sort instead.
Sorting doesn't work well on multivalued and tokenized fields.
Documentation -
Sorting can be done on the "score" of the document, or on any multiValued="false" indexed="true" field provided that field is either non-tokenized (ie: has no Analyzer) or uses an Analyzer that only produces a single Term (ie: uses the KeywordTokenizer)
Use string as the field type and copy the title field into the new field.
<field name="surname_s_sort" type="string" indexed="true" stored="false"/>
<copyField source="surname_s" dest="surname_s_sort" />
As #Paige answered you can have keyword tokenizer, lower case filters which do not tokenize the field.
I had similiar issues and I tried the alphaOnlySort. This work for some part, but it starts messing up the sort results when the field contains values like -,/ spaces etc.
So the result was something like
/ abc
aa
/ abc2
So I ended up using the field type lowercase. It was already there so I figured that its a default type. I did use the copy field construction, so my final config was:
<schema>
<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<fields>
<field name="job_name_sort" type="lowercase" indexed="true" stored="false" required="false"/>
</fields>
<copyField source="job_name" dest="job_name_sort"/>
</schema>

Solr 4 indexing creates a field with no name but all field values concatenated as value for this field

I have several text_en fields in Solr which are "Indexed" but not "Stored". I store these large text values for the document in MongoDb. However when I look at the Solr index, every document has a field which has no name. But all the fields of the document (including the indexed but not stored) are stored in this field.
What is this field and how can I eliminate it. It is increasing the size of my index.
<fields>
<dynamicField indexed="true" name="*_i" stored="true" type="int"/>
<dynamicField indexed="true" name="*_s" stored="true" type="string"/>
<dynamicField indexed="true" name="*_l" stored="true" type="long"/>
<dynamicField indexed="true" name="*_t" stored="true" type="text_en"/>
<dynamicField indexed="true" name="*_b" stored="true" type="boolean"/>
<dynamicField indexed="true" name="*_f" stored="true" type="float"/>
<dynamicField indexed="true" name="*_d" stored="true" type="double"/>
<dynamicField indexed="true" name="*_tiled" stored="false" type="double"/>
<dynamicField indexed="true" name="*_dt" stored="true" type="date"/>
<dynamicField indexed="true" name="*_p" stored="true" type="location"/>
<dynamicField name="random_*" type="random"/>
<dynamicField indexed="true" multiValued="true" name="attr_*" stored="true" type="string"/>
<dynamicField indexed="true" multiValued="true" name="*" stored="true" type="text_en"/>
<dynamicField indexed="true" multiValued="true" name="attr_*" stored="true" type="string"/>
<!-- My Custom Fields -->
<uniqueKey>id</uniqueKey>
<defaultSearchField>text_all</defaultSearchField>
<solrQueryParser defaultOperator="AND"/>
<copyField dest="author_display" source="author"/>
<copyField dest="keywords_display" source="keywords"/>
<copyField dest="text_all" source="id"/>
<copyField dest="text_all" source="url"/>
<copyField dest="text_all" source="title"/>
<copyField dest="text_all" source="description"/>
<copyField dest="text_all" source="keywords"/>
<copyField dest="text_all" source="author"/>
<copyField dest="text_all" source="body"/>
<copyField dest="text_all" source="*_t"/>
<copyField dest="spell" source="title"/>
<copyField dest="spell" source="body"/>
<copyField dest="spell" source="description"/>
<copyField dest="spell" source="author"/>
<copyField dest="autocomplete" source="title"/>
<copyField dest="autocomplete" source="body"/>
<copyField dest="autocomplete" source="description"/>
<copyField dest="autocomplete" source="author"/>
</fields>
You are seeing this behavior because of the following entry in your schema.xml file
<dynamicField indexed="true" multiValued="true" name="*" stored="true" type="text_en"/>
This a generic catch all field that you have defined in your schema. If you pass any documents to the index field names that do not match other fields in the schema either by convention (via your other dynamicField settings) or specific field names, Solr will create that field "on the fly" as a text_en type that can have multiple entries since it is setup as multiValued="true". And these fields are all being stored as well because of stored="true" setting. I would recommend removing this field from your schema.xml and reindexing your data.
For more details on the settings in this file, please reference - SchemaXml on the Solr Wiki.

Resources