How to search Alfresco for empty property? - solr

I have Alfresco 5.2 and my task is "to get all documents with empty (one of) property", I am creating a query
searchParameters.setQuery("search +TYPE:\"ecmcndintregst:nd_int_reg_standards\" +#ecmcnddoc\\:doc_name_ru:\"\" -ASPECT:\"ecmcdict:inactive\" AND ( #ecmcnddoc\\:doc_kind_cp_ecmcdict_value:\"mek\")");
And I got all the documents thus: with either - empty and non-empty ecmcnddoc:doc_name_ru
how can I get ONLY empty ecmcnddoc:doc_name_ru ?
Thank you
please tell me what am I doing wrong? How to search solr for empty properties? When I submit +#ecmcnddoc:doc_name_ru:"" (without slash) I got all documents with ANY ecmcnddoc:doc_name_ru value :(
Thank you

Related

Azure search not behaving as expected for dashes

I'm having an issue when using azure search for the following example data set: abc-123-456, abc-123-457, abc-123-458, etc
When making the search for abc-123-456, I'd expected to only return one results but instead getting all results containing abc-123-...
Is there some setting or way to change this behavior?
Current search settings:
TheSearchIndex.TokenFilters.Add(new EdgeNGramTokenFilter("frontEdgeNGram")
{
Side = EdgeNGramTokenFilterSide.Front,
MinGram = 3,
MaxGram = 20
});
TheSearchIndex.Analyzers.Add(new CustomAnalyzer("FrontEdgeNGram", LexicalTokenizerName.Whitespace)
{
TokenFilters =
{
TokenFilterName.Lowercase,
new TokenFilterName("frontEdgeNGram"),
TokenFilterName.Classic,
TokenFilterName.AsciiFolding
}
});
SearchOptions UsersSearchOptions = new SearchOptions
{
QueryType = SearchQueryType.Simple,
SearchMode = SearchMode.All,
};
Using azure.search.documents ver 11.1.1
Edit: Search with abc-123-456* with the asterisk gives me the one result as expected. How to get this behavior working as default?
Just to add to this..
The portal version is 2020-06-30
The sdk version we use is azure.search.documents ver 11.1.1
abc-123-456 does NOT work as expected
"abc-123-456" does NOT work as expected
"abc-123-456"* does NOT work
"abc-123-456*" does NOT work
If we append an asterisks to the end of the search text and it is not within a phrase .. it works as expected.
IE:
abc-123-456* works as expected.
(abc-123-456* | abc-123-457* ) works as expected.
Why is the asterisks required? How can we make this work within a phrase?
This is expected behavior when using the EdgeNGramTokenFilter inside the custom analyzer configuration. The text “abc-123-456” is broken into smaller tokens like “abc”, “abc-1”, “abc-12”, “abc-123”….”abc-123-456”. Check out the Analyzer API for the full list of tokens generated by a particular analyzer.
For a query - abc-123, if the default analyzer is being used, the query terms will be abc and 123 and will match all the documents that contain these terms.
The prefix query on the other hand is not analyzed and looks for documents that contain the prefix as is “abc-123”. A prefix search bypasses full-text search and looks for verbatim matches, which is why the correct result is coming back. Full-text search is over tokens in inverted indexes. Everything else (filters, fuzzy, regex, prefix/wildcard, etc.) is over verbatim strings in a separate unprocessed/internal index.
Another way can be to set only the search analyzer on the field to keyword to avoid breaking the input query.

local param not working in solr 8 but working in solr 5

I am migrating from solr 5.5 to solr 8.
Query for solr 5.5 looks like -
qt=/dismax
product_fields_Ref1=product_concept^279841
sku_and_product_fields_Ref1=silhouette_concept^234256 $product_fields_Ref1
product_phrase_Ref2=pant
concept_with_synonyms_ref1=({!edismax2 qf=$sku_and_product_fields_Ref1 v=$product_phrase_Ref2})
top_concept_query_ref= (+({!maxscore v=$concept_with_synonyms_ref1}) )
productQueryRef3=+(+({!query v=$cq})) +( ({!maxscore v=$top_concept_query_ref}) )
sq=+{!lucene v=$productQueryRef3}
q={!parent tag=top which=$pq score=max v=$sq}
But is giving error on solr 8.0 with error -
Error from server at http://localhost:8080/products: org.apache.solr.search.SyntaxError: Query Field '$product_fields_Ref1' is not a valid field name
If I modify query like this (remove the variable product_fields_Ref1 and append the value directly in sku_and_product_fields_Ref1) -
qt=/dismax
sku_and_product_fields_Ref1=silhouette_concept^234256 product_concept^279841
product_phrase_Ref2=pant
concept_with_synonyms_ref1=({!edismax2 qf=$sku_and_product_fields_Ref1 v=$product_phrase_Ref2})
top_concept_query_ref= (+({!maxscore v=$concept_with_synonyms_ref1}) )
productQueryRef3=+(+({!query v=$cq})) +( ({!maxscore v=$top_concept_query_ref}) )
sq=+{!lucene v=$productQueryRef3}
q={!parent tag=top which=$pq score=max v=$sq}
Problem is I can not modify this query since the value of param "product_fields_Ref1" are being compiled from a large number of places.
I am using defType=dismax only.
Can any one guide what needs to be fixed?
I went through the source code of "org.apache.solr.search.ExtendedDismaxQParser"
and found out the is a new validation check added which DOES NOT allow local parameter in qf field edismax parser (this check has been introduced starting solr 8.0.0).
Check works like this -
any parameter coming in qf MUST match a field in schema (I am not using schema-less mode) of the core. method is
validateQueryFields(up);
This executes in
public Query parse() throws SyntaxError { ... }
of
org.apache.solr.search.ExtendedDismaxQParser
I got this working by creating my own custom parser and removed this validator after overriding the parse() method.
Support for Local Parameters has changed significantly in more recent versions of Solr (see https://lucene.apache.org/solr/guide/7_5/solr-upgrade-notes.html#solr-7-2)
The only way that I have been able to get some of the behavior back is by setting lucene as the default parser in solrconfig.xml and then passing the local parameters in the query, for example: q={!dismax qf=$param1}coffee
I understand that you can get back the old behavior by switching to LuceneMatchVersion 7.1.0 but that change did not work for me.

How to remove escape character from solr indexed field?

I am indexing Json data into solr field, for eg
{"employees":[
{"firstName":"John", "lastName":"Doe"},
{"firstName":"Anna", "lastName":"Smith"},
{"firstName":"Peter", "lastName":"Jones"}
]}
But Json is getting indexed with escaped characters, so now I am getting the json as
"{\"employees\":[\n {\"firstName\":\"John\", \"lastName\":\"Doe\"},\n {\"firstName\":\"Anna\", \"lastName\":\"Smith\"},\n {\"firstName\":\"Peter\", \"lastName\":\"Jones\"}\n]}"
Is there any way to index without escaping the json or de escaping result while displaying from the solr end solely ?
This is perfectly fine storage of json data in a solr textfield.
If you see it through admin, you will see the json in escaped format in the UI, but if you were to query this and then decode the json, it will return correct object in the language you are using.
Python example.
my_json_field = json_string // read from solr using api calls or module like pysolr
my_obj = json.loads(my_json_field)
Finally solution was very simple by using Transforming Result Documents
eg,
fl=my_field_with_escaped_json:[json]
Thanks everyone

How to get word count of SOLR document?

I have the binary content of a pdf file, and I want to upload it to SOLR and index its content:
ContentStreamUpdateRequest up = new ContentStreamUpdateRequest('/update/extract')
up.setParam("literal.id", map.id)
def tmpFile = null
tmpFile = File.createTempFile(map.id, ".tmp")
tmpFile.append(binary)
up.addFile(tmpFile, ".pdf")
// Do the SOLR stuff here
def solr = getSolrServer()
up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true)
def response = solr.request(up)
if (tmpFile) {
tmpFile.delete()
}
return response
When I query SOLR, I can retrieve the SOLR document. How can I get the actual content of the file? Basically I need to find the word count of the document I've uploaded so I was planning to do a size() on the string returned (if that's even possible)....
I'm very new to SOLR so am probably on the wrong track... any assistance greatly appreciated :)
I am assuming you want to count the number of words in the PDF which you have indexed. Make sure that
The entire extracted contents of PDF are indexed into one field.
Make sure this field has atleast a whitespace tokenizer enabled. So that it splits the sentences into words based on whitespace.
Once you do this you can find the number of words either using facets or Term vector component. The below SO answer might be helpful:
https://stackoverflow.com/a/26933126/689625

searching quote char with surround query parser in SOLR

I am trying to search for two subsequent words as follow:
{!surround}FieldName:first w second
The query works great, but SOLR throws parse exception when one of the words contains quote char, i.e
{!surround}FieldName:first w sec"ond
I have tried to escape the qutoe:
{!surround}FieldName:first w sec\"ond
but it didnt help.
I tried also using the v parameter of LocalParams, but no good either.
{!surround v="first w sec\"ond"}FieldName
I am currently running Solr 4.0.
Does anybody knows how to overcome this problem?
Try this:
{!surround}FieldName:(w(first,"sec\"ond"))

Resources