How to get word count of SOLR document?

How to get word count of SOLR document? - solr

I have the binary content of a pdf file, and I want to upload it to SOLR and index its content:
ContentStreamUpdateRequest up = new ContentStreamUpdateRequest('/update/extract')
up.setParam("literal.id", map.id)
def tmpFile = null
tmpFile = File.createTempFile(map.id, ".tmp")
tmpFile.append(binary)
up.addFile(tmpFile, ".pdf")
// Do the SOLR stuff here
def solr = getSolrServer()
up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true)
def response = solr.request(up)
if (tmpFile) {
tmpFile.delete()
}
return response
When I query SOLR, I can retrieve the SOLR document. How can I get the actual content of the file? Basically I need to find the word count of the document I've uploaded so I was planning to do a size() on the string returned (if that's even possible)....
I'm very new to SOLR so am probably on the wrong track... any assistance greatly appreciated :)

I am assuming you want to count the number of words in the PDF which you have indexed. Make sure that
The entire extracted contents of PDF are indexed into one field.
Make sure this field has atleast a whitespace tokenizer enabled. So that it splits the sentences into words based on whitespace.
Once you do this you can find the number of words either using facets or Term vector component. The below SO answer might be helpful:
https://stackoverflow.com/a/26933126/689625

Related

Azure search not behaving as expected for dashes

I'm having an issue when using azure search for the following example data set: abc-123-456, abc-123-457, abc-123-458, etc
When making the search for abc-123-456, I'd expected to only return one results but instead getting all results containing abc-123-...
Is there some setting or way to change this behavior?
Current search settings:
TheSearchIndex.TokenFilters.Add(new EdgeNGramTokenFilter("frontEdgeNGram")
{
Side = EdgeNGramTokenFilterSide.Front,
MinGram = 3,
MaxGram = 20
});
TheSearchIndex.Analyzers.Add(new CustomAnalyzer("FrontEdgeNGram", LexicalTokenizerName.Whitespace)
{
TokenFilters =
{
TokenFilterName.Lowercase,
new TokenFilterName("frontEdgeNGram"),
TokenFilterName.Classic,
TokenFilterName.AsciiFolding
}
});
SearchOptions UsersSearchOptions = new SearchOptions
{
QueryType = SearchQueryType.Simple,
SearchMode = SearchMode.All,
};
Using azure.search.documents ver 11.1.1
Edit: Search with abc-123-456* with the asterisk gives me the one result as expected. How to get this behavior working as default?
Just to add to this..
The portal version is 2020-06-30
The sdk version we use is azure.search.documents ver 11.1.1
abc-123-456 does NOT work as expected
"abc-123-456" does NOT work as expected
"abc-123-456"* does NOT work
"abc-123-456*" does NOT work
If we append an asterisks to the end of the search text and it is not within a phrase .. it works as expected.
IE:
abc-123-456* works as expected.
(abc-123-456* | abc-123-457* ) works as expected.
Why is the asterisks required? How can we make this work within a phrase?

This is expected behavior when using the EdgeNGramTokenFilter inside the custom analyzer configuration. The text “abc-123-456” is broken into smaller tokens like “abc”, “abc-1”, “abc-12”, “abc-123”….”abc-123-456”. Check out the Analyzer API for the full list of tokens generated by a particular analyzer.
For a query - abc-123, if the default analyzer is being used, the query terms will be abc and 123 and will match all the documents that contain these terms.
The prefix query on the other hand is not analyzed and looks for documents that contain the prefix as is “abc-123”. A prefix search bypasses full-text search and looks for verbatim matches, which is why the correct result is coming back. Full-text search is over tokens in inverted indexes. Everything else (filters, fuzzy, regex, prefix/wildcard, etc.) is over verbatim strings in a separate unprocessed/internal index.
Another way can be to set only the search analyzer on the field to keyword to avoid breaking the input query.

How to search Alfresco for empty property?

I have Alfresco 5.2 and my task is "to get all documents with empty (one of) property", I am creating a query
searchParameters.setQuery("search +TYPE:\"ecmcndintregst:nd_int_reg_standards\" +#ecmcnddoc\\:doc_name_ru:\"\" -ASPECT:\"ecmcdict:inactive\" AND ( #ecmcnddoc\\:doc_kind_cp_ecmcdict_value:\"mek\")");
And I got all the documents thus: with either - empty and non-empty ecmcnddoc:doc_name_ru
how can I get ONLY empty ecmcnddoc:doc_name_ru ?
Thank you
please tell me what am I doing wrong? How to search solr for empty properties? When I submit +#ecmcnddoc:doc_name_ru:"" (without slash) I got all documents with ANY ecmcnddoc:doc_name_ru value :(
Thank you

Tone analyser only returns analysis for 1 sentence

When using tone analyser, I am only able to retrieve 1 result. For example, if I use the following input text.
string m_StringToAnalyse = "The World Rocks ! I Love Everything !! Bananas are awesome! Old King Cole was a merry old soul!";
The results only return the analysis for document level and sentence_id = 0, ie. "The World Rocks !". The analysis for the next 3 sentences are not returned.
Any idea what I am doing wrong or am I missing out anything? This is the case when running the provided sample code as well.
string m_StringToAnalyse = "This service enables people to discover and understand, and revise the impact of tone in their content. It uses linguistic analysis to detect and interpret emotional, social, and language cues found in text.";
Running Tone analysis using the sample code on the sample sentence provided above also return results for the document and the first sentence only.
I have tried with versions "2016-02-19" as well as "2017-03-15" with same results.

I believe that if you want sentence by sentence analysis you need to send every separate sentence as a JSON object. It will then return analysis in an array where id=SENTENCE_NUM.
Here is an example of one I did using multiple YouTube comments (using Python):
def get_comments(video):
#Get the comments from the Youtube API using requests
url = 'https://www.googleapis.com/youtube/v3/commentThreads?part=snippet&maxResults=100&videoId='+ video +'&key=' + youtube_credentials['api_key']
r = requests.get(url)
comment_dict = list()
# for item in comments, add an object to the list with the text of the comment
for item in r.json()['items']:
the_comment = {"text": item['snippet']['topLevelComment']['snippet']['textOriginal']}
comment_dict.append(the_comment)
# return the list as JSON to the sentiment_analysis function
return json.dumps(comment_dict)
def sentiment_analysis(words):
# Load Watson Credentials using Python SDK
tone_analyzer = ToneAnalyzerV3(
username=watson_credentials['username'], password=watson_credentials['password'], version='2016-02-11')
# Get the tone, based on the JSON object that is passed to sentiment_analysis
return_sentiment = json.dumps(tone_analyzer.tone(text=words), indent=2)
return_sentiment = json.loads(return_sentiment)
Afterwards you can do whatever you want with the JSON object. I would also like to note for anyone else looking at this if you want to do an analysis of many objects, you can add sentences=False in the tone_analyzer.tone function.

How to remove escape character from solr indexed field?

I am indexing Json data into solr field, for eg
{"employees":[
{"firstName":"John", "lastName":"Doe"},
{"firstName":"Anna", "lastName":"Smith"},
{"firstName":"Peter", "lastName":"Jones"}
]}
But Json is getting indexed with escaped characters, so now I am getting the json as
"{\"employees\":[\n {\"firstName\":\"John\", \"lastName\":\"Doe\"},\n {\"firstName\":\"Anna\", \"lastName\":\"Smith\"},\n {\"firstName\":\"Peter\", \"lastName\":\"Jones\"}\n]}"
Is there any way to index without escaping the json or de escaping result while displaying from the solr end solely ?

This is perfectly fine storage of json data in a solr textfield.
If you see it through admin, you will see the json in escaped format in the UI, but if you were to query this and then decode the json, it will return correct object in the language you are using.
Python example.
my_json_field = json_string // read from solr using api calls or module like pysolr
my_obj = json.loads(my_json_field)

Finally solution was very simple by using Transforming Result Documents
eg,
fl=my_field_with_escaped_json:[json]
Thanks everyone

java code for solr geolocation indexing

I am using solr for fixing my indexing and searching feature and a beginner to solr.
I actually want to index the geolocation into solr index and also want to make queries on it so went through some articles,
http://wiki.apache.org/solr/SpatialSearch
And exactly some schema type are present in my schema.xml.
Now my question is I want to write a java code to index the geolocation while indexing it for dynamic geolocation fields. So how to write it and is there any sample java code for indexing it. I looked for it but didn't found any so please if anybody can help me with it.
I also understand that when indexing we would need to write some thing like :
document.addField(myDynLocFld+"_p", val));
If using this approach what should be val an instance of location object with both lat and lng value embedded in it. So how to counter this or is there any diferent approach in solr java for this?
Thanking in advance.

Check this sample of code,
// Store the index in memory:
//Directory directory = new RAMDirectory();
// To store an index on disk
Directory directory = FSDirectory.open("/tmp/testindex");
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc = new Document();
String text = "This is the text to be indexed.";
doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();
For more details check Lucene APIs.