lucene solr - how to know numCount of each word in query - solr

i have a query string with 5 words. for exmple "cat dog fish bird animals".
i need to know how many matches each word has.
at this point i create 5 queries:
/q=name:cat&rows=0&facet=true
/q=name:dog&rows=0&facet=true
/q=name:fish&rows=0&facet=true
/q=name:bird&rows=0&facet=true
/q=name:animals&rows=0&facet=true
and get matches count of each word from each query.
but this method takes too many time.
so is there a way to check get numCount of each word with one query?
any help appriciated!

In this case, functionQueries are your friends. In particular:
termfreq(field,term) returns the number of times the term appears in the field for that document. Example Syntax:
termfreq(text,'memory')
totaltermfreq(field,term) returns the number of times the term appears in the field in the entire index. ttf is an alias of
totaltermfreq. Example Syntax: ttf(text,'memory')
The following query for instance:
q=*%3A*&fl=cntOnSummary%3Atermfreq(summary%2C%27hello%27)+cntOnTitle%3Atermfreq(title%2C%27entry%27)+cntOnSource%3Atermfreq(source%2C%27activities%27)&wt=json&indent=true
returns the following results:
"docs": [
{
"id": [
"id-1"
],
"source": [
"activities",
"activities"
],
"title": "Ajones3 Activity Entry 1",
"summary": "hello hello",
"cntOnSummary": 2,
"cntOnTitle": 1,
"cntOnSource": 1,
"score": 1
},
{
"id": [
"id-2"
],
"source": [
"activities",
"activities"
],
"title": "Common activity",
"cntOnSummary": 0,
"cntOnTitle": 0,
"cntOnSource": 1,
"score": 1
}
}
]
Please notice that while it's working well on single value field, it seems that for multivalued fields, the functions consider just the first entry, for instance in the example above, termfreq(source%2C%27activities%27) returns 1 instead of 2.

Related

Attribute Syntax for JSON query in check_json.pl

So, I'm trying to set up check_json.pl in NagiosXI to monitor some statistics. https://github.com/c-kr/check_json
I'm using the code with the modification I submitted in pull request #32, so line numbers reflect that code.
The json query returns something like this:
[
{
"total_bytes": 123456,
"customer_name": "customer1",
"customer_id": "1",
"indices": [
{
"total_bytes": 12345,
"index": "filename1"
},
{
"total_bytes": 45678,
"index": "filename2"
},
],
"total": "765.43gb"
},
{
"total_bytes": 123456,
"customer_name": "customer2",
"customer_id": "2",
"indices": [
{
"total_bytes": 12345,
"index": "filename1"
},
{
"total_bytes": 45678,
"index": "filename2"
},
],
"total": "765.43gb"
}
]
I'm trying to monitor the sized of specific files. so a check should look something like:
/path/to/check_json.pl -u https://path/to/my/json -a "SOMETHING" -p "SOMETHING"
...where I'm trying to figure out the SOMETHINGs so that I can monitor the total_bytes of filename1 in customer2 where I know the customer_id and index but not their position in the respective arrays.
I can monitor customer1's total bytes by using the string "[0]->{'total_bytes'}" but I need to be able to specify which customer and dig deeper into file name (known) and file size (stat to monitor) AND the working query only gives me the status (OK,WARNING, or CRITICAL). Adding -p all I get are errors....
The error with -p no matter how I've been able to phrase it is always:
Not a HASH reference at ./check_json.pl line 235.
Even when I can get a valid OK from the example "[0]->{'total_bytes'}", using that in -p still gives the same error.
Links pointing to documentation on the format to use would be very helpful. Examples in the README for the script or in the -h output are failing me here. Any ideas?
I really have no idea what your question is. I'm sure I'm not alone, hence the downvotes.
Once you have the decoded json, if you have a customer_id to search for, you can do:
my ($customer_info) = grep {$_->{customer_id} eq $customer_id} #$json_response;
Regarding the error on line 235, this looks odd:
foreach my $key ($np->opts->perfvars eq '*' ? map { "{$_}"} sort keys %$json_response : split(',', $np->opts->perfvars)) {
# ....................................... ^^^^^^^^^^^^^
$perf_value = $json_response->{$key};
if perfvars eq "*", you appear to be looking for $json_reponse->{"{total}"} for example. You might want to validate the user's input:
die "no such key in json data: '$key'\n" unless exists $json_response->{$key};
This entire business of stringifying the hash ref lookups just smells bad.
A better question would look like:
I have this JSON data. How do I get the sum of total_bytes for the customer with id 1?
See https://stackoverflow.com/help/mcve

count based on a specific field in SOLR

How to return the count of a field with each object in Solr
When I do fq=verify_ix:1 I have a response below, I want to get count where verify_ix = 1 in the response too. How can I do that?
"response": {
"numFound": 9484,
"start": 0,
"maxScore": 1,
"docs": [
{
"id": "10000000000965509",
"description_s": "No Description",
"recommendation_ix": 0,
"sId_lx": 30005938,
"sType_sx": "P",
"condition_ix": 1000,
"verify_ix": 1
},
.
.
.
{
"id": "10000000000965734",
"description_s": "No Description",
"recommendation_ix": 1,
"sId_lx": 30005947,
"sType_sx": "P",
"condition_ix": 2000,
"verify_ix": 1
}
]}
If you want counts of the different values for a given field, you can send a request to Solr with facet=true and facet.field=verify_ix. For counts over all records, set q=*:*. If you don't want to see any rows returned, you can set rows=0.
See here for more details on faceting:
https://cwiki.apache.org/confluence/display/solr/Faceting
(I tested this with Solr 5, but faceting should work with Solr 4 as well.)

solr query for not equal to text value and number greater than 0

I have solr documents with two fields, one is a string and one is an integer. Both fields are allowed to be null. I am attempting to write a query that will eliminate documents with the following properties:
textField = "badValue" AND (numberField is null OR numberField = 0)
I added the following fq:
((NOT textField=badValue) OR numberField=[1 TO *])
This does not seem to have worked properly, because I am getting a document with textField = badValue and numberField = 0. What did I do wrong with my fq?
The full query response header, containing the parsed query is:
"responseHeader": {
"status": 0,
"QTime": 245,
"params": {
"q": "(numi) AND (solr_specs:[* TO ] OR full_description:[ TO ])",
"defType": "edismax",
"bf": "log(sum(popularity,1))",
"indent": "true",
"qf": "categories^3.0 manufacturer^1.0 sku^0.2 split_sku^0.2 upc^1.0 invoice_description^2.6 full_description solr_specs^0.8 solr_spec_values^1.7 legacyid legacy_altcode id",
"fl": "distributor_status,QOH_estimate,id,score",
"start": "0",
"fq": "((:* NOT distributor_status=VENDORDISC) OR QOH_estimate=[1 TO *])",
"sort": "score desc,id desc",
"rows": "20",
"wt": "json",
"_": "1441220051438"
}
}
QOH_estimate is numberField and distributor_status is textField.
Please try the following in your fq parameter: ((*:* NOT textField:badValue) OR numberField:[1 TO *]).
((*:* NOT distributor_status:VENDORDISC) OR QOH_estimate:[1 TO *])
Here you first selecting the documents which are not containing textField:badValue and ORing with documents coming from numberField:[1 TO *] condition.

Graphite summarize() function results are inconsistent for different 'from' values

I am using Graphite to record user login information.
When I run the following :
render?target=summarize(stats_counts.login.success,"1day")&format=json&from=-1days
I am getting the result :
[
{
"target": "summarize(stats_counts.login.success, \"1day\", \"sum\")",
"datapoints": [
[
5,
1435708800
],
[
21,
1435795200
]
]
}
]
But for the following query :
render?target=summarize(stats_counts.login.success,"1day")&format=json&from=-7days
I am getting the result :
[
{
"target": "summarize(stats_counts.login.success, \"1day\", \"sum\")",
"datapoints": [
[
0,
1435190400
],
[
1,
1435276800
],
[
0,
1435363200
],
[
0,
1435449600
],
[
5,
1435536000
],
[
16,
1435622400
],
[
6,
1435708800
],
[
21,
1435795200
]
]
}
]
Notice the value for the bucket : 1435708800 in both the results.
In one result it is : 5 and in the second result it is : 6
In the first query I am trying to get the number of user logins per day over the last week and in the second one I am trying to get the number of user logins per day yesterday and today.
What is the reason for this difference ?
UPDATE
Graphite Version : 0.9.10
Retention Settings :
[carbon]
pattern = ^carbon\.
retentions = 60:90d
[real_time]
priority = 200
pattern = ^stats.*
retentions = 1:34560000
[stats]
priority = 110
pattern = .*
retentions = 1s:24h,1m:7d,10m:1y
Try adding allign to true, since based on time it varies the number of data points it get on the bucket interval.
By default, buckets are calculated by rounding to the nearest interval. This works well for intervals smaller than a day. For example, 22:32 will end up in the bucket 22:00-23:00 when the interval=1hour.
Passing alignToFrom=true will instead create buckets starting at the from time. In this case, the bucket for 22:32 depends on the from time. If from=6:30 then the 1hour bucket for 22:32 is 22:30-23:30.
"summarize(ex.cpe.ex.xxx,'30s','avg', true)"

SOLR/Tika - I need to concat values from 2 columns from 2 entities

Please look at my schema: http://pastebin.com/uPxwq8Zs and my data config: http://pastebin.com/ebeDfPM9
So, I need to index page content and some other fields and also all file attachments text, linked for each page. Because there can be multiple attachments, "text" and the rest fileds regarding to file need to be declared as multivalued. So, example output could be:
{
"header": " ",
"page_id": 25352,
"title": "Informacje, których nie ma w BIP",
"content": [
"<p> TEST test test"
],
"file_name": [
"Wniosek",
"Wniosek"
],
"file_desc": [
"Wniosek o udostępnienie informacji publicznej - format PDF",
"Wniosek o udostępnienie informacji publicznej - format RTF"
],
"file_path": [
"/mnt/storage/content/www/html/smartsite/src/data/resource/1/2/422/zalacznik_nr_1_wniosek.pdf",
"/mnt/storage/content/www/html/smartsite/src/data/resource/1/2/422/zalacznik_nr_1_wniosek.rtf"
],
"text": [
"Dzień dobry - dokument testowy.\nTEST.\n"
],
"_version_": 1479310282003054600
},
As you can see, it generallly works, but it's useless for me. In my example, there are 2 file attachments declared in database for page_id 25352. First attachment doesn't exist on disk on server, so tika was unable to index it, second attachment was indexed succesfully and this is text extracted:
"text": [
"Dzień dobry - dokument testowy.\nTEST.\n"
],
But I need to know, from which attachment was it extracted. So my idea is to concat my "text" field value with "file_patch" value and some separator, so I will get result like this:
"text": [
/mnt/storage/content/www/html/smartsite/src/data/resource/1/2/422/zalacznik_nr_1_wniosek.rtf*"Dzień dobry - dokument testowy.\nTEST.\n"
],
How to get result like this using solr/tike in my case?

Resources