System.QueryException: Non-selective query against large object type (more than 100000 rows) - salesforce

I was working months with my class, but now give me the following error:
"System.QueryException: Non-selective query against large object type (more than 100000 rows). Consider an indexed filter or contact salesforce.com about custom indexing. Even if a field is indexed a filter might still not be selective when: 1. The filter value includes null (for instance binding with a list that contains null) 2. Data skew exists whereby the number of matching rows is very large (for instance, filtering for a particular foreign key value that occurs many times)"
In this Line:
usuarios=[select id,PersonEmail,marca__c,
Marcas_de_las_que_quiere_recibir_ofertas__c,dni__c,
Segmento__c,Datos_Preferencias__c,
Datos_Test_Compatibilidad__c,Datos_Test_Big_Five__c,
Datos_CV_Obligatorio__c, datos_contratacion__c,
Fecha_orientacion_a_marca__c,Resultado_orientacion_a_marca__c,
Fecha_formacion_online__c,Resultado_formacion_online__c
from Account
where
Segmento__c in :setsegmentos
and Marca_relacional__c in: listaMarcas
and Baja__c=false and Ya_estoy_trabajando__c=false
and Back_list__c=false
and Inactivo__c=false limit 3000];
My SOQL just take 3000 records. Anyone can help me please?
I made some change for an advice fromm chri (thank you!), but still the same error.

Related

How do I perform a clustering algorithm (like DBSCAN) in Snowflake?

I need to cluster some data in Snowflake using DBSCAN. I created a UDF but the results won't match with a local run, so I ran an UDF that just creates a list with the row number that is being processed and it results in a list that has repeated values and its max value is much smaller than the number of rows in my table. (The expected result was unique values up to the number of rows)
Can this be a parallelization issue?
If so, is there a way to cluster data using DBSCAN in Snowflake?
Thanks!
EDIT -> code example:
#funcs.pandas_udf(name='DBSCAN_TEST', is_permanent=True,
stage_location='#UDF', replace=True,
packages=['scikit-learn==1.0.2', 'pandas', 'numpy'])
def DBSCAN_TEST(data_x: types.PandasDataFrame[float, float]) -> types.PandasSeries[float]:
data_x.columns = [features]
DBSCAN_cluster = DBSCAN(eps=2.5, min_samples=4)
DBSCAN_cluster.fit(data_x)
return resul
As input data I used this
to test DBSCAN.
If I run that locally (using the exact same code inside the UDF) I end up with 24 clusters and that is the expected result. But if I use the UDF it scales up to 71.
I've tried changing the input types to string, as a coworker suggested but it didn't work.
The only clustering option in Snowflake is a clustering key, and general rule of thumb is that this isn't something typically needed until you eclipse a TB of data in a table and/or the auto clustering proves to be deficient in performance.
https://docs.snowflake.com/en/user-guide/tables-clustering-keys.html

OR search in solr

I have a situation where I have to search a document in Solr with multiple OR keywords. Now the number of keywords may lead up to 5000 which is resulting in a awfully large query with 5000 OR conditions. This is resulting in the Solr server to hang. Is there any other way I can design the query to work. Short sample of the query is given below
tweet_id:337931022601699328 OR 337931064293081089 OR 337931089538584576 OR 337931098761871361 OR 337931138851016704 OR 337931143099854848 OR 337931160082591745 OR 337931163857453056 OR 337931230819516416 OR 337931239996665857 OR 337931287518126080 OR 337931322850951168 OR 337931325648535553 OR 337931331398934528 OR 337931413057830912 OR 337931442363441152 OR 337931448629731329 OR 337931453344129025 OR 337931465016877056 OR 337931482066726912 OR 337931514388029442 OR 337931533149155328 OR 337931645527130114 OR 337931704935256064 OR 337931784459268096 OR 337931845545103360 OR 337931889086185472 OR 337931892668108801 OR 337931963983855617 OR 337932154212319233 OR 337932176454721536 OR 337932193198374912 OR 337932229659459584 OR 337932437290090496 OR 337932436807749632 OR 337932436828725250 OR 337932437449474048 OR 337932448518250496 OR 337932458832035843 OR 337932458634915840 OR 337932458278387712 OR 337932474246119425 OR 337932476209041409 OR 337932477408620544 OR 337932480478842880 OR 337932478775959554 OR 337932480566931456 OR 337932478763376640 OR 337932481841999872 OR 337932479337992192 OR 337932479296045057 OR 337932479333797889 OR 337932484614434816 OR 337932484606038017 OR 337932482777317376 OR 337932484664758272 OR 337932482785718273 OR 337932484589273088 OR 337932487399444481 OR 337932489031032833 OR 337932489114923008 OR 337932486573166592 OR 337932490704560130 OR 337932489144270848 OR 337932488762601472 OR 337932492097069056 OR 337932497780355072 OR 337932498900230144 OR 337932499722321921 OR 337932514431729665 OR 337932561806409731 OR 337932567284154368 OR 337932567300935680 OR 337932574603214848 OR 337932571134533632 OR 337932574674518016 OR 337932575484026881 OR 337932578206121984 OR 337932582215892994 OR 337932586653454336 OR 337932584917024768 OR 337932592986865664 OR 337932597017587712 ....
I intend to facet the result based on a few fields.
I'm not sure whether this solution would help you or not, but tried something for your problem.
Whatever the query you provide to Solr, first it parses that query to it's understandable format. Then Solr executes that for result. You have to do some calculations before querying to Solr. Let's take the following scenario to solve your use case.
Suppose You have total 5000 tweet_id. You have to do an OR query on around 4000 tweet_id. In this type of scenario, it's better to query on other (5000-4000=1000) 1000 tweet_id with negation AND query. So, your query will have less values passed.
So, try querying with rest of the tweet_id with negation AND query instead of OR query.
If I were you, I'd create a new field denoting this custom_list_id .. Whenever you generate a new list, index the new data then query by the list I'd.

Django Query Optimisation

I am working currently on telecom analytics project and newbie in query optimisation. To show result in browser it takes a full minute while just 45,000 records are to be accessed. Could you please suggest on ways to reduce time for showing results.
I wrote following query to find call-duration of a person of age-group:
sigma=0
popn=len(Demo.objects.filter(age_group=age))
card_list=[Demo.objects.filter(age_group=age)[i].card_no
for i in range(popn)]
for card in card_list:
dic=Fact_table.objects.filter(card_no=card.aggregate(Sum('duration'))
sigma+=dic['duration__sum']
avgDur=sigma/popn
Above code is within for loop to iterate over age-groups.
Model is as follows:
class Demo(models.Model):
card_no=models.CharField(max_length=20,primary_key=True)
gender=models.IntegerField()
age=models.IntegerField()
age_group=models.IntegerField()
class Fact_table(models.Model):
pri_key=models.BigIntegerField(primary_key=True)
card_no=models.CharField(max_length=20)
duration=models.IntegerField()
time_8bit=models.CharField(max_length=8)
time_of_day=models.IntegerField()
isBusinessHr=models.IntegerField()
Day_of_week=models.IntegerField()
Day=models.IntegerField()
Thanks
Try that:
sigma=0
demo_by_age = Demo.objects.filter(age_group=age);
popn=demo_by_age.count() #One
card_list = demo_by_age.values_list('card_no', flat=True) # Two
dic = Fact_table.objects.filter(card_no__in=card_list).aggregate(Sum('duration') #Three
sigma = dic['duration__sum']
avgDur=sigma/popn
A statement like card_list=[Demo.objects.filter(age_group=age)[i].card_no for i in range(popn)] will generate popn seperate queries and database hits. The query in the for-loop will also hit the database popn times. As a general rule, you should try to minimize the amount of queries you use, and you should only select the records you need.
With a few adjustments to your code this can be done in just one query.
There's generally no need to manually specify a primary_key, and in all but some very specific cases it's even better not to define any. Django automatically adds an indexed, auto-incremental primary key field. If you need the card_no field as a unique field, and you need to find rows based on this field, use this:
class Demo(models.Model):
card_no = models.SlugField(max_length=20, unique=True)
...
SlugField automatically adds a database index to the column, essentially making selections by this field as fast as when it is a primary key. This still allows other ways to access the table, e.g. foreign keys (as I'll explain in my next point), to use the (slightly) faster integer field specified by Django, and will ease the use of the model in Django.
If you need to relate an object to an object in another table, use models.ForeignKey. Django gives you a whole set of new functionality that not only makes it easier to use the models, it also makes a lot of queries faster by using JOIN clauses in the SQL query. So for you example:
class Fact_table(models.Model):
card = models.ForeignKey(Demo, related_name='facts')
...
The related_name fields allows you to access all Fact_table objects related to a Demo instance by using instance.facts in Django. (See https://docs.djangoproject.com/en/dev/ref/models/fields/#module-django.db.models.fields.related)
With these two changes, your query (including the loop over the different age_groups) can be changed into a blazing-fast one-hit query giving you the average duration of calls made by each age_group:
age_groups = Demo.objects.values('age_group').annotate(duration_avg=Avg('facts__duration'))
for group in age_groups:
print "Age group: %s - Average duration: %s" % group['age_group'], group['duration_avg']
.values('age_group') selects just the age_group field from the Demo's database table. .annotate(duration_avg=Avg('facts__duration')) takes every unique result from values (thus each unique age_group), and for each unique result will fetch all Fact_table objects related to any Demo object within that age_group, and calculate the average of all the duration fields - all in a single query.

SalesForce limit on SOQL?

Using the PHP library for salesforce I am running:
SELECT ... FROM Account LIMIT 100
But the LIMIT is always capped at 25 records. I am selecting many fields (60 fields). Is this a concrete limit?
The skeleton code:
$client = new SforceEnterpriseClient();
$client->createConnection("EnterpriseSandboxWSDL.xml");
$client->login(USERNAME, PASSWORD.SECURITY_TOKEN);
$query = "SELECT ... FROM Account LIMIT 100";
$response = $client->query($query);
foreach ($response->records as $record) {
// ... there's only 25 records
}
Here is my check list
1) Make sure you have more than 25 records
2) after your first loop do queryMore to check if there are more records
3) make sure batchSize is not set to 25
I don't use PHP library for Salesforce. But I can assume that before doing
SELECT ... FROM Account LIMIT 100
some more select queries have been performed. If you don't code them that maybe PHP library does it for you ;-)
The Salesforce soap API query method will only return a finite number of rows. There are a couple of reasons why it may be returning less than your defined limit.
The QueryOptions header batchSize has been set to 25. If this is the case, you could try adjusting it. If it hasn't been explicitly set, you could try setting it to a larger value.
When the SOQL statement selects a number of large fields (such as two or more custom fields of type long text) then Salesforce may return fewer records than defined in the batchSize. The reduction in batch size also occurs when dealing with base64 encoded fields, such as the Attachment.Body. If this is the case they you can just use queryMore with the QueryLocator from the first response.
In both cases, check the done and size properties of the done and size properties of the QueryResult to determine if you need to use queryMore and the total number of rows that match the SOQL query.
To avoid governor limits it might be better to add all the records to a list then do everything you need to do to the records in the list. After you done just update your database using: update listName;

MDX MEMBER causing NON EMPTY to not filter

I'm using an MDX query to pull information to support a set of reports. A high degree of detail is required for the reports so they take some time to generate. To speed up the access time we pull the data we need and store it in a flat Oracle table and then connect to the table in Excel. This makes the reports refresh in seconds instead of minutes.
Previously the MDX was generated and run by department for 100 departments and then for a number of other filters. All this was done in VB.Net. The requirements for filters have grown to the point where this method is not sustainable (and probably isn't the best approach regardless).
I've built the entire dataset into one MDX query that works perfectly. One of my sets that I cross join includes members from three different levels of hierarchy, it looks like this:
(
Descendants([Merch].[Merch CHQ].[All], 2),
Descendants([Merch].[Merch CHQ].[All], 3),
[Merch].[Merch CHQ].[Department].&[1].Children
)
The problem for me is in our hierarchy (which I can't change), each group (first item) and each department (second item) have the same structure to their naming, ie 15-DeptName and it's confusing to work with.
To address it I added a member:
MEMBER
[Measures].[Merch Level] AS
(
[Merch].[Merch CHQ].CurrentMember.Level.Name
)
Which returns what type the member is and it works perfectly.
The problem is that it updates for every member so none of the rows get filtered by NON BLANK, instead of 65k rows I have 130k rows which will hurt my access performance.
Can my query be altered to still filter out the non blanks short of using IIF to check each measurement for null?
You can specify Null for your member based on your main measure like:
MEMBER
[Measures].[Merch Level] AS
IIf(IsEmpty([Measures].[Normal Measure]),null,[Merch].[Merch CHQ].CurrentMember.Level.Name)
That way it will only generate when there is data. You can go further and add additional dimensions to the empty check if you need to get more precise.

Resources