How do I perform a clustering algorithm (like DBSCAN) in Snowflake? - snowflake-cloud-data-platform

I need to cluster some data in Snowflake using DBSCAN. I created a UDF but the results won't match with a local run, so I ran an UDF that just creates a list with the row number that is being processed and it results in a list that has repeated values and its max value is much smaller than the number of rows in my table. (The expected result was unique values up to the number of rows)
Can this be a parallelization issue?
If so, is there a way to cluster data using DBSCAN in Snowflake?
Thanks!
EDIT -> code example:
#funcs.pandas_udf(name='DBSCAN_TEST', is_permanent=True,
stage_location='#UDF', replace=True,
packages=['scikit-learn==1.0.2', 'pandas', 'numpy'])
def DBSCAN_TEST(data_x: types.PandasDataFrame[float, float]) -> types.PandasSeries[float]:
data_x.columns = [features]
DBSCAN_cluster = DBSCAN(eps=2.5, min_samples=4)
DBSCAN_cluster.fit(data_x)
return resul
As input data I used this
to test DBSCAN.
If I run that locally (using the exact same code inside the UDF) I end up with 24 clusters and that is the expected result. But if I use the UDF it scales up to 71.
I've tried changing the input types to string, as a coworker suggested but it didn't work.

The only clustering option in Snowflake is a clustering key, and general rule of thumb is that this isn't something typically needed until you eclipse a TB of data in a table and/or the auto clustering proves to be deficient in performance.
https://docs.snowflake.com/en/user-guide/tables-clustering-keys.html

Related

A field with big array on mongodb

I am a beginner at Mongo and I made a data base with the following topology.
Some fields of metadata and one field that contain the experiment results.
experiment results- vector of integers with ~150,000 values
status = db.DataTest.insert_one(
{
"person_num" : num,
"life_cycle" : cycle,
"other_metadata" : meta_data,
"results_of_experiment": big_array
}
)
I inserted something like 7500 of those documents
Its occupied 8GB of memory and work really slowly for find operations.
I don't need those experiment results to search by them only the option to retrieve them from the DB as chunk of data.
Is there another solution to store on the DB the experiment results?
Is using "gridfs" is relevant to this case and not too complicated?
Based on your comments, the most common query is
db.DataTest.find( { "life_cycle": { $gt: 800 } }).limit(5)
Without an index on the life_cycle field, MongoDB is forced to do a collection scan. That is, fetch & evaluate all documents in your collection one by one. In a large collection, this will take a long time.
MongoDB does not create indexes automatically. You would have to observe your most common queries, and create indexes to support those queries. As far as I know, there is no automatic index creation in any database software; SQL, NoSQL, or otherwise.
Database indexing is a deep subject and cannot be explained in a short answer.
Having said that, if you create an index on the life_cycle field, it should improve your query times but only for the query you posted above. Other query types would likely require different indexes. You can do so in the mongo shell:
db.DataTest.createIndex({life_cycle: 1})
I encourage you to read these pages to understand more about indexing in MongoDB:
https://docs.mongodb.com/manual/indexes/
https://docs.mongodb.com/manual/applications/indexes/
https://docs.mongodb.com/manual/tutorial/create-indexes-to-support-queries/

Neo4J / Cypher Query very slow with order by property

I have a graph database with 5M of nodes and 10M of relationships.
I'm on a Macbook Pro with 4GB RAM. I have already try to adjust java heap size and neo4j memory without success.
My problem is that i have a simply cypher query like that :
MATCH (pet:Pet {id:52163})-[r:FOLLOWS]->(friend)
MATCH (friend)-[r:POSTED]->(n)
RETURN friend.id, TYPE(r),LABELS(n),n.id
LIMIT 30;
This query takes 100ms , which is impressive. But when i add an "ORDER BY" this query takes a long time => 8s :/
MATCH (pet:Pet {id:52163})-[r:FOLLOWS]->(friend)
MATCH (friend)-[r:POSTED]->(n)
RETURN friend.id, TYPE(r),LABELS(n),n.id
ORDER BY r.date DESC
LIMIT 30;
Does Someone has an idea ?
You might want to consider relationship indexes to speed up your query. The date property could be indexed this way. You're using the ORDER BY keyword which will almost always make your query slower as it needs to iterate the entire result set to perform the ordering.
Also consider using a single MATCH statement if that suits your needs:
MATCH (pet:Pet {id:52163})-[r:FOLLOWS]->(friend)-[r:POSTED]->(n)

Inconsistent values for getNumberFound() in Search API

I have a full-text search index with 42 documents like in the screenshot below:
When I query the index for "" it returns all the 42 documents correctly (good), but when I use the limit and offset options in the query, the value returned for the total number of matches found (results.getNumberFound()) varies from time to time. It gives me different values for different offsets!! In short, making the same query just with different offset values gives a different value for results.getNumberFound() function!
NOTE: This happens only in production server after I deploy the app. In local server everything
works perfectly (i.e for the same query, the number of total hits found is the same regardless of the offset option value).
Query query = Query.newBuilder()
.setOptions(QueryOptions.newBuilder()
.setLimit(limit)
.setOffset(offset).build())
.build(searchPhrase);
Results<ScoredDocument> results = INDEX.search(query);
LOG.warning( "Phrase:'" + searchPhrase +
"' limit:" + limit +
" offset:" + offset +
" num:" + results.getNumberFound());
Here's a screenshot of the log output:
So is there something wrong I'm doing or it's a bug in the Search API because the weird thing is that the issue only happens in the production server not the local one.
The python docs say
number_found
Returns an approximate number of documents matching the query. QueryOptions defining post-processing of the search results. If the QueryOptions.number_found_accuracy parameter were set to 100, then number_found <= 100 is accurate.
Similiar api components in exist in Java. From your code it appears you haven't set an accuracy. See java QueryOptions https://developers.google.com/appengine/docs/java/javadoc/com/google/appengine/api/search/QueryOptions
Having said that I have seen many questions/discussions about lack of accuracy on the number of found results.
Surprisingly, this is working as intended (as Tim says).
https://developers.google.com/appengine/docs/java/javadoc/com/google/appengine/api/search/QueryOptions.Builder#setNumberFoundAccuracy(int)
In its default state, the datastore scans the minimal set of data to fulfill the request. The database provides a very rough estimate of match results by multiplying ID range with estimate of matching keys (#keys found that matched / #ids scanned during the query).
For small data sets, set the accuracy value higher (500 or 1000) and call it a day. You can also improve the estimate by making sure key IDs are uniformly distributed and by fetching a higher limit each call (though if you don't need the data, just use the accuracy parameter).
This might not be applicable here but this is a general workaround for larger data sets:
Use num_accuracy == 1000. When queries return an estimate of <1000, you can trust that. When a query returns an estimate of >1000, perform your own estimate using a second query:
Include an extra numeric field with your data, which is a value of a discrete probabilistic event (e.g. #0s in a hash of some randomish data). When you get a large estimate from the first query, repeat your query with the additional constraint (e.g. AND ZERO_COUNT == y), where y is chosen based on the first query's estimate to match <1000 entities, producing an exact count for the second query which you can accurately extrapolate. Since you don't need the results of this data, you can set limit to 1 & num_accuracy == 1000.

Django Query Optimisation

I am working currently on telecom analytics project and newbie in query optimisation. To show result in browser it takes a full minute while just 45,000 records are to be accessed. Could you please suggest on ways to reduce time for showing results.
I wrote following query to find call-duration of a person of age-group:
sigma=0
popn=len(Demo.objects.filter(age_group=age))
card_list=[Demo.objects.filter(age_group=age)[i].card_no
for i in range(popn)]
for card in card_list:
dic=Fact_table.objects.filter(card_no=card.aggregate(Sum('duration'))
sigma+=dic['duration__sum']
avgDur=sigma/popn
Above code is within for loop to iterate over age-groups.
Model is as follows:
class Demo(models.Model):
card_no=models.CharField(max_length=20,primary_key=True)
gender=models.IntegerField()
age=models.IntegerField()
age_group=models.IntegerField()
class Fact_table(models.Model):
pri_key=models.BigIntegerField(primary_key=True)
card_no=models.CharField(max_length=20)
duration=models.IntegerField()
time_8bit=models.CharField(max_length=8)
time_of_day=models.IntegerField()
isBusinessHr=models.IntegerField()
Day_of_week=models.IntegerField()
Day=models.IntegerField()
Thanks
Try that:
sigma=0
demo_by_age = Demo.objects.filter(age_group=age);
popn=demo_by_age.count() #One
card_list = demo_by_age.values_list('card_no', flat=True) # Two
dic = Fact_table.objects.filter(card_no__in=card_list).aggregate(Sum('duration') #Three
sigma = dic['duration__sum']
avgDur=sigma/popn
A statement like card_list=[Demo.objects.filter(age_group=age)[i].card_no for i in range(popn)] will generate popn seperate queries and database hits. The query in the for-loop will also hit the database popn times. As a general rule, you should try to minimize the amount of queries you use, and you should only select the records you need.
With a few adjustments to your code this can be done in just one query.
There's generally no need to manually specify a primary_key, and in all but some very specific cases it's even better not to define any. Django automatically adds an indexed, auto-incremental primary key field. If you need the card_no field as a unique field, and you need to find rows based on this field, use this:
class Demo(models.Model):
card_no = models.SlugField(max_length=20, unique=True)
...
SlugField automatically adds a database index to the column, essentially making selections by this field as fast as when it is a primary key. This still allows other ways to access the table, e.g. foreign keys (as I'll explain in my next point), to use the (slightly) faster integer field specified by Django, and will ease the use of the model in Django.
If you need to relate an object to an object in another table, use models.ForeignKey. Django gives you a whole set of new functionality that not only makes it easier to use the models, it also makes a lot of queries faster by using JOIN clauses in the SQL query. So for you example:
class Fact_table(models.Model):
card = models.ForeignKey(Demo, related_name='facts')
...
The related_name fields allows you to access all Fact_table objects related to a Demo instance by using instance.facts in Django. (See https://docs.djangoproject.com/en/dev/ref/models/fields/#module-django.db.models.fields.related)
With these two changes, your query (including the loop over the different age_groups) can be changed into a blazing-fast one-hit query giving you the average duration of calls made by each age_group:
age_groups = Demo.objects.values('age_group').annotate(duration_avg=Avg('facts__duration'))
for group in age_groups:
print "Age group: %s - Average duration: %s" % group['age_group'], group['duration_avg']
.values('age_group') selects just the age_group field from the Demo's database table. .annotate(duration_avg=Avg('facts__duration')) takes every unique result from values (thus each unique age_group), and for each unique result will fetch all Fact_table objects related to any Demo object within that age_group, and calculate the average of all the duration fields - all in a single query.

Is it possible to query a value (or a value set) with cassandra even if we don't know the key (or key range) in advance?

I am going to express the idea in SQL:
SELECT key,value
FROM table1
WHERE value > 10
Or do we always need to know the key?
I suppose you can use secondary indexes which are available since version 0.7 of casssandra.
You might also checkout the following answer: Cassandra and Secondary-Indexes, how do they work internally?
it is recommended to use secondary indexes only for low-cardinality columns, which means for columns which do not have many different values (e.g. columns like 'status' or 'priority' which have usually only a handful different values like 'high', 'medium', 'low').
In case you are using Hector as your cassandra client you can find information here how to use them:
https://github.com/rantav/hector/wiki/User-Guide
Yes, of course, for example, you can use *
select * from CF where value = 10
If you use the Hector API (e.g. CqlQuery), you can get a list of rows back from this query.
Note, currently for secondary indexes, you must have at least one equality conditional, so your query with just value > 10 would not work. See this question

Resources