MongoDB numeric index - database

I was wondering if it's possible to create a numeric count index where the first document would be 1 and as new documents are inserted the count would increase. If possible are you also able to apply it to documents imported via mongoimport? I have created and index via db.collection.createIndex( {index : 1} ) but it doesn't seem to be applying.

I would strongly recommend using ObjectId as your _id field. This has the benefit of being a good value for distributed systems, but also based on the date it was created. It also has a built-in index inside MongoDB.
Example using Morphia:
Date d = ...;
QueryImpl<MyClass> query = datastore.createQuery(MyClass);
query.field("_id").greaterThanOrEq(new ObjectId(d));
query.sort("_id");
query.limit(100);
List<MyClass> myDocs = query.asList();
This would fetch all documents created since date d in order of creation.
To load the next batch, change to:
query.field("_id").greaterThan(lastDoc.getId());
This will very efficiently load the next batch based on the ID of the last document from the previous batch.

Related

How can I do an incremental load based on record ID in Dagster

I am trying to consume an HTTP API in my Dagster code. The API provides a log of "changes" which contain an incrementing ID. It supports an optional parameter fromUpdateId, which lets you only fetch updates that have a higher ID than some value.
This should let me do incremental loads, by looking at the highest update ID I have seen so far, and providing this as the parameter.
How can I accomplish this in Dagster? I am thinking it should be possible to write the highest ID as metadata when I materialize the asset. The metadata would then be available the next time the asset is materialized.
I am thinking it should be possible to write the highest ID as metadata when I materialize the asset. The metadata would then be available the next time the asset is materialized.
That sounds like the right approach to me.
Here's some Python code that implements that approach:
from dagster import asset, Output
#asset
def asset1(context):
asset_key = context.asset_key_for_output()
latest_materialization_event = context.instance.get_latest_materialization_events(
[asset_key]
).get(asset_key)
if latest_materialization_event:
materialization = (
latest_materialization_event.dagster_event.event_specific_data.materialization
)
metadata = {entry.label: entry.entry_data for entry in materialization.metadata_entries}
cursor = metadata["cursor"].value
else:
cursor = 0
return Output(value=..., metadata={"cursor": cursor + 1})

Django query bases on greater date

I want to know how efficient this filter can be done with django queries. Essentially I have the followig two clases
class Act(models.Model):
Date = models.DateTimeField()
Doc = models.ForeignKey(Doc)
...
class Doc(models.Model):
...
so one Doc can have severals Acts, and for each Doc I want to get the act with the greater Date. I'm only interested in Acts objects.
For example, if a have
act1 = (Date=2021-01-01, Doc=doc1)
act2 = (Date=2021-01-02, Doc=doc1)
act3 = (Date=2021-01-03, Doc=doc2)
act4 = (Date=2021-01-04, Doc=doc2)
act5 = (Date=2021-01-05, Doc=doc2)
I want to get [act2, act5] (the Act with Doc=doc1 with the greater Date and the Act with Doc=doc2 with the greater Date).
My only solution is to make a for over Docs.
Thank you so much
You can do this with one or two queries: the first query will retrieve the latest Act per Doc, and then the second one will then retrieve the acts:
from django.db.models import OuterRef, Subquery
last_acts = Doc.objects.annotate(
latest_act=Subquery(
Act.objects.filter(
Doc_id=OuterRef('pk')
).values('pk').order_by('-Date')[:1]
)
).values('latest_act')
and then we can retrieve the corresponding Acts:
Act.objects.filter(pk__in=last_acts)
depending on the database, it might be more efficient to first retrieve the primary keys, and then make an extra query:
Act.objects.filter(pk__in=list(last_acts))

How to get the maximum number in DynamoDB column?

I have a DynamoDB table called URLArray that contains a list of URL's (myURL) and a unique video number (myNum).
I use AWS Amplify to query my table like so for example:
URLData = await API.graphql(graphqlOperation(getUrlArray, { id: "173883db-9ff1-4...."}));
Also myNum is a GSI, so i can also query the row using it, for example:
URLData = await API.graphql(graphqlOperation(getURLinfofromMyNum, { myNum: 5 }));
My question is, I would like to simply query this table to know what the maximum number of myNum is. So in this picture it'd return myNum = 12. How do i query my table to get this?
DynamoDb does not have the equivalent of the SQL expression select MAX(myNum), so you cannot do what you are asking with your table as-is.
A few suggestions:
Record the highest value of myNum as you insert items into the table. For example, you could create an item with PK = "METADATA" and an attribute named maxMyNum. The maxMyNum attribute could be updated conditionally if you have a value that is higher than what is stored in DDB.
You could build a secondary index with myNum as the sort key in a single partition. This would allow you to execute a query operation with ScanIndexForward set to false (descending order), and pick the first returned entry (the max value)
If you are generating an auto incrementing value in your application code, consider checking out the documentation regarding atomic counters.

Sitecore > Update commerce index incrementally

I am trying to rebuild commerce index in Sitecore incrementally(so, I don't want to make a full rebuild). Index strategy is set to "manual" at the moment.
I made some changes in catalog(updated relations and data, added/removed products and categories) and now just want to update index correspondingly.
I am tried to use class IndexCustodian(read about this class in the article). However, can't find the detailed documentation for this class with some code examples.
For example, I tried to use IncrementalUpdate method. As a second parameter I used an array of IndexableUniqueId(created on a search result using uniqueid field), but index hasn't been changed.
IEnumerable<IIndexableUniqueId> uniqueIds = foundProducts.Select(x => new IndexableUniqueId<string>(x.UniqueId));
Job job = IndexCustodian.IncrementalUpdate(ContentSearchManager.GetIndex(indexName), uniqueIds);
Another example, I tried to use Refresh method. As a second input parameter I used object of type CommerceIndexableItem which was created for the root sitecore item of my catalog. New products have been added, but existing product has been updated not completely: category relation has been updated, localized string field - hasn't been updated. Removed from catalog products still presented in the index.
Database database = Database.GetDatabase("web");
Item rootFolder = database.GetItem(Paths.DefaultCatalogPath);
CommerceIndexableItem indexableFolder = new CommerceIndexableItem(rootFolder);
Job job = IndexCustodian.Refresh(ContentSearchManager.GetIndex(indexName), indexableFolder);
Will appreciate any example of using of IndexCustodian or any other way which allows to update the index.
Thank you in advance for the help.

Django Query Optimisation

I am working currently on telecom analytics project and newbie in query optimisation. To show result in browser it takes a full minute while just 45,000 records are to be accessed. Could you please suggest on ways to reduce time for showing results.
I wrote following query to find call-duration of a person of age-group:
sigma=0
popn=len(Demo.objects.filter(age_group=age))
card_list=[Demo.objects.filter(age_group=age)[i].card_no
for i in range(popn)]
for card in card_list:
dic=Fact_table.objects.filter(card_no=card.aggregate(Sum('duration'))
sigma+=dic['duration__sum']
avgDur=sigma/popn
Above code is within for loop to iterate over age-groups.
Model is as follows:
class Demo(models.Model):
card_no=models.CharField(max_length=20,primary_key=True)
gender=models.IntegerField()
age=models.IntegerField()
age_group=models.IntegerField()
class Fact_table(models.Model):
pri_key=models.BigIntegerField(primary_key=True)
card_no=models.CharField(max_length=20)
duration=models.IntegerField()
time_8bit=models.CharField(max_length=8)
time_of_day=models.IntegerField()
isBusinessHr=models.IntegerField()
Day_of_week=models.IntegerField()
Day=models.IntegerField()
Thanks
Try that:
sigma=0
demo_by_age = Demo.objects.filter(age_group=age);
popn=demo_by_age.count() #One
card_list = demo_by_age.values_list('card_no', flat=True) # Two
dic = Fact_table.objects.filter(card_no__in=card_list).aggregate(Sum('duration') #Three
sigma = dic['duration__sum']
avgDur=sigma/popn
A statement like card_list=[Demo.objects.filter(age_group=age)[i].card_no for i in range(popn)] will generate popn seperate queries and database hits. The query in the for-loop will also hit the database popn times. As a general rule, you should try to minimize the amount of queries you use, and you should only select the records you need.
With a few adjustments to your code this can be done in just one query.
There's generally no need to manually specify a primary_key, and in all but some very specific cases it's even better not to define any. Django automatically adds an indexed, auto-incremental primary key field. If you need the card_no field as a unique field, and you need to find rows based on this field, use this:
class Demo(models.Model):
card_no = models.SlugField(max_length=20, unique=True)
...
SlugField automatically adds a database index to the column, essentially making selections by this field as fast as when it is a primary key. This still allows other ways to access the table, e.g. foreign keys (as I'll explain in my next point), to use the (slightly) faster integer field specified by Django, and will ease the use of the model in Django.
If you need to relate an object to an object in another table, use models.ForeignKey. Django gives you a whole set of new functionality that not only makes it easier to use the models, it also makes a lot of queries faster by using JOIN clauses in the SQL query. So for you example:
class Fact_table(models.Model):
card = models.ForeignKey(Demo, related_name='facts')
...
The related_name fields allows you to access all Fact_table objects related to a Demo instance by using instance.facts in Django. (See https://docs.djangoproject.com/en/dev/ref/models/fields/#module-django.db.models.fields.related)
With these two changes, your query (including the loop over the different age_groups) can be changed into a blazing-fast one-hit query giving you the average duration of calls made by each age_group:
age_groups = Demo.objects.values('age_group').annotate(duration_avg=Avg('facts__duration'))
for group in age_groups:
print "Age group: %s - Average duration: %s" % group['age_group'], group['duration_avg']
.values('age_group') selects just the age_group field from the Demo's database table. .annotate(duration_avg=Avg('facts__duration')) takes every unique result from values (thus each unique age_group), and for each unique result will fetch all Fact_table objects related to any Demo object within that age_group, and calculate the average of all the duration fields - all in a single query.

Resources