How can I do an incremental load based on record ID in Dagster - data-ingestion

I am trying to consume an HTTP API in my Dagster code. The API provides a log of "changes" which contain an incrementing ID. It supports an optional parameter fromUpdateId, which lets you only fetch updates that have a higher ID than some value.
This should let me do incremental loads, by looking at the highest update ID I have seen so far, and providing this as the parameter.
How can I accomplish this in Dagster? I am thinking it should be possible to write the highest ID as metadata when I materialize the asset. The metadata would then be available the next time the asset is materialized.

I am thinking it should be possible to write the highest ID as metadata when I materialize the asset. The metadata would then be available the next time the asset is materialized.
That sounds like the right approach to me.
Here's some Python code that implements that approach:
from dagster import asset, Output
#asset
def asset1(context):
asset_key = context.asset_key_for_output()
latest_materialization_event = context.instance.get_latest_materialization_events(
[asset_key]
).get(asset_key)
if latest_materialization_event:
materialization = (
latest_materialization_event.dagster_event.event_specific_data.materialization
)
metadata = {entry.label: entry.entry_data for entry in materialization.metadata_entries}
cursor = metadata["cursor"].value
else:
cursor = 0
return Output(value=..., metadata={"cursor": cursor + 1})

Related

Django query bases on greater date

I want to know how efficient this filter can be done with django queries. Essentially I have the followig two clases
class Act(models.Model):
Date = models.DateTimeField()
Doc = models.ForeignKey(Doc)
...
class Doc(models.Model):
...
so one Doc can have severals Acts, and for each Doc I want to get the act with the greater Date. I'm only interested in Acts objects.
For example, if a have
act1 = (Date=2021-01-01, Doc=doc1)
act2 = (Date=2021-01-02, Doc=doc1)
act3 = (Date=2021-01-03, Doc=doc2)
act4 = (Date=2021-01-04, Doc=doc2)
act5 = (Date=2021-01-05, Doc=doc2)
I want to get [act2, act5] (the Act with Doc=doc1 with the greater Date and the Act with Doc=doc2 with the greater Date).
My only solution is to make a for over Docs.
Thank you so much
You can do this with one or two queries: the first query will retrieve the latest Act per Doc, and then the second one will then retrieve the acts:
from django.db.models import OuterRef, Subquery
last_acts = Doc.objects.annotate(
latest_act=Subquery(
Act.objects.filter(
Doc_id=OuterRef('pk')
).values('pk').order_by('-Date')[:1]
)
).values('latest_act')
and then we can retrieve the corresponding Acts:
Act.objects.filter(pk__in=last_acts)
depending on the database, it might be more efficient to first retrieve the primary keys, and then make an extra query:
Act.objects.filter(pk__in=list(last_acts))

MongoDB numeric index

I was wondering if it's possible to create a numeric count index where the first document would be 1 and as new documents are inserted the count would increase. If possible are you also able to apply it to documents imported via mongoimport? I have created and index via db.collection.createIndex( {index : 1} ) but it doesn't seem to be applying.
I would strongly recommend using ObjectId as your _id field. This has the benefit of being a good value for distributed systems, but also based on the date it was created. It also has a built-in index inside MongoDB.
Example using Morphia:
Date d = ...;
QueryImpl<MyClass> query = datastore.createQuery(MyClass);
query.field("_id").greaterThanOrEq(new ObjectId(d));
query.sort("_id");
query.limit(100);
List<MyClass> myDocs = query.asList();
This would fetch all documents created since date d in order of creation.
To load the next batch, change to:
query.field("_id").greaterThan(lastDoc.getId());
This will very efficiently load the next batch based on the ID of the last document from the previous batch.

Sitecore > Update commerce index incrementally

I am trying to rebuild commerce index in Sitecore incrementally(so, I don't want to make a full rebuild). Index strategy is set to "manual" at the moment.
I made some changes in catalog(updated relations and data, added/removed products and categories) and now just want to update index correspondingly.
I am tried to use class IndexCustodian(read about this class in the article). However, can't find the detailed documentation for this class with some code examples.
For example, I tried to use IncrementalUpdate method. As a second parameter I used an array of IndexableUniqueId(created on a search result using uniqueid field), but index hasn't been changed.
IEnumerable<IIndexableUniqueId> uniqueIds = foundProducts.Select(x => new IndexableUniqueId<string>(x.UniqueId));
Job job = IndexCustodian.IncrementalUpdate(ContentSearchManager.GetIndex(indexName), uniqueIds);
Another example, I tried to use Refresh method. As a second input parameter I used object of type CommerceIndexableItem which was created for the root sitecore item of my catalog. New products have been added, but existing product has been updated not completely: category relation has been updated, localized string field - hasn't been updated. Removed from catalog products still presented in the index.
Database database = Database.GetDatabase("web");
Item rootFolder = database.GetItem(Paths.DefaultCatalogPath);
CommerceIndexableItem indexableFolder = new CommerceIndexableItem(rootFolder);
Job job = IndexCustodian.Refresh(ContentSearchManager.GetIndex(indexName), indexableFolder);
Will appreciate any example of using of IndexCustodian or any other way which allows to update the index.
Thank you in advance for the help.

Google app engine ndb: prevent from using same id

I am building an order system on Google App Engine.
Suppose I have the following model of Order:
class Order(ndb.Model):
time: DateTimeProperty()
The date plus a serial number as a string is used as the id of one entity:
today = str(datetime.datetime.now())[:10].replace('-','')
# Fetch today's orders
orders = Order.query(...).order(-Order.time).fetch(1)
if len(orders)==0: # No entities: today's first order
orderNum = today + '00001'
else:
orderNum = str(int(orders[0].key.id())+1)
order = Order(id=orderNum)
order.date = datetime.datetime.now()
order.put()
Suppose there are multiple clerks and they may execute the program at the same time. An obvious problem is that they may obtain the same orderNum and actually write to the same entity.
How do I do to prevent such situation from happening?
You could use something similar to the code below.
#ndb.transactional(retries=3)
def create_order_if_not_exists(order):
exists = order.key.get()
if exists is not None:
raise Exception("Order already exists: ID=%s" % order.key.id())
order.put()
Or you can use Model's class method get_or_insert() that transactionally retrieves an existing entity or creates a new one. See details here.

Autoincrement ID in App Engine datastore

I'm using App Engine datastore, and would like to make sure that row IDs behave similarly to "auto-increment" fields in mySQL DB.
Tried several generation strategies, but can't seem to take control over what happens:
the IDs are not consecutive, there seem to be several "streams" growing in parallel.
the ids get "recycled" after old rows are deleted
Is such a thing at all possible ?
I really would like to refrain from keeping (indexed) timestamps for each row.
It sounds like you can't rely on IDs being sequential without a fair amount of extra work. However, there is an easy way to achieve what you are trying to do:
We'd like to delete old items (older than two month worth, for
example)
Here is a model that automatically keeps track of both its creation and its modification times. Simply using the auto_now_add and auto_now parameters makes this trivial.
from google.appengine.ext import db
class Document(db.Model):
user = db.UserProperty(required=True)
title = db.StringProperty(default="Untitled")
content = db.TextProperty(default=DEFAULT_DOC)
created = db.DateTimeProperty(auto_now_add=True)
modified = db.DateTimeProperty(auto_now=True)
Then you can use cron jobs or the task queue to schedule your maintenance task of deleting old stuff. Find the oldest stuff is as easy as sorting by created date or modified date:
db.Query(Document).order("modified")
# or
db.Query(Document).order("created")
What I know, is that auto-generated ID's as Long Integers are available in Google App Engine, but there's no guarantee that the value's are increasing and there's also no guarantee that the numbers are real one-increments.
So, if you nee timestamping and increments, add a DateTime field with milliseconds, but then you don't know that the numbers are unique.
So, the best this to do (what we are using) is: (sorry for that, but this is indeed IMHO the best option)
use a autogenerated ID as Long (we use Objectify in Java)
use a timestamp on each entity and use a index to query the entities (use a descending index) to get the top X
I think this is probably a fairly good solution, however be aware that I have not tested it in any way, shape or form. The syntax may even be incorrect!
The principle is to use memcache to generate a monotonic sequence, using the datastore to provide a fall-back if memcache fails.
class IndexEndPoint(db.Model):
index = db.IntegerProperty (indexed = False, default = 0)
def find_next_index (cls):
""" finds the next free index for an entity type """
name = 'seqindex-%s' % ( cls.kind() )
def _from_ds ():
"""A very naive way to find the next free key.
We just take the last known end point and loop untill its free.
"""
tmp_index = IndexEndPoint.get_or_insert (name).index
index = None
while index is None:
key = db.key.from_path (cls.kind(), tmp_index))
free = db.get(key) is None
if free:
index = tmp_index
tmp_index += 1
return index
index = None
while index is None:
index = memcache.incr (index_name)
if index is None: # Our index might have been evicted
index = _from_ds ()
if memcache.add (index_name, index): # if false someone beat us to it
index = None
# ToDo:
# Use a named task to update IndexEndPoint so if the memcache index gets evicted
# we don't have too many items to cycle over to find the end point again.
return index
def make_new (cls):
""" Makes a new entity with an incrementing ID """
result = None
while result is None:
index = find_next_index (cls)
def txn ():
"""Makes a new entity if index is free.
This should only fail if we had a memcache miss
(does not have to be on this instance).
"""
key = db.key.from_path (cls.kind(), index)
if db.get (key) is not None:
return
result = cls (key)
result.put()
return result
result = db.run_in_transaction (txn)

Resources