Data pipeline for running data records through modules which are acyclically dependent on each other - database

My question revolves around the best practice for processing data through multiple modules, where each module creates new data from the original data or from data generated by other modules. Let me create an illustrative example, where I'll be considering a MongoDB document and modules which create update parameters for the document. Consider a base document, {"a":2}. Now, consider these modules, which I've written as Python functions:
def mod1(data):
a = data["a"]
b = 2 * a
return {"$set":{"b":b}}
def mod2(data):
b = data["b"]
c = "Good" if b > 1 else "Bad"
return {"$set":{"c":c}}
def mod3(data):
a = data["a"]
d = a - 3
return {"$set":{"d":d}}
def mod4(data):
c = data["c"]
d = data["d"]
e = d if c == "Good" else 0
return {"$set":{"e":e}}
When applied in the correct order, the updated document will be {"a":2,"b":4,"c":"Good","d":-1,"e":-1}. Note that mod1 and mod3 can run simultaneously, while mod4 must wait for mod2 and mod3 to run on a document. My question is more general though; what's the best way to do something like this? My current method is very similar to this, but with each module getting its own Docker container due to the less trivial nature of their processing. The problem with this is that it requires every module to be querying the entire collection in which these documents reside continuously to see if documents become valid for them to process.

You really want an event-driven architecture for this for pipelines like this. As records are updated, push events onto a topic queue (or ESB) and have the other pipelines listening to those queues. As events are published, those other pipeline stages can pick up records and process their stages.

Related

Is it okay to use multiple concurrent read/write operations on the same database file but with different stores in Sembast?

My app has profile and work databases (and others) stored locally using Sembast DB.
Please look at the two examples below, which one is a better practice for asynchronous processes?
Example 1:
final profileDBPath = p.join(appDocumentDir.path, dbDirectory, 'profile.db');
final profileDB = await databaseFactoryIo.openDatabase(profileDBPath);
final workDBPath = p.join(appDocumentDir.path, dbDirectory, 'work.db');
final workDB = await databaseFactoryIo.openDatabase(workDBPath);
final workStore = stringMapStoreFactory.store('work');
final profileStore = stringMapStoreFactory.store('profile');
Example 2:
final dbPath = p.join(appDocumentDir.path, dbDirectory, 'database.db');
final database = await databaseFactoryIo.openDatabase(dbPath);
final workStore = stringMapStoreFactory.store('work');
final profileStore = stringMapStoreFactory.store('profile');
So notice that Example 1 is opening two different database files for each profile and work. And Example 2 is using the same database file for both.
The question is which one is better in terms of stability?
For coding simplicity I like Example 2 better but my worry is when in an Async operation Example 2 will crash when they write on the same file at the same time. Any ideas?
Thank you
Example 2 will crash when they write on the same file at the same time
I don't know if that is something your experiment or just an assumption. Sembast database supports multiple concurrent readers and a single writer (single process and single isolate) and will properly use a kind of mutex to ensure data consistency. Concurrent writes will be serialized and should not trigger any crash. And if it does, that's bug that you should fill!
Personally, I would go for a single database, it would allow cross stores transaction for data consistency that 2 databases cannot provide.

Quickly update django model objects from pandas dataframe

I have a Django model that records transactions. I need to update only some of the fields (two) of some of the transactions.
In order to update, the user is asked to provide additional data and I use pandas to make calculations using this extra data.
I use the output from the pandas script to update the original model like this:
for i in df.tnsx_uuid:
t = Transactions.objects.get(tnsx_uuid=i)
t.start_bal = df.loc[df.tnsx_uuid==i].start_bal.values[0]
t.end_bal = df.loc[df.tnsx_uuid==i].end_bal.values[0]
t.save()
this is very slow. What is the best way to do this?
UPDATE:
after some more research, I found bulk_update and changed the code to:
transactions = Transactions.objects.select_for_update()\
.filter(tnsx_uuid__in=list(df.tnsx_uuid)).only('start_bal', 'end_bal')
for t in transactions:
i = t.tnsx_uuid
t.start_bal = df.loc[df.tnsx_uuid==i].start_bal.values[0]
t.end_bal = df.loc[df.tnsx_uuid==i].end_bal.values[0]
Transactions.objects.bulk_update(transactions, ['start_bal', 'end_bal'])
this has approximately halved the time required.
How can I improve performance further?
I have been looking for the answer to this question and haven't found any authoritative, idiomatic solutions. So, here's what I've settled on for my own use:
transaction = Transactions.objects.filter(tnsx_uuid__in=list(df.tnsx_uuid))
# Build a DataFrame of Django model instances
trans_df = pd.DataFrame([{'tnsx_uuid': t.tnsx_uuid, 'object': t} for t in transactions])
# Join the Django instances to the main DataFrame on the index
df = df.join(trans_df.set_index('tnsx_uuid'))
for obj, start_bal, end_bal in zip(df['object'], df['start_bal'], df['end_bal']):
obj.start_bal = start_bal
obj.end_bal = send_bal
Transactions.objects.bulk_update(df['object'], ['start_bal', 'end_bal'])
I don't know how DataFrame.loc[] is implemented but it could be slow if it needs to search the whole DataFrame for each use rather than just do a hash lookup. For that reason and to just simply things by doing a single iteration loop, I pulled all of the model instances into df and then used the recommendation from a Stackoverflow answer on iterating over a DataFrames to loop over the zipped columns of interest.
I looked at the documentation for select_for_update in Django and it isn't apparent to me that it offers a performance improvement, but you may be using it to lock the transaction and make all of the changes atomically. Per the documentation, bulk_update should be faster than saving each object individually.
In my case, I'm only updating 3500 items. I did some timing of the various steps and came up with the following:
3.05 s to query and build the DataFrame
2.79 ms to join the instances to df
5.79 ms to run the for loop and update the instances
1.21 s to bulk_update the changes
So, I think you would need to profile your code to see what is actually taking time, but it is likely a Django issue rather than a Pandas issue.
I kind of face the same issue (almost same quantity of records 3500~), and I will like to add:
bulk_update seems to be a lot worse in performance than a
bulk_create, in my case deleting objects was allowed, so
instead of bulk_updating, I delete all objects, and then recreate them.
I used the same approach as you (thanks for the idea), but with some modifications:
a) I create the dataframe from the query itself:
all_objects_values = all_objects.values('id', 'date', 'amount')
self.df_values = pd.DataFrame.from_records(all_objects_values )
b) Then I create the column of objects without iterating (I make sure these are ordered):
self.df_values['object'] = list(all_objects)
c) For updating object values (after operations made in my dataframe), I iterate rows(not sure about performance difference):
for index, row in self.df_values.iterrows():
row['object'].amount= row['amount']
d) At the end, I re-create all objects:
MyModel.objects.bulk_create(self.df_values['object'].tolist())
Conclusion:
In my case, the most time consuming was the bulk update, so re-creating objects solved it for me (from 19 seconds with bulk_update to 10 seconds with delete + bulk_create)
In your case, using my approach may improve the time for all other operations.

NDB Queries Exceeding GAE Soft Private Memory Limit

I currently have a an application running in the Google App Engine Standard Environment, which, among other things, contains a large database of weather data and a frontend endpoint that generates graph of this data. The database lives in Google Cloud Datastore, and the Python Flask application accesses it via the NDB library.
My issue is as follows: when I try to generate graphs for WeatherData spanning more than about a week (the data is stored for every 5 minutes), my application exceeds GAE's soft private memory limit and crashes. However, stored in each of my WeatherData entities are the relevant fields that I want to graph, in addition to a very large json string containing forecast data that I do not need for this graphing application. So, the part of the WeatherData entities that is causing my application to exceed the soft private memory limit is not even needed in this application.
My question is thus as follows: is there any way to query only certain properties in the entity, such as can be done for specific columns in a SQL-style query? Again, I don't need the entire forecast json string for graphing, only a few other fields stored in the entity. The other approach I tried to run was to only fetch a couple of entities out at a time and split the query into multiple API calls, but it ended up taking so long that the page would time out and I couldn't get it to work properly.
Below is my code for how it is currently implemented and breaking. Any input is much appreciated:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
for acct in qry.fetch():
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
# Children Entity - log of a weather at parent location
class WeatherData(ndb.Model):
# model for data to save
...
# Function for querying data below a given ancestor between two optional
# times
#classmethod
def time_ordered_query(cls, ancestor_key, start=None, end=None):
return cls.query(cls.time>=start, cls.time<=end,ancestor=ancestor_key).order(-cls.time)
EDIT: I tried the iterative page fetching strategy described in the link from the answer below. My code was updated to the following:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
cursor = None
while True:
gc.collect()
fetched, next_cursor, more = qry.fetch_page(FETCHNUM, start_cursor=cursor)
if fetched:
for acct in fetched:
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
if more and next_cursor:
cursor = next_cursor
else:
break
where FETCHNUM=500. In this case, I am still exceeding the soft private memory limit for queries of the same length as before, and the query takes much, much longer to run. I suspect the problem may be with Python's garbage collector not deleting the already used information that is re-referenced, but even when I include gc.collect() I see no improvement there.
EDIT:
Following the advice below, I fixed the problem using Projection Queries. Rather than have a separate projection for each custom query, I simply ran the same projection each time: namely querying all properties of the entity excluding the JSON string. While this is not ideal as it still pulls gratuitous information from the database each time, generating individual queries of each specific query is not scalable due to the exponential growth of necessary indices. For this application, as each additional property is negligible additional memory (aside form that json string), it works!
You can use projection queries to fetch only the properties of interest from each entity. Watch out for the limitations, though. And this still can't scale indefinitely.
You can split your queries across multiple requests (more scalable), but use bigger chunks, not just a couple (you can fetch 500 at a time) and cursors. Check out examples in How to delete all the entries from google datastore?
You can bump your instance class to one with more memory (if not done already).
You can prepare intermediate results (also in the datastore) from the big entities ahead of time and use these intermediate pre-computed values in the final stage.
Finally you could try to create and store just portions of the graphs and just stitch them together in the end (only if it comes down to that, I'm not sure how exactly it would be done, I imagine it wouldn't be trivial).

Firebase + AngularFire -> States?

I'd like to know how I would deal with object states in a FireBase environment.
What do I mean by states? Well, let's say you have an app with which you organize order lists. Each list consists of a bunch of orders, so it can be considered a hierarchical data structure. Furthermore each list has a state which might be one of the following:
deferred
open
closed
sent
acknowledged
ware completely received
ware partially received
something else
On the visual (HTML) side the lists shall be distinguished by their state. Each state shall be presented to the client in its own, say, div-element, listing all the related orders beneath.
So the question is, how do I deal with this state in FireBase (or any other document based database)?
structure
Do I...
... (option 1) use a state-field for each orderlist and filter on the clientside by using if or something similar:
orderlist1.state = open
order1
order2
orderlist2.state = open
order1
orderlist3.state = closed
orderlist4.state = deferred
... (option 2) use the hierarchy of FireBase to classify the orderlists like so:
open
orderlist1
order1
order2
orderlist2
order1
closed
orderlist3
deferred
orderlist4
... (option 3) take a totally different approach?
So, what's the royal road here?
retrieval, processing & visual output of option 2
Since for option 1 the answer to this question is apparantly pretty straight forward (if state == ...) I continue with option 2: how do I retrieve the data in option 2? Do I use a Firebase-object for each state, like so:
var closedRef = new Firebase("https://xxx.firebaseio.com/closed");
var openRef = new Firebase("https://xxx.firebaseio.com/open");
var deferredRef = new Firebase("https://xxx.firebaseio.com/deferred");
var somethingRef = new Firebase("https://xxx.firebaseio.com/something");
Or what's considered the best approach to deal with that sort of data/structure?
There is no universal answer to this question. The "best approach" is going to depend on the particulars of your use case, which you haven't provided here. Specifically, how you will be reading and manipulating the data.
Data architecture in NoSQL is all about working hard on writes to make reads easy. It's all about how you plan to use the data. (It's also enough material for a chapter in a book.)
The advantage to "option 1" is that you can easily iterate all the entire list. Great if your list is measured in hundreds. This is a great approach if you want to fetch the list and manipulate it on the fly on the client side.
The advantage to "option 2" is that you can easily grab a subset of the list. Great if your list is measured in thousands and you will typically be fetching open issues only rather than closed ones. This is great for archiving/new/old lists like yours.
There are other options as well.
Sorted Data using Priorities
Perhaps the most universal approach is to use ordered data. This allows you to query a subset of your records using something like:
new Firebase(URL).startAt('open').endAt('open').limit(10);
This is sufficient in most cases where you have only one criteria, or when you can create a unique identifier from multiple criteria (e.g. 'open:marketing') without difficulty. Examples are scoreboards, state lists like yours, data ordered by timestamps.
Using an index
You can also create custom subsets of your data by creating an index of keys and using that to fetch the others.
This is most useful when there is no identifiable characteristic of your subsets. For example, if I pick them from a list and store my favorites.
I think my this plnkr can help you for this.
Here, click on edit/add and just check the country(order in your case) - State(state in your case) dependent dropdown may be the same as you want.just one single thing you may need to add is filter it.
They both are different tables in db.
You can also get it from git.

How to generate large files (PDF and CSV) using AppEngine and Datastore?

When I first started developing this project, there was no requirement for generating large files, however it is now a deliverable.
Long story short, GAE just doesn't play nice with any large scale data manipulation or content generation. The lack of file storage aside, even something as simple as generating a pdf with ReportLab with 1500 records seems to hit a DeadlineExceededError. This is just a simple pdf comprised of a table.
I am using the following code:
self.response.headers['Content-Type'] = 'application/pdf'
self.response.headers['Content-Disposition'] = 'attachment; filename=output.pdf'
doc = SimpleDocTemplate(self.response.out, pagesize=landscape(letter))
elements = []
dataset = Voter.all().order('addr_str')
data = [['#', 'STREET', 'UNIT', 'PROFILE', 'PHONE', 'NAME', 'REPLY', 'YS', 'VOL', 'NOTES', 'MAIN ISSUE']]
i = 0
r = 1
s = 100
while ( i < 1500 ):
voters = dataset.fetch(s, offset=i)
for voter in voters:
data.append([voter.addr_num, voter.addr_str, voter.addr_unit_num, '', voter.phone, voter.firstname+' '+voter.middlename+' '+voter.lastname ])
r = r + 1
i = i + s
t=Table(data, '', r*[0.4*inch], repeatRows=1 )
t.setStyle(TableStyle([('ALIGN',(0,0),(-1,-1),'CENTER'),
('INNERGRID', (0,0), (-1,-1), 0.15, colors.black),
('BOX', (0,0), (-1,-1), .15, colors.black),
('FONTSIZE', (0,0), (-1,-1), 8)
]))
elements.append(t)
doc.build(elements)
Nothing particularly fancy, but it chokes. Is there a better way to do this? If I could write to some kind of file system and generate the file in bits, and then rejoin them that might work, but I think the system precludes this.
I need to do the same thing for a CSV file, however the limit is obviously a bit higher since it's just raw output.
self.response.headers['Content-Type'] = 'application/csv'
self.response.headers['Content-Disposition'] = 'attachment; filename=output.csv'
dataset = Voter.all().order('addr_str')
writer = csv.writer(self.response.out,dialect='excel')
writer.writerow(['#', 'STREET', 'UNIT', 'PROFILE', 'PHONE', 'NAME', 'REPLY', 'YS', 'VOL', 'NOTES', 'MAIN ISSUE'])
i = 0
s = 100
while ( i < 2000 ):
last_cursor = memcache.get('db_cursor')
if last_cursor:
dataset.with_cursor(last_cursor)
voters = dataset.fetch(s)
for voter in voters:
writer.writerow([voter.addr_num, voter.addr_str, voter.addr_unit_num, '', voter.phone, voter.firstname+' '+voter.middlename+' '+voter.lastname])
memcache.set('db_cursor', dataset.cursor())
i = i + s
memcache.delete('db_cursor')
Any suggestions would be very much appreciated.
Edit:
Above I had documented three possible solutions based on my research, plus suggestions etc
They aren't necessarily mutually exclusive, and could be a slight variation or combination of any of the three, however the gist of the solutions are there. Let me know which one you think makes the most sense, and might perform the best.
Solution A: Using mapreduce (or tasks), serialize each record, and create a memcache entry for each individual record keyed with the keyname. Then process these items individually into the pdf/xls file. (use get_multi and set_multi)
Solution B: Using tasks, serialize groups of records, and load them into the db as a blob. Then trigger a task once all records are processed that will load each blob, deserialize them and then load the data into the final file.
Solution C: Using mapreduce, retrieve the keynames and store them as a list, or serialized blob. Then load the records by key, which would be faster than the current loading method. If I were to do this, which would be better, storing them as a list (and what would the limitations be...I presume a list of 100,000 would be beyond the capabilities of the datastore) or as a serialized blob (or small chunks which I then concatenate or process)
Thanks in advance for any advice.
Here is one quick thought, assuming it is crapping out fetching from the datastore. You could use tasks and cursors to fetch the data in smaller chunks, then do the generation at the end.
Start a task which does the initial query and fetches 300 (arbitrary number) records, then enqueues a named(!important) task that you pass the cursor to. That one in turn queries [your arbitrary number] records, and then passes the cursor to a new named task as well. Continue that until you have enough records.
Within each task process the entities, then store the serialized result in a text or blob property on a 'processing' model. I would make the model's key_name the same as the task that created it. Keep in mind the serialized data will need to be under the API call size limit.
To serialize your table pretty fast you could use:
serialized_data = "\x1e".join("\x1f".join(voter) for voter in data)
Have the last task (when you get enough records) kick of the PDf or CSV generation. If you use key_names for you models you, should be able to grab all of the entities with encoded data by key. Fetches by key are pretty fast, you'll know the model's keys since you know the last task name. Again, you'll want to be mindful size of your fetches from the datastore!
To deserialize:
list(voter.split('\x1f') for voter in serialized_data.split('\x1e'))
Now run your PDF / CSV generation on the data. If splitting up the datastore fetches alone does not help you'll have to look into doing more of the processing in each task.
Don't forget in the 'build' task you'll want to raise an exception if any of the interim models are not yet present. Your final task will automatically retry.
Some time ago I faced the same problem with GAE. After many attempts I just moved to another web hosting since I could do it. Nevertheless, before moving I had 2 ideas how to resolve it. I haven't implemented them, but you may try to.
First idea is to use SOA/RESTful service on another server, if it is possible. You can even create another application on GAE in Java, do all the work there (I guess with Java's PDFBox it will take much less time to generate PDF), and return result to Python. But this option needs you to know Java and also to divide your app to several parts with terrible modularity.
So, there's another approach: you can create a "ping-pong" game with a user's browser. The idea is that if you cannot make everything in a single request, force browser to send you several. During first request make only a part of work which fits 30 seconds limit, then save the state and generate 'ticket' - unique identifier of a 'job'. Finally, send the user response which is simple page with redirect back to your app, parametrized by a job ticket. When you get it. just restore state and proceed with the next part of job.

Resources