How to model nested lists with many items using Google Drive Realtime API? - google-drive-realtime-api

I'd like to model ordered nested lists of uniform items (like what you would see in a standard tree widget) using the Google Drive realtime API. These trees could get quite large, ideally working well with many thousands of items.
One approach would be:
Item:
title: CollaborativeString
attributes: CollaborativeMap
children: CollaborativeList // recursivly hold other items
But I'm unsure if this is feesible when dealing with a large number of items.
An alternative might be to store all items tree order in a single CollaborativeList and add an additional "level" attribute. Then reconstruct the tree structure based on that level on the client. That would change from having to maintain thousands of CollaborativeLists to just a single big one. Probably lots of other alternatives that I don't know about.
Thanks for pointers on the best way to model this in the Google Drive Realtime API.

So long as the total size of the document is within the size limits, there shouldn't be a significant performance difference between the approaches from a framework perspective. (One caveat, using ObjectChangedListeners with a highly connected graph may slow things down. Prefer registering listeners on the specific objects instead.)
Modeling it as a real tree makes sense, since that will be the easiest to work with, and you can use the new move operation to atomically rearrange items in the lists.

Related

Multiple weights per Edge in a JGraphT DAG

Is there a way in JGraphT that I can assign multiple weights to a single edge? For example, suppose I have a graph representing travel-time between cities. I want to assign edge-weights for "time by plane", "time by car", "time by bus", etc., and then find least-cost route by some specified mode of travel.
One approach I can think of is to have distinct graph for each travel mode and then add every city vertex to every graph but that seems like a messy and memory intensive solution.
My next thought was that I might be able to extend the class implementing the graph ( probably DirectedWeightedPseudograph) and customize the getEdgeWeight() method to take an additional argument specifying which weight value to use. That, however, would require extending all the algorithm classes as well (e.g., DijkstraShortestPath) which I am trying to avoid.
To get around that problem I considered the following:
Extend my Graph class by adding a method setWeightMode(enum mode)
customize the getEdgeWeight() method to use the currently assigned mode to determine which weight value to return to the caller.
On the plus side it would be 100% transparent to any existing analysis classes. On the negative side, it would not be thread-safe.
At this point I'm out of ideas. Can anyone suggest an approach that is scalable for large graphs, supports multi-threading, and minimizes the need to re-implement code already provided by JGraphT?
There exists a much easier solution: you want to use the AsWeightedGraph class. This is a wrapper class that allows you to create different weighted views of an underlying graph. From the class description:
Provides a weighted view of a graph. The class stores edge weights internally. All getEdgeWeight calls are handled by this view; all other graph operations are propagated to the graph backing this view.
This class can be used to make an unweighted graph weighted, to override the weights of a weighted graph, or to provide different weighted views of the same underlying graph. For instance, the edges of a graph representing a road network might have two weights associated with them: a travel time and a travel distance. Instead of creating two weighted graphs of the same network, one would simply create two weighted views of the same underlying graph.

Better way to store hierarchical data with known depth?

I have a (actually quite simple) data structure that has a tree-like adjacency. I am trying to find a good way to represent data for a film-industry based web-app which needs to store data about film projects. The data consists of:
project -> scene -> shot -> version - each adjacent to the previous in a "one-to-many" fashion.
Right now I am thinking about a simple adjacency list, but I am having trouble believing that it would be sufficiently efficient to quickly retrieve the name of the project, given just the version, as I'd have to cycle through the other tables to get it. The (simplified) layout would be like this:
simple adjacency layout
I was thinking about - instead of referencing only the direct parent - referencing all higher level parents (like this), knowing that the hierarchy has a fixed depth. That way, I could use these shortcuts to get my information with only one query. But is this bad data modeling? Are there any other ways to do it?
It's not good data modelling from the perspective of normalisation. If you realise that you put the wrong scene in for a project, you then have to move it and everything down the hierarchy.
But... does efficiency matter to you? How much data are you talking about? How fast do you need a response? I'd say go with what you've got and if you need it faster, have something that regularly extracts the data to a cache.
Try a method called Modified Preorder Tree Traversal: http://www.sitepoint.com/hierarchical-data-database/

GAE MapReduce Huge Query

Abstract: Is MapReduce a good idea when processing a collection of data from the database, instead of finding some answer to a somewhat complex (or just big) question?
I would like to sync a set of syndication sources (e.g. urls like http://xkcd.com/rss.xml ), which are stored in GAE's datastore as a collection/table. I see two options, one is straight forward. Make simple tasks which you put in a queue, where each task handle's 100 or 1000 or whatever natural number seems to fit in each task. The other option is MapReduce.
In the latter case, the Map does everything, and the Reduce does nothing. Moreover, the map has no result, it just alters the 'state' (of the datastore).
#Override public void map(Entity entity) {
String url = (String)entity.getProperty("url");
for(Post p : www.fetchPostsFromFeed(url)) {
p.save();
}
}
As you can see, one source can map to many posts, so my map might as well be called "Explode".
So no emits and nothing for reduce to do. The reason I like this map-approach, is that I tell google: Here, take my collection/table, split it however you see fit to different mappers, and then store the posts wherever you like. The datastore uses 'high replication'. So availability of the data is high and a best choice for what 'computational unit' does what entity doesn't really reduce network communication. The same goes for the save of the posts, as they need to go to all datastore units. What I do like is that mapreduce has some way of fault-recovery for map-computations that get stuck, and that it knows how many tasks to send to which node, instead of queueing some number of entities somewhere hoping it makes sense.
Maybe my way of thinking here is wrong, in which case, please correct me. Anyhow, is this approach 'wrong' for the lack of reduce and map being an 'explode'?
Nope, Map pretty does the same as as your manual enqueuing of tasks.

Self Tracking Entities Traffic Optimization

I'm working on a personal project using WPF with Entity Framework and Self Tracking Entities. I have a WCF web service which exposes some methods for the CRUD operations. Today I decided to do some tests and to see what actually travels over this service and even though I expected something like this, I got really disappointed. The problem is that for a simple update (or delete) operation for just one object - lets say Category I send to the server the whole object graph, including all of its parent categories, their items, child categories and their items, etc. I my case it was a 170 KB xml file on a really small database (2 main categories and about 20 total and about 60 items). I can't imagine what will happen if I have a really big database.
I tried to google for some articles concerning traffic optimization with STE, but with no success, so I decided to ask here if somebody has done something similar, knows some good practices, etc.
One of the possible ways I came out with is to get the data I need per object with more service calls:
return context.Categories.ToList();//only the categories
...
return context.Items.ToList();//only the items
Instead of:
return context.Categories.Include("Items").ToList();
This way the categories and the items will be separated and when making changes or deleting some objects the data sent over the wire will be less.
Has any of you faced a similar problem and how did you solve it or did you solve it?
We've encountered similiar challenges. First of all, as you already mentioned, is to keep the entities as small as possible (as dictated by the desired client functionality). And second, when sending entities back over the wire to be persisted: strip all navigation properties (nested objects) when they haven't changed. This sounds very simple but is not at all trivial. What we do is to recursively dig into the entities present in trackable collections of say the "topmost" entity (and their trackable collections, and theirs, and...) and remove them when their ChangeTracking state is "Unchanged". But be carefull with this, because in some cases you still need these entities because they have been removed or added to trackable collections of their parent entity (so then you shouldn't remove them).
This, what we call "StripEntity", is also mentioned (not with any code sample or whatsoever) in Julie Lerman's - Programming Entity Framework.
And although it might not be as efficient as a more purist kind of approach, the use of STE's saves a lot of code for queries against the database. We are not in need for optimal performance in a high traffic situation, so STE's suit our needs and takes away a lot of code to communicate with the database. You have to decide for your situation what the "best" solution is. Good luck!
You can find an Entity Framework project item at http://selftrackingentity.codeplex.com/. With version 0.9.8, I added a method called GetObjectGraphChanges() that returns an optimized entity object graph with only objects that have changes.
Also, there are two helper methods: EstimateObjectGraphSize() and EstimateObjectGraphChangeSize(). The first method returns the estimate size of the whole entity object along with its object graph; and the later returns the estimate size of the optimized entity object graph with only object that have changes. With these two helper methods, you can decide whether it makes sense to call GetObjectGraphChanges() or not.

How to model Data Transfer Objects for different front ends?

I've run into reoccuring problem for which I haven't found any good examples or patterns.
I have one core service that performs all heavy datasbase operations and that sends results to different front ends (html, silverlight/flash, web services etc).
One of the service operation is "GetDocuments", which provides a list of documents based on different filter criterias. If I only had one front-end, I would like to package the result in a list of Document DTOs (Data transfer objects) that just contains the data. However, different front-ends needs different amounts of "metadata". The simples client just needs the document headline and a link reference. Other clients wants a short text snippet of the document, another one also wants a thumbnail and a third wants the name of the author. Its basically all up to the implementation of the GUI what needs to be displayed.
Whats the best way to model this:
As a lot of different DTOs (Document, DocumentWithThumbnail, DocumentWithTextSnippet)
tends to become a lot of classes
As one DTO containing all the data, where the client choose what to display
Lots of unnecessary data sent
As one DTO where certain fields are populated based on what the client requested
Tends to become a very large class that needs to be extended over time
One DTO but with some kind of generic "Metadata" field containing requested metadata.
Or are there other options?
Since I want a high performance service, I need to think about both network load and caching strategies.
Does anyone have any good patterns or practices that might help me?
What I would do is give the front end the ability to request the presence of the wanted metadata ( say getDocument( WITH_THUMBNAILS | WITH_TEXT_SNIPPET ) )
Then this DTO is built with only this requested information.
Adding all the possible metadata is as you said, unacceptable.
I will surely stay with one class defining all the possible methods (getTitle(), getThumbnail()) and if possible it will return a placeholder when the thumbnail was not requested. Something like "Image not available".
If you want to model this like a pattern, take a look at the factory patterns.
Hope this helps you.
Is there any noticable cost to creating a DTO that has all the data any of your views could need and using it everywhere? I would do that, especially since it insulates you from a requirement change down the line to have one of the views incorporate data one of the other views uses
ex. Maybe your silverlight/flash view doesn't show the title itself b/c it's in the thumb now, but they decide they want to sort by it later.
To clarify, I do not necesarily think you need to pass down all of the data every time, but I think your DTO class should define all of them. Just don't fall into the pits of premature optimization or analysis paralysis. Do the simplest thing first, then justify added complexity. Throw it all in and profile it. If the perf is unacceptable, optimize and try again.

Resources