cache intermediate values on cluster - distributed

I am running dask distributed and would like intermediate values to be saved. For example, when I run this:
from distributed import Client
client = Client(IP_PORT_TO_SCHEDULER)
from dask import delayed
#delayed(pure=True)
def myfunction(a):
print("recomputing")
return a + 3
res = myfunction(1)
res2 = res**2
res3 = client.persist(res2)
resagain = res**3
resagain2 = client.persist(resagain)
I would expect "recompute" to print only once. However, in this case it prints twice. I think this might be because the client doesn't cache this intermediate value. For example, running client.has_what(), I see this:
{'tcp://xx.xx.xx.xx:xxxx': ['pow-9d66a68ce8be79ff9cca17a2dc58aa0b',
'pow-440784f1abedb14511aa0d633935b55a']}
I see the final result of the power functions but not the intermediate computation. Is there a way to force the client to store this intermediate computation? thanks!

Dask will hold on to all results that you have explicitly persisted. Any intermediate results will be cleaned up in order to save on memory.
So in your case you probably want to do the following:
res = myfunction(1)
res = res.persist() # Ask Dask to keep this in memory
...

Related

Is it okay to use multiple concurrent read/write operations on the same database file but with different stores in Sembast?

My app has profile and work databases (and others) stored locally using Sembast DB.
Please look at the two examples below, which one is a better practice for asynchronous processes?
Example 1:
final profileDBPath = p.join(appDocumentDir.path, dbDirectory, 'profile.db');
final profileDB = await databaseFactoryIo.openDatabase(profileDBPath);
final workDBPath = p.join(appDocumentDir.path, dbDirectory, 'work.db');
final workDB = await databaseFactoryIo.openDatabase(workDBPath);
final workStore = stringMapStoreFactory.store('work');
final profileStore = stringMapStoreFactory.store('profile');
Example 2:
final dbPath = p.join(appDocumentDir.path, dbDirectory, 'database.db');
final database = await databaseFactoryIo.openDatabase(dbPath);
final workStore = stringMapStoreFactory.store('work');
final profileStore = stringMapStoreFactory.store('profile');
So notice that Example 1 is opening two different database files for each profile and work. And Example 2 is using the same database file for both.
The question is which one is better in terms of stability?
For coding simplicity I like Example 2 better but my worry is when in an Async operation Example 2 will crash when they write on the same file at the same time. Any ideas?
Thank you
Example 2 will crash when they write on the same file at the same time
I don't know if that is something your experiment or just an assumption. Sembast database supports multiple concurrent readers and a single writer (single process and single isolate) and will properly use a kind of mutex to ensure data consistency. Concurrent writes will be serialized and should not trigger any crash. And if it does, that's bug that you should fill!
Personally, I would go for a single database, it would allow cross stores transaction for data consistency that 2 databases cannot provide.

Data pipeline for running data records through modules which are acyclically dependent on each other

My question revolves around the best practice for processing data through multiple modules, where each module creates new data from the original data or from data generated by other modules. Let me create an illustrative example, where I'll be considering a MongoDB document and modules which create update parameters for the document. Consider a base document, {"a":2}. Now, consider these modules, which I've written as Python functions:
def mod1(data):
a = data["a"]
b = 2 * a
return {"$set":{"b":b}}
def mod2(data):
b = data["b"]
c = "Good" if b > 1 else "Bad"
return {"$set":{"c":c}}
def mod3(data):
a = data["a"]
d = a - 3
return {"$set":{"d":d}}
def mod4(data):
c = data["c"]
d = data["d"]
e = d if c == "Good" else 0
return {"$set":{"e":e}}
When applied in the correct order, the updated document will be {"a":2,"b":4,"c":"Good","d":-1,"e":-1}. Note that mod1 and mod3 can run simultaneously, while mod4 must wait for mod2 and mod3 to run on a document. My question is more general though; what's the best way to do something like this? My current method is very similar to this, but with each module getting its own Docker container due to the less trivial nature of their processing. The problem with this is that it requires every module to be querying the entire collection in which these documents reside continuously to see if documents become valid for them to process.
You really want an event-driven architecture for this for pipelines like this. As records are updated, push events onto a topic queue (or ESB) and have the other pipelines listening to those queues. As events are published, those other pipeline stages can pick up records and process their stages.

Apache Flink: Why do reduce or groupReduce transformations not operate in parallel?

For example:
DataSet<Tuple1<Long>> input = env.fromElements(1,2,3,4,5,6,7,8,9);
DataSet<Tuple1<Long>> sum = input.reduce(new ReduceFunction()<Tuple1<Long>,Tuple1<Long>>{
public Tuple1<Long> reduce(Tuple1<Long> value1,Tuple1<Long> value2){
return new Tuple1<>(value1.f0 + value2.f0);
}
}
If the above reduce transform is not a parallel operation, do I need to use additional two transformation 'partitionByHash' and 'mapPartition' as below:
DataSet<Tuple1<Long>> input = env.fromElements(1,2,3,4,5,6,7,8,9);
DataSet<Tuple1<Long>> sum = input.partitionByHash(0).mapPartition(new MapPartitionFunction()<Tuple1<Long>,Tuple1<Long>>{
public void map(Iterable<Tuple1<Long>> values,Collector<Tuple1<Long>> out){
long sum = getSum(values);
out.collect(new Tuple1(sum));
}
}).reduce(new ReduceFunction()<Tuple1<Long>,Tuple1<Long>>{
public Tuple1<Long> reduce(Tuple1<Long> value1,Tuple1<Long> value2){
return new Tuple1<>(value1.f0 + value2.f0);
}
}
and why the result of reduce transform is still an instance of DataSet but not an instance of Tuple1<Long>
Both, reduce and reduceGroup are group-wise operations and are applied on groups of records. If you do not specify a grouping key using groupBy, all records of the data set belong to the same group. Therefore, there is only a single group and the final result of reduce and reduceGroup cannot be computed in parallel.
If the reduce transformation is combinable (which is true for any ReduceFunction and all combinable GroupReduceFunctions), Flink can apply combiners in parallel.
Two answers, to your two questions:
(1) Why is reduce() not parallel
Fabian gave a good explanation. The operations are parallel if applied by key. Otherwise only pre-aggregation is parallel.
In your second example, you make it parallel by introducing a key. Instead of your complex workaround with "mapPartition()", you can also simply write (Java 8 Style)
DataSet<Tuple1<Long>> input = ...;
input.groupBy(0).reduce( (a, b) -> new Tuple1<>(a.f0 + b.f0);
Note however that your input data is so small that there will be only one parallel task anyways. You can see parallel pre-aggregation if you use larger input, such as:
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(10);
DataSet<Long> input = env.generateSequence(1, 100000000);
DataSet<Long> sum = input.reduce ( (a, b) -> a + b );
(2) Why is the result of a reduce() operation still a DataSet ?
A DataSet is still a lazy representation of X in the cluster. You can continue to use that data in the parallel program without triggering some computation and fetching the result data back from the distributed workers to the driver program. That allows you to write larger programs that run entirely on the distributed workers and are lazily executed. No data ever fetched to the client and re-distributed to the parallel workers.
Especially in iterative programs, this is very powerful, as the entire loops work without ever involving the client and needing to re-deploy operators.
You can always get the "X" by calling "dataSet.collext().get(0);" - which makes it explicit that something should be executed and fetched.

Groovy sql dataset causes java.lang.OutOfMemory

I have a table with 252759 tuples. I would like to use DataSet object to make my life easier, however when I try to create a DataSet for my table, after 3 seconds, I get java.lang.OutOfMemory.
I have no experience with Datasets, are there any guidelines how to use DataSet object for big tables?
Do you really need to retrieve all the rows at once? If not, then you could just retrieve them in batches of (for example) 10000 using the approach shown below.
def db = [url:'jdbc:hsqldb:mem:testDB', user:'sa', password:'', driver:'org.hsqldb.jdbcDriver']
def sql = Sql.newInstance(db.url, db.user, db.password, db.driver)
String query = "SELECT * FROM my_table WHERE id > ? ORDER BY id limit 10000"
Integer maxId = 0
// Closure that executes the query and returns true if some rows were processed
Closure executeQuery = {
def oldMaxId = maxId
sql.eachRow(query, [maxId]) { row ->
// Code to process each row goes here.....
maxId = row.id
}
return maxId != oldMaxId
}
while (executeQuery());
AFAIK limit is a MySQL-specific feature, but most other RDBMS have an equivalent feature that limits the number of rows returned by a query.
Also, I haven't tested (or even compiled) the code above, so handle with care!
Why not start with giving the JVM more memory?
java -Xms<initial heap size> -Xmx<maximum heap size>
252759 tuples doesn't sound like anything a maching with 4GB RAM + some virtual memory couldn't handle in memory.

What's the best way to count results in GQL?

I figure one way to do a count is like this:
foo = db.GqlQuery("SELECT * FROM bar WHERE baz = 'baz')
my_count = foo.count()
What I don't like is my count will be limited to 1000 max and my query will probably be slow. Anyone out there with a workaround? I have one in mind, but it doesn't feel clean. If only GQL had a real COUNT Function...
You have to flip your thinking when working with a scalable datastore like GAE to do your calculations up front. In this case that means you need to keep counters for each baz and increment them whenever you add a new bar, instead of counting at the time of display.
class CategoryCounter(db.Model):
category = db.StringProperty()
count = db.IntegerProperty(default=0)
then when creating a Bar object, increment the counter
def createNewBar(category_name):
bar = Bar(...,baz=category_name)
counter = CategoryCounter.filter('category =',category_name).get()
if not counter:
counter = CategoryCounter(category=category_name)
else:
counter.count += 1
bar.put()
counter.put()
db.run_in_transaction(createNewBar,'asdf')
now you have an easy way to get the count for any specific category
CategoryCounter.filter('category =',category_name).get().count
+1 to Jehiah's response.
Official and blessed method on getting object counters on GAE is to build sharded counter. Despite heavily sounding name, this is pretty straightforward.
Count functions in all databases are slow (eg, O(n)) - the GAE datastore just makes that more obvious. As Jehiah suggests, you need to store the computed count in an entity and refer to that if you want scalability.
This isn't unique to App Engine - other databases just hide it better, up until the point where you're trying to count tens of thousands of records with each request, and your page render time starts to increase exponentially...
According to the GqlQuery.count() documentation, you can set the limit to be some number greater than 1000:
from models import Troll
troll_count = Troll.all(keys_only=True).count(limit=31337)
Sharded counters are the right way to keep track of numbers like this, as folks have said, but if you figure this out late in the game (like me) then you'll need to initialize the counters from an actual count of objects. But this is a great way to burn through your free quota of Datastore Small Operations (50,000 I think). Every time you run the code, it will use up as many ops as there are model objects.
I haven't tried it, and this is an utter resource hog, but perhaps iterating with .fetch() and specifying the offset would work?
LIMIT=1000
def count(query):
result = offset = 0
gql_query = db.GqlQuery(query)
while True:
count = gql_query.fetch(LIMIT, offset)
if count < LIMIT:
return result
result += count
offset += LIMIT
orip's solution works with a little tweaking:
LIMIT=1000
def count(query):
result = offset = 0
gql_query = db.GqlQuery(query)
while True:
count = len(gql_query.fetch(LIMIT, offset))
result += count
offset += LIMIT
if count < LIMIT:
return result
We now have Datastore Statistics that can be used to query entity counts and other data. These values do not always reflect the most recent changes as they are updated once every 24-48 hours. Check out the documentation (see link below) for more details:
Datastore Statistics
As pointed out by #Dimu, the stats computed by Google on a periodic basis are a decent go-to resource when precise counts are not needed and the % of records are NOT changing drastically during any given day.
To query the statistics for a given Kind, you can use the following GQL structure:
select * from __Stat_Kind__ where kind_name = 'Person'
There are a number of properties returned by this which are helpful:
count -- the number of Entities of this Kind
bytes -- total size of all Entities stored of this Kind
timestamp -- an as of date/time for when the stats were last computed
Example Code
To answer a follow-up question posted as a comment to my answer, I am now providing some sample C# code that I am using, which admittedly may not be as robust as it should be, but seems to work OK for me:
/// <summary>Returns an *estimated* number of entities of a given kind</summary>
public static long GetEstimatedEntityCount(this DatastoreDb database, string kind)
{
var query = new GqlQuery
{
QueryString = $"select * from __Stat_Kind__ where kind_name = '{kind}'",
AllowLiterals = true
};
var result = database.RunQuery(query);
return (long) (result?.Entities?[0]?["count"] ?? 0L);
}
The best workaround might seem a little counter-intuitive, but it works great in all my appengine apps. Rather than relying on the integer KEY and count() methods, you add an integer field of your own to the datatype. It might seem wasteful until you actually have more than 1000 records, and you suddenly discover that fetch() and limit() DO NOT WORK PAST THE 1000 RECORD BOUNDARY.
def MyObj(db.Model):
num = db.IntegerProperty()
When you create a new object, you must manually retrieve the highest key:
max = MyObj.all().order('-num').get()
if max : max = max.num+1
else : max = 0
newObj = MyObj(num = max)
newObj.put()
This may seem like a waste of a query, but get() returns a single record off the top of the index. It is very fast.
Then, when you want to fetch past the 1000th object limit, you simply do:
MyObj.all().filter('num > ' , 2345).fetch(67)
I had already done this when I read Aral Balkan's scathing review: http://aralbalkan.com/1504 . It's frustrating, but when you get used to it and you realize how much faster this is than count() on a relational db, you won't mind...

Resources