Caching for Neo4j user-defined procedures - database

I am current running a comparative experiment for some algorithms running on top of a relational database (PostgreSQL) and a graph one (Neo4j).
I implemented my algorithm as a user-defined procedure for Neo4j, but it doesn't look like it performs any caching out-of-box.
Is there a way to configure caching for user-defined procedures in Neo4j?
Thanks

You'll have to implement the caching yourself, if it's relevant for your use case and you have something to cache: probably something not related to the transaction, so no nodes or relationships; Neo4j ids are tricky, since they can be reused, so it's probably best to only cache them for a short duration, or not at all. Application-level ids would be fine, as would beans composed of strings or scalar types.
Suppose you have this procedure defined:
public class MyProcedure {
#Context
public GraphDatabaseService db;
#Procedure
public Stream<MyBean> doSomething(#Name("uuid") String uuid) {
int count = 0;
// ...
return Stream.of(new MyBean(count));
}
public static class MyBean {
public int count;
public MyBean(int count) {
this.count = count;
}
}
}
You can add some simple caching using a ConcurrentMap:
public class MyProcedure {
private static final ConcurrentMap<String, Collection<MyBean>> CACHE =
new ConcurrentHashMap<>();
#Context
public GraphDatabaseService db;
#Procedure
public Stream<MyBean> doSomething(#Name("uuid") String uuid) {
Collection<MyBean> result = CACHE.computeIfAbsent(uuid,
k -> doSomethingCacheable(k).collect(Collectors.toList()));
return result.stream();
}
private Stream<MyBean> doSomethingCacheable(String uuid) {
int count = 0;
// ...
return Stream.of(new MyBean(count));
}
public static class MyBean {
// ...
}
}
Note that you can't cache a Stream as it can only be consumed once, so you have to consume it yourself by collecting into an ArrayList (you can also move the collect inside the method, change the return type to Collection<MyBean> and use a method reference). If the procedure takes more than one argument, you'll need to create a proper class for the composite key (immutable if possible, with correct equals and hashCode implementation). The restrictions that apply to cacheable values also apply to the keys.
This is an eternal, unbounded cache. If you need more features (expiration, maximum size), I suggest you use a real cache implementation, such as Guava's Cache (or LoadingCache) or Ben Manes' Caffeine.

Related

Pre-shuffle aggregation in Flink

We are migrating a spark job to flink. We have used pre-shuffle aggregation in spark. Is there a way to execute similar operation in spark. We are consuming data from apache kafka. We are using keyed tumbling window to aggregate the data. We want to aggregate the data in flink before performing shuffle.
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
yes, it is possible and I will describe three ways. First the already built-in for Flink Table API. The second way you have to build your own pre-aggregate operator. The third is a dynamic pre-aggregate operator which adjusts the number of events to pre-aggregate before the shuffle phase.
Flink Table API
As it is shown here you can do MiniBatch Aggregation or Local-Global Aggregation. The second option is better. You basically tell to Flink to create mini-batches of every 5000 events and pre-aggregate them before the shuffle phase.
// instantiate table environment
TableEnvironment tEnv = ...
// access flink configuration
Configuration configuration = tEnv.getConfig().getConfiguration();
// set low-level key-value options
configuration.setString("table.exec.mini-batch.enabled", "true");
configuration.setString("table.exec.mini-batch.allow-latency", "5 s");
configuration.setString("table.exec.mini-batch.size", "5000");
configuration.setString("table.optimizer.agg-phase-strategy", "TWO_PHASE");
Flink Stream API
This way is more cumbersome because you have to create your own operator using OneInputStreamOperator and call it using the doTransform(). Here is the example of the BundleOperator.
public abstract class AbstractMapStreamBundleOperator<K, V, IN, OUT>
extends AbstractUdfStreamOperator<OUT, MapBundleFunction<K, V, IN, OUT>>
implements OneInputStreamOperator<IN, OUT>, BundleTriggerCallback {
#Override
public void processElement(StreamRecord<IN> element) throws Exception {
// get the key and value for the map bundle
final IN input = element.getValue();
final K bundleKey = getKey(input);
final V bundleValue = this.bundle.get(bundleKey);
// get a new value after adding this element to bundle
final V newBundleValue = userFunction.addInput(bundleValue, input);
// update to map bundle
bundle.put(bundleKey, newBundleValue);
numOfElements++;
bundleTrigger.onElement(input);
}
#Override
public void finishBundle() throws Exception {
if (!bundle.isEmpty()) {
numOfElements = 0;
userFunction.finishBundle(bundle, collector);
bundle.clear();
}
bundleTrigger.reset();
}
}
The call-back interface defines when you are going to trigger the pre-aggregate. Every time that the stream reaches the bundle limit at if (count >= maxCount) your pre-aggregate operator will emit events to the shuffle phase.
public class CountBundleTrigger<T> implements BundleTrigger<T> {
private final long maxCount;
private transient BundleTriggerCallback callback;
private transient long count = 0;
public CountBundleTrigger(long maxCount) {
Preconditions.checkArgument(maxCount > 0, "maxCount must be greater than 0");
this.maxCount = maxCount;
}
#Override
public void registerCallback(BundleTriggerCallback callback) {
this.callback = Preconditions.checkNotNull(callback, "callback is null");
}
#Override
public void onElement(T element) throws Exception {
count++;
if (count >= maxCount) {
callback.finishBundle();
reset();
}
}
#Override
public void reset() {
count = 0;
}
}
Then you call your operator using the doTransform:
myStream.map(....)
.doTransform(metricCombiner, info, new RichMapStreamBundleOperator<>(myMapBundleFunction, bundleTrigger, keyBundleSelector))
.map(...)
.keyBy(...)
.window(TumblingProcessingTimeWindows.of(Time.seconds(20)))
A dynamic pre-aggregation
In case you wish to have a dynamic pre-aggregate operator check the AdCom - Adaptive Combiner for stream aggregation. It basically adjusts the pre-aggregation based on backpressure signals. It results in using the maximum possible of the shuffle phase.

ConcurrentHashMap in Codename One

I need to use the same HashMap from different threads (EDT, a Timer, and maybe the Network Thread). Since Codename One doesn't have a ConcurrentHashMap implementation in its API, I tried to circumvent this issue so:
/**
* ConcurrentHashMap emulation for Codename One.
*/
public class ConcurrentHashMap<K, V> {
private static final EasyThread thread = EasyThread.start("concurrentHashMap_Thread");
private final Map<K, V> map = new HashMap<>();
public V put(K key, V value) {
return thread.run(() -> map.put(key, value));
}
public V get(K key) {
return thread.run(() -> map.get(key));
}
public V remove(K key) {
return thread.run(() -> map.remove(key));
}
}
I have two questions:
Is a code like the above actually necessary in Codename One, when accessing a Map from two or more threads? Is it better to use an HashMap directly?
What do you think about that code? I tried to keep it as simple as possible, but I'm not sure of its correctness.
That code is very correct. It's also very inefficient as you're creating multiple locks to add a new object to a different thread. So you're effectively bringing a tank to a problem that's far simpler.
The TL;DR is this:
map = Collections.synchronizedMap(map);
Your code could have been written like this and would have been more efficient:
public class ConcurrentHashMap<K, V> {
private static final EasyThread thread = EasyThread.start("concurrentHashMap_Thread");
private final Map<K, V> map = new HashMap<>();
public synchronized V put(K key, V value) {
return map.put(key, value);
}
public synchronized V get(K key) {
return map.get(key);
}
public synchronized V remove(K key) {
return map.remove(key);
}
}
But you obviously don't need it since we have the synchronizedMap method...
So why don't we have ConcurrentHashMap?
Because we don't run on huge servers. See this article: https://crunchify.com/hashmap-vs-concurrenthashmap-vs-synchronizedmap-how-a-hashmap-can-be-synchronized-in-java/
The TL;DR of that article is that ConcurrentHashMap makes sense when you have a lot of threads and a lot of server cores. It tries to go the extra mile at limiting synchronization for maximum scale. In mobile that's a huge overkill.

Objectify - save order of Ref<?>-s

I have a system where I'm trying to minimize the number of Datastore writes (who wouldn't?), all the while using ancestor relations. Consider the following simplified classes:
public class Ancestor {
#Id
private String id;
private String field;
private Ref<Descendant> descendantRef;
public Descendant getDescendant() {
return this.descendantRef.get();
}
public void setDescendant(Descendant des) {
this.descendantRef = Ref.create(des);
}
}
public class Descendant {
#Id
private String id;
private String field;
#Parent
private Key parent;
}
My problem: even though I set the descendant ref, upon saving the Ancestor entity, a null is saved, but if I save the Descendant as well, Objectify complains Attempted to save a null entity.
My question: I gathered that Objectify optimizes the order of get() operations with the #Load annotation, so is there a way to make it do the same on save() operations as well, so that by the time the Ancestor is being sent to the Datastore, the Descendant ref is populated properly?
Thank you in advance for any advice!
You can hide this implementation like this:
public Descendant getDescendant() {
// You probably don't want to break your code on null descendantRef
if (this.descendantRef != null) {
return this.descendantRef.get();
}
return null;
}
public void setDescendant(Descendant des) {
// Insert if this des have never been insert
if (getDescendant() != null) {
new DescendantEndpoint().insert(des);
}
// You probably don't want to break your code on null des
if (des != null) {
this.descendantRef = Ref.create(des);
}
}
By this, you don't have to do deal with each insertion of the ref on each endpoint you will make. Still, this method is not optimized for bulk insert as it will insert on each seperate datastore connection.
For this you can do something like this:
private Object bulkInsertAncestor(ArrayList<Ancestor> list){
ArrayList<Descendant> descendantArrayList = //get descendant list
// You should do all this inside a transaction
new DescendantEndpoint().bulkInsertDescendant(descendantArrayList);
return ofy().save().entities(list);
}
Objectify only complains that you Attempted to save a null entity if you literally pass a null value to the save() method.
This isn't an order-of-operations issue. FWIW, neither Objectify nor the underlying datastore provides any kind of referential integrity checks. It does not matter which you save first.

Flink executes dataflow twice

I'm new to Flink and I work with DataSet API. After a whole bunch of processing as the last stage I need to normalize one of the values by dividing it by its maximum value. So, I have used the .max() operator to take the max and later I'm passing the result as constructor's argument to the MapFunction.
This works, however all the processing is performed twice. One job is executed to find max values, and later another job is executed to create final result (starting execution from the beginning)... Is there any workaround to execute whole dataflow only once?
final List<Tuple6<...>> maxValues = result.max(2).collect();
assert maxValues.size() == 1;
result.map(new NormalizeAttributes(maxValues.get(0))).writeAsCsv(...)
#FunctionAnnotation.ForwardedFields("f0; f1; f3; f4; f5")
#FunctionAnnotation.ReadFields("f2")
private static class NormalizeAttributes implements MapFunction<Tuple6<...>, Tuple6<...>> {
private final Tuple6<...> maxValues;
public NormalizeAttributes(Tuple6<...> maxValues) {
this.maxValues = maxValues;
}
#Override
public Tuple6<...> map(Tuple6<...> value) throws Exception {
value.f2 /= maxValues.f2;
return value;
}
}
collect() immediately triggers an execution of the program up to the dataset requested by collect(). If you later call env.execute() or collect() again, the program is executed second time.
Besides the side effect of execution, using collect() to distribute values to subsequent transformation has also the drawback that data is transferred to the client and later back into the cluster. Flink offers so-called Broadcast variables to ship a DataSet as a side input into another transformation.
Using Broadcast variables in your program would look as follows:
DataSet maxValues = result.max(2);
result
.map(new NormAttrs()).withBroadcastSet(maxValues, "maxValues")
.writeAsCsv(...);
The NormAttrs function would look like this:
private static class NormAttr extends RichMapFunction<Tuple6<...>, Tuple6<...>> {
private Tuple6<...> maxValues;
#Override
public void open(Configuration config) {
maxValues = (Tuple6<...>)getRuntimeContext().getBroadcastVariable("maxValues").get(1);
}
#Override
public PredictedLink map(Tuple6<...> value) throws Exception {
value.f2 /= maxValues.f2;
return value;
}
}
You can find more information about Broadcast variables in the documentation.

How to share an Array between all Classes in an application?

I want to share an Array which all classes can "get" and "change" data inside that array. Something like a Global array or Multi Access array. How this is possible with ActionScript 3.0 ?
There are a couple of ways to solve this. One is to use a global variable (as suggested in unkiwii's answer) but that's not a very common approach in ActionScript. More common approaches are:
Class variable (static variable)
Create a class called DataModel or similar, and define an array variable on that class as static:
public class DataModel {
public static var myArray : Array = [];
}
You can then access this from any part in your application using DataModel.myArray. This is rarely a great solution because (like global variables) there is no way for one part of your application to know when the content of the array is modified by another part of the application. This means that even if your data entry GUI adds an object to the array, your data list GUI will not know to show the new data, unless you implement some other way of telling it to redraw.
Singleton wrapping array
Another way is to create a class called ArraySingleton, which wraps the actual array and provides access methods to it, and an instance of which can be accessed using the very common singleton pattern of keeping the single instance in a static variable.
public class ArraySingleton {
private var _array : Array;
private static var _instance : ArraySingleton;
public static function get INSTANCE() : ArraySingleton {
if (!_instance)
_instance = new ArraySingleton();
return _instance;
}
public function ArraySingleton() {
_array = [];
}
public function get length() : uint {
return _array.length;
}
public function push(object : *) : void {
_array.push(object);
}
public function itemAt(idx : uint) : * {
return _array[idx];
}
}
This class wraps an array, and a single instance can be accessed through ArraySingleton.INSTANCE. This means that you can do:
var arr : ArraySingleton = ArraySingleton.INSTANCE;
arr.push('a');
arr.push('b');
trace(arr.length); // traces '2'
trace(arr.itemAt(0)); // trace 'a'
The great benefit of this is that you can dispatch events when items are added or when the array is modified in any other way, so that all parts of your application can be notified of such changes. You will likely want to expand on the example above by implementing more array-like interfaces, like pop(), shift(), unshift() et c.
Dependency injection
A common pattern in large-scale application development is called dependency injection, and basically means that by marking your class in some way (AS3 meta-data is often used) you can signal that the framework should "inject" a reference into that class. That way, the class doesn't need to care about where the reference is coming from, but the framework will make sure that it's there.
A very popular DI framework for AS3 is Robotlegs.
NOTE: I discourage the use of Global Variables!
But here is your answer
You can go to your default package and create a file with the same name of your global variable and set the global variable public:
//File: GlobalArray.as
package {
public var GlobalArray:Array = [];
}
And that's it! You have a global variable. You can acces from your code (from anywhere) like this:
function DoSomething() {
GlobalArray.push(new Object());
GlobalArray.pop();
for each (var object:* in GlobalArray) {
//...
}
}
As this question was linked recently I would add something also. I was proposed to use singleton ages ago and resigned on using it as soon as I realized how namespaces and references work and that having everything based on global variables is bad idea.
Aternative
Note this is just a showcase and I do not advice you to use such approach all over the place.
As for alternative to singleton you could have:
public class Global {
public static const myArray:Alternative = new Alternative();
}
and use it almost like singleton:
var ga:Alternative = Global.myArray;
ga.e.addEventListener(GDataEvent.NEW_DATA, onNewData);
ga.e.addEventListener(GDataEvent.DATA_CHANGE, onDataChange);
ga.push(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, "ten");
trace(ga[5]); // 5
And your Alternative.as would look similar to singleton one:
package adnss.projects.tchqs
{
import flash.utils.Proxy;
import flash.utils.flash_proxy;
public class Alternative extends Proxy
{
private var _data:Array = [];
private var _events:AltEventDisp = new AltEventDisp();
private var _dispatching:Boolean = false;
public var blockCircularChange:Boolean = true;
public function Alternative() {}
override flash_proxy function getProperty(id:*):* {var i:int = id;
return _data[i += (i < 0) ? _data.length : 0];
//return _data[id]; //version without anal item access - var i:int could be removed.
}
override flash_proxy function setProperty(id:*, value:*):void { var i:int = id;
if (_dispatching) { throw new Error("You cannot set data while DATA_CHANGE event is dipatching"); return; }
i += (i < 0) ? _data.length : 0;
if (i > 9 ) { throw new Error ("You can override only first 10 items without using push."); return;}
_data[i] = value;
if (blockCircularChange) _dispatching = true;
_events.dispatchEvent(new GDataEvent(GDataEvent.DATA_CHANGE, i));
_dispatching = false;
}
public function push(...rest) {
var c:uint = -_data.length + _data.push.apply(null, rest);
_events.dispatchEvent(new GDataEvent(GDataEvent.NEW_DATA, _data.length - c, c));
}
public function get length():uint { return _data.length; }
public function get e():AltEventDisp { return _events; }
public function toString():String { return String(_data); }
}
}
import flash.events.EventDispatcher;
/**
* Dispatched after data at existing index is replaced.
* #eventType adnss.projects.tchqs.GDataEvent
*/
[Event(name = "dataChange", type = "adnss.projects.tchqs.GDataEvent")]
/**
* Dispatched after new data is pushed intwo array.
* #eventType adnss.projects.tchqs.GDataEvent
*/
[Event(name = "newData", type = "adnss.projects.tchqs.GDataEvent")]
class AltEventDisp extends EventDispatcher { }
The only difference form Singleton is that you can actually have multiple instances of this class so you can reuse it like this:
public class Global {
public static const myArray:Alternative = new Alternative();
public static const myArray2:Alternative = new Alternative();
}
to have two separated global arrays or even us it as instance variable at the same time.
Note
Wrapping array like this an using methods like myArray.get(x) or myArray[x] is obviously slower than accessing raw array (see all additional steps we are taking at setProperty).
public static const staticArray:Array = [1,2,3];
On the other hand you don't have any control over this. And the content of the array can be changed form anywhere.
Caution about events
I would have to add that if you want to involve events in accessing data that way you should be careful. As with every sharp blade it's easy to get cut.
For example consider what happens when you do this this:
private function onDataChange(e:GDataEvent):void {
trace("dataChanged at:", e.id, "to", Global.myArray[e.id]);
Global.myArray[e.id]++;
trace("new onDataChange is called before function exits");
}
The function is called after data in array was changed and inside that function you changing the data again. Basically it's similar to doing something like this:
function f(x:Number) {
f(++x);
}
You can see what happens in such case if you toggle myArray.blockCircularChange. Sometimes you would intentionally want to have such recursion but it is likely that you will do it "by accident". Unfortunately flash will suddenly stop such events dispatching without even telling you why and this could be confusing.
Download full example here
Why using global variables is bad in most scenarios?
I guess there is many info about that all over the internet but to be complete I will add simple example.
Consider you have in your app some view where you display some text, or graphics, or most likely game content. Say you have chess game. Mayby you have separated logic and graphics in two classes but you want both to operate on the same pawns. So you create your Global.pawns variable and use that in both Grahpics and Logic class.
Everything is randy-dandy and works flawlessly. Now You come with the great idea - add option for user to play two matches at once or even more. All you have to do is to create another instance of your match... right?
Well you are doomed at this point because, every single instance of your class will use the same Global.pawns array. You not only have this variable global but also you have limited yourself to use only single instance of each class that use this variable :/
So before you use any global variables, just think twice if the thing you want to store in it is really global and universal across your entire app.

Resources