Erlang mnesia database access - database

I have designed a mnesia database with 5 different tables. The idea is to simulate queries from many nodes (computers) not just one, at the moment from the terminal i can execute a query, but I just need help on how i can make it such that I am requesting information from many computers. I am testing for scalability and want to investigate the performance of mnesia vs other databases. Any idea will be highly appreciated.

The best way to test mnesia is by using an intensive threaded job both on the local Erlang Node where mnesia is running and on the remote nodes. Usually, you want to have remote nodes using RPC calls in which reads and writes are being executed on mnesia tables. Of-course, with high concurrency comes a trade off; speed of transactions will reduce, many may be retried as the locks may be many at a given time; But mnesia will ensure that all processes receive an {atomic,ok} for each transactional call they make.
The Concept
I propose that we have a non-blocking overload with both Writes and reads in directed to each mnesia table by as many processes as possible. We measure the time difference between the call to the write function and the time it takes for our massive mnesia subscriber to get a Write Event. These Events are sent by mnesia every after a successful Transaction and so we need not interrupt the working/overloading processes but rather let a "strong" mnesia subscriber to wait for asynchronous events reporting successful deletes and writes as soon as they occur. The technique here is that we take the time stamp at the point just before calling a write function and then we note down the record key, the write CALL timestamp. Then our mnesia subscriber would note down the record key, the write/read EVENT timestamp. Then the time difference between these two time stamps (lets call it: CALL-to-EVENT Time) would give us a rough idea of how loaded, or how efficient we are going. As locks increase with Concurrency, we should be registering increasing CALL-to-EVENT Time parameter. Processes doing writes (unlimited) will do so concurrently while those doing reads will also continue to do so without interruptions. We will choose the number of processes for each operation but lets first lay ground for the entire test case.
All the above Concept is for Local operations (processes running on the same Node as Mnesia)
--> Simulating Many Nodes
Well, i have personally not simulated Nodes in Erlang, i have always worked with real Erlang Nodes on the Same box or on several different machines in a networked environment. However, i advise that you look closely on this module: http://www.erlang.org/doc/man/slave.html, concentrate more on this one here: http://www.erlang.org/doc/man/ct_slave.html, and look at the following links as they talk about creating, simulating and controlling many nodes under another parent node (http://www.erlang.org/doc/man/pool.html, Erlang: starting slave node,https://support.process-one.net/doc/display/ERL/Starting+a+set+of+Erlang+cluster+nodes,http://www.berabera.info/oldblog/lenglet/howtos/erlangkerberosremctl/index.html). I will not dive into a jungle of Erlang Nodes here bacause it also another complicated topic but i will concentrate on tests on the same node running mnesia. I have come up with the above mnesia test concept and here, lets start implementing it.
Now, First of all, you need to make a test plan for each table (separate). This should include both writes and reads. Then you need to decide whether you want to do dirty operations or transactional operations on the tables. You need to test speed of traversing a mnesia table in relation to its size. Lets take an example of a simple mnesia table
-record(key_value,{key,value,instanceId,pid}).
We would want to have a general function for writing into our table, here below:
write(Record)->
%% Use mnesia:activity/4 to test several activity
%% contexts (and if your table is fragmented)
%% like the commented code below
%%
%% mnesia:activity(
%% transaction, %% sync_transaction | async_dirty | ets | sync_dirty
%% fun(Y) -> mnesia:write(Y) end,
%% [Record],
%% mnesia_frag
%% )
mnesia:transaction(fun() -> ok = mnesia:write(Record) end).
And for our reads, we will have:
read(Key)->
%% Use mnesia:activity/4 to test several activity
%% contexts (and if your table is fragmented)
%% like the commented code below
%%
%% mnesia:activity(
%% transaction, %% sync_transaction | async_dirty| ets | sync_dirty
%% fun(Y) -> mnesia:read({key_value,Y}) end,
%% [Key],
%% mnesia_frag
%% )
mnesia:transaction(fun() -> mnesia:read({key_value,Key}) end).
Now, we want to write very many records into our small table. We need a key generator. This key generator will be our own pseudo-random string generator. However, we need our generator to tell us the instant it generates a key so we record it. We want to see how long it takes to write a generated key. Lets put it down like this:
timestamp()-> erlang:now().
str(XX)-> integer_to_list(XX).
generate_instance_id()->
random:seed(now()),
guid() ++ str(crypto:rand_uniform(1, 65536 * 65536)) ++ str(erlang:phash2({self(),make_ref(),time()})).
guid()->
random:seed(now()),
MD5 = erlang:md5(term_to_binary({self(),time(),node(), now(), make_ref()})),
MD5List = binary_to_list(MD5),
F = fun(N) -> f("~2.16.0B", [N]) end,
L = lists:flatten([F(N) || N <- MD5List]),
%% tell our massive mnesia subscriber about this generation
InstanceId = generate_instance_id(),
mnesia_subscriber ! {self(),{key,write,L,timestamp(),InstanceId}},
{L,InstanceId}.
To make very many concurrent writes, we need a function which will be executed by many processes we will spawn. In this function, its desirable NOT to put any blocking functions such as sleep/1 usually implemented as sleep(T)-> receive after T -> true end.. Such a function would make a processes execution to hang for the specified milliseconds. mnesia_tm does the lock control, retry, blocking, e.t.c. on behalf of the processes to avoid dead locks. Lets say, we want each processes to write an unlimited amount of records. Here is our function:
-define(NO_OF_PROCESSES,20).
start_write_jobs()->
[spawn(?MODULE,generate_and_write,[]) || _ <- lists:seq(1,?NO_OF_PROCESSES)],
ok.
generate_and_write()->
%% remember that in the function ?MODULE:guid/0,
%% we inform our mnesia_subscriber about our generated key
%% together with the timestamp of the generation just before
%% a write is made.
%% The subscriber will note this down in an ETS Table and then
%% wait for mnesia Event about the write operation. Then it will
%% take the event time stamp and calculate the time difference
%% From there we can make judgement on performance.
%% In this case, we make the processes make unlimited writes
%% into our mnesia tables. Our subscriber will trap the events as soon as
%% a successful write is made in mnesia
%% For all keys we just write a Zero as its value
{Key,Instance} = guid(),
write(#key_value{key = Key,value = 0,instanceId = Instance,pid = self()}),
generate_and_write().
Likewise, lets see how the read jobs will be done.
We will have a Key provider, this Key provider keeps rotating around the mnesia table picking only keys, up and down the table it will keep rotating. Here is its code:
first()-> mnesia:dirty_first(key_value).
next(FromKey)-> mnesia:dirty_next(key_value,FromKey).
start_key_picker()-> register(key_picker,spawn(fun() -> key_picker() end)).
key_picker()->
try ?MODULE:first() of
'$end_of_table' ->
io:format("\n\tTable is empty, my dear !~n",[]),
%% lets throw something there to start with
?MODULE:write(#key_value{key = guid(),value = 0}),
key_picker();
Key -> wait_key_reqs(Key)
catch
EXIT:REASON ->
error_logger:error_info(["Key Picker dies",{EXIT,REASON}]),
exit({EXIT,REASON})
end.
wait_key_reqs('$end_of_table')->
receive
{From,<<"get_key">>} ->
Key = ?MODULE:first(),
From ! {self(),Key},
wait_key_reqs(?MODULE:next(Key));
{_,<<"stop">>} -> exit(normal)
end;
wait_key_reqs(Key)->
receive
{From,<<"get_key">>} ->
From ! {self(),Key},
NextKey = ?MODULE:next(Key),
wait_key_reqs(NextKey);
{_,<<"stop">>} -> exit(normal)
end.
key_picker_rpc(Command)->
try erlang:send(key_picker,{self(),Command}) of
_ ->
receive
{_,Reply} -> Reply
after timer:seconds(60) ->
%% key_picker hang, or too busy
erlang:throw({key_picker,hanged})
end
catch
_:_ ->
%% key_picker dead
start_key_picker(),
sleep(timer:seconds(5)),
key_picker_rpc(Command)
end.
%% Now, this is where the reader processes will be
%% accessing keys. It will appear to them as though
%% its random, because its one process doing the
%% traversal. It will all be a game of chance
%% depending on the scheduler's choice
%% he who will have the next read chance, will
%% win ! okay, lets get going below :)
get_key()->
Key = key_picker_rpc(<<"get_key">>),
%% lets report to our "massive" mnesia subscriber
%% about a read which is about to happen
%% together with a time stamp.
Instance = generate_instance_id(),
mnesia_subscriber ! {self(),{key,read,Key,timestamp(),Instance}},
{Key,Instance}.
Wow !!! Now we need to create the function where we will start all the readers.
-define(NO_OF_READERS,10).
start_read_jobs()->
[spawn(?MODULE,constant_reader,[]) || _ <- lists:seq(1,?NO_OF_READERS)],
ok.
constant_reader()->
{Key,InstanceId} = ?MODULE:get_key(),
Record = ?MODULE:read(Key),
%% Tell mnesia_subscriber that a read has been done so it creates timestamp
mnesia:report_event({read_success,Record,self(),InstanceId}),
constant_reader().
Now, the biggest part; mnesia_subscriber !!! This is a simple process that will subscribe
to simple events. Get mnesia events documentation from the mnesia users guide.
Here is the mnesia subscriber
-record(read_instance,{
instance_id,
before_read_time,
after_read_time,
read_time %% after_read_time - before_read_time
}).
-record(write_instance,{
instance_id,
before_write_time,
after_write_time,
write_time %% after_write_time - before_write_time
}).
-record(benchmark,{
id, %% {pid(),Key}
read_instances = [],
write_instances = []
}).
subscriber()->
mnesia:subscribe({table,key_value, simple}),
%% lets also subscribe for system
%% events because events passing through
%% mnesia:event/1 will go via
%% system events.
mnesia:subscribe(system),
wait_events().
-include_lib("stdlib/include/qlc.hrl").
wait_events()->
receive
{From,{key,write,Key,TimeStamp,InstanceId}} ->
%% A process is just about to call
%% mnesia:write/1 and so we note this down
Fun = fun() ->
case qlc:e(qlc:q([X || X <- mnesia:table(benchmark),X#benchmark.id == {From,Key}])) of
[] ->
ok = mnesia:write(#benchmark{
id = {From,Key},
write_instances = [
#write_instance{
instance_id = InstanceId,
before_write_time = TimeStamp
}]
}),
ok;
[Here] ->
WIs = Here#benchmark.write_instances,
NewInstance = #write_instance{
instance_id = InstanceId,
before_write_time = TimeStamp
},
ok = mnesia:write(Here#benchmark{write_instances = [NewInstance|WIs]}),
ok
end
end,
mnesia:transaction(Fun),
wait_events();
{mnesia_table_event,{write,#key_value{key = Key,instanceId = I,pid = From},_ActivityId}} ->
%% A process has successfully made a write. So we look it up and
%% get timeStamp difference, and finish bench marking that write
WriteTimeStamp = timestamp(),
F = fun()->
[Here] = mnesia:read({benchmark,{From,Key}}),
WIs = Here#benchmark.write_instances,
{_,WriteInstance} = lists:keysearch(I,2,WIs),
BeforeTmStmp = WriteInstance#write_instance.before_write_time,
NewWI = WriteInstance#write_instance{
after_write_time = WriteTimeStamp,
write_time = time_diff(WriteTimeStamp,BeforeTmStmp)
},
ok = mnesia:write(Here#benchmark{write_instances = [NewWI|lists:keydelete(I,2,WIs)]}),
ok
end,
mnesia:transaction(F),
wait_events();
{From,{key,read,Key,TimeStamp,InstanceId}} ->
%% A process is just about to do a read
%% using mnesia:read/1 and so we note this down
Fun = fun()->
case qlc:e(qlc:q([X || X <- mnesia:table(benchmark),X#benchmark.id == {From,Key}])) of
[] ->
ok = mnesia:write(#benchmark{
id = {From,Key},
read_instances = [
#read_instance{
instance_id = InstanceId,
before_read_time = TimeStamp
}]
}),
ok;
[Here] ->
RIs = Here#benchmark.read_instances,
NewInstance = #read_instance{
instance_id = InstanceId,
before_read_time = TimeStamp
},
ok = mnesia:write(Here#benchmark{read_instances = [NewInstance|RIs]}),
ok
end
end,
mnesia:transaction(Fun),
wait_events();
{mnesia_system_event,{mnesia_user,{read_success,#key_value{key = Key},From,I}}} ->
%% A process has successfully made a read. So we look it up and
%% get timeStamp difference, and finish bench marking that read
ReadTimeStamp = timestamp(),
F = fun()->
[Here] = mnesia:read({benchmark,{From,Key}}),
RIs = Here#benchmark.read_instances,
{_,ReadInstance} = lists:keysearch(I,2,RIs),
BeforeTmStmp = ReadInstance#read_instance.before_read_time,
NewRI = ReadInstance#read_instance{
after_read_time = ReadTimeStamp,
read_time = time_diff(ReadTimeStamp,BeforeTmStmp)
},
ok = mnesia:write(Here#benchmark{read_instances = [NewRI|lists:keydelete(I,2,RIs)]}),
ok
end,
mnesia:transaction(F),
wait_events();
_ -> wait_events();
end.
time_diff({A2,B2,C2} = _After,{A1,B1,C1} = _Before)->
{A2 - A1,B2 - B1,C2 - C1}.
Alright ! That was huge :) So we are done with the subscriber. We need to put the code that will crown it all together and run the necessary tests.
install()->
mnesia:stop().
mnesia:delete_schema([node()]),
mnesia:create_schema([node()]),
mnesia:start(),
{atomic,ok} = mnesia:create_table(key_value,[
{attributes,record_info(fields,key_value)},
{disc_copies,[node()]}
]),
{atomic,ok} = mnesia:create_table(benchmark,[
{attributes,record_info(fields,benchmark)},
{disc_copies,[node()]}
]),
mnesia:stop(),
ok.
start()->
mnesia:start(),
ok = mnesia:wait_for_tables([key_value,benchmark],timer:seconds(120)),
%% boot up our subscriber
register(mnesia_subscriber,spawn(?MODULE,subscriber,[])),
start_write_jobs(),
start_key_picker(),
start_read_jobs(),
ok.
Now, with proper analysis of the benchmark table records, you will get record of average read times,
average write times e.t.c. You draw a graph of these times against increasing number of processes.
As we increase the number of processes, you will discover that the read and write times increase
. Get the code, read it and make use of it. You may not use all of it but am sure you could pick up
new concepts from there as others send in there solutions. Using mnesia events is the best way to test mnesia reads and writes without blocking the processes doing the actual writing or reading. In the example above, the reading and writing processes are out of any control, infact, they will run forever until you terminate the VM. You can traverse the benchmark table with a good formulae to make use of the read and write times per read or write instance and then you would calculate averages, variations e.t.c.
Testing from Remote Computers, Simulating Nodes, benchmarking against other DBMS may not be as relevant simply because of many reasons. The concepts, motivations and goals of Mnesia are very different from several types of existing Database Types like: document oriented DBs, RDBMS, Object-Oriented DBs e.t.c. Infact, mnesia out to be compared with a Database such as this one. Its a Distributed DBMs with a Hybrid/Unstructured kinda Data Structures which belong to the Language Erlang. Benchmarking Mnesia against another type of Database may not be right because its purpose is very different from many and its tight coupling with Erlang/OTP. However, a knowledge of how mnesia works, transaction contexts, indexing, concurrency, distribution can be key to a good Database Design. Mnesia can store a very Complex Data Structure. Remember, the more complex a Data Structure is with nested information, the more work required to unpack it and extract the information you need at run-time, which means more CPU Cycles and memory. Some times, normalization with mnesia may just result in poor performance and so the implementation of its concepts are far away from other Database.
Its good you are interested in Mnesia performance across several machines (distributed), however, the performance is as good as Distributed Erlang is. The great thing is that atomicity is ensured for every transaction. Still concurrent requests from remote nodes can be sent via RPC Calls. Remember that if you have multiple replicas of mnesia on different machines, processes running on each node will write on that very node, then mnesia will carry on from there with its replication. Mnesia is very fast at replication, unless a network is really doing bad and/or the nodes are not connected, or network is partitioned at runtime.
Mnesia ensures Consistency and Atomicity of CRUD Operations. For this reason, replicated mnesia Databases highly depend on the network availability for better performance. As long as the Erlang Nodes remain connected, the two or more Mnesia Nodes will always have the same data. Reads on one Node will ensure that you get the most recent information. Problems arise when a disconnection occurs and each node registers thet other as though its down. More information on mnesia's performance can be found by following the following links
http://igorrs.blogspot.com/2010/05/mnesia-one-year-later.html
http://igorrs.blogspot.com/2010/05/mnesia-one-year-later-part-2.html
http://igorrs.blogspot.com/2010/05/mnesia-one-year-later-part-3.html
http://igorrs.blogspot.com/2009/11/consistent-hashing-for-mnesia-fragments.html
As a consequence, the concepts behind mnesia can only be compared with Ericsson's NDB Database found here: http://ww.dolphinics.no/papers/abstract/ericsson.html, but not with existing RDBMS, or Document Oriented Databases, e.t.c. Those are my thoughts :) lets wait for what others have to say.....

You start additional nodes using command like this:
erl -name test1#127.0.0.1 -cookie devel \
-mnesia extra_db_nodes "['devel#127.0.0.1']"\
-s mnesia start
where 'devel#127.0.0.1' is the node where mnesia is already setup. In this case all tables will be accessed from remote node, but you can make local copies with mnesia:add_table_copy/3.
Then you can use spawn/2 or spawn/4 to start load generation on all nodes with something like:
lists:foreach(fun(N) ->
spawn(N, fun () ->
%% generate some load
ok
end
end,
[ 'test1#127.0.0.1', 'test2#127.0.0.1' ]
)

Related

Flink - Grouping query to external system per operator instance while enriching an event

I am currently writing a streaming application where:
as an input, I am receiving some alerts from a kafka topic (1 alert is linked to 1 resource, for example 1 alert will be linked to my-router-1 or to my-switch-1 or to my-VM-1 or my-VM-2 or ...)
I need then to do a query to an external system in order to enrich the alert with some additional information linked to the resource on which the alert is linked
When querying the external system:
I do not want to do 1 query per alert and not even 1 query per resource
I rather want to do group queries (1 query for several alerts linked to several resources)
My idea was to have something like n buffer (n being a small number representing the nb of queries that I will do in parallel), and then for a given time period (let's say 100ms), put all alerts within one of those buffer and at the end of those 100ms, do my n queries in parallel (1 query being responsible for enriching several alerts belonging to several resources).
In Spark, it is something that I would do through a mapPartitions (if I have n partition, then I will do only n queries in parallel to my external system and each query will be for all the alerts received during the micro-batch for one partition).
Now, I am currently looking at Flink and I haven't really found what is the best way of doing such kind of grouping when requesting an external system.
When looking at this kind of use case and especially at asyncio (https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/operators/asyncio.html), it seems that it deals with 1 query per key.
For example, I can very easily:
define the resouce id as a key
define a processing time window of 100ms
and then do my query to the external system (synchronously or maybe better asynchrously through the asyncio feature)
But by doing so, I will do 1 query per resource (maybe for several alerts but linked to the same key, ie the same resource).
It is not what I want to do as it will lead to too much queries to the external system.
I've then explored the option of defining a kind of technical key for my requests (something like the hashCode of my resource id % nb of queries I want to perform).
So, if I want to do max 4 queries in parallel, then my key will be something like "resourceId.hashCode % 4".
I was thinking that it was ok, but when looking more deeply to some metrics when running my job, I found that that my queries were not well distributed to my 4 operator instances (only 2 of them were doing something).
It comes for the mechanism used to assign a key to a given operator instance:
public static int assignKeyToParallelOperator(Object key, int maxParallelism, int parallelism) {
return computeOperatorIndexForKeyGroup(maxParallelism, parallelism, assignToKeyGroup(key, maxParallelism));
}
(in my case, parallelism being 4, maxParallelism 128 and my key value in the range [0,4[ ) (in such a context, 2 of my keys goes to operator instance 3 and 2 to operator instance 4) (operator instance 1 and 2 will have nothing to do).
I was thinking that key=0 will go to operator 0, key 1 to operator 1, key 2 to operator 2 and key 3 to operator 3, but it is not the case.
So do you know what will be the best approach to do this kind of grouping while querying an external system ?
ie 1 query per operator instance for all the alerts "received" by this operator instance during the last 100ms.
You can put an aggregator function upstream of the async function, where that function (using a timed window) outputs a record with <resource id><list of alerts to query>. You'd key the stream by the <resource id> ahead of the aggregator, which should then get pipelined to the async function.

Server out-of-memory issue when using RJDBC in paralel computing environment

I have an R server with 16 cores and 8Gb ram that initializes a local SNOW cluster of, say, 10 workers. Each worker downloads a series of datasets from a Microsoft SQL server, merges them on some key, then runs analyses on the dataset before writing the results to the SQL server. The connection between the workers and the SQL server runs through a RJDBC connection. When multiple workers are getting data from the SQL server, ram usage explodes and the R server crashes.
The strange thing is that the ram usage by a worker loading in data seems disproportionally large compared to the size of the loaded dataset. Each dataset has about 8000 rows and 6500 columns. This translates to about 20MB when saved as an R object on disk and about 160MB when saved as a comma-delimited file. Yet, the ram usage of the R session is about 2,3 GB.
Here is an overview of the code (some typographical changes to improve readability):
Establish connection using RJDBC:
require("RJDBC")
drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver","sqljdbc4.jar")
con <<- dbConnect(drv, "jdbc:sqlserver://<some.ip>","<username>","<pass>")
After this there is some code that sorts the function input vector requestedDataSets with names of all tables to query by number of records, such that we load the datasets from largest to smallest:
nrow.to.merge <- rep(0, length(requestedDataSets))
for(d in 1:length(requestedDataSets)){
nrow.to.merge[d] <- dbGetQuery(con, paste0("select count(*) from",requestedDataSets[d]))[1,1]
}
merge.order <- order(nrow.to.merge,decreasing = T)
We then go through the requestedDatasets vector and load and/or merge the data:
for(d in merge.order){
# force reconnect to SQL server
drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver","sqljdbc4.jar")
try(dbDisconnect(con), silent = T)
con <<- dbConnect(drv, "jdbc:sqlserver://<some.ip>","<user>","<pass>")
# remove the to.merge object
rm(complete.data.to.merge)
# force garbage collection
gc()
jgc()
# ask database for dataset d
complete.data.to.merge <- dbGetQuery(con, paste0("select * from",requestedDataSets[d]))
# first dataset
if (d == merge.order[1]){
complete.data <- complete.data.to.merge
colnames(complete.data)[colnames(complete.data) == "key"] <- "key_1"
}
# later dataset
else {
complete.data <- merge(
x = complete.data,
y = complete.data.to.merge,
by.x = "key_1", by.y = "key", all.x=T)
}
}
return(complete.data)
When I run this code on a serie of twelve datasets, the number of rows/columns of the complete.data object is as expected, so it is unlikely the merge call somehow blows up the usage. For the twelve iterations memory.size() returns 1178, 1364, 1500, 1662, 1656, 1925, 1835, 1987, 2106, 2130, 2217, and 2361. Which, again, is strange as the dataset at the end is at most 162 MB...
As you can see in the code above I've already tried a couple of fixes like calling GC(), JGC() (which is a function to force a Java garbage collection jgc <- function(){.jcall("java/lang/System", method = "gc")}). I've also tried merging the data SQL-server-side, but then I run into number of columns constraints.
It vexes me that the RAM usage is so much bigger than the dataset that is eventually created, leading me to believe there is some sort of buffer/heap that is overflowing... but I seem unable to find it.
Any advice on how to resolve this issue would be greatly appreciated. Let me know if (parts of) my problem description are vague or if you require more information.
Thanks.
This answer is more of a glorified comment. Simply because the data being processed on one node only requires 160MB does not mean that the amount of memory needed to process it is 160MB. Many algorithms require O(n^2) storage space, which would be be in the GB for your chunk of data. So I actually don't see anything here which is unsurprising.
I've already tried a couple of fixes like calling GC(), JGC() (which is a function to force a Java garbage collection...
You can't force a garbage collection in Java, calling System.gc() only politely asks the JVM to do a garbage collection, but it is free to ignore the request if it wants. In any case, the JVM usually optimizes garbage collection well on its own, and I doubt this is your bottleneck. More likely, you are simply hitting on the overhead which R needs to crunch your data.

What Erlang data structure to use for ordered set with the possibility to do lookups?

I am working on a problem where I need to remember the order of events I receive but also I need to lookup the event based on it's id. How can I do this efficiently in Erlang if possible without a third party library? Note that I have many potentially ephemeral actors with each their own events (already considered mnesia but it requires atoms for the tables and the tables would stick around if my actor died).
-record(event, {id, timestamp, type, data}).
Based on the details included in the discussion in comments on Michael's answer, a very simple, workable approach would be to create a tuple in your process state variable that stores the order of events separately from the K-V store of events.
Consider:
%%% Some type definitions so we know exactly what we're dealing with.
-type id() :: term().
-type type() :: atom().
-type data() :: term().
-type ts() :: calendar:datetime().
-type event() :: {id(), ts(), type(), data()}.
-type events() :: dict:dict(id(), {type(), data(), ts()}).
% State record for the process.
% Should include whatever else the process deals with.
-record(s,
{log :: [id()],
events :: event_store()}).
%%% Interface functions we will expose over this module.
-spec lookup(pid(), id()) -> {ok, event()} | error.
lookup(Pid, ID) ->
gen_server:call(Pid, {lookup, ID}).
-spec latest(pid()) -> {ok, event()} | error.
latest(Pid) ->
gen_server:call(Pid, get_latest).
-spec notify(pid(), event()) -> ok.
notify(Pid, Event) ->
gen_server:cast(Pid, {new, Event}).
%%% gen_server handlers
handle_call({lookup, ID}, State#s{events = Events}) ->
Result = find(ID, Events),
{reply, Result, State};
handle_call(get_latest, State#s{log = [Last | _], events = Events}) ->
Result = find(Last, Events),
{reply, Result, State};
% ... and so on...
handle_cast({new, Event}, State) ->
{ok, NewState} = catalog(Event, State),
{noreply, NewState};
% ...
%%% Implementation functions
find(ID, Events) ->
case dict:find(ID, Events) of
{Type, Data, Timestamp} -> {ok, {ID, Timestamp, Type, Data}};
Error -> Error
end.
catalog({ID, Timestamp, Type, Data},
State#s{log = Log, events = Events}) ->
NewEvents = dict:store(ID, {Type, Data, Timestamp}, Events),
NewLog = [ID | Log],
{ok, State#s{log = NewLog, events = NewEvents}}.
This is a completely straightforward implementation and hides the details of the data structure behind the interface of the process. Why did I pick a dict? Just because (its easy). Without knowing your requirements better I really have no reason to pick a dict over a map over a gb_tree, etc. If you have relatively small data (hundreds or thousands of things to store) the performance isn't usually noticeably different among these structures.
The important thing is that you clearly identify what messages this process should respond to and then force yourself to stick to it elsewhere in your project code by creating an interface of exposed functions over this module. Behind that you can swap out the dict for something else. If you really only need the latest event ID and won't ever need to pull the Nth event from the sequence log then you could ditch the log and just keep the last event's ID in the record instead of a list.
So get something very simple like this working first, then determine if it actually suits your need. If it doesn't then tweak it. If this works for now, just run with it -- don't obsess over performance or storage (until you are really forced to).
If you find later on that you have a performance problem switch out the dict and list for something else -- maybe gb_tree or orddict or ETS or whatever. The point is to get something working right now so you have a base from which to evaluate the functionality and run benchmarks if necessary. (The vast majority of the time, though, I find that whatever I start out with as a specced prototype turns out to be very close to whatever the final solution will be.)
Your question makes it clear you want to lookup by ID, but it's not entirely clear if you want to lookup or traverse your data by or based on time, and what operations you might want to perform in that regard; you say "remember the order of events" but storing your records with an index of the ID field will accomplish that.
If you only have to lookup by ID then any of the usual suspects will work as a suitable storage engines, so ets, gb_trees and dict for example would be good. Don't use mnesia unless you need the transactions and safety and all those good features; mnesia is good, but there is a high performance price to be paid for all that stuff, and it's not clear you need it, from your question anyway.
If you do want to lookup or traverse your data by or based on time, then consider an ets table of ordered_set. If that can do what you need then it's probably a good choice. In that case you would employ two tables, one set to provide a hash lookup by ID and another ordered_set to lookup or traverse by timestamp.
If you have two different lookup methods like this there's no getting around the fact you need two indexes. You could store the whole record in both, or, assuming your IDs are unique, you could store the ID as the data in the ordered_set. Which you choose is really a matter of trade off of storage utilisation and read and wrote performance.

Inconsistency in App Engine datastore vs what I know it should be from parsing the same data source locally

This may be a trivial question, but I was just hoping to get some practical experience from people who may know more about this than I do.
I wanted to generate a database in GAE from a very large series of XML files -- as a form of validation, I am calculating statistics on the GAE datastore, and I know there should be ~16,000 entities, but when I perform a count, I'm getting more on the order of 12,000.
The way I'm doing counting is basically I perform a filter, fetch a page of 1000 entities, and then spin up task queues for each entity (using its key). Each task queue then adds "1" to a counter that I'm storing.
I think I may have juiced the datastore writes too much; I set the rate of my task queues to 50/s.. I did get some writing errors, but not nearly enough to justify the 4,000 difference. Could it be possible that I was rushing the counting calls too much that it lead to inconsistency? Would slowing the rate that I process task queues to something like 5/s solve the problem? Thanks.
You can count your entities very easily (no tasks and almost for free):
int total = 0;
Query q = new Query("entity_kind").setKeysOnly();
// set your filter on this query
QueryResultList<Entity> results;
Cursor cursor = null;
FetchOptions queryOptions = FetchOptions.Builder.withLimit(1000).chunkSize(1000);
do {
if (cursor != null) {
queryOptions.startCursor(cursor);
}
results = datastore.prepare(q).asQueryResultList(queryOptions);
total += results.size();
cursor = results.getCursor();
} while (results.size() == 1000);
System.out.println("Total entities: " + total);
UPDATE:
If looping like I suggested takes too long, you can spin a task for every 100/500/1000 entities - it's definitely more efficient than creating a task for each entity. Even very complex calculations should take milliseconds in Java if done right.
For example, each task can retrieve a batch of entities, spin a new task (and pass a query cursor to this new task), and then proceed with your calculations.

Which NDB query function is more efficient to iterate through a big set of query results?

I use NDB for my app and use iter() with limit and starting cursor to iterate through 20,000 query results in a task. A lot of time I run into timeout error.
Timeout: The datastore operation timed out, or the data was temporarily unavailable.
The way I make the call is like this:
results = query.iter(limit=20000, start_cursor=cursor, produce_cursors=True)
for item in results:
process(item)
save_cursor_for_next_time(results.cursor_after().urlsafe())
I can reduce the limit but I thought a task can run as long as 10 mins. 10 mins should be more than enough time to go through 20000 results. In fact, on a good run, the task can complete in just about a minute.
If I switched to fetch() or fetch_page(), would they be more efficient and less likely to run into the timeout error? I suspect there's a lot of overhead in iter() that causes the timeout error.
Thanks.
Fetch is not really any more efficient they all use the same mechanism, unless you know how many entities you want upfront - then fetch can be more efficient as you end up with just one round trip.
You can increase the batch size for iter, that can improve things. See https://developers.google.com/appengine/docs/python/ndb/queryclass#kwdargs_options
From the docs the default batch size is 20, which would mean for 20,000 entities a lot of batches.
Other things that can help. Consider using map and or map_async on the processing, rather than explicitly calling process(entity) Have a read https://developers.google.com/appengine/docs/python/ndb/queries#map also introducing async into your processing can mean improved concurrency.
Having said all of that you should profile so you can understand where the time is used. For instance the delays could be in your process due to things you are doing there.
There are other things to conside with ndb like context caching, you need to disable it. But I also used iter method for these. I also made an ndb version of the mapper api with the old db.
Here is my ndb mapper api that should solve timeout problems and ndb caching and easily create this kind of stuff:
http://blog.altlimit.com/2013/05/simple-mapper-class-for-ndb-on-app.html
with this mapper api you can create it like or you can just improve it too.
class NameYourJob(Mapper):
def init(self):
self.KIND = YourItemModel
self.FILTERS = [YourItemModel.send_email == True]
def map(self, item):
# here is your process(item)
# process here
item.send_email = False
self.update(item)
# Then run it like this
from google.appengine.ext import deferred
deferred.defer(NameYourJob().run, 50, # <-- this is your batch
_target='backend_name_if_you_want', _name='a_name_to_avoid_dups')
For potentially long query iterations, we use a time check to ensure slow processing can be handled. Given the disparities in GAE infrastructure performance, you will likely never find an optimal processing number. The code excerpt below is from an on-line maintenance handler we use which generally runs within ten seconds. If not, we get a return code saying it needs to be run again thanks to our timer check. In your case, you would likely break the process after passing the cursor to your next queue task. Here is some sample code which is edited down to hopefully give you a good idea of our logic. One other note: you may choose to break this up into smaller bites and then fan out the smaller tasks by re-enqueueing the task until it completes. Doing 20k things at once seems very aggressive in GAE's highly variable environment. HTH -stevep
def over_dt_limit(start, milliseconds):
dt = datetime.datetime.now() - start
mt = float(dt.seconds * 1000) + (float(dt.microseconds)/float(1000))
if mt > float(milliseconds):
return True
return False
#set a start time
start = datetime.datetime.now()
# handle a timeout issue inside your query iteration
for item in query.iter():
# do your loop logic
if over_dt_limit(start, 9000):
# your specific time-out logic here
break

Resources