Ada + PolyORB : can I have multiple instances of the same RCI unit and be able to call a specific one from code? - distributed

I'm developing a distributed application using Ada's Distributed System Annex and PolyORB and I have a somewhat peculiar request.
Let's say I have a RCI (Remote Call Interface) unit called U and the main program called MAIN that uses it.
What I want to know is:
Can I configure DSA to create multiple copies of the partition U ?
If the answer is yes, can I then call a specific one of these
partitions from my code in MAIN?
I can't find info on this online, right now the only solution I can think of would be to have a pre-processor generate multiple "copies" of U from a generic template and patch the DSA configuration file accordingly. Is there a less "hacky" way to do this?

After reading all the documentation I could find, I conclude that this is not possible, ie: it's not contemplated in DSA.
I still think the hacky solution I proposed in my original question would work, however it doesn't make much sense from a practical point of view.
For my application, I decided to switch to SOAP, using the AWS web server and ditching DSA altogether.
EDIT: in response to shark8 comment:
My idea, in detail, was this. Let's assume I have some kind of configuration file in which I describe how many copies of the partitions I want and how each one is configured, here's how I'd do it:
Have a pre-processor read the configuration file and generate source files for all the needed
partitions from a template (basically this would just create a copy
of the files and alter the package name and file name to avoid
collisions)
Have the pre-processor patch the DSA configuration file accordingly, to generate those partitions
I would then have one final package that acts as a "gateway", or a "selector" if you
want, to the other packages. It would contain "generic" procedures
that would act as selectors, something like:
procedure Some_Procedure(partitionID : Integer; params : whatever) is
case partitionID is
when 1 =>
Partition1.Some_Procedure(params);
when 2 =>
Partition2.Some_Procedure(params);
etc...
end case;
end Some_Procedure;
Obviously, like the partitions themselves, the various lines in the case will have to be generated by the pre-processor. Also: the main program that calls this procedure needs to pass it's ID, but this is no problem since it will have read the same configuration as the preprocessor did at 1) so it would know the IDs and which one it wants to call, depending on its internal logic.
As I said, it's a tortuous method, but I think it should work.

The DSA (polyorb) is very flexible and you can have multiple unique RCI . Why would you want to do this? is it a maximisation of processing speed. Having used the DSA quite a bit I would question your motives for doing this in this way and perhaps suggest that you change your design.
If its processing speed I would suggest you do your processing in multiple client partitions, and use the asynchronous messaging mechanism in the DSA. If you need more processing speed, then do your calculation/ processing in multiple stages across client partitions.
Theres a great example of this messaging mechanism in the banking example application included in the DSA examples.

Related

What type of database for storing ML experiments

So I'm thinking to write some small piece of software, which to run/execute ML experiments on a cluster or arbitrary abstracted executor and then save them such that I can view them in real time efficiently. The executor software will have access for writing to the database and will push metrics live. Now, I have not worked too much with databases, thus I'm not sure what is the correct approach for this. Here is a description of what the system should store:
Each experiment will consist of a single piece of code/archive of code such that it can be executed on the remote machine. For now we will assume allow dependencies and etc are installed there. The code will accept command line arguments. The experiment also will consists of a YAML scheme defining the command line arguments. In the code byitself will specify what will be logged in (e.g. I will provide a library in the language for registering channels). Now in terms of logging, you can log numerical values, arrays, text, etc so quite a few types. Each channel will be allowed a single specification (e.g. 2 columns, first int iteration, second float error). The code will also provide special copy of parameters at the end of the experiments.
When one submit an experiments, it will need to provide its unique group name + parameters for execution. This will launch the experiment and log everything.
Implementing this for me is easiest to do with a flat file system. Each project will have a unique name. Each new experiment gets a unique id and folder inside the project. I can store the code there. Each channel gets a file, which for simplicity can be an csv delimeter, with a special schema file describing what type of values are stored there so I can load them there. The final parameters can also be copied in the folder.
However, because of the variety of ways I can do this, and the fact that this might require a separate "table" for each experiment, I have no idea if this is possible in any database systems? Additionally, maybe I'm overseeing something very obvious or maybe not, if you had any experience with this any suggestions/advices are most welcome. The main goal is at the end to be able to serve this to a web interface. Maybe noSQL could accommodate this maybe not (I don't know exactly how those work)?
The data for ML primarily would be unstructured data. That kind of data will not naturally fit into a RDBMS. Essentially a document database like mongodb is far better suited....for such cases.

Achieving high-performance transactions when extending PostgreSQL with C-functions

My goal is to achieve the highest performance available for copying a block of data from the database into a C-function to be processed and returned as the result of a query.
I am new to PostgreSQL and I am currently researching possible ways to move the data. Specifically, I am looking for nuances or keywords related specifically to PostgreSQL to move big data fast.
NOTE:
My ultimate goal is speed, so I am willing to accept answers outside of the exact question I have posed as long as it gets big performance results. For example, I have come across the COPY keyword (PostgreSQL only), which moves data from tables to files quickly; and vice versa. I am trying to stay away from processing that is external to the database, but if it provides a performance improvement that out-weighs the obvious drawback of external processing, then so be it.
It sounds like you probably want to use the server programming interface (SPI) to implement a stored procedure as a C language function running inside the PostgreSQL back-end.
Use SPI_connect to set up the SPI.
Now SPI_prepare_cursor a query, then SPI_cursor_open it. SPI_cursor_fetch rows from it and SPI_cursor_close it when done. Note that SPI_cursor_fetch allows you to fetch batches of rows.
SPI_finish to clean up when done.
You can return the result rows into a tuplestore as you generate them, avoiding the need to build the whole table in memory. See examples in any of the set-returning functions in the PostgreSQL source code. You might also want to look at the SPI_returntuple helper function.
See also: C language functions and extending SQL.
If maximum speed is of interest, your client may want to use the libpq binary protocol via libpqtypes so it receives the data produced by your server-side SPI-using procedure with minimal overhead.

Is there a way to translate database table rows into Prolog facts?

After doing some research, I was amazed with the power of Prolog to express queries in a very simple way, almost like telling the machine verbally what to do. This happened because I've become really bored with Propel and PHP at work.
So, I've been wondering if there is a way to translate database table rows (Postgres, for example) into Prolog facts. That way, I could stop using so many boring joins and using ORM, and instead write something like this to get what I want:
mantenedora_ies(ID_MANTENEDORA, ID_IES) :-
papel_pessoa(ID_PAPEL_MANTENEDORA, ID_MANTENEDORA, 1),
papel_pessoa(ID_PAPEL_IES, ID_IES, 6),
relacionamento_pessoa(_, ID_PAPEL_IES, ID_PAPEL_MANTENEDORA, 3).
To see why I've become bored, look at this post. The code there would be replaced for these simple lines ahead, much easier to read and understand. I'm just curious about that, since it will be impossible to replace things around here.
It would also be cool if something like that was possible to be done in PHP. Does anyone know something like that?
check the ODBC interface of swi-prolog (maybe there is something equivalent for other prolog implementations too)
http://www.swi-prolog.org/pldoc/doc_for?object=section%280,%270%27,swi%28%27/doc/packages/odbc.html%27%29%29
I can think of a few approaches to this -
On initialization, call a method that performs a selects all data from a table and asserts it into the db. Do this for each db. You will need to declare the shape of each row as :- dynamic ies_row/4 etc
You could modify load_files by overriding user:prolog_load_files. From this activity you could so something similar to #1. This has the benefit of looking like a load_files call. http://www.swi-prolog.org/pldoc/man?predicate=prolog_load_file%2F2 ... This documentation mentions library(http_load), but I cannot find this anywhere (I was interested in this recently)!
There is the Draxler Prolog to SQL compiler, that translates some pattern (like the conjunction you wrote) into the more verbose SQL joins. You can find in the related post (prolog to SQL converter) more info.
But beware that Prolog has its weakness too, especially regarding aggregates. Without a library, getting sums, counts and the like is not very easy. And such libraries aren't so common, and easy to use.
I think you could try to specialize the PHP DB interface for equijoins, using the builtin features that allows to shorten the query text (when this results in more readable code). Working in SWI-Prolog / ODBC, where (like in PHP) you need to compose SQL, I effettively found myself working that way, to handle something very similar to what you have shown in the other post.
Another approach I found useful: I wrote a parser for the subset of SQL used by MySQL backup interface (PHPMyAdmin, really). So routinely I dump locally my CMS' DB, load it memory, apply whathever duty task I need, computing and writing (or applying) the insert/update/delete statements, then upload these. This can be done due to the limited size of the DB, that fits in memory. I've developed and now I'm mantaining this small e-commerce with this naive approach.
Writing Prolog from PHP should be not too much difficult: I'd try to modify an existing interface, like the awesome Adminer, that already offers a choice among basic serialization formats.

Delayed Table Initialization

Using the net-snmp API and using mib2c to generate the skeleton code, is it possible to support delayed initialization of tables? What I mean is, the table would not be initialized until any of it's members were queried directly. The reason for this is that the member data is obtained from another server, and I'd like to be able to start the snmpd daemon without requiring the other server to be online/ready for requests. I thought of maybe initializing the table with dummy data that gets updated with the real values when a member is queried, but I'm not sure if this is the best way.
The table also has only one row of entries, so using mib2c.iterate.conf to generate table iterators and dealing with all of that just seems unnecessary. I thought of maybe just implementing the sequence defined in the MIB and not the actual table, but that's not usually how it's done in all the examples I've seen. I looked at /mibgroup/examples/delayed_instance.c, but that's not quite what I'm looking for. Using mib2c with the mib2c.create-dataset.conf config file was the closest I got to getting this to work easily, but this config file assumes the data is static and not external (both of which are not true in my case), so it won't work. If it's not easily done, I'll probably just implement the sequence and not the table, but I'm hoping there's an easy way. Thanks in advance.
The iterator method will work just fine. It won't load any data until it calls your _first and _next routines. So it's up to you, in those routines and in the _handler routine, to request the data from the remote server. In fact, by default, it doesn't cache data at all so it'll make you query your remote server for every request. That can be slow if you have a lot of data in the table, so adding a cache to store the data for N seconds is advisable in that case.

Evaluating HDF5: What limitations/features does HDF5 provide for modelling data?

We are in evaluating technologies that we'll use to store data that we gather during the analysis of C/C++ code. In the case of C++, the amount of data can be relatively large, ~20Mb per TU.
After reading the following SO answer it made me consider that HDF5 might be a suitable technology for us to use. I was wondering if people here could help me answer a few initial questions that I have:
Performance. The general usage for the data will be write once and read "several" times, similar to the lifetime of a '.o' file generated by a compiler. How does HDF5 compare against using something like an SQLite DB? Is that even a reasonable comparison to make?
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format. After reading the user guide I understand that HDF5 is similar to XML or a DB, in that information is associated with a tag/column and so a tool built to read an older structure will just ignore the fields that it is not concerned with? Is my understanding on this correct?
A significant chunk of the information that we wish to write out will be a tree type of structure: scope hierarchy, type hierarchy etc. Ideally we would model scopes as having parents, children etc. Is it possible to have one HDF5 object "point" to another? If not, is there a standard technique to solve this problem using HDF5? Or, as is required in a DB, do we need a unique key that would "link" one object to another with appropriate lookups when searching for the data?
Many thanks!
How does HDF5 compare against using something like an SQLite DB?
Is that even a reasonable comparison to make?
Sort of similar but not really. They're both structured files. SQLite has features to support database queries using SQL. HDF5 has features to support large scientific datasets.
They're both meant to be high performance.
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format.
If you store data in structured form, the data types of those structures are also stored in the HDF5 file. I'm a bit rusty as to how this works (e.g. if it includes innate backwards compatibility), but I do know that if you design your "reader" correctly it should be able to handle types that are changed in the future.
Is it possible to have one HDF5 object "point" to another?
Absolutely! You'll want to use attributes. Each object has one or more strings describing the path to reach that object. HDF5 groups are analogous to folders/directories, except that folders/directories are hierarchical = a unique path describes each one's location (in filesystems w/o hard links at least), whereas groups form a directed graph which can include cycles. I'm not sure whether you can store a "pointer" to an object directly as an attribute, but you can always store an absolute/relative path as a string attribute. (or anywhere else as a string; you could have lookup tables galore if you wanted.)
We produce HDF5 data on my project, but I don't directly deal with it usually. I can take a stab at the first two questions:
We use a write once, read many times model and the format seems to handle this well. I know a project that used to write both to an Oracle database and HDF5. Eventually they removed the Oracle output since performance suffered and no one was using it. Obviously, SQLite is not Oracle, but the HDF5 format was better suited for the task. Based on that one data point, a RDBMS may be better tuned for multiple inserts and updates.
The readers our customers use are robust when we add new data types. Some of the changes are anticipated, but we don't have to worry about breaking thing when adding more data fields. Our DBA recently wrote a Python program to read HDF5 data and populate KMZ files for visualization in Google Earth. Since it was a project he used to learn Python, I'd say it's not hard to build readers.
On the third question, I'll bow to Jason S's superior knowledge.
I'd say HDF5 is a completely reasonable choice, especially if you are already interested in it or plan to produce something for the scientific community.

Resources