Achieving high-performance transactions when extending PostgreSQL with C-functions - c

My goal is to achieve the highest performance available for copying a block of data from the database into a C-function to be processed and returned as the result of a query.
I am new to PostgreSQL and I am currently researching possible ways to move the data. Specifically, I am looking for nuances or keywords related specifically to PostgreSQL to move big data fast.
NOTE:
My ultimate goal is speed, so I am willing to accept answers outside of the exact question I have posed as long as it gets big performance results. For example, I have come across the COPY keyword (PostgreSQL only), which moves data from tables to files quickly; and vice versa. I am trying to stay away from processing that is external to the database, but if it provides a performance improvement that out-weighs the obvious drawback of external processing, then so be it.

It sounds like you probably want to use the server programming interface (SPI) to implement a stored procedure as a C language function running inside the PostgreSQL back-end.
Use SPI_connect to set up the SPI.
Now SPI_prepare_cursor a query, then SPI_cursor_open it. SPI_cursor_fetch rows from it and SPI_cursor_close it when done. Note that SPI_cursor_fetch allows you to fetch batches of rows.
SPI_finish to clean up when done.
You can return the result rows into a tuplestore as you generate them, avoiding the need to build the whole table in memory. See examples in any of the set-returning functions in the PostgreSQL source code. You might also want to look at the SPI_returntuple helper function.
See also: C language functions and extending SQL.
If maximum speed is of interest, your client may want to use the libpq binary protocol via libpqtypes so it receives the data produced by your server-side SPI-using procedure with minimal overhead.

Related

What type of database for storing ML experiments

So I'm thinking to write some small piece of software, which to run/execute ML experiments on a cluster or arbitrary abstracted executor and then save them such that I can view them in real time efficiently. The executor software will have access for writing to the database and will push metrics live. Now, I have not worked too much with databases, thus I'm not sure what is the correct approach for this. Here is a description of what the system should store:
Each experiment will consist of a single piece of code/archive of code such that it can be executed on the remote machine. For now we will assume allow dependencies and etc are installed there. The code will accept command line arguments. The experiment also will consists of a YAML scheme defining the command line arguments. In the code byitself will specify what will be logged in (e.g. I will provide a library in the language for registering channels). Now in terms of logging, you can log numerical values, arrays, text, etc so quite a few types. Each channel will be allowed a single specification (e.g. 2 columns, first int iteration, second float error). The code will also provide special copy of parameters at the end of the experiments.
When one submit an experiments, it will need to provide its unique group name + parameters for execution. This will launch the experiment and log everything.
Implementing this for me is easiest to do with a flat file system. Each project will have a unique name. Each new experiment gets a unique id and folder inside the project. I can store the code there. Each channel gets a file, which for simplicity can be an csv delimeter, with a special schema file describing what type of values are stored there so I can load them there. The final parameters can also be copied in the folder.
However, because of the variety of ways I can do this, and the fact that this might require a separate "table" for each experiment, I have no idea if this is possible in any database systems? Additionally, maybe I'm overseeing something very obvious or maybe not, if you had any experience with this any suggestions/advices are most welcome. The main goal is at the end to be able to serve this to a web interface. Maybe noSQL could accommodate this maybe not (I don't know exactly how those work)?
The data for ML primarily would be unstructured data. That kind of data will not naturally fit into a RDBMS. Essentially a document database like mongodb is far better suited....for such cases.

Tool for manipulating results of large set of computational experiments

I am a researcher and my primary interest is improving sparse kernels for high performance computing. I investigate large number of parameters on many sparse matrices. I wonder whether there is a tool to manage these results. The problems that I encounter are:
Combine results of several experiments for each matrix
Version the results
Taking average, finding minimum/maximum/standard deviation of results
There are hundreds of metrics that describe the performance improvement. I want to select a couple of the metrics easily and try to find which metric correlates with the performance improvement.
Here I gave a sample small instance of my huge problem. There are three types of parameters and two values for each parameter: Row/Column, Cyclic/Block, HeuristicA/HeuristicB. So there must 8 files for the combination of these parameters. Contents of two of them:
Contents of the file RowCyclicHeuristicA.txt
a.mtx#3#5.1#10#2%#row#cyclic#heuristicA#1
a.mtx#7#4.1#10#4%#row#cyclic#heuristicA#2
b.mtx#4#6.1#10#3%#row#cyclic#heuristicA#1
b.mtx#12#5.7#10#7%#row#cyclic#heuristicA#2
b.mtx#9#3.1#10#10%#row#cyclic#heuristicA#3
Contents of the file ColumnCyclicHeuristicA.txt
a.mtx#3#5.1#10#5%#column#cyclic#heuristicA#1
a.mtx#1#5.3#10#6%#column#cyclic#heuristicA#2
b.mtx#4#7.1#10#5%#column#cyclic#heuristicA#1
b.mtx#3#5.7#10#9%#column#cyclic#heuristicA#2
b.mtx#5#4.1#10#3%#column#cyclic#heuristicA#3
I have a scheme file to describe the contents of these files. This file has a line describing type and meaning of each column in the result files:
str MatrixName
int Speedup
double Time
int RepetationCount
double Imbalance
str Parameter1
str Parameter2
str Parameter3
int ExperimentId
I need to display average Time and two types of parameters as follows: (numbers in the following table are random)
Parameter1 Parameter2
Matrix row col cyclic block
a.mtx 4.3 5.2 4.2 5.4
b.mtx 2.1 6.3 8.4 3.3
Is there an advanced and sophisticated tool that gets the scheme of the table above and generates this table automatically? Currently I have a tool written in Java to process raw files and Latex code to manipulate and display the table using pgfplotstable. However, I need one tool that is more professional. I do not want pivot tables of MS Excel.
A similar question is here.
Manipulating large amounts of data in an unknown format is...challenging for a generic program. Your best bet is probably similar to what you're doing already. Use a custom program to reformat your results into something easier to handle (backend), and a visualisation program of your choice to let you view and play around with the data (frontend).
Backend
For your problem I'd suggest a relational database (e.g.Mysql). Has a longer setup time than other options, but if this is an ongoing problem it should be worthwhile, as it allows you to easily pull fields of interest.
SELECT AVG(Speedup) FROM results WHERE Parameter1="column" AND Parameter2="cyclic" for example. You'll then still need a simple script to insert your data in the first place, and then to pull the results of interest in a useful format you can stick into your viewer. Or if you so desire you can just run queries directly against the db.
Alternatively, what I usually use is just Python or Perl. Read in your data files, strip the data you don't want, rearrange into the desire structure, and write out to some standard format your frontend and use. Replace Python/Perl with the language of your choice.
Frontend
Personally, I almost always use Excel. The backend does most of the heavy lifting, so I get a csv file with the results I care about all nicely ordered already. Excel then lets me play around with the data, doing stuff like taking averages, plotting, reordering, etc fairly simply.
Other tools I use to display stuff which are probably not useful for you, but included for completeness include:
Weka - Mostly machine learning targetted, but provides tools for searching for trends or correlations. Useful to play around with data looking for things of interest.
Python/IDL/etc - For when I need data that can't be represented by a spreadsheet. These programs can, in addition to doing the backend's job of extracting and bulk manipulations, generate difference images, complicated graphs, or whatever else I need.

Is there a way to translate database table rows into Prolog facts?

After doing some research, I was amazed with the power of Prolog to express queries in a very simple way, almost like telling the machine verbally what to do. This happened because I've become really bored with Propel and PHP at work.
So, I've been wondering if there is a way to translate database table rows (Postgres, for example) into Prolog facts. That way, I could stop using so many boring joins and using ORM, and instead write something like this to get what I want:
mantenedora_ies(ID_MANTENEDORA, ID_IES) :-
papel_pessoa(ID_PAPEL_MANTENEDORA, ID_MANTENEDORA, 1),
papel_pessoa(ID_PAPEL_IES, ID_IES, 6),
relacionamento_pessoa(_, ID_PAPEL_IES, ID_PAPEL_MANTENEDORA, 3).
To see why I've become bored, look at this post. The code there would be replaced for these simple lines ahead, much easier to read and understand. I'm just curious about that, since it will be impossible to replace things around here.
It would also be cool if something like that was possible to be done in PHP. Does anyone know something like that?
check the ODBC interface of swi-prolog (maybe there is something equivalent for other prolog implementations too)
http://www.swi-prolog.org/pldoc/doc_for?object=section%280,%270%27,swi%28%27/doc/packages/odbc.html%27%29%29
I can think of a few approaches to this -
On initialization, call a method that performs a selects all data from a table and asserts it into the db. Do this for each db. You will need to declare the shape of each row as :- dynamic ies_row/4 etc
You could modify load_files by overriding user:prolog_load_files. From this activity you could so something similar to #1. This has the benefit of looking like a load_files call. http://www.swi-prolog.org/pldoc/man?predicate=prolog_load_file%2F2 ... This documentation mentions library(http_load), but I cannot find this anywhere (I was interested in this recently)!
There is the Draxler Prolog to SQL compiler, that translates some pattern (like the conjunction you wrote) into the more verbose SQL joins. You can find in the related post (prolog to SQL converter) more info.
But beware that Prolog has its weakness too, especially regarding aggregates. Without a library, getting sums, counts and the like is not very easy. And such libraries aren't so common, and easy to use.
I think you could try to specialize the PHP DB interface for equijoins, using the builtin features that allows to shorten the query text (when this results in more readable code). Working in SWI-Prolog / ODBC, where (like in PHP) you need to compose SQL, I effettively found myself working that way, to handle something very similar to what you have shown in the other post.
Another approach I found useful: I wrote a parser for the subset of SQL used by MySQL backup interface (PHPMyAdmin, really). So routinely I dump locally my CMS' DB, load it memory, apply whathever duty task I need, computing and writing (or applying) the insert/update/delete statements, then upload these. This can be done due to the limited size of the DB, that fits in memory. I've developed and now I'm mantaining this small e-commerce with this naive approach.
Writing Prolog from PHP should be not too much difficult: I'd try to modify an existing interface, like the awesome Adminer, that already offers a choice among basic serialization formats.

Delayed Table Initialization

Using the net-snmp API and using mib2c to generate the skeleton code, is it possible to support delayed initialization of tables? What I mean is, the table would not be initialized until any of it's members were queried directly. The reason for this is that the member data is obtained from another server, and I'd like to be able to start the snmpd daemon without requiring the other server to be online/ready for requests. I thought of maybe initializing the table with dummy data that gets updated with the real values when a member is queried, but I'm not sure if this is the best way.
The table also has only one row of entries, so using mib2c.iterate.conf to generate table iterators and dealing with all of that just seems unnecessary. I thought of maybe just implementing the sequence defined in the MIB and not the actual table, but that's not usually how it's done in all the examples I've seen. I looked at /mibgroup/examples/delayed_instance.c, but that's not quite what I'm looking for. Using mib2c with the mib2c.create-dataset.conf config file was the closest I got to getting this to work easily, but this config file assumes the data is static and not external (both of which are not true in my case), so it won't work. If it's not easily done, I'll probably just implement the sequence and not the table, but I'm hoping there's an easy way. Thanks in advance.
The iterator method will work just fine. It won't load any data until it calls your _first and _next routines. So it's up to you, in those routines and in the _handler routine, to request the data from the remote server. In fact, by default, it doesn't cache data at all so it'll make you query your remote server for every request. That can be slow if you have a lot of data in the table, so adding a cache to store the data for N seconds is advisable in that case.

Evaluating HDF5: What limitations/features does HDF5 provide for modelling data?

We are in evaluating technologies that we'll use to store data that we gather during the analysis of C/C++ code. In the case of C++, the amount of data can be relatively large, ~20Mb per TU.
After reading the following SO answer it made me consider that HDF5 might be a suitable technology for us to use. I was wondering if people here could help me answer a few initial questions that I have:
Performance. The general usage for the data will be write once and read "several" times, similar to the lifetime of a '.o' file generated by a compiler. How does HDF5 compare against using something like an SQLite DB? Is that even a reasonable comparison to make?
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format. After reading the user guide I understand that HDF5 is similar to XML or a DB, in that information is associated with a tag/column and so a tool built to read an older structure will just ignore the fields that it is not concerned with? Is my understanding on this correct?
A significant chunk of the information that we wish to write out will be a tree type of structure: scope hierarchy, type hierarchy etc. Ideally we would model scopes as having parents, children etc. Is it possible to have one HDF5 object "point" to another? If not, is there a standard technique to solve this problem using HDF5? Or, as is required in a DB, do we need a unique key that would "link" one object to another with appropriate lookups when searching for the data?
Many thanks!
How does HDF5 compare against using something like an SQLite DB?
Is that even a reasonable comparison to make?
Sort of similar but not really. They're both structured files. SQLite has features to support database queries using SQL. HDF5 has features to support large scientific datasets.
They're both meant to be high performance.
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format.
If you store data in structured form, the data types of those structures are also stored in the HDF5 file. I'm a bit rusty as to how this works (e.g. if it includes innate backwards compatibility), but I do know that if you design your "reader" correctly it should be able to handle types that are changed in the future.
Is it possible to have one HDF5 object "point" to another?
Absolutely! You'll want to use attributes. Each object has one or more strings describing the path to reach that object. HDF5 groups are analogous to folders/directories, except that folders/directories are hierarchical = a unique path describes each one's location (in filesystems w/o hard links at least), whereas groups form a directed graph which can include cycles. I'm not sure whether you can store a "pointer" to an object directly as an attribute, but you can always store an absolute/relative path as a string attribute. (or anywhere else as a string; you could have lookup tables galore if you wanted.)
We produce HDF5 data on my project, but I don't directly deal with it usually. I can take a stab at the first two questions:
We use a write once, read many times model and the format seems to handle this well. I know a project that used to write both to an Oracle database and HDF5. Eventually they removed the Oracle output since performance suffered and no one was using it. Obviously, SQLite is not Oracle, but the HDF5 format was better suited for the task. Based on that one data point, a RDBMS may be better tuned for multiple inserts and updates.
The readers our customers use are robust when we add new data types. Some of the changes are anticipated, but we don't have to worry about breaking thing when adding more data fields. Our DBA recently wrote a Python program to read HDF5 data and populate KMZ files for visualization in Google Earth. Since it was a project he used to learn Python, I'd say it's not hard to build readers.
On the third question, I'll bow to Jason S's superior knowledge.
I'd say HDF5 is a completely reasonable choice, especially if you are already interested in it or plan to produce something for the scientific community.

Resources