Should I unescape bytea field in C-function for Postgresql and if so - how to do it? - c

I write my own C function for Postgresql which have bytea parameter. This function is defined as followed
CREATE OR REPLACE FUNCTION putDoc(entity_type int, entity_id int,
doc_type text, doc_data bytea) RETURNS text
AS 'proj_pg', 'call_putDoc'
LANGUAGE C STRICT;
My function call_putDoc, written on C, reads doc_data and pass its data to another function like file_magic to determine mime-type of the data and then pass data to appropriate file converter.
I call this postgresql function from php script which loads file content to last parameter. So, I should pass file contents with pg_escape_bytea.
When data are passed to call_putDoc C function, does its data already unescaped and if not - how to unescape them?
Edit: As I found, no, data, passed to C function, is not unescaped. How to unescape it?

When it comes to programming C functions for PostgreSQL, the documentation explains some of the basics, but for the rest it's usually down to reading the source code for the PostgreSQL server.
Thankfully the code is usually well structured and easy to read. I wish it had more doc comments though.
Some vital tools for navigating the source code are either:
A good IDE; or
The find and git grep commands.
In this case, after having a look I think your bytea argument is being decoded - at least in Pg 9.2, it's possible (though rather unlikely) that 8.4 behaved differently. The server should automatically do that before calling your function, and I suspect you have a programming error in how you are calling your putDoc function from SQL. Without sources it's hard to say more.
Try calling it putDoc from psql with some sample data you've verified is correctly escape encoded for your 8.4 server
Try setting a breakpoint in byteain to make sure it's called before your function
Follow through the steps below to verify that what I've said applies to 8.4.
Set a breakpoint in your function and step through with gdb, using the print function as you go to examine the variables. There are lots of gdb tutorials that'll teach you the required break, backtrace, cont, step, next, print, etc commands, so I won't repeat all that here.
As for what's wrong: You could be double-encoding your data - for example, given your comments I'm wondering if you've base64 encoded data and passed it to Pg with bytea_output set to escape. Pg would then decode it ... giving you a bytea containing the bytea representation of the base64 encoding of the bytes, not the raw bytes themselves. (Edit Sounds like probably not based on comments).
For correct use of bytea see:
http://www.postgresql.org/docs/current/static/runtime-config-client.html
http://www.postgresql.org/docs/current/static/datatype-binary.html
To say more I'd need source code.
Here's what I did:
A quick find -name bytea\* in the source tree locates src/include/utils/bytea.h. A comment there notes that the function definitions are in utils/adt/varlena.c - which turns out to actually be src/backend/util/adt/varlena.c.
In bytea.h you'll also notice the definition of the bytea_output GUC parameter, which is what you see when you SHOW bytea_output or SET bytea_output in psql.
Let's have a look at a function we know does something with bytea data, like bytea_substr, in varlena.c. It's so short I'll include one of its declarations here:
Datum
bytea_substr(PG_FUNCTION_ARGS)
{
PG_RETURN_BYTEA_P(bytea_substring(PG_GETARG_DATUM(0),
PG_GETARG_INT32(1),
PG_GETARG_INT32(2),
false));
}
Many of the public functions are wrappers around private implementation, so the private implementation can be re-used with functions that have differing arguments, or from other private code too. This is such a case; you'll see that the real implementation is bytea_substring. All the above does is handle the SQL function calling interface. It doesn't mess with the Datum containing the bytea input at all.
The real implementation bytea_substring follows directly below the SQL interface wrappers in this partcular case, so read on in varlena.c.
The implementation doesn't seem to refer to the bytea_output GUC, and basically just calls DatumGetByteaPSlice to do the work after handling some border cases. git grep DatumGetByteaPSlice shows us that DatumGetByteaPSlice is in src/include/fmgr.h, and is a macro defined as:
#define DatumGetByteaPSlice(X,m,n) ((bytea *) PG_DETOAST_DATUM_SLICE(X,m,n))
where PG_DETOAST_DATUM_SLICE is
#define PG_DETOAST_DATUM_SLICE(datum,f,c) \
pg_detoast_datum_slice((struct varlena *) DatumGetPointer(datum), \
(int32) (f), (int32) (c))
so it's just detoasting the datum and returning a memory slice. This leaves me wondering: has the decoding been done elsewhere, as part of the function call interface? Or have I missed something?
A look at byteain, the input function for bytea, shows that it's certainly decoding the data. Set a breakpoint in that function and it should trip when you call your function from SQL, showing that the bytea data is really being decoded.
For example, let's see if byteain gets called when we call bytea_substr with:
SELECT substring('1234'::bytea, 2,2);
In case you're wondering how substring(bytea) gets turned into a C call to bytea_substr, look at src/catalog/pg_proc.h for the mappings.
We'll start psql and get the pid of the backend:
$ psql -q regress
regress=# select pg_backend_pid();
pg_backend_pid
----------------
18582
(1 row)
then in another terminal connect to that pid with gdb, set a breakpoint, and continue execution:
$ sudo -u postgres gdb -q -p 18582
Attaching to process 18582
... blah blah ...
(gdb) break bytea_substr
Breakpoint 1 at 0x6a9e40: file varlena.c, line 1845.
(gdb) cont
Continuing.
In the 1st terminal we execute in psql:
SELECT substring('1234'::bytea, 2,2);
... and notice that it hangs without returning a result. Good. That's because we tripped the breakpoint in gdb, as you can see in the 2nd terminal:
Breakpoint 1, bytea_substr (fcinfo=0x1265690) at varlena.c:1845
1845 PG_RETURN_BYTEA_P(bytea_substring(PG_GETARG_DATUM(0),
(gdb)
A backtrace with the bt command doesn't show bytea_substr in the call path, it's all SQL function call machinery. So Pg is decoding the bytea before it's passing it to bytea_substr.
You can now detach the debugger with quit. This won't quit the Pg backend, only detach and quit the debugger.

Related

How to find all C functions starting with a prefix in a library

For a small test framework, I want to do automatic test discovery. Right now, my plan is that all tests just have a prefix, which could basically be implemented like this
#define TEST(name) void TEST_##name(void)
And be used like this (in different c files)
TEST(one_eq_one) { assert(1 == 1); }
The ugly part is that you would need to list all test-names again in the main function.
Instead of doing that, I want to collect all tests in a library (say lib-my-unit-tests.so) and generate the main function automatically, and then just link the generated main function against the library. All of this internal action can be hidden nicely with cmake.
So, I need a script that does:
1. Write "int main(void) {"
2. For all functions $f starting with 'TEST_' in lib-my-unit-tests.so do
a) write "extern void $f(void);"
b) write "$f();
3. Write "}"
Most parts of that script are easy, but I am unsure how to reliably get a list of all functions starting with the prefix.
On POSIX systems, I can try to parse the output of nm. But here, I am not sure if the names will always be the same (on my MacBook, all names start with an additional '_'). To me, it looks like it might be OS/architecture-dependent which names will be generated for the binary. For windows, I do not yet have an idea on how to do that.
So, my questions are:
Is there a better way to implement test-discovery in C? (maybe something like dlsym)
How do I reliably get a list of all function-names starting with a certain prefix on a MacOS/Linux/Windows
A partial solution for the problem is parsing nm with a regex:
for line in $(nm $1) ; do
# Finds all functions starting with "TEST_" or "_TEST_"
if [[ $line =~ ^_?(TEST_.*)$ ]] ; then
echo "${BASH_REMATCH[1]}"
fi
done
And then a second script consumes this output to generate a c file that calls these functions. Then, cmake calls the second script to create the test executable
add_executable(test-executable generated_source.c)
target_link_libraries(test-executable PRIVATE library_with_test_functions)
add_custom_command(
OUTPUT generated_source.c
COMMAND second_script.sh library_with_test_functions.so > generated_source.c
DEPENDS second_script.sh library_with_test_functions)
I think this works on POSIX systems, but I don't know how to solve it for Windows
You can write a shell script using the nm or objdump utilities to list the symbols, pipe through awk to select the appropriate name and output the desired source lines.

How can I check if a certain function could be indirectly called by another certain function?

Assuming that in a project written by C, there is a function named A and a function named B.
How can I verify if the function A could be in the call tree of function B? Just like B->C->D->...->A .
This question came when I was thinking about which libvirt API may invoke the qemu qmp "query-block". Since qmp "query-block" is only called by function qemuMonitorJSONQueryBlock. So this specific question becomes: How can I find which libvirt API may invoke qemuMonitorJSONQueryBlock?
I think dynamic analysis is hard to answer that question because lots of tests are required. It should be a question of static analysis. But I could find proper tools or methods to solve it. At last I summarize the question as the first paragraph.
You can try CppDepend and its code query language to create some advanced queries about the dependencies, In you case you can use a query like this one
from m in Methods
let depth0 = m.DepthOfIsUsedBy("__Globals.B()")
where depth0 >= 0 && m.SimpleName=="A" orderby depth0
select new { m, depth0 }
You can use GNU cflow utility which analyzes a collection of source files written in C programming language and outputs a graph charting dependencies between various functions
I think dynamic analysis is hard to answer that question because lots of tests are required. It should be a question of static analysis. But I could find proper tools or methods to solve it. At last I summarize the question as the first paragraph.
That's true, basically because you can call functions that you have never linked in your program. with the dlopen(3) function and friends, you can dynamically link to your program a completely unknown function and be able to call it. There's no way to check if the pointer to a function is actually storing a valid pointer and to see if, as a result, it will be called or not (or if it is in the call graph of some initial function)
I find cscope can help solve the question. It is a
is a developer's tool for browsing source code. It can get the caller of a function by following:
1. Change to the source code directory, then generate the cscope database file named cscope.out
cd libvirt
cscope -bR
Find the callers of func1 by cscope:
cscope -d -f cscope.out -L3 func1, then 2nd column is the callers of this function. For example:
cscope -d -f./cscope.out -L3 qemuMigrationDstPrepareDirect
The result:
src/qemu/qemu_driver.c ATTRIBUTE_NONNULL 12487 ret = qemuMigrationDstPrepareDirect(driver, dconn,
src/qemu/qemu_driver.c qemuDomainMigratePrepare2 12487 ret = qemuMigrationDstPrepareDirect(driver, dconn,
src/qemu/qemu_driver.c qemuDomainMigratePrepare3 12722 ret = qemuMigrationDstPrepareDirect(driver, dconn,
src/qemu/qemu_driver.c qemuDomainMigratePrepare3Params 12809 ret = qemuMigrationDstPrepareDirect(driver, dconn,
Note that: cscope will mistakenly regard function attribute declarement ATTRIBUTE_* as callers. We should skip them.
Then recursively find the caller of the a function. At last select the target B->...->A call trace.
doxygen can generate call graphs and caller graphs. If you configure it for an unlimited number of calls in a graph, you will be able to get the information you need.

Frama-c: save plugin analysis results in c file

I'am new in frama-c. So I apologize in advance for my question.
I would like to make a plugin that will modify the source code, clone some functions, insert some functions calls and I would like my plugin to generate a second file that will contain the modified version of the input file.
I would like to know if it is possible to generate a new file c with frama-c. For example, the results of the Sparecode and Semantic constant folding plugins are displayed on the terminal directly and not in a file. So I would like to know if Frama-c has the function to write to a file instead of sending the result of the analysis to the standard output.
Of course we can redirect the output of frama-c to a file.c for example, but in this case, for the plugin scf for example, the results of value is there and I found that frama-c replaces for example the "for" loops by while.
But what I would like is that frama-c can generate a file that will contain my original code plus the modifications that I would have inserted.
I looked in the directory src / kernel_services / ast_printing but I have not really found functions that can guide me.
Thanks.
On the command line, option -ocode <file> indicates that any subsequent -print will be done in <file> instead of the standard output (use -ocode "" after that if you want to print on stdout again). Note that -print prints the code corresponding to the current project. You can use -then-on <prj> to change the project you're interested in. More information is of course available in the user manual.
All of this is of course available programmatically. In particular, File.pretty_ast by defaults pretty-prints (i.e. output a C program) the AST of the current project on stdout, but takes two optional argument for changing the project or the formatter to which the output should be done.

Retrieving Global Variable Values from Command Line

In one particular project, we're trying to embed version information into shared object files. We'd like to be able to use some standard linux tool to parse the shared object to determine the version for automated testing.
Currently I have "const int plugin_version = 14;". I can use 'nm' and 'objdump' and verify that it's there:
00000000000dcfbc r plugin_version
I can't, however, seem to be able to get the value of that variable easily from command line. I figured there'd be a POSIX tool for showing the initialized values for globals. I have contemplated using a format for the variable as the information itself, ie, plugin_version_14, but that seems like a huge hack. Embedding the information in the filename unfortunately is NOT an option. Any other suggestions welcome.
You could embed it as a string
"MAGIC MARKER STRING VERSION: 4.56 END OF MAGIC" then just look for "MAGIC MARKER STRING" in the file and extract the version information that comes after it.
if you make it a standard, you could easily make command line tool to find these embeded strings on all your software.
if you require it also to be an int, a little macro magic will construct both the int and magic string to make sure they are never out of synch.
There's a couple of options I think.
My first instinct is to make sure the version information lives in its own section in the ELF file. You can use objdump -s -j name of section /bin/whatever.
This rather relies on objdump being available of course.
Alternatively you can do what Keith suggested, and just use 'strings', along with a magical marker string. This feels a little hackish, but should work quite well.
Finally, why don't you just add a --version command line option? You can then store the version information however you like, and trivially retrieve it using the one tool which is certain to be installed on any system which has your software.
A terrible hack that I've used in the past is to embed the version information in a variable name, so nm will show:
00000000000dcfbc r plugin_version_14
Why not writing your own tool to get that version in C/C++ ? You could Use dlopen, then dlsym to get the symbol and print its value to standard output. This way you also verify if the symbol is already there. It looks like 20 ~ 30 lines of code to me and about 20 minutes of your life :)
I know that the question is about command line, but writing such a tool yourself should be easy (especially if such a command line tool does not exist).
If the binary is not stripped, you could use gdb to print the variable. (I just tried to script gdb, but it seems to refuse work if stdin is not a tty, maybe expect will do the job ? )
If you can accept using python, this might help:
import struct
import sys
import subprocess
if __name__ == '__main__':
so = sys.argv[1]
sym = sys.argv[2]
addr = subprocess.check_output('nm %s | grep %s' % (so, sym), shell=True)
addr = int(addr.split()[0], 16)
so_file = open(so)
so_file.seek(addr)
data = so_file.read(4)
print struct.unpack('#i', data)[0]
Disclaimer: This script doesn't do any error checking (if you like it I'm sure you can come up with some ;)). It also assumes you're reading a 4-byte native int value.
$ cat global.c
const int plugin_version = 14;
$ python readsym.py global.so plugin_version
14

Make GDB print control flow of functions as they are called

How do I make gdb print functions of interest as they are called, indented according to how deep in the stack they are?
I want to be able to say something like (made up):
(gdb) trace Foo* Bar* printf
And have gdb print all functions which begin with Foo or Bar, as they are called. Kind of like gnu cflow, except using the debugging symbols and only printing functions which actually get called, not all possible call flows.
Tools which won't help include cachegrind, callgrind and oprofile, which order the results by which functions were called most often. I need the order of calling preserved.
The wildcarding (or equivalent) is essential, as there are a lot of Foo and Bar funcs. Although I would settle for recording absolutely every function. Or, perhaps telling gdb to record all functions in a particular library.
Some GDB wizard must have a script for this common job!
In your case I would turn to the define command in gdb, which allows you to define a function, which can take up to 10 arguments.
You can pass in the names of functions to "trace" as arguments to the function you define, or record them all in the function itself. I'd do something like the following
define functiontrace
if $arg0
break $arg0
commands
where
continue
end
end
if $arg1
...
Arguments to a user-defined function in gdb are referenced as $arg0-$arg9. Alternatively, you could just record every function you wanted to trace in the function, instead of using $arg0-9.
Note: this will not indent as to depth in the stack trace, but will print the stack trace every time the function is called. I find this approach more useful than strace etc... because it will log any function you want, system, library, local, or otherwise.
There's rbreak cmd accepting regular expression for setting breakpoints. You can use:
(gdb) rbreak Foo.*
(gdb) rbreak Bar.*
(gdb) break printf
See this for details on breakpoints.
Then use commands to print every function as it's called. E.g. let α = the number of the last breakpoint (you can check it with i br if you missed), then do:
(gdb) commands 1-α
Type commands for breakpoint(s) 1-α, one per line.
End with a line saying just "end".
>silent
>bt 1
>c
>end
(gdb)
Some elaboration: silent suppresses unnecessary informational messages, bt 1 prints the last frame of backtrace (i.e. it's the current function), c is a shortcut for continue, to continue execution, and end is just the delimiter of command list.
NB: if you trace library functions, you may want to wait for lib to get loaded. E.g. set a break to main or whatever function, run app until that point, and only then set breakpoints you wanted.
Use the right tool for the job ;)
How to print the next N executed lines automatically in GDB?
Did you see litb's excellent anwser to a similar post here ?
He uses readelf to get interesting symbols, gdb commands to get the trace, and awk to glue all that.
Basically what you have to change is to modify his gdb command script to remove the 1 depth from backtrace to see the stack and filter specific functions, and reformat the output with an awk/python/(...) script to present it as a tree. (I admit I'm too lazy to do it now...)
You may call gdb in batch mode (using -x option), break where you need and ask for backtrace (bt), then you filter the result using grep or egrep.
Indents are more difficult, however bt output is ordered so you have current function at the top of the trace and main at very bottom.
So you create file with commands:
br <function name where to break>
run
bt
kill
quit
then run gdb <program> -x<command file>
Filter strings that starts with #<digit> - you get stack trace.

Resources