I've seen some binary files where the developer was a bit paranoid it seems and obfuscated all text in a binary. I hadn't seen anything like it before and didn't find any obvious options to compile an ELF with hidden text. Even standard OS API strings were hidden which was strange given they are usually visible.
These programs wouldn't exactly have any text that isn't exposed when it runs. Except unknown text. But hiding the whole lot just red flags and it makes it look suspicious.
Are there easy ways to hide text that is compiled into an ELF? Be that with easy compiler/linking options. I imagine a decoder could be inserted at main() but how could the text section be easily encoded?
I can imagine a custom way to do it would be to have an implicit decoder in the code with a key. Then use that key to encode text of the ELF. So that it is easily encoded.
You must have been looking at compressed executable files.
There are various tools available to compress executable files and decompress them at load time, such as upx for linux. Most text in the binary file will become unreadable to the naked eye , but be aware that it is a very ineffective method to hide sensitive data as hackers will have no difficulty decompressing the executable to gain access to the actual data.
Using encrypted strings in your executable, whose contents will have been produced by a script during the build process is a better approach, but the code to decrypt them must still be available somewhere in the executable, just harder to locate. If the data is sufficiently valuable (database password, bitcoin keys...), hackers will get it.
I guess that by "text" you mean human readable text (and not the code segment a.k.a. text segment).
You could just encrypt or obfuscate it into a read only
const char encrypted_text[] = {
// a lot of encrypted bytes like 0x01, 0x43, etc
// the C file containing that would be generated by some script
};
Then you'll use your de-obfuscation or decryption routines to get the real (unciphered) text.
I'm not sure it is worth the trouble. Life is too short.
I've normally seen this when analyzing malware. The authors do this to to prevent static analysis tools like strings from working. Additionally, such authors might load functions by using dlopen and dlsym to get functions that they need.
For example, in the code snippet below;
printf("Hello World");
I would see the string "Hello World" in the output of strings and by looking at the import section of the elf file, I'd see that the program is making use of printf. So without running the program it is possible to get a sense of what it is doing.
Now lets assume that the author wrote a function char* decrypt(int). This function take an index into a sting table (which each string is encrypted) and returns the decrypted string. The above one line of code would now notionally look like
void* pfile = dlopen(decrypt(3));
void* pfunct = dlsym(pfile, decrypt(15));
pfunct(decrypt(5));
Again, remember that the above is closer to pseudo-code then actually compileable code. Now in this case using static analysis tools we would not see the strings or the function names (in the import section).
Additionally, if we were attempting to reverse engineer the code we would need to take time to decrypt the strings and work through the logic to determine what functions are being called. It's not that this can't be done but it will slow down that analyst, which means that it will be longer till a mitigation for the malware is created.
And now to your question;
Are there easy ways to hide text that is compiled into an ELF? Be that
with easy compiler/linking options. I imagine a decoder could be
inserted at main() but how could the text section be easily encoded?
There is not compiler / linker option that does this. The author of this would need to choose to do this, write the appropriate functions (i.e. decrypt) above and write a utility to produce the encrypted forms of the strings. Additionally, as others have suggested once this is done, the entire application can be encryped/compressed (think of a self-extracting zip file) thus the only thing you see initially with static analysis tools would be the stub to decrypt of decompress the file.
see https://www.ioactive.com/pdfs/ZeusSpyEyeBankingTrojanAnalysis.pdf for an example of this. (granted this is Windows based, but the techniques for encryption and dynamically loading functions is the same. Look at section on API calls)
If interested you can also see; https://www.researchgate.net/publication/224180021_On_the_analysis_of_the_Zeus_botnet_crimeware_toolkit and https://arxiv.org/pdf/1406.5569.pdf
Related
I have Orange PI with ubuntu connected to atmega328p through usbasp.
I've developed a program in C, compiled it, translated to hex and uploaded on the atmega, but because of some strange behavior, the file.c is lost.
How can I get my program back from the atmega?
The good news: It is possible, definitively.
The bad news: But it's a lot of work, depending on the size of your application. I did this more than one time with AVR code, written in C, BASCOM, or C++ (Arduino). It takes several hours, for example some 20 hours for a 100-liner in BASCOM.
The approach is:
Disassemble the HEX file. Use this output as reference. You might need some options to have all constant data in the output.
Start with the best approximation of the source that your memory still holds.
Compile, link and convert it into a HEX file, too.
Disassemble this HEX file, and compare the output with the reference.
Repeat editing your source until both disassemblies are equal.
Notes:
You need deep understanding about the translation from C into machine code.
The names of functions and variables can't be reconstructed exactly. These names are gone after compiling and linking.
Be aware that the order of functions in the resulting code might not depend on their appearance in the source. Most compilers do this, though.
Be aware that the order of variables in memory might not depend on their appearance in the source, but on their name. Additionally they are commonly not sorted lexically, for example I found GCC using some kind of hashing algorithm. However, members of structs keep their order, because the standard demands that.
In a first phase, ignore differences of variable placement.
Try to identify functions of the C library, and ignore them. Especially the printf() family draws a lot of other functions with it. When you own code is finished, the library functions will be there, too, most probably.
Final note: If you happen to have the ELF file, use this for disassembling and looking up names. You will be much faster.
I have reviewed the questions/answers asking whether or not directory/file names are case sensitive in a Windows environment as well as those discussing a need for case-sensitive searching [usually in Python, not C], so I think I understand the essential facts, but none of the postings include my particular application architecture, and the problem I am having resolving the problem.
So, let me briefly explain the application architecture of which I am speaking. The heart of the application is built using Adobe AIR. Yes, that means that much of the U/I involves the Flex framework, but the file handling problem I am needing help with has no dependency upon the Flex U/I part of the application.
As I am trying to process a very large list of recursive directory structures, I am using the low level C RunTime API via a well-behaved mechanism which AIR provides for such cases where access to the host's Native Environment is needed.
The suite of functions which I am using is FindFileFirst, FindFileNext and FindClose. If I write a stand-alone test program, it nicely lists the directories, sub-directories and files. The case of the directories and files is correctly shown -- just as they are correctly shown in Windows Explorer, or using the dir command.
If, however, I launch precisely the same function via the Adobe ANE interface, I receive exactly the same output with the exception that all directory names will be reduced to lower case.
Now, I should clarify that when this code is being executed as a Native Extension, it is not passing data back to AIR, it is directly outputting the results in a file that is opened and closed entirely in the CRT world, so we are not talking about any sort of communication confusion via the passing of either text or byte arrays between two different worlds.
Without kludging up this forum with lots and lots of extraneous code, I think what will help anyone who is able to help me is these snippets:
// This is where the output gets written.
FILE* textFile = _wfopen (L"Peek.txt", L"wt,ccs=UTF8");
WIN32_FIND_DATAW fdf;
HANDLE find = NULL;
wchar_t fullPath[2048];
// I am just showing the third argument as a literal to exemplify
// what, in reality is passed into the recursively-called function as
// a variable.
wsprintf (fullPath, L"\\\\?\\%ls\\*.*", L"F:\\");
hFind = FindFirstFile (fullPath, &fdf);
// After checking for success there appears a do..while loop
// inside which there is the expected check for the "." and ".."
// pseudo directories and a test of fdf.dwFileAttributes for
// file versus sub-directory.
// When the NextFile is a file a function is called to format
// the output in the textFile, like this:
fwprintf (textF, L"%ls\t%ls\t%2.2x\t%4d/%02d/%02d/%02d/%02d/%02d \t%9ld.\n",
parentPath, fdf.cFileName,
(fdf.dwFileAttributes & 0x0f),
st.wYear, st.wMonth, st.wDay,
st.wHour, st.wMinute, st.wSecond,
fSize);
At that point parentPath will be a concatenated wide character string and
the other file attributes will be of the types shown.
So, to summarize: All of this code works perfectly if I just write a stand-alone test. When, however, the code is running as a task called from an Adobe ANE, the names of all the sub-directory parts are reduced to lower case. I have tested every combination of file type attribute -- binary and text and encoding -- UTF-8 and UTF-16LE, but no matter what configuration I choose, the result remains the same: Standalone the API delivers case-correct strings, running as a task in a dll invoked from AIR, the same API delivers only lower-case strings.
First, my thanks to Messrs Ogilvie and Passant for helpful suggestions.
Second, I apologize for not really knowing the protocol here as a very infrequent visitor. If I am supposed to flag either response as helpful and therefore correct, let these words at least reflect that fact.
I am providing an answer which was discovered by taking the advice above.
A. I discovered several tools that helped me get a handle on the contents of the .exe and .dll files. I should add some detail that was not part of the original posting: I have purposely been using the mingw-w64 toolchain rather than Visual Studio for this development work. So, as it turns out, both ldd and dumpbin helped me get a handle on whether or not the two slightly-different build environments were perhaps leaving me with different dependencies.
B. When I saw that one output included a reference to FindFirstFileExW, which function I had once tried in order to solve what I thought was the problem, I thought I had perhaps found a reason for the different results. In the event, that was just a red-herring and I do not mean to waste the forum's time with my low-level of experience and understanding, but it seems useful to note this sort of trouble-shooting methodology as a possible assist to others.
C. So what was the problem? There was, indeed, a small difference in the code between the stand-alone and the ANE-integrated implementations of the recursive directory search. In the production ANE use case, there is logic to apply a level of filtering to the returned results. The actual application allows the user to qualify a search for duplicate files by interrogating parts of the parent string in addition to the filename string itself.
In one corner condition, the filter may be case-sensitive or case-insensitive and I was using _wcslwr in the mistaken belief that that function behaved the nice, Unicode-compliant way that string handling methods are provided in AIR/Actionscript3. I did not notice that the function actually does an in-place replacement of the original string with one reduced to lowercase.
User error, not, any untoward linking of non-standard CRT Kernel functions by Adobe's Native Extension interoperability, was the culprit.
Is there a facility for the C language that allows run-time struct introspection?
The context is this:
I've got a daemon that responds to external events, and for each event we carry around an execution context struct (the "context"). The context is big and messy, and contains references to all sorts of state.
Once the event has been handled, I would like to be able to run the context through a filter, and if it matches some set of criteria, drop a log message to help with debugging. However, since I hope to use this for field debugging, I won't know what criteria will be useful to filter on until run time.
My ideal solution would allow the user to, essentially, write a C-style boolean expression and have the program use that. Something like:
activate_filter context.response_time > 4.2 && context.event.event_type == foo_event
Ideas that have been tossed around so far include:
Providing a limited set of fields that we know how to access.
Wrapping all the relevant structs in some sort of macro that generates introspection tools at run time.
Writing a python script that knows where (versioned) headers live, generates C code and compiles it to a dll, which the daemon then loads and uses as a filter. Obviously this approach has some extra security considerations.
Before I start in on some crazy design goose chase, does anyone know of examples of this sort of thing in the wild? I've dome some googling but haven't come up with much.
I would also suggest tackling this issue from another angle. The key words in your question are:
The context is big and messy
And that's where the issue is. Once you clean this up, you'll probably be able to come up with a clean logging facility.
Consider redefining all the fields in your context struct in some easy, pliable format, like XML. A simple `XML schema, that lists all the members of the struct, their types, and maybe some other metadata, even a comment that documents this field.
Then, throw together a quick and dirty stylesheet that reads the XML file and generates a compilable C struct, that your code actually uses. Then, a different stylesheet that cranks out robo-generated code that enumerates each field in the struct, and generates the code to convert each field into a string.
From that, bolting on a logging facility of some kind, with a user-provided filtering string becomes an easier task. You do have to come up with some way of parsing an arbitrary filtering string. Knowledge of lex and yacc would come in handy.
Things of this nature have been done before.
The XCB library is a C client library for the X11 protocol. The protocol defines various kinds of binary messages which are essentially simple structs that the client and the server toss to each other, over a socket. The way that libxcb is implemented, is that all X11 messages and all datatypes inside them are described in an XML definition, and a stylesheet robo-generates C struct definitions, and the code to parse them out, and provide a fairly clean C API to parse and generate X11 messages.
You are probably approaching this problem from a wrong side.
Logging is typically used to facilitate debugging. The program writes all sorts of events to a log file. To extract interesting entries filtering is applied to the log file.
Sometimes a program generates just too much events; logging libraries usually address this issues by offering verbosity control. Basically a logging function takes an additional parameter telling the verbosity level of the current message. If the value is above the globally configured threshold the message gets discarded. Some libraries even allow to control verbosity level on a per-module basis (Ex: google log).
Another possible approach is to leverage the power of a debugger since the debugger has access to all sorts of meta information. One can create a conditional breakpoint testing variables in scope for arbitrary conditions. Once the program stops any information could be extracted from the scope. This can be automated using scripting facilities provided by a debugger (gdb has great ones).
Finally there are tools generating glue code to use C libraries from scripting languages. One example is SWIG. It analyzes a header file and generates code allowing a scripting language to invoke functions, access structure fields, etc.
Your filter expression will become a program in, say, Lua (other scripting languages are supported as well). You invoke this program passing in the pointer to execution context struct (the "context"). Thanks to the accessors generated by SWIG Lua program can examine any field in the structure.
I generated introspection out of SWIG-CSV parser.
Suppose the C code contains structure like the following,
class Bike {
public:
int color; // color of the bike
int gearCount; // number of configurable gear
Bike() {
// bla bla
}
~Bike() {
// bla bla
}
void operate() {
// bla bla
}
};
Then it will generate the following CSV metadata,
Bike|color|int|variable|public|
Bike|gearCount|int|variable|public|
Bike|operate|void|function|public|f().
Now it is easy to parse the CSV file with python or C/C++ if needed.
import csv
with open('bike.csv', 'rb') as csvfile:
bike_metadata = csv.reader(csvfile, delimiter='|')
# do your thing
A small program I made contains a lot of small bitmaps and sound clips that I would prefer to include into the binary itself (they need to be memory mapped anyway). In the MS PE/COFF standard, there is a specific description on how to include resources (the .rsrc section) that has a nice file system-like hierarchy. I have not found anything like that in the Linux ELF specification, thus I assume one is free to include these resources as seemed fit.
What I want to achieve is that I can include all resources in only one ELF section with a symbolic name on the start of each resource (so that I can address them from my C code). What I am doing now is using a small NASM file that has the following layout:
SECTION .rsrc
_resource_1:
incbin "../rsrc/file_name_1"
_resource_1_length:
dw $-resource_1
_resource_2:
incbin "../rsrc/file_name_2"
_resource_2_length:
dw $-resource_2
...
I can easily assemble this to an ELF object that can be linked with my C code. However, I dislike the use of assembly as that makes my code platform-dependent.
What would be a better way to achieve the same result?
This question has already been asked on stackoverflow, but the proposed solutions are not applicable to my case:
The solution proposed over here: C/C++ with GCC: Statically add resource files to executable/library
Including the resources as hex arrays in C code is not really useful, as that mixes the code and the data in one section. (Besides, it's not practical either, as I can't preview the resources once they are converted to arrays)
Using objcopy --add-section on every resource works, but then every resource gets its own section (including header and all that). That seems a little wasteful as I have around 120 files to include (each of +/- 4K).
You're wrong saying that using hexarrays mixes data and code, as ELF files will split them by default, in particular, if you define the hexarray as a constant array, it'll end up in .rodata. See an old post of mine for more details on .rodata.
Adding resources with objcopy should create multiple sections in the object file, but then they should all be merged in the output executable, although then you would have some extra padding almost certainly. Another post on a related topic.
Your other alternative if you want to go from the actual binary file (say a PNG) to ELF, you can use ldscripts, which allow you to build ELF files with arbitrary sections/symbols and reading the data from files. You'll still need custom rules to build your ELF file.
I'm actually surprised this kind of resource management is not used more often for ELF, especially since, for many small files, it'll improve filesystem performance quite quickly, as then you only have one file to map rather than many.
If your resource is not too large, you can translate them into C/C++ source code, for example, as a unsigned char array. Then you can access them as global variables, and compile & link them like normal source code.
Is it possible to modify an environmental variable's name inside a library with some sort of editor. I'm thinking maybe a hex editor ?
I wish to modify the name but without altering its length:
envfoobar (9 chars)
yellowbar (9 chars)
Obviously, recompilation would be perfect but I do not know what exact flags were used to compile this library.
What's stopping you? You can even use a text editor (as long as it's a decent editor and knows how to handle binary data, like vim does). If the library is referring to the name of the environment variable through a string, and the string is in the library in the data segment (ie. it's not a string built at runtime), then it's trivial to edit a library in this way. Just don't delete or introduce new characters. I've done this under Linux. Some other OSes may digitally sign binaries and prevent this from working. Some OSes use a standard checksum or hash in which case you'll have to recompute it.
If you can find the name with the strings command on the library it might work. You could load the library up in your favorite hex editor change the string and give it a shot.
It's a hacky thing to do but it could work. Let us know.