Hide string in binary at compile time? - c

I want to obfuscate a particular string in the binary of a C program to make it harder to analyze. I know this will not prevent someone from seeing the string if running it in a debugger. Yes, this is merely obfuscation.
Every instance of obfuscation triggers a discussion saying it has no value whatsoever. So did this one! I am aware that a capable and determined attacker will be able to recover the string. For the sake of the argument let's say I'm writing a game for X year olds and the string to be hidden is a URL to be called only once they beat the game and their name will be added to the hall of fame. It's reasonable to assume that most X year olds will not have skills that go beyond opening the binary file in a hex editor. Thanks!
Is there some elegant way to do the hiding at compile time, perhaps using the C preprocessor and a macro?
What i have seen so far is a suggestion by Yuri Slobodyanyuk resulting in this:
#define HIDE_LETTER(a) (a) + 0x50
#define UNHIDE_STRING(str) do { char * ptr = str ; while (*ptr) *ptr++ -= 0x50; } while(0)
...
char str1[] = { HIDE_LETTER('s'), HIDE_LETTER('e'), HIDE_LETTER('c'), HIDE_LETTER('r'), HIDE_LETTER('e'),
HIDE_LETTER('t'), '\0' };
UNHIDE_STRING(str1); // unmangle the string in-place
It works but it's a bit ugly. 🙂 Perhaps someone knows a better solution?
I'm fine with something that is gcc specific.
PS: For C++ there is a solution by Adam Yaxley on github but I'm looking for C, not C++. And there's a solution with a little helper program at https://github.com/TwizzyIndy/hkteam_obfuscator

First, be aware that your issue is probably better covered by some legal approach (a contract reviewed by a paid lawyer) than by technical means.
Your approach is similar to Caesar cypher (which has been broken thousands of years ago: insight: compute frequencies of letters; in human English, e is the most frequent one). Even the German Enigma machine did a lot better in WW2. Read about the works of Alan Turing during WW2 (his team broke the Enigma machine encryption).
Is there some elegant way to do it at compile time, perhaps using the C preprocessor and a macro?
No, there is not
(and mathematical proofs of that exist in the literature, covered by books related to Frama-C or cybersecurity or Coq proof assistant; be aware of Rice's theorem; Read also Berto-Caseran book on Interactive Theorem Proving and Software Development ISBN 3-540-20854-2)
The argument of such a proof is based on cardinality. You could also use a probabilistic approach: store in your program some cryptic hashcode (e.g. computed by crypt(3) at build time) and ask from user input a secret key, etc...
Any professional hacker will be technically able (perhaps after weeks of work) to find your "secret" string. Or colleagues working on or with BinSec.
However, you could write some metaprogram generating your obfuscated string as C code (to be #include-d at compile time), and add into your program some deobfuscation routine.
I'm fine with something that is gcc specific.
On large programs, consider developing your GCC plugin (perhaps starting with Bismon). See also the DECODER project.
Be however aware of Rice's theorem. Read about P vs NP problem.
Consider also generating some C code (maybe some #include-d header) with tools like GPP.
Code obfuscation is a topic which has conferences. Did you attend any of them? Many papers exist in ACM conferences.
There could be also legal issues (perhaps related to GDPR). You should contact your lawyer. In France, see article 323 du Code PĂ©nal.
If your code runs on a computer connected to the Internet and interacting with a user, consider a SaaS approach: you could ask some money with a VISA card at every run (or once a month).... Your bank will sell you appropriate software and permissions.
I'm writing a game for 8 year olds and the string to be hidden is a URL to be called only once they beat the game and their name will be added to the hall of fame. It's reasonable to assume that most 8 year olds will not have skills that go beyond opening the binary file in a hex editor.
I now no 8 years old kid able to do that, and those who do deserves to be added to your hall of fame. If indeed you are coding a game, I recommend putting the URL as clear text.
NB. The old XPM program could be inspirational, and so can be RefPerSys and Jacques Pitrat's last book Artificial Beings, the conscience of a conscious machine (ISBN-13: 978-1848211018). Feel free to contact me by email basile#starynkevitch.net (home) or basile.starynkevitch#cea.fr (office, at CEA LIST) for more.
PS. Consider of course starting your PhD on that topic! In France, at ENS or Ecole Polytechnique. There are interesting related talks at College de France. In Germany, Frauhaufer CyberSecurity lab. Probably, the Bundeswehr will fund your research in Germany (but I have no connections there), and also ITEA4. Of course, you will spend three or four years full-time to find a good enough solution. Please publish papers on arxiv.

How about this:
#define STRING "Obfuscated"
#define Makestr(i) string[i] = STRING[i]
char string[11];
Makestr(6); Makestr(5);
Makestr(9); Makestr(7);
Makestr(0); Makestr(3);
Makestr(2); Makestr(4);
Makestr(1); Makestr(8);
Makestr(10);
This will typically compile to the equivalent of
string[6] = 97; string[5] = 99;
string[9] = 100; string[7] = 116;
string[0] = 79; string[3] = 117;
string[2] = 102; string[4] = 115;
string[1] = 98; string[8] = 101;
string[10] = 0;
If you look at the object file using strings or a hex editor, it won't even be obvious that there's a string at all. (But if you step through the code in a debugger, you'd be able to suss out what it was doing soon enough. No way around that, really.)
You could also perturb the individual characters, as in your original question:
#define Makestr(i) string[i] = STRING[i] + 0x50
Me, I'd worry about overflow, so I'd probably do
#define Makestr(i) string[i] = STRING[i] ^ 0x55
Now you get the equivalent of
string[6] = 177;
or
string[6] = 52;
, etc.
Obviously in these cases you have to additionally unhide the
constructed string at run time, of course.
With clang I had to use -O to force it to collapse the constants and not emit the original string in the object file; with gcc it worked right away.
If your string is longer, the randomly-shuffled sequence of Makestr calls could get pretty unwieldy, though.

I changed the obfuscation to just flip bit 7.
Also i couldn't find a pretty way to do the encoding in the C preprocessor cpp at compile time.
I ended up encoding the string using this shell onliner
tr \\000-\\377 \\200-\\377\\0-\\177|od -t x1 -A none|sed -e 's/ /\\x/g'
and sticking the result into the C source:
#include <stdio.h>
#include <string.h>
/* flip bit 7 in string using shell commands
tr \\000-\\377 \\200-\\377\\0-\\177|od -t x1 -A none|sed -e 's/ /\\x/g'
*/
int main() {
char secret[] = "\xce\xef\xf4\xa0\xf5\xf3\xe9\xee\xe7\xa0\xf4\xe8"
"\xe5\xa0\xf0\xf2\xe5\xf0\xf2\xef\xe3\xe5\xf3\xf3\xef\xf2"
"\xa0\xba\xad\xa8";
for (int i = 0; secret[i]; i++)
secret[i] ^= 1 << 7; // flip bit 7
printf("%s\n",secret);
}
I will leave this question as unanswered for now in the hope that someone finds a one-step solution instead of this two-step approach.

Related

I need help filtering bad words in C?

As you can see, I am trying to filter various bad words. I have some code to do so. I am using C, and also this is for a GTK application.
char LowerEnteredUsername[EnteredUsernameLen];
for(unsigned int i = 0; i < EnteredUsernameLen; i++) {
LowerEnteredUsername[i] = tolower(EnteredUsername[i]);
}
LowerEnteredUsername[EnteredUsernameLen+1] = '\0';
if (strstr(LowerEnteredUsername, (char[]){LetterF, LetterU, LetterC, LetterK})||strstr(LowerEnteredUsername, (char[]){LetterF, LetterC, LetterU, LetterK})) {
gtk_message_dialog_set_markup((GtkMessageDialog*)Dialog, "This username seems to be innapropriate.");
UsernameErr = 1;
}
My issue is, is that, it will only filter the last bad word specified in the if statement. In this example, "fcuk". If I input "fuck," the code will pass that as clean. How can I fix this?
(char[]){LetterF, LetterU, LetterC, LetterK}
(char[]){LetterF, LetterC, LetterU, LetterK}
You’ve forgotten to terminate your strings with a '\0'. This approach doesn’t seem to me to be very effective in keeping ~bad words~ out of source code, so I’d really suggest just writing regular string literals:
if (strstr(LowerEnteredUsername, "fuck") || strstr(LowerEnteredUsername, "fcuk")) {
Much clearer. If this is really, truly a no-go, then some other indirect but less error-prone ways are:
"f" "u" "c" "k"
or
#define LOWER_F "f"
#define LOWER_U "u"
#define LOWER_C "c"
#define LOWER_K "k"
and
LOWER_F LOWER_U LOWER_C LOWER_K
Doing human-language text processing in C is painful because C's concept of strings (i.e. char*/char[] and wchar_t*/wchar_t[]) are very low-level and are not expressive enough to easily represent Unicode text, let alone locate word-boundaries in text and match words in a known dictionary (also consider things like inflection, declension, plurals, the use of diacritics to evade naive string matching).
For example - your program would need to handle George carlin's famous Seven dirty words quote:
https://www.youtube.com/watch?v=vbZhpf3sQxQ
Someone was quite interested in these words. They kept referring to them: they called them bad, dirty, filthy, foul, vile, vulgar, coarse, in poor taste, unseemly, street talk, gutter talk, locker room language, barracks talk, bawdy, naughty, saucy, raunchy, rude, crude, lude, lascivious, indecent, profane, obscene, blue, off-color, risqué, suggestive, cursing, cussing, swearing... and all I could think of was: shit, piss, fuck, cunt, cocksucker, motherfucker, and tits!
This could be slightly modified to evade a naive filter, like so:
Someone was quite interested in these words. They kept referring to them: they called them bad, dirty, filthy, foul, vile, vulgar, coarse, in poor taste, unseemly, street talk, gutter talk, locker room language, barracks talk, bawdy, naughty, saucy, raunchy, rude, crude, lude, lascivious, indecent, profane, obscene, blue, off-color, risquĂ©, suggestive, cursing, cussing, swearing... and all I could think of was: shĂ­t, pis$, phuck, c​unt, сocksucking, motherfĂșcker, and tĂ­ts!
Above, some of the words have simple replacements done, like s to $, others had diacritics added like u to Ăș, and some are just homonyms), however some of the other words in the above look the same but actually contain homographs or "invisible" characters like Unicode's zero-width-space, so they would evade naive text matching systems.
So in short: Avoid doing this in C. if you must, then use a robust and fully-featured Unicode handling library (i.e. do not use the C Standard Library's string functions like strstr, strtok, strlen, etc).
Here's how I would do it:
Read in input to a binary blob containing Unicode text (presumably UTF-8).
Use a Unicode library to:
Normalize the encoded Unicode text data (see https://en.wikipedia.org/wiki/Unicode_equivalence )
Identify word boundaries (assuming we're dealing with European-style languages that use sentences comprised of words).
Use a linguistics library and database (English alone is full of special-cases) to normalize each word to some singular canonical form.
Then lookup each morpheme in a case-insensitive hash-set of known "bad words".
Now, there are a few shortcuts you can take:
You can use regular-expressions to identify word-boundaries.
There exist Unicode-aware regular-expression libraries for C, for example PCRE2: http://www.pcre.org/current/doc/html/pcre2unicode.html
You can skip normalizing each word's inflections/declensions if you're happy with having to list those in your "bad word" list.
I would write working code for this example, but I'm short on time tonight (and it would be a LOT of code), but hopefully this answer provides you with enough information to figure out the rest yourself.
(Pro-tip: don't match strings in a list by checking each character - it's slow and inefficient. This is what hashtables and hashsets are for!)

Is there a c library for better handling of string? [duplicate]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
The community reviewed whether to reopen this question 11 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I recently got inspired to start up a project I've been wanting to code for a while. I want to do it in C, because memory handling is key this application. I was searching around for a good implementation of strings in C, since I know me doing it myself could lead to some messy buffer overflows, and I expect to be dealing with a fairly big amount of strings.
I found this article which gives details on each, but they each seem like they have a good amount of cons going for them (don't get me wrong, this article is EXTREMELY helpful, but it still worries me that even if I were to choose one of those, I wouldn't be using the best I can get). I also don't know how up to date the article is, hence my current plea.
What I'm looking for is something that may hold a large amount of characters, and simplifies the process of searching through the string. If it allows me to tokenize the string in any way, even better. Also, it should have some pretty good I/O performance. Printing, and formatted printing isn't quite a top priority. I know I shouldn't expect a library to do all the work for me, but was just wandering if there was a well documented string function out there that could save me some time and some work.
Any help is greatly appreciated. Thanks in advance!
EDIT: I was asked about the license I prefer. Any sort of open source license will do, but preferably GPL (v2 or v3).
EDIt2: I found betterString (bstring) library and it looks pretty good. Good documentation, small yet versatile amount of functions, and easy to mix with c strings. Anyone have any good or bad stories about it? The only downside I've read about it is that it lacks Unicode (again, read about this, haven't seen it face to face just yet), but everything else seems pretty good.
EDIT3: Also, preferable that its pure C.
It's an old question, I hope you have already found a useful one. In case you didn't, please check out the Simple Dynamic String library on github. I copy&paste the author's description here:
SDS is a string library for C designed to augment the limited libc string
handling functionalities by adding heap allocated strings that are:
Simpler to use.
Binary safe.
Computationally more efficient.
But yet... Compatible with normal C string functions.
This is achieved using an alternative design in which instead of using a C
structure to represent a string, we use a binary prefix that is stored
before the actual pointer to the string that is returned by SDS to the user.
+--------+-------------------------------+-----------+
| Header | Binary safe C alike string... | Null term |
+--------+-------------------------------+-----------+
|
`-> Pointer returned to the user.
Because of meta data stored before the actual returned pointer as a prefix,
and because of every SDS string implicitly adding a null term at the end of
the string regardless of the actual content of the string, SDS strings work
well together with C strings and the user is free to use them interchangeably
with real-only functions that access the string in read-only.
I would suggest not using any library aside from malloc, free, strlen, memcpy, and snprintf. These functions give you all of the tools for powerful, safe, and efficient string processing in C. Just stay away from strcpy, strcat, strncpy, and strncat, all of which tend to lead to inefficiency and exploitable bugs.
Since you mentioned searching, whatever choice of library you make, strchr and strstr are almost certainly going to be what you want to use. strspn and strcspn can also be useful.
If you really want to get it right from the beginning, you should look at ICU, i.e. Unicode support, unless you are sure your strings will never hold anything but plain ASCII-7... Searching, regular expressions, tokenization is all in there.
Of course, going C++ would make things much easier, but even then my recommendation of ICU would stand.
Please check milkstrings.
Sample code :
int main(int argc, char * argv[]) {
tXt s = "123,456,789" ;
s = txtReplace(s,"123","321") ; // replace 123 by 321
int num = atoi(txtEat(&s,',')) ; // pick the first number
printf("num = %d s = %s \n",num,s) ;
s = txtPrintf("%s,%d",s,num) ; // printf in new string
printf("num = %d s = %s \n",num,s) ;
s = txtConcat(s,"<-->",txtFlip(s),NULL) ; // concatenate some strings
num = txtPos(s,"987") ; // find position of substring
printf("num = %d s = %s \n",num,s) ;
if (txtAnyError()) { //check for errors
printf("%s\n",txtLastError()) ;
return 1 ; }
return 0 ;
}
I also found a need for an external C string library, as I find the <string.h> functions very inefficient, for example:
strcat() can be very expensive in performance, as it has to find the '\0' char each time you concatenate a string
strlen() is expensive, as again, it has to find the '\0' char instead of just reading a maintained length variable
The char array is of course not dynamic and can cause very dangerous bugs (a crash on segmentation fault can be the good scenario when you overflow your buffer)
The solution should be a library that does not contain only functions, but also contains a struct that wraps the string and that enables to store important fields such as length and buffer-size
I looked for such libraries over the web and found the following:
GLib String library (should be best standard solution) - https://developer.gnome.org/glib/stable/glib-Strings.html
http://locklessinc.com/articles/dynamic_cstrings/
http://bstring.sourceforge.net/
Enjoy
I faced this problem recently, the need for appending a string with millions of characters. I ended up doing my own.
It is simply a C array of characters, encapsulated in a class that keeps track of array size and number of allocated bytes.
The performance compared to SDS and std::string is 10 times faster with the benchmark below
at
https://github.com/pedro-vicente/table-string
Benchmarks
For Visual Studio 2015, x86 debug build:
| API | Seconds
| ----------------------|----|
| SDS | 19 |
| std::string | 11 |
| std::string (reserve) | 9 |
| table_str_t | 1 |
clock_gettime_t timer;
const size_t nbr = 1000 * 1000 * 10;
const char* s = "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb";
size_t len = strlen(s);
timer.start();
table_str_t table(nbr *len);
for (size_t idx = 0; idx < nbr; ++idx)
{
table.add(s, len);
}
timer.now("end table");
timer.stop();
EDIT
Maximum performance is achieved by allocating the string all at start (constructor parameter size). If a fraction of total size is used, performance drops. Example with 100 allocations:
std::string benchmark append string of size 33, 10000000 times
end str: 11.0 seconds 11.0 total
std::string reserve benchmark append string of size 33, 10000000 times
end str reserve: 10.0 seconds 10.0 total
table string benchmark with pre-allocation of 330000000 elements
end table: 1.0 seconds 1.0 total
table string benchmark with pre-allocation of ONLY 3300000 elements, allocation is MADE 100 times...patience...
end table: 9.0 seconds 9.0 total

Can I programmatically detect changes in a sketch?

At work we have an Arduino sketch that gets changed periodically. In a nutshell, it communicates back and forth on a Serial port. For the most part our software development team controls the code; however, there are some other teams at our company that periodically make last minute changes to the sketch in order to accommodate specific client needs.
This has obviously been quite problematic because it means we might have different versions of our sketch deployed in different places without realizing it. Our software developers are very good at using source control but the other teams are not quite so disciplined.
One idea that was proposed was hard-coding a version number, so that a certain serial command would respond by reporting back the predefined version number. The trouble however is that our other teams might likewise fail to have the discipline to update the version number if they decide to make other changes.
Obviously the best solution involves cutting off the other team from making updates, but assuming that isn't possible for office politics reasons, I was wondering if there's any way to programmatically "reflect" on an Arduino sketch. Obviously a sketch is going to take up a certain number of bytes, and that sketch file is going to have a unique file hash. I was thinking if there was some way to either get the byte count, the file hash, or the last modified time as a preprocessor directive that can be injected into code that would be ideal. Something like this:
// pseudocode
const String SKETCH_FILE_HASH = #filehash;
const int SKETCH_FILE_SIZE = #filesize;
const int SKETCH_LAST_UPDATED = #modified;
But that's about as far as my knowledge goes with this. Is there any way to write custom preprocessor directives, or macros, for Arduino code? Specifically ones that can examine the sketch file itself? Is that even possible? Or is there some way that already exists to programmatically track changes in one way or another?
Risking an answer.
SKETCH_FILE_HASH : you would have to precompute externally and pass as a flag. I guess you're using the arduino IDE and this is not doable
SKETCH_FILE_SIZE: same answer
SKETCH_LAST_UPDATED: You can use __TIME__ to get a string containing compilation time.
What I would do, taking into account the polititc parts.
enmbed a keyword linked to your version control (e.g. svn:id for subversion, almost all VCS provide this)
embed compilation time
change the official build (the one the SW team controls) to use the actual toolchain and not the IDE and put it on a jenkins : you'll be able to use compilation flags!
embed a code like
#ifndef BUILD_TYPE
#define BUILD_TYPE "Unsupported"
#endif
On your continuous build process, use -DBUILD_TYPE="HEAD" or "Release"
I'm sorry I don't see a magicx wand solving your solution. I'd invest a lot into training on why version control can save you (seems you already have the war stories)
I was looking at this issue myself, and found this:
https://gist.github.com/jcw/1985789#file-bootcheck-ino
This is to look up the bootloader; but I'm thinking that something like this could be used for determining a signature of some sort for the code as a whole.
I did a quick experiment, where I added something like:
Serial.print("Other...");
Serial.println(CalculateChecksum(0, 2048));
in void setup(), and was able to get different values for the CRC, based on changing a tiny bit of code (a string).
This is not an explicit solution; I tried CalculateChecksum(0, 32767), and so on, and if I defined an integer like int a=101; and changed it to int a=102; the checksum was the same. Only when I changed a string (i.e., add a space) did this value change.
I'm not crystal clear on the way memory is allocated in the Arduino; I do know there is program memory (32,256 bytes) and global variable memory (2048 bytes), so I'm sure there is some way of doing this.
In another experiment, I used the pgm_read_byte() function, and if I create a simple memory dump function:
void MemoryDump (word addr, word size) {
word dataval = ~0;
// prog_uint8_t* p = (prog_uint8_t*) addr;
uint8_t* p = (uint8_t*) addr;
for (word i = 0; i < size; ++i)
{
dataval = pgm_read_byte(p++);
Serial.print(i);
Serial.print(" ->");
Serial.print(dataval,HEX);
Serial.print(" ");
Serial.print(dataval);
Serial.print(" ");
if(dataval>32)
{
Serial.print(char(dataval));
}
else
{
Serial.print("***");
}
Serial.print("\n");
}
}
... and I put in a line like:
Serial.println(F("12345fghijklmnopqrstuvwxyz"));
because the F() puts the string in program memory, you will see it there.
Reading the SRAM is a bit of an issue, as noted here:
http://forum.arduino.cc/index.php?topic=220125.0
I'm not a compiler god, so I don't know how stuff like a=101; looks to the compiler/IDE, or why this doesn't look different to the program memory area.
One last note:
http://playground.arduino.cc/Code/AvailableMemory
Those functions access SRAM, so perhaps, with a bit of tweaking, you could do a CRC on that memory, but it would seem a bit of an issue, since you have to be doing a computation with a variable... in SRAM! But if the code was identical, even if doing a computation like that, it might be possible. Again, I'm in deep water here, so if an AVR god has issue with this, please destroy this theory with an ugly fact!

How to implement C code with pointers in Prolog?

I am new to prolog. I have learned that ,though it is a declarative language, prolog can be used as a general purpose programming language, just like C. So, whatever problems you can solve in C, you can solve in prolog as well, even though its run-time may not be as good. Since there are no pointers in prolog (as far as i know), I am wondering if i can write an equivalent program in prolog for the following code written in C :-
#include <stdio.h>
int main()
{
int a = 5;
int *p;
p = &a;
printf("The address of a is %d.", p);
return 0;
}
You're trying to drive in a nail using a screwdriver, to use a popular analogy. Prolog is not C and solving problems in Prolog is fundamentally different from solving them in C.
Printing the value of a variable is easy to do, for example:
main :-
X = 5,
io:format("X = ~w~n", [X]).
but you can't get the address of X like you can in C. And why would you want to? The address could be different next time since Prolog has automatic garbage collection.
If you want to learn Prolog, forget about trying to write Prolog programs which look like C programs, and try to solve actual problems instead. You could try out the Project Euler series of problems, for example.
Apart from the comments and the existing answer, here is more:
Ask yourself: what is the use of the C program that you have shown? What problem does it solve? I can't answer this question, and I suspect you can't answer it either. In isolation, this program has no useful application whatsoever! So despite C being a general purpose programming language, you can write programs without any purpose, general or domain-specific.
The same, of course, is true of Prolog.
To pointers in particular: they are a very thin abstraction over absolute memory addresses. You can use (and abuse) pointers in many ways, and, if your algorithms are correct for the problem you are currently solving, the compiler can generate very efficient machine code. The same, however, is true of Prolog. The paradigms, however, will be very different.
In summary, you have managed to write a question so devoid of meaning that you provoked me to answer it without any code.
P.S. Or you have just trolled us with moderate success.
Well, since you tagged swi-prolog your question, I can show the code I used to exchange Qt GUI objects (just pointers, you know...) with the Prolog engine.
/** get back an object passed by pointer to Prolog */
template<typename Obj> Obj* pq_cast(T ptr) {
return static_cast<Obj*>(static_cast<void*>(ptr));
}
to be used, for instance in swipl-win, where _read_f is really a C callback:
/** fill the buffer */
ssize_t Swipl_IO::_read_f(void *handle, char *buf, size_t bufsize) {
auto e = pq_cast<Swipl_IO>(handle);
return e->_read_(buf, bufsize);
swipl-win has found its way as the new console in SWI-Prolog.

Is the RLE algorithm flawed?

I was looking at a recent Code Golf on the removal of duplicate characters in a string. i mulled it over and thought that the RLE algorithm would solve it, in fact, I did believe that would resolve removing duplicates, I wrote an implementation here in C, to see how far I could go with it
char *rle(const char *src){
char *p=(char *)src;
char *q=(char *)src+1;
char *rle_enc=NULL, *tmp_rle, buf[10];
int run=1;
while (*p){
while(*q){
if (*p==*q++) run++,p++;
}
sprintf(buf,"%d%c",run,*(p-1));
p++;
if (!rle_enc){
if ((rle_enc=malloc(strlen(buf)+1))!=NULL){
strcpy(rle_enc,buf);
}
}else{
if ((tmp_rle=realloc(rle_enc,(strlen(rle_enc)+strlen(buf)+1)))!=NULL){
rle_enc=tmp_rle;
strcat(rle_enc,buf);
}
}
q=(p+1);
run=1;
}
return rle_enc;
}
Sure enough, here's the main for this:
int main(int argc, char **argv){
char *test1 = "HHHHHHeeeeeelllllloooooooo";
char *test2 = "nbHHkRvrXbvkn";
char *p = rle(test1);
printf("s = %s\n", test1);
printf("p = %s\n", p);
if (p) free(p);
return 0;
}
According to Code Golf on meta, it should be reusable and solve a set of problems, BUT in the shortest set of characters, fair enough I thought I'd just change the variables to 1 letters and compact the code to make it small..but something wasn't quite right with it as this lead me to think about the RLE algorithm itself, here's a page on Wikipedia about what it has to say and the implementation in Java.
The code does appear to be doing what it should, so I thought, now, it's just a matter of going through the encoded string result from rle looking for those that have a 1 followed by the letter..
I did however notice the limitation of the RLE algorithm, it is only suitable for those that have a set of repetitive characters adjacent to each other. But it failed the test case for the Code Golf, which looks deceptively simple, which leads me to this question:
Is the RLE algorithm flawed? Where would it be used nowadays? Gathering dust I presume due to the volume of data and information flowing around that RLE no longer fits a purpose...
Edit: Thanks to Moonshadow, John and Steve for posting their answers.
There is a fundamental lesson that I've still failed to learn - never ever go OTT and think complex when it comes to this kind of thing, that is a fallacy on my part and shows how big thinking can get in the way and I can get sucked into it too deeply and get carried away without looking at the right angle!!!!! Thanks again! :)
RLE will not solve that code golf problem for you.
The code golf problem requires you to strip out all characters that occur more than once in the input, regardless of where the occurrences are. However, RLE, "run length encoding", encodes "runs" - repeated sequences of the same character; multiple runs of the same character can occur in a string, and RLE will encode these separately, by design.
RLE is intended to encode sequences of repeated data elements more compactly by replacing the sequence with just one element followed by the number of times it is repeated. For this purpose it is perfectly adequate. Any "flaw" is not in the algorithm, but rather in the decision to use it for a purpose for which it is poorly suited.
RLE was commonly used for 8-bit bitmaps, as they would commonly have long runs of the same character. Windows still supports a RLE video codec that was used in a similar fashion. Nowadays, LZW + Huffman encoding has superceded RLE as the "simple" compression algorithm.
RLE has been used for years, so it us pretty hard to say that it is "flawed", but it certainly isn't efficient.
Most RLE formats will have an "escape character", so that there can be no confusion as to the output.
For example, if we use "e" as an escape character...
This would produce a litteral "e":
ee
This would be the letter "a" repeated twice:
ea2
Why do you think it is flawed? RLE works for compressing repeated characters. It isn't intended to do anything else and won't help compress data without run lengths > 1.
In the context of this problem I would say that RLE was just not the right answer, it is not flawed, it is just not helpful in this case.

Resources