Looking for patterns in binary files

Looking for patterns in binary files - c

I'm working on a small project in C where I have to parse a binary file of undocumented file format. As I'm quite new to C I have two questions to some more experienced programmers.
The first seems to be an easy one. How do I extract all the strings from the binary file and put them into an array? Basically I am looking for a simple implementation of strings program in C.
When I open the binary file in any text editor I get a lot of rubbish with some readable strings mixed in. I can extract this strings using strings in the command line. Now I'd like to do something similar in C, like in the pseudocode below:
while (!EOF) {
if (string found) {
put it into array[i]
i++
}
return i;
}
The second problem is a little bit more complicated and is, I believe, the proper way of achieving the same thing. When I look at the file in HEX editor it's easy to notice some patterns. For example before each string there is a byte of value 02 (0x02) followed by the length of the string and the string itself. For example 02 18 52 4F 4F 54 4B 69 57 69 4B 61 4B 69 is a string with the string part in bold.
Now the function I'm trying to create would work like this:
while(!EOF) {
for(i=0; i<buffer_size; ++i) {
if(buffer[i] hex value == 02) {
int n = read the next byte;
string = read the next n bytes as char;
put string into array;
}
}
}
Thanks for any pointers. :)

The first seems to be an easy one. How do I extract all the strings from the binary file and put them into an array?
Figure out what character range represents printable ASCII characters. Iterate across the file, checking if characters are ASCII characters, and counting up for adjacent ASCII characters. By default, strings will treat sequences of four or more characters as strings; when you find the next non-ASCII character, check if the number has been exceeded; if it has, output the string. Some book-keeping is necessary.
The second problem is a little bit more complicated and is, I believe, the proper way of achieving the same thing.
Your pseudocode is essentially correct. You can manually compare the contents of buffer[i] with an integer (e.g. 2). Reading a byte is as simple as incrementing i. Make sure you don't overrun the buffer, and make sure the array your reading the string to is big enough (if the size parameter is only one byte, you can get away with a 255 length array buffer.)

I'm not sure your solution will work: what if you find a string with 350 char length?
Numbers can be part of a string or you can consider them "rubbish"?
I think the most safe way is
Define what you consider string and what you consider "rubbish" - for instance ":,!?" are "string" or "rubbish"?
Define a minimum string length to be considered a "readable" string
Parse the file looking for every group of char with length >= minimum.
I know, it's boring, but I think it's the only safe way. Good luck!

Related

Logical XOR in character arrays

I've been trying to make a program on Vernam Cipher which requires me to XOR two strings. I tried to do this program in C and have been getting an error.The length of the two strings are the same.
#include<stdio.h>
#include<string.h>
int main()
{
printf("Enter your string to be encrypted ");
char a[50];
char b[50];
scanf("%s",a);
printf("Enter the key ");
scanf("%s",b);
char c[50];
int q=strlen(a);
int i=0;
for(i=0;i<q;i++)
{
c[i]=(char)(a[i]^b[i]);
}
printf("%s",c);
}
Whenever I run the code, I get output as ????? in boxes. What is the method to XOR these two strings ?

I've been trying to make a program on Vernam Cipher which requires me to XOR two strings
Yes, it does, but that's not the only thing it requires. The Vernam cipher involves first representing the message and key in the ITA2 encoding (also known as Baudot-Murray code), and then computing the XOR of each pair of corresponding character codes from the message and key streams.
Moreover, to display the result in the manner you indicate wanting to do, you must first convert it from ITA2 to the appropriate character encoding for your locale, which is probably a superset of ASCII.
The transcoding to and from ITA2 is relatively straightforward, but not so trivial that I'm inclined to write them for you. There is a code chart at the ITA2 link above.
Note also that ITA2 is a stateful encoding that includes shift codes and a null character. This implies that the enciphered message may contain non-printing characters, which could cause some confusion, including a null character, which will be misinterpreted as a string terminator if you are not careful. More importantly, encoding in ITA2 may increase the length of the message as a result of a need to insert shift codes.
Additionally, as a technical matter, if you want to treat the enciphered bytes as a C string, then you need to ensure that it is terminated with a null character. On a related note, scanf() will do that for the strings it reads, which uses one character, leaving you only 49 each for the actual message and key characters.
What is the method to XOR these two strings ?
The XOR itself is not your problem. Your code for that is fine. The problem is that you are XORing the wrong values, and (once the preceding is corrected) outputting the result in a manner that does not serve your purpose.

Whenever I run the code, I get output as ????? in boxes...
XORing two printable characters does not always result in a printable value.
Consider the following:
the ^ operator operates at the bit level.
there is a limited range of values that are printable. (from here):
Control Characters (0–31 & 127): Control characters are not printable characters. They are used to send commands to the PC or the
printer and are based on telex technology. With these characters, you
can set line breaks or tabs. Today, they are mostly out of use.
Special Characters (32–47 / 58–64 / 91–96 / 123–126): Special characters include all printable characters that are neither letters
nor numbers. These include punctuation or technical, mathematical
characters. ASCII also includes the space (a non-visible but printable
character), and, therefore, does not belong to the control characters
category, as one might suspect.
Numbers (30–39): These numbers include the ten Arabic numerals from 0-9.
Letters (65–90 / 97–122): Letters are divided into two blocks, with the first group containing the uppercase letters and the second
group containing the lowercase.
Using the following two strings and the following code:
char str1 = {"asdf"};
char str1 = {"jkl;"};
Following demonstrates XORing the elements of the strings:
int main(void)
{
char str1[] = {"asdf"};
char str2[] = {"jkl;"};
for(int i=0;i<sizeof(str1)/sizeof(str1[i]);i++)
{
printf("%d ^ %d: %d\n", str1[i],str2[i], str1[i]^str2[i]);
}
getchar();
return 0;
}
While all of the input characters are printable (except the NULL character), not all of the XOR results of corresponding characters are:
97 ^ 106: 11 //not printable
115 ^ 107: 24 //not printable
100 ^ 108: 8 //not printable
102 ^ 59: 93
0 ^ 0: 0
This is why you are seeing the odd output. While all of the values may be completely valid for your purposes, they are not all printable.

How to create array of fixed-length "strings" in C?

I am trying to create an array of fixed-length "strings" in C, but have been having a little trouble. The problem I am having is that I am getting a segmentation fault.
Here is the objective of my program: I would like to set the array's strings by index using data read from a text file. Here is the gists of my current code (I apologize that I couldn't add my entire code, but it is quite lengthy, and would likely just cause confusion):
//"n" is set at run time, and 256 is the length I would like the individual strings to be
char (*stringArray[n])[256];
char currentString[256];
//"inputFile" is a pointer to a FILE object (a .txt file)
fread(&currentString, 256, 1, inputFile);
//I would like to set the string at index 0 to the data that was just read in from the inputFile
strcpy(stringArray[i], &currentString);

Note that if your string can be 256 characters long, you need its container to be 257 bytes long, in order to add the final \0 null character.
typedef char FixedLengthString[257];
FixedLengthString stringArray[N];
FixedLengthString currentString;
The rest of the code should behave the same, although some casting might be necessary to please functions expecting char* or const char* instead of FixedLengthString (which can be considered a different type depending on compiler flags).

Manipulating C-strings with multiple null characters in memory

I need to search through a chunk of memory for a string of characters, but several of these strings have every character null separated, like this:
"I. .a.m. .a. .s.t.r.i.n.g"
with all of the '.'s being null characters. My problem comes from actually getting this into memory. I've tried several ways, for instance:
char* str2;
str2 = (char*)malloc(sizeof(char)*40);
memcpy((void*)str2, "123\0567\09abc", 12);
Will put the following into the memory that str2 points to: 123.7.9abc..
Something like
str2 = "123456789\0abcde\054321";
Will have str2 pointing to a block of memory that looks like 123456789.abcde,321 , wherein the '.' is a null character, and the ',' is an actual comma.
So clearly inserting null characters into cstrings doesn't work as easily as I thought it did, like inserting a newline character. I encountered similar difficulties trying this with the string library as well. I could do separate assignments, something like:
char* str;
str = (char*)malloc(sizeof(char)*40);
strcpy(str, "123");
strcpy(str+4, "abc");
strcpy(str+8, "ABC");
But that is certainly not preferable, and I believe the problem lies in my understanding of how c-style strings are stored in memory. Clearly "abc\0123" doesn't actually go into memory as 61 62 63 00 31 32 33 (in hex). How is it stored, and how can I store what I need to?
(I also apologize for not having set the code in blocks, this is my first time posting a question, and somehow "four spaced" is more difficult than I can handle apparently. Thank you, Luchian. I see more newlines were needed.)

If every other char contains a null, then almost certainly you actually have UTF-16 encoded strings. Process them accordingly and your problems will disappear.
Assuming you are on Windows, where UTF-16 is common, you would use wchar_t* rather than char* to hold such strings. And you would use wide char string processing functions to operate on such data. For example, use wcscpy rather than strcpy and so on.

\0 is the starting sequence of an escaped character in octets, it's not just a "null character" (even though the use of it's own will result in one).
The easiest way to define a string containing a null-character followed by something that could also be treated as a part of an escaped characer in octet (such as "\012"1) is to split it up using this below feature of C:
char const * p = "123456789" "\0" "abcde" "\0" "54321";
1. "\012" will result in the character with the equivalent hex value of 0x0A, not three characters; 0x00, '1' and '2'.

First off, every second character being a NULL is a clear hallmark of a widestring - a string that's composed of two-byte characters, really an array of unsigned shorts. Depending on your compiler and settings, you might be better off using datatype wchar_t instead of char and wcsxxx() family of functions instead of strxxx().
On Windows, 2-byte widestrings (UTF-16, technically) is the native string format of the OS, so they're all around the place.
That said, strxxx() functions all assume that the string is null-terminated. So plan accordingly. Sometimes memxxx() will come to the rescue.
"abc\0123" does not go into memory the way you expect because \012 is being interpreted by the compiler as a single octal escape sequence - the character with octal code 12 (that's 0a hex). To avoid, use one of the following literals:
"abc\000123"
"abc\x00123"
"abc\0""123"
The snippet where you generate a string from chunks is mostly correct. It's just that I'd rather use
strcpy(str+strlen(str)+1, "123");
that guarantees that the next chunk will be written past the null character of the previous chunk.

I am a bit confused by your question.
But let me guess what is going on. You are looking at 16 bit wchat_t string and not a normal c string.
wchar getting ascii characters may look like null separated between letters but actually this is normal.
simply (wchar_t *)XXX where XXX is a pointer to that region of memory and lookup wchar_t operations like wcscpy etc... as for the nulls between strings, this may actually be a known method to pass multiple string construct. You can simply iterate after your read each string until normally you encounter 2 consecutive nulls.
Hope I have answered your question.
Good luck!

figure out 2 strings similar or not

Rules:
2 strings, a and b, both of them consist of ASCII chars and non-ASCII chars (say, Chinese Characters gbk-encoded).
If the non-ASCII chars contained in b also show up in a and no less than the times they appear in b, then we say b is similar with a.
For example:
a = "ab中ef日jkl中本" //non-ASCII chars:'中'(twice), '日'(once), '本'(once)
b = "bej中中日" //non-ASCII chars:'中'(twice), '日'(once)
c = 'lk日日日' //non-ASCII chars:'日'(3 times, more than twice in a)
according to the rule, b is similar with a, but c is not.
Here is my question:
We don't know how many non-ASCII chars are there in a and b, probably many.
So to find out how many times a non-ASCII char appears in a and b, am I supposed to use a Hash-Table to store their appearing-times?
Take string a as an example:
[non-ASCII's hash-value]:[times]
中's hash-val : 2
日's hash-val : 1
本's hash-val : 1
Check string b, if we encounter a non-ASCII char in b, then hash it and check a's hash-table, if the char is present in a's hash-table, then its appearing-times decrements by 1.
If the appearing-times is less than 0 (-1), then we say b is not similar with a.
Or is there any better way?
PS:
I read string a byte by byte, if the byte is less than 128, then I take is as an ASCII char, otherwise I take it as part of a non-ASCII char (multi-bytes).
This is what I am doing to find out the non-ASCII chars.
Is it right?

You have asked two questions:
Can we count the non-ASCII characters using a hashtable? Answer: sure. As you read the characters (not the bytes), examine the codepoints. For any codepoint greater than 127, put it into a counting hashtable. That is for a character c, add (c,1) if c is not in the table, and update (c,x) to (c, x+1) if c is in the table already.
Is there a better way to solve this problem than your approach of incrementing counts in a and decrementing as you run through b? If your hashtable implementation gives nearly O(1) access, then I suspect not. You are looking at each character in the string exactly once, and for each character your are doing either an hashtable insert or lookup and an addition or subtraction, and a check against 0. With unsorted strings, you have to look at all the characters in both strings anyway, so you've given, I think, the best solution.
The interviewer might be looking for you to say things like, "Hmmmmm, if these strings were actually massive files that could not fit in memory, what would I do?" Or for you to ask "Well are the string sorted? Because if they are, I can do it faster...".
But now let's say the strings are massive. The only thing you are storing in memory is the hashtable. Unicode has only around 1 million codepoints and you are storing an integer count for each, so even if you are getting data from gigabyte sized files you only need around 4MB or so for your hash table (or a small multiple of this, as there will be overhead).
In the absence of any other conditions, your algorithm is nice. Sorting the strings beforehand isn't good; it takes up more memory and isn't a linear-time operation.
ADDENDUM
Since your original comments mentioned the type char as opposed to wchar_t, I thought I'd show an example of using wide strings. See http://codepad.org/B3MXOgqc
Hope that helps.
ADDENDUM 2
Okay here is a C program that shows exactly how to go through a widestring and work at the character level:
http://codepad.org/QVX3QPat
It is a very short program so I will also paste it here:
#include <stdio.h>
#include <string.h>
#include <wchar.h>
char *s1 = "abd中日";
wchar_t *s2 = L"abd中日";
int main() {
int i, n;
printf("length of s1 is %d\n", strlen(s1));
printf("length of s2 using wcslen is %d\n", wcslen(s2));
printf("The codepoints of the characters of s2 are\n");
for (i = 0, n = wcslen(s2); i < n; i++) {
printf("%02x\n", s2[i]);
}
return 0;
}
Output:
length of s1 is 9
length of s2 using wcslen is 5
The codepoints of the characters of s2 are
61
62
64
4e2d
65e5
What can we learn from this? A couple things:
If you use plain old char for CJK characters then the string length will be wrong.
To use Unicode characters in C, use wchar_t
String literals have a leading L for wide strings
In this example I defined a string with CJK characters and used wchar_t and a for-loop with wcslen. Please note here that I am working with real characters, NOT BYTES, so I get the correct count of characters, which is 5. Now I print out each codepoint. In your interview question, you will be looking to see if the codepoint is >= 128. I showed them in Hex, as is the culture, so you can look for > 0x7F. :-)
ADDENDUM 3
A few notes in http://tldp.org/HOWTO/Unicode-HOWTO-6.html are worth reading. There is a lot more to character handling than the simple example above shows. In the comments below J.F. Sebastian gives a number of other important links.
Of the few things that need to be addressed is normalization. For example, does your interviewer care that when given two strings, one containing just a Ç and the other a C followed by a COMBINING MARK CEDILLA BELOW, would they be the same? They represent the same character, but one uses one codepoint and the other uses two.

What's the easiest way to parse a string in C?

I have to parse this string in C:
XFR 3 NS 207.46.106.118:1863 0 207.46.104.20:1863\r\n
And be able to get the 207.46.106.118 part and 1863 part (the first ip address).
I know I could go char by char and eventually find my way through it, but what's the easiest way to get this information, given that the IP address in the string could change to a different format (with less digits)?

You can use sscanf() from the C standard lib. Here's an example of how to get the ip and port as strings, assuming the part in front of the address is constant:
#include <stdio.h>
int main(void)
{
const char *input = "XFR 3 NS 207.46.106.118:1863 0 207.46.104.20:1863\r\n";
const char *format = "XFR 3 NS %15[0-9.]:%5[0-9]";
char ip[16] = { 0 }; // ip4 addresses have max len 15
char port[6] = { 0 }; // port numbers are 16bit, ie 5 digits max
if(sscanf(input, format, ip, port) != 2)
puts("parsing failed");
else printf("ip = %s\nport = %s\n", ip, port);
return 0;
}
The important parts of the format strings are the scanset patterns %15[0-9.] and %5[0-9], which will match a string of at most 15 characters composed of digits or dots (ie ip addresses won't be checked for well-formedness) and a string of at most 5 digits respectively (which means invalid port numbers above 2^16 - 1 will slip through).

Depends on what defines the format of the document. In this case, it may be as simple as tokenizing the string and looking through the tokens for what you want. Simply use strtok and split on spaces to grab the 207.46.106.118:1863 and then you can tokenize that again (or simply scan for the : manually) to get the proper components.

You could use strtok to tokenize breaking on space, or you could use one of the scanf family to pull out data as well.
There is a big caveat in all of this though, these are functions that are notorious for security and mishandling bad input. YMMV.

Loop through until you get the first '.', and loop back until you find ' '. The loop forward until you find ':', building sub-strings every time you meet '.' or ':'. You can check the number of substrings and their lengths as simple error checking. Then loop until you find a ' ' and you have the 1863 part.
This would be robust if the beginning of the string doesn't vary much. And also very easy. You could make it even simpler if the string always begins with "XFR 3 NS ".

In this case, strok() is of trivial use and would be my choice. For safety, you might count the ':' in your string and proceed if there is exactly one ':'.

If the strings to be parsed are well-formatted then I'd go with Daniel and Ukko's suggestion to use strtok().
A word of warning though: strtok() modifies the string that it parses. Not always what you want.

This may be overkill, since you said you didn't want to use a regex library, but the re2c program will give you regex parsing without the library: it generates the DFSM for a regular expression as C code. The regexps are specified in comments embedded in C code.
And what seems like overkill now may become a comfort to you later should you have to parse the rest of the string; it is a lot easier to modify a few regexps to adjust or add new syntax than to modify a bunch of ad hoc tokenizing code. And it makes the structure of what you are parsing a lot clearer in your code.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight