Converting ascii hex string to byte array - c

I have a char array say char value []={'0','2','0','c','0','3'};
I want to convert this into a byte array like unsigned char val[]={'02','0c','03'}
This is in an embedded application so i can't use string.h functions. How can i do this?

Sicne you talk about an embedded application I assume that you want to save the numbers as values and not as strings/characters. So if you just want to store your character data as numbers (for example in an integer), you can use sscanf.
This means you could do something like this:
char source_val[] = {'0','A','0','3','B','7'} // Represents the numbers 0x0A, 0x03 and 0xB7
uint8 dest_val[3]; // We want to save 3 numbers
for(int i = 0; i<3; i++)
{
sscanf(&source_val[i*2],"%x%x",&dest_val[i]); // Everytime we read two chars --> %x%x
}
// Now dest_val contains 0x0A, 0x03 and 0xB7
However if you want to store it as a string (like in your example), you can't use unsigned char
since this type is also just 8-Bit long, which means it can only store one character. Displaying 'B3' in a single (unsigned) char does not work.
edit: Ok according to comments, the goal is to save the passed data as a numerical value. Unfortunately the compiler from the opener does not support sscanf which would be the easiest way to do so. Anyhow, since this is (in my opinion) the simplest approach, I will leave this part of the answer at it is and try to add a more custom approach in this edit.
Regarding the data type, it actually doesn't matter if you have uint8. Even though I would advise to use some kind of integer data type, you can also store your data into an unsigned char. The problem here is, that the data you get passed, is a character/letter, that you want to interpret as a numerical value. However, the internal storage of your character differs. You can check the ASCII Table, where you can check the internal values for every character.
For example:
char letter = 'A'; // Internally 0x41
char number = 0x61; // Internally 0x64 - represents the letter 'a'
As you can see there is also a differnce between upper an lower case.
If you do something like this:
int myVal = letter; //
myVal won't represent the value 0xA (decimal 10), it will have the value 0x41.
The fact you can't use sscanf means you need a custom function. So first of all we need a way to conver one letter into an integer:
int charToInt(char letter)
{
int myNumerical;
// First we want to check if its 0-9, A-F, or a-f) --> See ASCII Table
if(letter > 47 && letter < 58)
{
// 0-9
myNumerical = letter-48;
// The Letter "0" is in the ASCII table at position 48 -> meaning if we subtract 48 we get 0 and so on...
}
else if(letter > 64 && letter < 71)
{
// A-F
myNumerical = letter-55
// The Letter "A" (dec 10) is at Pos 65 --> 65-55 = 10 and so on..
}
else if(letter > 96 && letter < 103)
{
// a-f
myNumerical = letter-87
// The Letter "a" (dec 10) is at Pos 97--> 97-87 = 10 and so on...
}
else
{
// Not supported letter...
myNumerical = -1;
}
return myNumerical;
}
Now we have a way to convert every single character into a number. The other problem, is to always append two characters together, but this is rather easy:
int appendNumbers(int higherNibble, int lowerNibble)
{
int myNumber = higherNibble << 4;
myNumber |= lowerNibbler;
return myNumber;
// Example: higherNibble = 0x0A, lowerNibble = 0x03; -> myNumber 0 0xA3
// Of course you have to ensure that the parameters are not bigger than 0x0F
}
Now everything together would be something like this:
char source_val[] = {'0','A','0','3','B','7'} // Represents the numbers 0x0A, 0x03 and 0xB7
int dest_val[3]; // We want to save 3 numbers
int temp_low, temp_high;
for(int i = 0; i<3; i++)
{
temp_high = charToInt(source_val[i*2]);
temp_low = charToInt(source_val[i*2+1]);
dest_val[i] = appendNumbers(temp_high , temp_low);
}
I hope that I understood your problem right, and this helps..

If you have a "proper" array, like value as declared in the question, then you loop over the size of it to get each character. If you're on a system which uses the ASCII alphabet (which is most likely) then you can convert a hexadecimal digit in character form to a decimal value by subtracting '0' for digits (see the linked ASCII table to understand why), and subtracting 'A' or 'a' for letters (make sure no letters are higher than 'F' of course) and add ten.
When you have the value from the first hexadeximal digit, then convert the second hexadecimal digit the same way. Multiply the first value by 16 and add the second value. You now have single byte value corresponding to two hexadecimal digits in character form.
Time for some code examples:
/* Function which converts a hexadecimal digit character to its integer value */
int hex_to_val(const char ch)
{
if (ch >= '0' && ch <= '9')
return ch - '0'; /* Simple ASCII arithmetic */
else if (ch >= 'a' && ch <= 'f')
return 10 + ch - 'a'; /* Because hex-digit a is ten */
else if (ch >= 'A' && ch <= 'F')
return 10 + ch - 'A'; /* Because hex-digit A is ten */
else
return -1; /* Not a valid hexadecimal digit */
}
...
/* Source character array */
char value []={'0','2','0','c','0','3'};
/* Destination "byte" array */
char val[3];
/* `i < sizeof(value)` works because `sizeof(char)` is always 1 */
/* `i += 2` because there is two digits per value */
/* NOTE: This loop can only handle an array of even number of entries */
for (size_t i = 0, j = 0; i < sizeof(value); i += 2, ++j)
{
int digit1 = hex_to_val(value[i]); /* Get value of first digit */
int digit2 = hex_to_val(value[i + 1]); /* Get value of second digit */
if (digit1 == -1 || digit2 == -1)
continue; /* Not a valid hexadecimal digit */
/* The first digit is multiplied with the base */
/* Cast to the destination type */
val[j] = (char) (digit1 * 16 + digit2);
}
for (size_t i = 0; i < 3; ++i)
printf("Hex value %lu = %02x\n", i + 1, val[i]);
The output from the code above is
Hex value 1 = 02
Hex value 2 = 0c
Hex value 3 = 03
A note about the ASCII arithmetic: The ASCII value for the character '0' is 48, and the ASCII value for the character '1' is 49. Therefore '1' - '0' will result in 1.

It's easy with strtol():
#include <stdlib.h>
#include <assert.h>
void parse_bytes(unsigned char *dest, const char *src, size_t n)
{
/** size 3 is important to make sure tmp is \0-terminated and
the initialization guarantees that the array is filled with zeros */
char tmp[3] = "";
while (n--) {
tmp[0] = *src++;
tmp[1] = *src++;
*dest++ = strtol(tmp, NULL, 16);
}
}
int main(void)
{
unsigned char d[3];
parse_bytes(d, "0a1bca", 3);
assert(d[0] == 0x0a);
assert(d[1] == 0x1b);
assert(d[2] == 0xca);
return EXIT_SUCCESS;
}
If that is not available (even though it is NOT from string.h), you could do something like:
int ctohex(char c)
{
if (c >= '0' && c <= '9') {
return c - '0';
}
switch (c) {
case 'a':
case 'A':
return 0xa;
case 'b':
case 'B':
return 0xb;
/**
* and so on
*/
}
return -1;
}
void parse_bytes(unsigned char *dest, const char *src, size_t n)
{
while (n--) {
*dest = ctohex(*src++) * 16;
*dest++ += ctohex(*src++);
}
}

Assuming 8-bit bytes (not actually guaranteed by the C standard, but ubiquitous), the range of `unsigned char` is 0..255, and the range of `signed char` is -128..127. ASCII was developed as a 7-bit code using values in the range 0-127, so the same value can be represented by both `char` types.
For the now discovered task of converting a counted hex-string from ascii to unsigned bytes, here's my take:
unsigned int atob(char a){
register int b;
b = a - '0'; // subtract '0' so '0' goes to 0 .. '9' goes to 9
if (b > 9) b = b - ('A' - '0') + 10; // too high! try 'A'..'F'
if (b > 15) b = b - ('a' - 'A); // too high! try 'a'..'f'
return b;
}
void myfunc(const char *in, int n){
int i;
unsigned char *ba;
ba=malloc(n/2);
for (i=0; i < n; i+=2){
ba[i/2] = (atob(in[i]) << 4) | atob(in[i+1]);
}
// ... do something with ba
}

Related

Compress string using bits Operation in C

I want to reduce memory storage of chars using Bitwise Operations,
for Example
input: {"ACGT"}
output: {"0xE4"}
where we represtnt the number in binary then into Hexadecimal
if A=00, C=01, G=10, T=11
so ACGT = 0xE4 = 11100100b
I cant Figure the whole way so here is what I did So Far
enum NucleicAc {
A = 0,
C = 1,
G = 2,
T = 3,
} ;
struct _DNA {
char * seq ;
};
DNAString DSCreate(char * mseq) {
DNAString dna = malloc(sizeof(struct _DNA));
if (!dna){
exit(-1);
}
const int length = strlen(mseq);
// left shift will create the base , in the case of ACGT --> 0000,0000
int base = 0 << length * 2;
//loop and set bits
int i = 0 ;
int k = 0 ; // the counter where i want to modify the current bit
//simple looping gor the string
for ( ; i < length ; i++ ) {
switch (*(mseq+i)) {
case 'A': // 00
k++;
break;
case 'C': // 0 1
modifyBit(&base, k, 1);
k++;
break;
case 'G': //10
k++;
modifyBit(&base,k , 1);
break;
case 'T': // 11
modifyBit(&base, k, 1);
k++;
modifyBit(&base, k,1);
break;
default:
break;
} //end of switch
k++;
}//end of for
char * generatedSeq ;
//convert the base to hex ??
return dna;
}
void bin(unsigned n){
unsigned i;
for (i = 1 << 7; i > 0; i = i / 2){
(n & i) ? printf("1") : printf("0");
}
}
and if we print the base, the value is 11100100b as expected,
how to store the hexadecimal representation as String to the char *mseq in the struct ?
any direction or exact solutions or is there any better approach?
and also later i want to get the letter using only the index for Example
DSGet(dsStruct , '0')--> will return 'A' hence dsStruct contains the "ACGT" before Encode?
There are several ways you can approach encoding the sequence. Your enum is fine, but for your encoded sequence, a struct that captures the bytes as unsigned char, the original sequence length and the encoded size in bytes will allow an easy decoding. You will get 4-to-1 compression (plus 1 byte if you sequence isn't evenly divisible by 4). The enum and struct could be:
enum { A, C, G, T };
typedef struct {
unsigned char *seq;
size_t len, size;
} encoded;
To map characters in your string to the encoded values a simple function that returns the enum value matching the character is all you need (don't forget to handle any errors)
/* convert character to encoded value */
unsigned char getencval (const char c)
{
if (c == 'A')
return A;
else if (c == 'C')
return C;
else if (c == 'G')
return G;
else if (c == 'T')
return T;
/* exit on anything other than A, C, G, T */
fprintf (stderr, "error: invalid sequence character '%c'\n", c);
exit (EXIT_FAILURE);
}
To encode the original sequence, you will populate the encoded struct with the original length (len) and number of bytes (size) needed to hold the encoded string. Don't forget that any 1-character of the next 4-characters will require another byte of storage. You can use a simple add-and-divide to account for any partial 4-character ending portion of the sequence, e.g.
/* encode sequence of characters as 2-bit pairs (4-characters per-byte)
* returns encoded struct with allocated .seq member, on failure the .seq
* member is NULL. User is resposible for freeing .seq member when done.
*/
encoded encode_seq (const char *seq)
{
size_t len = strlen(seq),
size = (len + 3) / 4; /* integer division intentional */
encoded enc = { .seq = calloc (1, size), /* encoded sequence struct */
.len = len,
.size = size };
if (!enc.seq) { /* validate allication */
perror ("calloc-enc.seq");
return enc;
}
/* loop over each char (i) with byte index (ndx)
* shifting each 2-bit pair by (shift * 2) amount.
*/
for (int i = 0, ndx = 0, shift = 4; seq[i] && seq[i] != '\n'; i++) {
if (!shift--) /* decrement shift, reset if 0 */
shift = 3;
if (i && i % 4 == 0) /* after each 4th char, increment ndx */
ndx += 1;
/* shift each encoded value (multiply by 2 for shift of 6, 4, 2, 0) */
enc.seq[ndx] |= getencval (seq[i]) << shift * 2;
}
return enc; /* return encoded struct with allocated .seq member */
}
To get the original sequence back from your encoded struct, the use of a lookup table (shown below with the full code) makes it a breeze. You simply loop over all stored byte values appending the corresponding strings from the lookup table until the final byte. For the final byte, you need to determine if it is a partial string and, if so, how many characters remain to copy. (that's why you store the original sequence length in your struct). Then simply use strncat to append that many characters from the final byte, e.g.
/* decodes encoded sequence. Allocates storage for decoded sequence
* and loops over each encoded byte using lookup-table to obtain
* original 4-character string from byte value. User is responsible
* for freeing returned string when done. Returns NULL on allocation
* failure.
*/
char *decode_seq (encoded *eseq)
{
char *seq = malloc (eseq->len + 1); /* allocate storage for sequence */
size_t i = 0, offset = 0, remain;
if (!seq) { /* validate allocation */
perror ("malloc-seq");
return NULL;
}
/* loop appending strings from lookup table for all but last byte */
for (; i < eseq->size - 1; i++) {
memcpy (seq + offset, lookup[eseq->seq[i]], 4);
offset += 4; /* increment offset by 4 */
}
/* determine the number of characters in last byte */
remain = eseq->len - (eseq->size - 1) * 4;
memcpy (seq + offset, lookup[eseq->seq[i]], remain);
seq[offset + remain] = 0; /* nul-terminate seq */
return seq; /* return allocated sequence */
}
Adding the lookup table and putting all the pieces together, one way to approach this problem is:
(edit: lookup table reordered to match your byte-value encoding, optimized decode_seq() to not scan for end-of-string on copy)
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
enum { A, C, G, T };
typedef struct {
unsigned char *seq;
size_t len, size;
} encoded;
const char lookup[][5] = {
"AAAA","CAAA","GAAA","TAAA","ACAA","CCAA","GCAA","TCAA",
"AGAA","CGAA","GGAA","TGAA","ATAA","CTAA","GTAA","TTAA",
"AACA","CACA","GACA","TACA","ACCA","CCCA","GCCA","TCCA",
"AGCA","CGCA","GGCA","TGCA","ATCA","CTCA","GTCA","TTCA",
"AAGA","CAGA","GAGA","TAGA","ACGA","CCGA","GCGA","TCGA",
"AGGA","CGGA","GGGA","TGGA","ATGA","CTGA","GTGA","TTGA",
"AATA","CATA","GATA","TATA","ACTA","CCTA","GCTA","TCTA",
"AGTA","CGTA","GGTA","TGTA","ATTA","CTTA","GTTA","TTTA",
"AAAC","CAAC","GAAC","TAAC","ACAC","CCAC","GCAC","TCAC",
"AGAC","CGAC","GGAC","TGAC","ATAC","CTAC","GTAC","TTAC",
"AACC","CACC","GACC","TACC","ACCC","CCCC","GCCC","TCCC",
"AGCC","CGCC","GGCC","TGCC","ATCC","CTCC","GTCC","TTCC",
"AAGC","CAGC","GAGC","TAGC","ACGC","CCGC","GCGC","TCGC",
"AGGC","CGGC","GGGC","TGGC","ATGC","CTGC","GTGC","TTGC",
"AATC","CATC","GATC","TATC","ACTC","CCTC","GCTC","TCTC",
"AGTC","CGTC","GGTC","TGTC","ATTC","CTTC","GTTC","TTTC",
"AAAG","CAAG","GAAG","TAAG","ACAG","CCAG","GCAG","TCAG",
"AGAG","CGAG","GGAG","TGAG","ATAG","CTAG","GTAG","TTAG",
"AACG","CACG","GACG","TACG","ACCG","CCCG","GCCG","TCCG",
"AGCG","CGCG","GGCG","TGCG","ATCG","CTCG","GTCG","TTCG",
"AAGG","CAGG","GAGG","TAGG","ACGG","CCGG","GCGG","TCGG",
"AGGG","CGGG","GGGG","TGGG","ATGG","CTGG","GTGG","TTGG",
"AATG","CATG","GATG","TATG","ACTG","CCTG","GCTG","TCTG",
"AGTG","CGTG","GGTG","TGTG","ATTG","CTTG","GTTG","TTTG",
"AAAT","CAAT","GAAT","TAAT","ACAT","CCAT","GCAT","TCAT",
"AGAT","CGAT","GGAT","TGAT","ATAT","CTAT","GTAT","TTAT",
"AACT","CACT","GACT","TACT","ACCT","CCCT","GCCT","TCCT",
"AGCT","CGCT","GGCT","TGCT","ATCT","CTCT","GTCT","TTCT",
"AAGT","CAGT","GAGT","TAGT","ACGT","CCGT","GCGT","TCGT",
"AGGT","CGGT","GGGT","TGGT","ATGT","CTGT","GTGT","TTGT",
"AATT","CATT","GATT","TATT","ACTT","CCTT","GCTT","TCTT",
"AGTT","CGTT","GGTT","TGTT","ATTT","CTTT","GTTT","TTTT"};
/* convert character to encoded value */
unsigned char getencval (const char c)
{
if (c == 'A')
return A;
else if (c == 'C')
return C;
else if (c == 'G')
return G;
else if (c == 'T')
return T;
/* exit on anything other than A, C, G, T */
fprintf (stderr, "error: invalid sequence character '%c'\n", c);
exit (EXIT_FAILURE);
}
/* encode sequence of characters as 2-bit pairs (4-characters per-byte)
* returns encoded struct with allocated .seq member, on failure the .seq
* member is NULL. User is resposible for freeing .seq member when done.
*/
encoded encode_seq (const char *seq)
{
size_t len = strlen(seq),
size = (len + 3) / 4; /* integer division intentional */
encoded enc = { .seq = calloc (1, size), /* encoded sequence struct */
.len = len,
.size = size };
if (!enc.seq) { /* validate allication */
perror ("calloc-enc.seq");
return enc;
}
/* loop over each char (i) with byte index (ndx)
* shifting each 2-bit pair by (shift * 2) amount.
*/
for (int i = 0, ndx = 0, shift = 0; seq[i] && seq[i] != '\n'; i++, shift++) {
if (shift == 4) /* reset to 0 */
shift = 0;
if (i && i % 4 == 0) /* after each 4th char, increment ndx */
ndx += 1;
/* shift each encoded value (multiply by 2 for shift of 0, 2, 4, 6) */
enc.seq[ndx] |= getencval (seq[i]) << shift * 2;
}
return enc; /* return encoded struct with allocated .seq member */
}
/* decodes encoded sequence. Allocates storage for decoded sequence
* and loops over each encoded byte using lookup-table to obtain
* original 4-character string from byte value. User is responsible
* for freeing returned string when done. Returns NULL on allocation
* failure.
*/
char *decode_seq (encoded *eseq)
{
char *seq = malloc (eseq->len + 1); /* allocate storage for sequence */
size_t i = 0, offset = 0, remain;
if (!seq) { /* validate allocation */
perror ("malloc-seq");
return NULL;
}
/* loop appending strings from lookup table for all but last byte */
for (; i < eseq->size - 1; i++) {
memcpy (seq + offset, lookup[eseq->seq[i]], 4);
offset += 4; /* increment offset by 4 */
}
/* determine the number of characters in last byte */
remain = eseq->len - (eseq->size - 1) * 4;
memcpy (seq + offset, lookup[eseq->seq[i]], remain);
seq[offset + remain] = 0; /* nul-terminate seq */
return seq; /* return allocated sequence */
}
/* short example program that takes string to encode as 1st argument
* using "ACGT" if no argument is provided by default
*/
int main (int argc, char **argv) {
char *seq = NULL;
encoded enc = encode_seq(argc > 1 ? argv[1] : "ACGT");
if (!enc.seq) /* validate encoded allocation */
return 1;
/* output original string, length and encoded size */
printf ("encoded str : %s\nencoded len : %zu\nencoded size : %zu\n",
argc > 1 ? argv[1] : "ACGT", enc.len, enc.size);
/* loop outputting byte-values of encoded string */
fputs ("encoded seq :", stdout);
for (size_t i = 0; i < enc.size; i++)
printf (" 0x%02x", enc.seq[i]);
putchar ('\n');
seq = decode_seq (&enc); /* decode seq from byte values */
printf ("decoded seq : %s\n", seq); /* output decoded string */
free (seq); /* don't forget to free what you allocated */
free (enc.seq);
}
In most cases a lookup-table provides a great deal of efficiency advantage compared to computing and building each 4-character string during decoding. This is enhanced by the lookup table staying resident in cache for most cases.
The length of the DNA sequence you can encode and decode is limited only by the amount of virtual memory you have available.
Example Use/Output
The program takes the sequence to encode and decode as the first argument (default "ACGT"). So the default output is:
$ ./bin/dnaencodedecode
encoded str : ACGT
encoded len : 4
encoded size : 1
encoded seq : 0xe4
decoded seq : ACGT
4-byte encoded in 1-byte. Note the byte value of 0x1b and not 0xe4 due to the table ordering.
A longer example:
./bin/dnaencodedecode ACGTGGGTCAGACTTA
encoded str : ACGTGGGTCAGACTTA
encoded len : 16
encoded size : 4
encoded seq : 0xe4 0xea 0x21 0x3d
decoded seq : ACGTGGGTCAGACTTA
16-character encoded in 4-bytes.
Finally, what of a sequence that isn't divisible by 4 so you have a partial number of characters in the last encoded byte? That is handled as well, e.g.
$ ./bin/dnaencodedecode ACGTGGGTCAGACTTAG
encoded str : ACGTGGGTCAGACTTAG
encoded len : 17
encoded size : 5
encoded seq : 0xe4 0xea 0x21 0x3d 0x02
decoded seq : ACGTGGGTCAGACTTAG
17 characters encoded in 5-bytes. (not the pure 4-to-1 compression, but as the sequence size increases, the significance of any partial group of characters in the last byte becomes negligible)
As far a perfomance, for a sequence of 100,000 characters and output of the the byte values and strings replaced with a simple loop that compares the decoded seq to the original argv[1] it only takes a few thousandth of a second (on an old i7 Gen2 laptop with SSD) to encode and decode and validate, e.g.
$
time ./bin/dnaencodedecodet2big $(< dat/dnaseq100k.txt)
encoded len : 100000
encoded size : 25000
all tests passed
real 0m0.014s
user 0m0.012s
sys 0m0.003s
There are a lot of ways to do this, but given your description, this was what came to my mind that you were trying to accomplish. There is a lot here, so take your time going through it.
Look things over (the code is commented), and let me know if you have further questions. Just drop a comment below.
This should do what you want. I thought the compression was a pretty cool idea so I wrote this real quick. As mentioned by #kaylum, hex encoding is just a way to read the underlying data in memory, which is always just bits. So, you only need to worry about that on print statements.
Let me know if this works or you have any questions about what I did.
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
typedef struct {
unsigned char *bits;
unsigned long length; // use this to store the number of letters encoded, for safety
} DNA;
typedef enum {
A = 0,
C = 1,
G = 2,
T = 3
} NucleicAc;
This returns the base at a given index with some bounds checking
char base_at_index(DNA *dna, unsigned long index) {
if (index >= dna->length) {
fputs("error: base_at_index: index out of range", stderr);
exit(EXIT_FAILURE);
}
// offset is index / 4, this gives us the correct byte
// shift amount is index % 4 to give us the correct 2 bits within the byte.
// This must then be multiplied by 2 because
// each base takes 2 bits to encode
// then we have to bitwise-and this value with
// 3 (0000 0011 in binary) to retrieve the bits we want.
// so, the formula we need is
// (dna->bits[index / 4] >> (2 * (index % 4))) & 3
switch((dna->bits[index / 4] >> 2 * (index % 4)) & 3) {
case A: return 'A';
case C: return 'C';
case G: return 'G';
case T: return 'T';
default:
fputs("error: base_at_index: invalid encoding", stderr);
exit(EXIT_FAILURE);
}
}
This encodes a string of bases to bytes
/* you can fit four 2-bit DNA codes in each byte (unsigned char).
len is the maximum number of characters to read. result must be at least len bytes long
*/
void encode_dna(unsigned char *result, char *sequence, unsigned long len) {
// keep track of what byte we are on in the result
unsigned result_index = 0;
// our shift for writing to the correct position in the byte
unsigned shift = 0;
// first clear result or else bitwise operations will produce errors
// this could be removed if you were certain result parameter was zero-filled
memset(result, 0, len);
// iterate through characters of the sequence
while(*sequence) {
switch (*sequence) {
// do nothing for 'A' since it is just zero
case 'A': break;
case 'C':
// we are doing a bitwise or with the current byte
// and C (1) shifted to the appropriate position within
// the byte, and then assigning the byte with the result
result[result_index] |= C << shift;
break;
case 'G':
result[result_index] |= G << shift;
break;
case 'T':
result[result_index] |= T << shift;
break;
default:
fputs("error: encode_dna: invalid base pair", stderr);
exit(EXIT_FAILURE);
}
// increase shift amount by 2 to the next 2-bit slot in the byte
shift += 2;
// on every 4th iteration, reset our shift to zero since the byte is now full
// and move to the next byte in our result buffer
if (shift == 8) {
shift = 0;
result_index++;
}
// advance sequence to next nucleotide character
sequence++;
}
}
And here's a test
int main(int argc, char **argv) {
// allocate some storage for encoded DNA
unsigned char encoded_dna[32];
const unsigned long sample_length = 15;
// encode the given sample sequence
encode_dna(encoded_dna, "ACGTAGTCGTCATAG", sample_length);
// hh here means half of half word, which is a byte
// capital X for capitalized hex output
// here we print some bytes
printf("0x%hhX\n", encoded_dna[0]); // should output 0xE4
printf("0x%hhX\n", encoded_dna[1]); // should be 0x78
printf("0x%hhX\n", encoded_dna[2]); // should be 0x1E
printf("0x%hhX\n", encoded_dna[3]); // should be 0x23
DNA test_dna; // allocate a sample DNA structure
test_dna.bits = encoded_dna;
test_dna.length = sample_length; // length of the sample sequence above
// test some indices and see if the results are correct
printf("test_dna index 4: %c\n", base_at_index(&test_dna, 4));
printf("test_dna index 7: %c\n", base_at_index(&test_dna, 7));
printf("test_dna index 12: %c\n", base_at_index(&test_dna, 12));
return 0;
}
Output:
0xE4
0x78
0x1E
0x23
test_dna index 4: A
test_dna index 7: C
test_dna index 12: T
Assuming you really do want to encode your dna string into a hex string and that you want to read the input string from left to right, but output hex chars right to left, here's a simple, but slightly slow implementation.
First, your DNAString needs to keep track whether there are really an even or odd number of acid sequences in the list. This will make additional appendages easier.
struct DNAString
{
char* seq;
bool odd; // if odd bit is set, then the front char is already allocated and holds one acid
};
And now let's introduce a little helper function to convert ACGT into 0,1,2,3.
char acid_to_value(char c)
{
switch (c)
{
case 'A': return 0;
case 'C': return 1;
case 'G': return 2;
case 'T': return 3;
}
// ASSERT(false)
return 0;
}
Then the core implementation is to keep "prepending" new hexchars onto the string your are building. If the string is already of an odd length, the code will just "fixup" the front character by converting it from hex to integer, then shifting the new acid value into it, then converting it back to a hex char
extern char fixup(char previous, char acid);
{
char hexchars[16] = { '0','1','2','3','4','5','6','7','8','9','A','B','C','D','E','F' };
char tmp[2] = { previous, '\0' };
unsigned long asnumber = strtol(tmp, nullptr, 16);
asnumber = asnumber & 0x3; // last two bits
asnumber = asnumber | (acid_to_value(acid) << 2);
return hexchars[asnumber];
}
void prepend_nucleic_acid_to_hexstring(struct DNAString* dnastring, char acid)
{
if (dnastring->odd)
{
// find the first char in the string and fix it up hexwise
dnastring->seq[0] = fixup(dnastring->seq[0], acid);
dnastring->odd = false;
}
else
{
size_t currentlength = dnastring->seq ? strlen(dnastring->seq) : 0;
const char* currentstring = dnastring->seq ? dnastring->seq : "";
char* newseq = (char*)calloc(currentlength + 2, sizeof(char)); // +1 for new char and +1 for null char
newseq[0] = acid_to_value(acid) + '0'; // prepend the next hex char
strcpy(newseq + 1, currentstring); // copy the old string into the new string space
free(dnastring->seq);
dnastring->seq = newseq;
dnastring->odd = true;
}
}
Then your DNACreate function is real simple:
struct DNAString DSCreate(const char* mseq)
{
DNAString dnastring = { 0 };
while (*mseq)
{
prepend_nucleic_acid_to_hexstring(&dnastring, *mseq);
mseq++;
}
return dnastring;
}
I don't claim this approach to be efficient since he literally keeps reallocating memory for each char. But it does enable you to have flexability to invoke the prepend function later for additional sequencing.
And then to test it:
int main()
{
struct DNAString dnastring = DSCreate("ACGT");
printf("0x%s\n", dnastring.seq);
return 0;
}

Converting a HEX value to DECimal value in C

I have tried out many ideas from SO. One of them worked (output was DEC 49 for HEX 31) when I tested it in here onlinegdb.
But, when I implemented it in my application, it didn't produce same results. (output was 31 for 31 again).
The idea is to use a string value (full of HEX pairs);
An example; 313030311b5b324a1b5b324a534f495f303032371b
I need to convert each integer pair (HEX) into equivalence decimal value.
e.g.
HEX => DEC
31 => 49
30 => 48
I will then send the DEC value using UART value by value.
The code I test the behavior is below and here;
But, it doesn't have to be that code, I am open to all suggestions as long as it does the job.
#include <stdio.h>
int isHexaDigit(char p) {
return (( '0' <= p && p <= '9' ) || ( 'A' <= p && p <= 'F'));
}
int main(int argc, char** argv)
{
char * str = "31";
char t[]="31";
char* p = t;
char val[3]; // 2 hexa digit
val[2] = 0; //and the final \0 for a string
int number;
while (isHexaDigit(*p) && isHexaDigit(*(p+1))) {
val[0] = *p;
val[1] = *(p+1);
sscanf(val,"%X", &number); // <---- Read hexa string into number
printf("\nNum=%i",number); // <---- Display number to decimal.
p++;
//p++;
if (!*p) break;
p++;
}
return 0;
}
EDIT
I minimized the code.
Odd-length string is ignored for the time being.
The code sends out the data byte by byte. In the terminal application,
I get the values as HEX, e.g. HEX 31 instead of DEC 49. They are actually same. But, a device I use requires DEC 49 version of the value (which is ASCII = 1)
Any pointer highly appreciated.
You can use strtol function to convert your hex string to binary and then convert it to a decimal string in a single line:
snprintf(str_dec, 4, "%ld", strtol(str_hex, NULL, 16));
Your code becomes:
#include <stdio.h>
#include <stdlib.h>
int isHexaDigit(char p) {
return (( '0' <= p && p <= '9' ) || ( 'A' <= p && p <= 'F'));
}
int main(int argc, char** argv)
{
char * str = "31";
char t[]="31";
char* p = t;
char str_hex[3] = {0,};
char str_dec[4] = {0,};
while (isHexaDigit(*p) && isHexaDigit(*(p+1))) {
str_hex[0] = *p;
str_hex[1] = *(p+1);
/* Convert hex string to decimal string */
snprintf(str_dec, 4, "%ld", strtol(str_hex, NULL, 16));
printf("str_dec = %s\n", str_dec);
/* Send the decimal string over UART1 */
if (str_dec[0]) UART1_Write(str_dec[0]);
if (str_dec[1]) UART1_Write(str_dec[1]);
if (str_dec[2]) UART1_Write(str_dec[2]);
/* Reset str_dec variable */
str_dec[0] = 0;
str_dec[1] = 0;
str_dec[2] = 0;
p++;
if (!*p) break;
p++;
}
return 0;
}

C. Checksum calculation of array of ASCII hex values

i'm trying to calculate the checksum of an array of ASCII hex values.
Say I have the following
char exArray[] = "3030422320303030373830434441453141542355";
It's 20 pairs of hex values that each represent an ASCII character (eg. 0x41 = A).
How can I split them up to calculate a checksum?
Alternatively, how can I merge two values in an array to be one value?
(eg. '4', '1' -> '41')
#pmg:
First step would be converting the string representation (in hex) to an integer.
For the 2nd part, try ('4' - '0') * 16 + ('1' - '0')
This ultimately did the trick, love how simple it is too,
My implementation now looks somewhat like this.
uint8_t t = 0, tem, tem2, sum;
uint32_t chksum = 0;
void checkSum(void)
{
while (t < 40)
{
asciiToDec(exArray[t]);
tem = global.DezAscii[0];
t++;
asciiToDec(exArray[t]);
tem2 = global.DezAscii[0];
t++;
sum = (tem) * 16 + (tem2);
chksum += sum;
}
}
void asciiToDec(uint8_t value)
{
if (value == 'A')
global.DezAscii[0] = 10;
else if (value == 'B')
global.DezAscii[0] = 11;
else if (value == 'C')
global.DezAscii[0] = 12;
else if (value == 'D')
global.DezAscii[0] = 13;
else if (value == 'E')
global.DezAscii[0] = 14;
else if (value == 'F')
global.DezAscii[0] = 15;
else
global.DezAscii[0] = value;
}
uint16_t exArray[] = "3030422320303030373830434441453141542355";
I don't think this does what you are trying to do. The string literal is treated as a pointer to a const char. It's not even compiling for me. What you want here is something like this:
const char * exArray = "3030422320303030373830434441453141542355";
It's 20 pairs of hex values that each represent an ASCII character
(eg. 0x41 = A). How can I split them up to calculate a checksum?
You could loop through the array, doing what you want to do with the two chars inside the loop:
for (int i = 0; exArray[i]; i+=2) {
printf("my chars are %c and %c\n", exArray[i], exArray[i+1]);
// do the calculations you need here using exArray[i] and exArray[i+1]
}
Alternatively, how can I merge two values in an array to be one value?
(eg. '4', '1' -> '41')
I'm not sure what you mean here. Do you mean "41", as in the string representing 41? To do that, allocate three chars, then copy over those two chars and a null terminator. Something like
char hexByte[3];
hexByte[2] = 0; // setting the null terminator
for (int i = 0; exArray[i]; i+=2) {
hexByte[0] = exArray[i];
hexByte[1] = exArray[i+1];
printf("the string \"hexByte\" is: %s\n", hexByte);
// do something with hexByte here
}
If you want to convert it to its integer representation, use strtol:
printf("int value: %ld\n", strtol(hexByte, 0, 16));

C string to int without any libraries

I'm trying to write my first kernel module so I'm not able to include libraries for atoi, strtol, etc. How can I convert a string to int without these built-in functions? I tried:
int num;
num = string[0] - '0';
which works for the first character, but if I remove the [0] to try and convert the full string it gives me a warning: assignment makes integer from pointer without a cast. So what do I do?
When creating your own string to int function, make sure you check and protect against overflow. For example:
/* an atoi replacement performing the conversion in a single
pass and incorporating 'err' to indicate a failed conversion.
passing NULL as error causes it to be ignored */
int strtoi (const char *s, unsigned char *err)
{
char *p = (char *)s;
int nmax = (1ULL << 31) - 1; /* INT_MAX */
int nmin = -nmax - 1; /* INT_MIN */
long long sum = 0;
char sign = *p;
if (*p == '-' || *p == '+') p++;
while (*p >= '0' && *p <= '9') {
sum = sum * 10 - (*p - '0');
if (sum < nmin || (sign != '-' && -sum > nmax)) goto error;
p++;
}
if (sign != '-') sum = -sum;
return (int)sum;
error:
fprintf (stderr, "strtoi() error: invalid conversion for type int.\n");
if (err) *err = 1;
return 0;
}
You can't remove the [0]. That means that you are subtracting '0' from the pointer string, which is meaningless. You still need to dereference it:
num = string[i] - '0';
A string is an array of characters, represented by an address (a.k.a pointer).
An pointer has an value that might look something like 0xa1de2bdf. This value tells me where the start of the array is.
You cannot subtract a pointer type with a character type (e.g 0xa1de2bdf - 'b' does not really make sense).
To convert a string to a number, you could try this:
//Find the length of the string
int len = 0;
while (str[len] != '\0') {
len++;
}
//Loop through the string
int num = 0, i = 0, digit;
for (i=0; i<len; i++) {
//Extract the digit
digit = ing[i] - '0';
//Multiply the digit with its correct position (ones, tens, hundreds, etc.)
num += digit * pow(10, (len-1)-i);
}
Of course if you are not allowed to use math.h library, you could write your own pow(a,b) function which gives you the value of a^b.
int mypowfunc(int a, int b) {
int i=0, ans=1;
//multiply the value a for b number of times
for (i=0; i<b; i++) {
ans *= a;
}
return ans;
}
I have written the code above in a way that is simple to understand. It assumes that your string has a null character ('\0') right behind the last useful character (which is good practice).
Also, you might want to check that the string is actually a valid string with only digits (e.g '0', '1', '2', etc.). You could do this by including an if... else.. statement while looping through the string.
In modern kernels you want to use kstrto*:
http://lxr.free-electrons.com/source/include/linux/kernel.h#L274
274 /**
275 * kstrtoul - convert a string to an unsigned long
276 * #s: The start of the string. The string must be null-terminated, and may also
277 * include a single newline before its terminating null. The first character
278 * may also be a plus sign, but not a minus sign.
279 * #base: The number base to use. The maximum supported base is 16. If base is
280 * given as 0, then the base of the string is automatically detected with the
281 * conventional semantics - If it begins with 0x the number will be parsed as a
282 * hexadecimal (case insensitive), if it otherwise begins with 0, it will be
283 * parsed as an octal number. Otherwise it will be parsed as a decimal.
284 * #res: Where to write the result of the conversion on success.
285 *
286 * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error.
287 * Used as a replacement for the obsolete simple_strtoull. Return code must
288 * be checked.
289 */
This function skips leading and trailing whitespace, handles one optional + / - sign, and returns 0 on invalid input,
// Convert standard null-terminated string to an integer
// - Skips leading whitespaces.
// - Skips trailing whitespaces.
// - Allows for one, optional +/- sign at the front.
// - Returns zero if any non-+/-, non-numeric, non-space character is encountered.
// - Returns zero if digits are separated by spaces (eg "123 45")
// - Range is checked against Overflow/Underflow (INT_MAX / INT_MIN), and returns 0.
int StrToInt(const char* s)
{
int minInt = 1 << (sizeof(int)*CHAR_BIT-1);
int maxInt = -(minInt+1);
char* w;
do { // Skip any leading whitespace
for(w=" \t\n\v\f\r"; *w && *s != *w; ++w) ;
if (*s == *w) ++s; else break;
} while(*s);
int sign = 1;
if ('-' == *s) sign = -1;
if ('+' == *s || '-' == *s) ++s;
long long i=0;
while('0' <= *s && *s <= '9')
{
i = 10*i + *s++ - '0';
if (sign*i < minInt || maxInt < sign*i)
{
i = 0;
break;
}
}
while (*s) // Skip any trailing whitespace
{
for(w=" \t\n\v\f\r"; *w && *s != *w; ++w) ;
if (*w && *s == *w) ++s; else break;
}
return (int)(!*s*sign*i);
}
" not able to include libraries" --> Unclear if code is allowed access to INT_MAX, INT_MIN. There is no way to determine the minimum/maximum signed integer in a completely portable fashion without using the language provided macros like INT_MAX, INT_MIN.
Use INT_MAX, INT_MIN is available. Else we could guess the char width is 8. We could guess there are no padding bits. We could guess that integers are 2's complement. With these reasonable assumptions, minimum and maximum are defined below.
Note: Shifting into the sign bit is undefined behavior (UB), so don't do that.
Let us add another restriction: make a solution that works for any signed integer from signed char to intmax_t. This disallows code from using a wider type, as there may not be a wider type.
typedef int Austin_int;
#define Austin_INT_MAXMID ( ((Austin_int)1) << (sizeof(Austin_int)*8 - 2) )
#define Austin_INT_MAX (Austin_INT_MAXMID - 1 + Austin_INT_MAXMID)
#define Austin_INT_MIN (-Austin_INT_MAX - 1)
int Austin_isspace(int ch) {
const char *ws = " \t\n\r\f\v";
while (*ws) {
if (*ws == ch) return 1;
ws++;
}
return 0;
}
// *endptr points to where parsing stopped
// *errorptr indicates overflow
Austin_int Austin_strtoi(const char *s, char **endptr, int *errorptr) {
int error = 0;
while (Austin_isspace(*s)) {
s++;
}
char sign = *s;
if (*s == '-' || *s == '+') {
s++;
}
Austin_int sum = 0;
while (*s >= '0' && *s <= '9') {
int ch = *s - '0';
if (sum <= Austin_INT_MIN / 10 &&
(sum < Austin_INT_MIN / 10 || -ch < Austin_INT_MIN % 10)) {
sum = Austin_INT_MIN;
error = 1;
} else {
sum = sum * 10 - ch;
}
s++;
}
if (sign != '-') {
if (sum < -Austin_INT_MAX) {
sum = Austin_INT_MAX;
error = 1;
} else {
sum = -sum;
}
}
if (endptr) {
*endptr = (char *) s;
}
if (errorptr) {
*errorptr = error;
}
return sum;
}
The above depends on C99 or later in the Austin_INT_MIN Austin_INT_MIN % 10 part.
This is the cleanest and safest way I could come up with
int str_to_int(const char * str, size_t n, int * int_value) {
int i;
int cvalue;
int value_muliplier = 1;
int res_value = 0;
int neg = 1; // -1 for negative and 1 for whole.
size_t str_len; // String length.
int end_at = 0; // Where loop should end.
if (str == NULL || int_value == NULL || n <= 0)
return -1;
// Get string length
str_len = strnlen(str, n);
if (str_len <= 0)
return -1;
// Is negative.
if (str[0] == '-') {
neg = -1;
end_at = 1; // If negative 0 item in 'str' is skipped.
}
// Do the math.
for (i = str_len - 1; i >= end_at; i--) {
cvalue = char_to_int(str[i]);
// Character not a number.
if (cvalue == -1)
return -1;
// Do the same math that is down below.
res_value += cvalue * value_muliplier;
value_muliplier *= 10;
}
/*
* "436"
* res_value = (6 * 1) + (3 * 10) + (4 * 100)
*/
*int_value = (res_value * neg);
return 0;
}
int char_to_int(char c) {
int cvalue = (int)c;
// Not a number.
// 48 to 57 is 0 to 9 in ascii.
if (cvalue < 48 || cvalue > 57)
return -1;
return cvalue - 48; // 48 is the value of zero in ascii.
}

Data types conversion (unsigned long long to char)

Can anyone tell me what is wrong with the following code?
__inline__
char* ut_byte_to_long (ulint nb) {
char* a = malloc(sizeof(nb));
int i = 0;
for (i=0;i<sizeof(nb);i++) {
a[i] = (nb>>(i*8)) & 0xFF;
}
return a;
}
This string is then concatenated as part of a larger one using strcat. The string prints fine but for the integers which are represented as character symbols. I'm using %s and fprintf to check the result.
Thanks a lot.
EDIT
I took one of the comments below (I was adding the terminating \0 separately, before calling fprintf, but after strcat. Modifying my initial function...
__inline__
char* ut_byte_to_long (ulint nb) {
char* a = malloc(sizeof(nb) + 1);
int i = 0;
for (i=0;i<sizeof(nb);i++) {
a[i] = (nb>>(i*8)) & 0xFF;
}
a[nb] = '\0' ;
return a;
}
This sample code still isn't printing out a number...
char* tmp;
tmp = ut_byte_to_long(start->id);
fprintf(stderr, "Value of node is %s \n ", tmp);
strcat is expecting a null byte terminating the string.
Change your malloc size to sizeof(nb) + 1 and append '\0' to the end.
You have two problems.
The first is that the character array a contains numbers, such as 2, instead of ASCII codes representing those numbers, such as '2' (=50 on ASCII, might be different in other systems). Try modifying your code to
a[i] = (nb>>(i*8)) & 0xFF + '0';
The second problem is that the result of the above computation can be anything between 0 and 255, or in other words, a number which requires more than one digit to print.
If you want to print hexadecimal numbers (0-9, A-F), two digits per such computation will be enough, and you can write something like
a[2*i + 0] = int2hex( (nb>>(i*8)) & 0x0F ); //right hexa digit
a[2*i + 1] = int2hex( (nb>>(i*8+4)) & 0x0F ); //left hexa digit
where
char int2hex(int n) {
if (n <= 9 && n >= 0)
return n + '0';
else
return (n-10) + 'A';
}
if you dont want to use sprintf(target_string,"%lu",source_int) or the non standard itoa(), here is a version of the function that transform a long to a string :
__inline__
char* ut_byte_to_long (ulint nb) {
char* a = (char*) malloc(22*sizeof(char));
int i=21;
int j;
do
{
i--;
a[i] = nb % 10 + '0';
nb = nb/10;
}while (nb > 0);
// the number is stored from a[i] to a[21]
//shifting the string to a[0] : a[21-i]
for(j = 0 ; j < 21 && i < 21 ; j++ , i++)
{
a[j] = a[i];
}
a[j] = '\0';
return a;
}
I assumed that an unsigned long contain less than 21 digits. (biggest number is 18,446,744,073,709,551,615 which equals 2^64 − 1 : 20 digits)

Resources