Related
I have a piece of hardware that I'm trying to control via my computer's built-in SPI driver. The SPI driver is controlled via ioctl.
I can successfully drive the hardware from a small C program; but when I try to duplicate the C program in Ruby I run into problems.
Using IO#ioctl to set basic registers (with u32 and u8 ints) works fine (I know because I can also use ioctl to read back the values I set); but as soon as I try to set a complex struct, the program fails with
small.rb:51:in 'ioctl': Connection timed out # rb_ioctl - /dev/spidev32766.0 (Errno::ETIMEDOUT)
I might be running into trouble because the spi_ioc_transfer struct has two pointers to byte buffers but the pointers are typed as unsigned 64-bit ints even on 32-bit platforms -- necessitating a cast to (unsigned long) in C. I'm trying to replicate that in Ruby but am quite unsure of myself.
Below are the C program which works and the Ruby port which doesn't work. The do_latch functions are necessary so I can see the result in my hardware; but are probably not germane to this problem.
C (which works):
#include <stdint.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <linux/spi/spidev.h>
int do_latch() {
int fd = open("/sys/class/gpio/gpio1014/value", O_RDWR);
write(fd, "1", 1);
write(fd, "0", 1);
close(fd);
}
int do_transfer(int fd, uint8_t *bytes, size_t len) {
uint8_t *rx_bytes = malloc(sizeof(uint8_t) * len);
struct spi_ioc_transfer transfer = {
.tx_buf = (unsigned long)bytes,
.rx_buf = (unsigned long)rx_bytes,
.len = len,
.speed_hz = 100000,
.delay_usecs = 0,
.bits_per_word = 8,
.cs_change = 0,
.tx_nbits = 0,
.rx_nbits = 0,
.pad = 0
};
if(ioctl(fd, SPI_IOC_MESSAGE(1), &transfer) < 1) {
perror("Could not send SPI message");
exit(1);
}
free(rx_bytes);
}
int main() {
int fd = open("/dev/spidev32766.0", O_RDWR);
uint8_t mode = 0;
ioctl(fd, SPI_IOC_WR_MODE, &mode);
uint8_t lsb_first = 0;
ioctl(fd, SPI_IOC_WR_LSB_FIRST, lsb_first);
uint32_t speed_hz = 100000;
ioctl(fd, SPI_IOC_WR_MAX_SPEED_HZ, speed_hz);
size_t data_len = 36;
uint8_t *tx_data = malloc(sizeof(uint8_t) * data_len);
memset(tx_data, 0xFF, data_len);
do_transfer(fd, tx_data, data_len);
do_latch();
sleep(2);
memset(tx_data, 0x00, data_len);
do_transfer(fd, tx_data, data_len);
do_latch();
free(tx_data);
close(fd);
return 0;
}
Ruby (which fails on the ioctl line in do_transfer):
SPI_IOC_WR_MODE = 0x40016b01
SPI_IOC_WR_LSB_FIRST = 0x40016b02
SPI_IOC_WR_BITS_PER_WORD = 0x40016b03
SPI_IOC_WR_MAX_SPEED_HZ = 0x40046b04
SPI_IOC_WR_MODE32 = 0x40046b05
SPI_IOC_MESSAGE_1 = 0x40206b00
def do_latch()
File.open("/sys/class/gpio/gpio1014/value", File::RDWR) do |file|
file.write("1")
file.write("0")
end
end
def do_transfer(file, bytes)
##########################################################################################
#begin spi_ioc_transfer struct (cat /usr/include/linux/spi/spidev.h)
#pack bytes into a buffer; create a new buffer (filled with zeroes) for the rx
tx_buff = bytes.pack("C*")
rx_buff = (Array.new(bytes.size) { 0 }).pack("C*")
#on 32-bit, the struct uses a zero-extended pointer for the buffers (so it's the same
#byte layout on 64-bit as well) -- so do some trickery to get the buffer addresses
#as 64-bit strings even though this is running on a 32-bit computer
tx_buff_pointer = [tx_buff].pack("P").unpack("L!")[0] #u64 (zero-extended pointer)
rx_buff_pointer = [rx_buff].pack("P").unpack("L!")[0] #u64 (zero-extended pointer)
buff_len = bytes.size #u32
speed_hz = 100000 #u32
delay_usecs = 0 #u16
bits_per_word = 8 #u8
cs_change = 0 #u8
tx_nbits = 0 #u8
rx_nbits = 0 #u8
pad = 0 #u16
struct_array = [tx_buff_pointer, rx_buff_pointer, buff_len, speed_hz, delay_usecs, bits_per_word, cs_change, tx_nbits, rx_nbits, pad]
struct_packed = struct_array.pack("QQLLSCCCCS")
#in C, I pass a pointer to the the structure; so mimic that here
struct_pointer_packed = [struct_packed].pack("P")
#end spi_ioc_transfer struct
##########################################################################################
file.ioctl(SPI_IOC_MESSAGE_1, struct_pointer_packed)
end
File.open("/dev/spidev32766.0", File::RDWR) do |file|
file.ioctl(SPI_IOC_WR_MODE, [0].pack("C"));
file.ioctl(SPI_IOC_WR_LSB_FIRST, [0].pack("C"));
file.ioctl(SPI_IOC_WR_MAX_SPEED_HZ, [0].pack("L"));
data_bytes = Array.new(36) { 0x00 }
do_transfer(file, data_bytes)
do_latch()
sleep(2)
data_bytes = []
data_bytes = Array.new(36) { 0xFF }
do_transfer(file, data_bytes)
do_latch()
end
I pulled the magic number constants out by having C print them (they're macros in C). I can validate that most of them work; I'm a little unsure about the ioctl message that fails (SPI_IOC_MESSAGE_1) since that doesn't work and it's a complicated macro. Still, I have no reason to think that it's incorrect and it's always the same when I look at it from C.
When I print out the structure in C and then print it out in Ruby, the only differences are in the buffer addresses, so if something's going wrong, that feels like the right place to look. But I've run out of things to try.
I can also print out the addresses in both versions and they look like what I would expect, 32 bits extended to 64 bits, and match the values in the structure (although the structure is little-endian -- this is an ARM).
Structure in C (that works):
60200200 00000000 a8200200 00000000 24000000 40420f00 00000800 00000000
Structure in Ruby (that fails):
a85da27f 00000000 08399b7f 00000000 24000000 40420f00 00000800 00000000
Is there an obvious mistake that I'm making when I lay out the struct in Ruby? Is there something else that I'm missing?
My next step is to write a library in C and use FFI to access it from Ruby. But that seems like giving up; and using the native ioctl function feels like the better approach if I can ever make it work.
Update
Above, I'm doing
struct_array = [tx_buff_pointer, rx_buff_pointer, buff_len, speed_hz, delay_usecs, bits_per_word, cs_change, tx_nbits, rx_nbits, pad]
struct_packed = struct_array.pack("QQLLSCCCCS")
#in C, I pass a pointer to the the structure; so mimic that here
struct_pointer_packed = [struct_packed].pack("P")
file.ioctl(SPI_IOC_MESSAGE_1, struct_pointer_packed)
because I have to pass a pointer to the struct in C. But that's what's causing the error!
Instead, it needs to be
struct_array = [tx_buff_pointer, rx_buff_pointer, buff_len, speed_hz, delay_usecs, bits_per_word, cs_change, tx_nbits, rx_nbits, pad]
struct_packed = struct_array.pack("QQLLSCCCCS")
file.ioctl(SPI_IOC_MESSAGE_1, struct_packed)
I guess Ruby is automatically making it an array when it marshalls it over?
Unfortunately, now it only intermittently works. The second call never works and the first call doesn't work if I pass in all zeros. It's very mysterious.
It is a common issue not to flush the buffer, you could check it out and try it.
Flush:
Flushes any buffered data within ios to the underlying operating system (note that this is Ruby internal buffering only; the OS may buffer the data as well).
rb_io_flush(VALUE io)
{
return rb_io_flush_raw(io, 1);
}
[Update] I am offering a bonus for this. Frankly, I don't care which encryption method is used. Preferably something simple like XTEA, RC4, BlowFish ... but you chose.
I want minimum effort on my part, preferably just drop the files into my projects and build.
Idealy you should already have used the code to en/de-crypt a file in Delphi and C (I want to trade files between an Atmel UC3 micro-processor (coding in C) and a Windows PC (coding in Delphi) en-and-de-crypt in both directions).
I have a strong preference for a single .PAS unit and a single .C/.H file. I do not want to use a DLL or a library supporting dozens of encryption algorithms, just one (and I certainly don't want anything with an install program).
I hope that I don't sound too picky here, but I have been googling & trying code for over a week and still can't find two implementations which match. I suspect that only someone who has already done this can help me ...
Thanks in advance.
As a follow up to my previous post, I am still looking for some very simple code with why I can - with minimal effort - en-de crypt a file and exchange it between Delphi on a PC and C on an Atmel UC3 u-processor.
It sounds simple in theory, but in practice it's a nightmare. There are many possible candidates and I have spend days googling and trying them out - to no avail.
Some are humonous libraries, supporting many encryption algorithms, and I want something lightweight (especially on the C / u-processor end).
Some look good, but one set of source offers only block manipulation, the other strings (I would prefer whole file en/de-crypt).
Most seem to be very poorly documented, with meaningless parameter names and no example code to call the functions.
Over the past weekend (plus a few more days), I have burned my way through a slew of XTEA, XXTEA and BlowFish implementations, but while I can encrypt, I can't reverse the process.
Now I am looking at AES-256. Dos anyone know of an implementation in C which is a single AES.C file? (plus AES.H, of course)
Frankly, I will take anything that will do whole file en/de-crypt between Delphi and C, but unless anyone has actually done this themselves, I expect to hear only "any implementation that meets the standard should do" - which is a nice theoory but just not working out for me :-(
Any simple AES-256 in C out there? I have some reasonable looking Delphi code, but won't be sure until I try them together.
Thanks in advance ...
I would suggest using the .NET Micro Framework on a secondary microcontroller (e.g. Atmel SAM7X) as a crypto coprocessor. You can test this out on a Netduino, which you can pick up for around $35 / £30. The framework includes an AES implementation within it, under the System.Security.Cryptography namespace, alongside a variety of other cryptographic functions that might be useful for you. The benefit here is that you get a fully tested and working implementation, and increased security via type-safe code.
You could use SPI or I2C to communicate between the two microcontrollers, or bit-bang your own data transfer protocol over several I/O lines in parallel if higher throughput is needed.
I did exactly this with an Arduino and a Netduino (using the Netduino to hash blocks of data for a hardware BitTorrent device) and implemented a rudimentary asynchronous system using various commands sent between the devices via SPI and an interrupt mechanism.
Arduino is SPI master, Netduino is SPI slave.
A GPIO pin on the Netduino is set as an output, and tied to another interrupt-enabled GPIO pin on the Arduino that is set as an input. This is the interrupt pin.
Arduino sends 0xF1 as a "hello" initialization message.
Netduino sends back 0xF2 as an acknolwedgement.
When Arduino wants to hash a block, it sends 0x48 (ASCII 'H') followed by the data. When it is done sending data, it sets CS low. It must send whole bytes; setting CS low when the number of received bits is not divisible by 8 causes an error.
The Netduino receives the data, and sends back 0x68 (ASCII 'h') followed by the number of received bytes as a 2-byte unsigned integer. If an error occurred, it sends back 0x21 (ASCII '!') instead.
If it succeeded, the Netduino computes the hash, then sets the interrupt pin high. During the computation time, the Arduino is free to continue its job whilst waiting.
The Arduino sends 0x52 (ASCII 'R') to request the result.
The Netduino sets the interrupt pin low, then sends 0x72 (ASCII 'r') and the raw hash data back.
Since the Arduino can service interrupts via GPIO pins, it allowed me to make the processing entirely asynchronous. A variable on the Arduino side tracks whether we're currently waiting on the coprocessor to complete its task, so we don't try to send it a new block whilst it's still working on the old one.
You could easily adapt this scheme for computing AES blocks.
Small C library for AES-256 by Ilya Levin. Short implementation, asm-less, simple usage. Not sure how would it work on your current micro CPU, though.
[Edit]
You've mentioned having some delphi implementation, but in case something not working together, try this or this.
Also I've found arduino (avr-based) module using the Ilya's library - so it should also work on your micro CPU.
Can you compile C code from Delphi (you can compile Delphi code from C++ Builder, not sure about VV). Or maybe use the Free Borland Command line C++ compiler or even another C compiler.
The idea is to use the same C code in your Windows app as you use on your microprocessor.. That way you can be reasonably sure that the code will work in both directions.
[Update] See
http://www.drbob42.com/examines/examin92.htm
http://www.hflib.gov.cn/e_book/e_book_file/bcb/ch06.htm (Using C++ Code in Delphi)
http://edn.embarcadero.com/article/10156#H11
It looks like you need to use a DLL, but you can statically link it if you don't want to distribute it
Here is RC4 code. It is very lightweight.
The C has been used in a production system for five years.
I have added lightly tested Delphi code. The Pascal is a line-by-line port with with unsigned char going to Byte. I have only run the Pascal in Free Pascal with Delphi option turned on, not Delphi itself. Both C and Pascal have simple file processors.
Scrambling the ciphertext gives the original cleartext back.
No bugs reported to date. Hope this solves your problem.
rc4.h
#ifndef RC4_H
#define RC4_H
/*
* rc4.h -- Declarations for a simple rc4 encryption/decryption implementation.
* The code was inspired by libtomcrypt. See www.libtomcrypt.org.
*/
typedef struct TRC4State_s {
int x, y;
unsigned char buf[256];
} TRC4State;
/* rc4.c */
void init_rc4(TRC4State *state);
void setup_rc4(TRC4State *state, char *key, int keylen);
unsigned endecrypt_rc4(unsigned char *buf, unsigned len, TRC4State *state);
#endif
rc4.c
void init_rc4(TRC4State *state)
{
int x;
state->x = state->y = 0;
for (x = 0; x < 256; x++)
state->buf[x] = x;
}
void setup_rc4(TRC4State *state, char *key, int keylen)
{
unsigned tmp;
int x, y;
// use only first 256 characters of key
if (keylen > 256)
keylen = 256;
for (x = y = 0; x < 256; x++) {
y = (y + state->buf[x] + key[x % keylen]) & 255;
tmp = state->buf[x];
state->buf[x] = state->buf[y];
state->buf[y] = tmp;
}
state->x = 255;
state->y = y;
}
unsigned endecrypt_rc4(unsigned char *buf, unsigned len, TRC4State *state)
{
int x, y;
unsigned char *s, tmp;
unsigned n;
x = state->x;
y = state->y;
s = state->buf;
n = len;
while (n--) {
x = (x + 1) & 255;
y = (y + s[x]) & 255;
tmp = s[x]; s[x] = s[y]; s[y] = tmp;
tmp = (s[x] + s[y]) & 255;
*buf++ ^= s[tmp];
}
state->x = x;
state->y = y;
return len;
}
int endecrypt_file(FILE *f_in, FILE *f_out, char *key)
{
TRC4State state[1];
unsigned char buf[4096];
size_t n_read, n_written;
init_rc4(state);
setup_rc4(state, key, strlen(key));
do {
n_read = fread(buf, 1, sizeof buf, f_in);
endecrypt_rc4(buf, n_read, state);
n_written = fwrite(buf, 1, n_read, f_out);
} while (n_read == sizeof buf && n_written == n_read);
return (n_written == n_read) ? 0 : 1;
}
int endecrypt_file_at(char *f_in_name, char *f_out_name, char *key)
{
int rtn;
FILE *f_in = fopen(f_in_name, "rb");
if (!f_in) {
return 1;
}
FILE *f_out = fopen(f_out_name, "wb");
if (!f_out) {
close(f_in);
return 2;
}
rtn = endecrypt_file(f_in, f_out, key);
fclose(f_in);
fclose(f_out);
return rtn;
}
#ifdef TEST
// Simple test.
int main(void)
{
char *key = "This is the key!";
endecrypt_file_at("rc4.pas", "rc4-scrambled.c", key);
endecrypt_file_at("rc4-scrambled.c", "rc4-unscrambled.c", key);
return 0;
}
#endif
Here is lightly tested Pascal. I can scramble the source code in C and descramble it with the Pascal implementation just fine.
type
RC4State = record
x, y : Integer;
buf : array[0..255] of Byte;
end;
KeyString = String[255];
procedure initRC4(var state : RC4State);
var
x : Integer;
begin
state.x := 0;
state.y := 0;
for x := 0 to 255 do
state.buf[x] := Byte(x);
end;
procedure setupRC4(var state : RC4State; var key : KeyString);
var
tmp : Byte;
x, y : Integer;
begin
y := 0;
for x := 0 to 255 do begin
y := (y + state.buf[x] + Integer(key[1 + x mod Length(key)])) and 255;
tmp := state.buf[x];
state.buf[x] := state.buf[y];
state.buf[y] := tmp;
end;
state.x := 255;
state.y := y;
end;
procedure endecryptRC4(var buf : array of Byte; len : Integer; var state : RC4State);
var
x, y, i : Integer;
tmp : Byte;
begin
x := state.x;
y := state.y;
for i := 0 to len - 1 do begin
x := (x + 1) and 255;
y := (y + state.buf[x]) and 255;
tmp := state.buf[x];
state.buf[x] := state.buf[y];
state.buf[y] := tmp;
tmp := (state.buf[x] + state.buf[y]) and 255;
buf[i] := buf[i] xor state.buf[tmp]
end;
state.x := x;
state.y := y;
end;
procedure endecryptFile(var fIn, fOut : File; key : KeyString);
var
nRead, nWritten : Longword;
buf : array[0..4095] of Byte;
state : RC4State;
begin
initRC4(state);
setupRC4(state, key);
repeat
BlockRead(fIN, buf, sizeof(buf), nRead);
endecryptRC4(buf, nRead, state);
BlockWrite(fOut, buf, nRead, nWritten);
until (nRead <> sizeof(buf)) or (nRead <> nWritten);
end;
procedure endecryptFileAt(fInName, fOutName, key : String);
var
fIn, fOut : File;
begin
Assign(fIn, fInName);
Assign(fOut, fOutName);
Reset(fIn, 1);
Rewrite(fOut, 1);
endecryptFile(fIn, fOut, key);
Close(fIn);
Close(fOut);
end;
{$IFDEF TEST}
// Very small test.
const
key = 'This is the key!';
begin
endecryptFileAt('rc4.pas', 'rc4-scrambled.pas', key);
endecryptFileAt('rc4-scrambled.pas', 'rc4-unscrambled.pas', key);
end.
{$ENDIF}
It looks easier would be to get reference AES implementation (which works with blocks), and add some code to handle CBC (or CTR encryption).
This would need from you only adding ~30-50 lines of code, something like the following (for CBC):
aes_expand_key();
first_block = iv;
for (i = 0; i < filesize / 16; i++)
{
data_block = read(file, 16);
data_block = (data_block ^ iv);
iv = encrypt_block(data_block);
write(outputfile, iv);
}
// if filesize % 16 != 0, then you also need to add some padding and encrypt the last block
Assuming the encryption strength isn't an issue, as in satisfying an organization's Chinese Wall requirement, the very simple "Sawtooth" encryption scheme of adding (i++ % modulo 256) to fgetc(), for each byte, starting at the beginning of the file, might work just fine.
Declaring i as a UCHAR will eliminate the modulo requirement, as the single byte integer cannot help but cycle through its 0-255 range.
The code is so simple it's not worth posting. A little imagination, and you'll have some embellishments that can add a lot to the strength of this cypher. The primary vulnerability of this cypher is large blocks of identical characters. Rectifying this is a good place to start improving its strength.
This cypher works on every possible file type, and is especially effective if you've already 7Zipped the file.
Performance is phenomenal. You won't even know the code is there. Totally I/O bound.
Wikipedia says it's called a quine and someone gave the code below:
char*s="char*s=%c%s%c;main(){printf(s,34,s,34);}";main(){printf(s,34,s,34);}
But, obviously you have to add
#include <stdio.h> //corrected from #include <stdlib.h>
so that the printf() could work.
Literally, since the above program did not print #include <stdio.h>, it is not a solution (?)
I am confused about the literal requirement of "print its own source code", and any purpose of this kind of problems, especially at interviews.
The main purpose of interview questions about quine programs is usually to see whether you've come across them before. They are almost never useful in any other sense.
The code above can be upgraded modestly to make a C99-compliant program (according to GCC), as follows:
Compilation
/usr/bin/gcc -O3 -g -std=c99 -Wall -Wextra -Wmissing-prototypes \
-Wstrict-prototypes -Wold-style-definition quine.c -o quine
Code
#include <stdio.h>
char*s="#include <stdio.h>%cchar*s=%c%s%c;%cint main(void){printf(s,10,34,s,34,10,10);}%c";
int main(void){printf(s,10,34,s,34,10,10);}
Note that this assumes a code set where " is code point 34 and newline is code point 10. This version prints out a newline at the end, unlike the original. It also contains the #include <stdio.h> that is needed, and the lines are almost short enough to work on SO without a horizontal scroll bar. With a little more effort, it could undoubtedly be made short enough.
Test
The acid test for the quine program is:
./quine | diff quine.c -
If there's a difference between the source code and the output, it will be reported.
An almost useful application of "quine-like" techniques
Way back in the days of my youth, I produced a bilingual "self-reproducing" program. It was a combination of shell script and Informix-4GL (I4GL) source code. One property that made this possible was that I4GL treats { ... } as a comment, but the shell treats that as a unit of I/O redirection. I4GL also has #...EOL comments, as does the shell. The shell script at the top of the file included data and operations to regenerate the complex sequence of validation operations in a language that does not support pointers. The data controlled which I4GL functions we generated and how each one was generated. The I4GL code was then compiled to validate the data imported from an external data source on a weekly basis.
If you ran the file (call it file0.4gl) as a shell script and captured the output (call that file1.4gl), and then ran file1.4gl as a shell script and captured the output in file2.4gl, the two files file1.4gl and file2.4gl would be identical. However, file0.4gl could be missing all the generated I4GL code and as long as the shell script 'comment' at the top of the file was not damaged, it would regenerate a self-replicating file.
The trick here is that most compilers will compile without requiring you to include stdio.h.
They will usually just throw a warning.
A quine has some depth roots in fixed point semantics related to programming languages and to executions in general. They have some importance related to theoretical computer science but in practice they have no purpose.
They are a sort of challenge or tricks.
The literal requirement is just you said, literal: you have a program, its execution produces itself as the output. Nothing more nor less, that's why it's considered a fixed point: the execution of the program through the language semantics has itself as its ouput.
So if you express the computation as a function you'll have that
f(program, environment) = program
In the case of a quine the environment is considered empty (you don't have anything as input neither precomputed before)
You can also define printf's prototype by hand.
const char *a="const char *a=%c%s%c;int printf(const char*,...);int main(){printf(a,34,a,34);}";int printf(const char*,...);int main(){printf(a,34,a,34);}
Here's a version that will be accepted by C++ compilers:
#include<stdio.h>
const char*s="#include<stdio.h>%cconst char*s=%c%s%c;int main(int,char**){printf(s,10,34,s,34);return 0;}";int main(int,char**){printf(s,10,34,s,34);return 0;}
test run:
$ /usr/bin/g++ -o quine quine.cpp
$ ./quine | diff quine.cpp - && echo 'it is a quine' || echo 'it is not a quine'
it is a quine
The string s contains mostly a copy of the source, except for the content of s itself - instead it has %c%s%c there.
The trick is that in the printf call, the string s is used both as format and as the replacement for the %s. This causes printf to put it also into the definition of s (on the output text, that is)
the additional 10 and 34s correspond to the linefeed and " string delimiter. They are inserted by printf as replacements of the %cs, because they would require an additional \ in the format-string, which would cause the format- and replacement-string to differ, so the trick wouldn't work anymore.
Quine (Basic self-relicating code in c++`// Self replicating basic code
[http://www.nyx.net/~gthompso/quine.htm#links]
[https://pastebin.com/2UkGbRPF#links]
// Self replicating basic code
#include <iostream> //1 line
#include <string> //2 line
using namespace std; //3 line
//4 line
int main(int argc, char* argv[]) //5th line
{
char q = 34; //7th line
string l[] = { //8th line ---- code will pause here and will resume later in 3rd for loop
" ",
"#include <iostream> //1 line ",
"#include <string> //2 line ",
"using namespace std; //3 line ",
" //4 line ",
"int main(int argc, char* argv[]) //5th line ",
"{",
" char q = 34; //7th line ",
" string l[] = { //8th line ",
" }; //9th resume printing end part of code ", //3rd loop starts printing from here
" for(int i = 0; i < 9; i++) //10th first half code ",
" cout << l[i] << endl; //11th line",
" for(int i = 0; i < 18; i++) //12th whole code ",
" cout << l[0] + q + l[i] + q + ',' << endl; 13th line",
" for(int i = 9; i < 18; i++) //14th last part of code",
" cout << l[i] << endl; //15th line",
" return 0; //16th line",
"} //17th line",
}; //9th resume printing end part of code
for(int i = 0; i < 9; i++) //10th first half code
cout << l[i] << endl; //11th line
for(int i = 0; i < 18; i++) //12th whole code
cout << l[0] + q + l[i] + q + ',' << endl; 13th line
for(int i = 9; i < 18; i++) //14th last part of code
cout << l[i] << endl; //15th line
return 0; //16th line
} //17th line
Not sure if you were wanting the answer on how to do this. But this works:
#include <cstdio>
int main () {char n[] = R"(#include <cstdio>
int main () {char n[] = R"(%s%c"; printf(n, n, 41); })"; printf(n, n, 41); }
If you are a golfer, this is a more minified version:
#include<cstdio>
int main(){char n[]=R"(#include<cstdio>
int main(){char n[]=R"(%s%c";printf(n,n,41);})";printf(n,n,41);}
My version without using %c:
#include <stdio.h>
#define S(x) #x
#define P(x) printf(S(S(%s)),x)
int main(){char y[5][300]={
S(#include <stdio.h>),
S(#define S(x) #x),
S(#define P(x) printf(S(S(%s)),x)),
S(int main(){char y[5][300]={),
S(};puts(y[0]);puts(y[1]);puts(y[2]);puts(y[3]);P(y[0]);putchar(',');puts(S());P(y[1]);putchar(',');puts(S());P(y[2]);putchar(',');puts(S());P(y[3]);putchar(',');puts(S());P(y[4]);puts(S());fputs(y[4],stdout);})
};puts(y[0]);puts(y[1]);puts(y[2]);puts(y[3]);P(y[0]);putchar(',');puts(S());P(y[1]);putchar(',');puts(S());P(y[2]);putchar(',');puts(S());P(y[3]);putchar(',');puts(S());P(y[4]);puts(S());fputs(y[4],stdout);}
/* C/C++ code that shows its own source code without and with File line number and C/C++ code that shows its own file path name of the file. With Line numbers */
#include<stdio.h>
#include<string.h>
#include<iostream>
using namespace std;
#define SHOW_SOURCE_CODE
#define SHOW_SOURCE_FILE_PATH
/// Above two lines are user defined Macros
int main(void) {
/* shows source code without File line number.
#ifdef SHOW_SOURCE_CODE
// We can append this code to any C program
// such that it prints its source code.
char c;
FILE *fp = fopen(__FILE__, "r");
do
{
c = fgetc(fp);
putchar(c);
}
while (c != EOF);
fclose(fp);
// We can append this code to any C program
// such that it prints its source code.
#endif
*/
#ifdef SHOW_SOURCE_FILE_PATH
/// Prints location of C this C code.
printf("%s \n",__FILE__);
#endif
#ifdef SHOW_SOURCE_CODE
/// We can append this code to any C program
/// such that it prints its source code with line number.
unsigned long ln = 0;
FILE *fp = fopen(__FILE__, "r");
int prev = '\n';
int c; // Use int here, not char
while((c=getc(fp))!=EOF) {
if (prev == '\n'){
printf("%05lu ", ++ln);
}
putchar(c);
prev = c;
}
if (prev != '\n') {
putchar('\n'); /// print a \n for input that lacks a final \n
}
printf("lines num: %lu\n", ln);
fclose(fp);
/// We can append this code to any C program
/// such that it prints its source code with line number.
#endif
return 0;
}
main(a){printf(a="main(a){printf(a=%c%s%c,34,a,34);}",34,a,34);}
we are working on a model checking tool which executes certain search routines several billion times. We have different search routines which are currently selected using preprocessor directives. This is not only very unhandy as we need to recompile every time we make a different choice, but also makes the code hard to read. It's now time to start a new version and we are evaluating whether we can avoid conditional compilation.
Here is a very artificial example that shows the effect:
/* program_define */
#include <stdio.h>
#include <stdlib.h>
#define skip 10
int main(int argc, char** argv) {
int i, j;
long result = 0;
int limit = atoi(argv[1]);
for (i = 0; i < 10000000; ++i) {
for (j = 0; j < limit; ++j) {
if (i + j % skip == 0) {
continue;
}
result += i + j;
}
}
printf("%lu\n", result);
return 0;
}
Here, the variable skip is an example for a value that influences the behavior of the program. Unfortunately, we need to recompile every time we want a new value of skip.
Let's look at another version of the program:
/* program_variable */
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv) {
int i, j;
long result = 0;
int limit = atoi(argv[1]);
int skip = atoi(argv[2]);
for (i = 0; i < 10000000; ++i) {
for (j = 0; j < limit; ++j) {
if (i + j % skip == 0) {
continue;
}
result += i + j;
}
}
printf("%lu\n", result);
return 0;
}
Here, the value for skip is passed as a command line parameter. This adds great flexibility. However, this program is much slower:
$ time ./program_define 1000 10
50004989999950500
real 0m25.973s
user 0m25.937s
sys 0m0.019s
vs.
$ time ./program_variable 1000 10
50004989999950500
real 0m50.829s
user 0m50.738s
sys 0m0.042s
What we are looking for is an efficient way to pass values into a program (by means of a command line parameter or a file input) that will never change afterward. Is there a way to optimize the code (or tell the compiler to) such that it runs more efficiently?
Any help is greatly appreciated!
Comments:
As Dirk wrote in his comment, it is not about the concrete example. What I meant was a way to replace an if that evaluates a variable that is set once and then never changed (say, a command line option) inside a function that is called literally billions of times by a more efficient construct. We currently use the preprocessor to tailor the desired version of the function. It would be nice if there is a nicer way that does not require recompilation.
You can take a look at libdivide which works to do fast division when the divisor isn't known until runtime: (libdivide is an open source library
for optimizing integer division).
If you calculate a % b using a - b * (a / b) (but with libdivide) you might find that it's faster.
I ran your program_variable code on my system to get a baseline of performance:
$ gcc -Wall test1.c
$ time ./a.out 1000 10
50004989999950500
real 0m55.531s
user 0m55.484s
sys 0m0.033s
If I compile test1.c with -O3, then I get:
$ time ./a.out 1000 10
50004989999950500
real 0m54.305s
user 0m54.246s
sys 0m0.030s
In a third test, I manually set the values of limit and skip:
int limit = 1000, skip = 10;
I then re-run the test:
$ gcc -Wall test2.c
$ time ./a.out
50004989999950500
real 0m54.312s
user 0m54.282s
sys 0m0.019s
Taking out the atoi() calls doesn't make much of a difference. But if I compile with -O3 optimizations turned on, then I get a speed bump:
$ gcc -Wall -O3 test2.c
$ time ./a.out
50004989999950500
real 0m26.756s
user 0m26.724s
sys 0m0.020s
Adding a #define macro for an ersatz atoi() function helped a little, but didn't do much:
#define QSaToi(iLen, zString, iOut) {int j = 1; iOut = 0; \
for (int i = iLen - 1; i >= 0; --i) \
{ iOut += ((zString[i] - 48) * j); \
j = j*10;}}
...
int limit, skip;
QSaToi(4, argv[1], limit);
QSaToi(2, argv[2], skip);
And testing:
$ gcc -Wall -O3 -std=gnu99 test3.c
$ time ./a.out 1000 10
50004989999950500
real 0m53.514s
user 0m53.473s
sys 0m0.025s
The expensive part seems to be those atoi() calls, if that's the only difference between -O3 compilation.
Perhaps you could write one binary, which loops through tests of various values of limit and skip, something like:
#define NUM_LIMITS 3
#define NUM_SKIPS 2
...
int limits[NUM_LIMITS] = {100, 1000, 1000};
int skips[NUM_SKIPS] = {1, 10};
int limit, skip;
...
for (int limitIdx = 0; limitIdx < NUM_LIMITS; limitIdx++)
for (int skipIdx = 0; skipIdx < NUM_SKIPS; skipIdx++)
/* per-limit, per-skip test */
If you know your parameters ahead of compilation time, perhaps you can do it this way. You could use fprintf() to write your output to a per-limit, per-skip file output, if you want results in separate files.
You could try using the GCC likely/unlikely builtins (e.g. here) or profile guided optimization (e.g. here). Also, do you intend (i + j) % 10 or i + (j % 10)? The % operator has higher precedence, so your code as written is testing the latter.
I'm a bit familiar with the program Niels is asking about.
There are a bunch of interesting answers around (thanks), but the answers slightly miss the spirit of the question. The given example programs are really just example programs. The logic that is subject to pre-processor statements is much much more involved. In the end, it is not just about executing a modulo operation or a simple division. it is about keeping or skipping certain procedure calls, executing an operation between two other operations etc, defining the size of an array, etc.
All these things could be guarded by variables that are set by command-line parameters. But that would be too costly as many of these routines, statements, memory allocations are executed a billion times. Perhaps that shapes the problem a bit better. Still very interested in your ideas.
Dirk
If you would use C++ instead of C you could use templates so that things can be calculated at compile time, even recursions are possible.
Please have a look at C++ template meta programming.
A stupid answer, but you could pass the define on the gcc command line and run the whole thing with a shell script that recompiles and runs the program based on a command-line parameter
#!/bin/sh
skip=$1
out=program_skip$skip
if [ ! -x $out ]; then
gcc -O3 -Dskip=$skip -o $out test.c
fi
time $out 1000
I got also an about 2× slowdown between program_define and program_variable, 26.2s vs. 49.0s. I then tried
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv) {
int i, j, r;
long result = 0;
int limit = atoi(argv[1]);
int skip = atoi(argv[2]);
for (i = 0; i < 10000000; ++i) {
for (j = 0, r = 0; j < limit; ++j, ++r) {
if (r == skip) r = 0;
if (i + r == 0) {
continue;
}
result += i + j;
}
}
printf("%lu\n", result);
return 0;
}
using an extra variable to avoid the costly division, and the resulting time was 18.9s, so significantly better than the modulo with a statically known constant. However, this auxiliary-variable technique is only promising if the change is easily predictable.
Another possibility would be to eliminate using the modulus operator:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv) {
int i, j;
long result = 0;
int limit = atoi(argv[1]);
int skip = atoi(argv[2]);
int current = 0;
for (i = 0; i < 10000000; ++i) {
for (j = 0; j < limit; ++j) {
if (++current == skip) {
current = 0;
continue;
}
result += i + j;
}
}
printf("%lu\n", result);
return 0;
}
If that is the actual code, you have a few ways to optimize it:
(i + j % 10==0) is only true when i==0, so you can skip that entire mod operation when i>0. Also, since i + j only increases by 1 on each loop, you can hoist the mod out and simply have a variable you increment and reset when it hits skip (as has been pointed out in other answers).
You can also have all possible function implementations already in the program, and at runtime you change the function pointer to select the function which you are actually are using.
You can use macros to avoid that you have to write duplicate code:
#define MYFUNCMACRO(name, myvar) void name##doit(){/* time consuming code using myvar */}
MYFUNCMACRO(TEN,10)
MYFUNCMACRO(TWENTY,20)
MYFUNCMACRO(FOURTY,40)
MYFUNCMACRO(FIFTY,50)
If you need to have too many of these macros (hundreds?) you can write a codegenerator which writes the cpp file automatically for a range of values.
I didn't compile nor test the code, but maybe you see the principle.
You might be compiling without optimisation, which will lead your program to load skip each time it's checked, instead of the literal of 10. Try adding -O2 to your compiler's command line, and/or use
register int skip;
Aren't misaligned pointers (in the BEST possible case) supposed to slow down performance and in the worst case crash your program (assuming the compiler was nice enough to compile your invalid c program).
Well, the following code doesn't seem to have any performance differences between the aligned and misaligned versions. Why is that?
/* brutality.c */
#ifdef BRUTALITY
xs = (unsigned long *) ((unsigned char *) xs + 1);
#endif
...
/* main.c */
#include <stdio.h>
#include <stdlib.h>
#define size_t_max ((size_t)-1)
#define max_count(var) (size_t_max / (sizeof var))
int main(int argc, char *argv[]) {
unsigned long sum, *xs, *itr, *xs_end;
size_t element_count = max_count(*xs) >> 4;
xs = malloc(element_count * (sizeof *xs));
if(!xs) exit(1);
xs_end = xs + element_count - 1; sum = 0;
for(itr = xs; itr < xs_end; itr++)
*itr = 0;
#include "brutality.c"
itr = xs;
while(itr < xs_end)
sum += *itr++;
printf("%lu\n", sum);
/* we could free the malloc-ed memory here */
/* but we are almost done */
exit(0);
}
Compiled and tested on two separate machines using
gcc -pedantic -Wall -O0 -std=c99 main.c
for i in {0..9}; do time ./a.out; done
I tested this some time in the past on Win32 machines and did not notice much of a penalty on 32-bit machines. On 64-bit, though, it was significantly slower. For example, I ran the following bit of code. On a 32-bit machine, the times printed were hardly changed. But on a 64-bit machine, the times for the misaligned accesses were nearly twice as long. The times follow the code.
#define UINT unsigned __int64
#define ENDPART QuadPart
#else
#define UINT unsigned int
#define ENDPART LowPart
#endif
int main(int argc, char *argv[])
{
LARGE_INTEGER startCount, endCount, freq;
int i;
int offset;
int iters = atoi(argv[1]);
char *p = (char*)malloc(16);
double *d;
for ( offset = 0; offset < 9; offset++ )
{
d = (double*)( p + offset );
printf( "Address alignment = %u\n", (unsigned int)d % 8 );
*d = 0;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&startCount);
for(i = 0; i < iters; ++i)
*d = *d + 1.234;
QueryPerformanceCounter(&endCount);
printf( "Time: %lf\n",
(double)(endCount.ENDPART-startCount.ENDPART)/freq.ENDPART );
}
}
Here are the results on a 64-bit machine. I compiled the code as a 32-bit application.
[P:\t]pointeralignment.exe 100000000
Address alignment = 0
Time: 0.484156
Address alignment = 1
Time: 0.861444
Address alignment = 2
Time: 0.859656
Address alignment = 3
Time: 0.861639
Address alignment = 4
Time: 0.860234
Address alignment = 5
Time: 0.861539
Address alignment = 6
Time: 0.860555
Address alignment = 7
Time: 0.859800
Address alignment = 0
Time: 0.484898
The x86 architecture has always been able to handle misaligned accesses, so you'll never get a crash. Other processors might not be as lucky.
You're probably not seeing any time difference because the loop is memory-bound; it can only run as fast as data can be fetched from RAM. You might think that the misalignment will cause the RAM to be accessed twice, but the first access puts it into cache, and the second access can be overlapped with getting the next value from RAM.
You're assuming either x86 or x64 architectures. On MIPS, for example, your code may result in a SIGBUS(bus fault) signal being raised. On other architectures, non-aligned accesses will typically be slower than aligned accesses, although, it is very much architecture dependent.
x86 or x64?
Misaligned pointers were a killer in x86 where 64bit architectures were not nearly as prone to the crash, or even slow performance at all.
It is probably because malloc of that many bytes is returning NULL. At least that's what it does for me.
You never defined BRUTALITY in your posted code. Are you sure you are testing in 'brutal' mode?
Maybe in order to malloc such a huge buffer, the system is paging memory to and from disk. That could swamp small differences. Try a much smaller buffer and a large, in program loop count around that.
I made the mods I've suggested here and in the comments and tested on my system (a tired, 4 year old, 32 bit laptop). Code shown below. I do get a measurable difference, but only around 3%. I maintain my changes are a success because your question indicates you get no difference at all correct ?
Sorry I am using Windows and used the windows specific GetTickCount() API I am familiar with because I often do timing tests, and enjoy the simplicity of that misnamed API (it actually return millisecs since system start).
/* main.cpp */
#include <stdio.h>
#include <stdlib.h>
#include <windows.h>
#define BRUTALITY
int main(int argc, char *argv[]) {
unsigned long i, begin, end;
unsigned long sum, *xs, *itr, *xs_begin, *xs_end;
size_t element_count = 100000;
xs = (unsigned long *)malloc(element_count * (sizeof *xs));
if(!xs) exit(1);
xs_end = xs + element_count - 1;
#ifdef BRUTALITY
xs_begin = (unsigned long *) ((unsigned char *) xs + 1);
#else
xs_begin = xs;
#endif
begin = GetTickCount();
for( i=0; i<50000; i++ )
{
for(itr = xs_begin; itr < xs_end; itr++)
*itr = 0;
sum = 0;
itr = xs_begin;
while(itr < xs_end)
sum += *itr++;
}
end = GetTickCount();
printf("sum=%lu elapsed time=%lumS\n", sum, end-begin );
free(xs);
exit(0);
}