MD5 hashing of Sequential files and dynamic arrays in UniData

MD5 hashing of Sequential files and dynamic arrays in UniData - md5

I am creating a sequential file which requires a digital signature (MD5 hash). While I am creating the sequential file I am also creating a dynamic array with the same data.
If I perform a MD5 hash on both the sequential file and dynamic array can I expect the result to be the same or different?

No, generally they won't be the same.
When you are adding to the dynamic array, you are probably introducing attribute (#AM) markers for each new line, whereas with sequential files they will stay as the native new line characters.
If you are using a UNIX system to run UniData, you can do a CONVERT #AM TO CHAR(10) IN MYARRAY and it should be equivalent.
If you are using a Windows system to run UniData, you can do a SWAP #AM WITH CHAR(13):CHAR(10) IN MYARRAY and it should be equivalent.
Disclaimer: Above code has not been tested.

Related

How to avoid reading a very large Array in Matlab multiple times?

I have a large Array/Matrix with 5899091 rows and 11 columns. I am storing it in a text file.
Using dlmread() method in matlab i am reading it everytime i need it. However,it is taking a lot of time(more than 1 minute). And i need to read the file again and again. I got stuck in this situation. My question is:
1) Is there any way to read the file just once and save it in any kind of global/persistent Matrix?
2) Is there a better way to read a text file and convert it into a matrix in a more efficient way?
Thanks in advance.

You might get the performance you want from a memory-mapped file. Investigate the Matlab function memmapfile. It's not something I use much so won't offer any further advice which is likely to be wrong.

The best option is almost certainly to simply read the file once in a script or control function and then pass it as a variable to any subsequent functions which require that data. This is just as much work as adding the global declarations and is cleaner, more maintainable and more flexible.
You can also save the variable to a MAT file. If each element in your file is of type double, it should be a bit over 4GB in size. The MAT format is efficient, but the major benefit is from storing your numbers as numbers instead of text. With 5 or 8 significant digits the same numbers in ASCII take 6.2 or 9.3 GB respectively.
If for some reason you really don't want to pass the data as a variable, I would recommend nested functions over global variables:
function aResult = aFunction(var)
data = dlmread(...);
var4 = bFunction(var);
function bResult = bFunction(var)
var4 = cFunction(data);
end
end
Of course at this point you are still wrapping the business functions in something. The scoping rules are helpful.
Now, if the real problem is just the size of this file - that is, it's too big for memory and you are using range arguments to dlmread to access the file in chunks - then you should probably take the time to design a format for use with memmapfile. This Wikipedia page explains the potential benefits.
Then there is the brute force solution.

You want to use global variables. Declare the global at the top of the function and it will be shared by the functions it is declared in: see http://www.mit.edu/people/abbe/matlab/globals.html
Use a .mat file. It will be slightly faster. Also, if the matrix is easy to create (large identity or eye matrix) it maybe quicker to generate it on the fly. Lastly, if your matrix is sparse use the sparse matrix operations.

You can read the file once and save it to MATLAB's MAT file. Then you can access the saved variables fully or partially (basically as any variable in MATLAB workspace) directly from the file using MATFILE. I have answered a similar question about it here. Please have a look.

Alternative to Hash Map for Small Data set in C

I am currently working on a command line interface for a particle simulator. Its parser takes reads input in the following format:
[command] [argument]* (-[flag] [flag argument])
Currently, the command is sent through a conditional block, compared to various known commands and its corresponding data packet is sent to the matching function. This, however, seems clunky, inefficient and inelegant.
I am thinking about using a hashmap instead, with a string representation of a command as the key and a function pointer as the value. The function referenced would then be sent a data packet containing arguments, flags, etc.
Is a hash map overkill in this situation? Does the extra infrastructure required to implement one outweigh the potential benefits? I am aiming for speed, elegance, function, and, since this is an open-source project, extensibility.
Thanks for the help.

You might want to consider the Ternary Search Tree. It has good performnce, efficient use of storage; and you don't need a hash function or a collision strategy.
The linked Bentley/Sedgwick article is a very thorough-yet-readable explanation of the accompanying C source.
I've been using a TST for name-lookup in the past 3 versions of my postscript interpreter. The only changes that have been needed have been due to changes in memory management. Here's a version I modified (lightly) to use explicit pointers. I use yet another version in my postscript interpreter, any of the xpost2*.zip versions, in the file core.c, which uses byte-offsets for pointers (have to be added to the user-memory byte-pointer to yield a real pointer).

Speed gained will probably be minimal, but you could hash the command to convert it to a number and then use a switch statement. Faster than a hash map.

C data structure to disk

How can I make a copy of a tree data structure in memory to disk in C programming language?

You need to serialize it, i.e. figure out a way to go through it serially that includes all nodes. These are often called traversal methods.
Then figure out a way to store the representation of each node, together with references to other nodes, so that it can all be loaded in again.
One way of representing the references is implicitly, by nesting like XML does.

The basic pieces here are:
The C file I/O routines are fopen, fwrite, fprintf, etc.
Copying pointers to disk is useless, since the next time you run all those pointer values will be crap. So you'll need some alternative to pointers that still somehow refers disk records to each other. One sensible alternative would be file indexes (the kind used by your C I/O routines like fseek and ftell).
That should be about all the info you need to do the job.
Alternatively, if you use an array-based tree (with array indexes instead of pointers, or with the links implied by their position in the array) you could just save and load the whole shebang without any further logic required.

Come up with a serialization (and deserialization) function. Then run it and send the output to a file.

Determining string uniqueness in a large file

In C, I want to process a file that contains 108 16-digit alphanumeric strings and determine if each one is unique in the file. How can I do that?

As other people have said, the most straightforward method is to simply load the entire file and use something like qsort to sort it.
If you can't load that much into memory at once, another option is to load the data in several passes. On your first pass, read the file and only load in lines that start with A. Sort those and find the unique lines. For the next pass, load all the lines that start with B, sort, and find unique lines. Repeat this process for every alphanumeric character that a line might start with. Using this technique, you should only have to load a fraction of the file into memory at a time and it shouldn't cause you to mis-classify any lines.

Given that you're talking about ~16 megabytes of data, the obvious way to do it would be to just load the data into a hash table (or something on that order) and count the occurrences of each string.
I can't quite imagine doing this in C though -- most other languages will supply a reasonable data structure (some sort of map), making the job substantially easier.

Do a bucket sort(Hash function) into multiple files, one file for each bucket. Then process each bucket's file to determine if all strings are unique within the bucket.

You'll need to sort the file.
Just load it into a single memory block, run qsort from the C runtime library on the memory block and the finally run sequentially over all strings to check for two consecutive strings that are the same.

Take a library with set/map functions, e.g. see link text

Hash function for short strings

I want to send function names from a weak embedded system to the host computer for debugging purpose. Since the two are connected by RS232, which is short on bandwidth, I don't want to send the function's name literally. There are some 15 chars long function names, and I sometimes want to send those names at a pretty high rate.
The solution I thought about, was to find a hash function which would hash those function names to a single byte, and send this byte only. The host computer would scan all the functions in the source, compute their hash using the same function, and then would translate the hash to the original string.
The hash function must be
Collision free for short strings.
Simple (since I don't want too much code in my embedded system).
Fit a single byte
Obviously, it does not need to be secure by any means, only collision free. So I don't think using cryptography-related hash function is worth their complexity.
An example code:
int myfunc() {
sendToHost(hash("myfunc"));
}
The host would then be able to present me with list of times where the myfunc function was executed.
Is there some known hash function which holds the above conditions?
Edit:
I assume I will use much less than 256 function-names.
I can use more than a single byte, two bytes would have me pretty covered.
I prefer to use a hash function instead of using the same function-to-byte map on the client and the server, because (1) I have no map implementation on the client, and I'm not sure I want to put one for debugging purposes. (2) It requires another tool in my build chain to inject the function-name-table into my embedded system code. Hash is better in this regard, even if that means I'll have a collision once in many while.

Try minimal perfect hashing:
Minimal perfect hashing guarantees that n keys will map to 0..n-1 with no collisions at all.
C code is included.

Hmm with only 256 possible values, since you will parse your source code to know all possible functions, maybe the best way to do it would be to attribute a number to each of your function ???
A real hash function would probably won't work because you have only 256 possible hashes.
but you want to map at least 26^15 possible values (assuming letter-only, case-insensitive function names).
Even if you restricted the number of possible strings (by applying some mandatory formatting) you would be hard pressed to get both meaningful names and a valid hash function.

You could use a Huffman tree to abbreviate your function names according to the frequency they are used in your program. The most common function could be abbreviated to 1 bit, less common ones to 4-5, very rare functions to 10-15 bits etc. A Huffman tree is not very hard to implement but you will have to do something about the bit alignment.

No, there isn't.
You can't make a collision free hash code, or even close to it, with just an eight bit hash. If you allow strings that are longer than one character, you have more possible strings than there are possible hash codes.
Why not just extract the function names and give each function name an id? Then you only need a lookup table on each side of the wire.
(As others have shown you can generate a hash algorithm without collisions if you already have all the function names, but then it's easier to just assign a number to each name to make a lookup table...)

If you have a way to track the functions within your code (i.e. a text file generated at run-time) you can just use the memory locations of each function. Not exactly a byte, but smaller than the entire name and guaranteed to be unique. This has the added benefit of low overhead. All you would need to 'decode' the address is the text file that maps addresses to actual names; this could be sent to the remote location or, as I mentioned, stored on the local machine.

In this case you could just use an enum to identify functions. Declare function IDs in some header file:
typedef enum
{
FUNC_ID_main,
FUNC_ID_myfunc,
FUNC_ID_setled,
FUNC_ID_soundbuzzer
} FUNC_ID_t;
Then in functions:
int myfunc(void)
{
sendFuncIDToHost(FUNC_ID_myfunc);
...
}

If sender and receiver share the same set of function names, they can build identical hashtables from these. You can use the path taken to get to an hash element to communicate this. This can be {starting position+ number of hops} to communicate this. This would take 2 bytes of bandwidth. For a fixed-size table (lineair probing) only the final index is needed to address an entry.
NOTE: when building the two "synchronous" hash tables, the order of insertion is important ;-)

Described here is a simple way of implementing it yourself: http://www.devcodenote.com/2015/04/collision-free-string-hashing.html
Here is a snippet from the post:
It derives its inspiration from the way binary numbers are decoded and converted to decimal number format. Each binary string representation uniquely maps to a number in the decimal format.
if say we have a character set of capital English letters, then the length of the character set is 26 where A could be represented by the number 0, B by the number 1, C by the number 2 and so on till Z by the number 25. Now, whenever we want to map a string of this character set to a unique number , we perform the same conversion as we did in case of the binary format