How do I store 262144 varibles efficiently? Multi-dimensional array? - arrays

I'm trying to write a program that reads a .txt file containing several thousand strings (each one is exactly 9 letters long) made up only of the letters A,C,G and T (i.e. DNA sequences).
Now, there are of course 4^9 possible combinations of A,C,G and T in a 9-letter string. I need to know how often each of these 262144 combinations appears in my .txt file.
My problem is that I (obviously) don't want to initialize 262144 individual variables, increment each when a match is found and then print them all individually, because that would be crazy.
So, my idea was to create either some kind of tree which goes down the branches according to the letter encountered at each node and stores the number of times each branch was 'run down' (i.e. each possible 9-letter combination) at the last node.
Or an array of 262144 positions where I can store the number of appearances of each possible combination. For that, however, I would need some kind of non-redundant system that chooses a unique position in the array (to store the number of times that combination has been encountered) based on which letters have been encountered in which sequence in the 9-letter string.
For example: For each 'A' encountered in the 9-letter string, I increment my 'pointer variable' (which points to the position in the big array) by 0, so every time the sequence AAAAAAAAA is encountered, position [0] of my array is incremented by 1. For every 'T' I increment the pointer by 1, so TTTTTTTTT would increment position [9] of my array by 1 and so on.
This, however, gives me the problem that both sequences AAAAAAAAT and TAAAAAAAA (and all other combinations of 8 As and 1T) will increment position [1] of the array. So I would have to use some kind of system where the pointer can actually reach each value between 0 and 262143 exactly once?
I'm sure there is some better way? Multi-dimensional arrays or something like that?
Best regards,
rokyo

You want to store this as a tree of depth 9, each node can have 4 children, just each of the 4 possibilities of the next letter. Each leaf would have a counter in it. When you have built your tree, go through all the leaves and that will give you the counts.
So it would work like this:
Read in a sequence.
For each character in the sequence select the proper child, if it does not exist create the node. If it does then go to the child.
If you are at the end of you string then update the count in the node.
loop back to read in a sequence.
Once all sequences are read, and tree is built.
Iterate through the tree, if it is a leaf (no children) then spit out the count.
The benefit to this approach is if the size of the data changes, or the length of each sequence it will still work. This is a typical use for a tree.

Why multidimensional. If you want to count just encode into an integer and increment the place in an array of 262143 integers.
How to encode your string: think of those 4 letters as a binary number with 2 places. so you need 18 bits to represent one combination.
A - 00
C - 01
G - 10
T - 11
AAAAAAAAA - 000000000000000000
ACACACACA - 000100010001000100 - 17476
GAAAAAAAA - 100000000000000000 - 131072
TAAAAAAAA - 110000000000000000 - 196608
AAAAAAAAT - 000000000000000011 - 3
The Array in Memory would be depending on the maximum number of occurrences you want to cope with. If 4 Billion is enough you would need about a megabyte of memory to represent this "counter"-array.
Each counting access would be O(1).

Related

Suggestions on to make my compressor faster

I have some data which I'm compressing with a custom compressor, the compressed data is fine but the compressor takes ages, and I'm seeking advice on how I could make that faster. Let me give you all the details.
The input data is an array of bytes, maximum 2^16 of them. Since those bytes in the array NEVER assume values between 0x08 and 0x37 (inclusive), I decided that I could exploit that for a simple LZ-like compression scheme that works by replacing any found sequence of 4 to 51 bytes in length that is already been found at a "lower address" (I mean closer to the array's beginning) with a single byte in the 0x08 to 0x37 range that would then be followed by two bytes addressing the low and high byte of the index of the beginning of the sequence, thus giving the decompressor the length (within that single byte) and address of the original data, to rebuild the original array.
The compressor works this way: for any sequence of any length from 51 to 4 bytes (I test longer sequences first) starting from any index (from left to right) I check if there's a correspondence 'left' of that, which means at an index which is lower than the starting point I'm checking. In case there is more than a single match, I choose the match that 'saves' more, which means the longer correspondence starting at the leftmost place.
The results are just perfect... but of course this is over-killing - it's 4 nested 'for' cycles with a memcmp() inside that, and it takes minutes on a modern workstation to compress some 20 KB worth of data, and that's why I'm seeking help.
Code is accessible here, if you need to sneak a peek. The 'job' starts at line 44.
Of course I can give you any detail you need, there's nothing secret here (BTW, just in case... I'm not going to change compression scheme for this reason, as this one works exactly as I need it!)
Thank you in advance.
A really obvious one is that you don't have to loop over the lengths, just find out what the longest match at that position is. That's not a "search", just keep extending the match by 1 for every matching pair of characters. When it stops, you have the longest match at that position (naturally you can force it to stop at 51 too, so it doesn't overrun).
An other typical trick is keeping a hashmap that maps keys of 3 or 4 characters to a list of offsets where they can be found. That way you only need to try positions that have some hope of resulting in a match. This is also described in the DEFLATE RFC all the way at the bottom.

Array filled with how frequently letters appeared in text file. How to sort (descending) in C and still know which letter went with which number?

Totally a homework assignment and I've been stuck here for a couple days. Simple substitution cipher, encryption program works fine and I think I've got most of decryption figured out except this one part.
It creates an array of 26 int's (one for each letter) and loops through the text, increasing the appropriate index each time it encounters the associated letter. Right now the index is how I know what slot is what letter (a is [0], b is [1], c[2], etc).
How do I sort this array by size (so that I can tease apart frequencies) while still knowing what letter the number is associated with? (ie, 'a' appears 600 times, 'b' appears 30 times, 'c' appears 82 times, etc)
You need to store structs, with char code and count fields. Then sort those. Note that in C you can use structs like any variables, pass them around and assign them. You don't have to use pointers or something like that, unlike when you are dealing with arrays.
If you for some reason don't want to use structs (like, if you have not learned them yet...), you could have two arrays, one for char codes and other for counts, with same size. Then you do sorting by the count array, but do swap in both arrays, so they stay in sync.

Find the first non repetitive character in a string

I was asked this question in an interview. My answer: "create an array of size 26. Traverse the string by each character and make counts of characters in the array. Check the first non repetitive character by traversing again in the string and checking the array if the character is not repeated." What if the string contains large set of characters like 10000 types of characters instead of 26 types of alphabets.
You can implement your original algorithm using less memory. Since you don't care about how many times the character repeated above 1, you only need 2 bits per character in the alphabet. When incrementing a value that is already above 1, just leave the value alone. The rest of your algorithm remains unchanged.
If memory is not a restriction, there is a faster algorithm that doesn't require another pass over the string. Allow each letter in the alphabet to be represented by a ListNode. Then have two lists, list1 and list2, that start out as empty. list1 contains letters that have only occurred once, and list2 contains letters that have occurred more than once. For each letter in the input string, get the corresponding ListNode, say node. If node is not in either list, put it at the end of list1. If node is already in list1, take it out, and put it in list2. After the input string is processed, if list1 is empty, there are no non-repeating characters. Otherwise, the character that corresponds with the first node in list1 is the first non-repeating character.
Follow this link to IDEONE for an implementation of the list based algorithm.
You gave the brute-force answer. A clever solution keeps an array of the size of the alphabet, we'll call it x, with each item initially -1. Then pass through the string once; if the current character has an x-value of -1, change it to the index of the current character, otherwise if the current character has an x-value greater than 0, change it to -2. At the end of the string, examine each location in the x array, keeping track of the smallest positive value and the associated character. I implement that algorithm in Scheme at my blog.
You could use a tree-like data structure instead of an array.
If our strings are not too long, loop through the string character by character and check, whether you manage to find it again. Drawback: n^2 runtime in the worst case
build a hash of each character for the first pass and then check each bucket seperately. Should be combined with the method above and I'm not sure whether it reduces runtime significantly - it depends on your real world data.
Start reading the string character-by-character
Put each character in HashMap
Return the first character which has conflict
Pros:
You don't need to create a BitMap/Bit-Array in advance
Cons:
The HashMap can grow to as much as number of characters in string, if it does not encounter repeating character (or if there is no repeating character)

how to find the most frequent number in 1T numbers?

How to find the most frequent number(int type) in 1T (i.e. 10^12) numbers?
My premises are:
My memory is limited to 4G (i.e. 4ยท10^9) bytes.
All the numbers are stored in a file as the input.
The output is just one number.
All numbers(int type) are stored in one or serval files
The file structure is eithor binary or line-stored.
Edited at : 2013.04.22 17:08
Thanks for you comments:
Plus:
- External Storage is not limited.
First note that the problem is at least as hard as the element distinctness problem.
Thus, the solutions should follow the same approaches:
sort (using external sort) and iterate while counting occurances for each number and looking for the maximal.
Hashing solution: hash the numbers into buckets that fit in memory (note that all occurances of the same number will be hashed to the same bucket), for each bucket - find the most frequent number, and store it. Then go through all candidates from all buckets and chose the best.
In here, you can either sort (in memory) each bucket and find the most frequent number or you can create a histogram (using a hash map, with a different hash function) to find the frequency of each item in the bucket.
Note that the buckets are written on disk, and loaded into memory one after the other, at each time only a small part of the data is stored on RAM.
Another more scalable approach could be using map-reduce, with a simple map-reduce step to count number of occurances per number, and then just find maximum of those:
map(number):
emit(number,'1')
reduce(number,list):
emit (number, size(list))
all is left is to find the number with the highest value - which can be done in linear scan.
What's about using filesystem to store counters of numbers?
For example, if your numbers are uint32, you can create 65536 directories with 65536 files in each.
Name of directory will be two high bytes of a number, name of file - low two bytes. When you meet number X, you split it into two parts and get filename, open that file and increment counter inside it (or write there 1, if file is absent).
After filling that file structure you can scan recursively your tree finding file with greatest value.
That would be very slowly, but will almost eat none RAM.
Use a hash-table the key is the number, the value is the count. O(n) to insert all the numbers in the hashtable O(Unique numbers) to find the most frequent.
brute force:
remember = 0;
repeat:
take the first unmarked number and count its occurance in the file (n1).
mark each occurance of number as read. (overwrite it with blanks f.e)
if (n1 > remember) remember = n1;

How can I find all possible combinations of a string? Using CUDA

I am trying to speed up my algorithm by using CUDA to find all possible combination of a string. What is the best way I can achieve this?
example:
abc
gives:
a
b
c
ab
ac
bc
i have nothing so far. i am not asking for code. i am just asking for the best way to do it? an algorithm? a pseudocode? maybe a discussion?
The advantage to using CUDA is massive parallelism with potentially thousands of threads with little overhead. To that end, you have to figure out a way to divide the problem into small chunks without relying too much on communication between the threads. In this problem you have n characters and each can be either present or absent in each output string. This yields 2^n total output strings. (You've left off the empty string and the original string from your list...if that's the desired result then you have 2^n - 2 total output strings.)
In any event, one way you can divide up the work of creating the strings is to assign each potential output string a number and have each thread compute the output strings for a certain range of numbers. The mapping from number to output string is easy if you look at the binary representation of each number. Each binary digit in an n-bit number corresponds to a character in the string of length n. Thus, for your example, the number 5 or 101 in binary maps to the string "ac". The strings you listed would be created by computing the mappings for numbers from 1 to 6 as follows:
1 c
2 b
3 bc
4 a
5 ac
6 ab
You could compute 7 to get abc or 0 to get the empty string if desired.
Unless you're doing this for words longer than a dozen or so characters, I'm not sure this will be that much faster though. If you're doing it for words longer than 25 or so characters you might start running into memory constraints since you'll be wrangling hundreds of megabytes.
I will be very, very surprised if CUDA is the right solution to this problem.
However, I would write a kernel to find all substrings of length n, and launch the kernel in a loop for each value of n from 0 to the length of the string. Thus, each thread in a kernel will have exactly the same instructions (no threads will sit around idle while others finish).
Each thread will "find" one substring, so you might as well have thread i find the substring starting at index i in the string. Note that each substring length requires a different number of threads.
so, for n=1:
thread 0: a
thread 1: b
thread 2: c
and for n=2:
thread 0: ab
thread 1: bc

Resources