How to efficiently store 1 million words and query them by starts_with, contains, or ends_with?

How to efficiently store 1 million words and query them by starts_with, contains, or ends_with? - database

How do sites like this store tens of thousands of words "containing c", or like this, "words with d and c", or even further, "unscrambling" the word like CAUDK and finding that the database has duck. Curious from an algorithms/efficiency perspective how they would accomplish this:
Would a database be used, or would the words simply be stored in memory and quickly traversed? If a database was used (and each word was a record), how would you make these sorts of queries (with PostgreSQL for example, contains, starts_with, ends_with, and unscrambles)?
I guess the easiest thing to do would be to store all words in memory (sorted?), and just traverse the whole million or less word list to find the matches? But how about the unscramble one?
Basically wondering the efficient way this would be done.

"Containing C" amounts to count(C) > 0. Unscrambling CAUDC amounts to count(C) <= 2 && count(A) <= 1 && count(U) <= 1 && count(D) <= 1. So both queries could be efficiently answered by a database with 26 indices, one for the count of each letter in the alphabet.
Here is a quick and dirty python sqlite3 demo:
from collections import defaultdict, Counter
import sqlite3
conn = sqlite3.connect(':memory:')
cur = conn.cursor()
alphabet = [chr(ord('A')+i) for i in range(26)]
alphabet_set = set(alphabet)
columns = ['word TEXT'] + [f'{c}_count TINYINT DEFAULT 0' for c in alphabet]
create_cmd = f'CREATE TABLE abc ({", ".join(columns)})'
cur.execute(create_cmd)
for c in alphabet:
cur.execute(f'CREATE INDEX {c}_index ON abc ({c}_count)')
def insert(word):
counts = Counter(word)
columns = ['word'] + [f'{c}_count' for c in counts.keys()]
counts = [f'"{word}"'] + [f'{n}' for n in counts.values()]
var_str = f'({", ".join(columns)})'
val_str = f'({", ".join(counts)})'
insert_cmd = f'INSERT INTO abc {var_str} VALUES {val_str}'
cur.execute(insert_cmd)
def unscramble(text):
counts = {a:0 for a in alphabet}
for c in text:
counts[c] += 1
where_clauses = [f'{c}_count <= {n}' for (c, n) in counts.items()]
select_cmd = f'SELECT word FROM abc WHERE {" AND ".join(where_clauses)}'
cur.execute(select_cmd)
return list(sorted([tup[0] for tup in cur.fetchall()]))
print('Building sqlite table...')
with open('/usr/share/dict/words') as f:
word_set = set(line.strip().upper() for line in f)
for word in word_set:
if all(c in alphabet_set for c in word):
insert(word)
print('Table built!')
d = defaultdict(list)
for word in unscramble('CAUDK'):
d[len(word)].append(word)
print("unscramble('CAUDK'):")
for n in sorted(d):
print(' '.join(d[n]))
Output:
Building sqlite table...
Table built!
unscramble('CAUDK'):
A C D K U
AC AD AK AU CA CD CU DA DC KC UK
AUK CAD CUD
DUCK

I don't know for sure what they're doing, but I suggest this algorithm for contains and unscramble (and, I think, can be trivially extended to starts with or end with):
User submits a set of letters in the form of a string. Say, user submits bdsfa.
The algorithm sorts that string in (1). So, query becomes abdfs
Then, to find all words with those letters in them, the algorithm simply accesses the directory database/a/b/d/f/s/ and finds all words with those letters in. In case it finds the directory to be empty, it goes one level up: database/a/b/d/f/ and shows result there.
So, now, the question is, how to index the database of millions of words as done in step (3)? database/ directory will have 26 directories inside it for a to z, each of which will have 26-1 directories for all letters, except their parent's. E.g.:
database/a/{b,c,...,z}`
database/b/{a,c,...,z}`
...
database/z/{a,c,...,y}`
This tree structure will be only 26 level deep. Each branch will have no more than 26 elements. So browsing this directory structure is scalable.
Words will be stored in the leaves of this tree. So, the word apple will be stored in database/a/e/l/p/leaf_apple. In that place, you will also find other words such as leap. More specifically:
database/
a/
e/
l/
p/
leaf_apple
leaf_leap
leaf_peal
...
This way, you can efficiently reach the subset of target words as O(log n), where n is total number of words in your database.
You can further optimise this by adding additional indices. For example, there are too many words containing a, and the website won't display them all (at least not in the 1st page). Instead, the website may say there are total 500,000 many words containing 'a', here is 100 examples. In order to obtain 500,000 efficiently, the number of children at every level can be added during the indexing. E.g. `database/{a,b,...,z}/{num_children,
`database/
{...}/
{num_children,...}/
{num_childred,...}/
...
Here, num_children is just a leaf node, just like leaf_WORD. All leafs are files.
Depending on the load that this website has, it may not require to load this database in memory. It may simply leave it to the operating system to decide which portion of its file system to cache in memory as a read-time optimisation.
Personally, I think, as a criticism to applications, I think developers tend to jump into requiring RAM too fast even when a simple file system trick can do the job without any noticeable difference to the end user.

Related

Effectively derive term co-occurrence matrix from Google Ngrams

I need to use the lexical data from Google Books N-grams to construct a (sparse!) matrix of term co-occurrences (where rows are words and columns are the same words, and the cells reflect how many times they appear in the same context window). The resulting tcm would then be used to measure a bunch of lexical statistics and serve as input into vector semantics methods (Glove, LSA, LDA).
For reference, the Google Books (v2) dataset is formatted as follows (tab-separated)
ngram year match_count volume_count
some word 1999 32 12 # example bigram
However, problem is of course, these data be superhuge. Although, I will only need a subset of the data from certain decades (about 20 years worth of ngrams), and I am happy with a context window of up to 2 (i.e., use the trigram corpus). I have a few ideas but none seem particularly, well, good.
-Idea 1- initially was more or less this:
# preprocessing (pseudo)
for file in trigram-files:
download $file
filter $lines where 'year' tag matches one of years of interest
find the frequency of each of those ngrams (match_count)
cat those $lines * $match_count >> file2
# (write the same line x times according to the match_count tag)
remove $file
# tcm construction (using R)
grams <- # read lines from file2 into list
library(text2vec)
# treat lines (ngrams) as documents to avoid unrelated ngram overlap
it <- itoken(grams)
vocab <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(vocab, skip_grams_window = 2)
tcm <- create_tcm(it, vectorizer) # nice and sparse
However, I have a hunch this might not be the best solution. The ngram data files already contain the co-occurrence data in the form of n-grams, and there is a tag that gives the frequency. I have a feeling there should be a more direct way.
-Idea 2- I was also thinking of cat'ing each filtered ngram only once into the new file (instead of replicating it match_count times), then creating an empty tcm and then looping over the whole (year-filtered) ngram dataset and record instances (using the match_count tag) where any two words co-occur to populate the tcm. But, again, the data is big, and this kind of looping would probably take ages.
-Idea 3- I found a Python library called google-ngram-downloader that apparently has a co-occurrence matrix creation function, but looking at the code, it would create a regular (not sparse) matrix (which would be massive, given most entries are 0s), and (if I got it right) it simply loops through everything (and I assume a Python loop over this much data would be superslow), so it seems to be more aimed at rather smaller subsets of data.
edit -Idea 4- Came across this old SO question asking about using Hadoop and Hive for a similar task, with a a short answer with a broken link and a comment about MapReduce (none of which I am familiar with, so I would not know where to start).
But I'm thinking I can't be the first one with the need to tackle such a task, given the popularity of the Ngram dataset, and the popularity of (non-word2vec) distributed semantics methods that operate on a tcm or dtm input; hence ->
...the question: what would be a more reasonable/effective way of constructing a term-term co-occurrence matrix from Google Books Ngram data? (be it a variation of the proposed ideas of something completely different; R preferred but not necessary)

I will give an idea of how you can do this. But it can be improved in several places. I specially wrote in a "spagetti-style" for better interpretability, but it can be generalized to more than tri-grams
ngram_dt = data.table(ngram = c("as we know", "i know you"), match_count = c(32, 54))
# here we split tri-grams to obtain words
tokens_matrix = strsplit(ngram_dt$ngram, " ", fixed = T) %>% simplify2array()
# vocab here is vocabulary from chunk, but you can be interested first
# to create vocabulary from whole corpus of ngrams and filter non
# interesting/rare words
vocab = unique(tokens_matrix)
# convert char matrix to integer matrix for faster downstream calculations
tokens_matrix_int = match(tokens_matrix, vocab)
dim(tokens_matrix_int) = dim(tokens_matrix)
ngram_dt[, token_1 := tokens_matrix_int[1, ]]
ngram_dt[, token_2 := tokens_matrix_int[2, ]]
ngram_dt[, token_3 := tokens_matrix_int[3, ]]
dt_12 = ngram_dt[, .(cnt = sum(match_count)), keyby = .(token_1, token_2)]
dt_23 = ngram_dt[, .(cnt = sum(match_count)), keyby = .(token_2, token_3)]
# note here 0.5 - discount for more distant word - we follow text2vec discount of 1 / distance
dt_13 = ngram_dt[, .(cnt = 0.5 * sum(match_count)), keyby = .(token_1, token_3)]
dt = rbindlist(list(dt_12, dt_13, dt_23))
# "reduce" by word indices again - sum pair co-occurences which were in different tri-grams
dt = dt[, .(cnt = sum(cnt)), keyby = .(token_1, token_2)]
tcm = Matrix::sparseMatrix(i = dt$token_1, j = dt$token_2, x = dt$cnt, dims = rep(length(vocab), 2), index1 = T,
giveCsparse = F, check = F, dimnames = list(vocab, vocab))

How to, given a predetermined set of keys, reorder the keys such that the minimum number of nodes are used when inserting into a B-Tree?

So I have a problem which i'm pretty sure is solvable, but after many, many hours of thinking and discussion, only partial progress has been made.
The issue is as follows. I'm building a BTree of, potentially, a few million keys. When searching the BTree, it is paged on demand from disk into memory, and each page in operation is relatively expensive. This effectively means that we want to need to traverse as few nodes as possible (although after a node has been traversed to, the cost of traversing through that node, up to that node is 0). As a result, we don't want to waste space by having lots of nodes nearing minimum capacity. In theory, this should be preventable (within reason) as the structure of the tree is dependent on the order that the keys were inserted in.
So, the question is how to reorder the keys such that after the BTree is built the fewest number of nodes are used. Here's an example:
I did stumble on this question In what order should you insert a set of known keys into a B-Tree to get minimal height? which unfortunately asks a slightly different question. The answers, also don't seem to solve my problem. It is also worth adding that we want the mathematical guarantees that come from not building the tree manually, and only using the insert option. We don't want to build a tree manually, make a mistake, and then find it is unsearchable!
I've also stumbled upon 2 research papers which are so close to solving my question but aren't quite there!
Time-and Space-Optimality in B-Trees and Optimal 2,3-Trees (where I took the above image from in fact) discuss and quantify the differences between space optimal and space pessimal BTrees, but don't go as far as to describe how to design an insert order as far as I can see.
Any help on this would be greatly, greatly appreciated.
Thanks
Research papers can be found at:
http://www.uqac.ca/rebaine/8INF805/Automne2007/Sujets2007Automne/p174-rosenberg.pdf
http://scholarship.claremont.edu/cgi/viewcontent.cgi?article=1143&context=hmc_fac_pub
EDIT:: I ended up filling a btree skeleton constructed as described in the above papers with the FILLORDER algorithm. As previously mentioned, I was hoping to avoid this, however I ended up implementing it before the 2 excellent answers were posted!

The algorithm below should work for B-Trees with minimum number of keys in node = d and maximum = 2*d I suppose it can be generalized for 2*d + 1 max keys if way of selecting median is known.
Algorithm below is designed to minimize the number of nodes not just height of the tree.
Method is based on idea of putting keys into any non-full leaf or if all leaves are full to put key under lowest non full node.
More precisely, tree generated by proposed algorithm meets following requirements:
It has minimum possible height;
It has no more then two nonfull nodes on each level. (It's always two most right nodes.)
Since we know that number of nodes on any level excepts root is strictly equal to sum of node number and total keys number on level above we can prove that there is no valid rearrangement of nodes between levels which decrease total number of nodes. For example increasing number of keys inserted above any certain level will lead to increase of nodes on that level and consequently increasing of total number of nodes. While any attempt to decrease number of keys above the certain level will lead to decrease of nodes count on that level and fail to fit all keys on that level without increasing tree height.
It also obvious that arrangement of keys on any certain level is one of optimal ones.
Using reasoning above also more formal proof through math induction may be constructed.
The idea is to hold list of counters (size of list no bigger than height of the tree) to track how much keys added on each level. Once I have d keys added to some level it means node filled in half created in that level and if there is enough keys to fill another half of this node we should skip this keys and add root for higher level. Through this way, root will be placed exactly between first half of previous subtree and first half of next subtree, it will cause split, when root will take it's place and two halfs of subtrees will become separated. Place for skipped keys will be safe while we go through bigger keys and can be filled later.
Here is nearly working (pseudo)code, array needs to be sorted:
PushArray(BTree bTree, int d, key[] Array)
{
List<int> counters = new List<int>{0};
//skip list will contain numbers of nodes to skip
//after filling node of some order in half
List<int> skip = new List<int>();
List<Pair<int,int>> skipList = List<Pair<int,int>>();
int i = -1;
while(true)
{
int order = 0;
while(counters[order] == d) order += 1;
for(int j = order - 1; j >= 0; j--) counters[j] = 0;
if (counters.Lenght <= order + 1) counters.Add(0);
counters[order] += 1;
if (skip.Count <= order)
skip.Add(i + 2);
if (order > 0)
skipList.Add({i,order}); //list of skipped parts that will be needed later
i += skip[order];
if (i > N) break;
bTree.Push(Array[i]);
}
//now we need to add all skipped keys in correct order
foreach(Pair<int,int> p in skipList)
{
for(int i = p.2; i > 0; i--)
PushArray(bTree, d, Array.SubArray(p.1 + skip[i - 1], skip[i] -1))
}
}
Example:
Here is how numbers and corresponding counters keys should be arranged for d = 2 while first pass through array. I marked keys which pushed into the B-Tree during first pass (before loop with recursion) with 'o' and skipped with 'x'.
24
4 9 14 19 29
0 1 2 3 5 6 7 8 10 11 12 13 15 16 17 18 20 21 22 23 25 26 27 28 30 ...
o o x x o o o x x o o o x x x x x x x x x x x x o o o x x o o ...
1 2 0 1 2 0 1 2 0 1 2 0 1 ...
0 0 1 1 1 2 2 2 0 0 0 1 1 ...
0 0 0 0 0 0 0 0 1 1 1 1 1 ...
skip[0] = 1
skip[1] = 3
skip[2] = 13
Since we don't iterate through skipped keys we have O(n) time complexity without adding to B-Tree itself and for sorted array;
In this form it may be unclear how it works when there is not enough keys to fill second half of node after skipped block but we can also avoid skipping of all skip[order] keys if total length of array lesser than ~ i + 2 * skip[order] and skip for skip[order - 1] keys instead, such string after changing counters but before changing variable i might be added:
while(order > 0 && i + 2*skip[order] > N) --order;
it will be correct cause if total count of keys on current level is lesser or equal than 3*d they still are split correctly if add them in original order. Such will lead to slightly different rearrangement of keys between two last nodes on some levels, but will not break any described requirements, and may be it will make behavior more easy to understand.
May be it's reasonable to find some animation and watch how it works, here is the sequence which should be generated on 0..29 range: 0 1 4 5 6 9 10 11 24 25 26 29 /end of first pass/ 2 3 7 8 14 15 16 19 20 21 12 13 17 18 22 23 27 28

The algorithm below attempts to prepare the order the keys so that you don't need to have power or even knowledge about the insertion procedure. The only assumption is that overfilled tree nodes are either split at the middle or at the position of the last inserted element, otherwise the B-tree can be treated as a black box.
The trick is to trigger node splits in a controlled way. First you fill a node exactly, the left half with keys that belong together and the right half with another range of keys that belong together. Finally you insert a key that falls in between those two ranges but which belongs with neither; the two subranges are split into separate nodes and the last inserted key ends up in the parent node. After splitting off in this fashion you can fill the remainder of both child nodes to make the tree as compact as possible. This also works for parent nodes with more than two child nodes, just repeat the trick with one of the children until the desired number of child nodes is created. Below, I use what is conceptually the rightmost childnode as the "splitting ground" (steps 5 and 6.1).
Apply the splitting trick recursively, and all elements should end up in their ideal place (which depends on the number of elements). I believe the algorithm below guarantees that the height of the tree is always minimal and that all nodes except for the root are as full as possible. However, as you can probably imagine it is hard to be completely sure without actually implementing and testing it thoroughly. I have tried this on paper and I do feel confident that this algorithm, or something extremely similar, should do the job.
Implied tree T with maximum branching factor M.
Top procedure with keys of length N:
Sort the keys.
Set minimal-tree-height to ceil(log(N+1)/log(M)).
Call insert-chunk with chunk = keys and H = minimal-tree-height.
Procedure insert-chunk with chunk of length L, subtree height H:
If H is equal to 1:
Insert all keys from the chunk into T
Return immediately.
Set the ideal subchunk size S to pow(M, H - 1).
Set the number of subtrees T to ceil((L + 1) / S).
Set the actual subchunk size S' to ceil((L + 1) / T).
Recursively call insert-chunk with chunk' = the last floor((S - 1) / 2) keys of chunk and H' = H - 1.
For each of the ceil(L / S') subchunks (of size S') except for the last with index I:
Recursively call insert-chunk with chunk' = the first ceil((S - 1) / 2) keys of subchunk I and H' = H - 1.
Insert the last key of subchunk I into T (this insertion purposefully triggers a split).
Recursively call insert-chunk with chunk' = the remaining keys of subchunk I (if any) and H' = H - 1.
Recursively call insert-chunk with chunk' = the remaining keys of the last subchunk and H' = H - 1.
Note that the recursive procedure is called twice for each subtree; that is fine, because the first call always creates a perfectly filled half subtree.

Here is a way which would lead to minimum height in any BST (including b tree) :-
sort array
Say you can have m key in b tree
Divide array recursively in m+1 equal parts using m keys in parent.
construct the child tree of n/(m+1) sorted keys using recursion.
example : -
m = 2 array = [1 2 3 4 5 6 7 8 9 10]
divide array into three parts :-
root = [4,8]
recursively solve :-
child1 = [1 2 3]
root1 = [2]
left1 = [1]
right1 = [3]
similarly for all childs solve recursively.

So is this about optimising the creation procedure, or optimising the tree?
You can clearly create a maximally efficient B-Tree by first creating a full Balanced Binary Tree, and then contracting nodes.
At any level in a binary tree, the gap in numbers between two nodes contains all the numbers between those two values by the definition of a binary tree, and this is more or less the definition of a B-Tree. You simply start contracting the binary tree divisions into B-Tree nodes. Since the binary tree is balanced by construction, the gaps between nodes on the same level always contain the same number of nodes (assuming the tree is filled). Thus the BTree so constructed is guaranteed balanced.
In practice this is probably quite a slow way to create a BTree, but it certainly meets your criteria for constructing the optimal B-Tree, and the literature on creating balanced binary trees is comprehensive.
=====================================
In your case, where you might take an off the shelf "better" over a constructed optimal version, have you considered simply changing the number of children nodes can have? Your diagram looks like a classic 2-3 tree, but its perfectly possible to have a 3-4 tree, or a 3-5 tree, which means that every node will have at least three children.

Your question is about btree optimization. It is unlikely that you do this just for fun. So I can only assume that you would like to optimize data accesses - maybe as part of database programming or something like this. You wrote: "When searching the BTree, it is paged on demand from disk into memory", which means that you either have not enough memory to do any sort of caching or you have a policy to utilize as less memory as possible. In either way this may be the root cause for why any answer to your question will not be satisfying. Let me explain why.
When it comes to data access optimization, memory is your friend. It does not matter if you do read or write optimization you need memory. Any sort of write optimization always works on the assumption that it can read information in a quick way (from memory) - sorting needs data. If you do not have enough memory for read optimization you will not have that for write optimization too.
As soon as you are willing to accept at least some memory utilization you can rethink your statement "When searching the BTree, it is paged on demand from disk into memory", which makes up room for balancing between read and write optimization. A to maximum optimized BTREE is maximized write optimization. In most data access scenarios I know you get a write at any 10-100 reads. That means that a maximized write optimization is likely to give a poor performance in terms of data access optimization. That is why databases accept restructuring cycles, key space waste, unbalanced btrees and things like that...

One-way flight trip problem

You are going on a one-way indirect flight trip that includes billions an unknown very large number of transfers.
You are not stopping twice in the same airport.
You have 1 ticket for each part of your trip.
Each ticket contains src and dst airport.
All the tickets you have are randomly sorted.
You forgot the original departure airport (very first src) and your destination (last dst).
Design an algorithm to reconstruct your trip with minimum big-O complexity.
Attempting to solve this problem I have started to use a symmetric difference of two sets, Srcs and Dsts:
1)Sort all src keys in array Srcs
2)Sort all dst keys in array Dsts
3)Create an union set of both arrays to find non-duplicates - they are your first src and last dst
4)Now, having the starting point, traverse both arrays using the binary search.
But I suppose there must be another more effective method.

Construct a hashtable and add each airport into the hash table.
<key,value> = <airport, count>
Count for the airport increases if the airport is either the source or the destination. So for every airport the count will be 2 ( 1 for src and 1 for dst) except for the source and the destination of your trip which will have the count as 1.
You need to look at each ticket at least once. So complexity is O(n).

Summary: below a single-pass algorithm is given. (I.e., not just linear, but looks each ticket exactly once, which of course is optimal number of visits per ticket). I put the summary because there are many seemingly equivalent solutions and it would be hard to spot why I added another one. :)
I was actually asked this question in an interview. The concept is extremely simple: each ticket is a singleton list, with conceptually two elements, src and dst.
We index each such list in a hashtable using its first and last elements as keys, so we can find in O(1) if a list starts or ends at a particular element (airport). For each ticket, when we see it starts where another list ends, just link the lists (O(1)). Similarly, if it ends where another list starts, another list join. Of course, when we link two lists, we basically destroy the two and obtain one. (The chain of N tickets will be constructed after N-1 such links).
Care is needed to maintain the invariant that the hashtable keys are exactly the first and last elements of the remaining lists.
All in all, O(N).
And yes, I answered that on the spot :)
Edit Forgot to add an important point. Everyone mentions two hashtables, but one does the trick as well, because the algorithms invariant includes that at most one ticket list starts or begins in any single city (if there are two, we immediately join the lists at that city, and remove that city from the hashtable). Asymptotically there is no difference, it's just simpler this way.
Edit 2 Also of interest is that, compared to solutions using 2 hashtables with N entries each, this solution uses one hashtable with at most N/2 entries (which happens if we see the tickets in an order of, say, 1st, 3rd, 5th, and so on). So this uses about half memory as well, apart from being faster.

Construct two hash tables (or tries), one keyed on src and the other on dst. Choose one ticket at random and look up its dst in the src-hash table. Repeat that process for the result until you hit the end (the final destination). Now look up its src in the dst-keyed hash table. Repeat the process for the result until you hit the beginning.
Constructing the hash tables takes O(n) and constructing the list takes O(n), so the whole algorithm is O(n).
EDIT: You only need to construct one hash table, actually. Let's say you construct the src-keyed hash table. Choose one ticket at random and like before, construct the list that leads to the final destination. Then choose another random ticket from the tickets that have not yet been added to the list. Follow its destination until you hit the ticket you initially started with. Repeat this process until you have constructed the entire list. It's still O(n) since worst case you choose the tickets in reverse order.
Edit: got the table names swapped in my algorithm.

It's basically a dependency graph where every ticket represents a node and the src and dst airport represents directed links, so use a topological sort to determine the flight order.
EDIT: Although since this is an airline ticket and you know you actually made an itinerary you could physically perform, sort by departure date and time in UTC.
EDIT2: Assuming each airport you have a ticket to uses a three character code, you can use the algorithm described here (Find three numbers appeared only once) to determine the two unique airports by xoring all the airports together.
EDIT3: Here's some C++ to actually solve this problem using the xor method. The overall algorithm is as follows, assuming a unique encoding from airport to an integer (either assuming a three letter airport code or encoding the airport location in an integer using latitude and longitude):
First, XOR all the airport codes together. This should be equal to the initial source airport XOR the final destination airport. Since we know that the initial airport and the final airport are unique, this value should not be zero. Since it's not zero, there will be at least one bit set in that value. That bit corresponds to a bit that is set in one of the airports and not set in the other; call it the designator bit.
Next, set up two buckets, each with the XORed value from the first step. Now, for every ticket, bucket each airport according to whether it has the designator bit set or not, and xor the airport code with the value in the bucket. Also keep track for each bucket how many source airports and destination airports went to that bucket.
After you process all the tickets, pick one of the buckets. The number of source airports sent to that bucket should be one greater or less than the number of destination airports sent to that bucket. If the number of source airports is less than the number of destination airports, that means the initial source airport (the only unique source airport) was sent to the other bucket. That means the value in the current bucket is the identifier for the initial source airport! Conversely, if the number of destination airports is less than the number of source airports, the final destination airport was sent to the other bucket, so the current bucket is the identifier for the final destination airport!
struct ticket
{
int src;
int dst;
};
int get_airport_bucket_index(
int airport_code,
int discriminating_bit)
{
return (airport_code & discriminating_bit)==discriminating_bit ? 1 : 0;
}
void find_trip_endpoints(const ticket *tickets, size_t ticket_count, int *out_src, int *out_dst)
{
int xor_residual= 0;
for (const ticket *current_ticket= tickets, *end_ticket= tickets + ticket_count; current_ticket!=end_ticket; ++current_ticket)
{
xor_residual^= current_ticket->src;
xor_residual^= current_ticket->dst;
}
// now xor_residual will be equal to the starting airport xor ending airport
// since starting airport!=ending airport, they have at least one bit that is not in common
//
int discriminating_bit= xor_residual & (-xor_residual);
assert(discriminating_bit!=0);
int airport_codes[2]= { xor_residual, xor_residual };
int src_count[2]= { 0, 0 };
int dst_count[2]= { 0, 0 };
for (const ticket *current_ticket= tickets, *end_ticket= tickets + ticket_count; current_ticket!=end_ticket; ++current_ticket)
{
int src_index= get_airport_bucket_index(current_ticket->src, discriminating_bit);
airport_codes[src_index]^= current_ticket->src;
src_count[src_index]+= 1;
int dst_index= get_airport_bucket_index(current_ticket->dst, discriminating_bit);
airport_codes[dst_index]^= current_ticket->dst;
dst_count[dst_index]+= 1;
}
assert((airport_codes[0]^airport_codes[1])==xor_residual);
assert(abs(src_count[0]-dst_count[0])==1); // all airports with the bit set/unset will be accounted for as well as either the source or destination
assert(abs(src_count[1]-dst_count[1])==1);
assert((src_count[0]-dst_count[0])==-(src_count[1]-dst_count[1]));
int src_index= src_count[0]-dst_count[0]<0 ? 0 : 1;
// if src < dst, that means we put more dst into the source bucket than dst, which means the initial source went into the other bucket, which means it should be equal to this bucket!
assert(get_airport_bucket_index(airport_codes[src_index], discriminating_bit)!=src_index);
*out_src= airport_codes[src_index];
*out_dst= airport_codes[!src_index];
return;
}
int main()
{
ticket test0[]= { { 1, 2 } };
ticket test1[]= { { 1, 2 }, { 2, 3 } };
ticket test2[]= { { 1, 2 }, { 2, 3 }, { 3, 4 } };
ticket test3[]= { { 2, 3 }, { 3, 4 }, { 1, 2 } };
ticket test4[]= { { 2, 1 }, { 3, 2 }, { 4, 3 } };
ticket test5[]= { { 1, 3 }, { 3, 5 }, { 5, 2 } };
int initial_src, final_dst;
find_trip_endpoints(test0, sizeof(test0)/sizeof(*test0), &initial_src, &final_dst);
assert(initial_src==1);
assert(final_dst==2);
find_trip_endpoints(test1, sizeof(test1)/sizeof(*test1), &initial_src, &final_dst);
assert(initial_src==1);
assert(final_dst==3);
find_trip_endpoints(test2, sizeof(test2)/sizeof(*test2), &initial_src, &final_dst);
assert(initial_src==1);
assert(final_dst==4);
find_trip_endpoints(test3, sizeof(test3)/sizeof(*test3), &initial_src, &final_dst);
assert(initial_src==1);
assert(final_dst==4);
find_trip_endpoints(test4, sizeof(test4)/sizeof(*test4), &initial_src, &final_dst);
assert(initial_src==4);
assert(final_dst==1);
find_trip_endpoints(test5, sizeof(test5)/sizeof(*test5), &initial_src, &final_dst);
assert(initial_src==1);
assert(final_dst==2);
return 0;
}

Create two data structures:
Route
{
start
end
list of flights where flight[n].dest = flight[n+1].src
}
List of Routes
And then:
foreach (flight in random set)
{
added to route = false;
foreach (route in list of routes)
{
if (flight.src = route.end)
{
if (!added_to_route)
{
add flight to end of route
added to route = true
}
else
{
merge routes
next flight
}
}
if (flight.dest = route.start)
{
if (!added_to_route)
{
add flight to start of route
added to route = true
}
else
{
merge routes
next flight
}
}
}
if (!added to route)
{
create route
}
}

Put in two Hashes:
to_end = src -> des;
to_beg = des -> src
Pick any airport as a starting point S.
while(to_end[S] != null)
S = to_end[S];
S is now your final destination. Repeat with the other map to find your starting point.
Without properly checking, this feels O(N), provided you have a decent Hash table implementation.

A hash table won't work for large sizes (such as the billions in the original question); anyone who has worked with them knows that they're only good for small sets. You could instead use a binary search tree, which would give you complexity O(n log n).
The simplest way is with two passes: The first adds them all to the tree, indexed by src. The second walks the tree and collects the nodes into an array.
Can we do better? We can, if we really want to: we can do it in one pass. Represent each ticket as a node on a liked list. Initially, each node has null values for the next pointer. For each ticket, enter both its src and dest in the index. If there's a collision, that means that we already have the adjacent ticket; connect the nodes and delete the match from the index. When you're done, you'll have made only one pass, and have an empty index, and a linked list of all the tickets in order.
This method is significantly faster: it's only one pass, not two; and the store is significantly smaller (worst case: n/2 ; best case: 1; typical case: sqrt(n)), enough so that you might be able to actually use a hash instead of a binary search tree.

Each airport is a node. Each ticket is an edge. Make an adjacency matrix to represent the graph. This can be done as a bit field to compress the edges. Your starting point will be the node that has no path into it (it's column will be empty). Once you know this you just follow the paths that exist.
Alternately you could build a structure indexable by airport. For each ticket you look up it's src and dst. If either is not found then you need to add new airports to your list. When each is found you set a the departure airport's exit pointer to point to the destination, and the destination's arrival pointer to point to the departure airport. When you are out of tickets you must traverse the entire list to determine who does not have a path in.
Another way would be to have a variable length list of mini-trips that you connect together as you encounter each ticket. Each time you add a ticket you see if the ends of any existing mini-trip match either the src or dest of you ticket. If not, then your current ticket becomes it's own mini-trip and is added to the list. If so then the new ticket is tacked on to the end(s) of the existing trip(s) that it matches, possibly splicing two existing mini-trips together, in which case it would shorten the list of mini-trips by one.

This is the simple case of a single path state machine matrix.
Sorry for the pseudo-code being in C# style, but it was easier to express the idea with objects.
First, construct a turnpike matrix.
Read my description of what a turnpike matrix is (don't bother with the FSM answer, just the explanation of a turnpike matrix) at What are some strategies for testing large state machines?.
However, the restrictions you describe make the case a simple single path state machine. It is the simplest state machine possible with complete coverage.
For a simple case of 5 airports,
vert nodes=src/entry points,
horiz nodes=dst/exit points.
A1 A2 A3 A4 A5
A1 x
A2 x
A3 x
A4 x
A5 x
Notice that for each row, as well as for each column, there should be no more than one transition.
To get the path of the machine, you would sort the matrix into
A1 A2 A3 A4 A5
A2 x
A1 x
A3 x
A4 x
A5 x
Or sort into a diagonal square matrix - an eigen vector of ordered pairs.
A1 A2 A3 A4 A5
A2 x
A5 x
A1 x
A3 x
A4 x
where the ordered pairs are the list of tickets:
a2:a1, a5:a2, a1:a3, a3:a4, a4:a5.
or in more formal notation,
<a2,a1>, <a5,a2>, <a1,a3>, <a3,a4>, <a4,a5>.
Hmmm .. ordered pairs huh? Smelling a hint of recursion in Lisp?
<a2,<a1,<a3,<a4,a5>>>>
There are two modes of the machine,
trip planning - you don't know how
many airports there are, and you
need a generic trip plan for an
unspecified number of airports
trip reconstruction - you have all
the turnpike tickets of a past trip
but they are all one big stack in
your glove compartment/luggage bag.
I am presuming your question is about trip reconstruction. So, you pick one ticket after another randomly from that pile of tickets.
We presume the ticket pile is of indefinite size.
tak mnx cda
bom 0
daj 0
phi 0
Where 0 value denotes unordered tickets. Let us define unordered ticket as a ticket where its dst is not matched with the src of another ticket.
The following next ticket finds that mnx(dst) = kul(src) match.
tak mnx cda kul
bom 0
daj 1
phi 0
mnx 0
At any moment you pick the next ticket, there is a possibility that it connects two sequential airports. If that happen, you create a cluster node out of that two nodes:
<bom,tak>, <daj,<mnx,kul>>
and the matrix is reduced,
tak cda kul
bom 0
daj L1
phi 0
where
L1 = <daj,<mnx,kul>>
which is a sublist of the main list.
Keep on picking the next random tickets.
tak cda kul svn xml phi
bom 0
daj L1
phi 0
olm 0
jdk 0
klm 0
Match either existent.dst to new.src
or existent.src to new.dst:
tak cda kul svn xml
bom 0
daj L1
olm 0
jdk 0
klm L2
<bom,tak>, <daj,<mnx,kul>>, <<klm,phi>, cda>
The above topological exercise is for visual comprehension only. The following is the algorithmic solution.
The concept is to cluster ordered pairs into sublists to reduce the burden on the hash structures we will use to house the tickets. Gradually, there will be more and more pseudo-tickets (formed from merged matched tickets), each containing a growing sublist of ordered destinations. Finally, there will remain one single pseudo-ticket containing the complete itinerary vector in its sublist.
As you see, perhaps, this is best done with Lisp.
However, as an exercise of linked lists and maps ...
Create the following structures:
class Ticket:MapEntry<src, Vector<dst> >{
src, dst
Vector<dst> dstVec; // sublist of mergers
//constructor
Ticket(src,dst){
this.src=src;
this.dst=dst;
this.dstVec.append(dst);
}
}
class TicketHash<x>{
x -> TicketMapEntry;
void add(Ticket t){
super.put(t.x, t);
}
}
So that effectively,
TicketHash<src>{
src -> TicketMapEntry;
void add(Ticket t){
super.put(t.src, t);
}
}
TicketHash<dst>{
dst -> TicketMapEntry;
void add(Ticket t){
super.put(t.dst, t);
}
}
TicketHash<dst> mapbyDst = hash of map entries(dst->Ticket), key=dst
TicketHash<src> mapbySrc = hash of map entries(src->Ticket), key=src
When a ticket is randomly picked from the pile,
void pickTicket(Ticket t){
// does t.dst exist in mapbyDst?
// i.e. attempt to match src of next ticket to dst of an existent ticket.
Ticket zt = dstExists(t);
// check if the merged ticket also matches the other end.
if(zt!=null)
t = zt;
// attempt to match dst of next ticket to src of an existent ticket.
if (srcExists(t)!=null) return;
// otherwise if unmatched either way, add the new ticket
else {
// Add t.dst to list of existing dst
mapbyDst.add(t);
mapbySrc.add(t);
}
}
Check for existent dst:
Ticket dstExists(Ticket t){
// find existing ticket whose dst matches t.src
Ticket zt = mapbyDst.getEntry(t.src);
if (zt==null) return false; //no match
// an ordered pair is matched...
//Merge new ticket into existent ticket
//retain existent ticket and discard new ticket.
Ticket xt = mapbySrc.getEntry(t.src);
//append sublist of new ticket to sublist of existent ticket
xt.srcVec.join(t.srcVec); // join the two linked lists.
// remove the matched dst ticket from mapbyDst
mapbyDst.remove(zt);
// replace it with the merged ticket from mapbySrc
mapbyDst.add(zt);
return zt;
}
Ticket srcExists(Ticket t){
// find existing ticket whose dst matches t.src
Ticket zt = mapbySrc.getEntry(t.dst);
if (zt==null) return false; //no match
// an ordered pair is matched...
//Merge new ticket into existent ticket
//retain existent ticket and discard new ticket.
Ticket xt = mapbyDst.getEntry(t.dst);
//append sublist of new ticket to sublist of existent ticket
xt.srcVec.join(t.srcVec); // join the two linked lists.
// remove the matched dst ticket from mapbyDst
mapbySrc.remove(zt);
// replace it with the merged ticket from mapbySrc
mapbySrc.add(zt);
return zt;
}
Check for existent src:
Ticket srcExists(Ticket t){
// find existing ticket whose src matches t.dst
Ticket zt = mapbySrc.getEntry(t.dst);
if (zt == null) return null;
// if an ordered pair is matched
// remove the dst from mapbyDst
mapbySrc.remove(zt);
//Merge new ticket into existent ticket
//reinsert existent ticket and discard new ticket.
mapbySrc.getEntry(zt);
//append sublist of new ticket to sublist of existent ticket
zt.srcVec.append(t.srcVec);
return zt;
}
I have a feeling the above has quite some typos, but the concept should be right. Any typo found, someone could help correct it for, plsss.

Easiest way is with hash tables, but that doesn't have the best worst-case complexity (O(n2))
Instead:
Create a bunch of nodes containing (src, dst) O(n)
Add the nodes to a list and sort by src O(n log n)
For each (destination) node, search the list for the corresponding (source) node O(n log n)
Find the start node (for instance, using a topological sort, or marking nodes in step 3) O(n)
Overall: O(n log n)
(For both algorithms, we assume the length of the strings is negligible ie. comparison is O(1))

No need for hashes or something alike.
The real input size here is not necessarily the number of tickets (say n), but the total 'size' (say N) of the tickets, the total number of char needed to code them.
If we have a alphabet of k characters (here k is roughly 42) we can use bucketsort techniques to sort an array of n strings of a total size N that are encoded with an alphabet of k characters in O(n + N + k) time. The following works if n <= N (trivial) and k <= N (well N is billions, isn't it)
In the order the tickets are given, extract all airport codes from the tickets and store them in a struct that has the code as a string and the ticket index as a number.
Bucketsort that array of structs according to their code
Run trough that sorted array and assign an ordinal number (starting from 0) to each newly encountered airline code. For all elements with the same code (they are consecutive) go to the ticket (we have stored the number with the code) and change the code (pick the right, src or dst) of the ticket to the ordinal number.
During this run through the array we may identify original source src0.
Now all tickets have src and dst rewritten as ordinal numbers, and the tickets may be interpreted as one list starting in src0.
Do a list ranking (= toplogical sort with keeping track of the distance from src0) on the tickets.

If you assume a joinable list structure that can store everything (probably on disk):
Create 2 empty hash tables S and D
grab the first element
look up its src in D
If found, remove the associated node from D and link it to the current node
If not found, insert the node into S keyed on src
repeat from 3 the other way src<->des, S<->D
repeat from 2 with the next node.
O(n) time. As for space, the birthday paradox (or something much like it) will keep your data set a lot smaller than the full set. In the bad luck case where it still gets to large (worst case is O(n)), you can evict random runs from the hash table and insert them at the end of the processing queue. Your speed could go to pot but as long as you can far excede the threashold for expecting collisions (~O(sqrt(n))) you should expect to see your dataset (the tables and input queue combined) regularly shrink.

It seems to me like a graph-based approach is based here.
Each airport is a node, each ticket is an edge. Let's make every edge undirected for now.
In the first stage you are building the graph: for each ticket, you lookup the source and destination and build an edge between them.
Now that the graph is constructed, we know that it is acyclical and that there is a single path through it. After all, you only have tickets for trips you took, and you never visited the same airport once.
In the second stage, you are searching the graph: pick any node, and initiate a search in both directions until you find you cannot continue. These are your source and destination.
If you need to specifically say which was source and which was destination, add a directory property to each edge (but keep it an undirected graph). Once you have the candidate source and destination, you can tell which is which based on the edge connected to them.
The complexity of this algorithm would depend on the time it takes to lookup a particular node. If you could achieve an O(1), then the time should be linear. You have n tickets, so it takes you O(N) steps to build the graph, and then O(N) to search and O(N) to reconstruct the path. Still O(N). An adjacency matrix will give you that.
If you can't spare the space, you could do a hash for the nodes, which would give you O(1) under optimal hashing and all that crap.

Note that if the task were only to determine the source and destination airports (instead of reconstructing the whole trip), the puzzle would probably become more interesting.
Namely, assuming that airport codes are given as integers, the source and destination airports can be determined using O(1) passes of the data and O(1) additional memory (i.e. without resorting to hashtables, sorting, binary search, and the like).
Of course, once you find the source, it also becomes a trivial matter to index and traverse the full route, but from that point on the whole thing will require at least O(n) additional memory anyway (unless you can sort the data in place, which, by the way, allows to solve the original task in O(n log n) time with O(1) additional memory)

Let's forget the data structures and graphs for a moment.
First I need to point out that everybody made an assumption that there are no loops. If the route goes through one airport twice than it's a much larger problem.
But let's keep the assumption for now.
The input data is in fact an ordered set already. Every ticket is an element of the relation that introduces order to a set of airports. (English is not my mother tongue, so these might not be correct math terms)
Every ticket holds information like this: airportX < airportY, so while doing one pass through the tickets an algorithm can recreate an ordered list starting from just any airport.
Now let's drop the "linear assumption". No order relation can be defined out of that kind of stuff. The input data has to be treated as production rules for a formal grammar, where grammar's vocabulary set is a set of ariport names.
A ticket like that:
src: A
dst: B
is in fact a pair of productions:
A->AB
B->AB
from which you only can keep one.
Now you have to generate every possible sentence, but you can use every production rule once. The longest sentence that uses every its production only once is a correct solution.

Prerequisites
First of all, create some kind of subtrip structure that contains a part of your route.
For example, if your complete trip is a-b-c-d-e-f-g, a subtrip could be b-c-d, i.e. a connected subpath of your trip.
Now, create two hashtables that map a city to the subtrip structure the city is contained in. Thereby, one Hashtable stands for the city a subtrip is starting with, the other stands for the cities a subtrip is ending with. That means, one city can occur at most once in one of the hashtables.
As we will see later, not every city needs to be stored, but only the beginning and the end of each subtrip.
Constructing subtrips
Now, take the tickets just one after another. We assume the ticket to go from x to y (represented by (x,y)). Check, wheter x is the end of some subtrip s(since every city is visited only once, it can not be the end of another subtrip already). If x is the beginning, just add the current ticket (x,y) at the end of the subtrip s. If there is no subtrip ending with x, check whether there is a subtrip t beginning with y. If so, add (x,y) at the beginning of t. If there's also no such subtrip t, just create a new subtrip containing just (x,y).
Dealing with subtrips should be done using some special "tricks".
Creating a new subtrip s containing (x,y) should add x to the hashtable for "subtrip beginning cities" and add y to the hashtable for "subtrip ending cities".
Adding a new ticket (x,y) at the beginning of the subtrip s=(y,...), should remove y from the hashtable of beginning cities and instead add x to the hashtable of beginning cities.
Adding a new ticket (x,y) at the end of the subtrip s=(...,x), should remove x from the hashtable of ending cities and instead add y to the hashtable of ending cities.
With this structure, subtrips corresponding to a city can be done in amortized O(1).
After this is done for all tickets, we have some subtrips. Note the fact that we have at most (n-1)/2 = O(n) such subtrips after the procedure.
Concatenating subtrips
Now, we just consider the subtrips one after another. If we have a subtrip s=(x,...,y), we just look in our hashtable of ending cities, if there's a subtrip t=(...,x) ending with x. If so, we concatenate t and s to a new subtrip. If not, we know, that s is our first subtrip; then, we look, if there's another subtrip u=(y,...) beginning with y. If so, we concatenate s and u. We do this until just one subtrip is left (this subtrip is then our whole original trip).
I hope I didnt overlook somtehing, but this algorithm should run in:
constructing all subtrips (at most O(n)) can be done in O(n), if we implement adding tickets to a subtrip in O(1). This should be no problem, if we have some nice pointer structure or something like that (implementing subtrips as linked lists). Also changing two values in the hashtable is (amortized) O(1). Thus, this phase consumes O(n) time.
concatenating the subtrips until just one is left can also be done in O(n). Too see this, we just need to look at what is done in the second phase: Hashtable lookups, that need amortized O(1) and subtrip concatenation that can be done in O(1) with pointer concatenation or something.
Thus, the whole algorithm takes time O(n), which might be the optimal O-bound, since at least every ticket might need to be looked at.

I have written a small python program, uses two hash tables one for count and another for src to dst mapping.
The complexity depends on the implementation of the dictionary. if dictionary has O(1) then complexity is O(n) , if dictionary has O( lg(n) ) like in STL map, then complexity is O( n lg(n) )
import random
# actual journey: a-> b -> c -> ... g -> h
journey = [('a','b'), ('b','c'), ('c','d'), ('d','e'), ('e','f'), ('f','g'), ('g','h')]
#shuffle the journey.
random.shuffle(journey)
print("shffled journey : ", journey )
# Hashmap to get the count of each place
map_count = {}
# Hashmap to find the route, contains src to dst mapping
map_route = {}
# fill the hashtable
for j in journey:
source = j[0]; dest = j[1]
map_route[source] = dest
i = map_count.get(source, 0)
map_count[ source ] = i+1
i = map_count.get(dest, 0)
map_count[ dest ] = i+1
start = ''
# find the start point: the map entry with count = 1 and
# key exists in map_route.
for (key,val) in map_count.items():
if map_count[key] == 1 and map_route.has_key(key):
start = key
break
print("journey started at : %s" % start)
route = [] # the route
n = len(journey) # number of cities.
while n:
route.append( (start, map_route[start]) )
start = map_route[start]
n -= 1
print(" Route : " , route )

I provide here a more general solution to the problem:
You can stop several times in the same airport, but you have to use every ticket exactly 1 time
You can have more than 1 ticket for each part of your trip.
Each ticket contains src and dst airport.
All the tickets you have are randomly sorted.
You forgot the original departure airport (very first src) and your destination (last dst).
My method returns list of cities (vector) that contain all specified cities, if such chain exists, and empty list otherwise. When there are several ways to travel the cities, the method returns lexicographically smallest list.
#include<vector>
#include<string>
#include<unordered_map>
#include<unordered_set>
#include<set>
#include<map>
using namespace std;
struct StringPairHash
{
size_t operator()(const pair<string, string> &p) const {
return hash<string>()(p.first) ^ hash<string>()(p.second);
}
};
void calcItineraryRec(const multimap<string, string> &cities, string start,
vector<string> &itinerary, vector<string> &res,
unordered_set<pair<string, string>, StringPairHash> &visited, bool &found)
{
if (visited.size() == cities.size()) {
found = true;
res = itinerary;
return;
}
if (!found) {
auto pos = cities.equal_range(start);
for (auto p = pos.first; p != pos.second; ++p) {
if (visited.find({ *p }) == visited.end()) {
visited.insert({ *p });
itinerary.push_back(p->second);
calcItineraryRec(cities, p->second, itinerary, res, visited, found);
itinerary.pop_back();
visited.erase({ *p });
}
}
}
}
vector<string> calcItinerary(vector<pair<string, string>> &citiesPairs)
{
if (citiesPairs.size() < 1)
return {};
multimap<string, string> cities;
set<string> uniqueCities;
for (auto entry : citiesPairs) {
cities.insert({ entry });
uniqueCities.insert(entry.first);
uniqueCities.insert(entry.second);
}
for (const auto &startCity : uniqueCities) {
vector<string> itinerary;
itinerary.push_back(startCity);
unordered_set<pair<string, string>, StringPairHash> visited;
bool found = false;
vector<string> res;
calcItineraryRec(cities, startCity, itinerary, res, visited, found);
if (res.size() - 1 == cities.size())
return res;
}
return {};
}
Here is an example of usage:
int main()
{
vector<pair<string, string>> cities = { {"Y", "Z"}, {"W", "X"}, {"X", "Y"}, {"Y", "W"}, {"W", "Y"}};
vector<string> itinerary = calcItinerary(cities); // { "W", "X", "Y", "W", "Y", "Z" }
// another route is possible {W Y W X Y Z}, but the route above is lexicographically smaller.
cities = { {"Y", "Z"}, {"W", "X"}, {"X", "Y"}, {"W", "Y"} };
itinerary = calcItinerary(cities); // empty, no way to travel all cities using each ticket exactly one time
}

Subset Sum TI Basic Programming

I'm trying to program my TI-83 to do a subset sum search. So, given a list of length N, I want to find all lists of given length L, that sum to a given value V.
This is a little bit different than the regular subset sum problem because I am only searching for subsets of given lengths, not all lengths, and recursion is not necessarily the first choice because I can't call the program I'm working in.
I am able to easily accomplish the task with nested loops, but that is becoming cumbersome for values of L greater than 5. I'm trying for dynamic solutions, but am not getting anywhere.
Really, at this point, I am just trying to get the list references correct, so that's what I'm looking at. Let's go with an example:
L1={p,q,r,s,t,u}
so
N=6
let's look for all subsets of length 3 to keep it relatively short, so L = 3 (6c3 = 20 total outputs).
Ideally the list references that would be searched are:
{1,2,3}
{1,2,4}
{1,2,5}
{1,2,6}
{1,3,4}
{1,3,5}
{1,3,6}
{1,4,5}
{1,4,6}
{1,5,6}
{2,3,4}
{2,3,5}
{2,3,6}
{2,4,5}
{2,4,6}
{2,5,6}
{3,4,5}
{3,4,6}
{3,5,6}
{4,5,6}
Obviously accomplished by:
FOR A,1,N-2
FOR B,A+1,N-1
FOR C,B+1,N
display {A,B,C}
END
END
END
I initially sort the data of N descending which allows me to search for criteria that shorten the search, and using FOR loops screws it up a little at different places when I increment the values of A, B and C within the loops.
I am also looking for better dynamic solutions. I've done some research on the web, but I can't seem to adapt what is out there to my particular situation.
Any help would be appreciated. I am trying to keep it brief enough as to not write a novel but explain what I am trying to get at. I can provide more details as needed.

For optimisation, you simply want to skip those sub-trees of the search where you already now they'll exceed the value V. Recursion is the way to go but, since you've already ruled that out, you're going to be best off setting an upper limit on the allowed depths.
I'd go for something like this (for a depth of 3):
N is the total number of array elements.
L is the desired length (3).
V is the desired sum
Y[] is the array
Z is the total
Z = 0
IF Z <= V
FOR A,1,N-L
Z = Z + Y[A]
IF Z <= V
FOR B,A+1,N-L+1
Z = Z + Y[B]
IF Z <= V
FOR C,B+1,N-L+2
Z = Z + Y[C]
IF Z = V
DISPLAY {A,B,C}
END
Z = Z - Y[C]
END
END
Z = Z - Y[B]
END
END
Z = Z - Y[A]
END
END
Now that's pretty convoluted but it basically check at every stage whether you've already exceed the desired value and refuses to check lower sub-trees as an efficiency measure. It also keeps a running total for the current level so that it doesn't have to do a large number of additions when checking at lower levels. That's the adding and subtracting of the array values against Z.
It's going to get even more complicated when you modify it to handle more depth (by using variables from D to K for 11 levels (more if you're willing to move N and L down to W and X or if TI BASIC allows more than one character in a variable name).
The only other non-recursive way I can think of doing that is to use an array of value groups to emulate recursion with iteration, and that will look only slightly less hairy (although the code should be less nested).

Help with a special case of permutations algorithm (not the usual)

I have always been interested in algorithms, sort, crypto, binary trees, data compression, memory operations, etc.
I read Mark Nelson's article about permutations in C++ with the STL function next_perm(), very interesting and useful, after that I wrote one class method to get the next permutation in Delphi, since that is the tool I presently use most. This function works on lexographic order, I got the algo idea from a answer in another topic here on stackoverflow, but now I have a big problem. I'm working with permutations with repeated elements in a vector and there are lot of permutations that I don't need. For example, I have this first permutation for 7 elements in lexographic order:
6667778 (6 = 3 times consecutively, 7 = 3 times consecutively)
For my work I consider valid perm only those with at most 2 elements repeated consecutively, like this:
6676778 (6 = 2 times consecutively, 7 = 2 times consecutively)
In short, I need a function that returns only permutations that have at most N consecutive repetitions, according to the parameter received.
Does anyone know if there is some algorithm that already does this?
Sorry for any mistakes in the text, I still don't speak English very well.
Thank you so much,
Carlos

My approach is a recursive generator that doesn't follow branches that contain illegal sequences.
Here's the python 3 code:
def perm_maxlen(elements, prefix = "", maxlen = 2):
if not elements:
yield prefix + elements
return
used = set()
for i in range(len(elements)):
element = elements[i]
if element in used:
#already searched this path
continue
used.add(element)
suffix = prefix[-maxlen:] + element
if len(suffix) > maxlen and len(set(suffix)) == 1:
#would exceed maximum run length
continue
sub_elements = elements[:i] + elements[i+1:]
for perm in perm_maxlen(sub_elements, prefix + element, maxlen):
yield perm
for perm in perm_maxlen("6667778"):
print(perm)
The implentation is written for readability, not speed, but the algorithm should be much faster than naively filtering all permutations.
print(len(perm_maxlen("a"*100 + "b"*100, "", 1)))
For example, it runs this in milliseconds, where the naive filtering solution would take millenia or something.

So, in the homework-assistance kind of way, I can think of two approaches.
Work out all permutations that contain 3 or more consecutive repetitions (which you can do by treating the three-in-a-row as just one psuedo-digit and feeding it to a normal permutation generation algorithm). Make a lookup table of all of these. Now generate all permutations of your original string, and look them up in lookup table before adding them to the result.
Use a recursive permutation generating algorthm (select each possibility for the first digit in turn, recurse to generate permutations of the remaining digits), but in each recursion pass along the last two digits generated so far. Then in the recursively called function, if the two values passed in are the same, don't allow the first digit to be the same as those.

Why not just make a wrapper around the normal permutation function that skips values that have N consecutive repetitions? something like:
(pseudocode)
funciton custom_perm(int max_rep)
do
p := next_perm()
while count_max_rerps(p) < max_rep
return p

Krusty, I'm already doing that at the end of function, but not solves the problem, because is need to generate all permutations and check them each one.
consecutive := 1;
IsValid := True;
for n := 0 to len - 2 do
begin
if anyVector[n] = anyVector[n + 1] then
consecutive := consecutive + 1
else
consecutive := 1;
if consecutive > MaxConsecutiveRepeats then
begin
IsValid := False;
Break;
end;
end;
Since I do get started with the first in lexographic order, ends up being necessary by this way generate a lot of unnecessary perms.

This is easy to make, but rather hard to make efficient.
If you need to build a single piece of code that only considers valid outputs, and thus doesn't bother walking over the entire combination space, then you're going to have some thinking to do.
On the other hand, if you can live with the code internally producing all combinations, valid or not, then it should be simple.
Make a new enumerator, one which you can call that next_perm method on, and have this internally use the other enumerator, the one that produces every combination.
Then simply make the outer enumerator run in a while loop asking the inner one for more permutations until you find one that is valid, then produce that.
Pseudo-code for this:
generator1:
when called, yield the next combination
generator2:
internally keep a generator1 object
when called, keep asking generator1 for a new combination
check the combination
if valid, then yield it

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight