Deterministic Finite Automata with 6 states - dfa

I am really new to this and to sure how to start. I am trying to do this question for practice
Let segma = {a,b}. Consider the set of all strings in segma* that have an odd
number of occurrences of the substring "ab" but do not have "bb" as a substring.
Give a DFA with six states accepting the set.
I did change my solution.

Here is an idea of how to approach this question. Think about the conditions you need to meet.
First, you need to have an odd number of "ab", which means your DFA should have a "counter" loop that every time you encounter odd number of "ab", your DFA will be in a state, such that this state has an edge that allows your DFA to move toward the accepting state. Conversely every time you encounter even number of "ab", your DFA should be in a state, such that this state cannot move forward, unless you encounter another "ab".
Second, the string cannot have "bb". This means that if you ever encounter a single "b", a subsequent "b" will drive your DFA to a sink, which means your DFA rejects the string.
It might be easier to associate each character condition to the edges, so that if a certain condition is met, your DFA can move to a certain state.


How to force the sorting of values in certain logic

In SQL Server, how do I force the sorting of values to appear in certain logic. As far as I understand, SQL puts Characters first, then numbers and then letters when sorting the values.
Now, I need the underscore to comes after the letter. For example,
I have a value of OA_G and a range between MRI and OL5
Currently SQL puts OA_G between the range. But I need to force it to be outside the range.
Your premise seems incorrect. Regardless of how SQL Server sorts string data, which, by the way, is controlled by the specification of the database's collation, if one were to sort this manually, the results you seek could not be achieved. To wit:
Given the three strings, "MRI", "OL5", and "OA_G", sort these in ascending order by string value.
Of course, this would be done character-by-character, comparing each character from left to right.
Since "M" comes before "O", the first member of the sorted set would be "MRI". Next, compare "OL5" and "OA_G". The letter "O" is the same, so check the next position. "L" is greater than "A" because it is the 11th character in the alphabet, and "A" is the first, so the next member would be "OA_G", leaving "OL5" as the final member.
There is no SQL Server collation order that would make this evaluation give you the results you're seeking.
If the parts of data are like MRI, OL5, OA_G etc, are constants or rarely change, you can make some table like
Part | Priority
MRI | 1
OL5 | 2
OA_G | 3
Join it with your table and ORDER BY Priority

Is it more efficent to use a linked list and delete nodes or use an array and do a small computation to a string to see if element can be skipped?

I am writing a program in C that reads a file. Each line of the file is a string of characters to which a computation will be done. The result of the computation on a particular string may imply that strings latter on in the file do not need any computations done to them. Also if the reverse of the string comes in alphabetical order before the (current, non-reversed) string then it does not need to be checked.
My question is would it be better to put each string in a linked list and delete each node after finding particular strings don’t need to be checked or using an array and checking the last few characters of a string and if it is alphabetically after the string in the previous element skip it? Either way the list or array only needs to be iterated through once.
Rules of thumb is that if you are dealing with small objects (< 32 bytes), std::vector is better than a linked list for most of general operations.
But for larger objects, (say, 1K bytes), generally you need to consider lists.
There is an article details the comparison you can check , the link is here
Without further details about what are your needs is a bit difficult to tell you which one would fit more with your requirements.
Arrays are easy to access, specially if you are going to do it in a non sequential way, but they are hard to maintain if you need to perform deletions on it or if you don't have a good approximation of the final number of elements.
Lists are good if you plan to access them sequentially, but terrible if you need to jump between its elements. Also deletion over them can be done in constant time if you are already in the node you want to delete.
I don't quite understand how you plan to access them since you say that either one would be iterated just once, but if that is the case then either structure would give you the similar performance since you are not really taking advantage of their key benefits.
It's really difficult to understand what you are trying to do, but it sounds like you should create an array of records, with each record holding one of your strings and a boolean flag to indicate whether it should be processed.
You set each record's flag to true as you load the array from the file.
You use one pointer to scan the array once, processing only the strings from records whose flags are still true.
For each record processed, you use a second pointer to scan from the first pointer + 1 to the end of the array, identify strings that won't need processing (in light of the current string), and set their flags to false.

fast indexing for slow cpu?

I have a large document that I want to build an index of for word searching. (I hear this type of array is really called a concordances). Currently it takes about 10 minutes. Is there a fast way to do it? Currently I iterate through each paragraph and if I find a word I have not encountered before, I add it too my word array, along with the paragraph number in a subsidiary array, any time I encounter that same word again, I add the paragraph number to the index. :
This takes forever, well, 5 minutes or so. I tried converting this array to a string, but it is so large it won't work to include in a program file, even after removing stop words, and would take a while to convert back to an array anyway.
Is there a faster way to build a text index other than linear brute force? I'm not looking for a product that will do the index for me, just the fastest known algorithm. The index should be accurate, not fuzzy, and there will be no need for partial searches.
I think the best idea is to build a trie, adding a word at the time of your text, and having for each leaf a List of location you can find that word.
This would not only save you some space, since storing word with similar prefixes will require way less space, but the search will be faster too. Search time is O(M) where M is the maximum string length, and insert time is O(n) where n is the length of the key you are inserting.
Since the obvious alternative is an hash table, here you can find some more comparison between the two.
I would use a HashMap<String, List<Occurrency>> This way you can check if a word is already in yoz index in about O(1).
At the end, when you have all word collected and want to search them very often, you might try to find a hash-function that has no or nearly no collisions. This way you can guarantee O(1) time for the search (or nearly O(1) if you have still some collisions).
Well, apart from going along with MrSmith42's suggestion of using the built in HashMap, I also wonder how much time you are spending tracking the paragraph number?
Would it be faster to change things to track line numbers instead? (Especially if you are reading the input line-by-line).
There are a few things unclear in your question, like what do you mean in "I tried converting this array to a string, but it is so large it won't work to include in a program file, even after removing stop words, and would take a while to convert back to an array anyway."?! What array, is your input in form of array of paragraphs or do you mean the concordance entries per word, or what.
It is also unclear why your program is so slow, probably there is something inefficient there - i suspect is you check "if I find a word I have not encountered before" - i presume you look up the word in the dictionary and then iterate through the array of occurrences to see if paragraph# is there? That's slow linear search, you will be better served to use a set there (think hash/dictionary where you care only about the keys), kind of
concord = {
'chocolate': {10:1, 30:1, 35:1, 200:1, 50001:1},
'parsnips': {5:1, 500:1, 100403:1}
and your check then becomes if paraNum in concord[word]: ... instead of a loop or binary search.
PS. actually assuming you are keeping list of occurrences in array AND scanning the text from 1st to last paragraph, that means arrays will form sorted, so you only need to check the very last element there if word in concord and paraNum == concord[word][-1]:. (Examples are in pseudocode/python but you can translate to your language)

What is the best data structure and algorithm for comparing a list of strings?

I want to find the longest possible sequence of words that match the following rules:
Each word can be used at most once
All words are Strings
Two strings sa and sb can be concatenated if the LAST two characters of sa matches the first two characters of sb.
In the case of concatenation, it is performed by overlapping those characters. For example:
sa = "torino"
sb = "novara"
sa concat sb = "torinovara"
For example, I have the following input file, "input.txt":
And, the output of the above file according to the above rules should be:
since the longest possible concatenation is:
Can anyone please help me out with this? What would be the best data structure for this?
This can be presented as a directed graph problem -- the nodes are words, and they are connected by an edge if they overlap (and the smallest overlap is chosen to get the longest length), and then find the highest weight non-intersecting path.
(Well, actually, you want to expand the graph a bit to handle beginning and ending at a word. Adjoin a "starting node" with with an edge to each word of weight length word / 2.
Between words, -overlap + length start + length finish / 2, and between each word and an "ending node" "length word / 2". Might be easier to double it.)
I would start really simple. Make 2 vectors of strings, one sorted normally, one sorted by the last 2 letters. Create an index (vector of ints) for the second vector that points out it's position in the first.
To find the longest.. first remove the orphans. words that don't match at either end to anything. Then you want to build a neighbor joining tree, this is where you determine which words can possibly reach each other. If you have 2 or more trees you should try the largest tree first.
Now with a tree your job is to find ends that are rare, and bind them to other ends, and repeat. This should get you a pretty nice solution, if it uses all the words your golden, skip the other trees. If it doesn't then your into a whole slew of algorithms to make this efficient.
Some items to consider:
If you have 3+ unique ends you are guaranteed to drop 1+ words.
This can be used to prune your tries down while hunting for an answer.
recalculate unique ends often.
Odd numbers of a given end ensure that one must be dropped(you get 2 freebies at the ends).
Segregate words that can self match , you can always throw them in last, and they muck up the math otherwise.
You may be able to create small self matching rings, you can treat these like self matching words, as long as you don't orphan them when you create them. This can make the performance fantastic, but no guarantees on a perfect solution.
The search space is order(N!) a list of millions of elements may be hard to prove an exact answer. Of course I could be overlooking something.

How do I find common elements from n arrays

I am thinking of sorting and then doing binary search. Is that the best way?
I advocate for hashes in such cases: you'll have time proportional to common size of both arrays.
Since most major languages offer hashtable in their standard libraries, I hardly need to show your how to implement such solution.
Iterate through each one and use a hash table to store counts. The key is the value of the integer and the value is the count of appearances.
It depends. If one set is substantially smaller than the other, or for some other reason you expect the intersection to be quite sparse, then a binary search may be justified. Otherwise, it's probably easiest to step through both at once. If the current element in one is smaller than in the other, advance to the next item in that array. When/if you get to equal elements, you send that as output, and advance to the next item in both arrays. (This assumes, that as you advocated, you've already sorted both, of course).
This is an O(N+M) operation, where N is the size of one array, and M the size of the other. Using a binary search, you get O(N lg2 M) instead, which can be lower complexity if one array is lot smaller than the other, but is likely to be a net loss if they're close to the same size.
Depending on what you need/want, the versions that attempt to just count occurrences can cause a pretty substantial problem: if there are multiple occurrences of a single item in one array, they will still count that as two occurrences of that item, indicating an intersection that doesn't really exist. You can prevent this, but doing so renders the job somewhat less trivial -- you insert items from one array into your hash table, but always set the count to 1. When that's finished, you process the second array by setting the count to 2 if and only if the item is already present in the table.
Define "best".
If you want to do it fast, you can do it O(n) by iterating through each array and keeping a count for each unique element. Details of how to count the unique elements depend on the alphabet of things that can be in the array, eg, is it sparse or dense?
Note that this is O(n) in the number of arrays, but O(nm) for arrays of length m).
The best way is probably to hash all the values and keep a count of occurrences, culling all that have not occurred i times when you examine array i where i = {1, 2, ..., n}. Unfortunately, no deterministic algorithm can get you less than an O(n*m) running time, since it's impossible to do this without examining all the values in all the arrays if they're unsorted.
A faster algorithm would need to either have an acceptable level of probability (Monte Carlo), or rely on some known condition of the lists to examine only a subset of elements (i.e. you only care about elements that have occurred in all i-1 previous lists when considering the ith list, but in an unsorted list it's non-trivial to search for elements.
