So I've been trying to solve this for some hours now, but apparently there's still something missing. Maybe I'm thinking the wrong way, but I think it is a very complex problem:
I have three lists with items in a fixed order. For explaining the problem assume they contain items A to Z - mostly in the same order with some exceptions, where items can be in different positions. Also only one list contains all items - the other contain a subset and are missing certain items. As a solution for this problem would be sufficient, it could be possible to have no list with all items, but only partly overlapping sets. Even better would be an algorithm to solve the problem with multiple (> 3) lists.
So here's the example:
List 1: A B C D E F G H I J
List 2: A C D B F G
List 3: B C D E H F G
Now what I want is to match these three lists to visualize where the sort order is different and where are items that are missing. So the result should be:
List 1: A B C D E F G H I J
List 2: A C D B F G
List 3: B C D E H F G
So I immediately see, that List 2 has a B at the wrong position, A is missing from List 3, which also has H in the wrong position.
I was thinking about storing the result in a CSV to import into Excel. So the rows are:
A,A,
B,,B
C,C,C
...
Now my question is: how do I match the lists that way to generate the CSV output? The language I use is Java. So far I failed with the problem that a list other than the reference list contains items earlier, which appear later in the reference list.
This is by the way a real-world problem.
Any suggestions are appreciated.
There are off-the-shelf tools for solving this problem, such as the Unix tool diff3. Trying to solve it for arbitrary numbers of lists is not advisable unless you are willing to invest a lot of time in developing heuristics, as you are then dealing with the NP-hard general case of the longest common subsequence problem.
If I understand your question correctly, you are essentially trying to solve a multiple sequence alignment problem, which is a well-researched topic within bioinformatics. There are several algorithms for it, some of which are based on the concept of Levenshtein distance (which would solve a two-array version of your problem) - I suggest you start there.
Related
I have homework very new to R and I skipped an introductor course which I now realise was maybe not a good idea just need help with these questions.
Using “mtcars” dataset in R and answering to the following questions. For each question you
need to provide the exact R codes in R-studio. You should provide the copy of the plots.
(make a screenshot from R plot)
a) Create a scatterplot with any chosen variable with appropriate tittle and labels. (you
need to add the R codes here)
b) Use a loop structure and shows 4 different plots with different colours and symbols.
All 4 plots should be appeared in one window.
No idea where to even begin
A was asked an interesting question on an interview lately.
You have 1 million users
Each user has 1 thousand friends
Your system should efficiently answer on Do I know him? question for each couple of users. A user "knows" another one, if they are connected through 6 levels of friends.
E.g. A is friend of B, B is a friend of C, C is friend of D, D is a friend of E, E is a friend of F. So we can say that, A knows F.
Obviously you can't to solve this problem efficiently using BFS or other standard traversing technic. The question is - how to store this data structure in DB and how to quickly perform this search.
What's wrong with BFS?
Execute three steps of BFS from the first node, marking accessible users by flag 1. It requires 10^9 steps.
Execute three steps of BFS from the second node, marking accessible users by flag 2. If we meet mark 1 - bingo.
What about storing the data as 1 million x 1 million matrix A where A[i][j] is the minimum number of steps to reach from user i to user j. Then you can query it almost instantly. The update however is more costly.
I'm looking for a non-VBA solution to this problem.
Say I have a graph (in the computer science sense) in a spreadsheet as follows:
A B C D
1 Vertex Neighbors Degree Avg Nghbr Deg
2 A B,C 2 2.5
3 B A,C 2 2.5
4 C A,B,D 3 1.666666667
5 D C 1 3
I've entered columns C and D by hand but I want them to be calculated automatically. I've found reasonable solutions for column C that essentially count the commas and add 1. But for column D, I can't find a solution. I've found countless articles that explain how to lookup one value multiple times in one column, and countless articles that explain how to look up multiple values once in multiple columns, but I can't figure out how to look up multiple values in ONE column, get back an array of values, and then take the average of that array. I'm sure this can be done in VBA but I'd prefer a native Excel solution if one exists.
Obviously I'd like to extend this so that I can do other analyses of a vertex's neighbors. Presumably once I know the method to analyze a "looked-up array" I will be able to use it in other functions as well.
Any help is greatly appreciated.
To get column C:
=LEN(B2)-LEN(SUBSTITUTE(B2,",",""))+1
To get column D use SUMPRODUCT with SEARCH:
=SUMPRODUCT((ISNUMBER(SEARCH("," & $A$2:$A$5 & ",","," & B2 & ",")))*$C$2:$C$5)/C2
needing desperate help with understanding boyce codd and finding the candidate keys.
i found a link here http://djitz.com/neu-mscs/how-to-find-candidate-keys/ which i have understood for most part but i get stuck
e.g
(A B C D E F)
A B → C D E
B C D → A
B C E → A D
B D → E
right as far as i understand from the link i know you find the common sets from the left which is only B, and common sets from the right which are none
now where do i go from here? i know all candidate sets will have B in them but i need guidance on finding candidate sets after that. someone explain in simple language
The linked article isn't written particularly well. (That's an observation, not a criticism. The author's first language isn't English.) I'll try to rewrite the algorithm. This isn't me telling you how to do this. It's my interpretation of how the original author is telling you to do this.
Identify the attributes that are on neither the left side nor right side of any FD.
Identify the attributes that are only on the right side of any FD.
Identify the attributes that are only on the left side of any FD.
Combine the attributes from steps 1 and 3.
Compute the closure of the attributes from step 4. If the closure comprises all the attributes, then the attributes from step 4 make up the only candidate key. (No matter how many candidate keys there are, every one of them must contain these attributes.)
Identify the attributes not included in step 4 and step 2.
Compute the closure of the attributes from step 4 plus every possible combination of attributes from step 6.
So for the FDs you posted, you'd end up with this.
{F}
{}
{B}
{BF}
The closure of {BF} is {BF}. That's not all the attributes. (But every candidate key must contain {BF}.)
{ACDE}
Compute the closure of these sets of attributes.
{ABF}
{CBF}
{DBF}
{EBF}
{ACBF}
{ADBF}
{AEBF}
{CDBF}
{CEBF}
{DEBF}
{ACDBF}
{ADEBF}
{CDEBF}
If I got those combinations right, every candidate key will be found among the possibilities in step 7. In your example, there are 3 candidate keys.
http://www.sroede.nl/projects/fdhelper.aspx
this would help'just put in ur relation and FD's
click generate at the bottom
The Problem "Consider a relation R with five attributes ABCDE. You are given the following dependancies
A->B
BC->E
ED->A
List all the keys for R.
The teacher gave us the keys, Which are ACD,BCD,CDE
And we need to show the work to get to them.
The First two I solved.
For BCD, the transitive of 2 with 3 to get (BC->E)D->A => BCD->A.
and for ACD id the the transitive of 1 with 4 (BCD), to get (A->B)CD->A => ACD->A
But I can't figure out how to get CDE.
So it seems I did it wrong, after googling I found this answer
methodology to find keys:
consider attribute sets α containing: a. the determinant attributes of F (i.e. A, BC,
ED) and b. the attributes NOT contained in the determined ones (i.e. C,D). Then
do the attribute closure algorithm:
if α+ superset R then α -> R
Three keys: CDE, ACD, BCD
Source
From what I can tell, since C,D are not on the left side of the dependencies. The keys are left sides with CD pre-appended to them. Can anyone explain this to me in better detail as to why?
To get they keys, you start with one of the dependencies and using inference to extend the set.
Let me have a go with simple English, you can find formal definition the net easily.
e.g. start with 3).
ED -> A
(knowing E and D, I know A)
ED ->AB
(knowing E and D, I know A, by knowing A, I know B as well)
ED->AB
Still, C cannot be known, and I have used all the rules now except BC->E,
So I add C to the left hand side, i.e.
CDE ->AB
so, by knowing C,D and E, you will know A and B as well,
Hence CDE is a key for your relation ABCDE. You repeat the same process, starting with other rules until exhausted.