Maximal Number of Functional Dependencies (including trivial) - database

In a relationship R of N attributes how many functional dependencies are there (including trivial)?
I know that trivial dependencies are those where the right hand side is a subset of the left hand side but I'm not sure how to calculate the upper bound on the dependencies.
Any information regarding the answer and the approach to it would be greatly appreciated.
-

The maximum possible number of functional dependencies is
the number of possible left-hand sides * the number of possible right-hand sides
We're including trivial functional dependencies, so the number of possible left-hand sides equals the number of possible right-hand sides. So this simplifies to
(the number of possible left-hand sides)2
Let's say you have R{∅AB}. There are three attributes.1 The number of possible left-hand sides is
combinations of 3 attributes taken 1 at a time, plus
combinations of 3 attributes taken 2 at a time, plus
combinations of 3 attributes taken 3 at a time
which equals 3+3+1, or 7. So there are at most 72 possible functional dependencies for any R having three attributes: 49. The order of attributes doesn't matter, so we use formulas for combinations, not for permutations.
If you start with R{∅ABC}, you have
combinations of 4 attributes taken 1 at a time, plus
combinations of 4 attributes taken 2 at a time, plus
combinations of 4 attributes taken 3 at a time, plus
combinations of 4 attributes taken 4 at a time
which equals 4+6+4+1, or 15. So there are at most 152 possible functional dependencies for any R having four attributes: 225.
Once you know this formula, these calculations are simple using a spreadsheet. It's also pretty easy to write a program to generate every possible functional dependency using a scripting language like Ruby or Python.
The Wikipedia article on combinations has examples of how to count the combinations, with and without using factorials.
All possible combinations from R{∅AB}:
A->A A->B A->∅ A->AB A->A∅ A->B∅ A->AB∅
B->A B->B B->∅ B->AB B->A∅ B->B∅ B->AB∅
∅->A ∅->B ∅->∅ ∅->AB ∅->A∅ ∅->B∅ ∅->AB∅
AB->A AB->B AB->∅ AB->AB AB->A∅ AB->B∅ AB->AB∅
A∅->A A∅->B A∅->∅ A∅->AB A∅->A∅ A∅->B∅ A∅->AB∅
B∅->A B∅->B B∅->∅ B∅->AB B∅->A∅ B∅->B∅ B∅->AB∅
AB∅->A AB∅->B AB∅->∅ AB∅->AB AB∅->A∅ AB∅->B∅ AB∅->AB∅
Most people ignore the empty set. They'd say R{∅AB} has only two attributes, A and B, and they'd write it as R{AB}.

Related

DFA of two simple languages and then product construction of those two languages

The language below is the intersection of two simpler languages. First, identify the simpler languages and give the state diagrams of the DFAs that recognize them. Then, use the product construction to build a DFA that recognizes the language specified below; give its state diagram before and after simplification if there are any unneeded states or states that can be combined.
Language: {w is a member of {0,1}* | w contains an odd number of 0s and the sum of its 0s and 1s is equal to 1}
This is my proposed solution: https://imgur.com/a/lh5Hwfr Should the bottom two states be connected with 0s?
Also, what would be the DFA if it was OR instead of AND?
Here's a drawing I hope will help understand how to do this:
Language A is "odd number of zeros". States are labeled Z0 and Z1 indicating even or odd number of zeros.
Language B is "exactly one one" (which is equivalent to "sum of digits equals one"). States are labeled I0, I1 and I2 indicaing zero, one or more ones.
Language A+B can be interpreted as A∩B (ignoring the dotted circles) or AUB (counting the dotted circles). If building A∩B, states Z0I2 and Z1I2 can be joined together.
I hope this gives not only an answer to the exact problem in the question, but also an idea how to build similar answers for similar problems.

Given a sorted array find number of all unique pairs which sum is greater than X

Is there any solution which can be achieved with complexity better than
O(n2)? Linear complexity would be best but O(nlogn) is also great.
I've tried to use binary search for each element which would be O(nlogn) but Im always missing something.
For example, given the sorted array: 2 8 8 9 10 20 21
and the number X=18
[2,20][2,21][8,20][8,21][8,20][8,21][9,10][9,20][9,21][10,20][10,21][20,21]
The function should return 12.
For your problem, for any given set of numbers, there may be some which, included with any other number, are guaranteed to be a solution because in themselves they are greater than the target amount for the summing.
Identify those. Any combination of those is a solution. With your example-data, 21 and 20 satisfy this, so all combinations of 21, which is six, and all remaining combinations of 20, which is five, satisfy your requirement (11 results so far).
Treat your sorted array now as a subset, which excludes the above set (if non-empty).
From your new subset, find all numbers which can be included with another number in that subset to satisfy your requirement. Starting from the highest (for "complexity" it may not matter, but for ease of coding, knowing early that you have none often helps) number in your subset, identify any, remembering that there need to be at least two such numbers for any more results to be added to those identified.
Take you data in successive adjacent pairs. Once you reach of sum which does not meet your requirement, you know you have found the bottom limit for your table.
With this sequence:
1 2 12 13 14 15 16 17 18 19
19 already meets the criteria. Subset is of nine numbers. Take 18 and 17. Meets the criteria. Take 17 and 16. Meets. 16 and 15. Meets... 13 and 12. Meets. 12 and 2. Does not meet. So 12 is the lowest value in the range, and the number of items in the range are seven.
This process yields 10 and 9 with your original data. One more combination, giving your 12 combinations that are sought.
Determining the number of combinations should be abstracted out, as it is simple to calculate and, depending on actual situation, perhaps faster to pre-calculate.
A good rule-of-thumb is that if you, yourself, can get the answer by looking at the data, it is pretty easy for a computer. Doing it the same way is a good starting point, though not always the end.
Find the subset of the highest elements which would cause the sum to be greater than the desired value when paired with any other member of the array.
Taking the subset which excludes those, use the highest element and work backwards to find the lowest element which gives you success. Have that as your third subset.
Your combinations are all those of the entire group for members of the first subset, plus solely those combinations of the third subset.
You are the student, you'll have to work out how complex that is.

How to find pattern groups in boolean array?

Given a 2D array of Boolean values I want to find all patterns that consist of at least 2 columns and at least 2 rows. The problem is somewhat close to finding cliques in a graph.
In the example below green cells represent "true" bits, greys are "false". Pattern 1 contains cols 1,3,4 and 5 and rows 1 and 2. Pattern 2 contains only columns 2 and 4, and rows 2,3,4.
Business idea behind this is finding similarity patterns among various groups of social network users. In real world number of rows can go up to 3E7, and the number of columns up to 300.
Can't really figure out a solution other than brute force matching.
Please advice the proper name of the problem, so I could read more, or advice an elegant solution.
This is (equivalent to) asking for all bicliques (complete bipartite subgraphs) larger than a certain size in a bipartite graph. Here the rows are the vertices of one part A of the graph, and the columns are the vertices of the other part B, and there is an edge between u \in A and v \in B whenever the cell at row u, column v is green.
Although you say that you want to find all patterns, you probably only want to find only maximal ones -- that is, patterns that cannot be extended to become larger patterns by adding more rows or columns. (Otherwise, for any pattern with c >= 2 columns and r >= 3 rows, you will also get back the more than 2^(c-2)*2^(r-3) non-maximal patterns that can be formed by deleting some of the rows or columns.)
But even listing just the maximal patterns can take time exponential in the number of rows and columns, assuming that P != NP. That's because the problem of finding a maximum (i.e. largest-possible) pattern, in terms of the total number of green cells, has been proven to be NP-complete: if it were possible to list all maximal patterns in polynomial time, then we could simply do so, and pick the largest, thereby solving this NP-complete problem in polynomial time.

How to normalize multiple array of different size in matlab

I use set of images for image processing in which each image generates unique code (Freeman chain code). The size of array for each image varies. However the value ranges from 0 to 7. For e.g. First image creates array of 3124 elements. Second image creates array of 1800 elements.
Now for further processing, I need a fixed size of those array. So, is there any way to Normalize it ?
There is a reason why you are getting different sized arrays when applying a chain code algorithm to different images. This is because the contours that represent each shape are completely different. For example, the letter C and D will most likely contain chain codes that are of a different length because you are describing a shape as a chain of values from a starting position. The values ranging from 0-7 simply tell you which direction you need to look next given the current position of where you're looking in the shape. Usually, chain codes have the following convention:
3 2 1
4 x 0
5 6 7
0 means to move to the east, 1 means to move north east, 2 means to move north and so on. Therefore, if we had the following contour:
o o x
o
o o o
With the starting position at x, the chain code would be:
4 4 6 6 0 0
Chain codes encode how we should trace the perimeter of an object given a starting position. Now, what you are asking is whether or not we can take two different contours with different shapes and represent them using the same number of values that represent their chain code. You can't because of the varying length of the chain code.
tl;dr
In general, you can't. The different sized arrays mean that the contours that are represented by those chain codes are of different lengths. What you are actually asking is whether or not you can represent two different and unrelated contours / chain codes with the same amount of elements.... and the short answer is no.
What you need to think about is why you want to try and do this? Are you trying to compare the shapes between different contours? If you are, then doing chain codes is not the best way to do that due to how sensitive chain codes are with respect to how the contour changes. Adding the slightest bit of noise would result in an entirely different chain code.
Instead, you should investigate shape similarity measures instead. An authoritative paper by Remco Veltkamp talks about different shape similarity measures for the purposes of shape retrieval. See here: http://www.staff.science.uu.nl/~kreve101/asci/smi2001.pdf . Measures such as the Hausdorff distance, Minkowski distance... or even simple moments are some of the most popular measures that are used.

Algorithm - How to select one number from each column in an array so that their sum is as close as possible to a particular value

I have an m x n matrix of real numbers. I want to choose a single value from each column such that the sum of my selected values is as close as possible to a pre-specified total.
I am not an experienced programmer (although I have an experienced friend who will help). I would like to achieve this using Matlab, Mathematica or c++ (MySQL if necessary).
The code only needs to run a few times, once every few days - it does not necessarily need to be optimised. I will have 16 columns and about 12 rows.
Normally I would suggest dynamic programming, but there are a few features of this situation suggesting an alternative approach. First, the performance demands are light; this program will be run only a couple times, and it doesn't sound as though a running time on the order of hours would be a problem. Second, the matrix is fairly small. Third, the matrix contains real numbers, so it would be necessary to round and then do a somewhat sophisticated search to ensure that the optimal possibility was not missed.
Instead, I'm going to suggest the following semi-brute-force approach. 12**16 ~ 1.8e17, the total number of possible choices, is too many, but 12**9 ~ 5.2e9 is doable with brute force, and 12**7 ~ 3.6e7 fits comfortably in memory. Compute all possible choices for the first seven columns. Sort these possibilities by total. For each possible choice for the last nine columns, use an efficient search algorithm to find the best mate among the first seven. (If you have a lot of memory, you could try eight and eight.)
I would attempt a first implementation in C++, using std::sort and std::lower_bound from the <algorithm> standard header. Measure it; if it's too slow, then try an in-memory B+-tree (does Boost have one?).
I spent some more time thinking about how to implement what I wrote above in the simplest way possible. Here's an approach that will work well for a 12 by 16 matrix on a 64-bit machine with roughly 4 GB of memory.
The number of choices for the first eight columns is 12**8. Each choice is represented by a 4-byte integer between 0 and 12**8 - 1. To decode a choice index i, the row for the first column is given by i % 12. Update i /= 12;. The row for the second column now is given by i % 12, et cetera.
A vector holding all choices requires roughly 12**8 * 4 bytes, or about 1.6 GB. Two such vectors require 3.2 GB. Prepare one for the first eight columns and one for the last eight. Sort them by sum of the entries that they indicate. Use saddleback search to find the best combination. (Initialize an iterator into the first vector and a reverse iterator into the second. While neither iterator is at its end, compare the current combination against the current best and update the current best if necessary. If the current combination sums to than the target, increment the first iterator. If the sum is greater than the target, increment the second iterator.)
I would estimate that this requires less than 50 lines of C++.
Without knowing the range of values that might fill the arrays, how about something generic like this:
divide the target by the number of remaining columns.
Pick the number from that column closest to that value.
Repeat from 1. Until each column picked.

Resources