Venn diagram notation for all other sets except one - dataset

I'm trying to find a Venn diagram notation that can illustrate data that is only in a single set.
If I can select data from all the other sets, without knowing how many there are, then I can find the intersection of their complement, to select data only in the targeting set.
My current solution looks like this, but it assumes the existance of sets B and C.
The eventual diagram expecting to look like this:

One way to do it would be by using a system based on regions rather than sets. In your case, it would be the region that belongs to set A but does not belong to any other set. You can find the rationale to do that here. The idea is to express the region as a binary chain where 1 means "belongs to set n" and 0 means "does not belong to set n", where n is determined by the ordering of the sets.
In your example, you might define A as the last set, and therefore as the last bit. With three sets CBA, your region would be 001. The nice thing about this is that the leading zeroes can be naturally disregarded. Your region would be 1b, not matter how many sets there are (the b is for "binary").
You might even extend the idea by translating the number to another base. For instance, say that you want to express the region of elements belonging to set B only. With the same ordering as before, it would be 010 or 10b. But you can also express it as a decimal number and say "region 2". This expression would be valid if sets A and B exist, independently of the presence of any other set.

Related

Bayesian Network creating conditional probability table (CPT)

I have trouble understanding where the numbers in the P(A|B,E) table are coming from in the alarm burglary example. I understand that P(B) and P(E) is chosen from knowledge about the domain. But I do not understand how many of the values in the CPT which can be chosen and which has to be calculated in order to make the tables valid. I assume that the P(J|A) and P(J|¬A) are chosen by expert knowledge? And then it must be the same for P(J|M).. or would these also have to be calculated by using given values?
I see with a binary example which is given here in the table on page 7:
https://cseweb.ucsd.edu/~elkan/250A/bayesnets.pdf, they are using the same numbers, but how have they calculated the values 0.95, 0.94, 0.29 and 0.001?
All the values in CPTs must come from somewhere, and cannot be calculated from other CPTs. There are two major approaches to get the numbers:
Have a domain expert specify the numbers.
Have a data set that contains joint realizations of the random variables. Then the numbers within the CPTs can be calculated from the respective frequencies within the data set. Note that this procedure becomes more complicated when not all variables are observed within the data set.
In addition, it is possible to mix approach 1 and 2.

General Big-Data principles for finding pairs of similar objects - "fuzzy inner join"

Firstly, sorry for the vague title and if this question has been asked before, but I was not entirely sure how to phrase it.
I am looking for general design principles for finding pairs of 'similar' objects from two different data sources.
Lets for simplicity say that we have two databases, A and B, both containing large volumes of objects, each with time-stamp and geo-location, along with some other data that we don't care about here.
Now I want to perform a search along these lines:
Within as certain time-frame and location dictated as search tiem, find pairs of objects from A and B respectively, ordered by some similarity score. Here for example some scalar 'time/space distance' function, distance(a,b), that calculates the distance in time and space between the objects.
I am expecting to get a (potentially ginormous) set of results where the first result is a pair of data points which has the minimum 'distance'.
I realize that the full search space is cardinality(A) x cardinality(B).
Are there any general guidelines on how to do this in a reasonable efficient way? I assume that I would need to replicate the two databases into a common repository like Hadoop? But then what? I am not sure how to perform such a query in Hadoop either.
What is this this type of query called?
To me, this is some kind of "fuzzy inner join" that I struggle wrapping my head around how to construct, let along efficiently at scale.
SQL joins don't have to be based on equality. You can use ">", "<", "BETWEEN".
You can even do something like this:
select a.val aval, b.val bval, a.val - b.val diff
from A join B on abs(a.val - b.val) < 100
What you need is a way to divide your objects into buckets in advance, without comparing them (or at least making a linear, rather than square, number of comparisons). That way, at query time, you will only be comparing a small number of items.
There is no "one-size-fits-all" way to bucket your items. In your case the bucketing can be based on time, geolocation, or both. Time-based bucketing is very natural, and can also scales elastically (increase or decrease the bucket size). Geo-clustering buckets can be based on distance from a particular point in space (if the space is abstract), or on some finite division of the space (for example, if you divide the entire Earth's world map into tiles, which can also scale nicely if done right).
A good question to ask is "if my data starts growing rapidly, can I handle it by just adding servers?" If not, you might need to rethink the design.

VDM-SL notation for a single, finite subset

Not sure if this is within the realm of SO but:
Using VDM-SL, I have been looking around for the 'best' way of describing a single, finite subset of ℕ. In my travels I have found several ways that people are conveying this but I wonder which is the most accepted.
I initially thought that F(ℕ) would do but I believe that this is the set of finite subsets of ℕ, rather than a single subset.
Would it be enough to say, "Let S be finite: S ⊂ ℕ?"
Or does such a notation exist?
All sets in VDM language are finite by definition, so I believe there is no need to explicitly specify that part. As defined here http://wiki.overturetool.org/images/c/cb/VDM10_lang_manV2.pdf section 3.2.1
Now, to model a type which is a subset of a set s2 , one of the ways is to use an invariant on that type. such as "inv t == s1 subset s2".

What is a good approach to check if an item is in a very big hashset?

I have a hashset that cannot be entirely loaded into the memory. So let's say it has ABC part and each one could be loaded into memory but not all at one time.
I also have random entries coming in from time to time which I can barely tell which part it could potentially belong to. So one of the approaches could be that I load A first and then make a check, and then B, C. But next entry could belong to B so I have to unload C, and then load A, then B...Hopefully I make this understood.
This clearly would be very slow so I wonder is there a better way to do that? (if using db is not an alternative)
I suggest that you don't use some criteria to put data entry either to A or to B. In other words, A,B,C - it's just result of division of whole data to 3 equal parts. Am I right? If so I recommend you add some criteria when you adding new entry to your set. For example, if your entries are numbers put those who starts from 0-3 to A, those who starts from 4-6 - to B, from 7-9 to C. When your search something, you apriori now that you have to search in A or in B, or in C. If your entries are words - the same solution, but now criteria is first letter. May be here better use not 3 sets but 26 - size of english alphabet. Please note, that you anyway have to store one of sets in memory. You see one advantage - you do maximum 1 load/unload operation, you don't need to check all sets - you now which of them can really store your value. This idea is widely using in DB - partitioning. If you store in sets nor numbers nor words but some complex objects you anyway can invent some simple criteria.

CSV String vs Arrays: Is this too stringly typed?

I came across some existing code in our production environment given to us by our vendor. They use a string to store comma seperated values to store filtered results from a DB. Keep in mind that this is for a proprietary scripting language called PowerOn that interfaces with a database residing on an AIX system, but it's a language that supports strings, integers, and arrays.
For example, we have;
Account
----------------
123
234
3456
28390
The psuedo code might look like;
Define accounts As String
For Each Account
accounts=accounts + CharCast(Account) + ","
End
as opposed to something I would expect to see like
Define accounts As Integer Array(99)
Define index as Integer=0
For Each Account
accounts(index)=Account
index=index+1
End
By the time the loop is done, accounts will look like; 123,234,3456,28390,. The string is later used to test if a specific instance exists like so
If CharSearch("28390", accounts) > 0 Then Call DoSomething
In the example, the statement evaluates to true and DoSomething gets called. Given the option of arrays, why would want to store integer values whithin a string of comma seperated values? Every language I've come across, it's almost always more expensive to perform string based operations than integer based operations.
Considering I haven't seen this technique before and my experience is somewhat limitted, is there a name for this? Is this common practice or is this just another example of being too stringly typed? To extend the existing code, should I continue using string method? Did we get cruddy code from our vendor?
What I put in the comment still holds but my real answer is: It's probably a design decision with respect to compatibility/portability. In your integer-array case (and a low enough level of the API) you'd typically find yourself asking questions like, what's a safe guess of the size of an integer on "today"'s machines. What about endianness.
The most portable and most flexible of all data formats always has been and always will be printed representation. It may not be as fast to process that but that's where adapters/converters or so kick in. I wouldn't be surprised to find (human-readable) printed representation of something especially in database APIs like you describe.
If you want something fast, just take whatever is given to you, convert it to a more efficient internal format, do you processing and convert it back.
There's nothing inherently wrong with using comma-separated strings instead of arrays. Sure you can't readily access a random n's element of such a collection, but if such random access is not needed then there's no penalty for it, right?
As far as I know Oracle DB stores NUMBER values as strings (and if my memory is correct - for DATEs as well) for very practical reasons.
In your specific example looks like using strings is an overkill when dealing with passing data around without crossing the process boundaries. But could it be that the choice of string data type makes more sense when sending data over wire or storing on disk?

Resources