Bayesian Network Probability For Child Node - artificial-intelligence

Given the following Bayesian network determine the probabilities.
On the network shown in Figure 1, suppose that:
P("alternator broken"=true) = 0.02
P("no charging"=true | "alternator broken"=true) = 0.95
P("no charging"=true | "alternator broken"=false) = 0.01.
What is P("no charging"=false)? How is it derived?
How would you go about determining "no charging" without having information about "fanbelt broken"?
Would the following be true:
P("no charging"=false) =
P("alternator broken"=true) * P("no charging"=true | "alternator broken"=true) + P("alternator broken"= false) * P("no charging"=true | "alternator broken"= false)

It's not possible
To calculate P("no charging") for the given BN, you are missing the prior for fanbelt broken. And also the CPT for no charging is underspecified, because no charging depends on fanbelt broken.
But you might want to
The best you can do with the information you have is simply ignore fanbelt broken. If the values for P( "charging" | "alternator broken") are obtained by taking the correct expectation over fanbelt broken, then the result is correct. If the latter is true this means that fanbelt broken is already eliminated (summed out), and it's influence is incorporated into the CPT for ´charging`.

Related

How to validate two data sets coming from an algorithm to check its effectiveness

I have two data sets:
Ist (AKA "OLD") [smaller - just a sample]:
Origin | Alg.Result | Score
Star123 | Star123 | 100
Star234 | Star200 | 90
Star421 | Star420 | 98
Star578 | Star570 | 95
... | ... | ...
IInd (AKA "NEW") [bigger - used all real data]:
Origin | Alg.Result | Score
Star123 | Star120 | 90
Star234 | Star234 | 100
Star421 | Star423 | 98
Star578 | Star570 | 95
... | ... | ...
Those DFs are the results of two different algorithms. Let's call them "OLD" and "NEW".
The logic of those algorithms is following:
it takes value from some table (represented in the column: 'Origin'), and tries to match this value from some different table (outcome represented as a column: Alg. Result). Plus it calculates a score of the match based on some internal logic (column: Score).
Additionally important information:
I DF (old) is a smaller sample
II DF (new) is a bigger sample
Values in ORIGIN are the same for both datasets, excluding the fact that the old dataset has fewer of them compared to the NEW set.
Values in Alg. Result can:
be exactly the same as in Origin
can be similar
can be completely something else
In a solution where those algorithms are used, the threshold is used based on SCORE. For OLD it's a Score > 90. For the new, it's the same.
What I want to achieve is to:
How accurate is the new algorithm?
Validate how accurate is the new approach ("NEW") in matching values with Origin values.
What are the discrepancies between the OLD and NEW sets:
which cases the OLD has that the NEW doesn't have
which cases the NEW has, which the OLD doesn't have
What kind of comparison would you do to achieve those goals?
I thought about checking:
True positive => by taking NEW dataset with condition NEW.Origin == NEW.Alg.Result and NEW.Score == 100
False positive => by taking NEW dataset with condition NEW.Origin != NEW.Alg.Result and NEW.Score == 100
False-negative => by taking NEW dataset with condition NEW.Origin == NEW.Al.Result and NEW.Score != 100
I don't see a sense to count True negatives if the algorithm always generates some match. I'm not sure what this could look like.
What else you'd suggest? What to do to compare OLD and NEW values? Do you have some ideas?

Using cells as output table of while loop in octave

So I'm implementing a while loop in my code that just does some simple calculations. The thing is, that I want to have an output that no only shows the final values but all of them from each step. The best I could do was using cell arrays with the following code:
i=1; p=(a+b)/2;
valores=cell(n, 3);
while (i<=n && f(p)!=0);
if f(a)*f(p)<0;
a=a; b=p;
else a=p; b=b;
endif
i=i+1; p=(a+b)/2;
valores(i, :)={i-1 p f(p)}; fprintf('%d %d %d \n', valores{i, :});
endwhile
An example output would be:
1 1.25 -1.40998
2 1.125 -0.60908
3 1.0625 -0.266982
4 1.03125 -0.111148
5 1.01562 -0.0370029
But I have two main issues with this method, the first one is that I couldn't find a way to get some text as title in the first line, so I have to explain what each column in a sentence later, and second I don't know how to make it so that all the columns stay at the same distance from each other instead of each text staying at the same distance. I assume this last issue has something to do with the way I used the fprintf line since I'm not to familiar with it.
In case it helps to understand what I want to get from this algorithm, I'm trying to calculate the root of a function with the bisection method. And sorry if this was to long or unclear, feel free to give me advise, I'm kinda new here :)
An open-source package called Tablicious can take care of cell, row, and column alignment. Using print statements and whitespace gets tedious and leads to unmaintainable code.
Tablicious is a package for GNU Octave that provides relational data structures for Octave. It includes implementations of table arrays, datetime, string, categorical, and some other related stuff. You can think of it as “pandas for Octave”.
Installation
pkg install https://github.com/apjanke/octave-tablicious/releases/download/v0.3.6/tablicious-0.3.6.tar.gz
Example
pkg load tablicious
Forename = {"Tom"; "Dick"; "Harry"};
Age = [21; 63; 38];
Salary = {"$1"; "$2"; "$3"};
tab = table(Forename, Age, Salary);
prettyprint (tab)
Result
-------------------------------
| Forename | Age | Salary |
-------------------------------
| Tom | 21 | $1 |
| Dick | 63 | $2 |
| Harry | 38 | $3 |
-------------------------------
Documentation can be found here.

SPSS: using IF function with REPEAT when each case has multiple linked instances

I have a dataset as such:
Case #|DateA |Drug.1|Drug.2|Drug.3|DateB.1 |DateB.2 |DateB.3 |IV.1|IV.2|IV.3
------|------|------|------|------|--------|---------|--------|----|----|----
1 |DateA1| X | Y | X |DateB1.1|DateB1.2 |DateB1.3| 1 | 0 | 1
2 |DateA2| X | Y | X |DateB2.1|DateB2.2 |DateB2.3| 1 | 0 | 1
3 |DateA3| Y | Z | X |DateB3.1|DateB3.2 |DateB3.3| 0 | 0 | 1
4 |DateA4| Z | Z | Z |DateB4.1|DateB4.2 |DateB4.3| 0 | 0 | 0
For each case, there are linked variables i.e. Drug.1 is linked with DateB.1 and IV.1 (Indicator Variable.1); Drug.2 is linked with DateB.2 and IV.2, etc.
The variable IV.1 only = 1 if Drug.1 is the case that I want to analyze (in this example, I want to analyze each receipt of Drug "X"), and so on for the other IV variables. Otherwise, IV = 0 if the drug for that scenario is not "X".
I want to calculate the difference between DateA and DateB for each instance where Drug "X" is received.
e.g. In the example above I want to calculate a new variable:
DateDiffA1_B1.1 = DateA1 - DateB1.1
DateDiffA1_B2.1 = DateA1 - DateB2.1
DateDiffA1_B1.3 = DateA1 - DateB1.3
DateDiffA1_B2.3 = DateA1 - DateB2.3
DateDiffA1_B3.3 = DateA1 - DateB3.3
I'm not sure if this new variable would need to be linked to each instance of Drug "X" as for the other variables, or if it could be a single variable that COUNTS all the instances for each case.
The end goal is to COUNT how many times each case had a date difference of <= 2 weeks when they received Drug "X". If they did not receive Drug "X", I do not want to COUNT the date difference.
I will eventually want to compare those who did receive Drug "X" with a date difference <= 2 weeks to those who did not, so having another indicator variable to help separate out these specific patients would be beneficial.
I am unsure about the best way to go about this; I suspect it will require a combination of IF and REPEAT functions using the IV variable, but I am relatively new with SPSS and syntax and am not sure how this should be coded to avoid errors.
Thanks for your help!
EDIT: It seems like I may need to use IV as a vector variable to loop through the linked variables in each case. I've tried the syntax below to no avail:
DATASET ACTIVATE DataSet1.
vector IV = IV.1 to IV.3.
loop #i = .1 to .3.
do repeat DateB = DateB.1 to DateB.3
/ DrugDateDiff = DateDiff.1 to DateDiff.3.
if IV(#i) = 1
/ DrugDateDiff = datediff(DateA, DateB, "days").
end repeat.
end loop.
execute.
Actually there is no need to add the vector and the loop, all you need can be done within one DO REPEAT:
compute N2W=0.
do repeat DateB = DateB.1 to DateB.3 /IV=IV.1 to IV.3 .
if IV=1 and datediff(DateA, DateB, "days")<=14 N2W = N2W + 1.
end repeat.
execute.
This syntax will first put a zero in the count variable N2W. Then it will loop through all the dates, and only if the matching IV is 1, the syntax will compare them to dateA, and add 1 to the count if the difference is <=2 weeks.
if you prefer to keep the count variable as missing when none of the IV are 1, instead of compute N2W=0. start the syntax with:
If any(1, IV.1 to IV.3) N2W=0.

Split array into chunks based on timestamp in Haskell

I have an array of records (custom data type) in Haskell which I want to aggregate based on a each records' timestamp. In very general terms each record looks like this:
data Record = Record { event :: String,
time :: Double,
from :: Int,
to :: Int
} deriving (Show, Eq)
I used a Double for the timestamp since that is the same format used in the tracefile.
And I parse them from a CSV file into an array of records: [Record]
Now I'm looking to get an approximation of instantaneous events / time. So I want to split the array into several arrays based on the timestamp (say. every 1 seconds) and then fold across each smaller array.
The problem is I can't figure out how to split an array based on the value of a record. Looking on Hoogle I found several functions like splitEvery and splitWhen, but I'm lost. I considered using splitWhen to break up the list when, say, (mod time 0.1) == 0, but even if that worked it would remove the elements it's splitting on (which I don't want to do).
I should note that the records are NOT evenly spaced in time. E.g. the timestamp on sequential records is not going to differ by a fixed amount.
I am more than willing to store the data in a different format if you can suggest one that would make this sort of work easier.
A quick sample of the data I'm parsing (from a ns2 simulation):
r 0.114 1 2 tcp 1000 ________ 2 1.0 5.0 0 2
r 0.240 1 2 tcp 1000 ________ 2 1.0 5.0 0 2
r 0.914 2 1 tcp 1000 ________ 2 5.0 1.0 0 3
If you have [Record] and you want to group them by a specific condition, you can use Data.List.groupBy. I'm assuming that for your time :: Double, 1 second is the base unit, so time = 1 is 1 second, time = 100 is 100 seconds, etc, so adjust this to whatever system you're actually using:
import Data.List
import Data.Function (on)
isInSameClockSecond :: Record -> Record -> Bool
isInSameClockSecond = (==) `on` (floor . time :: Record -> Integer)
-- The type signature is given for floor . time to remove any ambiguity
-- due to floor's polymorphic type signature.
groupBySameClockSecond :: [Record] -> [[Record]]
groupBySameClockSecond = groupBy isInSameClockSecond

Folder searching algorithm

Not sure if this is the usual sort of question that gets asked around here, or if I'll get any answers to this one, but I'm looking for a pseudo-code approach to generating DB linking records from a folder structure containing image files.
I have a set of folders, structured as folllows:
+-make_1/
| +--model_1/
| +-default_version/
| | +--1999
| | +--2000
| | | +--image_01.jpg
| | | +--image_02.jpg
| | | +--image_03.jpg
| | | ...
| | +--2001
| | +--2002
| | +--2003
| | ...
| | +--2009
| +--version_1/
| | +--1999
| | ...
| | +--2009
| +--version_2/
| | +--1999
| | +--2000
| | +--2001
| | | +--image_04.jpg
| | | +--image_05.jpg
| | | +--image_06.jpg
| | | ...
| | +--2002
| | +--2003
| | | +--image_07.jpg
| | | +--image_08.jpg
| | | +--image_09.jpg
| | ...
| | +--2009
... ... ...
In essence, it represents possible images for vehicles, by year starting in 1999.
Makes and models (e.g. Make: Alfa Romeo, Model: 145) come in various trims or versions. Each trim, or version may be found in a number of vehicles which will look the same but have say differences in fuel type or engine capacity.
To save duplication, the folder structure above makes use of a default folder... And images appear for the default version from 2000 onwards. I need to produce the links table for each version - based on whether the have their own overriding images, or whether make use of the default version...
So for example, version_1 has no image files, so I need to make links for to the default images, starting in 2000 and continuing until 2009.
Version 2 on the other hand starts out using the default images in 2000, but then uses two new sets first for 2001-2002, and then 2003-2009. The list of links required are therefore...
version start end file_name
======= ===== ===== =========
version_1 2000 2009 image_01.jpg
version_1 2000 2009 image_02.jpg
version_1 2000 2009 image_03.jpg
...
version_2 2000 2001 image_01.jpg
version_2 2000 2001 image_02.jpg
version_2 2000 2001 image_03.jpg
version_2 2001 2003 image_04.jpg
version_2 2001 2003 image_05.jpg
version_2 2001 2003 image_06.jpg
version_2 2003 2009 image_07.jpg
version_2 2003 2009 image_08.jpg
version_2 2003 2009 image_09.jpg
...
(Default is just that - a place holder, and no links are required for it.)
At the moment I'm running through the folders, building arrays, and then trimming the fat at the end. I was just wondering if there was a short cut, using some sort of text-processing approach? There are about 45,000 folders, most of which are empty :-)
Here's some Python pseudocode, pretty close to executable (needs suitable imports and a def for a writerow function that will do the actual writing -- be it to an intermediate file, DB, CSV, whatever):
# first, collect all the data in a dict of dicts of lists
# first key is version, second key is year (only for non-empty years)
tree = dict()
for root, dirs, files in os.walk('make_1/model_1'):
head, tail = os.path.split(root)
if dirs:
# here, tail is a version
tree[tail] = dict
elif files:
# here, tail is a year
tree[os.path.basename(head)][tail] = files
# now specialcase default_version
default_version = tree.pop('default_version')
# determine range of years; rule is quite asymmetrical:
# for min, only years with files in them count
min_year = min(d for d in default_version if default_version[d])
# for max, all years count, even if empty
max_year = max(default_version)
for version, years in tree.iteritems():
current_files = default_version[min_year]
years.append(max_year + 1)
y = min_year
while years:
next_change = min(years)
if y < next_change:
for f in current_files:
writerow(version, y, next_change-1, f)
y = next_change
current_files = years.pop(y)
One ambiguity in the spec and example is whether it's possible for the default_version to change the set of files in some years - here, I'm assuming that doesn't happen (only specific versions change that way, the default version always carries one set of files).
If this is not the case, what happens if the default version changes in years (say) 1999 and 2003, and version1 changes in 2001 and 2005 -- what files should version 1 use for 03 and 04, the new ones in the default version, or those it specified in 01?
In the most complicated version of the spec (where both default_version and a specific one can change, with the most recent change taking precedence, and if both specific and default change in the same year then the specific taking precedence) one needs to get all the "next change year" sequence, for each specific version, by careful "priority merging" of the sequences of change years for default and specific version, instead of just using years (the sequence of changes in the specific version) as I do here -- and each change year placed in the sequence must be associated with the appropriate set of files of course.
So if the exact spec can please be expressed, down to the corner cases, I can show how to do the needed merging by modifying this pseudocode -- I'd rather not do the work until the exact specs are clarified, because, if the specs are indeed simpler, the work would be unneeded!-)
Edit: as a new comment clarified, the exact specs is indeed the most complex one, so we have do do the merging appropriately. So the loop at the end of the simplistic answer above changes to:
for version, years_dict in tree.iteritems():
# have years_dict override default_version when coincident
merged = dict(default_version, **years_dict)
current_files = merged.pop(min_year)
merged[max_year + 1] = None
y = min_year
while merged:
next_change = min(merged)
for f in current_files:
writerow(version, y, next_change-1, f)
y = next_change
current_files = merged.pop(y)
The key change is the merged = dict(... line: in Python, this means to make merged a new dict (a dict is a generic mapping, would be typically called a hashmap in other languages) which is the sum, or merge, of default_version and years_dict, but when a key is present in both of those, the value from years_dict takes precedence -- which meets the key condition for a year that's present (i.e., is a year with a change in files) in both.
After that it's plain sailing: anydict.pop(somekey) returns the value corresponding to the key (and also removes it from anydict); min(anydict) returns the minimum key in the dictionary. Note the "sentinel" idiom at merged[max_year + 1] = None: this says that the year "one after the max one" is always deemed to be a change-year (with a dummy placeholder value of None), so that the last set of rows is always written properly (with a maximum year of max_year + 1 - 1, that is, exactly max_year, as desired).
This algorithm is not maximally efficient, just simplest! We're doing min(merged) over and over, making it O(N squared) -- I think we can afford that because each merged should have a few dozen change-years at most, but a purist would wince. We can of course present an O(N logN) solution -- just sort the years once and for all and walk that sequence to get the successive values for next_change. Just for completeness...:
default_version[max_year + 1] = None
for version, years_dict in tree.iteritems():
merged = dict(default_version, **years_dict)
for next_change in sorted(merged):
if next_change > min_year:
for f in merged[y]:
writerow(version, y, next_change-1, f)
y = next_change
Here sorted gives a list with the keys of merged in sorted order, and I've switched to the for statement to walk that list from beginning to end (and an if statements to output nothing the first time through). The sentinel is now put in default_version (so it's outside the loop, for another slight optimization). It's funny to see that this optimized version (essentially because it works at a slightly higher level of abstraction) turns out to be smaller and simpler than the previous ones;-).

Resources