How to validate two data sets coming from an algorithm to check its effectiveness

How to validate two data sets coming from an algorithm to check its effectiveness - dataset

I have two data sets:
Ist (AKA "OLD") [smaller - just a sample]:
Origin | Alg.Result | Score
Star123 | Star123 | 100
Star234 | Star200 | 90
Star421 | Star420 | 98
Star578 | Star570 | 95
... | ... | ...
IInd (AKA "NEW") [bigger - used all real data]:
Origin | Alg.Result | Score
Star123 | Star120 | 90
Star234 | Star234 | 100
Star421 | Star423 | 98
Star578 | Star570 | 95
... | ... | ...
Those DFs are the results of two different algorithms. Let's call them "OLD" and "NEW".
The logic of those algorithms is following:
it takes value from some table (represented in the column: 'Origin'), and tries to match this value from some different table (outcome represented as a column: Alg. Result). Plus it calculates a score of the match based on some internal logic (column: Score).
Additionally important information:
I DF (old) is a smaller sample
II DF (new) is a bigger sample
Values in ORIGIN are the same for both datasets, excluding the fact that the old dataset has fewer of them compared to the NEW set.
Values in Alg. Result can:
be exactly the same as in Origin
can be similar
can be completely something else
In a solution where those algorithms are used, the threshold is used based on SCORE. For OLD it's a Score > 90. For the new, it's the same.
What I want to achieve is to:
How accurate is the new algorithm?
Validate how accurate is the new approach ("NEW") in matching values with Origin values.
What are the discrepancies between the OLD and NEW sets:
which cases the OLD has that the NEW doesn't have
which cases the NEW has, which the OLD doesn't have
What kind of comparison would you do to achieve those goals?
I thought about checking:
True positive => by taking NEW dataset with condition NEW.Origin == NEW.Alg.Result and NEW.Score == 100
False positive => by taking NEW dataset with condition NEW.Origin != NEW.Alg.Result and NEW.Score == 100
False-negative => by taking NEW dataset with condition NEW.Origin == NEW.Al.Result and NEW.Score != 100
I don't see a sense to count True negatives if the algorithm always generates some match. I'm not sure what this could look like.
What else you'd suggest? What to do to compare OLD and NEW values? Do you have some ideas?

Related

Using cells as output table of while loop in octave

So I'm implementing a while loop in my code that just does some simple calculations. The thing is, that I want to have an output that no only shows the final values but all of them from each step. The best I could do was using cell arrays with the following code:
i=1; p=(a+b)/2;
valores=cell(n, 3);
while (i<=n && f(p)!=0);
if f(a)*f(p)<0;
a=a; b=p;
else a=p; b=b;
endif
i=i+1; p=(a+b)/2;
valores(i, :)={i-1 p f(p)}; fprintf('%d %d %d \n', valores{i, :});
endwhile
An example output would be:
1 1.25 -1.40998
2 1.125 -0.60908
3 1.0625 -0.266982
4 1.03125 -0.111148
5 1.01562 -0.0370029
But I have two main issues with this method, the first one is that I couldn't find a way to get some text as title in the first line, so I have to explain what each column in a sentence later, and second I don't know how to make it so that all the columns stay at the same distance from each other instead of each text staying at the same distance. I assume this last issue has something to do with the way I used the fprintf line since I'm not to familiar with it.
In case it helps to understand what I want to get from this algorithm, I'm trying to calculate the root of a function with the bisection method. And sorry if this was to long or unclear, feel free to give me advise, I'm kinda new here :)

An open-source package called Tablicious can take care of cell, row, and column alignment. Using print statements and whitespace gets tedious and leads to unmaintainable code.
Tablicious is a package for GNU Octave that provides relational data structures for Octave. It includes implementations of table arrays, datetime, string, categorical, and some other related stuff. You can think of it as “pandas for Octave”.
Installation
pkg install https://github.com/apjanke/octave-tablicious/releases/download/v0.3.6/tablicious-0.3.6.tar.gz
Example
pkg load tablicious
Forename = {"Tom"; "Dick"; "Harry"};
Age = [21; 63; 38];
Salary = {"$1"; "$2"; "$3"};
tab = table(Forename, Age, Salary);
prettyprint (tab)
Result
-------------------------------
| Forename | Age | Salary |
-------------------------------
| Tom | 21 | $1 |
| Dick | 63 | $2 |
| Harry | 38 | $3 |
-------------------------------
Documentation can be found here.

Use TQuery.Locate() function to find other then first matching

Locate moves the cursor to the first row matching a specified set of search criteria.
Let's say that q is TQuery component, which is connected to the database with two columns TAG and TAGTEXT. With next code I am getting letter a. And I would like to use Locate() function to get letter d.
If q.Locate('TAG','1',[loPartialKey]) Then
begin
tag60 := q.FieldByName('TAGTEXT');
end
For example if I got table like this:
TAG | TAGTEXT
+---+--------+
| 1 | a |
+---+--------+
| 2 | b |
+---+--------+
| 3 | c |
+---+--------+
| 1 | d |
+---+--------+
| 4 | e |
+---+--------+
| 1 | f |
+---+--------+
is it possible to locate the second time number one occurred in table?
EDIT
My job is to find the occurrence of TAG with value 1 (which occurrence I need depends on the parameter I get), I need to iterate through table and get the values from all the TAGTEXT fields till I find that value in TAG field is again number 1. Number 1 in this case represents the start of new segment, and all between the two number 1s belongs to one segment. It doesn't have to be same number of rows in each segment. Also I am not allowed to do any changes on table.
What I thought I could do is to create a counter variable that is going to be increased by one every time it comes to TAG with value 1 in it. When the counter equals to the parameter that represents the occurrence I know that I am in the right segment and I am going to iterate through that segment and get the values I need.
But this might be slow solution, and I wanted to know if there was any faster.

You need to be a bit wary of using Locate for a purpose like this, because some
TDataSet descendants' implementation of Locate (or the underlying db-access layer) construct a temporary index on the dataset. which can be discarded immediately afterwards, so repeatedly calling Locate to iterate the rows of a given segment may be a lot more inefficient than one might expect it to be.
Also, TClientDataSet constructs, uses and then discards an expression parser for each invocation of Locate (in its internal call to LocateRecord), which is a lot of overhead for repeated calls, especial when they are entirely avoidable.
In any case, the best way to do this is to ensure that your table records which segment a given row belongs to, adding a column like the SegmentID below if your table does not already have one:
TAG | TAGTEXT|SegmentID
+---+--------+---------+
| 1 | a | 1
| 2 | b | 1
| 3 | c | 1
| 1 | d | 2
+---+--------+---------+ // btw, what happened to the 2 missing rows after this one?
| 4 | e | 2
| 1 | f | 3
+---+--------+---------+
Then, you could use code like this to iterate the rows of a segment:
procedure IterateSegment(Query : TSomeTypeOfQueryComponent; SegmentID : Integer);
var
Sql; String;
begin
Sql := Format('select * from mytable where SegmentID = %d order by Tag', [SegmentID]);
if Query.Active then
Query.Close;
Query.Sql.Text := Sql;
Query.Open;
Query.DisableControls;
try
while not Query.Eof do begin
// process row here
Query.Next;
end;
finally
Query.EnableControls;
end;
end;
Once you have the SegmentID column in the table, if you don't want to open a new query to iterate a block, you can set up a local index (by SegmentID then Tag), assuming your dataset type supports it, set a filter on the dataset to restrict it to a given SegmentID and then iterate over it

You have much options to do this.
If your component don´t provide a locateNext you can make your on function locateNext, comparing the value and make next until find.
You can also bring the sql with order by then use locate for de the first value and test if the next value match the comparision.
If you use a clientDataset you can filter into the component filter propertie, or set IndexFieldNames to order values instead the "order by" of sql in the prior suggestion.
You can filter it on the SQL Where clausule too.

Excel Lookup IP addresses in multiple ranges

I am trying to find a formula for column A that will check an IP address in column B and find if it falls into a range (or between) 2 addresses in two other columns C and D.
E.G.
A B C D
+---------+-------------+-------------+------------+
| valid? | address | start | end |
+---------+-------------+-------------+------------+
| yes | 10.1.1.5 | 10.1.1.0 | 10.1.1.31 |
| Yes | 10.1.3.13 | 10.1.2.16 | 10.1.2.31 |
| no | 10.1.2.7 | 10.1.1.128 | 10.1.1.223 |
| no | 10.1.1.62 | 10.1.3.0 | 10.1.3.127 |
| yes | 10.1.1.9 | 10.1.4.0 | 10.1.4.255 |
| no | 10.1.1.50 | … | … |
| yes | 10.1.1.200 | | |
+---------+-------------+-------------+------------+
This is supposed to represent an Excel table with 4 columns a heading and 7 rows as an example.
I can do a lateral check with
=IF(AND((B3>C3),(B3 < D3)),"yes","no")
which only checks 1 address against the range next to it.
I need something that will check the 1 IP address against all of the ranges. i.e. rows 1 to 100.
This is checking access list rules against routes to see if I can eliminate redundant rules... but has other uses if I can get it going.
To make it extra special I can not use VBA macros to get it done.
I'm thinking some kind of index match to look it up in an array but not sure how to apply it. I don't know if it can even be done. Good luck.

Ok, so I've been tracking this problem since my initial comment, but have not taken the time to answer because just like Lana B:
I like a good puzzle, but it's not a good use of time if i have to keep guessing
+1 to Lana for her patience and effort on this question.
However, IP addressing is something I deal with regularly, so I decided to tackle this one for my own benefit. Also, no offense, but getting the MIN of the start and the MAX of the end is wrong. This will not account for gaps in the IP white-list. As I mentioned, this required 15 helper columns and my result is simply 1 or 0 corresponding to In or Out respectively. Here is a screenshot (with formulas shown below each column):
The formulas in F2:J2 are:
=NUMBERVALUE(MID(B2,1,FIND(".",B2)-1))
=NUMBERVALUE(MID(B2,FIND(".",B2)+1,FIND(".",B2,FIND(".",B2)+1)-1-FIND(".",B2)))
=NUMBERVALUE(MID(B2,FIND(".",B2,FIND(".",B2)+1)+1,FIND(".",B2,FIND(".",B2,FIND(".",B2)+1)+1)-1-FIND(".",B2,FIND(".",B2)+1)))
=NUMBERVALUE(MID(B2,FIND(".",B2,FIND(".",B2,FIND(".",B2)+1)+1)+1,LEN(B2)))
=F2*256^3+G2*256^2+H2*256+I2
Yes, I used formulas instead of "Text to Columns" to automate the process of adding more information to a "living" worksheet.
The formulas in L2:P2 are the same, but replace B2 with C2.
The formulas in R2:V2 are also the same, but replace B2 with D2.
The formula for X2 is
=SUMPRODUCT(--($P$2:$P$8<=J2)*--($V$2:$V$8>=J2))
I also copied your original "valid" set in column A, which you'll see matches my result.

You will need helper columns.
Organise your data as outlined in the picture.
Split address, start and end into columns by comma (ribbon menu Data=>Text To Columns).
Above the start/end parts, calculate MIN FOR START, and MAX FOR END for all split text parts (i.e. MIN(K5:K1000) .
FORMULAS:
VALIDITY formula - copy into cell D5, and drag down:
=IF(AND(B6>$I$1,B6<$O$1),"In",
IF(OR(B6<$I$1,B6>$O$1),"Out",
IF(B6=$I$1,
IF(C6<$J$1, "Out",
IF( C6>$J$1, "In",
IF( D6<$K$1, "Out",
IF( D6>$K$1, "In",
IF(E6>=$L$1, "In", "Out"))))),
IF(B6=$O$1,
IF(C6>$P$1, "Out",
IF( C6<$P$1, "In",
IF( D6>$Q$1, "Out",
IF( D6<$Q$1, "In",
IF(E6<=$R$1, "In", "Out") )))) )
)))

Loop across many datasets to get one summary table

I have about 100 datasets in Stata. I want to loop across all of them to get one summary table for the proportion of people across all datasets who are taking a drug aceinhib. I can write code which produces a table for each dataset, but what I want is a summary of all these tables in one table.
Here is an example using just 5 datasets:
forval i=1/5 {
capture use "FILEADDRESS\FILENAME`i'", clear
table aceinhib
capture save "FILEADDRESS\NEW_FILENAME`i'", replace
}
This gives me:
----------------------
aceinhib | Freq.
----------+-----------
0 | 1578935
1 | 138,961
----------------------
----------------------
aceinhib | Freq.
----------+-----------
0 | 5671774
1 | 421,732
----------------------
----------------------
aceinhib | Freq.
----------+-----------
0 | 2350391
1 | 198,875
----------------------
----------------------
aceinhib | Freq.
----------+-----------
0 | 884,660
1 | 51,087
----------------------
----------------------
aceinhib | Freq.
----------+-----------
0 | 1470388
1 | 130,614
----------------------
What I want is:
----------------------
aceinhib | Freq.
----------+-----------
0 | 11956148
1 | 941269
----------------------
-- namely, the combined results of the 5 tables above.

Consider this pattern:
scalar a = 0
scalar b = 0
quietly forval i = 1/1000 {
sysuse auto, clear
count if foreign
scalar a = scalar(a) + r(N)
count if !foreign
scalar b = scalar(b) + r(N)
}
gen double count = cond(_n == 1, scalar(a), cond(_n == 2, scalar(b), .))
gen which = cond(_n == 1, "Foreign", cond(_n == 2, "Domestic", ""))
list which count in 1/2
Just cumulate counts from one file to another. For the real problem, don't read in the same dataset, repeatedly, but different files in a loop.

Perhaps this will point you in a useful direction.
clear
tempfile working
save `working', emptyok
forval i=1/5{
quietly use "FILEADDRESS\FILENAME`i'", clear
* replace "somevariable" with the name of a variable that is never missing
collapse (count) N=somevariable, by(aceinhib)
append using `working'
quietly save `working', replace
}
use `working', clear
collapse (sum) N, by(aceinhib)
list

If all files have the same structure, you could append them into one file before your table command. The following solutions also rely on aceinhib being coded as 0/1. If the files are not too large to append, it could be as simple as:
use "FILEADDRESS\FILENAME1", clear
forvalues i = 2/100 {
append using "FILEADDRESS\FILENAME`i'"
}
table aceinhib
If the resulting data file from append is too large, and there are no weights involved, you may continue as you have and employ the replace option for table:
forvalues i = 1/100 {
use "FILENAME`i'", clear
table aceinhib, replace
rename table1 freq
save "NEW_FILENAME`i'"
}
use "NEW_FILENAME1", clear
forvalues i = 2/100 {
append using "NEW_FILENAME`i'"
}
collapse (sum) freq, by(aceinhib)
list
Note that this approach will create data files containing the individual frequency tables. A third approach relies on storing the results of tab into a matrix for each iteration of the loop, and adding them to another matrix to store the cumulative freq of 0/1 values for aceinhib in each dataset:
mat b = (0\0)
forvalues i = 1/100 {
use "`FILENAME`i''", clear
tab aceinhib, matcell(aceinhib`i')
mat aceinhib = aceinhib + aceinhib`i'
}
mat list aceinhib
This is how I would approach the problem, although there may be cleaner solutions leveraging user written packages or other base Stata functionality that I haven't included here.

Folder searching algorithm

Not sure if this is the usual sort of question that gets asked around here, or if I'll get any answers to this one, but I'm looking for a pseudo-code approach to generating DB linking records from a folder structure containing image files.
I have a set of folders, structured as folllows:
+-make_1/
| +--model_1/
| +-default_version/
| | +--1999
| | +--2000
| | | +--image_01.jpg
| | | +--image_02.jpg
| | | +--image_03.jpg
| | | ...
| | +--2001
| | +--2002
| | +--2003
| | ...
| | +--2009
| +--version_1/
| | +--1999
| | ...
| | +--2009
| +--version_2/
| | +--1999
| | +--2000
| | +--2001
| | | +--image_04.jpg
| | | +--image_05.jpg
| | | +--image_06.jpg
| | | ...
| | +--2002
| | +--2003
| | | +--image_07.jpg
| | | +--image_08.jpg
| | | +--image_09.jpg
| | ...
| | +--2009
... ... ...
In essence, it represents possible images for vehicles, by year starting in 1999.
Makes and models (e.g. Make: Alfa Romeo, Model: 145) come in various trims or versions. Each trim, or version may be found in a number of vehicles which will look the same but have say differences in fuel type or engine capacity.
To save duplication, the folder structure above makes use of a default folder... And images appear for the default version from 2000 onwards. I need to produce the links table for each version - based on whether the have their own overriding images, or whether make use of the default version...
So for example, version_1 has no image files, so I need to make links for to the default images, starting in 2000 and continuing until 2009.
Version 2 on the other hand starts out using the default images in 2000, but then uses two new sets first for 2001-2002, and then 2003-2009. The list of links required are therefore...
version start end file_name
======= ===== ===== =========
version_1 2000 2009 image_01.jpg
version_1 2000 2009 image_02.jpg
version_1 2000 2009 image_03.jpg
...
version_2 2000 2001 image_01.jpg
version_2 2000 2001 image_02.jpg
version_2 2000 2001 image_03.jpg
version_2 2001 2003 image_04.jpg
version_2 2001 2003 image_05.jpg
version_2 2001 2003 image_06.jpg
version_2 2003 2009 image_07.jpg
version_2 2003 2009 image_08.jpg
version_2 2003 2009 image_09.jpg
...
(Default is just that - a place holder, and no links are required for it.)
At the moment I'm running through the folders, building arrays, and then trimming the fat at the end. I was just wondering if there was a short cut, using some sort of text-processing approach? There are about 45,000 folders, most of which are empty :-)

Here's some Python pseudocode, pretty close to executable (needs suitable imports and a def for a writerow function that will do the actual writing -- be it to an intermediate file, DB, CSV, whatever):
# first, collect all the data in a dict of dicts of lists
# first key is version, second key is year (only for non-empty years)
tree = dict()
for root, dirs, files in os.walk('make_1/model_1'):
head, tail = os.path.split(root)
if dirs:
# here, tail is a version
tree[tail] = dict
elif files:
# here, tail is a year
tree[os.path.basename(head)][tail] = files
# now specialcase default_version
default_version = tree.pop('default_version')
# determine range of years; rule is quite asymmetrical:
# for min, only years with files in them count
min_year = min(d for d in default_version if default_version[d])
# for max, all years count, even if empty
max_year = max(default_version)
for version, years in tree.iteritems():
current_files = default_version[min_year]
years.append(max_year + 1)
y = min_year
while years:
next_change = min(years)
if y < next_change:
for f in current_files:
writerow(version, y, next_change-1, f)
y = next_change
current_files = years.pop(y)
One ambiguity in the spec and example is whether it's possible for the default_version to change the set of files in some years - here, I'm assuming that doesn't happen (only specific versions change that way, the default version always carries one set of files).
If this is not the case, what happens if the default version changes in years (say) 1999 and 2003, and version1 changes in 2001 and 2005 -- what files should version 1 use for 03 and 04, the new ones in the default version, or those it specified in 01?
In the most complicated version of the spec (where both default_version and a specific one can change, with the most recent change taking precedence, and if both specific and default change in the same year then the specific taking precedence) one needs to get all the "next change year" sequence, for each specific version, by careful "priority merging" of the sequences of change years for default and specific version, instead of just using years (the sequence of changes in the specific version) as I do here -- and each change year placed in the sequence must be associated with the appropriate set of files of course.
So if the exact spec can please be expressed, down to the corner cases, I can show how to do the needed merging by modifying this pseudocode -- I'd rather not do the work until the exact specs are clarified, because, if the specs are indeed simpler, the work would be unneeded!-)
Edit: as a new comment clarified, the exact specs is indeed the most complex one, so we have do do the merging appropriately. So the loop at the end of the simplistic answer above changes to:
for version, years_dict in tree.iteritems():
# have years_dict override default_version when coincident
merged = dict(default_version, **years_dict)
current_files = merged.pop(min_year)
merged[max_year + 1] = None
y = min_year
while merged:
next_change = min(merged)
for f in current_files:
writerow(version, y, next_change-1, f)
y = next_change
current_files = merged.pop(y)
The key change is the merged = dict(... line: in Python, this means to make merged a new dict (a dict is a generic mapping, would be typically called a hashmap in other languages) which is the sum, or merge, of default_version and years_dict, but when a key is present in both of those, the value from years_dict takes precedence -- which meets the key condition for a year that's present (i.e., is a year with a change in files) in both.
After that it's plain sailing: anydict.pop(somekey) returns the value corresponding to the key (and also removes it from anydict); min(anydict) returns the minimum key in the dictionary. Note the "sentinel" idiom at merged[max_year + 1] = None: this says that the year "one after the max one" is always deemed to be a change-year (with a dummy placeholder value of None), so that the last set of rows is always written properly (with a maximum year of max_year + 1 - 1, that is, exactly max_year, as desired).
This algorithm is not maximally efficient, just simplest! We're doing min(merged) over and over, making it O(N squared) -- I think we can afford that because each merged should have a few dozen change-years at most, but a purist would wince. We can of course present an O(N logN) solution -- just sort the years once and for all and walk that sequence to get the successive values for next_change. Just for completeness...:
default_version[max_year + 1] = None
for version, years_dict in tree.iteritems():
merged = dict(default_version, **years_dict)
for next_change in sorted(merged):
if next_change > min_year:
for f in merged[y]:
writerow(version, y, next_change-1, f)
y = next_change
Here sorted gives a list with the keys of merged in sorted order, and I've switched to the for statement to walk that list from beginning to end (and an if statements to output nothing the first time through). The sentinel is now put in default_version (so it's outside the loop, for another slight optimization). It's funny to see that this optimized version (essentially because it works at a slightly higher level of abstraction) turns out to be smaller and simpler than the previous ones;-).