Create a binary matrix where columns equals rows - logistic-regression

I am trying to run a regression of a list of bond's values against the credit (S&P) rating of the bond. For that I am trying to create a binary matrix of where the columns (a list of all available S&P credit rating [AAA, AA+,..., BBB-, etc..]). My code takes several hours to run, I was wondering if there was a faster way to create a binary matrix to later run a regression, than the code below.
ratg = ['AAA', 'AA+', 'AA', 'AA-', 'A+', 'A', 'A-', 'BBB+', 'BBB', 'BBB-', 'BB+', 'BB', 'BB-', 'B+', 'B', 'B-', 'CCC']
sizefile = len(datafile)
binaryrat = []
s = []
for i in range(sizefile):
for k in range(lenrat):
x = datafile['RatingGrp'].iloc[i] == ratg[k]
s.append(x)
binaryrat.append(s)

Related

Applying transform_lookup on datasets with different number of rows

I am currently learning Altair's maps feature and while looking into one of the examples (https://altair-viz.github.io/gallery/airport_connections.html), I noticed that the datasets (airports.csv and flights-airport.csv) have different number of rows. Is it possible to apply transform_lookup even if that's the case?
Yes, it is possible to apply transform_lookup to datasets with different numbers of rows. The lookup transform amounts to a one-sided join based on a specified key colum: regardless of how many rows each dataset has, for each row of the main dataset, the first match in the lookup data is joined to the data.
A simple example to demonstrate this:
import altair as alt
import pandas as pd
df1 = pd.DataFrame({
'key': ['A', 'B', 'C'],
'x': [1, 2, 3]
})
df2 = pd.DataFrame({
'key': ['A', 'B', 'C', 'D'],
'y': [1, 2, 3, 4]
})
alt.Chart(df1).transform_lookup(
lookup='key',
from_=alt.LookupData(df2, key='key', fields=['y'])
).mark_bar().encode(
x='x:Q',
y='y:O',
color='key:N'
)
More information is available in the Lookup transform docs.

The matrix created by two arrays does not have expected dimensions

Here is my first question and I am totally newby in Python so bear with me!
I am developing a code and at this step I am trying to create a matrix with 2 rows and certain amount of columns. The first row is an array and the second is another array (with the same length), UaP and UbP as can be seen in the code hopefully.
As it can be seen UaP and UbP both are (1, 400), but when I try to create an array by combining two, the resulted matrix dimension will be (2,1,400) instead of expected 2 x 400 dimension.
I have tried different things but I dont get what I expected. Maybe there is a simple trick to solve it? Thanks in advance.
```python
import numpy as np
#some codes here
UaP = 0.5*(Ua-Ub90)
UbP = 0.5*(Ub+Ua90)
UabP = np.array([(UaP),(UbP)])
# shapes of arrays
UbP.shape
(1, 400)
UaP.shape
(1, 400)
UabP = np.array([(UaP),(UbP)])
UabP.shape
(2, 1, 400)
Thats because your first array has shape (1,400) instead of (400,).
You could try this:
import numpy as np
UaP = np.random.rand(1,400)
UbP = np.random.rand(1,400)
# first solution
UabP = np.array([UaP[0],UbP[0]])
print(UabP.shape)
# second soluton
UabP = np.array([UaP,UbP])
UabP = UabP.reshape(1,2,400)
UabP = UabP[0]
print(UabP.shape)

Binning then sorting arrays in each bin but keeping their indices together

I have two arrays and the indices of these arrays are related. So x[0] is related to y[0], so they need to stay organized. I have binned the x array into two bins as shown in the code below.
x = [1,4,7,0,5]
y = [.1,.7,.6,.8,.3]
binx = [0,4,9]
index = np.digitize(x,binx)
Giving me the following:
In [1]: index
Out[1]: array([1, 2, 2, 1, 2])
So far so good. (I think)
The y array is a parameter telling me how well measured the x data point is, so .9 is better than .2, so I'm using the next code to sort out the best of the y array:
y.sort()
ysorted = y[int(len(y) * .5):]
which gives me:
In [2]: ysorted
Out[2]: [0.6, 0.7, 0.8]
giving me the last 50% of the array. Again, this is what I want.
My question is how do I combine these two operations? From each bin, I need to get the best 50% and put these new values into a new x and new y array. Again, keeping the indices of each array organized. Or is there an easier way to do this? I hope this makes sense.
Many numpy functions have arg... variants that don't operate "by value" but rather "by index". In your case argsort does what you want:
order = np.argsort(y)
# order is an array of indices such that
# y[order] is sorted
top50 = order[len(order) // 2 :]
top50x = x[top50]
# now top50x are the x corresponding 1-to-1 to the 50% best y
You should make a list of pairs from your x and y lists
It can be achieved with the zip function:
x = [1,4,7,0,5]
y = [.1,.7,.6,.8,.3]
values = zip(x, y)
values
[(1, 0.1), (4, 0.7), (7, 0.6), (0, 0.8), (5, 0.3)]
To sort such a list of pairs by a specific element of each pair you may use the sort's key parameter:
values.sort(key=lambda pair: pair[1])
[(1, 0.1), (5, 0.3), (7, 0.6), (4, 0.7), (0, 0.8)]
Then you may do whatever you want with this sorted list of pairs.

Finding an element of a structure based on a field value

I have a 1x10 structure array with plenty of fields and I would like to remove from the struct array the element with a specific value on one of the field variables.
I know the value im looking for and the field I should be looking for and I also know how to delete the element from the struct array once I find it. Question is how(if possible) to elegantly identify it without going through a brute force solution ie a for-loop that goes through elements of the struct array to compare with the value I m looking for.
Sample code: buyers as 1x10 struct array with fields:
id,n,Budget
and the variable to find in the id values like id_test = 12
You can use the fact that if you have an array of structs, and you use the dot referencing, this creates a comma-separated list. If you enclose this in [] it will attempt to create an array and if you enclose it in {} it will be coerced into a cell array.
a(1).value = 1;
a(2).value = 2;
a(3).value = 3;
% Into an array
[a.value]
% 1 2 3
% Into a cell array
{a.value}
% [1] [2] [3]
So to do your comparison, you can convert the field you care about into either an array of cell array to do the comparison. This comparison will then yield a logical array which you can use to index into the original structure.
For example
% Some example data
s = struct('id', {1, 2, 3}, 'n', {'a', 'b', 'c'}, 'Budget', {100, 200, 300});
% Remove all entries with id == 2
s = s([s.id] ~= 2);
% Remove entries that have an id of 2 or 3
s = s(~ismember([s.id], [2 3]));
% Find ones with an `n` of 'a' (uses a cell array since it's strings)
s = s(ismember({s.id}, 'a'));

Algorithm balanced K-D tree with O(kn log n)

I tried to implement a balanced K-D tree with O(kn log n), I used presorted K arrays (sorted arrays for each index) to get O(kn log n), and median to get balanced tree.
The problem I faced was that mostly the median value at some level ,for example the median for x axis, maybe chosen again at another subsequent level, for example for y axis.
I tried to solve this by dividing y sorted array to two arrays by using chosen x value as a pivot, but this way wouldn't yield a balanced tree.
Any idea how to get K-D balanced tree with O(kn log n)?
EDIT
Quoted from wiki
https://en.wikipedia.org/wiki/K-d_tree
Alternative algorithms for building a balanced k-d tree presort the
data prior to building the tree. They then maintain the order of the
presort during tree construction and hence eliminate the costly step
of finding the median at each level of subdivision. Two such
algorithms build a balanced k-d tree to sort triangles in order to
improve the execution time of ray tracing for three-dimensional
computer graphics. These algorithms presort n triangles prior to
building the k-d tree, then build the tree in O(n log n) time in the
best case.[5][6] An algorithm that builds a balanced k-d tree to sort
points has a worst-case complexity of O(kn log n).[7] This algorithm
presorts n points in each of k dimensions using an O(n log n) sort
such as Heapsort or Mergesort prior to building the tree. It then
maintains the order of these k presorts during tree construction and
thereby avoids finding the median at each level of subdivision.
Anyone could provide such algorithm provided above?
EDIT
The came up with a way but it doesn't work if there is any duplicate value of specific axis for the median.
For example
x1 = [ (0, 7), (1, 3), (3, 0), (3, 1), (6, 2) ] y1 = [ (3, 0), (3, 1), (6, 2), (1, 3), (0, 7) ]
The median of x-axis is 3.
So when we want to split the array y11 and y12 we have to use > and < to distribute y array left and right considering pivot as delimiter.
there is no guarantee one of them is correct if the median a on specific axis is duplicated
Consider the partition on x axis, and there is no problem on x1 array following completion of above example of first step partition:
median=(3,0)
The pivot = 3 // is it's the median of x axis
y11[],y12[]
for(i = 0 ; i < x1.size;i++)
if(y1[i].getX()<pivot)
y11.add(y1[i])
else
if(y1[i].getX()>pivot)
y12.add(y1[i])
This will result y11 = [(2 ,1) , (1, 3), (0, 7) ] y12 = [ (6,2) ]
Any idea how to handle such case?
Or is there any another presorting kd-tree presorting algorithm O(kn log n) ?
To elaborate on my comment (and Anony-Mousse's answer, probably):
The key idea with pre-sorting in constructing KD-trees is to keep the order during split. The overhead looks quite high, a comparative benchmark with re-sorting (and k-select) seems in order.
Some proof-of principle Java source code:
package net.*.coder.greybeard.sandbox;
import java.util.Arrays;
import java.util.Comparator;
import java.util.LinkedList;
/** finger exercise pre-sorting & split for KD-tree construction
* (re. https://stackoverflow.com/q/35225509/3789665) */
public class KDPreSort {
/** K-dimensional key, dimensions fixed
* by number of coordinates in construction */
static class KKey {
public static KKey[] NONE = {};
final Comparable[]coordinates;
public KKey(Comparable ...coordinates) {
this.coordinates = coordinates;
}
/** #return {#code Comparator<KKey>} for coordinate {#code n}*/
static Comparator<KKey> comparator(int n) { // could be cached
return new Comparator<KDPreSort.KKey>() { #Override
public int compare(KKey l, KKey r) {
return l.coordinates[n]
.compareTo(r.coordinates[n]);
}
};
}
#Override
public String toString() {
StringBuilder sb = new StringBuilder(
Arrays.deepToString(coordinates));
sb.setCharAt(0, '(');
sb.setCharAt(sb.length()-1, ')');
return sb.toString();
}
}
// static boolean trimLists = true; // introduced when ArrayList was used in interface
/** #return two arrays of {#code KKey}s: comparing smaller than
* or equal to {#code pivot} (according to {#code comp)},
* and greater than pivot -
* in the same order as in {#code keys}. */
static KKey[][] split(KKey[] keys, KKey pivot, Comparator<KKey> comp) {
int length = keys.length;
ArrayList<KKey>
se = new ArrayList<>(length),
g = new ArrayList<>(length);
for (KKey k: keys) {
// pick List to add to
List<KKey>d = comp.compare(k, pivot) <= 0 ? se : g;
d.add(k);
}
// if (trimLists) { se.trimToSize(); g.trimToSize(); }
return new KKey[][] { se.toArray(KKey.NONE), g.toArray(KKey.NONE) };
}
/** #return two arrays of <em>k</em> arrays of {#code KKey}s:
* comparing smaller than or equal to {#code pivot}
* (according to {#code comp)}, and greater than pivot,
* in the same order as in {#code keysByCoordinate}. */
static KKey[][][]
splits(KKey[][] keysByCoordinate, KKey pivot, Comparator<KKey> comp) {
final int length = keysByCoordinate.length;
KKey[][]
se = new KKey[length][],
g = new KKey[length][],
splits;
for (int i = 0 ; i < length ; i++) {
splits = split(keysByCoordinate[i], pivot, comp);
se[i] = splits[0];
g[i] = splits[1];
}
return new KKey[][][] { se, g };
}
// demo
public static void main(String[] args) {
// from https://stackoverflow.com/q/17021379/3789665
Integer [][]coPairs = {// {0, 7}, {1, 3}, {3, 0}, {3, 1}, {6, 2},
{12, 21}, {13, 27}, {19, 5}, {39, 5}, {49, 63}, {43, 45}, {41, 22}, {27, 7}, {20, 12}, {32, 11}, {24, 56},
};
KKey[] someKeys = new KKey[coPairs.length];
for (int i = 0; i < coPairs.length; i++) {
someKeys[i] = new KKey(coPairs[i]);
}
//presort
Arrays.sort(someKeys, KKey.comparator(0));
List<KKey> x = new ArrayList<>(Arrays.asList(someKeys));
System.out.println("by x: " + x);
KKey pivot = someKeys[someKeys.length/2];
Arrays.sort(someKeys, KKey.comparator(1));
System.out.println("by y: " + Arrays.deepToString(someKeys));
// split by x
KKey[][] allOrdered = new KKey[][] { x.toArray(KKey.NONE), someKeys },
xSplits[] = splits(allOrdered, pivot, KKey.comparator(0));
for (KKey[][] c: xSplits)
System.out.println("split by x of " + pivot + ": "
+ Arrays.deepToString(c));
// split "higher x" by y
pivot = xSplits[1][1][xSplits[1][1].length/2];
KKey[][] ySplits[] = splits(xSplits[1], pivot, KKey.comparator(1));
for (KKey[][] c: ySplits)
System.out.println("split by y of " + pivot + ": "
+ Arrays.deepToString(c));
}
}
(Didn't find a suitable answer/implementation on SE while not investing too much energy. The output was non-convincing with your example, with the longer one, I had to re-format it to believe it.
The code looks ugly, in all likelihood because it is: if so inclined re-read about the licence of code posted on SE, an visit Code Review.)
(Consider that there's voting, accepting, and awarding bounties, and re-visit Anony-Mousse's answer.)
When splitting the data, you need to retain the sort order.
E.g. using data (x,y) we build
x1 = [ (0, 7), (1, 3), (3, 0), (4, 2), (6, 1) ]
y1 = [ (3, 0), (6, 1), (3, 2), (1, 3), (0, 7) ]
If we split at x now, we need to filter both sets by the record at x=3,y=0.
I.e. split both lists, removing (3,0), all items with x<3 go to the first list each, all with x>3 go to the second (order unchanged):
x1 -> filter to x11 = [ (0, 7), (1, 3) ] x12 = [ (4, 2), (6, 1) ]
y1 -> filter to y11 = [ (1, 3), (0, 7) ] y12 = [ (6, 1), (4, 2) ]
The point is to filter each sorted list by the x values, while keeping the sorting order (so this is in O(n*k) in each of O(log n) levels). If you use only x1, and reconstruct y11 and y12 from x1, then you would need to sort again. By necessity, it is the same as if you sort once by x, once by y. Except that we did not sort again, only select.
I do not think this is much better in practise. Sorting is cheaper than the extra memory.

Resources