NameError: name 'sequence' is not defined - arrays

I am using jupyter notebook.
i am trying to split the sequence into multiple samples.
running the following code gives me NameError.
# univariate data preparation
from numpy import array
# split a univariate sequence into samples
def split_sequence(sequence, n_steps):
X, y = list(), list()
for i in range(len(sequence)):
# find the end of this pattern
end_ix = i + n_steps
# check if we are beyond the sequence
if end_ix > len(sequence)-1:
break
# gather input and output parts of the pattern
seq_x, seq_y = sequence[i:end_ix], sequence[end_ix]
X.append(seq_x)
y.append(seq_y)
return array(X), array(y)
# define input sequence
raw_seq = [10, 20, 30, 40, 50, 60, 70, 80, 90]
# choose a number of time steps
n_steps = 3
# split into samples
X, y = split_sequence(raw_seq, n_steps)
# summarize the data
for i in range(len(X)):
print(X[i], y[i])

Related

How to find out if an arithmetic sequence exists in an array

If there is an array that contains random integers in ascending order, how can I tell if this array contains a arithmetic sequence (length>3) with the common differece x?
Example:
Input: Array=[1,2,4,5,8,10,17,19,20,23,30,36,40,50]
x=10
Output: True
Explanation of the Example: the array contains [10,20,30,40,50], which is a arithmetic sequence (length=5) with the common differece 10.
Thanks!
I apologize that I have not try any code to solve this since I have no clue yet.
After reading the answers, I tried it in python.
Here are my codes:
df = [1,10,11,20,21,30,40]
i=0
common_differene=10
df_len=len(df)
for position_1 in range(df_len):
for position_2 in range(df_len):
if df[position_1] + common_differene == df[position_2]:
position_1=position_2
i=i+1
print(i)
However, it returns 9 instead of 4.
Is there anyway to prevent the repetitive counting in one sequence [10,20,30,40] and also prevent accumulating i from other sequences [1,11,21]?
You can solve your problem by using 2 loops, one to run through every element and the other one to check if the element is currentElement+x, if you find one that does, you can continue form there.
With the added rule of the sequence being more than 2 elements long, I have recreated your problem in FREE BASIC:
DIM array(13) As Integer = {1, 2, 4, 5, 8, 10, 17, 19, 20, 23, 30, 36, 40, 50}
DIM x as Integer = 10
DIM arithmeticArrayMinLength as Integer = 3
DIM index as Integer = 0
FOR position As Integer = LBound(array) To UBound(array)
FOR position2 As Integer = LBound(array) To UBound(array)
IF (array(position) + x = array(position2)) THEN
position = position2
index = index + 1
END IF
NEXT
NEXT
IF (index <= arithmeticArrayMinLength) THEN
PRINT false
ELSE
PRINT true
END IF
Hope it helps
Edit:
After reviewing your edit, I have come up with a solution in Python that returns all arithmetic sequences, keeping the order of the list:
def arithmeticSequence(A,n):
SubSequence=[]
ArithmeticSequences=[]
#Create array of pairs from array A
for index,item in enumerate(A[:-1]):
for index2,item2 in enumerate(A[index+1:]):
SubSequence.append([item,item2])
#finding arithmetic sequences
for index,pair in enumerate(SubSequence):
if (pair[1] - pair[0] == n):
found = [pair[0],pair[1]]
for index2,pair2 in enumerate(SubSequence[index+1:]):
if (pair2[0]==found[-1] and pair2[1]-pair2[0]==n):
found.append(pair2[1])
if (len(found)>2): ArithmeticSequences.append(found)
return ArithmeticSequences
df = [1,10,11,20,21,30,40]
common_differene=10
arseq=arithmeticSequence(df,common_differene)
print(arseq)
Output: [[1, 11, 21], [10, 20, 30, 40], [20, 30, 40]]
This is how you can get all the arithmetic sequences out of df for you to do whatever you want with them.
Now, if you want to remove the sub-sequences of already existing arithmetic sequences, you can try running it through:
def distinct(A):
DistinctArithmeticSequences = A
for index,item in enumerate(A):
for index2,item2 in enumerate([x for x in A if x != item]):
if (set(item2) <= set(item)):
DistinctArithmeticSequences.remove(item2)
return DistinctArithmeticSequences
darseq=distinct(arseq)
print(darseq)
Output: [[1, 11, 21], [10, 20, 30, 40]]
Note: Not gonna lie, this was fun figuring out!
Try from 1: check the presence of 11, 21, 31... (you can stop immediately)
Try from 2: check the presence of 12, 22, 32... (you can stop immediately)
Try from 4: check the presence of 14, 24, 34... (you can stop immediately)
...
Try from 10: check the presence of 20, 30, 40... (bingo !)
You can use linear searches, but for a large array, a hash map will be better. If you can stop as soon as you have found a sequence of length > 3, this procedure takes linear time.
Scan the list increasingly and for every element v, check if the element v + 10 is present and draw a link between them. This search can be done in linear time as a modified merge operation.
E.g. from 1, search 11; you can stop at 17; from 2, search 12; you can stop at 17; ... ; from 8, search 18; you can stop at 19...
Now you have a graph, the connected components of which form arithmetic sequences. You can traverse the array in search of a long sequence (or a longest), also in linear time.
In the given example, the only links are 10->-20->-30->-40->-50.

Check matches with 3's in a set of 6 numbers across 49 number draws

I am using select within Sidekiq:
require 'set'
require 'benchmark'
all_numbers = (1..49).to_a.combination(6)
needle = [1,2,3,4,5,6].to_set
Benchmark.bm do |x|
x.report { all_numbers.select{|z| (needle & z).count == 3} }
end
# user system total real
# 74.200000 3.040000 77.240000 ( 78.901259)
I want to check thousands of such needles quickly. Is there a different way to find out this information? Is converting to C an option?
Note:
all_numbers is a variable that does not change, and is as above always.
Goal is to display all the sets which have 3 matches.
Examples of needles can be got from:
(1..49).to_a.shuffle.first(6).sort
Assuming all of needle is members of all_numbers:
For three to be correct, that's a 3-combination of 6:
(6*5*4) / (1*2*3) = 20
For the remaining three to be incorrect, that's a 3-combination of the remaining (49-6):
(43*42*41) / (1*2*3) = 12341
Thus, the total number of combinations is
12341 * 20 = 246820
In code:
require 'benchmark'
size_all = 49
size_needle = 6
required = 3
def binomial(n, k)
((n - k + 1)..n).inject(&:*) / (1..k).inject(&:*)
end
Benchmark.bm do |x|
x.report {
binomial(size_needle, required) * binomial(size_all - size_needle, required)
}
end
# user system total real
# 0.000026 0.000006 0.000032 ( 0.000030)
Slightly faster.
EDIT: After the requirements are changed:
class Array
def semimatching_combination(needle, num_total, num_needle)
unless block_given?
return to_enum(__method__, needle, num_total, num_needle) do
binomial(needle.size, num_needle) *
binomial(self.size - needle.size, num_total - num_needle)
end
end
needle.combination(num_needle) do |needle_comb|
(self - needle).combination(num_total - num_needle) do |other_comb|
yield (needle_comb + other_comb).sort
end
end
end
end
(1..49).to_a.semimatching_combination((1..6).to_a, 6, 3).size
# => 246820
(1..49).to_a.semimatching_combination((1..6).to_a, 6, 3).to_a
# => [[1, 2, 3, 7, 8, 9], ...]
You can replace sort with to_set (with require 'set') if you want, depending on what you want to generate.

Fitting a linear regression with scipy.stats; error in array shapes

I have written some code to read a data file using pandas and process the data with numpy. This results in some NaNs in the numpy array. I mask those out so that I can apply a linear regression fit with scipy.stats:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
def makeArray(band):
"""
Takes as argument a string as the name of a wavelength band.
Converts the list of magnitudes in that band into a numpy array,
replacing invalid values (where invalid == -999) with NaNs.
Returns the array.
"""
array_name = band + '_mag'
array = np.array(df[array_name])
array[array==-999]=np.nan
return array
# Read data file
fields = ['no', 'NED', 'z', 'obj_type','S_21', 'power', 'SI_flag',
'U_mag', 'B_mag', 'V_mag', 'R_mag', 'K_mag', 'W1_mag',
'W2_mag', 'W3_mag', 'W4_mag', 'L_UV', 'Q', 'flag_uv']
magnitudes = ['U_mag', 'B_mag', 'V_mag', 'R_mag', 'K_mag', 'W1_mag',
'W2_mag', 'W3_mag', 'W4_mag']
df = pd.read_csv('todo.dat', sep = ' ',
names = fields, index_col = False)
# Define axes for processing
redshifts = np.array(df['z'])
y = np.log(makeArray('K'))
mask = np.isnan(y)
plt.scatter(redshifts, y, label = ('K'), s = 2, color = 'r')
slope, intercept, r_value, p_value, std_err = stats.linregress(redshifts, y[mask])
fit = slope*redshifts + intercept
plt.legend()
plt.show()
but the lines where I calculate the stats parameters and the fit line (third- and fourth-to-last lines) give me the following error:
Traceback (most recent call last):
File "<ipython-input-77-ec9f43cdfa9b>", line 1, in <module>
runfile('C:/Users/Jeremy/Dropbox/Notes/Postgrad/Masters Research/VUW/QSOs/read_csv.py', wdir='C:/Users/Jeremy/Dropbox/Notes/Postgrad/Masters Research/VUW/QSOs')
File "C:\Users\Jeremy\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
execfile(filename, namespace)
File "C:\Users\Jeremy\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/Jeremy/Dropbox/Notes/Postgrad/Masters Research/VUW/QSOs/read_csv.py", line 35, in <module>
slope, intercept, r_value, p_value, std_err = stats.linregress(redshifts, y[mask])
File "C:\Users\Jeremy\Anaconda3\lib\site-packages\scipy\stats\_stats_mstats_common.py", line 92, in linregress
ssxm, ssxym, ssyxm, ssym = np.cov(x, y, bias=1).flat
File "C:\Users\Jeremy\Anaconda3\lib\site-packages\numpy\lib\function_base.py", line 2865, in cov
X = np.vstack((X, y))
File "C:\Users\Jeremy\Anaconda3\lib\site-packages\numpy\core\shape_base.py", line 234, in vstack
return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
ValueError: all the input array dimensions except for the concatenation axis must match exactly
The variables are shaped like:
so I'm not sure what the error means, or how to fix it. Is there a way around this? Or perhaps another module I can use instead of scipy.stats that will allow me to fit a linear regression?
The problem is that y[mask] is a different length to redshifts.
Below is a simple example piece of code to show the issue..
import numpy as np
na = np.array
y = na([np.nan, 4, 5, 6, 7, 8, np.nan, 9, 10, np.nan])
mask = np.isnan(y)
print(len(y), len(y[mask]))
You will have to substitute values for the nan values in y with something like..
print('old y: ', y)
for idx, m in enumerate(mask):
if m:
y[idx] = 1000 # or whatever value you decide on
print('new y: ', y)
Full example code...
import numpy as np
na = np.array
y = na([np.nan, 4, 5, 6, 7, 8, np.nan, 9, 10, np.nan])
mask = np.isnan(y)
print(len(y), len(y[mask]))
print('old y: ', y)
for idx, m in enumerate(mask):
if m:
y[idx] = 1000 # or whatever value you decide on
print('new y: ', y)
print(len(y))

Divide an array into subarrays as equally as possible for core-mapping

What algorithms are used to map an image array to multiple cores for processing? I've been trying to come up with something that will return a list of (disjoint) ranges over which to iterate in an array, and so far I have the following.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import numpy as np
def divider(arr_dims, coreNum=1):
""" Get a bunch of iterable ranges;
Example input: [[[0, 24], [15, 25]]]
"""
if (coreNum == 1):
return arr_dims
elif (coreNum < 1):
raise ValueError(\
'partitioner expected a positive number of cores, got %d'\
% coreNum
)
elif (coreNum % 2):
raise ValueError(\
'partitioner expected an even number of cores, got %d'\
% coreNum
)
total = []
# Split each coordinate in arr_dims in _half_
for arr_dim in arr_dims:
dY = arr_dim[0][1] - arr_dim[0][0]
dX = arr_dim[1][1] - arr_dim[1][0]
if ((coreNum,)*2 > (dY, dX)):
coreNum = max(dY, dX)
coreNum -= 1 if (coreNum % 2 and coreNum > 1) else 0
new_c1, new_c2, = [], []
if (dY >= dX):
# Subimage height is greater than its width
half = dY // 2
new_c1.append([arr_dim[0][0], arr_dim[0][0] + half])
new_c1.append(arr_dim[1])
new_c2.append([arr_dim[0][0] + half, arr_dim[0][1]])
new_c2.append(arr_dim[1])
else:
# Subimage width is greater than its height
half = dX // 2
new_c1.append(arr_dim[0])
new_c1.append([arr_dim[1][0], half])
new_c2.append(arr_dim[0])
new_c2.append([arr_dim[1][0] + half, arr_dim[1][1]])
total.append(new_c1), total.append(new_c2)
# If the number of cores is 1, we get back the total; Else,
# we split each in total, etc.; it's turtles all the way down
return divider(total, coreNum // 2)
if __name__ == '__main__':
import numpy as np
X = np.random.randn(25 - 1, 36 - 1)
dims = [zip([0, 0], list(X.shape))]
dims = [list(j) for i in dims for j in dims[0] if type(j) != list]
print(divider([dims], 2))
It's incredibly limited, however, because it only accepts a number of cores that's some power of 2, and then I'm certain there's edge cases I'm overlooking. Running it returns [[[0, 24], [0, 17]], [[0, 24], [17, 35]]], and then using pathos I've mapped the first set to one core in my laptop and the second to another.
I guess I just don't know how to geometrically walk my way through partitioning an image into segments that are as similar in size as possible, so that each core on a given machine has the same amount of work to do.
I'm not too sure what you're trying to achieve, but if you want to split an array (of whatever dimensions) into multiple parts you can look into the numpy.array_split method numpy.array_split.
It partitions an array into an almost equal number of parts, so it works even when the number of partitions cannot cleanly divide the array.

How to write binary data into SQLite with R DBI's dbWriteTable()?

For instance, how to execute the equivalent following SQL (which inserts into a BINARY(16) field)
INSERT INTO Table1 (MD5) VALUES (X'6717f2823d3202449201145073ab871A'),(X'6717f2823d3202449301145073ab371A')
using dbWriteTable()? Doing
dbWriteTable(db, "Table1", data.frame(MD5 = "X'6717f2823d3202449201145073ab871A'", ...), append = T, row.names = F)
doesn't seem to work - it writes the values as text.
In the end, I'm going to have a big data.frame of hashes that I want to write, and so perfect for using dbWriteTable. But I just can't figure out how to INSERT the data.frame into binary database fields.
So here are two possibilities that seem to work. The first uses dbSendQuery(...) in a loop (you've probably thought of this already...).
db.WriteTable = function(con,table,df) { # no error checking whatsoever...
require(DBI)
field <- colnames(df)[1]
for (i in 1:nrow(df)) {
query <- sprintf("INSERT INTO %s (%s) VALUES (X'%s')",table,field,df[i,1])
rs <- dbSendQuery(con,statement=query)
}
return(nrow(df))
}
library(DBI)
drv <- dbDriver("SQLite")
con <- dbConnect(drv)
rs <- dbSendQuery(con, statement="CREATE TABLE hash (MD5 BLOB)")
df <- data.frame(MD5=c("6717f2823d3202449201145073ab871A",
"6717f2823d3202449301145073ab371A"))
rs <- db.WriteTable(con,"hash",df)
result.1 <- dbReadTable(con,"hash")
result.1
# MD5
# 1 67, 17, f2, 82, 3d, 32, 02, 44, 92, 01, 14, 50, 73, ab, 87, 1a
# 2 67, 17, f2, 82, 3d, 32, 02, 44, 93, 01, 14, 50, 73, ab, 37, 1a
If your data frame of hashes is very large, then df.WriteFast(...) does the same thing as db.WriteTable(...) only it should be faster.
db.WriteFast = function(con.table,df) {
require(DBI)
field <- colnames(df)[1]
lapply(unlist(df[,1]),function(x){
dbSendQuery(con,
statement=sprintf("INSERT INTO %s (%s) VALUES (X'%s')",
table,field,x))})
}
Note that result.1 is a data frame, and if we use it in a call to dbWriteTable(...) we can successfully write the hashes to a BLOB. So it is possible.
str(result.1)
# 'data.frame': 2 obs. of 1 variable:
# $ MD5:List of 2
# ..$ : raw 67 17 f2 82 ...
# ..$ : raw 67 17 f2 82 ...
The second approach takes advantage of R's raw data type to create a data frame structured like result.1, and passes that to dbWriteTable(...). You'd think this would be easy, but no.
h2r = function(x) {
bytes <- substring(x, seq(1, nchar(x)-1, 2), seq(2, nchar(x), 2))
return(list(as.raw(as.hexmode(bytes))))
}
hash2raw = Vectorize(h2r)
df.raw=data.frame(MD5=list(1:nrow(df)))
colnames(df.raw)="MD5"
df.raw$MD5 = unname(hash2raw(as.character(df$MD5)))
dbWriteTable(con, "newHash",df.raw)
result.2 <- dbReadTable(con,"newHash")
result.2
all.equal(result.1$MD5,result.2$MD5)
# [1] TRUE
In this approach, we create a data frame df.raw which has one column, MD5, wherein each element is a list of raw bytes. The utility function h2r(...) takes a character representation of the hash, breaks it into a vector of char(2) (the bytes), then interprets each of those as hex (as.hexmode(...)), converts the result to raw (as.raw(...)), and finally returns the result as a list. Vectorize(...) is a wrapper that allows hash2raw(...) to take a vector as its argument.
Personally, I think you're better off using the first approach: it takes advantage of SQLite's internal mechanism for writing hex to BLOBs, and it's much easier to understand.

Resources