Script Failure because of Large Arrays - arrays

My current script runs through data from a PDB and stores them to arrays then those arrays are used for the rest of the script. The script runs very well on a small PDB file but when I use a real PDB file I end up using all the computers memory on just one PDB file. I have 2000 PDB files I need calculations done on.
This is my full current script with a few notes.
Full script:
#!/usr/bin/perl
use warnings;
use strict;
#my $inputfile = $ARGV[0];
#my $inputfile = '8ns_emb_alt_test.pdb';
my $inputfile = '8ns_emb_alt_101.pdb';
open( INPUTFILE, "<", $inputfile ) or die $!;
my #array = <INPUTFILE>;
### Protein
my $protein = 'PROT';
my #protx;
my #proty;
my #protz;
for ( my $line = 0; $line <= $#array; ++$line ) {
if ( ( $array[$line] =~ m/\s+$protein\s+/ ) ) {
chomp $array[$line];
my #splitline = ( split /\s+/, $array[$line] );
push #protx, $splitline[5]; #This has 2083 x-coordinates
push #proty, $splitline[6]; #this has 2083 y-coordinates
push #protz, $splitline[7]; #this has 2083 z-coordinates
}
}
### Lipid
my $lipid1 = 'POPS';
my $lipid2 = 'POPC';
my #lipidx;
my #lipidy;
my #lipidz;
for ( my $line = 0; $line <= $#array; ++$line ) {
if ( ( $array[$line] =~ m/\s+$lipid1\s+/ ) || ( $array[$line] =~ m/\s+$lipid2\s+/ ) ) {
chomp $array[$line];
my #splitline = ( split /\s+/, $array[$line] );
push #lipidx, $splitline[5]; #this has approximately 35,000 x coordinates
push #lipidy, $splitline[6]; #same as above for y
push #lipidz, $splitline[7]; #same as above for z
}
}
### Calculation
my #deltaX = map {
my $diff = $_;
map { $diff - $_ } #lipidx
} #protx; #so this has 2083*35000 data x-coordinates
my #squared_deltaX = map { $_ * $_ } #deltaX; #this has all the x-coordinates squared from #deltaX
my #deltaY = map {
my $diff = $_;
map { $diff - $_ } #lipidy
} #proty;
my #squared_deltaY = map { $_ * $_ } #deltaY;
my #deltaZ = map {
my $diff = $_;
map { $diff - $_ } #lipidz
} #protz;
my #squared_deltaZ = map { $_ * $_ } #deltaZ;
my #distance;
for ( my $ticker = 0; $ticker <= $#array; ++$ticker ) {
my $distance_calc = sqrt( ( $squared_deltaX[$ticker] + $squared_deltaY[$ticker] + $squared_deltaZ[$ticker] ) );
push #distance, $distance_calc;
} #this runs the final calculation and computes all the distances between the atoms
### The Hunt
my $limit = 5;
my #DistU50;
my #resid_tagger;
for ( my $tracker = 0; $tracker <= $#array; ++$tracker ) {
my $dist = $distance[$tracker];
if ( ( $dist < $limit ) && ( $array[$tracker] =~ m/\s+$protein\s+/ ) ) {
my #splitline = ( split /\s+/, $array[$tracker] );
my $LT50 = $dist;
push #resid_tagger, $splitline[4]; #selects stores a selected index number
push #DistU50, $LT50; #stores the values within the $limit
}
} #this 'for' loop search for all the elements in the '#distance' and pushes them to the final arrays and also makes an array of certain index numbers in to another array.
### Le'Finali
print "#resid_tagger = resid \n";
print "5 > #DistU50 \n";
close INPUTFILE;
One of my lab friends said that I could store some of the data into files so that it takes up less memory. I think that is a fine idea but I am not sure where would be the most efficient place to do that and how many times would I have to do that. I did this with arrays because it is the best way I knew how to do this.
If anyone could show me an example of where I could take an array and turn it into a file and then use the data in that file again that would be really helpful. Otherwise, if any ones has an idea that I can look up, things to try or just suggestions that would be at least start me somewhere.

You're trying to store ~66 million results in an array, which as you've noticed is both slow and memory intensive. Perl arrays are not great for massive calculations like this, but PDL is.
The core of your problem is to calculate the distance between a number of 3D coordinates. Let's do this for a simplified data set first just to prove we can do it:
Start End
--------- ---------
(0, 0, 0) (1, 2, 3)
(1, 1, 1) (1, 1, 1)
(4, 5, 6) (7, 8, 9)
We can represent this data set in PDL like this:
use PDL;
# x # y # z
my $start = pdl [ [0, 1, 4], [0, 1, 5], [0, 1, 6] ];
my $end = pdl [ [1, 1, 7], [2, 1, 8], [3, 1, 9] ];
We now have two sets of 3D coordinates. To compute the distances, first we subtract our start coordinates from our end coordinates:
my $diff = $end - $start;
print $diff;
This outputs
[
[1 0 3]
[2 0 3]
[3 0 3]
]
where the differences in the x-coordinates are in the first row, the differences in the y-coordinates are in the second row, and the differences in the z-coordinates are in the third row.
Next we have to square the differences:
my $squared = $diff**2;
print $squared;
which gives us
[
[1 0 9]
[4 0 9]
[9 0 9]
]
Finally we need to sum the square of the differences for each pair of points and take the square root:
foreach my $i (0 .. $squared->dim(0) - 1) {
say sqrt sum $squared($i,:);
}
(There's probably a better way to do this, but I haven't used PDL much.)
This prints out
3.74165738677394
0
5.19615242270663
which are our distances.
Putting it all together:
use strict;
use warnings;
use 5.010;
use PDL;
use PDL::NiceSlice;
my $start = pdl [ [0, 1, 4], [0, 1, 5], [0, 1, 6] ];
my $end = pdl [ [1, 1, 7], [2, 1, 8], [3, 1, 9] ];
my $diff = $end - $start;
my $squared = $diff**2;
foreach my $i (0 .. $squared->dim(0) - 1) {
say sqrt sum $squared($i,:);
}
It takes ~35 seconds on my desktop to calculate the distance between one million pairs of coordinates and write the results to a file. When I try with ten million pairs, I run out of memory, so you'll probably have to split your data set into pieces.
Reading data from files
Here's an example that reads data in from two files, using sample input you included in an earlier question:
use strict;
use warnings;
use 5.010;
use PDL;
use PDL::IO::Misc;
use PDL::NiceSlice;
my $start_file = 'start.txt';
my $end_file = 'end.txt';
my $start = rcols $start_file, [ 5..7 ];
my $end = rcols $end_file, [ 5..7 ];
my $diff = $end - $start;
my $squared = $diff**2;
foreach my $i (0 .. $squared->dim(0) - 1) {
say sqrt sum $squared($i,:);
}
start.txt
ATOM 1 N GLU 1 -19.992 -2.816 36.359 0.00 0.00 PROT
ATOM 2 HT1 GLU 1 -19.781 -1.880 35.958 0.00 0.00 PROT
ATOM 3 HT2 GLU 1 -19.713 -2.740 37.358 0.00 0.00 PROT
ATOM 4 HT3 GLU 1 -21.027 -2.910 36.393 0.00 0.00 PROT
ATOM 5 CA GLU 1 -19.344 -3.944 35.652 0.00 0.00 PROT
ATOM 6 HA GLU 1 -19.817 -4.852 35.998 0.00 0.00 PROT
ATOM 7 CB GLU 1 -19.501 -3.795 34.119 0.00 0.00 PROT
end.txt
ATOM 2084 N POPC 1 -44.763 27.962 20.983 0.00 0.00 MEM1
ATOM 2085 C12 POPC 1 -46.144 27.379 20.551 0.00 0.00 MEM1
ATOM 2086 C13 POPC 1 -44.923 28.611 22.367 0.00 0.00 MEM1
ATOM 2087 C14 POPC 1 -43.713 26.889 21.099 0.00 0.00 MEM1
ATOM 2088 C15 POPC 1 -44.302 29.004 20.059 0.00 0.00 MEM1
ATOM 2089 H12A POPC 1 -46.939 28.110 20.555 0.00 0.00 MEM1
ATOM 2090 H12B POPC 1 -46.486 26.769 21.374 0.00 0.00 MEM1
Output
42.3946824613654
42.2903357636233
42.9320321205507
40.4541893133455
44.1770768272415
45.3936402704167
42.7174829080553
The rcols function comes from PDL::IO::Misc and can be used to read specific columns from a file into a PDL object (in this cases, columns 5 through 7, zero-indexed).

edit.....
you guys... we should have checked first. you might want to look into the perl modules that already exist for processing and manipulating PDB data.
http://search.cpan.org/~rulix/Bio-PDB-Structure-0.02/lib/Bio/PDB/Structure.pm
http://www.iu.a.u-tokyo.ac.jp/~tterada/softwares/pdb.html
http://www.perlmol.org/pod/Chemistry/File/PDB.html
http://comp.chem.nottingham.ac.uk/parsepdb/
http://www.perl.com/pub/2001/11/16/perlbio2.html
https://www.biostars.org/p/89300/ (forum post, not library)
Okay... so perl is not my first language and I don't know exactly what your data looks like.
edit: there was weird row in my test data. there are two sets of code here... one splits on whitespace and the other uses expected/known column positions and length for determining values.
#!/usr/bin/perl
use strict;
use warnings;
use DBI;
my $db = 'test';
my $host = 'localhost';
my $user = 'root';
my $pass = '';
my $dbh = DBI->connect("dbi:mysql:$db:$host",$user,$pass)
or die "Connection Error: $DBI::errstr\n";
my $localpath = 'C:\path\to\folder\with\datums';
my #filenames = ('glucagon'); # i am using this as my table name, too
my $colnum = 12; # number of columns in data, I assumed this was fixed
my #placehoders;
for (1..$colnum) { push #placehoders, '?'; }
my $placeholders = join(',',#placehoders); # builds a string like: ?, ?, ?, ?, ...
# for our query that uses binding
foreach my $file (#filenames) {
my $filename = "$localpath\\$file.txt";
if (open(my $fh => $filename)) {
# the null at the start of the insert is because my first column is an
# auto_incrememnt primary key that will be generated on insert by the db
my $stmt = $dbh->prepare("insert into $file values (null, $placeholders)");
while(my $line = <$fh>) {
$line =~ s/\s+$//; # trim whitespace
if ($line ne q{}) { # if not totally blank
my #row = split(/ +/,$line); # split on whitespace
for my $index (1..$colnum) {
$stmt->bind_param($index, $row[$index]);
$index++;
}
$stmt->execute();
}
}
close $fh;
}
else { print "$file not opened\n"; }
}
-- i didn't know appropriate names for any of it
create table glucagon (
row_id int unsigned auto_increment primary key,
name varchar(10),
seq int,
code1 varchar(5),
code2 varchar(5),
code3 varchar(5),
code4 int,
val1 decimal(10,2),
val2 decimal(10,2),
val3 decimal(10,2),
val4 decimal(10,2),
val5 decimal(10,2),
code5 varchar(5)
)
the following is found in C:\path\to\folder\with\datums\glucagon.txt
ATOM 1058 N ARG A 141 -6.466 12.036 -10.348 7.00 19.11 N
ATOM 1059 CA ARG A 141 -7.922 12.248 -10.253 6.00 26.80 C
ATOM 1060 C ARG A 141 -8.119 13.499 -9.393 6.00 28.93 C
ATOM 1061 O ARG A 141 -7.112 13.967 -8.853 8.00 28.68 O
ATOM 1062 CB ARG A 141 -8.639 11.005 -9.687 6.00 24.11 C
ATOM 1063 CG ARG A 141 -8.153 10.551 -8.308 6.00 19.20 C
ATOM 1064 CD ARG A 141 -8.914 9.319 -7.796 6.00 21.53 C
ATOM 1065 NE ARG A 141 -8.517 9.076 -6.403 7.00 20.93 N
ATOM 1066 CZ ARG A 141 -9.142 8.234 -5.593 6.00 23.56 C
ATOM 1067 NH1 ARG A 141 -10.150 7.487 -6.019 7.00 19.04 N
ATOM 1068 NH2 ARG A 141 -8.725 8.129 -4.343 7.00 25.11 N
ATOM 1069 OXT ARG A 141 -9.233 14.024 -9.296 8.00 40.35 O
TER 1070 ARG A 141
HETATM 1071 FE HEM A 1 8.128 7.371 -15.022 24.00 16.74 FE
HETATM 1072 CHA HEM A 1 8.617 7.879 -18.361 6.00 17.74 C
HETATM 1073 CHB HEM A 1 10.356 10.005 -14.319 6.00 18.92 C
HETATM 1074 CHC HEM A 1 8.307 6.456 -11.669 6.00 11.00 C
HETATM 1075 CHD HEM A 1 6.928 4.145 -15.725 6.00 13.25 C
end result...
mysql> select * from glucagon;
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
| row_id | name | seq | code1 | code2 | code3 | code4 | val1 | val2 | val3 | val4 | val5 | code5 |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
| 1 | ATOM | 1058 | N | ARG | A | 141 | -6.47 | 12.04 | -10.35 | 7.00 | 19.11 | N |
| 2 | ATOM | 1059 | CA | ARG | A | 141 | -7.92 | 12.25 | -10.25 | 6.00 | 26.80 | C |
| 3 | ATOM | 1060 | C | ARG | A | 141 | -8.12 | 13.50 | -9.39 | 6.00 | 28.93 | C |
| 4 | ATOM | 1061 | O | ARG | A | 141 | -7.11 | 13.97 | -8.85 | 8.00 | 28.68 | O |
| 5 | ATOM | 1062 | CB | ARG | A | 141 | -8.64 | 11.01 | -9.69 | 6.00 | 24.11 | C |
| 6 | ATOM | 1063 | CG | ARG | A | 141 | -8.15 | 10.55 | -8.31 | 6.00 | 19.20 | C |
| 7 | ATOM | 1064 | CD | ARG | A | 141 | -8.91 | 9.32 | -7.80 | 6.00 | 21.53 | C |
| 8 | ATOM | 1065 | NE | ARG | A | 141 | -8.52 | 9.08 | -6.40 | 7.00 | 20.93 | N |
| 9 | ATOM | 1066 | CZ | ARG | A | 141 | -9.14 | 8.23 | -5.59 | 6.00 | 23.56 | C |
| 10 | ATOM | 1067 | NH1 | ARG | A | 141 | -10.15 | 7.49 | -6.02 | 7.00 | 19.04 | N |
| 11 | ATOM | 1068 | NH2 | ARG | A | 141 | -8.73 | 8.13 | -4.34 | 7.00 | 25.11 | N |
| 12 | ATOM | 1069 | OXT | ARG | A | 141 | -9.23 | 14.02 | -9.30 | 8.00 | 40.35 | O |
| 13 | TER | 1070 | ARG | A | 141 | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
| 14 | HETATM | 1071 | FE | HEM | A | 1 | 8.13 | 7.37 | -15.02 | 24.00 | 16.74 | FE |
| 15 | HETATM | 1072 | CHA | HEM | A | 1 | 8.62 | 7.88 | -18.36 | 6.00 | 17.74 | C |
| 16 | HETATM | 1073 | CHB | HEM | A | 1 | 10.36 | 10.01 | -14.32 | 6.00 | 18.92 | C |
| 17 | HETATM | 1074 | CHC | HEM | A | 1 | 8.31 | 6.46 | -11.67 | 6.00 | 11.00 | C |
| 18 | HETATM | 1075 | CHD | HEM | A | 1 | 6.93 | 4.15 | -15.73 | 6.00 | 13.25 | C |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
18 rows in set (0.00 sec)
ohh... look... this row makes it dirty... TER 1070 ARG A 141. i can easily fix this if you go my route but if you use the other answer/approach, i'm not going to bother to update this.
okay... for the stupid row. I went through and counted the starting position and length of each value in my test dataset. I didn't know if that information changes for you or not when you load different files or what... so I made it in a way that it can be set for each file you use.
#!/usr/bin/perl
use strict;
use warnings;
use DBI;
my $db = 'test';
my $host = 'localhost';
my $user = 'root';
my $pass = '';
my $dbh = DBI->connect("dbi:mysql:$db:$host",$user,$pass)
or die "Connection Error: $DBI::errstr\n";
my $localpath = 'C:\path\to\datums';
# first num is starting pos, second is length
my $fileinfo = { 'glucagon' => [[0,6], # 'name'
[7,4], # 'seq'
[12,4], # 'code1'
[17,3], # 'code2'
[21,1], # 'code3'
[23,3], # 'code4'
[27,12], # 'val1'
[39,7], # 'val2'
[47,7], # 'val3'
[55,5], # 'val4'
[61,5], # 'val5'
[69,10] # 'code5'
]
# 'second_file' => [ [0,5], # col1
# [6,5], # col2
# ]
}; # i am using this as my table name, too
foreach my $file (keys %$fileinfo) {
my $filename = "$localpath\\$file.txt";
if (open(my $fh => $filename)) {
my $colnum = scalar #{$fileinfo->{$file}};
my #placehoders;
for (1..$colnum) { push #placehoders, '?'; }
my $placeholders = join(',',#placehoders); # builds a string like: ?, ?, ?, ?, ...
# for our query that uses binding
# the null at the start of the insert is because my first column is an
# auto_incrememnt primary key that will be generated on insert by the db
my $stmt = $dbh->prepare("insert into $file values (null, $placeholders)");
while(my $line = <$fh>) {
$line =~ s/\s+$//; # trim trailing whitespace
if ($line ne q{}) { # if not totally blank
my #row;
my $index = 1;
# foreach value column position & length
foreach my $col (#{$fileinfo->{$file}}) {
my $value;
if ($col->[0] <= length($line)) {
$value = substr($line,$col->[0],$col->[1]);
$value =~ s/^\s+|\s+$//g; # trim trailing & leading whitespace
if ($value eq q{}) { undef $value; } # i like null values vs blank
}
$row[$index] = $value;
$index++;
}
for my $index (1..$colnum) {
$stmt->bind_param($index, $row[$index]);
}
$stmt->execute();
}
}
close $fh;
}
else { print "$file not opened\n"; }
}
new data:
mysql> select * from glucagon;
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
| row_id | name | seq | code1 | code2 | code3 | code4 | val1 | val2 | val3 | val4 | val5 | code5 |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
| 1 | ATOM | 1058 | N | ARG | A | 141 | -6.47 | 12.04 | -10.35 | 7.00 | 19.11 | N |
| 2 | ATOM | 1059 | CA | ARG | A | 141 | -7.92 | 12.25 | -10.25 | 6.00 | 26.80 | C |
| 3 | ATOM | 1060 | C | ARG | A | 141 | -8.12 | 13.50 | -9.39 | 6.00 | 28.93 | C |
| 4 | ATOM | 1061 | O | ARG | A | 141 | -7.11 | 13.97 | -8.85 | 8.00 | 28.68 | O |
| 5 | ATOM | 1062 | CB | ARG | A | 141 | -8.64 | 11.01 | -9.69 | 6.00 | 24.11 | C |
| 6 | ATOM | 1063 | CG | ARG | A | 141 | -8.15 | 10.55 | -8.31 | 6.00 | 19.20 | C |
| 7 | ATOM | 1064 | CD | ARG | A | 141 | -8.91 | 9.32 | -7.80 | 6.00 | 21.53 | C |
| 8 | ATOM | 1065 | NE | ARG | A | 141 | -8.52 | 9.08 | -6.40 | 7.00 | 20.93 | N |
| 9 | ATOM | 1066 | CZ | ARG | A | 141 | -9.14 | 8.23 | -5.59 | 6.00 | 23.56 | C |
| 10 | ATOM | 1067 | NH1 | ARG | A | 141 | -10.15 | 7.49 | -6.02 | 7.00 | 19.04 | N |
| 11 | ATOM | 1068 | NH2 | ARG | A | 141 | -8.73 | 8.13 | -4.34 | 7.00 | 25.11 | N |
| 12 | ATOM | 1069 | OXT | ARG | A | 141 | -9.23 | 14.02 | -9.30 | 8.00 | 40.35 | O |
| 13 | TER | 1070 | NULL | ARG | A | 141 | NULL | NULL | NULL | NULL | NULL | NULL |
| 14 | HETATM | 1071 | FE | HEM | A | 1 | 8.13 | 7.37 | -15.02 | 24.00 | 16.74 | FE |
| 15 | HETATM | 1072 | CHA | HEM | A | 1 | 8.62 | 7.88 | -18.36 | 6.00 | 17.74 | C |
| 16 | HETATM | 1073 | CHB | HEM | A | 1 | 10.36 | 10.01 | -14.32 | 6.00 | 18.92 | C |
| 17 | HETATM | 1074 | CHC | HEM | A | 1 | 8.31 | 6.46 | -11.67 | 6.00 | 11.00 | C |
| 18 | HETATM | 1075 | CHD | HEM | A | 1 | 6.93 | 4.15 | -15.73 | 6.00 | 13.25 | C |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+

Related

Having trouble with Postgres unnest array syntax

I am looking for guidance on the best way to do this insert. I am trying to create 11 entries for role_id 58385 while looping through the values of each of these arrays. I am new to PostgreSQL and need some guidance as to what I am doing wrong in this instance.
INSERT INTO public.acls (role_id, acl_id, update, can_grant, retrieve, create, archive) VALUES (
'58385',
unnest(array[1,14,20,21,22,24,25,26,36,300,302]),
unnest(array[f,f,t,t,f,f,f,t,f,t,t]),
unnest(array[f,f,f,f,f,f,f,f,f,f,f]),
unnest(array[t,t,t,t,t,t,t,t,t,t,t]),
unnest(array[f,f,t,t,f,f,f,t,f,t,t]),
unnest(array[f,f,f,f,f,f,f,f,f,f,f])
)
Do I need a SELECT subquery for each of the arrays? Or could I make one array from the six and Insert them.
A single select will do it for you, but t and f will need to be true and false:
select '58385',
unnest(array[1,14,20,21,22,24,25,26,36,300,302]),
unnest(array[false,false,true,true,false,false,false,true,false,true,true]),
unnest(array[false,false,false,false,false,false,false,false,false,false,false]),
unnest(array[true,true,true,true,true,true,true,true,true,true,true]),
unnest(array[false,false,true,true,false,false,false,true,false,true,true]),
unnest(array[false,false,false,false,false,false,false,false,false,false,false])
;
?column? | unnest | unnest | unnest | unnest | unnest | unnest
----------+--------+--------+--------+--------+--------+--------
58385 | 1 | f | f | t | f | f
58385 | 14 | f | f | t | f | f
58385 | 20 | t | f | t | t | f
58385 | 21 | t | f | t | t | f
58385 | 22 | f | f | t | f | f
58385 | 24 | f | f | t | f | f
58385 | 25 | f | f | t | f | f
58385 | 26 | t | f | t | t | f
58385 | 36 | f | f | t | f | f
58385 | 300 | t | f | t | t | f
58385 | 302 | t | f | t | t | f
(11 rows)

Multi-dimensional data structure management in R

I have a concern about data organisation and the best approach to simplify some multi-layered data. Simply, I have a 10 replicates of small wood beams (BeamID, ~10) subjected to a 10 different treatment (TreatID, ~10), and each beam is load tested which produces a series data of a Load with consequent Displacement (ranging from 10 to 50 rows per test; I have code that corrects for disparities in row length). Each wood beam is tested multiple times (Rep, ~10).
My plan was to lump all this data into a 5-D array:
Array[Load, Deflection, BeamID, TreatID, Rep]
This way, I should be able to plot the load~deflection curves for a given BeamID, TreatID, for all Reps by using Array[ , ,1,1, ], right? So the hypothetical output for Array[ , ,1,1,1], would be:
+------------+--------+-----+
| Deflection | Load | Rep |
+------------+--------+-----+
| 0 | 0 | 1 |
| 6.35 | 10.5 | 1 |
| 12.7 | 20.8 | 1 |
| 19.05 | 45.3 | 1 |
| 25.4 | 75.2 | 1 |
+------------+--------+-----+
And Array[ , ,1,1,2] would be:
+------------+--------+-----+
| Deflection | Load | Rep |
+------------+--------+-----+
| 0 | 0 | 2 |
| 7.3025 | 12.075 | 2 |
| 14.605 | 23.92 | 2 |
| 21.9075 | 52.095 | 2 |
| 29.21 | 86.48 | 2 |
+------------+--------+-----+
Or I think I could keep it as a simpler, 'melted' dataframe, which would have columns for Load and Deflection, and BeamID, TreatID, and Rep would be repeated for each row of the test output.
+------------+--------+-----+--------+---------+
| Deflection | Load | Rep | BeamID | TreatID |
+------------+--------+-----+--------+---------+
| 0 | 0 | 1 | 1 | 1 |
| 6.35 | 10.5 | 1 | 1 | 1 |
| 12.7 | 20.8 | 1 | 1 | 1 |
| 19.05 | 45.3 | 1 | 1 | 1 |
| 25.4 | 75.2 | 1 | 1 | 1 |
| 0 | 0 | 2 | 1 | 1 |
| 7.3025 | 12.075 | 2 | 1 | 1 |
| 14.605 | 23.92 | 2 | 1 | 1 |
| 21.9075 | 52.095 | 2 | 1 | 1 |
| 29.21 | 86.48 | 2 | 1 | 1 |
+------------+--------+-----+--------+---------+
However, with the latter, I'm not sure how I could easily and discretely pull out all the Rep test values for a specific BeamID and TreatID, especially since I use a linear model to fit a 3rd order polynomial for an specific test to extract the slope of the curves. Having it as a continuous dataframe means I'd have to specify starting and stopping points to start the linear model, correct?
Thoughts, suggestions? Am I headed in the right direction in using a 5-D array? R is a new programming language for me, so please pardon my misunderstandings.

Building index for specific value

I have a table that keeps inventory information for products in stores on daily basis. It is like:
|------------|-----------|---------|-----------------|
| Date | ProductId | StoreId | InventoryOnHand |
|------------|-----------|---------|-----------------|
| 2017-10-11 | 348 | 121 | 2 |
| 2017-10-11 | 110 | 200 | 0 |
| 2017-10-11 | 254 | 587 | -2 |
| 2017-10-12 | 311 | 875 | 26 |
| 2017-10-12 | 954 | 364 | 15 |
| 2017-10-12 | 348 | 121 | 0 |
| 2017-10-12 | 441 | 121 | 7 |
| . | . | . | . |
| . | . | . | . |
| . | . | . | . |
|------------|-----------|---------|-----------------|
In most queries I used have condition like WHERE InventoryOnHand > 0. I need to speed up these queries.
Therefore, I want to build and index that separates values on column InventoryOnHand whether they are greater than 0 or not.
Filtered Index does not solve my problem because if I use filtered index all values greater than 0 will be indexed and this increases index size. I only need to know if a value greater than 0 or not.
i.e. I want to build an index that only works when condition is InventoryOnHand > 0. Is there any way to do this on SQL-Server?

How to make a SQL "IF-THEN-ELSE" statement

I've seen other questions about SQL If-then-else stuff, but I'm not seeing how to relate it to what I'm trying to do. I've been using SQL for about a year now but only basic stuff and never this.
If I have a SQL table that looks like this
| Name | Version | Category | Value | Number |
|:-----:|:-------:|:--------:|:-----:|:------:|
| File1 | 1.0 | Time | 123 | 1 |
| File1 | 1.0 | Size | 456 | 1 |
| File1 | 1.0 | Final | 789 | 1 |
| File2 | 1.0 | Time | 312 | 1 |
| File2 | 1.0 | Size | 645 | 1 |
| File2 | 1.0 | Final | 978 | 1 |
| File3 | 1.0 | Time | 741 | 1 |
| File3 | 1.0 | Size | 852 | 1 |
| File3 | 1.0 | Final | 963 | 1 |
| File1 | 1.1 | Time | 369 | 2 |
| File1 | 1.1 | Size | 258 | 2 |
| File1 | 1.1 | Final | 147 | 2 |
| File2 | 1.1 | Time | 741 | 2 |
| File2 | 1.1 | Size | 734 | 2 |
| File2 | 1.1 | Final | 942 | 2 |
| File3 | 1.1 | Time | 997 | 2 |
| File3 | 1.1 | Size | 997 | 2 |
| File3 | 1.1 | Final | 985 | 2 |
How can I write a SQL IF, ELSE statement that creates a new column called "Replication" that follows this rule:
A = B + 1 when x = 1
else
A = B
where A = the number we will use for the next Number
B = Max(Number)
x = Replication count (this is the number of times that a loop is executed. x=i)
The results table will look like this:
| Name | Version | Category | Value | Number | Replication |
|:-----:|:-------:|:--------:|:-----:|:------:|:-----------:|
| File1 | 1.0 | Time | 123 | 1 | 1 |
| File1 | 1.0 | Size | 456 | 1 | 1 |
| File1 | 1.0 | Final | 789 | 1 | 1 |
| File2 | 1.0 | Time | 312 | 1 | 1 |
| File2 | 1.0 | Size | 645 | 1 | 1 |
| File2 | 1.0 | Final | 978 | 1 | 1 |
| File1 | 1.0 | Time | 369 | 1 | 2 |
| File1 | 1.0 | Size | 258 | 1 | 2 |
| File1 | 1.0 | Final | 147 | 1 | 2 |
| File2 | 1.0 | Time | 741 | 1 | 2 |
| File2 | 1.0 | Size | 734 | 1 | 2 |
| File2 | 1.0 | Final | 942 | 1 | 2 |
| File1 | 1.1 | Time | 997 | 2 | 1 |
| File1 | 1.1 | Size | 997 | 2 | 1 |
| File1 | 1.1 | Final | 985 | 2 | 1 |
| File2 | 1.1 | Time | 438 | 2 | 1 |
| File2 | 1.1 | Size | 735 | 2 | 1 |
| File2 | 1.1 | Final | 768 | 2 | 1 |
| File1 | 1.1 | Time | 786 | 2 | 2 |
| File1 | 1.1 | Size | 486 | 2 | 2 |
| File1 | 1.1 | Final | 135 | 2 | 2 |
| File2 | 1.1 | Time | 379 | 2 | 2 |
| File2 | 1.1 | Size | 943 | 2 | 2 |
| File2 | 1.1 | Final | 735 | 2 | 2 |
EDIT: Based on the answer by Sean Lange, this is my 2nd attempt at a solution:
SELECT COALESCE(MAX)(Number) + CASE WHEN Replication = 1 then 1 else 0, 1) FROM Table
The COALESCE is in there for when there is no value yet in the Number column.
The IF/Else construct is used to control flow of statements in t-sql. You want a case expression, which is used to conditionally return values in a column.
https://msdn.microsoft.com/en-us/library/ms181765.aspx
Yours would be something like:
case when x = 1 then A else B end as A
As SeanLange pointed out in this case it would be better to use an CASE/WHEN but to illustrate how to use If\ELSE the way to do it in sql is like this:
if x = 1
BEGIN
---Do something
END
ELSE
BEGIN
--Do something else
END
I would say the best way to know the difference and when to use which is if you are writing a query and want a different field to appear based on a certain condition, use case/when. If a certain condition will cause a series of steps to happen then use if/else

SQL Server : Islands And Gaps

I'm struggling with an "Islands and Gaps" issue. This is for SQL Server 2008 / 2012 (we have databases on both).
I have a table which tracks "available" Serial-#'s for a Pass Outlet; i.e., Buss Passes, Admissions Tickets, Disneyland Tickets, etc. Those Serial-#'s are VARCHAR, and can be any combination of numbers and characters... any length, up to the max value of the defined column... which is VARCHAR(30). And this is where I'm mightily struggling with the syntax/design of a VIEW.
The table (IM_SER) which contains all this data has a primary key consisting of:
ITEM_NO...VARCHAR(20),
SERIAL_NO...VARCHAR(30)
In many cases... particularly with different types of the "Bus Passes" involved, those Serial-#'s could easily track into the TENS of THOUSANDS. What is needed... is a simple view in SQL Server... which simply outputs the CONSECUTIVE RANGES of Available Serial-#'s...until a GAP is found (i.e. a BREAK in the sequences). For example, say we have the following Serial-#'s on hand, for a given Item-#:
123
124
125
139
140
ABC123
ABC124
ABC126
XYZ240003
XYY240004
In my example above, the output would be displayed as follows:
123 -to- 125
139 -to- 140
ABC123 -to- ABC124
ABC126 -to- ABC126
XYZ240003 to XYZ240004
In total, there would be 10 Serial-#'s...but since we're outputting the sequential ranges...only 5-lines of output would be necessary. Does this make sense? Please let me know...and, again, THANK YOU!...Mark
This should get you started... the fun part will be determining if there are gaps or not. You will have to handle each serial format a little bit differently to determine if there are gaps or not...
select x.item_no,x.s_format,x.s_length,x.serial_no,
LAG(x.serial_no) OVER (PARTITION BY x.item_no,x.s_format,x.s_length
ORDER BY x.item_no,x.s_format,x.s_length,x.serial_no) PreviousValue,
LEAD(x.serial_no) OVER (PARTITION BY x.item_no,x.s_format,x.s_length
ORDER BY x.item_no,x.s_format,x.s_length,x.serial_no) NextValue
from
(
select item_no,serial_no,
len(serial_no) as S_LENGTH,
case
WHEN PATINDEX('%[0-9]%',serial_no) > 0 AND
PATINDEX('%[a-z]%',serial_no) = 0 THEN 'NUMERIC'
WHEN PATINDEX('%[0-9]%',serial_no) > 0 AND
PATINDEX('%[a-z]%',serial_no) > 0 THEN 'ALPHANUMERIC'
ELSE 'ALPHA'
end as S_FORMAT
from table1 ) x
order by item_no,s_format,s_length,serial_no
http://sqlfiddle.com/#!3/5636e2/7
| item_no | s_format | s_length | serial_no | PreviousValue | NextValue |
|---------|--------------|----------|-----------|---------------|-----------|
| 1 | ALPHA | 4 | ABCD | (null) | ABCF |
| 1 | ALPHA | 4 | ABCF | ABCD | (null) |
| 1 | ALPHANUMERIC | 6 | ABC123 | (null) | ABC124 |
| 1 | ALPHANUMERIC | 6 | ABC124 | ABC123 | ABC126 |
| 1 | ALPHANUMERIC | 6 | ABC126 | ABC124 | (null) |
| 1 | ALPHANUMERIC | 9 | XYY240004 | (null) | XYZ240003 |
| 1 | ALPHANUMERIC | 9 | XYZ240003 | XYY240004 | (null) |
| 1 | NUMERIC | 3 | 123 | (null) | 124 |
| 1 | NUMERIC | 3 | 124 | 123 | 125 |
| 1 | NUMERIC | 3 | 125 | 124 | 139 |
| 1 | NUMERIC | 3 | 139 | 125 | 140 |
| 1 | NUMERIC | 3 | 140 | 139 | (null) |

Resources