I cannot for the life of me figure out how to get a weighted ranking for scores across X categories. For example, the student needs to answer 10 questions across 3 categories (both # of questions and # of categories will be variable eventually). To get a total score the top 1 score in each of the X (3) categories will be added to whatever is left to add up to 10 total question scores.
Here is the data. I used a CASE WHEN Row_Number() to get the TopInCat
http://sqlfiddle.com/#!6/e6e9f/1
The fiddle has more students.
| Question | Student | Category | Score | TopInCat |
|----------|---------|----------|-------|----------|
| 120149 | 125 | 6 | 1 | 1 |
| 120127 | 125 | 6 | 0.9 | 0 |
| 120124 | 125 | 6 | 0.8 | 0 |
| 120125 | 125 | 6 | 0.7 | 0 |
| 120130 | 125 | 6 | 0.6 | 0 |
| 120166 | 125 | 6 | 0.5 | 0 |
| 120161 | 125 | 6 | 0.4 | 0 |
| 120138 | 125 | 4 | 0.15 | 1 |
| 120069 | 125 | 4 | 0.15 | 0 |
| 120022 | 125 | 4 | 0.15 | 0 |
| 120002 | 125 | 4 | 0.15 | 0 |
| 120068 | 125 | 2 | 0.01 | 1 |
| 120050 | 125 | 3 | 0.05 | 1 |
| 120139 | 125 | 2 | 0 | 0 |
| 120156 | 125 | 2 | 0 | 0 |
This is how I envision it needs to look, but it doesn't have to be exactly this. I just need to have 10 questions by 3 categories detail data in a way that would allow me to sum and average the Sort 1-10 column below. The 999's could be null or whatever as long as I can sum whats important and present the details.
| Question | Student | Category | Score | TopInCat | Sort |
|----------|---------|----------|-------|----------|------|
| 120149 | 125 | 6 | 1 | 1 | 1 |
| 120138 | 125 | 4 | 0.15 | 1 | 2 |
| 120068 | 125 | 2 | 0.01 | 1 | 3 |
| 120127 | 125 | 6 | 0.9 | 0 | 4 |
| 120124 | 125 | 6 | 0.8 | 0 | 5 |
| 120125 | 125 | 6 | 0.7 | 0 | 6 |
| 120130 | 125 | 6 | 0.6 | 0 | 7 |
| 120166 | 125 | 6 | 0.5 | 0 | 8 |
| 120161 | 125 | 6 | 0.4 | 0 | 9 |
| 120069 | 125 | 4 | 0.15 | 0 | 10 |
| 120022 | 125 | 4 | 0.15 | 0 | 999 |
| 120002 | 125 | 4 | 0.15 | 0 | 999 |
| 120050 | 125 | 3 | 0.05 | 1 | 999 |
| 120139 | 125 | 2 | 0 | 0 | 999 |
| 120156 | 125 | 2 | 0 | 0 | 999 |
One last thing, the category no longer matters once the X (3) threshold is met. So a 4th category would just sort normally.
| Question | Student | Category | Score | TopInCat | Sort |
|----------|---------|----------|-------|----------|------|
| 120149 | 126 | 6 | 1 | 1 | 1 |
| 120138 | 126 | 4 | 0.75 | 1 | 2 |
| 120068 | 126 | 2 | 0.50 | 1 | 3 |
| 120127 | 126 | 6 | 0.9 | 0 | 4 |
| 120124 | 126 | 6 | 0.8 | 0 | 5 |
| 120125 | 126 | 6 | 0.7 | 0 | 6 |
| 120130 | 126 | 6 | 0.6 | 0 | 7 |
| 120166 | 126 | 6 | 0.5 | 0 | 8 |
| 120050 | 126 | 3 | 0.45 | 1 | 9 |********
| 120161 | 126 | 6 | 0.4 | 0 | 10 |
| 120069 | 126 | 4 | 0.15 | 0 | 999 |
| 120022 | 126 | 4 | 0.15 | 0 | 999 |
| 120002 | 126 | 4 | 0.15 | 0 | 999 |
| 120139 | 126 | 2 | 0 | 0 | 999 |
| 120156 | 126 | 2 | 0 | 0 | 999 |
I really appreciate any help. Been banging my head on this for a few days.
With such matters I like to proceed with a 'building blocks' approach. Following the maxim of first make it work, then if you need to make it fast, this first step is often enough.
So, given
CREATE TABLE WeightedScores
([Question] int, [Student] int, [Category] int, [Score] dec(3,2))
;
and your sample data
INSERT INTO WeightedScores
([Question], [Student], [Category], [Score])
VALUES
(120161, 123, 6, 1), (120166, 123, 6, 0.64), (120138, 123, 4, 0.57), (120069, 123, 4, 0.5),
(120068, 123, 2, 0.33), (120022, 123, 4, 0.18), (120061, 123, 6, 0), (120002, 123, 4, 0),
(120124, 123, 6, 0), (120125, 123, 6, 0), (120137, 123, 6, 0), (120154, 123, 6, 0),
(120155, 123, 6, 0), (120156, 123, 6, 0), (120139, 124, 2, 1), (120156, 124, 2, 1),
(120050, 124, 3, 0.88), (120068, 124, 2, 0.87), (120161, 124, 6, 0.87), (120138, 124, 4, 0.85),
(120069, 124, 4, 0.51), (120166, 124, 6, 0.5), (120022, 124, 4, 0.43), (120002, 124, 4, 0),
(120130, 124, 6, 0), (120125, 124, 6, 0), (120124, 124, 6, 0), (120127, 124, 6, 0),
(120149, 124, 6, 0), (120149, 125, 6, 1), (120127, 125, 6, 0.9), (120124, 125, 6, 0.8),
(120125, 125, 6, 0.7), (120130, 125, 6, 0.6), (120166, 125, 6, 0.5), (120161, 125, 6, 0.4),
(120138, 125, 4, 0.15), (120069, 125, 4, 0.15), (120022, 125, 4, 0.15), (120002, 125, 4, 0.15),
(120068, 125, 2, 0.01), (120050, 125, 3, 0.05), (120139, 125, 2, 0), (120156, 125, 2, 0),
(120149, 126, 6, 1), (120138, 126, 4, 0.75), (120068, 126, 2, 0.50), (120127, 126, 6, 0.9),
(120124, 126, 6, 0.8), (120125, 126, 6, 0.7), (120130, 126, 6, 0.6), (120166, 126, 6, 0.5),
(120050, 126, 3, 0.45), (120161, 126, 6, 0.4), (120069, 126, 4, 0.15), (120022, 126, 4, 0.15),
(120002, 126, 4, 0.15), (120139, 126, 2, 0), (120156, 126, 2, 0)
;
let's proceed.
The complicated part here is identifying the top three top-in-category questions; the others of the ten questions of interest per student are simply sorted by score, which is easy. So let's start with identifying the top three top-in-category questions.
First, assign to each row a row number giving the ordering of that score within the category, for the student:
;WITH Numbered1 ( Question, Student, Category, Score, SeqInStudentCategory ) AS
(
SELECT Question, Student, Category, Score
, ROW_NUMBER() OVER (PARTITION BY Student, Category ORDER BY Score DESC) SeqInStudentCategory
FROM WeightedScores
)
Now we are only interested in rows where SeqInStudentCategory is 1. Considering only such rows, let's order them by score within student, and number those rows:
-- within the preceding WITH
, Numbered2 ( Question, Student, Category, Score, SeqInStudent ) AS
(
SELECT
Question, Student, Category, Score
, ROW_NUMBER() OVER (PARTITION BY Student ORDER BY Score DESC) SeqInStudent
FROM
Numbered1
WHERE
SeqInStudentCategory = 1
)
Now we are only interested in rows where SeqInStudent is at most 3. Let's pull them out, so that we know to include it (and exclude it from the simple sort by score, that we will use to make up the remaining seven rows):
-- within the preceding WITH
, TopInCat ( Question, Student, Category, Score, SeqInStudent ) AS
(
SELECT Question, Student, Category, Score, SeqInStudent FROM Numbered2 WHERE SeqInStudent <= 3
)
Now we have the three top-in-category questions for each student. We now need to identify and order by score the not top-in-category questions for each student:
-- within the preceding WITH
, NotTopInCat ( Question, Student, Category, Score, SeqInStudent ) AS
(
SELECT
Question, Student, Category, Score
, ROW_NUMBER() OVER (PARTITION BY Student ORDER BY Score DESC) SeqInStudent
FROM
WeightedScores WS
WHERE
NOT EXISTS ( SELECT 1 FROM TopInCat T WHERE T.Question = WS.Question AND T.Student = WS.Student )
)
Finally we combine TopInCat with NotTopInCat, applying an appropriate offset and restriction to NotTopInCat.SeqInStudent - we need to add 3 to the raw value, and take the top 7 (which is 10 - 3):
-- within the preceding WITH
, Combined ( Question, Student, Category, Score, CombinedSeq ) AS
(
SELECT
Question, Student, Category, Score, SeqInStudent AS CombinedSeq
FROM
TopInCat
UNION
SELECT
Question, Student, Category, Score, SeqInStudent + 3 AS CombinedSeq
FROM
NotTopInCat
WHERE
SeqInStudent <= 10 - 3
)
To get our final results:
SELECT * FROM Combined ORDER BY Student, CombinedSeq
;
You can see the results on sqlfiddle.
Note that here I have assumed that every student will always have answers from at least three categories. Also, the final output doesn't have a TopInCat column, but hopefully you will see how to regain that if you want it.
Also, "(both # of questions and # of categories will be variable eventually)" should be relatively straightforward to deal with here. But watch out for my assumption that (in this case) 3 categories will definitely be present in the answers of each student.
I've been trying to learn about minimum span trees and the algorithms associated with it, namely Prim's, Kruskal's and Dijkstra's algorithms.
I understand how these algorithms work and have seen them in action but there is only one thing I don't understand about Prim's algorithm, which is an array that I don't understand what it is it's intention and how does it work.
So here is the situation:
I have to do an exercise wherein I am given an adjacency table and I have to run Prim's algorithm to create a minimum span tree.
The table looks like this:
0 |1|2| 3| 4| 5|
0| 0 73 4 64 40 74
1| 73 0 46 26 30 70
2| 4 46 0 77 86 14
3| 64 26 77 0 20 85
4| 40 30 86 20 0 22
5| 74 70 14 85 22 0
The numbers separated by the "|" are the vertices and the numbers in the table are the edges. Simple, I run the algorithm ( in this website for example: http://www.jakebakermaths.org.uk/maths/primsalgorithmsolverv10.html ) or just jot it down on paper and and draw the minimum span tree and I get the tree with the minimal cost of 86 and the edges that have been used are 4, 26, 20, 22 and 14.
Now here comes the problem, apparently just solving it wasn't enough. I need to find the values of an array called closest[0,...,5]. I know it is used in the algorithm but I don't know it's purpose and what I should do with it or how to get it's values.
I have searched the internet for it and found this link about Prim's algorithm:
http://lcm.csa.iisc.ernet.in/dsa/node183.html
Which defines the array "closest" as "For i in V - U, closest[i] gives the vertex in U that is closest to i".
I still don't understand what it is, what it is used for and what the values inside of them are.
All I know the answer to my exercise is
closest[1] = 3
closest[2] = 0
closest[3] = 4
closest[4] = 5
closest[5] = 2
Thank you in advance.
When doing a MST with Prim's algorithm, it is important to keep track of four things: the vertex, has it been visited, minimal distance to vertex, and what precedes this vertex (this is what you are looking for).
You start at vertex 0, and you see that the closest vertex to 0 is 2. At the same time, you could have visited all other nodes, but with bigger distances. Nevertheless, the closest node to 0 is 2, so thus 2 becomes visited and its parent is set to vertex 0. All the other nodes are not visited yet, but its parent as of now is set to 0, with its respective distance. You now need to set the smallest distance vertex to visited, and now consider this node as the node to be considered.
Vertex | Visited | Distance | Parent
0 | T | - | -
1 | F | 73 | 0
2 | T | 4 | 0
3 | F | 64 | 0
4 | F | 40 | 0
5 | F | 74 | 0
We then check all the distances of nodes from 2. We compare the new distances from 2 to the other nodes to the distance from the other nodes from its previous distance, and if it needs to be updated, it gets updated. We now see that the distance from 2 to 5 is shorter than 0 to 5, and vertex 5 now becomes becomes visited, with its parent now equal to vertex 2.
Vertex | Visited | Distance | Parent
0 | T | - | -
1 | F | 46 | 2
2 | T | 4 | 0
3 | F | 64 | 0
4 | F | 40 | 0
5 | T | 14 | 2
Now we visit 5. One thing to note is that if a node is visited, we do not consider it in our distance calculations. I have simulated the rest, and hopefully you can see how you get the answer you're looking for.
Vertex | Visited | Distance | Parent
0 | T | - | -
1 | F | 46 | 2
2 | T | 4 | 0
3 | F | 64 | 0
4 | T | 22 | 5
5 | T | 14 | 2
Now visit 4
Vertex | Visited | Distance | Parent
0 | T | - | -
1 | F | 46 | 2
2 | T | 4 | 0
3 | T | 20 | 4
4 | T | 22 | 5
5 | T | 14 | 2
And now visit 3
Vertex | Visited | Distance | Parent
0 | T | - | -
1 | T | 26 | 3
2 | T | 4 | 0
3 | T | 20 | 4
4 | T | 22 | 5
5 | T | 14 | 2
My current script runs through data from a PDB and stores them to arrays then those arrays are used for the rest of the script. The script runs very well on a small PDB file but when I use a real PDB file I end up using all the computers memory on just one PDB file. I have 2000 PDB files I need calculations done on.
This is my full current script with a few notes.
Full script:
#!/usr/bin/perl
use warnings;
use strict;
#my $inputfile = $ARGV[0];
#my $inputfile = '8ns_emb_alt_test.pdb';
my $inputfile = '8ns_emb_alt_101.pdb';
open( INPUTFILE, "<", $inputfile ) or die $!;
my #array = <INPUTFILE>;
### Protein
my $protein = 'PROT';
my #protx;
my #proty;
my #protz;
for ( my $line = 0; $line <= $#array; ++$line ) {
if ( ( $array[$line] =~ m/\s+$protein\s+/ ) ) {
chomp $array[$line];
my #splitline = ( split /\s+/, $array[$line] );
push #protx, $splitline[5]; #This has 2083 x-coordinates
push #proty, $splitline[6]; #this has 2083 y-coordinates
push #protz, $splitline[7]; #this has 2083 z-coordinates
}
}
### Lipid
my $lipid1 = 'POPS';
my $lipid2 = 'POPC';
my #lipidx;
my #lipidy;
my #lipidz;
for ( my $line = 0; $line <= $#array; ++$line ) {
if ( ( $array[$line] =~ m/\s+$lipid1\s+/ ) || ( $array[$line] =~ m/\s+$lipid2\s+/ ) ) {
chomp $array[$line];
my #splitline = ( split /\s+/, $array[$line] );
push #lipidx, $splitline[5]; #this has approximately 35,000 x coordinates
push #lipidy, $splitline[6]; #same as above for y
push #lipidz, $splitline[7]; #same as above for z
}
}
### Calculation
my #deltaX = map {
my $diff = $_;
map { $diff - $_ } #lipidx
} #protx; #so this has 2083*35000 data x-coordinates
my #squared_deltaX = map { $_ * $_ } #deltaX; #this has all the x-coordinates squared from #deltaX
my #deltaY = map {
my $diff = $_;
map { $diff - $_ } #lipidy
} #proty;
my #squared_deltaY = map { $_ * $_ } #deltaY;
my #deltaZ = map {
my $diff = $_;
map { $diff - $_ } #lipidz
} #protz;
my #squared_deltaZ = map { $_ * $_ } #deltaZ;
my #distance;
for ( my $ticker = 0; $ticker <= $#array; ++$ticker ) {
my $distance_calc = sqrt( ( $squared_deltaX[$ticker] + $squared_deltaY[$ticker] + $squared_deltaZ[$ticker] ) );
push #distance, $distance_calc;
} #this runs the final calculation and computes all the distances between the atoms
### The Hunt
my $limit = 5;
my #DistU50;
my #resid_tagger;
for ( my $tracker = 0; $tracker <= $#array; ++$tracker ) {
my $dist = $distance[$tracker];
if ( ( $dist < $limit ) && ( $array[$tracker] =~ m/\s+$protein\s+/ ) ) {
my #splitline = ( split /\s+/, $array[$tracker] );
my $LT50 = $dist;
push #resid_tagger, $splitline[4]; #selects stores a selected index number
push #DistU50, $LT50; #stores the values within the $limit
}
} #this 'for' loop search for all the elements in the '#distance' and pushes them to the final arrays and also makes an array of certain index numbers in to another array.
### Le'Finali
print "#resid_tagger = resid \n";
print "5 > #DistU50 \n";
close INPUTFILE;
One of my lab friends said that I could store some of the data into files so that it takes up less memory. I think that is a fine idea but I am not sure where would be the most efficient place to do that and how many times would I have to do that. I did this with arrays because it is the best way I knew how to do this.
If anyone could show me an example of where I could take an array and turn it into a file and then use the data in that file again that would be really helpful. Otherwise, if any ones has an idea that I can look up, things to try or just suggestions that would be at least start me somewhere.
You're trying to store ~66 million results in an array, which as you've noticed is both slow and memory intensive. Perl arrays are not great for massive calculations like this, but PDL is.
The core of your problem is to calculate the distance between a number of 3D coordinates. Let's do this for a simplified data set first just to prove we can do it:
Start End
--------- ---------
(0, 0, 0) (1, 2, 3)
(1, 1, 1) (1, 1, 1)
(4, 5, 6) (7, 8, 9)
We can represent this data set in PDL like this:
use PDL;
# x # y # z
my $start = pdl [ [0, 1, 4], [0, 1, 5], [0, 1, 6] ];
my $end = pdl [ [1, 1, 7], [2, 1, 8], [3, 1, 9] ];
We now have two sets of 3D coordinates. To compute the distances, first we subtract our start coordinates from our end coordinates:
my $diff = $end - $start;
print $diff;
This outputs
[
[1 0 3]
[2 0 3]
[3 0 3]
]
where the differences in the x-coordinates are in the first row, the differences in the y-coordinates are in the second row, and the differences in the z-coordinates are in the third row.
Next we have to square the differences:
my $squared = $diff**2;
print $squared;
which gives us
[
[1 0 9]
[4 0 9]
[9 0 9]
]
Finally we need to sum the square of the differences for each pair of points and take the square root:
foreach my $i (0 .. $squared->dim(0) - 1) {
say sqrt sum $squared($i,:);
}
(There's probably a better way to do this, but I haven't used PDL much.)
This prints out
3.74165738677394
0
5.19615242270663
which are our distances.
Putting it all together:
use strict;
use warnings;
use 5.010;
use PDL;
use PDL::NiceSlice;
my $start = pdl [ [0, 1, 4], [0, 1, 5], [0, 1, 6] ];
my $end = pdl [ [1, 1, 7], [2, 1, 8], [3, 1, 9] ];
my $diff = $end - $start;
my $squared = $diff**2;
foreach my $i (0 .. $squared->dim(0) - 1) {
say sqrt sum $squared($i,:);
}
It takes ~35 seconds on my desktop to calculate the distance between one million pairs of coordinates and write the results to a file. When I try with ten million pairs, I run out of memory, so you'll probably have to split your data set into pieces.
Reading data from files
Here's an example that reads data in from two files, using sample input you included in an earlier question:
use strict;
use warnings;
use 5.010;
use PDL;
use PDL::IO::Misc;
use PDL::NiceSlice;
my $start_file = 'start.txt';
my $end_file = 'end.txt';
my $start = rcols $start_file, [ 5..7 ];
my $end = rcols $end_file, [ 5..7 ];
my $diff = $end - $start;
my $squared = $diff**2;
foreach my $i (0 .. $squared->dim(0) - 1) {
say sqrt sum $squared($i,:);
}
start.txt
ATOM 1 N GLU 1 -19.992 -2.816 36.359 0.00 0.00 PROT
ATOM 2 HT1 GLU 1 -19.781 -1.880 35.958 0.00 0.00 PROT
ATOM 3 HT2 GLU 1 -19.713 -2.740 37.358 0.00 0.00 PROT
ATOM 4 HT3 GLU 1 -21.027 -2.910 36.393 0.00 0.00 PROT
ATOM 5 CA GLU 1 -19.344 -3.944 35.652 0.00 0.00 PROT
ATOM 6 HA GLU 1 -19.817 -4.852 35.998 0.00 0.00 PROT
ATOM 7 CB GLU 1 -19.501 -3.795 34.119 0.00 0.00 PROT
end.txt
ATOM 2084 N POPC 1 -44.763 27.962 20.983 0.00 0.00 MEM1
ATOM 2085 C12 POPC 1 -46.144 27.379 20.551 0.00 0.00 MEM1
ATOM 2086 C13 POPC 1 -44.923 28.611 22.367 0.00 0.00 MEM1
ATOM 2087 C14 POPC 1 -43.713 26.889 21.099 0.00 0.00 MEM1
ATOM 2088 C15 POPC 1 -44.302 29.004 20.059 0.00 0.00 MEM1
ATOM 2089 H12A POPC 1 -46.939 28.110 20.555 0.00 0.00 MEM1
ATOM 2090 H12B POPC 1 -46.486 26.769 21.374 0.00 0.00 MEM1
Output
42.3946824613654
42.2903357636233
42.9320321205507
40.4541893133455
44.1770768272415
45.3936402704167
42.7174829080553
The rcols function comes from PDL::IO::Misc and can be used to read specific columns from a file into a PDL object (in this cases, columns 5 through 7, zero-indexed).
edit.....
you guys... we should have checked first. you might want to look into the perl modules that already exist for processing and manipulating PDB data.
http://search.cpan.org/~rulix/Bio-PDB-Structure-0.02/lib/Bio/PDB/Structure.pm
http://www.iu.a.u-tokyo.ac.jp/~tterada/softwares/pdb.html
http://www.perlmol.org/pod/Chemistry/File/PDB.html
http://comp.chem.nottingham.ac.uk/parsepdb/
http://www.perl.com/pub/2001/11/16/perlbio2.html
https://www.biostars.org/p/89300/ (forum post, not library)
Okay... so perl is not my first language and I don't know exactly what your data looks like.
edit: there was weird row in my test data. there are two sets of code here... one splits on whitespace and the other uses expected/known column positions and length for determining values.
#!/usr/bin/perl
use strict;
use warnings;
use DBI;
my $db = 'test';
my $host = 'localhost';
my $user = 'root';
my $pass = '';
my $dbh = DBI->connect("dbi:mysql:$db:$host",$user,$pass)
or die "Connection Error: $DBI::errstr\n";
my $localpath = 'C:\path\to\folder\with\datums';
my #filenames = ('glucagon'); # i am using this as my table name, too
my $colnum = 12; # number of columns in data, I assumed this was fixed
my #placehoders;
for (1..$colnum) { push #placehoders, '?'; }
my $placeholders = join(',',#placehoders); # builds a string like: ?, ?, ?, ?, ...
# for our query that uses binding
foreach my $file (#filenames) {
my $filename = "$localpath\\$file.txt";
if (open(my $fh => $filename)) {
# the null at the start of the insert is because my first column is an
# auto_incrememnt primary key that will be generated on insert by the db
my $stmt = $dbh->prepare("insert into $file values (null, $placeholders)");
while(my $line = <$fh>) {
$line =~ s/\s+$//; # trim whitespace
if ($line ne q{}) { # if not totally blank
my #row = split(/ +/,$line); # split on whitespace
for my $index (1..$colnum) {
$stmt->bind_param($index, $row[$index]);
$index++;
}
$stmt->execute();
}
}
close $fh;
}
else { print "$file not opened\n"; }
}
-- i didn't know appropriate names for any of it
create table glucagon (
row_id int unsigned auto_increment primary key,
name varchar(10),
seq int,
code1 varchar(5),
code2 varchar(5),
code3 varchar(5),
code4 int,
val1 decimal(10,2),
val2 decimal(10,2),
val3 decimal(10,2),
val4 decimal(10,2),
val5 decimal(10,2),
code5 varchar(5)
)
the following is found in C:\path\to\folder\with\datums\glucagon.txt
ATOM 1058 N ARG A 141 -6.466 12.036 -10.348 7.00 19.11 N
ATOM 1059 CA ARG A 141 -7.922 12.248 -10.253 6.00 26.80 C
ATOM 1060 C ARG A 141 -8.119 13.499 -9.393 6.00 28.93 C
ATOM 1061 O ARG A 141 -7.112 13.967 -8.853 8.00 28.68 O
ATOM 1062 CB ARG A 141 -8.639 11.005 -9.687 6.00 24.11 C
ATOM 1063 CG ARG A 141 -8.153 10.551 -8.308 6.00 19.20 C
ATOM 1064 CD ARG A 141 -8.914 9.319 -7.796 6.00 21.53 C
ATOM 1065 NE ARG A 141 -8.517 9.076 -6.403 7.00 20.93 N
ATOM 1066 CZ ARG A 141 -9.142 8.234 -5.593 6.00 23.56 C
ATOM 1067 NH1 ARG A 141 -10.150 7.487 -6.019 7.00 19.04 N
ATOM 1068 NH2 ARG A 141 -8.725 8.129 -4.343 7.00 25.11 N
ATOM 1069 OXT ARG A 141 -9.233 14.024 -9.296 8.00 40.35 O
TER 1070 ARG A 141
HETATM 1071 FE HEM A 1 8.128 7.371 -15.022 24.00 16.74 FE
HETATM 1072 CHA HEM A 1 8.617 7.879 -18.361 6.00 17.74 C
HETATM 1073 CHB HEM A 1 10.356 10.005 -14.319 6.00 18.92 C
HETATM 1074 CHC HEM A 1 8.307 6.456 -11.669 6.00 11.00 C
HETATM 1075 CHD HEM A 1 6.928 4.145 -15.725 6.00 13.25 C
end result...
mysql> select * from glucagon;
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
| row_id | name | seq | code1 | code2 | code3 | code4 | val1 | val2 | val3 | val4 | val5 | code5 |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
| 1 | ATOM | 1058 | N | ARG | A | 141 | -6.47 | 12.04 | -10.35 | 7.00 | 19.11 | N |
| 2 | ATOM | 1059 | CA | ARG | A | 141 | -7.92 | 12.25 | -10.25 | 6.00 | 26.80 | C |
| 3 | ATOM | 1060 | C | ARG | A | 141 | -8.12 | 13.50 | -9.39 | 6.00 | 28.93 | C |
| 4 | ATOM | 1061 | O | ARG | A | 141 | -7.11 | 13.97 | -8.85 | 8.00 | 28.68 | O |
| 5 | ATOM | 1062 | CB | ARG | A | 141 | -8.64 | 11.01 | -9.69 | 6.00 | 24.11 | C |
| 6 | ATOM | 1063 | CG | ARG | A | 141 | -8.15 | 10.55 | -8.31 | 6.00 | 19.20 | C |
| 7 | ATOM | 1064 | CD | ARG | A | 141 | -8.91 | 9.32 | -7.80 | 6.00 | 21.53 | C |
| 8 | ATOM | 1065 | NE | ARG | A | 141 | -8.52 | 9.08 | -6.40 | 7.00 | 20.93 | N |
| 9 | ATOM | 1066 | CZ | ARG | A | 141 | -9.14 | 8.23 | -5.59 | 6.00 | 23.56 | C |
| 10 | ATOM | 1067 | NH1 | ARG | A | 141 | -10.15 | 7.49 | -6.02 | 7.00 | 19.04 | N |
| 11 | ATOM | 1068 | NH2 | ARG | A | 141 | -8.73 | 8.13 | -4.34 | 7.00 | 25.11 | N |
| 12 | ATOM | 1069 | OXT | ARG | A | 141 | -9.23 | 14.02 | -9.30 | 8.00 | 40.35 | O |
| 13 | TER | 1070 | ARG | A | 141 | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
| 14 | HETATM | 1071 | FE | HEM | A | 1 | 8.13 | 7.37 | -15.02 | 24.00 | 16.74 | FE |
| 15 | HETATM | 1072 | CHA | HEM | A | 1 | 8.62 | 7.88 | -18.36 | 6.00 | 17.74 | C |
| 16 | HETATM | 1073 | CHB | HEM | A | 1 | 10.36 | 10.01 | -14.32 | 6.00 | 18.92 | C |
| 17 | HETATM | 1074 | CHC | HEM | A | 1 | 8.31 | 6.46 | -11.67 | 6.00 | 11.00 | C |
| 18 | HETATM | 1075 | CHD | HEM | A | 1 | 6.93 | 4.15 | -15.73 | 6.00 | 13.25 | C |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
18 rows in set (0.00 sec)
ohh... look... this row makes it dirty... TER 1070 ARG A 141. i can easily fix this if you go my route but if you use the other answer/approach, i'm not going to bother to update this.
okay... for the stupid row. I went through and counted the starting position and length of each value in my test dataset. I didn't know if that information changes for you or not when you load different files or what... so I made it in a way that it can be set for each file you use.
#!/usr/bin/perl
use strict;
use warnings;
use DBI;
my $db = 'test';
my $host = 'localhost';
my $user = 'root';
my $pass = '';
my $dbh = DBI->connect("dbi:mysql:$db:$host",$user,$pass)
or die "Connection Error: $DBI::errstr\n";
my $localpath = 'C:\path\to\datums';
# first num is starting pos, second is length
my $fileinfo = { 'glucagon' => [[0,6], # 'name'
[7,4], # 'seq'
[12,4], # 'code1'
[17,3], # 'code2'
[21,1], # 'code3'
[23,3], # 'code4'
[27,12], # 'val1'
[39,7], # 'val2'
[47,7], # 'val3'
[55,5], # 'val4'
[61,5], # 'val5'
[69,10] # 'code5'
]
# 'second_file' => [ [0,5], # col1
# [6,5], # col2
# ]
}; # i am using this as my table name, too
foreach my $file (keys %$fileinfo) {
my $filename = "$localpath\\$file.txt";
if (open(my $fh => $filename)) {
my $colnum = scalar #{$fileinfo->{$file}};
my #placehoders;
for (1..$colnum) { push #placehoders, '?'; }
my $placeholders = join(',',#placehoders); # builds a string like: ?, ?, ?, ?, ...
# for our query that uses binding
# the null at the start of the insert is because my first column is an
# auto_incrememnt primary key that will be generated on insert by the db
my $stmt = $dbh->prepare("insert into $file values (null, $placeholders)");
while(my $line = <$fh>) {
$line =~ s/\s+$//; # trim trailing whitespace
if ($line ne q{}) { # if not totally blank
my #row;
my $index = 1;
# foreach value column position & length
foreach my $col (#{$fileinfo->{$file}}) {
my $value;
if ($col->[0] <= length($line)) {
$value = substr($line,$col->[0],$col->[1]);
$value =~ s/^\s+|\s+$//g; # trim trailing & leading whitespace
if ($value eq q{}) { undef $value; } # i like null values vs blank
}
$row[$index] = $value;
$index++;
}
for my $index (1..$colnum) {
$stmt->bind_param($index, $row[$index]);
}
$stmt->execute();
}
}
close $fh;
}
else { print "$file not opened\n"; }
}
new data:
mysql> select * from glucagon;
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
| row_id | name | seq | code1 | code2 | code3 | code4 | val1 | val2 | val3 | val4 | val5 | code5 |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
| 1 | ATOM | 1058 | N | ARG | A | 141 | -6.47 | 12.04 | -10.35 | 7.00 | 19.11 | N |
| 2 | ATOM | 1059 | CA | ARG | A | 141 | -7.92 | 12.25 | -10.25 | 6.00 | 26.80 | C |
| 3 | ATOM | 1060 | C | ARG | A | 141 | -8.12 | 13.50 | -9.39 | 6.00 | 28.93 | C |
| 4 | ATOM | 1061 | O | ARG | A | 141 | -7.11 | 13.97 | -8.85 | 8.00 | 28.68 | O |
| 5 | ATOM | 1062 | CB | ARG | A | 141 | -8.64 | 11.01 | -9.69 | 6.00 | 24.11 | C |
| 6 | ATOM | 1063 | CG | ARG | A | 141 | -8.15 | 10.55 | -8.31 | 6.00 | 19.20 | C |
| 7 | ATOM | 1064 | CD | ARG | A | 141 | -8.91 | 9.32 | -7.80 | 6.00 | 21.53 | C |
| 8 | ATOM | 1065 | NE | ARG | A | 141 | -8.52 | 9.08 | -6.40 | 7.00 | 20.93 | N |
| 9 | ATOM | 1066 | CZ | ARG | A | 141 | -9.14 | 8.23 | -5.59 | 6.00 | 23.56 | C |
| 10 | ATOM | 1067 | NH1 | ARG | A | 141 | -10.15 | 7.49 | -6.02 | 7.00 | 19.04 | N |
| 11 | ATOM | 1068 | NH2 | ARG | A | 141 | -8.73 | 8.13 | -4.34 | 7.00 | 25.11 | N |
| 12 | ATOM | 1069 | OXT | ARG | A | 141 | -9.23 | 14.02 | -9.30 | 8.00 | 40.35 | O |
| 13 | TER | 1070 | NULL | ARG | A | 141 | NULL | NULL | NULL | NULL | NULL | NULL |
| 14 | HETATM | 1071 | FE | HEM | A | 1 | 8.13 | 7.37 | -15.02 | 24.00 | 16.74 | FE |
| 15 | HETATM | 1072 | CHA | HEM | A | 1 | 8.62 | 7.88 | -18.36 | 6.00 | 17.74 | C |
| 16 | HETATM | 1073 | CHB | HEM | A | 1 | 10.36 | 10.01 | -14.32 | 6.00 | 18.92 | C |
| 17 | HETATM | 1074 | CHC | HEM | A | 1 | 8.31 | 6.46 | -11.67 | 6.00 | 11.00 | C |
| 18 | HETATM | 1075 | CHD | HEM | A | 1 | 6.93 | 4.15 | -15.73 | 6.00 | 13.25 | C |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+