how to sort a database - database

I have a database that i now combined using this function
def ReadAndMerge():
library1=input("Enter 1st filename to read and merge:")
with open(library1, 'r') as library1names:
library1contents = library1names.read()
library2=input("Enter 2nd filename to read and merge:")
with open(library2, 'r') as library2names:
library2contents = library2names.read()
print(library1contents)
print(library2contents)
combined_contents = library1contents + library2contents # concatenate text
print(combined_contents)
return(combined_contents)
The two databases originally looked like this
Bud Abbott 51 92.3
Mary Boyd 52 91.4
Hillary Clinton 50 82.1
and this
Don Adams 51 90.4
Jill Carney 53 76.3
Randy Newman 50 41.2
After being combined they now look like this
Bud Abbott 51 92.3
Mary Boyd 52 91.4
Hillary Clinton 50 82.1
Don Adams 51 90.4
Jill Carney 53 76.3
Randy Newman 50 41.2
if i wanted to sort this database by last names how would i go about doing that?
is there a sort function built in to python like lists? is this considered a list?
or would i have to use another function that locates the last name then orders them alphabetically

You sort with the sorted() method. But you can't sort just a big string, you need to have the data in a list or something similar. Something like this (untested):
def get_library_names(): # Better name of function
library1 = input("Enter 1st filename to read and merge:")
with open(library1, 'r') as library1names:
library1contents = library1names.readlines()
library2=input("Enter 2nd filename to read and merge:")
with open(library2, 'r') as library2names:
library2contents = library2names.readlines()
print(library1contents)
print(library2contents)
combined_contents = sorted(library1contents + library2contents)
print(combined_contents)
return(combined_contents)

Related

How to return an array where a column matches a value

I am trying to create the column _Type_below which return the type value for a matching name AND a matching interval. I know I can use VLOOKUP for individual names, but lets say I have thousands of names and I can specify an array for VLOOKUP for all of them. Cheers!!
Name position _Type_ Name Range_From Range_To Type
bob 0 A bob 0 30 A
bob 5 A bob 30 100 B
bob 10 A doug 0 40 C
bob 15 A doug 40 200 A
bob 20 A
bob 30 B
bob 40 B
bob 80 B
doug 0 C
doug 20 C
doug 40 A
...
If yu have the dynamic array formula you can use FILTER():
=VLOOKUP(B2,FILTER(E:G,A2=D:D),3)
If not then your data must be sorted on D then E:
=VLOOKUP(B2,INDEX(E:E,MATCH(A2,D:D,0)):INDEX(G:G,MATCH(A2,D:D,0)+COUNTIF(D:D,A2)-1),3)
This should be relatively quick, but it requires the data sorted.
You can go super old fashioned and get the row using SumProduct() and pass that into Index(). It's not going to be speedy though.
=INDEX($I$1:$I$5, SUMPRODUCT(($F$1:$F$5=A2)*($G$1:$G$5<=B2)*($H$1:$H$5>B2)*ROW($F$1:$F$5)), 1)

How do I use a dictionary to find the average of the grades in the list using a file?

The file grades.txt contains random students' grades in a fictional course. Each line of the file is a student's name, followed by a space, followed by an integer grade. I want to write a program that reads in the data from the file, and then prints out each student's name and average score.
I have a general sense of what I want to do, and I’m working to make it more precise, but I am a bit confused. I know that I should read the file and use the dictionary to store the grades and average them, but I'm not sure how to use a file and a dictionary at the same time.
inFile = open('grades.txt','r')
lines = inFile.read()
fix_list = lines.replace('\n',' ') #Replace '\n' with spaces before getting rid of them.
new_list = fix_list.split(' ') #List without spaces or '\n'
length_list = len(new_list) - 1
print(new_list)
Here is what is in the grades.txt file specifically:
Cleese 80
Gilliam 78
Jones 69
Jones 90
Cleese 90
Chapman 90
Chapman 100
Palin 80
Gilliam 82
Cleese 85
Gilliam 80
Gilliam 75
Idle 91
Jones 90
Palin 90
Cleese 88
This is what my code is supposed to spit out (not in a new file, but just to print):
Gilliam 78.75
Jones 83.0
Cleese 85.75
Chapman 95.0
Idle 91.0
Palin 85.0
I am not sure what you are doing in your code snippet.
What you want to do is:
Instantiate a dictionary
Use readlines() instead of read() on the file so that you get the lines in a list
For each line, split it and store the grade in the dictionary such that the name is the key, and the value is all the grades in a list.
Loop over the dictionary and print the name and average. For average you can do sum(list)/len(list)

How to load mixed record type fixed width file with two headers into two separate files

I got a task to load a strangely formatted text file. The file contains unwanted data too. It contains two headers back to back and data for each header is specified on alternate lines. Header rows start after ------. I need to read both the header along with its corresponding data and dump it into some Excel/table destination using. Let me know how to solve this using any transformation in SSIS or maybe with a script.
Don't know how to use a script task for this.
Right now I am reading the file in one column and using a derived column manually trying to split it using substring function. But that works for only one header and it is too hard coded type. I need some dynamic approach to read header rows as well as data rows directly.
Input file:
A1234-012 I N F O R M A T I C S C O M P A N Y 08/23/17
PAGE 2 BATCH ABC PAYMENT DATE & DUE DATE EDIT PAGE 481
------------------------------------------------------------------------------------------------------------------------------------
SEO XRAT CLT LOAN OPENING PAYMENT MATURIUH LOAN NEXE ORIG-AMT OFF TO CATE CONTC MON NO.TO TOL NEL S CUP CO IND PAT
NOM CODE NOM NOMTER DATE DUO DATE DATE TIME PT # MONEY AQ LOAN NUMBER BLOCK PAYMENT U TYP GH OMG IND
1-3 4-6 7-13/90-102 14-19 20-25 26-31 32-34 35-37 38-46 47-48 49 50-51 52-61 62 63 64-72 73 4-5 76 77 8-80
------------------------------------------------------------------------------------------------------------------------------------
SEO XRAT CLT LOAN A/C A/C MIN MAX MAX PENDI LATE CCH L/F PARTLYS CUR L/F L/F L/F
NOM CODE NOM NOMTER CODE FACTOR MON MON ROAD DAYS MONE POT L/A L/F JAC INT VAD CD USED PI VAD DT
1-3 4-6 7-13/90-102 14 15 20-23 24-29 30-34 35-37 38-42 43 44 49 60 61-63 64-69
USED-ID:
------------------------------------------------------------------------------------------------------------------------------------
454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55
025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS
454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55
025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS
454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55
025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS
Expected output should be:
FILE 1:
SEO XRAT CLT LOAN OPENING PAYMENT MATURIUH LOAN NEXE ORIG-AMT OFF TO CATE CONTC MON NO.TO TOL NEL S CUP CO IND PAT
NOM CODE NOM NOMTER DATE DUO DATE DATE TIME PT # MONEY AQ LOAN NUMBER BLOCK PAYMENT U TYP GH OMG IND
454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55
454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55
454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55
FILE 2:
SEO XRAT CLT LOAN A/C A/C MIN MAX MAX PENDI LATE CCH L/F PARTLYS CUR L/F L/F L/F
NOM CODE NOM NOMTER CODE FACTOR MON MON ROAD DAYS MONE POT L/A L/F JAC INT VAD CD USED PI VAD DT
025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS
025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS
025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS
Ignore first 3 rows
To ignore first 3 rows you can simply configure the flat file connection manager to ignore them, similar to:
Split file and remove bad rows
1. Configure connection managers
In addition, in the flat file connection manager, go to the advanced tab and delete all columns except one and change its data type to DT_STR and the MaxLength to 4000.
Add two connection managers , one for each destination file where you must define only one column with max length = 4000:
2. Configure Data flow task
Add a Data Flow Task, And add a Flat File Source inside. Select the Source File connection manager.
Add a conditional split with the following expressions:
File1
FINDSTRING([Column 0],"OPENING",1) > 1 || FINDSTRING([Column 0],"DATE",1) > 1 || TOKENCOUNT([Column 0]," ") == 19
File2
FINDSTRING([Column 0],"A/C",1) > 1 || FINDSTRING([Column 0],"FACTOR",1) > 1 || TOKENCOUNT([Column 0]," ") == 10
The expressions above are created based on the expected output you mentioned in the question, i tired to search for unique keywords inside each header and splitted the data rows based on the number of space occurrence.
Finally Map each output to a destination flat file component:
Experiments
The execution result is shown in the following screenshots:
Update 1 - Remove duplicates
To remove duplicates you must you can refer to the following link:
How to remove duplicate rows from flat file using SSIS?
Update 2 - Remove only duplicates headers + Replace spaces with Tab
If you need only to remove duplicate headers then you can do this in two steps:
Add a script component after each conditional split output to flag unwanted rows
Add a conditional split to filter rows based on the script component output
In addition, because the columns values does not contains spaces you can use regular expression to replace spaces with single Tab to make the file consistent.
Script Component
In the Script Component add an output column of type DT_BOOL and name it outFlag also add a output column outColumn0 of type DT_STR and length equal to 4000 and select Column0 as Input Column.
Then write the following script in the Script Editor (C#):
First make sure that you add the RegularExpressions namespace
using System.Text.RegularExpressions;
Script Code
int SEOCount = 0;
int NOMCount = 0;
Regex regex = new Regex("[ ]{2,}", RegexOptions.None);
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (Row.Column0.Trim().StartsWith("SEO"))
{
if (SEOCount == 0)
{
SEOCount++;
Row.outFlag = true;
}
else
{
Row.outFlag = false;
}
}
else if (Row.Column0.Trim().StartsWith("NOM"))
{
if (NOMCount == 0)
{
NOMCount++;
Row.outFlag = true;
}
else
{
Row.outFlag = false;
}
}
else if (Row.Column0.Trim().StartsWith("PAGE"))
{
Row.outFlag = false;
}
else
{
Row.outFlag = true;
}
Row.outColumn0 = regex.Replace(Row.Column0.TrimStart(), "\t");
}
Conditional Split
Add a conditional split after each Script Component and use the following expression to filter duplicate header:
[outFlag] == True
And connect the conditional split to the destination. Make Sure to map outColumn0 to the destination column.
Package link
https://www.dropbox.com/s/d936u4xo3mkzns8/Package.dtsx?dl=0

Trying to sort two columns of an array in R

Assume my data (df) looks something like this:
Rank Student Points Type
3 Liz 60 Junior
1 Sarah 100 Junior
10 John 40 Senior
2 Robert 70 Freshman
13 Jackie 33 Freshman
11 Stevie 35 Senior
I want to sort the data according to the Points, followed by Rank column in descending and ascending order, respectively, so that it looks like this:
Rank Student Points Type
1 Sarah 100 Junior
2 Robert 70 Freshman
3 Liz 60 Junior
10 John 40 Senior
11 Stevie 35 Senior
13 Jackie 33 Freshman
So I did this:
df[order(df[, "Points"], df[, "Rank"]), ]
Resulted in this:
Rank Student Points Type
1 Sarah 100 Junior
10 John 40 Senior
11 Stevie 35 Senior
13 Jackie 33 Freshman
2 Robert 70 Freshman
3 Liz 60 Junior
Question: How do I fix this?
I'm trying to use the column headers because the column length/width may change which can affect my sorting if I use physical locations.
FYI: I've tried so many suggestions and none seems to work:
one, two, three and four...
Try this:
df[order(df$Points,decreasing=T,df$Rank),]
Rank Student Points Type
2 1 Sarah 100 Junior
4 2 Robert 70 Freshman
1 3 Liz 60 Junior
3 10 John 40 Senior
6 11 Stevie 35 Senior
5 13 Jackie 33 Freshman
going back to the basics :) http://www.statmethods.net/management/sorting.html
so, your code should be:
df <- df[order(-Points, Rank),]
Like m0nhawk pointed out in his comment, you probably have the data as strings. String characters are ordered one at a time.
You need to convert them to numeric first. Also, for decreasing order you need the argument decreasing = TRUE.
df[, "Rank"] <- as.numeric(df[, "Rank"])
df[, "Points"] <- as.numeric(df[, "Points"])
df[order(df[, "Points"], decreasing = TRUE, df[, "Rank"]), ]
If the data type is 'factor' this will not work, though. You can try the following:
df <- as.data.frame(df, stringsAsFactors = FALSE)
And then the three lines above will work.

Sorting Awk array by last name

In my script, I start with a file of campaign contributors and anyone who donates a collective $500 is eligible for a contest. Anyone who meets that criteria I add to an array with an incrementing index to adjust the size as needed. Each index is formatted as outlined below, with the X's being a phone number. In the END portion of the script, I need to sort this array by last name($2) for printing. I've done some searching but come up empty handed. I'm not asking for someone to type the script for me, merely to point me in a better direction of search or offer advice. I need help sorting the array contestants as currently it will be filled properly with the string values the way I need them for the assignment.
Where v1,2, & 3 are the campaign contributions, I am using -F'[ :]' in my command to get both spaces and colons as field separators.
Input File lab4.data
Fname Lname:Phone__Number:v1:v2:v3
Mike Harrington:(510) 548-1278:250:100:175
Christian Dobbins:(408) 538-2358:155:90:201
Susan Dalsass:(206) 654-6279:250:60:50
Archie McNichol:(206) 548-1348:250:100:175
Jody Savage:(206) 548-1278:15:188:150
Guy Quigley:(916) 343-6410:250:100:175
Dan Savage:(406) 298-7744:450:300:275
Nancy McNeil:(206) 548-1278:250:80:75
John Goldenrod:(916) 348-4278:250:100:175
Chet Main:(510) 548-5258:50:95:135
Tom Savage:(408) 926-3456:250:168:200
Elizabeth Stachelin:(916) 440-1763:175:75:300
Array to hold anyone > $500, $8 is created and holds the value $5+$6+$7:
the array is initialized and filled in for loop given below
$8 = $5+$6+$7;
contestants[len++]
Loop to check add people to contestant array.
name and number are arrays that hold their respective values for later use.
for(i=0;i<=NR;i++)if(contrib[i]>500){contestants[len++]= name[i]" "number[i] }
Formatting of indexes(desired array values for contestant[len++]):
[0] Mike Harrington (510) 548-1278
[1] Archie McNichol (206) 548-1348
[2] Guy Quigley (916) 343-6410
[3] Dan Savage (406) 298-7744
[4] John Goldenrod (916) 348-4278
[5] Tom Savage (408) 926-3456
[6] Elizabeth Stachelin (916) 440-1763
Loop to print/check that array has been correctly filled(it is)
for (i=0; i <len; i++) {print contestants[i]}
Output:
Mike Harrington (510) 548-1278
Archie McNichol (206) 548-1348
Guy Quigley (916) 343-6410
Dan Savage (406) 298-7744
John Goldenrod (916) 348-4278
Tom Savage (408) 926-3456
Elizabeth Stachelin (916) 440-1763
Desired Final Output: Ignore formatting as it correctly displays in my terminal I just hard a hard time getting it all nice in here.
***FIRST QUARTERLY REPORT***
***CAMPAIGN 2004 CONTRIBUTIONS***
Name Phone Jan | Feb | Mar | Total Donated
Mike Harrington (510)548-1278 $ 250 $ 100 $ 175 $ 525
Christian Dobbins (408)538-2358 $ 155 $ 90 $ 201 $ 446
Susan Dalsass (206)654-6279 $ 250 $ 60 $ 50 $ 360
Archie McNichol (206)548-1348 $ 250 $ 100 $ 175 $ 525
Jody Savage (206)548-1278 $ 15 $ 188 $ 150 $ 353
Guy Quigley (916)343-6410 $ 250 $ 100 $ 175 $ 525
Dan Savage (406)298-7744 $ 450 $ 300 $ 275 $ 1025
Nancy McNeil (206)548-1278 $ 250 $ 80 $ 75 $ 405
John Goldenrod (916)348-4278 $ 250 $ 100 $ 175 $ 525
Chet Main (510)548-5258 $ 50 $ 95 $ 135 $ 280
Tom Savage (408)926-3456 $ 250 $ 168 $ 200 $ 618
Elizabeth Stachelin (916)440-1763 $ 175 $ 75 $ 300 $ 550
-----------------------------------------------------------------------------
SUMMARY
-----------------------------------------------------------------------------
The campaign received a total of $6137.00 for this quarter.
The average donation for the 12 contributors was $511.42.
The highest total contribution was $1025.00 made by Dan Savage.
***Thank you Dan Savage***
The following people donated over $500 to the campaign.
They are eligible for the quarterly drawing!!
Listed are their names(sorted by last names) and phone numbers.
John Goldenrod (916) 348-4278
Mike Harrington (510) 548-1278
Archie McNichol (206) 548-1348
Guy Quigley (916) 343-6410
Dan Savage (406) 298-7744
Tom Savage (408) 926-3456
Elizabeth Stachelin (916) 440-1763
Thank you all for your continued support!!
Using gawk, this is straightforward to do with the in-built sort functions, e.g.
BEGIN {
data["Jane Doe (123) 456-7890"] = 600;
data["Fred Adams (123) 456-7891"] = 800;
data["John Smith (123) 456-7892"] = 900;
exit;
}
END {
for (i in data) {
split(i,x," ")
data1[x[2] " " x[1] " " x[3] " " x[4]] = i;
}
asorti(data1,sdata1);
for (i in sdata1) {
print data1[sdata1[i]],"\t",data[data1[sdata1[i]]];
}
}
... which produces:
Fred Adams (123) 456-7891 800
Jane Doe (123) 456-7890 600
John Smith (123) 456-7892 900
In plain awk, the same result can be achieved by writing the array indices to a file, sorting that file and then reading the file back using getline.
The way to approach this is to produce the pre-SUMMARY output as you read the data so you don't need to store all of your data in an array, just the people who contributed more than $500 and just insert them into the array in the desired order using an insertion sort algorithm.
You would do it something like this:
awk -F':' '
NR==1 {
print "header stuff"
next
}
{
tot = $3 + $4 + $5
printf "%-20s%10s $%5s $%5s $%5s $%5s\n", $1, $2, $3, $4, $5, tot
}
tot > 500 {
split($1,name,/ /)
surname = name[2]
numContribs++
# insertion sort, check the algorithm:
for (i=1; i<=numContribs; i++) {
if (surname > surnames[i]) {
for (j=numContribs; j>i; j--) {
surnames[j+1] = surnames[j]
contribs[j+1] = contribs[j]
}
surnames[i] = surname
contribs[i] = $1 " " $2
break
}
}
}
END {
print "SUMMARY and text below it and then the list of $500+ contributors:"
for (i=1; i<=numContribs; i++) {
print contribs[i]
}
}
' lab4.data
The above is not a fully functional program. It's just intended to show you the right approach per your request.

Resources