Matching and replacing a selection of data from two different dataframes - loops

(First time posting so please bear with) I have two different dataframes, one of which contains a column of replacement data for a selection of data within the first dataframe.
#dataframe 1
df<-data.frame(site= rep(1:4,3), landings = rep("val",12),
harbour = c("a","b","c","d","e","f","g","h","i","j","k","l"))
#dataframe 2
new_site4<-data.frame(harbour = c("a","b","c","d","e","f","g","h","i","j","k","l"),
sub_site = c("x","x","y","x","y","y","y","x","y","x","y","y") )
I want to replace the "site" in dataframe 1 with the "subsite" in dataframe 2 based on the match of "harbour" however I only need to do it for records for site "4".
Is there a neat way to select only site 4 and then replace the site number with the subsite, ideally without merging or without creating a whole new dataframe. My real dataset is large but the key is only small as it only refers to a small selection of the data which needs the subsite added.
I tried using match() on my main dataset but for some reason it only matched some of the required data not all of it, but this code wont work on my sample data either.
#df$site[match(df$harbour, new_site4$harbour)] <- new_site4$sub_site[match(df$harbour, df$harbour)]`

Related

No other way than dropping the duplicates, if ValueError: Index contains duplicate entries, cannot reshape?

enter image description here
Hi everyone, this is my first question.
I'm working on a dataset from patients who undergone urine analysis.
Every row refer to a single Patient Id and every Request ID could refer to different types of urine analysis (aspect, colour, number of erythrocytes, bacteria and go on).
I've add an image to let you understand my dataset.
I'd like to reshape making one request = one row , with all the tests done in the same request on the same row.
After that I want to merge with another df, that I reshape by Request ID (cause the first was missing a "long result" column, that I downloaded from another software in use in our Hospital).
I've tried:
df_pivot = df.pivot(index='Id Richiesta', columns = 'Nome Analisi Elementare', values = 'Risultato')
df_pivot.reset_index(inplace=True)
After I want to do --> df_merge = pd.merge (df_pivot,df,how='left', on='Id Richiesta')
I've tried once with another dataset, but I had to drop_duplicates for other purpose, and it worked.
But this time I have to analyse all the features.
How can I do? Is there no other way than dropping the duplicates?
Thank you for any help! :)
I've studied more my data and discovered 1 duplicate of bacteria for the same id request (1 in almost 8 million entries....)
df.drop_duplicates[df[['Id Richiesta', 'Id Analisi Elementare', 'Risultato']].duplicated()]
Then visualized all the rows referring at the "Id Richiesta" and the keep last (they were the same).
Thank you and sorry.
Please, tell me if I had to delete this question.

Google Sheets Filtering by 2 separate conditions

I'm trying to create a way to most effectively see who is working at a certain day and block of time. So far, I've created my data lists that have the Name of the reps in A, the day available in B, the reps names again in D, and the blocks of time available in E. I've been trying to use FILTER in order for it to pull based on 2 dropdowns. I've tried using all 3 of these:
=filter(Sheet2!A2:A999,Sheet2!B2:B999=A2,Sheet2!E2:E999=B2)
=FILTER(Sheet2!A2:A,(Sheet2!B1:b=A2)+(Sheet2!B2:B=B2)
=Filter(Filter(Sheet2!A2:A999, Sheet2!B2:B999=A2), Sheet2!B2:B999=B2)
But I can't crack it. The last nested formula seems like it's functioning correctly, except that it's returning a different number of rows each time, so I don't know how to avoid the mismatched range error it gets. To be fair, I'm basically trying to create this from scratch, having known nothing about the FILTER function before saturday. Any ideas on how I can accomplish this with filters would be super helpful.
https://docs.google.com/spreadsheets/d/1WhBSQy4OZFtJvheNHd5ZfIRvcDxvUb-WhdY0mUci3O0/edit?usp=sharing
I have found this function =filter(Sheet2!A2:A136, Sheet2!B2:B136 =G2) on your sheet and based on that I have added the filter for the time, and works.
The working condition is:
=filter(Sheet2!A2:A136, Sheet2!B2:B136 =A2, Sheet2!E2:E136 = B2)
Based on the FILTER documentation:
FILTER(range, condition1, [condition2, ...])
first, in your data validation fix 7-7:30 to 7-730. then use this:
=ARRAYFORMULA(UNIQUE(QUERY(TO_TEXT({Sheet2!A2:B; Sheet2!D2:E}),
"select Col1 where Col2 matches '"&TEXTJOIN("|", 1, A2:B2)&"'")))

How to find a MoveTo destination filled by database?

I could need some help with a Anylogic Model.
Model (short): Manufacturing scenario with orders move in a individual route. The workplaces (WP) are dynamical created by simulation start. Their names, quantity and other parameters are stored in a database (excel Import). Also the orders are created according to an import. The Agent population "order" has a collection routing which contains the Workplaces it has to stop in the specific order.
Target: I want a moveTo block in main which finds the next destination of the agent order.
Problem and solution paths:
I set the destination Type to agent and in the Agent field I typed a function agent.getDestination(). This function is in order which returns the next entry of the collection WP destinationName = routing.get(i). With this I get a Datatype error (while run not compiling). I quess it's because the database does not save the entrys as WP Type but only String.
Is there a possiblity to create a collection with agents from an Excel?
After this I tried to use the same getDestination as String an so find via findFirst the WP matching the returned name and return it as WP. WP targetWP = findFirst(wps, w->w.name == destinationName);
Of corse wps (the population of Workplaces) couldn't be found.
How can I search the population?
Maybe with an Agentlink?
I think it is not that difficult but can't find an answer or a solution. As you can tell I'm a beginner... Hope the description is good an someone can help me or give me a hint :)
Thanks
Is there a possiblity to create a collection with agents from an Excel?
Not directly using the collection's properties and, as you've seen, you can't have database (DB) column types which are agent types.1
But this is relatively simple to do directly via Java code (and you can use the Insert Database Query wizard to construct the skeleton code for you).
After this I tried to use the same getDestination as String an so find via findFirst the WP matching the returned name and return it as WP
Yes, this is one approach. If your order details are in Excel/the database, they are presumably referring to workplaces via some String ID (which will be a parameter of the workplace agents you've created from a separate Excel worksheet/database table). You need to use the Java equals method to compare strings though, not == (which is for comparing numbers or whether two objects are the same object).
I want a moveTo block in main which finds the next destination of the agent order
So the general overall solution is
Create a population of Workplace agents (let's say called workplaces in Main) from the DB, each with a String parameter id or similar mapped from a DB column.
Create a population of Order agents (let's say called orders in Main) from the DB and then, in their on-startup action, set up their collection of workplace IDs (type ArrayList, element class String; let's say called workplaceIDsList) using data from another DB table.
Order probably also needs a working variable storing the next index in the list that it needs to go to (so let's say an int variable nextWorkplaceIndex which starts at 0).
Write a function in Main called getWorkplaceByID that has a single String argument id and returns a Workplace. This gets the workplace from the population that matches the ID; a one-line way similar to yours is findFirst(workplaces, w -> w.id.equals(id)).
The MoveTo block (which I presume is in Main) needs to move the Order to an agent defined by getWorkplaceByID(agent.workplaceIDsList.get(nextWorkplaceIndex++)). (The ++ bit increments the index after evaluating the expression so it is ready for the next workplace to go to.)
For populating the collection, you'd have two tables, something like the below (assuming using strings as IDs for workplaces and orders):
orders table: columns for parameters of your orders (including some String id column) other than the workplace-list. (Create one Order agent per row.)
order_workplaces table: columns order_id, sequence_num and workplace_id (so with multiple rows specifying the sequence of workplace IDs for an order ID).
In the On startup action of Order, set up the skeleton query code via the Insert Database Query wizard as below (where we want to loop through all rows for this order's ID and do something --- we'll change the skeleton code to add entries to the collection instead of just printing stuff via traceln like the skeleton code does).
Then we edit the skeleton code to look like the below. (Note we add an orderBy clause to the initial query so we ensure we get the rows in ascending sequence number order.)
List<Tuple> rows = selectFrom(order_workplaces)
.where(order_workplaces.order_id.eq(id))
.orderBy(order_workplaces.sequence_num.asc())
.list();
for (Tuple row : rows) {
workplaceIDsList.add(row.get(order_workplaces.workplace_id));
}
1 The AnyLogic database is a normal relational database --- HSQLDB in fact --- and databases only understand their own specific data types like VARCHAR, with AnyLogic and the libraries it uses translating these to Java types like String. In the user interface, AnyLogic makes it look like you set the column types as int, String, etc. but these are really the Java types that the columns' contents will ultimately be translated into.
AnyLogic does support columns which have option list types (and the special Code type column for columns containing executable Java code) but these are special cases using special logic under the covers to translate the column data (which is ultimately still a string of characters) into the appropriate option list instance or (for Code columns) into compiled-on-the-fly-and-then-executed Java).
Welcome to Stack Overflow :) To create a Population via Excel Import you have to create a method and call Code like this. You also need an empty Population.
int n = excelFile.getLastRowNum(YOUR_SHEET_NAME);
for(int i = FIRST_ROW; i <= n; i++){
String name = excelFile.getCellStringValue(YOUR_SHEET_NAME, i, 1);
double SEC_PARAMETER_TO_READ= excelFile.getCellNumericValue(YOUR_SHEET_NAME, i, 2);
WP workplace = add_wps(name, SEC_PARAMETER_TO_READ);
}
Now if you want to get a workplace by name, you have to create a method similar to your try.
Functionbody:
WP workplaceToFind = wps.findFirst(w -> w.name.equals(destinationName));
if(workplaceToFind != null){
//do what ever you want
}

Determining Difference Between Items On-Hand and Items Required per Project in Access 2003

I'm usually a PHP programmer, but I'm currently working on a project in MS Access 2003 and I'm a complete VBA newbie. I'm trying to do something that I could easily do in PHP but I have no idea how to do it in Access. The facts are as follows:
Tables and relevant fields:
tblItems: item_id, on_hand
tblProjects: project_id
tblProjectItems: project_id, item_id
Goal: Determine which projects I could potentially do, given the items on-hand.
I need to find a way to compare each project's required items against the items on-hand to determine if there are any items missing. If not, add the project to the list of potential projects. In PHP I would compare an array of on-hand items with an array of project items required, using the array_diff function; if no difference, add project_id to an array of potential projects.
For example, if...
$arrItemsOnHand = 1,3,4,5,6,8,10,11,15
$arrProjects[1] = 1,10
$arrProjects[2] = 8,9,12
$arrProjects[3] = 7,13
$arrProjects[4] = 1,3
$arrProjects[5] = 2,14
$arrProjects[6] = 2,5,8,10,11,15
$arrProjects[7] = 2,4,5,6,8,10,11,15
...the result should be:
$arrPotentialProjects = 1,4
Is there any way to do this in Access?
Consider a single query to reach your goal: "Determine which projects I could potentially do, given the items on-hand."
SELECT
pi.project_id,
Count(pi.item_id) AS NumberOfItems,
Sum(IIf(i.on_hand='yes', 1, 0)) AS NumberOnHand
FROM
tblProjectItems AS pi
INNER JOIN tblItems AS i
ON pi.item_id = i.item_id
GROUP BY pi.project_id
HAVING Count(pi.item_id) = Sum(IIf(i.on_hand='yes', 1, 0));
That query computes the number of required items for each project and the number of those items which are on hand.
When those two numbers don't match, that means at least one of the required items for that project is not on hand.
So the HAVING clause excludes those rows from the query result set, leaving only rows where the two numbers match --- those are the projects for which all required items are on hand.
I realize my description was not great. (Sorry.) I think it should make more sense if you run the query both with and without the HAVING clause ... and then read the description again.
Anyhow, if that query gives you what you need, I don't think you need VBA array handling for this. And if you can use that query as your form's RecordSource or as the RowSource for a list or combo box, you may not need VBA at all.

Django Query Optimisation

I am working currently on telecom analytics project and newbie in query optimisation. To show result in browser it takes a full minute while just 45,000 records are to be accessed. Could you please suggest on ways to reduce time for showing results.
I wrote following query to find call-duration of a person of age-group:
sigma=0
popn=len(Demo.objects.filter(age_group=age))
card_list=[Demo.objects.filter(age_group=age)[i].card_no
for i in range(popn)]
for card in card_list:
dic=Fact_table.objects.filter(card_no=card.aggregate(Sum('duration'))
sigma+=dic['duration__sum']
avgDur=sigma/popn
Above code is within for loop to iterate over age-groups.
Model is as follows:
class Demo(models.Model):
card_no=models.CharField(max_length=20,primary_key=True)
gender=models.IntegerField()
age=models.IntegerField()
age_group=models.IntegerField()
class Fact_table(models.Model):
pri_key=models.BigIntegerField(primary_key=True)
card_no=models.CharField(max_length=20)
duration=models.IntegerField()
time_8bit=models.CharField(max_length=8)
time_of_day=models.IntegerField()
isBusinessHr=models.IntegerField()
Day_of_week=models.IntegerField()
Day=models.IntegerField()
Thanks
Try that:
sigma=0
demo_by_age = Demo.objects.filter(age_group=age);
popn=demo_by_age.count() #One
card_list = demo_by_age.values_list('card_no', flat=True) # Two
dic = Fact_table.objects.filter(card_no__in=card_list).aggregate(Sum('duration') #Three
sigma = dic['duration__sum']
avgDur=sigma/popn
A statement like card_list=[Demo.objects.filter(age_group=age)[i].card_no for i in range(popn)] will generate popn seperate queries and database hits. The query in the for-loop will also hit the database popn times. As a general rule, you should try to minimize the amount of queries you use, and you should only select the records you need.
With a few adjustments to your code this can be done in just one query.
There's generally no need to manually specify a primary_key, and in all but some very specific cases it's even better not to define any. Django automatically adds an indexed, auto-incremental primary key field. If you need the card_no field as a unique field, and you need to find rows based on this field, use this:
class Demo(models.Model):
card_no = models.SlugField(max_length=20, unique=True)
...
SlugField automatically adds a database index to the column, essentially making selections by this field as fast as when it is a primary key. This still allows other ways to access the table, e.g. foreign keys (as I'll explain in my next point), to use the (slightly) faster integer field specified by Django, and will ease the use of the model in Django.
If you need to relate an object to an object in another table, use models.ForeignKey. Django gives you a whole set of new functionality that not only makes it easier to use the models, it also makes a lot of queries faster by using JOIN clauses in the SQL query. So for you example:
class Fact_table(models.Model):
card = models.ForeignKey(Demo, related_name='facts')
...
The related_name fields allows you to access all Fact_table objects related to a Demo instance by using instance.facts in Django. (See https://docs.djangoproject.com/en/dev/ref/models/fields/#module-django.db.models.fields.related)
With these two changes, your query (including the loop over the different age_groups) can be changed into a blazing-fast one-hit query giving you the average duration of calls made by each age_group:
age_groups = Demo.objects.values('age_group').annotate(duration_avg=Avg('facts__duration'))
for group in age_groups:
print "Age group: %s - Average duration: %s" % group['age_group'], group['duration_avg']
.values('age_group') selects just the age_group field from the Demo's database table. .annotate(duration_avg=Avg('facts__duration')) takes every unique result from values (thus each unique age_group), and for each unique result will fetch all Fact_table objects related to any Demo object within that age_group, and calculate the average of all the duration fields - all in a single query.

Resources