How Can I Find Unique Values in this Dataframe with pandas?

How Can I Find Unique Values in this Dataframe with pandas? - database

I have a dataframe (image attached for reference), which is a list of venues in neighborhoods of Toronto.
For each venue the neighborhood name is listed, as well as the venue type (I got rid of everything else).
I need to find a way to grab the total number of unique venue types in each neighborhood. So for example, if there are 8 coffee shops and 2 restaurants, the value returned should be 2. If there's 1 coffee shop, 1 restaurant and 1 laundromat, the value should be 3, etc.
Does anyone know how to do this?

Try using 'groupby' and 'nunique'
df.groupby('Neighbourhood')['Venue Category'].nunique()
This will return the count of all the different 'Venue Category' for each 'Neighbourhood'.
Hope this solves your question :)

Try using pandas' unique() function
List unique values in the df['neighborhood'] column
df.neighborhood.unique()
Should return an array/list of values

Related

In MongoDB/React what is the best practice for filtering data?

I came here today with a theoretical question. (hint: it will be long and tough, but to fully understand the problem I think I have to write every important detail. If you read it to the end huge thanks for you, you're not the hero we deserved but the hero we needed)
Story time: I'm currently building an online shop from 0. It has the same principles as an ebay: users can create advertisment for their used products. The problem is that I want to create a filtering feautre.
What is my MongoDB data structure?
My page has products with different attributes, by this I mean that the products have varying categories and values. To imagine here is an example
Product A:
Creator:User1
Category:Car
Type:BMW
Color:Red
Product B:
Creator:UserB
Category:Electronics
Type:Phone
Producer:Apple
To be more complex each user can define maximum 3 more extra category and value for each product. So for example User1 adds 2 new category and the final product will be:
Product A:
Creator:User1
Category:Car
Type:BMW
Color:Red
Number of seats:4
Fuel type: Gasoline
Because of the above mentioned when a user adds a new product there will be two type of categories: the static ones which are predefined by me(Category,Type,Color -> in car's case) and the dynamic ones which the user adds (Number of seats, Fuel Type or anything else).
Overall: My final data structure in mongoDB is not static, since there are some added categories. Because of this I have a Product collection and each document looks like the above mentioned example
How are the items shown?
I have a main page. When I populate it I make a call with $skip and a $limit attribute set to 8, so for the first time I only query 8 products. If a user clicks on a Load More button it will load another 8 product and so on.
FINALLY: My actual question ...
So at this point I guess you understand everything related to the business logic so it's time for my question: if I want to filter these dynamic products, but i don't know what is the best practice for it?
My idea:
First create a mongoDB collection named Categories. Each main category will be a document in it and we will store static and dynamic categories and values
ex:
category:car
predefined:[{type:[BMW,Mustang,Ferrari]},{color:[red,green,blue]}]
userdefiend:[{number of seats:[2,4,5,6]},{fuel type:[Gasoline,Air,Diesel]}]
We load the the values in the main page if a user clicks a specific value ex:BMW we set a limit to 8 and go through on our Product collection and get the 8 items which has a Type:BMW. If he selects another option ex: color:Red we loop again through the collection but now with two criteria: Type:BMW and color:Red.
Idea2: Create a Category collection again with this structure
categoryType:predefined
mainCategory:Car
categoryName:Type
BMW:[prodA, prodC,prodD]
Ferrari:[prodD,prodE]
...values:products which contains
categoryType:userdefined
mainCategory:Car
categoryName:Number of seats
4:[prodA, prodD],
5:[prodE]
If a user selects from Type category the BMW we load the products from the BMW fields [prodA,prodC,prodD]. If the user selects Number of seat category with a value 4 we load the [prodA, prodD] and on the webpage we use a filter with our actual products so it remains only [prodA,prodD]. And from our actual list we use findById for the specific products.
I think that these are not the best options from any perspective, but I am really confused.
What do you guys think how should I structure my categories/products to have an efficent read/write/update complexity?
Anyways thank you for reading this and if you made it until here I'm curious about your idea. Have a nice day
UPDATE:
The filtering functionality
To don't have any confusion this is my filtering idea: When a user selects a main category for example Car or Electronics I want to show only the relevant filtering categories and options. Filtering categories in Car's case are Type and Color.
I want these filtering options to have pre-poupulated options. By this I mean, that if a filtering category is Type, and there are 2 Products which has Type:BMW and Type:Ferrari I want to show these values as options for filtering. And I don't want to hardcode these options, for example I hardcoded Type:Laborghini and I have no products with type Laborghini.
By the end if a user clicks to a Type:BMW I will filter all of my products based on that criteria.
My filtering side menu will look like this:
Type: BMW,Ferrari (these values exists in my database)
Color:Red,Black,Grey,Yellow
And for user-added categories I will build a searchbar, if a user selects a userdefiened category I want to add to the filtering categories so the overall look would look like this:
Type: BMW,Ferrari (these values exists in my database)
Color:Red,Black,Grey,Yellow
Number of seats:4,6,7 (number of seats category is added by user, 4,6,7 are the existing values to this category)

You could structure Your data like having a generic Products collection. Having both
Product A:
Creator:User1
Category:Car
Type:BMW
Color:Red
Product B:
Creator:UserB
Category:Electronics
Type:Phone
Producer:Apple
rows. Whenever you show the filter component, you can select the available categories by using an Aggregate (https://stackoverflow.com/a/43570730/1859959)
This would generate search boxes like "Creator", "Category", "Type", "Color", "Producer".
The data itself would be as generic as possible.
When the user wants to add a new product, it starts out from a template, like "Car" or "Electronics". The Templates collection gives him the initial values, which should be included. So it would be like:
{Car: [{type:[BMW,Mustang,Ferrari]},{color:[red,green,blue]}],
Electronics: ... }
Selecting a Car would generate the "type" and "color" input boxes. Saving the form would insert the new row into Products.

Generate Distinct List and Sum On Conditional VLookUp

Sample Data Sets
I'm trying to write a formula that can fill in revenue amounts in Table3. What I need is for it first to search through Table1 and return an array with each unique ID referenced by the domain.
So for a.com it would return {1,2,3}.
It would then find the revenue value in Table2 that is associated with that ID.
For the array above it would be {100,200,400}.
It would then sum those values to arrive at {700} which would populate cell H3 or [#[Sum of Revenue]] for "a.com".
I've tried using sumproduct and combining it with variations of
{=INDEX(Table1[ID],MATCH(0,COUNTIF($H$2:H2,Table1[ID])+(Table1[Domains]<>$G$3),0))}
But I'm having trouble at arriving at the full array and not just the first, second, etc. unique value in the array. I also don't know how to end up with the revenue values in the array instead of just the IDs. Any help is appreciated.

This is what ended up working:
{=SUM(SUMIF(Table2[ID],IF(FREQUENCY(IF(Table1[Domain]=[#Domain],Table1[ID]),Table1[ID]),Table1[ID]),Table2[Revenue]))}

Index match function with multiple criteria and duplicate values

I am trying to work my way out using index and match function (I am relatively new using this function)
I have a database with multiple criteria and also has duplicate values. I have 3 criteria for index and match
Salesman Name (duplicates in the field), month and value greater than 0. I have to select stores names based on the name of salesperson, month and value.
Outlet Name salesman Name Month Value
Outlet ABC Tom Jan 1
Outlet BCD Tom jan 2
Outlet XYZ Marc Feb 1
Outlet UTR Tom Mar 0
How can I use Index match function to select salesman "Tom" month "Jan" & value > 0
Note: Month, Salesman are fixed each row "to enter the salesman name & Month - input fields" however the value criteria is fixed at ">0"

Assuming that you just want to select the first outlet in the case of duplicates...
You can use INDEX/MATCH/INDEX to select based on multiple criteria:
=INDEX($A$2:$A$5,MATCH(1,INDEX(($B$2:$B$5=$F2)*($C$2:$C$5=$G2)*($D$2:$D$5>0),0),0))
This is essentially the same as a normal INDEX/MATCH, however, it uses a second INDEX to find the index where all the required criteria is true.
If you would like to return both outlets in the case of duplicates, I would suggest adding a helper column ("Outlet ID") and use a MINIFS formula to select each. Let me know if you need this suggestion clarifying.
EDIT: UPDATE BASED ON FURTHER COMMENTS
If I understand correctly, I think that you would like a table where you can input the salesman and month, and it spit out a list of Outlets where the value is >0.
Based on this, I think that your best bet would be to add a helper column to the list of data. This helper column with give each distinct outlet a unique ID:
=IF(
MINIFS($A$2:$A2,$B$2:$B2, $B3)=0,
MAX($A$2:$A2)+1,
MINIFS($A$2:$A2,$B$2:$B2,$B3)
)
This formula will check if an ID has already been assigned to the current outlet. If it has, it will use that one. If it hasn't, it will check what the MAX ID used so far is, and add 1 to it.
Now, we can use these IDs to produce a dynamic list of IDs associated with a particular Salesman/Month/Value:
The main formula here is the:
MINIFS($A$3:$A$7,$C$3:$C$7,$G$3,$D$3:$D$7,$H$3,$E$3:$E$7,">"&0,$A$3:$A$7,">"&MAX($G$6:$G6))
The rest is just to remove the zeros which are returned when we do not have any more IDs in out list.
The MINIFS formula selects the lowest ID where the salesman is the one we want AND the month is the one we want AND the value is >0 AND the ID is greater than any which already appear in our list.
Once we have our list of IDs, it is a simple matter to retrieve the name of the outlets using an INDEX/MATCH. We can then hide the IDs entirely if we wish.
=IFNA(INDEX($B$3:$B$7,MATCH($G7,$A$3:$A$7,0)),"")
The IFNA is just to prevent errors showing on empty lines. If you preferred, you could do something like this instead:
=IF($G7="","",INDEX($B$3:$B$7,MATCH($G7,$A$3:$A$7,0)))
Please let me know if you need me to expand any more.

Filling in an array from a list with uneven catagories

I have data in the format of an ID number, and a geographic area (MSOA) which is in the format in the image below.
List of Data:
As you can see, there are differing numbers of MSOA for each ID, hence the list is not evenly spaced.
I need to use the data to fill in an array with ID in the columns, and MSOA in the rows. I have all MSOAs in the rows (Not just ones matched with IDs) and need the array to display a 1 or a 0 depending on whether the ID has match for that MSOA.
I've attached an image of what sort of output I need:
Array output:
Can anybody help with some code to achieve this?
Cheers

you can use a formula:
=IF(AND(E$1=$A2,$D2=$B2),1,0)
or a pivot table:

couchdb map / reduce multiple keys filtering by date

I have a view setup with a map reduce. Right now this code works great:
function(doc) {
if (doc.type == 'test'){
if(doc.trash != 1){
for (var id in doc.items) {
emit([id,doc.items[id].name], 1);
}
}
}
}
function(keys,prices){
return (keys, sum(prices));
}
I get a return and when using the group parameter, it condenses everything just fine.
My issue/question, I want to add a third key.... DATE, so I may only reduce records from certain dates. So for example:
function(doc) {
if (doc.type == 'test'){
if(doc.trash != 1){
for (var id in doc.items) {
emit([date,id,doc.items[id].name], 1);
}
}
}
}
My issue is that since date is at the beginning of the array, the reduce groups by date, id etc. I know I use group_level and say just take the first key from the array or the first 2 keys, but that doesn't help either because afaik, group_level goes from left to right in the array. I could put the date on the end of the emit array, but that doesn't help either because I need to have values at the beginning of my startkey and endkey to search on.
Here is an example of the output of data:
{"key":["2012-03-13","356752b8a5f6871f3","Apple"],"value":1},
{"key":["2012-03-20","123752b8a76986857","Pear"],"value":1},
{"key":["2012-04-12","3013531de05871194","Grapefruit"],"value":1},
{"key":["2012-04-12","356752b8a5f6871f3","Apple"],"value":1},
I want APPLE to be added up in one row, here it's adding up apples by date first. I was able to successfully just add up all the apples if I remove DATE as the first key in the array, but then I can't search by date range.
Any ideas on how to accomplish this?

If I correctly understand what you want to do, then you'd want to put the date as the first element of your array, and use group_level as well as start_key and end_key.
Eg. startkey=[1, "someid"] endkey=[1,"someid",{}] group_level=2
Will get you all items from date 1 (obviously choose your own format here), with id "someid" and any name. It seems funny that you emit id's before names, and without having more information about what you're actually trying to accomplish, it's hard to advise your general data model. If ID is a "type" id meaning that many items share the same ID then this makes sense. If ID is a unique per item ID, then it does not. In that case, you'd want to emit "name" before ID...
Edit 1
As per your comment, to do a range of dates you do this:
startkey=[1] endkey=[5,{}] group_level=2
You will get everything from date 1 to date 5 grouped by id ie. apples, oranges etc. I use this exact technique in a very large scale production application. I actually formatted the dates as an easily human readable integers of the format yyyymmdd, so 20140624 would sort to the top. If I want everything from the start of the month till now grouped by my group ids, I call
startkey=[20140601] endkey=[20140624,{}] group_level=2
It works perfectly and as far as I can tell that's what you're looking to do. I also have a third key layer "detail" which allows me to provide a deeper level of grouping for items that need it. I can then call
startkey=[20140601, "someid"] endkey=[20140624, "someid",{}] group_level=3
To drill to the detail level for a particular id, or just use the previous query with group_level=3 if I want the details for every id. I'm certain you can make this work - I've solved this exact problem in a production application using the techniques described.
Edit 2
If you want to group all apples regardless of date, then you'll need to let apples be the first element in the key. You can then get all apples over all time as a single row in the view result using group_level=1, and Apples over a date range using group_level=2. The difference here is that you'll only be able to do the group_level=2 query on a single item type at a time. If you want the best of both worlds, you unfortunately just need to make 2 views. That's just how key ordering works... If you need fast response times for both types of queries, all item types over a date range, and all of a particular item not grouped by date, I believe 2 views is the only way to achieve that.
Note
Another thing to note is about your reduce function. Wherever possible it is highly recommended that you use the built in reduce functions. They're implemented in erlang and are highly optimized compared to custom javascript reduce functions.
In your case, just replace your reduce function with this
_sum
Easy hey?
If you post more info about your application, data model etc. then I'd be happy to help out more with your database design.