Groupby and count() with alias and 'normal' dataframe: python pandas versus mssql - sql-server

Coming from a SQL environment, I am learning some things in Python Pandas. I have a question regarding grouping and aggregates.
Say I group a dataset by Age Category and count the different categories. In MSSQL I would write this:
SELECT AgeCategory, COUNT(*) AS Cnt
FROM TableA
GROUP BY AgeCategory
ORDER BY 1
The result set is a 'normal' table with two columns, the second column I named Count.
When I want to do the equivalent in Pandas, the groupby object is different in format. So now I have to reset the index and rename the column in a following line. My code would look like this:
grouped = df.groupby('AgeCategory')['ColA'].count().reset_index()
grouped.columns = ['AgeCategory', 'Count']
grouped
My question is if this can be accomplished in one go. Seems like I am over-doing it, but I lack experience.
Thanks for any advise.
Regards, M.

Use parameter name in DataFrame.reset_index:
grouped = df.groupby('AgeCategory')['ColA'].count().reset_index(name='Count')
Or:
grouped = df.groupby('AgeCategory').size().reset_index(name='Count')
Difference is GroupBy.count exclude missing values, GroupBy.size not.
More information about aggregation in pandas.

Related

SQL Select on Records that Meet Criteria from Another Table

I've got a very complex query and trying to give a simple example of one of the sub-tables I'm having problems with, if you need more information or context please let me know.
I've posed a CSV file with some sample data here:
https://drive.google.com/open?id=0B4xdnV0LFZI1dzE5S29QSFhQSmM
We make cakes, and 99% of our cakes are made by us. The 1% is when we have a cake delivered to us from a subcontractor and we 'Receive' and 'Audit' it.
What I wanted to do was to write something like this:
SELECT
Cake.Cake
Instruction.Cake_Instruction_Key
Steps
FROM
Cake
Join Instruction
ON Cake.Cake_Key = Instruction.Cake_Key
JOIN Steps
ON Instruction.Step_Key = Steps.Step_Key
WHERE
MIN(Steps.Step_Key) = 1
This fails because you can't have an aggregate in the WHERE clause.
The desired results would be:
Cake C 13 Receive
Cake C 14 Audit
Cake D 15 Receive
Cake D 16 Audit
Thank you in advance for your help!
Take a look at the HAVING keyword:
https://msdn.microsoft.com/en-us/library/ms180199.aspx
It works more or less the same as the WHERE clause but for aggregate functions after the GROUP BY clause.
Beware however this can be slow. You should try filtering down the number of records as much as possible in the WHERE and even consider using a tempory table to aggregate the data into first.
What you're talking about is the GROUP BY/HAVING clause, so in your case you would need to add something like
GROUP BY Cake.Cake, Instruction.Cake_Instruction_Key, Steps
HAVING MIN(Steps.Step_Key) = 1

Determining Difference Between Items On-Hand and Items Required per Project in Access 2003

I'm usually a PHP programmer, but I'm currently working on a project in MS Access 2003 and I'm a complete VBA newbie. I'm trying to do something that I could easily do in PHP but I have no idea how to do it in Access. The facts are as follows:
Tables and relevant fields:
tblItems: item_id, on_hand
tblProjects: project_id
tblProjectItems: project_id, item_id
Goal: Determine which projects I could potentially do, given the items on-hand.
I need to find a way to compare each project's required items against the items on-hand to determine if there are any items missing. If not, add the project to the list of potential projects. In PHP I would compare an array of on-hand items with an array of project items required, using the array_diff function; if no difference, add project_id to an array of potential projects.
For example, if...
$arrItemsOnHand = 1,3,4,5,6,8,10,11,15
$arrProjects[1] = 1,10
$arrProjects[2] = 8,9,12
$arrProjects[3] = 7,13
$arrProjects[4] = 1,3
$arrProjects[5] = 2,14
$arrProjects[6] = 2,5,8,10,11,15
$arrProjects[7] = 2,4,5,6,8,10,11,15
...the result should be:
$arrPotentialProjects = 1,4
Is there any way to do this in Access?
Consider a single query to reach your goal: "Determine which projects I could potentially do, given the items on-hand."
SELECT
pi.project_id,
Count(pi.item_id) AS NumberOfItems,
Sum(IIf(i.on_hand='yes', 1, 0)) AS NumberOnHand
FROM
tblProjectItems AS pi
INNER JOIN tblItems AS i
ON pi.item_id = i.item_id
GROUP BY pi.project_id
HAVING Count(pi.item_id) = Sum(IIf(i.on_hand='yes', 1, 0));
That query computes the number of required items for each project and the number of those items which are on hand.
When those two numbers don't match, that means at least one of the required items for that project is not on hand.
So the HAVING clause excludes those rows from the query result set, leaving only rows where the two numbers match --- those are the projects for which all required items are on hand.
I realize my description was not great. (Sorry.) I think it should make more sense if you run the query both with and without the HAVING clause ... and then read the description again.
Anyhow, if that query gives you what you need, I don't think you need VBA array handling for this. And if you can use that query as your form's RecordSource or as the RowSource for a list or combo box, you may not need VBA at all.

Django Query: Annotate with `count` of a *window*

I search for a query which is pretty similar to this one. But as an extension, I do not want to count all objects, but just over the ones, that are fairly recent.
In my case, there are two models. Let one be the Source and one be the Data. As result I'd like to get a list of all Sources ordered by the number of data records, that has been collected during the last week.
For me it is not iteresting, how many data records have been collected in total, but if there is a recent activity of that source.
Using the following code snippet from the above link, I cannot make up how to subquery the Data Table before.
from django.db.models import Count
activity_per_source = Source.objects.annotate(count_data_records=Count('Data')) \
.order_by('-count_data_records')
The only ways I came up with, would be to write native SQL or to process this in a loop and individual queries. Is there a Django-Query version?
(I use a MySQL database and Django 1.5.4)
Checkout out the docs on the order of annotate and filter: https://docs.djangoproject.com/en/1.5/topics/db/aggregation/#order-of-annotate-and-filter-clauses
Try something along the lines of:
activity_per_source = Source.objects.\
filter(data__date__gte=one_week_ago).\
annotate(count_data_records=Count('Data')).\
order_by('-count_data_records').distinct()
There is a way of doing that mixing Django queries with SQL via extra:
start_date = datetime.date.today() - 7
activity_per_source = (
Source.objects
.extra(where=["(select max(date) from app_data where source_id=app_source.id) >= '%s'"
% start_date.strftime('%Y-%m-%d')])
.annotate(count_data_records=Count('Data'))
.order_by('-count_data_records'))
The where part will filter the Sources by its Data last date.
Note: replace table and field names with actual ones.

Equivalent of "IN" that uses AND instead of OR logic?

I know I'm breaking some rules here with dynamic SQL but I still need to ask. I've inherited a table that contains a series of tags for each ticket that I need to pull records from.
Simple example... I have an array that contains "'Apples','Oranges','Grapes'" and I am trying to retrieve all records that contain ALL items contained within the array.
My SQL looks like this:
SELECT * FROM table WHERE basket IN ( " + fruitArray + " )
Which of course would be the equivalent of:
SELECT * FROM table WHERE basket = 'Apples' OR basket = 'Oranges' OR basket = 'Grapes'
I'm curious if there is a function that works the same as IN ( array ) except that it uses AND instead of OR so that I can obtain the same results as:
SELECT * FROM table WHERE basket LIKE '%Apples%' AND basket LIKE '%Oranges%' AND basket LIKE '%Grapes%'
I could probably just generate the entire string manually, but would like a more elegant solution if at all possible. Any help would be appreciated.
This is a very common problem in SQL. There are basically two solutions:
Match all rows in your list, group by a column that has a common value on all those rows, and make sure the count of distinct values in the group is the number of elements in your array.
SELECT basket_id FROM baskets
WHERE basket IN ('Apples','Oranges','Grapes')
GROUP BY basket_id
HAVING COUNT(DISTINCT basket) = 3
Do a self-join for each distinct value in your array; only then you can compare values from multiple rows in one WHERE expression.
SELECT b1.basket_id
FROM baskets b1
INNER JOIN baskets b2 USING (basket_id)
INNER JOIN baskets b3 USING (basket_id)
WHERE (b1.basket, b2.basket, b3.basket) = ('Apples','Oranges','Grapes')
There may be something like that in full text search, but in general, I sincerely doubt such an operator would be very useful, outside the conjunction with LIKE.
Consider:
SELECT * FROM table WHERE basket ='Apples' AND basket = 'Oranges'
it would always match zero rows.
If basket is a string, like your example suggests, then the closest you could get would be to use LIKE '%apples%oranges%grapes%', which could be built easily with '%'.implode('%', $tags).'%'
The issue with this is if some of 'tags' might be contained in other tags, e.g. 'supercalifragilisticexpialidocious' LIKE '%super%' will be true.
If you need to do LIKE comparisons, I think you're out of luck. If you are doing exact comparisons invovling matching sets in arbitrary order, you should look into the INTERSECT and EXCEPT options of the SELECT statement. They're a bit confusing, but can be quite powerful. (You'd have to parse your delimited strings into tabular format, but of course you're doing that anyway, aren't you?)
Are the items you're searching for always in the same order within the basket? If yes, a single LIKE should suffice:
SELECT * FROM table WHERE basket LIKE '%Apples%Oranges%Grapes%';
And concatenating your array into a string with % separators should be trivial.

Datastore Query filtering on list

Select all records, ID which is not in the list
How to make like :
query = Story.all()
query.filter('ID **NOT IN** =', [100,200,..,..])
There's no way to do this efficiently in App Engine. You should simply select everything without that filter, and filter out any matching entities in your code.
This is now supported via GQL query
The 'IN' and '!=' operators in the Python runtime are actually
implemented in the SDK and translate to multiple queries 'under the
hood'.
For example, the query "SELECT * FROM People WHERE name IN ('Bob',
'Jane')" gets translated into two queries, equivalent to running
"SELECT * FROM People WHERE name = 'Bob'" and "SELECT * FROM People
WHERE name = 'Jane'" and merging the results. Combining multiple
disjunctions multiplies the number of queries needed, so the query
"SELECT * FROM People WHERE name IN ('Bob', 'Jane') AND age != 25"
generates a total of four queries, for each of the possible conditions
(age less than or greater than 25, and name is 'Bob' or 'Jane'), then
merges them together into a single result set.
source: appengine blog
This is an old question, so I'm not sure if the ID is a non-key property. But in order to answer this:
query = Story.all()
query.filter('ID **NOT IN** =', [100,200,..,..])
...With ndb models, you can definitely query for items that are in a list. For example, see the docs here for IN and !=. Here's how to filter as the OP requested:
query = Story.filter(Story.id.IN([100,200,..,..])
We can even query for items that in a list of repeated keys:
def all(user_id):
# See if my user_id is associated with any Group.
groups_belonged_to = Group.query().filter(user_id == Group.members)
print [group.to_dict() for group in belong_to]
Some caveats:
There's docs out there that mention that in order to perform these types of queries, Datastore performs multiple queries behind the scenes, which (1) might take a while to execute, (2) take longer if you searching in repeated properties, and (3) will up your costs with more operations.

Resources