Is it possible to read categorical columns with pandas' read_csv? - file

I have tried passing the dtype parameter with read_csv as dtype={n: pandas.Categorical} but this does not work properly (the result is an Object). The manual is unclear.

In version 0.19.0 you can use parameter dtype='category' in read_csv:
data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3'
df = pd.read_csv(pd.compat.StringIO(data), dtype='category')
print (df)
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3
print (df.dtypes)
col1 category
col2 category
col3 category
dtype: object
If want specify column for category use dtype with dictionary:
df = pd.read_csv(pd.compat.StringIO(data), dtype={'col1':'category'})
print (df)
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3
print (df.dtypes)
col1 category
col2 object
col3 int64
dtype: object

Related

How to show values in Col3 where Col1 values are true and for false values in Col1 show NULL in Col3 only

SELECT Col1, Col2, Col3
FROM Table
Results Set: ( Sample table )
Col1 Col2 col3
----------- ---------- -----------
Value Value Value
Value Value Value
Value Value Value
Value Value Value
Value Value Value
Show Col3 with NULL/empty values ( assume that Col3 supports NULL if needed ) EXCEPT for the rows(=values) where col1 condition is true keep the values on Col2 for all rows.
Condition as in ( WHERE Col > 2| WHERE chareindex( 'x' , Col1 ) Etc.. )
Table end results: ( The conditions in here are true for rows 2 and 5 )
Col1 Col2 col3
----------- ---------- -----------
1 Value Value NULL
2 Value Value Value
3 Value Value NULL
4 Value Value NULL
5 Value Value Value
More expressed way for the question :
SELECT EmployeeID, Firstname, Lastname,
From Employees
Results set:
EmployeeID FirstName LastName
----------- ---------- --------------------
1 Nancy Davolio
2 Andrew Fuller
3 Janet Leverling
4 Margaret Peacock
5 Steven Buchanan
6 Michael Suyama
7 Robert King
lets say that in the code above there's a condition for employeeID and employeeID 3 and 6 was true to that condition
I'm looking to achieve:
EmployeeID FirstName LastName
----------- ---------- --------------------
1 Nancy NULL
2 Andrew NULL
3 Janet Leverling
4 Margaret NULL
5 Steven NULL
6 Michael Suyama
7 Robert NULL
what condition/s and how should they be put to achieve this results set
You may change the "base code" completely
You don't know the values on 'Lastname' col ( or col3 )
You must keep all the rows and columns value for the false conditions values
The table is big
Another way of putting the question: ( based on the first paragraph of the question)
For a row which the value in col1 is true to a condition/s, show the value in Col3, if not show NULL/empty and keep the col2 values for all rows.
I'm not sure to have understood exactly what you want to do, but does SELECT CASE WHEN could solve your problem ?
Here's an example :
SELECT
Col1,
Col2,
CASE WHEN (Condition) THEN NULL ELSE Col3 END AS Col3
FROM Table
It would return this output:
Col1 Col2 col3
----------- ---------- -----------
5 Value Value
21 Value NULL
7 Value Value
8 Value Value
40 Value NULL
Using this way, you conditionally select data from the column or NULL
EDIT : concerning the explanation of the case...when, you can find explanations & examples here :
https://www.w3schools.com/sql/sql_case.asp
Try this:
Select col1, col2,
Case When col1 (condition) Then col3 Else null End As col3
From Table

Group By and list not matched

I have table as under:
Col1 Col2 Col3 Col4 Col5
1 50.9499411799115 Point imp A
1 109.69487431133 Point exp 1
1 107.69487431133 Point exp 2
1 1019.69487431133 Point exp B
2 51.5403193833315 Point imp 0
2 50.5403193833315 Point exp 3
I want to group by Col1 and select all the ones where there are no 'A' or 'B' in Col5
I used the below query for generating the ouput in MSSQL but didnt get correct result, can someone point out my mistake
SELECT Col1
FROM table1
WHERE
Col5 NOT LIKE('%A%')
or Col5 NOT LIKE('%B%')
GROUP BY Col1;
Therefore my output should be
Col1
2
The problem is you're trying to eliminate one row based on data in another, so the only way to do that is to check the other day e.g.
select Col1
from table1 D1
where not exists (select 1 from table1 D2 where D2.Col1 = D1.Col1 and (Col5 like ('%A%') or Col5 like ('%B%')))
group by Col1

SQL Server - Best way to sum 2 fields based on different groupings

Suppose I have a table in SQL Server that looks as follows:
MyTable:
Col1: Col2: Col3: Val1: Val2:
1 2 a 1 1
1 2 a 1 1
1 2 b 1 1
1 2 b 1 1
1 2 c 1 1
1 2 c 1 1
So, I am looking to create a query that returns:
Sum(Val1) based on the values in Col1, Col2 and Col3
AND
Sum(Val2) based on the values in Col1, Col2 only
So, I came up with the following query that can accomplish this:
with MyData as
(
select distinct
Col1
, Col2
, Col3
, sum(Val1) over(partition by Col1, Col2, Col3) as Value1,
, sum(Val2) over(partition by Col1, Col2) as Value2
from
MyTable
)
Select * from MyData
Yielding something along the lines of:
Result:
Col1: Col2: Col3: Value1: Value2:
1 2 a 2 6
1 2 b 2 6
1 2 c 2 6
This works, but it seems terribly inefficient due to the Distinct - Is there a better way to accomplish this?
Thanks!
I wouldn't characterize the DISTINCT as "terribly inefficient", and no, I don't think there's a "better way" to accomplish your desired results.

Case Statements in group by, Why doesn't this work? Dynamic Aggregation in SQL Server

I'm trying to create a query which aggregates some data by one of several columns. (Only ever expecting one, but very interested in hearing multi solutions too!).
I've created the following query but it doesn't work. It complains that all the columns
is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
I'm very interested in hearing why this is the case, as when the query is run the correct column would always be grouped. Obviously I'm also interested in solutions to my problem too, but will press on with figuring it out myself.
Declare #AggregateColumn int = 1
SELECT
ResultID,
CASE
WHEN #AggregateColumn = 1 THEN Col1
WHEN #AggregateColumn = 2 THEN Col2
WHEN #AggregateColumn = 3 THEN Col3
WHEN #AggregateColumn = 4 THEN Col4
WHEN #AggregateColumn = 5 THEN Col5
ELSE 'ERROR'
END AS ResultName,
SUM(a.ResultValue) AS ResultValue
FROM
#Results
GROUP BY
ResultID,
CASE
WHEN #AggregateColumn = 1 THEN Col1
WHEN #AggregateColumn = 2 THEN Col2
WHEN #AggregateColumn = 3 THEN Col3
WHEN #AggregateColumn = 4 THEN Col4
WHEN #AggregateColumn = 5 THEN Col5
ELSE null
END
Thank you for your time.
Edit: Thanks for the responses... I'd added an extra typo to my question. Basically I've fixed this issue myself. SQL returns very unhelpful error messages if you've done what I've done above. Namely Else 'ERROR' needs to be else null. You'd never realise that was the problem though from the error message SQL gives!
Remove the alias in the group by:
GROUP BY
ResultID,
CASE
WHEN #AggregateColumn = 1 THEN Col1
WHEN #AggregateColumn = 2 THEN Col2
WHEN #AggregateColumn = 3 THEN Col3
WHEN #AggregateColumn = 4 THEN Col4
WHEN #AggregateColumn = 5 THEN Col5
ELSE null
END -- AS ResultName
Basically I've fixed this issue myself. SQL returns very unhelpful error messages if you've done what I've done.
Namely Else 'ERROR' needs to be else null. You'd never realise that was the problem though from the error message SQL gives! It complains about all the other columns.
Declare #AggregateColumn int = 1
SELECT
ResultID,
CASE
WHEN #AggregateColumn = 1 THEN Col1
WHEN #AggregateColumn = 2 THEN Col2
WHEN #AggregateColumn = 3 THEN Col3
WHEN #AggregateColumn = 4 THEN Col4
WHEN #AggregateColumn = 5 THEN Col5
ELSE null
END AS ResultName,
SUM(a.ResultValue) AS ResultValue
FROM
#Results
GROUP BY
ResultID,
CASE
WHEN #AggregateColumn = 1 THEN Col1
WHEN #AggregateColumn = 2 THEN Col2
WHEN #AggregateColumn = 3 THEN Col3
WHEN #AggregateColumn = 4 THEN Col4
WHEN #AggregateColumn = 5 THEN Col5
ELSE null
END

Getting number of records against each row using SQL server

I have table
col1 col2
---- ----
a rrr
a fff
b ccc
b zzz
b xxx
i want a query that return number of occurrences of col1 against each row like.
rows col1 col2
---- ---- ----
2 a rrr
2 a fff
3 b ccc
3 b zzz
3 b xxx
As a is repeated 2 time and b is repeated 3 time.
You can try Count over partition_by_clause which divides the result set produced by the FROM clause into partitions to which the function is applied.
This function treats all rows of the query result set as a single group
Try this...
select count (col1) over (partition by col1) [rows] ,col1 ,col2 from tablename
You can use the OVER clause with aggregate functions like COUNT:
SELECT rows = COUNT(*) OVER(PARTITION BY col1),
col1,
col2
FROM dbo.TableName
Demo

Resources