pyspark inner join and intersect on two dataframes from same source is broken - inner-join

I am using spark-2.0.0. When I am trying to intersect two dataframes from same source the answer is not as expected. Its the same with inner join.
df = sqlContext.read.parquet("path to hdfs file").select('col1','col2')
w1 = df.filter(df.col1 == 1)
w2 = df.filter(df.col2 == 2)
w1.select('col1').intersect(w2.select('col1')).show()
#Answer is just w1
w2.select('col1').intersect(w1.select('col1')).show()
#Answer is just w2
w1.intersect(w2).show()
#This should give empty data frame but output is w1
w1.join(w2,w1.col1==w2.col2)
#this is giving out put as if w1.join(w1,'col1)
This behavior is happening only when i am reading through hdfs. If i construct a dataframe in program it self it is behaving as expected.
I tried using alias as well, as given in this link to temporarily fix the problem still not working.
http://search-hadoop.com/m/q3RTtOaknEYeLLi&subj=Re+Join+on+DataFrames+from+the+same+source+Pyspark+

Related

How to check df rows that has a difference between 2 columns and then send it to another table to verify information

I’m very new to python and am trying really hard these last few days on how to go through a df row by row, and check each row that has a difference between columns dQ and dCQ. I just said != 0 since there could be a pos or neg value. Now if this is true, I would like to check in another table whether certain criteria are met. I'm used to working in R, where I could store the df into a variable and call upon the column name, I can't seem to find a way to do it in python. I posted all of the code I’ve been playing with. I know this is messy, but any help would be appreciated. Thank you!
I've tried installing different packages that wouldn't work, I tried making a for loop (I failed miserably), maybe a function? I’m not sure where to even look. I've never learned Python, I’m really doing my best watching videos online and reading on here.
import pyodbc
import PyMySQL
import pandas as pd
import numpy as np
conn = pyodbc.connect("Driver={ODBC Driver 17 for SQL Server};"
"Server=***-***-***.****.***.com;"
"Database=****;"
"Trusted_Connection=no;"
"UID=***;"
"PWD=***")
# cur = conn.cursor()
# cur.execute("SELECT TOP 1000 tr.dQ, po.dCQ,
tr.dQ - po.dCQ as diff FROM [IP].[dbo].
[vT] tr (nolock) JOIN [IP].[dbo].[vP] po
ON tr.vchAN = po.vchCustAN WHERE tr.dQ
!= po.dCQ")
# query = cur.fetchall()
query = "SELECT TOP 100 tr.dQ, po.dCQ/*, tr.dQ -
po.dCQ as diff */FROM [IP].[dbo].[vT]
tr (nolock) INNER JOIN [IP].[dbo].[vP] po ON
tr.vchAN = po.vchCustAN WHERE tr.dQ !=
po.dCQ"
df = pd.read_sql(query, conn)
#print(df[2,])
cursor = conn.cursor(PyMySQL.cursors.DictCursor)
cursor.execute("SELECT TOP 100 tr.dQ, po.dCQ/*,
tr.dQ - po.dCQ as diff */FROM [IP].[dbo].
[vT] tr (nolock) INNER JOIN [IP].[dbo].
[vP] po ON tr.vchAN = po.vchCustAN
WHERE tr.dQ != po.dCQ")
result_set = cursor.fetchall()
for row in result_set:
print("%s, %s" % (row["name"], row["category"]))
# if df[3] != 0:
# diff = df[1]-df[2]
# print(diff)
# else:
# exit
# cursor = conn.cursor()
# for row in cursor.fetchall():
# print(row)
#
# for record in df:
# if record[1] != record[2]:
# print(record[3])
# else:
# record[3] = record[1]
# print(record)
# df['diff'] = np.where(df['dQ'] != df["dCQ"])
I expect some sort of notification that there's a difference in row xx, and now it will check in table vP to verify we received this data's details. I believe i can get to this point, if i can get the first part working. Any help is appreciated. I'm sorry if this question is not clear, i will do my best to answer any questions someone may have. Thank you!
One solution could be to make a new column where you store the result of the diff between df[1] and df[2]. One note first. It might be more precise to either name your columns when you make the df, then reference them with df['name1'] and df['name2'], or use df.iloc[:,1] and df.iloc[:,2]. Also note that column numbers start with zero, so these would refer to the second and third columns in the df. The reason to use iloc is and the colons is to explicitly state that you want all rows and and column numbers 1 and 2. Otherwise, with df[1] or df[2] if your df was transposed that may actually refer to what you think of as the index. Now, on to a solution.
You could try
df['diff']=df.iloc[:,1]-df.iloc[:,2]
df['diff_bool']=np.where(df['diff']==0,False, True)
or you could combine this into one method
df['diff_bool']==np.where(df.iloc[:,1]-df.iloc[:,2]==0,False, True)
This will create a column in your df that says if there is a difference between columns one and two. You don't actually need to loop through row by row because pandas functions work like matrix math, so df.iloc[:,1]-df.iloc[:,2] will apply the subtraction row by row automatically.

Is there a way to compare more than one value of dataset against a single value of another dataset in left outer join(Flink)

I am trying to find out a way to check whether two values of a dataset can be checked against one value of another dataset using Flink Left Outer Join?
final DataSet<type> finalDataSet = dataSet1
.leftOuterJoin(dataSet2)
.where("value1")
.equalTo("value2")
.with(new FunctionNameToBeImplemented())
.name("StepName");
This works fine for a one to one check.
Can there be a way to do something similar:
final DataSet<type> finalDataSet = dataSet1
.leftOuterJoin(dataSet2)
.where(["value1","value2"]) // List of values
.contains("value2")
.with(new FunctionNameToBeImplemented())
.name("StepName");
I expect the output to check value1 and then value2 and if any (or both) matches, pass it to the function "FunctionNameToBeImplemented()" for further processing.
The outer join in Flink's DataSet API are strictly equality joins.
You can implement your use case with two separate joins and union the result. In order to avoid duplicates, on of the join functions should check if the other condition applies as well and only produce a result if it does not.
left -\
> JOIN(l.val1 == r.val2)[emit result] ---------------------\
right -/ \
> UNION
left -\ /
> JOIN(l.val2 == r.val2)[emit result if l.val1 != r.val2) -/
right -/

Using SQL to find entries that were originally X and later changed to Y

I recently started using SQL for work and don't have much experience of it so I'm sorry if this is a ridiculous question.
I'm looking for an entry that was originally listed as X but was then later changed to Y, I figure that a nested sub query is the way to go but the one I'm trying doesn't seem to use the nested bit.
Here is the code I'm trying
SELECT *
FROM [HOME].[dba].[ARCHIVE]
where FRIE like 'AR8%'
and RESULT = 'X'
and EXISTS(SELECT FRIE, RESULT
FROM [HOME].[dba].[ARCHIVE]
where RESULT = 'Y');
Everything as far as the EXISTS works but afterwards it just ignores the nested query
Your query doesn't have the same WHERE clause in the EXISTS portion. I think this will work for you:
SELECT *
FROM [HOME].[dba].[ARCHIVE]
WHERE FRIE like 'AR8%'
AND RESULT = 'X'
AMD EXISTS(SELECT TOP 1 1
FROM [HOME].[dba].[ARCHIVE]
where FRIE like 'AR8%' AND RESULT = 'Y');
I'd recommend using an INNER JOIN to a subquery rather than using an EXISTS statement. Something like this:
SELECT *
FROM [HOME].[dba].[ARCHIVE] a
INNER JOIN (SELECT FRIE
FROM [HOME].[dba].[ARCHIVE]
WHERE RESULT = 'Y') t1 ON a.FRIE = t1.FRIE
WHERE
FRIE like 'AR8%'
and RESULT = 'X'
That would return all rows from ARCHIVE where they there is a row with the same FRIE with a RESULT of X and a RESULT of Y.
Hopefully that helps.

MDX, CellSet contains meta data?

So I have the following MDX query which runs against the AdventureWorks example cube.
SELECT NON EMPTY ([Date].[Fiscal Semester of Year].[Fiscal Semester of Year],{[Measures].[Internet Order Count], [Measures].[Average Unit Price]}) ON 0,
NON EMPTY([Product].[Status].[Status] ,[Customer].[Country].[Country]) ON 1
FROM [Adventure Works]
This gives a result which looks a bit like:
H1 H2 H1 H2
X1 X2 X1 X2 X1 X2 X1 X2
A A1
A A2
A A3
B A1 numbers
B A2
B A3
The CellSet has various bits of metadata, but none of it describes that structure, as far as I can tell. It doesn't seem to include useful things like Measure names or Dimension names, etc.
Any clues? If I execute the above query in Management Studio it formats it nicely for me. Is that because it is being clever with its parsing of the MDX, or because it is accessing some metadata I haven't found yet?
As an aside, I can query the cube via the connection and get lots of cube related info that way, but I need a way to relate that back to my query result.
The chances are I will have a CellSet, but not the source MDX that generated it.
Thanks
Try something like this:
CellSet cs = cmd.ExecuteCellSet();
Axis columns = cs.Axes[0];
TupleCollection columnTuples = columns.Set.Tuples;
for (int i = 0; i < columnTuples.Count; i++)
{
string caption = columnTuples[i].Members[0].Caption;
}
Axis rows = cs.Axes[1];
TupleCollection rowTuples = rows.Set.Tuples;
foreach (Position rowPos in rows.Positions)
{
string caption = rowTuples[rowNum].Members[0].Caption;
}

"if, then, else" in SQLite

Without using custom functions, is it possible in SQLite to do the following. I have two tables, which are linked via common id numbers. In the second table, there are two variables. What I would like to do is be able to return a list of results, consisting of: the row id, and NULL if all instances of those two variables (and there may be more than two) are NULL, 1 if they are all 0 and 2 if one or more is 1.
What I have right now is as follows:
SELECT
a.aid,
(SELECT count(*) from W3S19 b WHERE a.aid=b.aid) as num,
(SELECT count(*) FROM W3S19 c WHERE a.aid=c.aid AND H110 IS NULL AND H112 IS NULL) as num_null,
(SELECT count(*) FROM W3S19 d WHERE a.aid=d.aid AND (H110=1 or H112=1)) AS num_yes
FROM W3 a
So what this requires is to step through each result as follows (rough Python pseudocode):
if row['num_yes'] > 0:
out[aid] = 2
elif row['num_null'] == row['num']:
out[aid] = 'NULL'
else:
out[aid] = 1
Is there an easier way? Thanks!
Use CASE...WHEN, e.g.
CASE x WHEN w1 THEN r1 WHEN w2 THEN r2 ELSE r3 END
Read more from SQLite syntax manual (go to section "The CASE expression").
There's another way, for numeric values, which might be easier for certain specific cases.
It's based on the fact that boolean values is 1 or 0, "if condition" gives a boolean result:
(this will work only for "or" condition, depends on the usage)
SELECT (w1=TRUE)*r1 + (w2=TRUE)*r2 + ...
of course #evan's answer is the general-purpose, correct answer

Resources