So I have the following MDX query which runs against the AdventureWorks example cube.
SELECT NON EMPTY ([Date].[Fiscal Semester of Year].[Fiscal Semester of Year],{[Measures].[Internet Order Count], [Measures].[Average Unit Price]}) ON 0,
NON EMPTY([Product].[Status].[Status] ,[Customer].[Country].[Country]) ON 1
FROM [Adventure Works]
This gives a result which looks a bit like:
H1 H2 H1 H2
X1 X2 X1 X2 X1 X2 X1 X2
A A1
A A2
A A3
B A1 numbers
B A2
B A3
The CellSet has various bits of metadata, but none of it describes that structure, as far as I can tell. It doesn't seem to include useful things like Measure names or Dimension names, etc.
Any clues? If I execute the above query in Management Studio it formats it nicely for me. Is that because it is being clever with its parsing of the MDX, or because it is accessing some metadata I haven't found yet?
As an aside, I can query the cube via the connection and get lots of cube related info that way, but I need a way to relate that back to my query result.
The chances are I will have a CellSet, but not the source MDX that generated it.
Thanks
Try something like this:
CellSet cs = cmd.ExecuteCellSet();
Axis columns = cs.Axes[0];
TupleCollection columnTuples = columns.Set.Tuples;
for (int i = 0; i < columnTuples.Count; i++)
{
string caption = columnTuples[i].Members[0].Caption;
}
Axis rows = cs.Axes[1];
TupleCollection rowTuples = rows.Set.Tuples;
foreach (Position rowPos in rows.Positions)
{
string caption = rowTuples[rowNum].Members[0].Caption;
}
Related
I have data for a large amount of Group IDs, and each group ID has anywhere from 4 to 30 observations. I would like to solve a (linear or nonlinear, depending on approach) system of equations using data in Matlab. I want to solve a system of three equations and three unknowns, but also load in data for known variables. I need observations 2 through 4 in order to solve this, but would also like to move to the next set of 3 observations (if it exists) to see how the solutions change. I would like to record these calculations as well.
What is the best way to accomplish this? I have a standard idea of how to solve the system using fsolve, but what is the best way to loop through group IDs with varying amounts of observations?
Here is some sample code I have written when thinking about this issue:
%%Load Data
Data = readtable('dataset.csv'); % Full Dataset
%Define Variables
%Main Data
groupID = Data{:,1};
Known1 = Data{:,7};
Known2 = Data{:,8};
Known3 = Data{:,9};
%%%%%%Function %%%%%
f = [A,B,C];
% Define the function handle for the system of equations
fun = #(f) [A^2 + B*Known3 - 2C*Known1 +1/Known2 - D2;
A + (B^2)Known3 - C*Known1 +1/Known2 - D3;
A - B*Known3 + C^2*Known1 +1/Known2 - D4];
% Define the initial guess for the solution
f0 = [0; 0; 0];
% Solve the nonlinear system of equations
f = fsolve(fun, f0)
%%%% Create Loop %%%%%%
% Set the number of observations to load at a time
numObservations = 3;
% Set the initial group ID
groupID = 1;
% Set the maximum number of groups
maxGroups = 100;
% Loop through the groups of data
while groupID <= maxGroups
% Load the data for the current group
data = loadData(groupID, numObservations);
% Update the solution using the new data
x = fsolve(fun, x);
% Print the updated solution
disp(x);
% Move on to the next group of data
groupID = groupID + 1;
end
What are the pitfalls with writing the code like this, and how can I improve it?
So I have this formula, and it's working as intended but I would like to further refine the data.
Formula:
=QUERY('Users'!A1:Q, "Select A,B,C,F,G,O,Q where Q >= 180 and Q < 44223")
I have tried:
=QUERY('Users'!A1:Q, "Select A,B,C,F,G,O,Q where Q >= 180 and Q < 44223 and F not like '*text*'")
But that caused an error. Ideally, I would like to ommet any results that match partial texts in columns C and F. Any ideas?
EDIT:
Included a link to an example sheet
Ideally, I want the query to match everything in column F except 'Archived' (but needs to be wildcarded) and everything in column C except Delta (again, needs to be wildcarded)
try:
=QUERY(Sheet1!A1:Q6,
"select A,B,C,F,G,O,Q
where Q >= 180
and Q < 44223
and not lower(O) matches '.*archived.*'
and not lower(C) matches '.*delta.*'")
I’m very new to python and am trying really hard these last few days on how to go through a df row by row, and check each row that has a difference between columns dQ and dCQ. I just said != 0 since there could be a pos or neg value. Now if this is true, I would like to check in another table whether certain criteria are met. I'm used to working in R, where I could store the df into a variable and call upon the column name, I can't seem to find a way to do it in python. I posted all of the code I’ve been playing with. I know this is messy, but any help would be appreciated. Thank you!
I've tried installing different packages that wouldn't work, I tried making a for loop (I failed miserably), maybe a function? I’m not sure where to even look. I've never learned Python, I’m really doing my best watching videos online and reading on here.
import pyodbc
import PyMySQL
import pandas as pd
import numpy as np
conn = pyodbc.connect("Driver={ODBC Driver 17 for SQL Server};"
"Server=***-***-***.****.***.com;"
"Database=****;"
"Trusted_Connection=no;"
"UID=***;"
"PWD=***")
# cur = conn.cursor()
# cur.execute("SELECT TOP 1000 tr.dQ, po.dCQ,
tr.dQ - po.dCQ as diff FROM [IP].[dbo].
[vT] tr (nolock) JOIN [IP].[dbo].[vP] po
ON tr.vchAN = po.vchCustAN WHERE tr.dQ
!= po.dCQ")
# query = cur.fetchall()
query = "SELECT TOP 100 tr.dQ, po.dCQ/*, tr.dQ -
po.dCQ as diff */FROM [IP].[dbo].[vT]
tr (nolock) INNER JOIN [IP].[dbo].[vP] po ON
tr.vchAN = po.vchCustAN WHERE tr.dQ !=
po.dCQ"
df = pd.read_sql(query, conn)
#print(df[2,])
cursor = conn.cursor(PyMySQL.cursors.DictCursor)
cursor.execute("SELECT TOP 100 tr.dQ, po.dCQ/*,
tr.dQ - po.dCQ as diff */FROM [IP].[dbo].
[vT] tr (nolock) INNER JOIN [IP].[dbo].
[vP] po ON tr.vchAN = po.vchCustAN
WHERE tr.dQ != po.dCQ")
result_set = cursor.fetchall()
for row in result_set:
print("%s, %s" % (row["name"], row["category"]))
# if df[3] != 0:
# diff = df[1]-df[2]
# print(diff)
# else:
# exit
# cursor = conn.cursor()
# for row in cursor.fetchall():
# print(row)
#
# for record in df:
# if record[1] != record[2]:
# print(record[3])
# else:
# record[3] = record[1]
# print(record)
# df['diff'] = np.where(df['dQ'] != df["dCQ"])
I expect some sort of notification that there's a difference in row xx, and now it will check in table vP to verify we received this data's details. I believe i can get to this point, if i can get the first part working. Any help is appreciated. I'm sorry if this question is not clear, i will do my best to answer any questions someone may have. Thank you!
One solution could be to make a new column where you store the result of the diff between df[1] and df[2]. One note first. It might be more precise to either name your columns when you make the df, then reference them with df['name1'] and df['name2'], or use df.iloc[:,1] and df.iloc[:,2]. Also note that column numbers start with zero, so these would refer to the second and third columns in the df. The reason to use iloc is and the colons is to explicitly state that you want all rows and and column numbers 1 and 2. Otherwise, with df[1] or df[2] if your df was transposed that may actually refer to what you think of as the index. Now, on to a solution.
You could try
df['diff']=df.iloc[:,1]-df.iloc[:,2]
df['diff_bool']=np.where(df['diff']==0,False, True)
or you could combine this into one method
df['diff_bool']==np.where(df.iloc[:,1]-df.iloc[:,2]==0,False, True)
This will create a column in your df that says if there is a difference between columns one and two. You don't actually need to loop through row by row because pandas functions work like matrix math, so df.iloc[:,1]-df.iloc[:,2] will apply the subtraction row by row automatically.
I am working on some software where I need to do large aggregation of data using SQL Server. The software is helping people play poker better. The query I am using at the moment looks like this:
Select H, sum(WinHighCard + ChopHighCard + WinPair + ChopPair + Win2Pair + Chop2Pair + Win3OfAKind + Chop3OfAKind + WinStraight + ChopStraight + WinFlush + ChopFlush + WinFullHouse + ChopFullHouse + WinQuads + ChopQuads + WinStraightFlush + ChopStraightFlush) / Count(WinStraightFlush) as ResultTotal
from [FlopLookup].[dbo].[_5c3c3d]
inner join[FlopLookup].[dbo].[Lookup5c3c3d] on [FlopLookup].[dbo].[Lookup5c3c3d].[id] = [FlopLookup].[dbo].[_5c3c3d].[id]
where (H IN (1164, 1165, 1166) ) AND
(V IN (1260, 1311))
Group by H;
This works fine and is the fastest way I have found to do what I am trying to achieve. The problem is I need to enhance the functionality in a way that allows the aggregation to include multiple instances of V. So for example in the above query instead of it just including the data from 1260 and 1311 once it may need to include 1260 twice and 1311 three times. But obviously just saying
V IN (1260, 1260, 1311, 1311, 1311)
won't work because each unique value is only counted once in an IN clause.
I have come up with a solution to this problem which works but seems rather clunky. I have created another lookup table which just takes the values between 0 and 1325 and assigns them to a field called V1 and for each V1 there are 100 V2 values that for e.g. for V1 = 1260 there is a range from 126000 through to 126099 for the V2 values. Then in the main query I join to this table and do the lookup like this:
Select H, sum(WinHighCard + ChopHighCard + WinPair + ChopPair + Win2Pair + Chop2Pair + Win3OfAKind + Chop3OfAKind + WinStraight + ChopStraight + WinFlush + ChopFlush + WinFullHouse + ChopFullHouse + WinQuads + ChopQuads + WinStraightFlush + ChopStraightFlush) / Count(WinStraightFlush) as ResultTotal
from [FlopLookup].[dbo].[_5c3c3d]
inner join[FlopLookup].[dbo].[Lookup5c3c3d] on [FlopLookup].[dbo].[Lookup5c3c3d].[id] = [FlopLookup].[dbo].[_5c3c3d].[id]
inner join[FlopLookup].[dbo].[VillainJoinTable] on [FlopLookup].[dbo].[VillainJoinTable].[V1] = [FlopLookup].[dbo].[_5c3c3d].[V]
where (H IN (1164, 1165, 1166) ) AND
(V2 IN (126000, 126001, 131100, 131101, 131102) )
Group by H;
So although it works it is quite slow. It feels inefficient because it is adding data multiple times when what would probably be more appropriate is a way of doing this using multiplication, i.e. instead of passing in 126000, 126001, 126002, 126003, 126004, 126005, 126006, 126007 I instead pass in 1260 in the original query and then multiply it by 8. But I have not been able to work out a way to do this.
Any help would be appreciated. Thanks.
EDIT - Added more information at the request of Livius in the comments
H stands for "Hero" and is in the table _5c3c3d as a smallint representing the two cards the player is holding (e.g. AcKd, Js4h etc.). V stands for "Villain" and is similar to Hero but represents the cards the opponent is holding similarly encoded. The encoding and decoding takes place in the code. These two fields form the clustered index for the _5c3c3d table. The remaining field in this table is Id which is another smallint which is used to join with the table Lookup5c3c3d which contains all the equity information for the hero's hand against the villain's hand for the flop 5c3c3d.
V2 is just a field in a table I have created to try and resolve the problem described by having a table called VillainJoinTable which has V1 (which maps directly to V in _5c3c3d via a join) and V2 which can potentially contain 100 numbers per V1 (e.g. when V1 is 1260 it could contain 126000, 126001 ... 126099). This is in order to allow me to create an "IN" clause that can effectively have multiple lookups to equity information for the same V multiple times.
Here are some screenshots:
Structure of the three tables
Some data from _5c3c3d
Some data from Lookup5c3c3d
Some data from VillainJoinTable
I am using spark-2.0.0. When I am trying to intersect two dataframes from same source the answer is not as expected. Its the same with inner join.
df = sqlContext.read.parquet("path to hdfs file").select('col1','col2')
w1 = df.filter(df.col1 == 1)
w2 = df.filter(df.col2 == 2)
w1.select('col1').intersect(w2.select('col1')).show()
#Answer is just w1
w2.select('col1').intersect(w1.select('col1')).show()
#Answer is just w2
w1.intersect(w2).show()
#This should give empty data frame but output is w1
w1.join(w2,w1.col1==w2.col2)
#this is giving out put as if w1.join(w1,'col1)
This behavior is happening only when i am reading through hdfs. If i construct a dataframe in program it self it is behaving as expected.
I tried using alias as well, as given in this link to temporarily fix the problem still not working.
http://search-hadoop.com/m/q3RTtOaknEYeLLi&subj=Re+Join+on+DataFrames+from+the+same+source+Pyspark+