Difference between 2 datasets - dataset

Hi I've a two Datasets called A and B. A is having 100 records and B is having 99 records. How to find the missed record in Dataiku tool?

There are multiple ways to do that. The best method depends on many factors: can you code? do you need to do this once or multiple times? etc.
Here is one solution I'm thinking of:
Click on the dataset with 100 records
Create a "Join with..." recipe
Select the dataset with 99 records
Under the "Join" tab, make sure that your "Left join" key is right
Under the "Selected Columns" tab, select your join key
Add prefix to the columns of the second dataset
Run the recipe
Open the resulting dataset and find the row for which the prefixed column is empty

Related

How can I merge duplicate rows in a Spreadsheet while summig the numerical columns?

My Project Reports come from two different sources as we have changed the Project Management tool middle of the year. Therefore, I have prepared two tabs within a Google Spreadsheet with the data from the two systems under same set of headings. Then I combined the two sheets into one using the following query,
=QUERY({'Sheet1'!A1:I1000;'Sheet2'!A2:I1000},"select * where Col1 <>''")
Some of my projects are present within the both the list as they were started early in the year. In order to avoid duplicates I need to merge the two rows representing the same project into one. The project names are identical. However, I need to get the sum of some of the columns such as the 'Time Spent' in order to receive the total value for the whole period. At the same time, columns such as ' Project Owner' are identical among the two rows.
How can I combine these duplicate rows into single rows while merging the selcted columns?
Thank you in advance for your support!
syntax is:
=QUERY({Sheet1!A1:B; Sheet2!A2:B},
"select Col1,sum(Col2)
where Col1 is not null
group by Col1
label sum(Col2)''")
where A column is text and B column are numeric values

Storing processed results of connection in RDBMS

A csv file contains following two column : admission_number, project_name.
The relationship between two entities are many to many relationships : a specific admission_number can work over multiple projects. A specific project may have multiple admission_number.
Data will be like as follows and initially there are '1000 milion' rows and data will keep on updating on daily basis in this table will go upto 1300 milion rows.
admission_number,project_name
1234567890,ABC1234567
1234567890,ABC1234568
1234567891,ABC1234569
1234567892,ABC1234569
1234567893,ABC1234570
1234567894,ABC1234567
1234567895,ABC1234567
For a specific admission number(lets say 1234567890), i want to know all the admission_number who are working on the same projects (ABC1234567,ABC1234568). The output of above query will be
1234567894,1234567895.
Explanation : Since for admission number '1234567890', projects name are 'ABC1234567' and 'ABC1234568'. On these two projects other 'admission_number' are working as '1234567894','1234567895'
I came up with two solutions, To store the data,RDBMS will be used.
Approach 1 : By using two retrieval query : First query shall return all the projcects_name for a specific 'admission_number' and the second query will retrun all the admission_number for 'project_name'.
select admission_number from table where project_name IN (select project_name from table where admission_number='ABC1234567'.
Approach 2 : In this approach, before going for loading i am preprocessing the results and directly results is storing in database. I am only storing all the connected 'admission_number'.
Eg. For project_name 'ABC1234567', these 3 admission_number '1234567890','1234567894', '1234567895' are working. I want to store all connected admission_number in table with two columns (number,connected_number) like ('1234567890','1234567894'),('1234567890','1234567895'), ('1234567894','1234567895'), and query will work on both columns (number and connected_number).
But in this approach there will be many rows means if a specifc project_name 'p', there are n 'admission_number' than total number of rows will be n(n-1)/2
How can i store all the connected admission_number in RDBMS? Loading of data can be slow, but retrieval should be fast.
Do not optimize the data structure. It would only cause problems.
Create a simple table with two columns for both ID and create index for both columns.
The RDBMS will build and maintain an index of the column values, which will enable fast lookup for a specific record.

Access tables multiple join issue

Can I use two joins between two tables in Access database?
I have a customer database, and my customer names appear in two different fields and I want to convert customer names into the short names and return that in a query in one single field.
In attempt to solve that I have created a second table with all the customer names and their abbreviations then linked "CustName" field with the "Customer_Name" field in my main table, in my query I am returning the short names of my customers. The struggle is that some customer names e.g Toyota appear in "customer_Plant" field instead of "customer_Name" field (please see picture). I want to use different Toyota shortnames by each plant location. Another difficulty is that the "Customer_Plant" field in my original table is not always populated, except for Toyota.
Is there any way I can use multiple relations between two different tables? so that access query can return short names, not just by "customer_Name" but also by "Customer_Plant" at the same time.
Access does not allow me to join "Customer_Plant" with "custPlant" if one join is present between the tables. Is there any other way I can achieve this?
Tbl_claimdata & tbl_custShortName:
Join between the tables:
Current Output:
If Plant name is not provided in either or both tables, consider:
Query1: Claims_ADJ
SELECT tblCustClaimData.Customer_Name,
tblCustClaimData.Customer_Plant, Nz([Customer_Plant],[Customer_Name])
AS LinkNameClaim FROM tblCustClaimData;
Query2: Short_ADJ
SELECT tblShortCustName.CustName, tblShortCustName.PlantName,
Nz([PlantName],[CustName]) AS LinkNameShort,
tblShortCustName.ShortName FROM tblShortCustName;
Query3:
SELECT Customer_Name, Customer_Plant, ShortName FROM Short_ADJ RIGHT
JOIN Claims_ADJ ON Short_ADJ.LinkNameShort = Claims_ADJ.LinkNameClaim;
Query3 is not updatable so probably useful only for a report.
So alternative is DLookup in query (queries 2 and 3 not needed) or textbox:
DLookUp("ShortName","tblShortCustName","Nz([PlantName],[CustName])='"
& Nz([Customer_Plant],[Customer_Name]) & "'")

Is it possible to query two access tables where you want to know if a value/range in the first table is between two fields in the second table?

I'm trying to query two tables, ASSAYS, AND LITHO in a diamond drillhole database.
I was given values (SAMPLE_NO) to search for in the ASSAYS table, to return values such as HOLE-ID, FROM, and TO. So each sample that we take has a HOLE-ID, SAMPLE_NO, FROM AND TO. One hole-id can have multiple sample numbers, but each sample number is unique. The from and to will be unique in each hole-id. This I can find no problem.
My coworker also wanted to know what rock type was associated with each sample. This info is located in another table so I'll need to figure out how to query for this. The information that this table holds is HOLE-ID, FROM, TO, and ROCKTYPE.
You're looking for what is called a JOIN. This allows you to JOIN data of multiple tables based on matiching column values.
This could be your starting point:
SELECT a.*, l.*
FROM ASSAYS a LEFT JOIN LITHO l ON a.hole-id = l.hole-id
WHERE a.sample_no = 'XXXX'
Please google for JOIN and SQL to find out about the exact syntax.

Possible to search multiple tables with a single query? [MSAccess/SQL Server]

So my goal here is to have a single search field in an application that will be able to search multiple tables and return results.
For example, two of these tables are "performers" and "venues" and there are the following performers: "John Andrews","Andrew Smith","John Doe" and the following venues: "St. Andrew's Church","City Hall". Is there a way to somehow return the first two performers and the first venue for a search of "Andrew"?
My first thought was to somehow get all the tables aggregated into a single table with three columns; "SearchableText","ResultType","ResultID". The first column would contain whatever I want searched (e.g. Performer name), the second would say what is being shown (e.g. Performer) and the third would say the item's ID (note: all my tables have auto-incrementing primary keys for ease). The question for this idea is it possible to somehow do this dynamically or do I have to add code to have a table that automatically fills whenever a new row is updated/added/deleted from the performers and venues table (perhaps via trigger?).
My application is written in MSAccess (I know, I know, but I have no choice) on top of a SQL Server backend. I'd prefer this happen through MSAccess so I don't have to have a "searchme" table sitting on my SQL Server but any good result is acceptable :)
I think you are looking for the "union" sql keyword
I'd use full text indexing in SQL server, have a single table with your searchable text, and forign keys in your main tables that link to the search table. This way you can order your results by relevance.
I think you have a schema problem. Querying a UNION is almost always evidence of that (though not in all cases).
The question to me is:
What are you returning as your result?
If you find a person, are you displaying a list of people?
Or if you find a venue, a list of venues?
Or a mix of both?
I would say that if you want to return a list of both, then you'd want something like this:
SELECT tblPerson.PersonID, tblPerson.LastName & ", " & tblPerson.FirstName, "Person"
FROM tblPerson
WHERE tblPerson.LastName LIKE "Andrew*"
OR tblPerson.FirstName LIKE "Andrew*"
UNION
SELECT tblVenue.Venue, tblVenue.Venue, "Venue"
FROM tblVenue
WHERE tblVenue.Venue LIKE "Andrew*"
ORDER BY Venue
This will give a list of the matches indicating which is a person and which a venue, and allow you to then select one of those and open a detail view (by checking the value in the third field).
What you definitely don't want to do is this:
SELECT tblPerson.PersonID, tblPerson.LastName & ", " & tblPerson.FirstName, "Person"
FROM tblPerson
UNION
SELECT tblVenue.Venue, tblVenue.Venue, "Venue"
FROM tblVenue
then saving that and trying to query it on the 2nd column. That will be extremely inefficient. You want your WHERE clause to be on fields that can be searched via the index, and that means each subquery of your UNION needs to have an appropriate WHERE clause.

Resources