Remove duplicates in Tableau

Remove duplicates in Tableau - database

I am trying to remove duplicates from the ID field, I would like it so that only the earliest retention academic period is associated with the unique ID. For example:
Status | ID | Profile Period | Retention Period | Academic Period
Retained | 654321 | 200610 | 200620 | 200620
Retained | 654321 | 200610 | 200710 | 200710
Retained | 654321 | 200610 | 200720 | 200720
CODO IN | 123456 | 200510 | 200520 | 200520
CODO IN | 123456 | 200510 | 200610 | 200610
CODO IN | 123456 | 200510 | 200620 | 200620
So what I want is to keep the repeated terms for Retained IDs, however, I want to the repeated CODO In values to take the earliest retention period:
Retained | 654321 | 200610 | 200620 | 200620
Retained | 654321 | 200610 | 200710 | 200710
Retained | 654321 | 200610 | 200720 | 200720
CODO IN | 234567 | 200510 | 200520 | 200520
I have tried {Fixed [ID]: MIN [Retention Period]} but it doesn't seem to remove all the repeats for CODO In (i.e. I still see the counts for 200610 and 200620).
Any suggestions on how to approach this task?

As other commenters have mentioned this work is usually best dealt with before the data is inputted into Tableau. This is because if this data is not required to be visualised it is best practise to remove it to keep the dashboard performant and simple.
However, if you require a solution in Tableau you were on the right track with a LOD. If you nest it with an IF statement so that only the CODO Status duplicates are remove you should get your desired result.
So create the following as a calculated field:
iif([Status]='CODO IN',
[Retention Period] = {Fixed [ID]: MIN([Retention Period])}
, TRUE)
Then if you drag this calculated field to the filters shelf and select 'TRUE' you will then remove the duplicates.
Let me know how it goes.

Related

How to model a table to save many metrics for same timestamp in QuestDb?

I have zillion lot of sensors which tick every min with some floating point data and I want to use to save the data in QuestDb. I see two options:
Options 1 is to create a wide table with zillion of columns and have one row per each minute
| Time | Sensor1 | Sensor2 | .... | Sensor1232123 |
| 10:01 | 3.4 | 0.0 | .... | 23.4 |
| 10:02 | 5.46 | 23.987 | .... | 0.0 |
...
And the option option 2
| Time | Id | Value |
| 10:01 | 1 | 3.4 |
| 10:01 | 2 | 0.0 |
...
| 10:01 | 123123 | 23.4 |
| 10:02 | 1 | 5.46 |
| 10:02 | 2 | 23.987 |
...
| 10:02 | 3 | 0.0 |
...
Since my data comes from individual sensors independently I'm inclined to use option 2 but the QuestDb requires the designated timestamp column to be ascending, so I cannot have duplicated values in Time column.
It sounds pretty common case but I cannot figure out how can I store my sensors data in one table.

You should use the option 2 you described where there is timestamp, sensor id, value saved in the table.
Repeated timestamp is allowed so it is valid to have all the sensors as individual rows at 10:01, 10:02 etc.

What is this data referencing anti-pattern called?

I have a question related to a kind of duplication I see in databases from time to time. To ask this question, I need to set the stage a bit:
Let's say I have a database of TV shows. Its primary table Content stores information at various levels of granularity (Show -> Season -> Episode), using a parent column to denote hierarchy:
+----+---------------------------+-------------+----------+
| ID | ContentName | ContentType | ParentId |
+----+---------------------------+-------------+----------+
| 1 | Friends | Show | [null] |
| 2 | Season 1 | Season | 1 |
| 3 | The Pilot | Episode | 2 |
| 4 | The One with the Sonogram | Episode | 2 |
+----+---------------------------+-------------+----------+
Maybe this isn't ideal, but let's say it's good enough to work with and we're not looking to change it.
Now let's say we need to build a table that defines air dates. We can set these at any level, and they must apply down the hierarchy (e.g., if set at the Season level, it applies to all episodes within that season; if set at the Show level, it applies to all seasons and episodes).
So the original air dates might look like this:
+-------+-----------+------------+
| airId | ContentId | AirDate |
+-------+-----------+------------+
| 71 | 3 | 1994-09-22 |
| 72 | 4 | 1994-09-29 |
+-------+-----------+------------+
Whereas the air date for a streaming service might look like:
+-------+-----------+------------+
| airId | ContentId | AirDate |
+-------+-----------+------------+
| 91 | 1 | 2015-01-01 |
+-------+-----------+------------+
Cool. Everything's fine so far; we're adhering to 4NF (I think!) and we can proceed to our business logic.
Now we get to my question. If we implement our business logic in such a way that disregards the referential hierarchy, and instead duplicates the air dates down the hierarchy, what is this anti-pattern called? e.g., Let's say I set an air date at the Show level like above, but the business logic finds all child elements and creates an entry for each one, resulting in:
+-------+-----------+------------+
| airId | ContentId | AirDate |
+-------+-----------+------------+
| 91 | 1 | 2015-01-01 |
| 92 | 2 | 2015-01-01 |
| 93 | 3 | 2015-01-01 |
| 94 | 4 | 2015-01-01 |
+-------+-----------+------------+
There are some pretty clear problems with this, but please note that my question is not how to fix this. Just, is there a specific term for it? I want to call it something like, "disregarding data relationship" or, "ignoring referential context". Maybe it's not strictly a database anti-pattern, since in my example there's an external actor inserting the excess rows.

Solr poor performance on date range query

I'm currently struggling to get decent performance on a ~18M documents core with NRT indexing from a date range query, in Solr 4.10 (CDH 5.14).
I tried multiple strategies but everything seems to fail.
Each document has multiple versions (10 to 100) valid at different non-overlapping periods (startTime/endTime) of time.
The query pattern is the following: query on the referenceNumber (or other criteria) but only return documents valid at a referenceDate (day precision). 75% of queries select a referenceDate within the last 30 days.
If we query without the referenceDate, we have very good performance, but 100x slowdown with the additional referenceDate filter, even when forcing it as a postfilter.
Here are some perf tests from a python script executing http queries and computing the QTime of 100 distinct referenceNumber.
+----+-------------------------------------+----------------------+--------------------------+
| ID | Query | Results | Comment |
+----+-------------------------------------+----------------------+--------------------------+
| 1 | q=referenceNumber:{referenceNumber} | 100 calls in <10ms | Performance OK |
+----+-------------------------------------+----------------------+--------------------------+
| 2 | q=referenceNumber:{referenceNumber} | 99 calls in <10ms | 1 call to warm up |
| | &fq=startDate:[* to NOW/DAY] | 1 call in >=1000ms | the cache then all |
| | AND endDate:[NOW/DAY to *] | | queries hit the filter |
| | | | cache. Problem: as |
| | | | soon as new documents |
| | | | come in, they invalidate |
| | | | the cache. |
+----+-------------------------------------+----------------------+--------------------------+
| 3 | q=referenceNumber:{referenceNumber} | 99 calls in >=500ms | The average of |
| | &fq={!cache=false cost=200} | 1 call in >=1000ms | calls is 734.5ms. |
| | startDate:[* to NOW/DAY] | | |
| | AND endDate:[NOW/DAY to *] | | |
+----+-------------------------------------+----------------------+--------------------------+
How is it possible that the additional date range filter query creates a 100x slowdown? From this blog, I would have expected similar performance from the daterange query as without the additional filter: http://yonik.com/advanced-filter-caching-in-solr/
Or is the only option is to change the softCommit/hardCommit delays, create 30 warmup fq for the past 30 days and tolerate poor performance on 25% of our queries?
Edit 1: Thanks for the answer, unfortunately, using integers instead of tdate does not seem to provide any performance gains. It can only leverage caching, like the query ID 2 above. That means we need a strategy of warmup of 30+ fq.
+----+-------------------------------------+----------------------+--------------------------+
| ID | Query | Results | Comment |
+----+-------------------------------------+----------------------+--------------------------+
| 4 | fq={!cache=false} | 35 calls in <10ms | |
| | referenceNumber:{referenceNumber} | 65 calls in >10ms | |
+----+-------------------------------------+----------------------+--------------------------+
| 5 | fq={!cache=false} | 9 calls in >100ms | |
| | referenceNumber:{referenceNumber} | 6 calls in >500ms | |
| | AND versionNumber:[2 TO *] | 85 calls in >1000ms | |
+----+-------------------------------------+----------------------+--------------------------+
edit 2: It seems that passing my referenceNumber from fq to q and setting different costs improve the query time (no perfect, but better). What's weird though is that the cost >= 100 is supposed to be executed as a postFilter, but setting the cost from 20 to 200 does not seem to impact performance at all. Does anyone know how to see whether a fq param is executed as a post filter?
+----+-------------------------------------+----------------------+--------------------------+
| 6 | fq={!cache=false cost=0} | 89 calls in >100ms | |
| | referenceNumber:{referenceNumber} | 11 calls in >500ms | |
| | &fq={!cache=false cost=200} | | |
| | startDate:[* TO NOW] AND | | |
| | endDate:[NOW TO *] | | |
+----+-------------------------------------+----------------------+--------------------------+
| 7 | fq={!cache=false cost=0} | 36 calls in >100ms | |
| | referenceNumber:{referenceNumber} | 64 calls in >500ms | |
| | &fq={!cache=false cost=20} | | |
| | startDate:[* TO NOW] AND | | |
| | endDate:[NOW TO *] | | |
+----+-------------------------------------+----------------------+--------------------------+

Hi I have an another solution for you, it will give a good performance after performing same query to solr.
My Suggestion is store date in int format, please find below example.
Your Start Date : 2017-03-01
Your END Date : 2029-03-01
**Suggested format in int format.
Start Date : 20170301
END Date : 20290301**
When you are trying fire same query with int number instead of dates it works faster as expected.
So your query will be.
q=referenceNumber:{referenceNumber}
&fq=startNewDate:[* to YYMMDD]
AND endNewDate:[YYMMDD to *]
Hope it will help you ..

SSRS - filter on expression in report table

Let's say I have a report with following table in the body:
ID | Name | FirstName | GivenCity | RecordedCity | IsCorrectCity
1 | Gates | Bill | London | New York | No
2 | McCain | John | Brussels | Brussels | Yes
3 | Bullock | Lili | London | London | Yes
4 | Bravo | Johnny | Paris | Las Vegas | No
The column IsCorrectCity Basically includes an expression that checkes GivenCity and RecordedCity and returns a No if different or a Yes when equal.
Is it possible to add a report filter on the column IsCorrectCity (and how) so the users will be able to just select all records with No or Yes? I know this can be done with a parameter in the SQL query, but I would like to add it based on the expressions rather then adding more calculations and all to the query.

Here's a tutorial which explains how you can do it
Filtering Data Without Changing Dataset [SSRS]

How to select multiple rows with same foreignKey, but one column has all values in a set

So I'm currently working on a stored procedure that returns a table comprising of data about the user (user ID) and performance metrics (field, metric_obtained). I was originally doing it so I would return all data back, but I was thinking it would be more efficient to return only people who meet the minimum to be recognized. I've already filtered them out based on minimum requirement, but the thing is they can be qualified based on a combination of things, so if I have 3 metrics A, B, and C, one recognition can be for A and B, or another is just C. I'm also not limited to a max of 3 or this wouldn't be a problem.
My tables look like this:
|Employee | Metric | Obtained|
|_________|________|_________|
| John | Email | .98 |
| Sue | Email | .99 |
| Sue | Phone | .82 |
| Larry | Email | .93 |
| Larry | Phone | .83 |
| Jess | Phone | .9 |
| Jess | Email | .94 |
| Bob | Phone | .99 |
So if I need to get back both Phone AND Email my results would look like this:
|Employee | Metric | Obtained|
|_________|________|_________|
| Sue | Email | .99 |
| Sue | Phone | .82 |
| Larry | Email | .93 |
| Larry | Phone | .83 |
| Jess | Phone | .9 |
| Jess | Email | .94 |
Like I said, this would be easy if I had a guaranteed number of metrics, but I don't. Any thoughts?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Remove duplicates in Tableau - database

Related

How to model a table to save many metrics for same timestamp in QuestDb?

What is this data referencing anti-pattern called?

Solr poor performance on date range query

SSRS - filter on expression in report table

How to select multiple rows with same foreignKey, but one column has all values in a set

Categories

Resources