Tricky duplicate control: meeting criterias in Excel array formulas - arrays

A bit of a tricky question - I might just have to do it through VBA with a proper script, however if someone actually has a complicated answer, (let's be honest I don't think there's a super simple formula for this) I'm taker. I'd rather do as much as I can through formulas. I've attached a sample.
The data: I have data that relates to countries. In each country, you can have multiple sites. For each site, you may or not have different distributions. When those distributions meet a given criteria, I want to tally up that as a "break" & count how many by countries, sites, etc.
How it works: I'm using array formulas with sumproduct() for this. The nice thing is that you can easily add criteria, each criteria returns your 0/1 so when you multiply them it gives you the array you need to sum up to see how many breaks you have.
The problem: I am unable to format the formula so that I can account for each site being counted only once in the case where the same site has 2 different distribution types and both meet the break criteria. If both distributions meet the break criteria, I don't want to record that as 2 breaks, otherwise I may end up with more sites with breaks recorded than the number of sites. Part of the problem is how I account for the unicness of sites:
(tdata[siteid]>"")/COUNTIF(tdata[siteid],tdata[siteid] &"")
This is actually a bit of a hack, in the sense that as opposed to other formulas it doesn't return 0/1 but possibly fractions. They do add up correctly and do allow me to, say, count the number of sites correctly, but the array isn't formated as 0/1 therefore when multiplied with other 0/1 arrays it messes up the results....
I control the data, so I have some leeway. I work with tables (as can be seen) and VBA is already used. I could sort the source tables if that helps. Source data:
1 row = 1 distribution for 1 site on 1 month
The summary table per country I linked is based on those source data.
Any idea?
EDIT - Filtering for distribution is not really an option. I do already have an event-based filters for the source data, and I can already calculate rightly the indicator for filtered data by distributions. But I also need to display global data (which is currently not working). Also there are other indicators that need to be calculate which won't work if I filter the data (it's big dashboard).
EDIT2: In other words, I need to find some way to account for the fact that if the same criteria (break or not) is found in 2 sites with the same siteid but 2 different distributions, I want to count that as 1 break only. While keeping in mind that if one distribution has a break (and the other not), I still want to record it as 1 site with break in that country.
EDIT3: I've decided to make a new table, that summarizes the data for each site individually (each of which may have more than once distribution). Then I can calculate global stuff from that.
My take home message from this: I think that when you have many level of data (e.g. countries, sites, with some kind of a sub-level with distributions) in Excel formulas, it's difficult NOT to summarize the data in intermediate tables for the level of analysis at which you want to focus. E.g. in my case, I am interested in country-level analysis, which is 2 "levels" above the distribution level. This means that there will be "duplication" of data from a site-level perspective. You may be able to navigate around this, but I think by far the simpler solution is to suck it up and make an intermediate table. I does shorten significantly your formulas as well.
I don't mark this as a solution because it's not what I was looking for. Still open to better suggestions allowing to work only with formulas....
File: https://www.dropbox.com/sh/4ofctha6qhfgtqw/AAD0aPJXr__tononRTpKc1oka?dl=0

Maybe the following can help.
First, you filter the entries which don't meet the criteria regarding the distribution.
In a second step, you sort the table from A to Z based on the column siteid.
Then you add an extra column after the last on with the formula =C3<>C4, where column C contains the siteid entries. In that way all duplicates are denoted by a FALSE value in the helper column.
After that you filter the FALSE values in this column.
You then get unique site ids.
In case I got your question wrong, I would be glad about an update in order to try to help you.

Related

Is there a Google Apps Script for Fuzzy Lookups?

I have a list of companies on a spreadsheet that is rarely updated. I'll call it List A.
I also have a constantly updating weekly list of companies (List B) that should have entries that match some entries on List A.
The reality is that the data extracted from List B's company names are often inconsistent due to various business abbreviations (e.g. The Company, Company Ltd., Company Accountants Limited). Sometimes, these companies are under different trading names or have various mispellings.
My initial very not intelligent reaction was to construct a table of employer alias names, with the first column being the true employer name and the following columns holding alises, something like this: [https://i.stack.imgur.com/2cmYv.png]
On the left is a sample table, and the far right is a column where I am using the following array formula template:
=ArrayFormula(INDEX(A30:A33,MATCH(1,MMULT(--(B30:E33=H30),TRANSPOSE(COLUMN(B30:E33)^0)),0)))
I realized soon after that I needed to create a new entry for every single exact match variation (Ltd., Ltd, and Limited), so I looked into fuzzy lookups. I was really impressed by Alan's Fuzzy Matching UDFs, but my needs heavily lean towards using Google Spreadsheets rather than VBA.
Sorry for the long post, but I would be grateful if anyone has any good suggestions for fuzzy lookups or can suggest an alternative solution.
The comments weren't exactly what I was looking for, but they did provide some inspiration for me to come up with a bandaid solution.
My original array formula needed exact matches, but the problem was that there were simply too many company suffixes and alternate names, so I looked into fuzzy lookups.
My current answer is to abandon the fuzzy lookup proposal and instead focus on editing the original data string (i.e. company names) into more simplified substring. Grabbing with a few codes floating around, I came up with a combined custom formula that implements two lines for GApps Script:
var companysuffixremoval = str.toString().replace(/limited|ltd|group|holdings|plc|llp|sons|the/ig, "");
var alphanumericalmin = str.replace(/[^A-Za-z0-9]/g,"")
The first line is simply my idea of removing popular company suffixes and the "the" from the string.
The second line is removing all non-alphanumerical characters, as well as any spaces and periods.
This ensure "The First Company Limited." and "First Company Ltd" become "FirstCompany", which should work with returning the same values from the original array formula in the OP. Of course, I also implemented a trimming and cleaning line for any trailing/leading/extra spaces for the initial string, but that's probably unncessary with the second line.
If someone can come up with a better idea, please do tell. As it stands, I'm just tinkering with a script with minimal experience.

Is it possible to use LookupSet/Lookup with Running Value in SSRS

This is my first question on StackOverflow so apologies if there is not enough appropriate information.
Rather than having four different tables that I try to position 'just so' so that they look like one table, I was hoping to have all of my data in one visible table and hide the rest.
To do this I was trying to use LookupSet/Lookup with Running Value (I need a cumulative figure for each fortnight from a start date).
I have used the following code which supplies me with figures in the table - however the figures seem to be nearly double what they actually are.
=Lookup(Fields!StartFortnightDate.Value, Fields!StartFortnightDate.Value,
Fields!RowIdentifier.Value, "KPI004")
Is it possible to use Lookup with RunningValue? It won't let me use ReportItems either its obviously only pulling from the first box and therefore is just repeating the first figure again and again.
Any help, guidance, or even a simple "it's not possible" would be appreciated.
Edited to add more information as suggested:
It's difficult to add example data without worrying about data protection etc.
Report design is currently:
ReportDesign
Each table has it's own dataset - I'm trying to get them all into one table.
Lets say the first dataset is number of cars sold in each fortnight.
The second dataset (table) is number of meetings held.
The third dataset is number of days weather was sunny/cloudy/rainy etc.
(This obviously isn't what the datasets are, but I'm trying to show that they don't actually relate to each other that much and therefore can't all be in the same script)
All datasets have a table of the fortnightly dates within that quarter, my hope was to get one table that showed the cumulative figures of each item even though they're not in the same dataset - the tables are all grouped by the StartOfFortnightDate.
The script =RunningValue(Fields!NumberOfFordCarsSold.Value, Count, Nothing) and similar works fine in the separate tables, however if I add a row to the top table and try to use RunningValue with Lookup it doesn't work.
When I used the script mentioned at the top (Lookup script) I get inflated figures (top row of this image) compared to the expected figures (bottom row of the image): IncorrectAndCorrectFigures
Apologies if this doesn't make sense, it's likely that my complete confusion in trying to find the answer is coming across in the question.
If the resulting datasets are all similar then why can you not combine them?
From the output they seem to be just Indicator & Date.
Add an extra column to indicate which set of data each row belongs to (Cars Meetings etc), this might help with grouping rows in the report.

Using an array of strings that have the same meaning as a "lookup_value" in the MATCH() excel function

I have a large table of market values with rows labeled with each asset name and each column representing each month between 2000 and 2014.
The table I have is currently blank and I want to use an index / match function to search for the data corresponding to each date / asset combo in data that was submitted to me. My problem is that this submitted data is slightly inconsistent in how it names assets at different points at time. One year may have an asset called Goldman Sachs Strategic and another year may label the same asset as GS Strategic Income.
I would like to allow for the user of the spreadsheet to type names that are actually equivalent to each other. On a sheet called equivalent names I would like have cell A1 be Goldman Sachs Strategic and B1 to be GS Strategic Income. I was hoping there was a way to make A1:B1 an array that could then be used as a lookup value within my index / match function.
I realize this probably is not the way to approach this problem but I have a very limited dictionary of solutions because I have limited experience coding and using excel. I was hoping someone could point me in the direction of a solution that would actually work because I am assuming inconsistent data is a problem many people have dealt with before. Thanks a lot for any help that you can offer!
I created some random data to test with
I then created the following table on the second worksheet.
The last 5 columns (2010,2011,2012,2013,2014) represent the year to year names. Using these names (which may or may not be blank), we simply use a series of concatenated SUMIF() functions.
Because you didn't provide specific data, I understand that this answer might not fully fit your question, so if there is anything that is incorrect, let me know. Otherwise, I hope this helps.

Determining the Similarity Between Items in a Database

We have a database with hundreds of millions of records of log data. We're attempting to 'group' this log data as being likely to be of the same nature as other entries in the log database. For instance:
Record X may contain a log entry like:
Change Transaction ABC123 Assigned To Server US91
And Record Y may contain a log entry like:
Change Transaction XYZ789 Assigned To Server GB47
To us humans those two log entries are easily recognizable as being likely related in some way. Now, there may be 10 million rows between Record X and Record Y. And there may be thousands of other entries that are similar to X and Y, and some that are totally different but that have other records they are similar to.
What I'm trying to determine is the best way to group the similar items together and say that with XX% certainty Record X and Record Y are probably of the same nature. Or perhaps a better way of saying it would be that the system would look at Record Y and say based on your content you're most like Record X as apposed to all other records.
I've seen some mentions of Natural Language Processing and other ways to find similarity between strings (like just brute-forcing some Levenshtein calculations) - however for us we have these two additional challenges:
The content is machine generated - not human generated
As opposed to a search engine approach where we determine results for a given query - we're trying to classify a giant repository and group them by how alike they are to one another.
Thanks for your input!
Interesting problem. Obviously, there's a scale issue here because you don't really want to start comparing each record to every other record in the DB. I believe I'd look at growing a list of "known types" and scoring records against the types in that list to see if each record has a match in that list.
The "scoring" part will hopefully draw some good answers here -- your ability to score against known types is key to getting this to work well, and I have a feeling you're in a better position than we are to get that right. Some sort of soundex match, maybe? Or if you can figure out how to "discover" which parts of new records change, you could define your known types as regex expressions.
At that point, for each record, you can hopefully determine that you've got a match (with high confidence) or a match (with lower confidence) or very likely no match at all. In this last case, it's likely that you've found a new "type" that should be added to your "known types" list. If you keep track of the score for each record you matched, you could also go back for low-scoring matches and see if a better match showed up later in your processing.
I would suggest indexing your data using a text search engine like Lucene to split your log entries into terms. As your data is machine generated use also word bigrams and tigrams, even higher order n-grams. A bigram is just a sequence of consecutive words, in your example you would have the following bigrams:
Change_Transaction, Transaction_XYZ789, XYZ789_Assigned, Assigned_To, To_Server, Server_GB47
For each log prepare queries in a similar way, the search engine may give you the most similar results. You may need to tweek the similarity function a bit to obtain best results but I believe this is a good start.
Two main strategies come to my mind here:
the ad-hoc one. Use an information retrieval approach. Build an index for the log entries, eventually using a specialized tokenizer/parser, by feeding them into a regular text search engine. I've heard people do this with Xapian and Lucene. Then you can "search" for a new log record and the text search engine will (hopefully) return some related log entries to compare it with. Usually the "information retrieval" approach is however only interested in finding the 10 most similar results.
the clustering approach. You will usually need to turn the data into numerical vectors (that may however be sparse) e.g. as TF-IDF. Then you can apply a clustering algorithm to find groups of closely related lines (such as the example you gave above), and investigate their nature. You might need to tweak this a little, so it doesn't e.g. cluster on the server ID.
Both strategies have their ups and downs. The first one is quite fast, however it will always just return you some similar existing log lines, without much quantities on how common this line is. It's mostly useful for human inspection.
The second strategy is more computationally intensive, and depending on your parameters could fail completely (so maybe test it on a subset first), but could also give more useful results by actually building large groups of log entries that are very closely related.
It sounds like you could take the lucene approach mentioned above, then use that as a source for input vectors into the machine learning library Mahout (http://mahout.apache.org/). Once there you can train a classifier, or just use one of their clustering algorithms.
If your DBMS has it, take a look at SOUNDEX().

Flagging possible identical users in an account management system

I am working on a possible architecture for an abuse detection mechanism on an account management system. What I want is to detect possible duplicate users based on certain correlating fields within a table. To make the problem simplistic, lets say I have a USER table with the following fields:
Name
Nationality
Current Address
Login
Interests
It is quite possible that one user has created multiple records within this table. There might be a certain pattern in which this user has created his/her accounts. What would it take to mine this table to flag records that may be possible duplicates. Another concern is scale. If we have lets say a million users, taking one user and matching it against the remaining users is unrealistic computationally. What if these records are distributed across various machines in various geographic locations?
What are some of the techniques, that I can use, to solve this problem? I have tried to pose this question in a technologically agnostic manner with the hopes that people can provide me with multiple perspectives.
Thanks
The answer really depends upon how you model your users and what constitutes a duplicate.
There could be a user that uses names from all harry potter characters. Good luck finding that pattern :)
If you are looking for records that are approximately similar try this simple approach:
Hash each word in the doc and pick the min shingle. Do this for k different hash functions. Concatenate these min hashes. What you have is a near duplicate.
To be clear, lets say a record has words w1....wn. Lets say your hash functions are h1...hk.
let m_i = min_j (h_i(w_j)
and the signature is S = m1.m2.m3....mk
The cool thing with this signature is that if two documents contain 90% same words then there is a good 90% chance that good chance that the signatures would be the same for the two documents. Hence, instead of looking for near duplicates, you look for exact duplicates in the signatures. If you want to increase the number of matches then you decrease the value of k, if you are getting too many false positives then you increase the number of k.
Of course there is the approach of implicit features of users such as thier IP addresses and cookie etc.

Resources