How to do t-test after matching in MatchThem - matching

Through matchthem(), I have got matched.datasets. Then I want to check the balance of
covariates between two matched groups through t-test or p-value. But the cobalt package could not show p-value or t-test results.

This cannot be done and should not be done. It is inappropriate to compute p-values to assess or report balance. Reasons for this are explained in the cobalt vignette and references therein. cobalt provides ample statistics to use to assess balance that are appropriate.

Related

combining the impact of absolute value and YoY change

I am currently working on a problem where we are ranking a customer demographic by absolute value of sales or by the YoY (year-over-year) change in sales.
As is obvious, looking at both these metrics at the same time lets us know whether the growth we are witnessing is actually substantial or not.
For example, let's say a state had a YoY growth of 200% by the end of 2021, but if the absolute value of sales only changed from $100 to $300, it is not actual growth with regards to the hundred thousand dollar values we usually see for other states.
I am looking for ways to combine the effect of both these metrics (absolute value and YoY change) and create a composite metric which can be used to rank my customer demographic.
A common way to combine two metrics is to normalise them to the same range, e.g. 0..1 and then take a weighted average. However there's always a question as to what the weights should be, and, more importantly, what such a metric means. From my experience, I'd advise against trying to combine metrics this way.

Database anonymization : Using additive noise

I want to do an experiment involving the use of additive noise for protecting a database from inference attacks.
My database should begin by generating a specific list of values that have a mean of 25, Then I will anonymize these values by adding a random noise value, which is designed to have an expected value of 0.
For example:
I can use uniformly distributed noise in the range [-1,1] or use a Normal (Gaussian) noise with mean 0.
I will test this anonymization method for a database of 100, 1000, 10000 values with different noise.
I am confused to use which platform and how, So I started with 10 values in Excel, For uniformly distributed noise value I use RAND() and add to the actual value, for normal noise, I use Norm.Inv with mean 0, then I add to the actual value.
But I don't know how to interpret data from the hacker's side, When I add noise to the dataset, how can I interpret its effect on privacy when the dataset becomes larger?
Also, should I use a database tool to handle this problem?
From what I understand, you're trying to protect your "experimental" database from inference attacks.
Attackers try to steal information from a database, using queries that are already allowed for public use. First, try to decide on your identifiers, quasi-identifiers and sensitive values.
Consider a student management system that has the GPA's of each student. We know that GPA is a sensitive information. Identifier is "student_id", and quasi-identifiers are "standing" and, let's say, "gender". In most cases, administrator of the RDBM system allows aggregate queries such as "Get average GPA of all students" or "Get average GPA of senior students" etc. Attackers try to infer from these aggregate queries. If, somehow, there are only one student who is senior, then the query "Get average GPA of senior students" would return the GPA of one specific person.
There are two main ways to protect the database from this kind of attacks. De-identification and Anonymization. De-identification means removing any identifier and quasi-identifier from the database. But this does not work in some cases. Consider one student who takes a make-up exam after grades are announced. If you get the average GPA of all students before and after he takes the exam, and compare the results of queries, you'd see a small amount of change (let's say, from 2.891 to 2.893). The attacker can infer the make-up exam score of that one particular student from this 0.002 difference of aggregate GPA.
The other approach is anonymization. With k-anonymity, you divide the database into groups that has at least k entities. For example, 2-anonymity ensures that there are no groups with single entity in it, so the aggregate queries on single-entity groups no longer leak private information.
Unless, you are one of the two entities in a group.
If there are 2 senior students in a class, and you want to know the average grade of seniors, the 2-anonymity allows you to have that information. But if you are a senior, and you already know your grade, you can infer the other student's grade.
Adding noise to sensitive values is a way to deal with those attacks, but if the noise is too low, then it has almost no effect on the information leaked (e.g. for grades, knowing someone has 57 out of 100 instead of 58 makes almost no difference). If it's too high, it results with a loss of functionality.
You've also asked how you can interpret the effect on privacy, as the dataset becomes larger. If you take average of an extremely large dataset, you'd see that the result you find is actually the sensitive value of everyone (this could be a little complex, but think the dataset as infinite, and the values that the sensitive information can take is finite, then calculate the probabilities). Adding noise with zero mean works, but range of the noise should get wider as the dataset get larger.
Lastly, if you are using excel, which is not a RDBMS but a spreadsheet, I suggest you to come up with a way to use equivalents of SQL queries, set identifiers and quasi-identifiers of the dataset, and public queries that can be performed by anyone.
Also, in addition to anonymity, take a look at "diversity" and "closeness" of a dataset and their use in database anonymization.
I hope that answers your question.
If you have further questions, please ask.

Tricky duplicate control: meeting criterias in Excel array formulas

A bit of a tricky question - I might just have to do it through VBA with a proper script, however if someone actually has a complicated answer, (let's be honest I don't think there's a super simple formula for this) I'm taker. I'd rather do as much as I can through formulas. I've attached a sample.
The data: I have data that relates to countries. In each country, you can have multiple sites. For each site, you may or not have different distributions. When those distributions meet a given criteria, I want to tally up that as a "break" & count how many by countries, sites, etc.
How it works: I'm using array formulas with sumproduct() for this. The nice thing is that you can easily add criteria, each criteria returns your 0/1 so when you multiply them it gives you the array you need to sum up to see how many breaks you have.
The problem: I am unable to format the formula so that I can account for each site being counted only once in the case where the same site has 2 different distribution types and both meet the break criteria. If both distributions meet the break criteria, I don't want to record that as 2 breaks, otherwise I may end up with more sites with breaks recorded than the number of sites. Part of the problem is how I account for the unicness of sites:
(tdata[siteid]>"")/COUNTIF(tdata[siteid],tdata[siteid] &"")
This is actually a bit of a hack, in the sense that as opposed to other formulas it doesn't return 0/1 but possibly fractions. They do add up correctly and do allow me to, say, count the number of sites correctly, but the array isn't formated as 0/1 therefore when multiplied with other 0/1 arrays it messes up the results....
I control the data, so I have some leeway. I work with tables (as can be seen) and VBA is already used. I could sort the source tables if that helps. Source data:
1 row = 1 distribution for 1 site on 1 month
The summary table per country I linked is based on those source data.
Any idea?
EDIT - Filtering for distribution is not really an option. I do already have an event-based filters for the source data, and I can already calculate rightly the indicator for filtered data by distributions. But I also need to display global data (which is currently not working). Also there are other indicators that need to be calculate which won't work if I filter the data (it's big dashboard).
EDIT2: In other words, I need to find some way to account for the fact that if the same criteria (break or not) is found in 2 sites with the same siteid but 2 different distributions, I want to count that as 1 break only. While keeping in mind that if one distribution has a break (and the other not), I still want to record it as 1 site with break in that country.
EDIT3: I've decided to make a new table, that summarizes the data for each site individually (each of which may have more than once distribution). Then I can calculate global stuff from that.
My take home message from this: I think that when you have many level of data (e.g. countries, sites, with some kind of a sub-level with distributions) in Excel formulas, it's difficult NOT to summarize the data in intermediate tables for the level of analysis at which you want to focus. E.g. in my case, I am interested in country-level analysis, which is 2 "levels" above the distribution level. This means that there will be "duplication" of data from a site-level perspective. You may be able to navigate around this, but I think by far the simpler solution is to suck it up and make an intermediate table. I does shorten significantly your formulas as well.
I don't mark this as a solution because it's not what I was looking for. Still open to better suggestions allowing to work only with formulas....
File: https://www.dropbox.com/sh/4ofctha6qhfgtqw/AAD0aPJXr__tononRTpKc1oka?dl=0
Maybe the following can help.
First, you filter the entries which don't meet the criteria regarding the distribution.
In a second step, you sort the table from A to Z based on the column siteid.
Then you add an extra column after the last on with the formula =C3<>C4, where column C contains the siteid entries. In that way all duplicates are denoted by a FALSE value in the helper column.
After that you filter the FALSE values in this column.
You then get unique site ids.
In case I got your question wrong, I would be glad about an update in order to try to help you.

Merging two datasets with unclear identifier in Excel

I'm trying to merge two datasets on Mergers & Acquisitions. They both consist of c.10'000 observations with c.50-100 variables each. One contains information about the actual M&A deal whereas the other one contains info on how a deal was financed.
The problem is that there is no clear and unique identifier. For example, I could use the date that the deal was announced but that wouldn't be unique because on some days 10 deals were announced. Using company names is difficult since they mostly aren't identical in both datasets. For example if in one dataset I find "Ebay", in the other the same company could be called "eBay", "Ebay Inc", or "Ebay, Inc."
I've been working with the Fuzzy Lookup add-on for Excel, as well as concacenating various identifiers that are not unique but in their combination become useful (e.g. Date & Country & SIC Industry Classification Code, etc.). However I haven't been able to generate as many matches as I would have hoped.
I'd be grateful for any ideas or pointers towards resources that would help me merge the datasets more efficiently.
This is probably a process that requires group tactics applied over several repeated iterations. No single fuzzy match will be able to catch them all the first time. An optimum strategy should narrow the possibilities for manual matching. Once you match them, exclude them from further fuzzy matching. Go through all records using various tactics to whittle down unmatched entries to as few as possible.
As for the tactics, you mention date and that there could be 10 mergers that day. Now you just manually match up just those 10. That becomes a manageable chunk.
Using company names is difficult...
Yes, but you can combine date with Levenshtein Distance (or an equivalent algorithm) to rank the possible choices to narrow the options. So names with eBay in them will all appear closer when you rank them by their Levenshtein Distance.
Other text comparison algorithms besides Levenshtein include Gotoh, Jaro, Soundex, Chapman, etc. Some of these techniques are decades old so the chances of finding Excel add-ins are high. There used to be a vibrant open source group working on these solutions on Sourceforge.net.
...their combination become useful (e.g. Date & Country & SIC Industry Classification Code, etc.).
Watch out for SIC codes. These are never consistent nor accurate. Depending on who was maintaining those codes you may not get very accurate values beyond 4-digit levels. Also the SIC codes themselves have been updated while the companies were under no obligation to update/revise them as needed. Lastly, SIC codes have been replaced by newer NAICS codes, which in-turn had several versions. During each change, they add new industries, such as Social Media companies that did not exist in the old SIC codes. However SIC/NAICS codes can be useful to eliminate duplicates through self matching.
...any ideas or pointers towards resources...
SQL Server's text indexing features have ready-made algorithms for finding matches. Depending on your resources, that may be something to explore.
If this matching process is going to be a routine task, then you can explore specialized ETL products (extract, transform, load) offered through various data integration services. Some of these ensure bi-directional updates. But no matter how sophisticated the solution, it's all based on a few simple tactics as shown above and nearly all of them contain manual overrides.

Determining the Similarity Between Items in a Database

We have a database with hundreds of millions of records of log data. We're attempting to 'group' this log data as being likely to be of the same nature as other entries in the log database. For instance:
Record X may contain a log entry like:
Change Transaction ABC123 Assigned To Server US91
And Record Y may contain a log entry like:
Change Transaction XYZ789 Assigned To Server GB47
To us humans those two log entries are easily recognizable as being likely related in some way. Now, there may be 10 million rows between Record X and Record Y. And there may be thousands of other entries that are similar to X and Y, and some that are totally different but that have other records they are similar to.
What I'm trying to determine is the best way to group the similar items together and say that with XX% certainty Record X and Record Y are probably of the same nature. Or perhaps a better way of saying it would be that the system would look at Record Y and say based on your content you're most like Record X as apposed to all other records.
I've seen some mentions of Natural Language Processing and other ways to find similarity between strings (like just brute-forcing some Levenshtein calculations) - however for us we have these two additional challenges:
The content is machine generated - not human generated
As opposed to a search engine approach where we determine results for a given query - we're trying to classify a giant repository and group them by how alike they are to one another.
Thanks for your input!
Interesting problem. Obviously, there's a scale issue here because you don't really want to start comparing each record to every other record in the DB. I believe I'd look at growing a list of "known types" and scoring records against the types in that list to see if each record has a match in that list.
The "scoring" part will hopefully draw some good answers here -- your ability to score against known types is key to getting this to work well, and I have a feeling you're in a better position than we are to get that right. Some sort of soundex match, maybe? Or if you can figure out how to "discover" which parts of new records change, you could define your known types as regex expressions.
At that point, for each record, you can hopefully determine that you've got a match (with high confidence) or a match (with lower confidence) or very likely no match at all. In this last case, it's likely that you've found a new "type" that should be added to your "known types" list. If you keep track of the score for each record you matched, you could also go back for low-scoring matches and see if a better match showed up later in your processing.
I would suggest indexing your data using a text search engine like Lucene to split your log entries into terms. As your data is machine generated use also word bigrams and tigrams, even higher order n-grams. A bigram is just a sequence of consecutive words, in your example you would have the following bigrams:
Change_Transaction, Transaction_XYZ789, XYZ789_Assigned, Assigned_To, To_Server, Server_GB47
For each log prepare queries in a similar way, the search engine may give you the most similar results. You may need to tweek the similarity function a bit to obtain best results but I believe this is a good start.
Two main strategies come to my mind here:
the ad-hoc one. Use an information retrieval approach. Build an index for the log entries, eventually using a specialized tokenizer/parser, by feeding them into a regular text search engine. I've heard people do this with Xapian and Lucene. Then you can "search" for a new log record and the text search engine will (hopefully) return some related log entries to compare it with. Usually the "information retrieval" approach is however only interested in finding the 10 most similar results.
the clustering approach. You will usually need to turn the data into numerical vectors (that may however be sparse) e.g. as TF-IDF. Then you can apply a clustering algorithm to find groups of closely related lines (such as the example you gave above), and investigate their nature. You might need to tweak this a little, so it doesn't e.g. cluster on the server ID.
Both strategies have their ups and downs. The first one is quite fast, however it will always just return you some similar existing log lines, without much quantities on how common this line is. It's mostly useful for human inspection.
The second strategy is more computationally intensive, and depending on your parameters could fail completely (so maybe test it on a subset first), but could also give more useful results by actually building large groups of log entries that are very closely related.
It sounds like you could take the lucene approach mentioned above, then use that as a source for input vectors into the machine learning library Mahout (http://mahout.apache.org/). Once there you can train a classifier, or just use one of their clustering algorithms.
If your DBMS has it, take a look at SOUNDEX().

Resources