How many principal component analysis should I conduct for each construct? - logistic-regression

I would like to conduct PCA to reduce the number of total 14 items. After getting the principle components, I will use these components as independent variables in a logistic regression.
These 14 items belong to two construct: resources (5 items) and cultural (9 items).
My question is that whether I should conduct one PCA for all 14 items OR two different PCA's for each two constructs and getting the components. At the end, I would like to be able to interpret two different constructs (resources and cultural) separately in the logistic regression output.
Thank you in advance.

I found the answer. Only one PCA should be conducted on all variables. For more detailed interpretation of the analysis, there is need of rotation and subsetting the variables, then using them in the logistic regression. The book below is a good reference:
George H. Dunteman - Principal Components Analysis (1989)

Related

Recommendation for mailing address matching scenario?

My SQL server contains 2 tables containing a similar set of fields for a mailing (physical) address. NB these tables are populated before the data gets to my database (can't change that). The set of fields in the tables are similar though not identical - most exist in both tables, some only in one, some the other. The goal is to determine with "high confidence" whether or not two mailing addresses match.
Example fields:
Street Number
Predirection
Street Name
Street Suffix
Postdirection (one table and not the other)
Unit name (one table) v Address 2 (other table) --adds complexity
Zip code (length varies in each table 5 v 5+ digits)
Legal description
Ideally I'd like to a simple way to call a "function" which returns either a boolean or a confidence level of match (0.0 - 1.0). This call can be made in SQL or Python within my solution; free/open source highly preferred by client.
Among options such as SOUNDEX, DIFFERENCE, Levenshtein distance (all SQL) and usaddress, dedupe (Python) none stand out as a good-fit solution.
Ideally I'd like to a simple way to call a "function" which returns
either a boolean or a confidence level of match (0.0 - 1.0).
A similarity metric is what you're looking for. You can use Distance Metrics to calculate similarity. The Levenshtein Distance, Damerau-Levenshtein Distance and Hamming Distance are examples of Distance Metrics.
Given the shortest of the two: M the shorter of the two, N the longest, and your distance metric (D) you can measure string Similarity using (M-D)/N. You can also use the Longest Common subsequence or Longest Common Substring (LCS) to measure similarity by dividing LCS/N.
If you can use CLRs I HIGHLY recommend mdq.similarity which you can get from here. It will give a similarity metric using these algorithms:
The Damarau-Levenshtein distance (the documentation only says, "Levenshtein" but they are mistaken)
The Jaccard Similarity coefficient algorithm.
a form of the Jaro-Winkler distance algorithm.
4 a longest common subsequence algorithm (which grows by one when transpositions are involved)
If performance is important (these metrics can be quite slow depending on what you're feeding them) then I would get familiar with my Bernie function. It's designed to help measure similarity using any of the aforementioned algorithms much, much faster. Bernie is 100% open source and can be easily re-created in any language (Python, C#, etc.) Ditto my N-Grams function.
You can easily create your own metric using NGrams8K.
For pure T-SQL versions of Levenshtein or the Longest Common Subsequence you can check Phil Factor's blog. (Note these cannot compete with the CLR I mentioned).
I'll stop for now. The best advice can be given after we better understand what is making the strings different (note my question under your comment).

Coefficients and significance of lasso/ridge

I had 628 predictors after forming dummy of all categorical variables. When I ran lot many iterations traditional logistic regression iteration, I came across 15 variables that was giving me pretty good model with good ROC, recall & precision(for certain cut-off) values on test data and also all variables were significant(at p<=0.05). But since it took lot of time, I tried using lasso that gave me 50 non-zero-coefficient variables after taking best lambda value post running 10 fold cross-validation. But only 5 variables were common between 15 variables of traditional method and 50 of lasso. Moreover, when I tried to calculate its SE and t-stats, I figured out that many variables are insignificant(low t-stats and high p-value). In addition to it, the AUC for ROC was less than traditional method.The ROC drops even more when I used traditional logistic regression on 50 variables that were result of lasso. Can someone help me understand the dynamics of it and how I will be able to justify the coefficients of lasso model as they are penalized(I have normalized all the variables before using lasso)?

Data mining and weka

Hi ive beeen asked to search for at least 20 different datasets with a maximum of 40 datasets. i need to apply the following classification techniques using the WEKA software on the chosen datasets:
(1) Decision tree (SimpleCart),
(2) Naïve Bayes, and
(3) K-NN (IBk) (with K taking the value of 1 up to the number of class labels in the dataset)
Once you have applied WEKA on all the datasets, it is required to accomplish the following tasks:
Compare the performance of the applied techniques you have achieved through WEKA.
Analyse the results with regards to the dataset properties.
Ive never used weka before,am unsure on how to apply the classification techniques and what am actually comparing, but am quick at learning.Am not really about what am required to do...i just need some direction or some example please anyone?
To find dataset, you can use
https://archive.ics.uci.edu/ml/datasets.html
To compare the performance of classifier, there are many measures like AUC (Area Under Curve), ROC curve, Accuracy, precision and recall. Weka has the ability to generate these measures. I recommend to use AUC and Accuracy.
To learn how to use Weka, there are many online tutorials like http://www.ibm.com/developerworks/library/os-weka2/

Getting win percentages for Texas hold'em poker without monte carlo/exhaustive enumeration

Sorry I'm just starting this project and don't have any ideas or code, I'm asking more of a theoretical question than a programming one.
It seems that every google search provides the same responses and it's very hard to find an answer to this question:
Is there a way to calculate win percentages for texas holdem poker (the same way they do on poker after dark or other televised poker events) without using the monte carlo/exhaustive enumeration methods. Assuming all cards are face up and we know every card in the deck.
Every response on other forums just seems to be "use pokerstove" or something similar, I'm looking for the theory to write the code.
Thanks.
Is there a way to calculate win percentages for texas holdem poker
(the same way they do on poker after dark or other televised poker
events) without using the monte carlo/exhaustive enumeration methods.
In specific instances it is possible...
You can use a perfect lookup table preflop for two players heads-up preflop matchups: note that the "typical" 169 vs 169 approximation ain't good enough (say Jh Th vs 9h 8h ain't really "JTs vs 98s": I mean, that would quite a gross approximation).
Besides that if you have a lot of memory and if you can live with gigantic cache misses, you technically could precompute gigantic lookup tables (say on the server side) and do lookups for other cases (e.g. for every possible three players all-in matchups preflop), but you'd really need a lot of memory : )
Note that "full enumeration" at flop and turn ain't an issue: at flop there's only 2 more cards to come, so there are typically only C(45,2) [two players all-in at flop, we know 2*2 holecards + 3 community cards -- hence leaving 990 possibilities] or C(43,2) [three players all-in at at flop, we know 3*2 holecards + 3 community cards].
So an actual evaluator would not use one but several methods. For example:
lookup table for two players all-in preflop (the fastest)
full enum for any number of players all-in at flop or turn because it's tiny (max 990 possibilities) -- very fast
monte-carlo or full enum for three players or more all-in preflop -- incredibly slower
It is interesting to see here that in the most typical cases you'll get the result very, very fast: most actual all-ins involve two players, not three or more.
So you're either looking up in a "1 vs 1 preflop" lookup table or doing full C(45,2) or C(46,1) full enum (which are, in both case, amazingly fast).
It's really only the "three players or more all-in preflop" case which do take time.
The answer is no.
There is no closed form computation that you can do to compute poker equities. Using combinatorics, you can identify and solve many subproblems, which speeds up computation.
For example, if you are considering all five card hands, there are 52 choose 5 = 2,598,960 different hands. But knowing that suits are equivalent and using combinatorial methods (either analytic or computational), you can reduce the space of all hands to 134,459 classes each weighted according to the number of different hands in each equivalent class.
There are also various ways of using exhaustive evaluations tailored to your application. If you need to perform some subset of evaluations repeatedly, you can use caches or precomputed lookup tables targeted to your specific needs.

Solving the problem of finding parts which work well with each other

I have a database of items. They are for cars and similar parts (eg cam/pistons) work better than others in different combinations (eg one product will work well with another, while another combination of 2 parts may not).
There are so many possible permutations, what solutions apply to this problem?
So far, I feel that these are possible approaches (Where I have question marks, something tells me these are solutions but I am not 100% confident they are).
Neural networks (?)
Collection-based approach (selection of parts in a collection for cam, and likewise for pistons in another collection, all work well with each other)
Business rules engine (?)
What are good ways to tackle this sort of problem?
Thanks
The answer largely depends on how do you calculate 'works better'?
1) Independent values
Assuming that 'works better' function f of x combination of items x=(a,b,c,d,...) and(!) that there are no regularities that can be used to decide if f(x') is bigger or smaller then f(x) knowing only x, f(x) and x' (which could allow to find the xmax faster) you will have to calculate f for all combinations at least once.
Once you calculate it for all combinations you can sort. If you will need to look up data in a partitioned way, using SQL/RDBMS might be a good approach (for example, finding top 5 best solutions but without such and such part).
For extra points after calculating all of the results and storing them you could analyze them statistically and try to establish patterns
2) Dependent values
If you can establish some regularities (and maybe you can) regarding the values the search for the max value can be simplified and speeded up.
For example if you know that function that you try to maximize is linear combination of all the parameters then you could look into linear programming
If it is not...

Resources