I'm trying to merge two datasets on Mergers & Acquisitions. They both consist of c.10'000 observations with c.50-100 variables each. One contains information about the actual M&A deal whereas the other one contains info on how a deal was financed.
The problem is that there is no clear and unique identifier. For example, I could use the date that the deal was announced but that wouldn't be unique because on some days 10 deals were announced. Using company names is difficult since they mostly aren't identical in both datasets. For example if in one dataset I find "Ebay", in the other the same company could be called "eBay", "Ebay Inc", or "Ebay, Inc."
I've been working with the Fuzzy Lookup add-on for Excel, as well as concacenating various identifiers that are not unique but in their combination become useful (e.g. Date & Country & SIC Industry Classification Code, etc.). However I haven't been able to generate as many matches as I would have hoped.
I'd be grateful for any ideas or pointers towards resources that would help me merge the datasets more efficiently.
This is probably a process that requires group tactics applied over several repeated iterations. No single fuzzy match will be able to catch them all the first time. An optimum strategy should narrow the possibilities for manual matching. Once you match them, exclude them from further fuzzy matching. Go through all records using various tactics to whittle down unmatched entries to as few as possible.
As for the tactics, you mention date and that there could be 10 mergers that day. Now you just manually match up just those 10. That becomes a manageable chunk.
Using company names is difficult...
Yes, but you can combine date with Levenshtein Distance (or an equivalent algorithm) to rank the possible choices to narrow the options. So names with eBay in them will all appear closer when you rank them by their Levenshtein Distance.
Other text comparison algorithms besides Levenshtein include Gotoh, Jaro, Soundex, Chapman, etc. Some of these techniques are decades old so the chances of finding Excel add-ins are high. There used to be a vibrant open source group working on these solutions on Sourceforge.net.
...their combination become useful (e.g. Date & Country & SIC Industry Classification Code, etc.).
Watch out for SIC codes. These are never consistent nor accurate. Depending on who was maintaining those codes you may not get very accurate values beyond 4-digit levels. Also the SIC codes themselves have been updated while the companies were under no obligation to update/revise them as needed. Lastly, SIC codes have been replaced by newer NAICS codes, which in-turn had several versions. During each change, they add new industries, such as Social Media companies that did not exist in the old SIC codes. However SIC/NAICS codes can be useful to eliminate duplicates through self matching.
...any ideas or pointers towards resources...
SQL Server's text indexing features have ready-made algorithms for finding matches. Depending on your resources, that may be something to explore.
If this matching process is going to be a routine task, then you can explore specialized ETL products (extract, transform, load) offered through various data integration services. Some of these ensure bi-directional updates. But no matter how sophisticated the solution, it's all based on a few simple tactics as shown above and nearly all of them contain manual overrides.
Related
hope you are safe and well!
I have a question about regular or common ways of pair-matching if there is a database of users: say there are a few properties of each user, and when matching, each user could change the filtering options to only match those who fit their own requirement(so there is mutual selection between users), and we want to efficiently match 1000 users as precisely as possible.
For example, let's say there are 3 properties of every user: gender(female/male/other), study level(elementary/mediate/advanced), and grade(freshman/sophomore/junior/senior), and when matching, each user could choose to only match with people with their selected gender, study level and grade.
When focusing on 1 user, I could guess, on the perspective of database, we could use the filtering options in commands and get a list of those who satisfy both "my requirement" and "I fit their requirement"? However, I think this would be slow and asynchronous problems when there are 1000+ users in the matching phase at the same time?
I saw another post here discussed the blossom algorithm or greedy algorithm, which seem cool since if looking in a graph. Are they doable in this case? I guess if two users mutually fit both requirements, they would have an edge between the two nodes, and the value of edge could be comprehensive matching scores of 3 properties all together?
Anyway, I'm wondering is there a common way to do the pair matching precisely with at least 1000+ users at the same time?
Thank you so much!
If the requirement is that each match has to have the exact same properties, then the solution is fairly simple; just do a multiple criteria sort (ex. first sort by gender, then within each gender category sort by study level, etc.) and pair the identical users.
However, in a random dataset you're very unlikely to have perfect matches for all users. In that case you would want score pairs by how closely each category matches and use a more complex algorithm to maximize your overall matches. What you would do depends heavily on your use case and userbase size. Honestly, 1000 users is a very small number for modern computers; pretty much any polynomial time method (including blossom as you mentioned) would work fine.
I am newbie at machine learning and data mining. Here's the problem: I have one input variable currently which is a small text comprises of non-standard nouns and want to classify in target category. I have about 40% of total training data from entire dataset. Rest 60% we would like to classify as accurately as possible. Followings are some input variables across multiple observations those are assigned 'LEAD_GENERATION_REPRESENTATIVE' title.
"Business Development Representative MFG"
"Business Development Director Retail-KK"
"Branch Staff"
"Account Development Rep"
"New Business Rep"
"Hong Kong Cloud"
"Lead Gen, New Business Development"
"Strategic Alliances EMEA"
"ENG-BDE"
I think above give idea what I mean by non-standard nouns. I can see here few tokens that are meaningful like 'development','lead','rep' Others seems random without any semantic but they may be appearing multiple times in data. Another thing is some tokens like 'rep','account' can appear for multiple category. I think that will make weighting/similarity a challenging task.
My first question is "is it worth automating this kind of classification?"
Second : "is it a good problem to learn machine learning classification?". There are only 30k such entries and handful of target categories. I can find someone to manually do that which will also be more accurate.
here's my take on this problem so far:
Full-text engine: like solr to build index and query rules that draws matches based on tokens - word, phrase, synonyms, acronyms, descriptions. I can get someone to define detail taxonomy for each category. Use boosting, use pluggable scoring lib
Machine learning:
Naive Bayes classification
Decision tree
SVM
I have tried out Solr for this with revers lookup though since I don't have taxonomy available at moment. It seems like I can get about 80% true positives (I'll have to dig more into confusion matrix to reduce false positives). My query is bunch of booleans terms and phrases with proximity and boosts; negations to reduce errors. I'm afraid this approach may lead to overfit and wont scale.
I am aware that people usually tries multiple modeling techniques to achieve which one works best or derives combination of techniques. I want to understand this problem with feasibility and complexity point of view. If its too broad question please just comment on feasibility of solution.
I have 2 tables of UK postal addresses (around 300000 rows each) and need to match one set to another in order to return a unique ID contained in first set for each address.
The problem is there's a lot of variation in the formats of the addresses and in the spellings.
I've written a lot of t-sql scripts to pick off the east matches (exact postcode + house number + street name, etc) but there are many unmatched records left that are proving difficult to handle. I might end up having as many sql scripts as there are exceptions!
I've look at Levenstein function and ranking word for word but these methods are unreliable and problematic too.
Does anyone have any experience of doing similar work and what was your approach & success rate?
Thank you!
I agree with the commenters that this is largely a business rule thing rather than a programming question, but for what it's worth...
I had a somewhat similar problem with a catalogue many years ago. Entries weren't always consistent in the way we'd hoped, different editions came up weirdly and with a wide variety of variations. All had to be linked.
What I did in the end was a fuzzy matcher. Broke the item down into components. Normalised the data where I could - removing spaces from fields that didn't always have them and could live without them for example. Worked out the distance between near misses - bar and car being 1 apart, for example. I stemmed words - see http://snowball.tartarus.org/algorithms/english/stemmer.html for more info. Think I even played with SQL Server's SOUNDEX matching.
I then went through and scripted the job to produce a list of candidate matches. Anything above a certain level got presented to an administrator, who was shown what the program thought was the best match along with other likely matches. They picked the one that looked best, ticked it and went on to the next one.
At the start of the list everyone thought the job was far too huge to be manageable. They then started going through it, and found it was much faster than they thought and much easier than they'd feared to stay on top of the new data as it came in.
The script to do it all programmatically will never be perfect, and will end up being nearly as long as the source list with as many objections as it'll generate. Don't try to automate it perfectly; automate the easy stuff, put a human in the loop for the uncertain cases. Much easier and safer.
We have a database with hundreds of millions of records of log data. We're attempting to 'group' this log data as being likely to be of the same nature as other entries in the log database. For instance:
Record X may contain a log entry like:
Change Transaction ABC123 Assigned To Server US91
And Record Y may contain a log entry like:
Change Transaction XYZ789 Assigned To Server GB47
To us humans those two log entries are easily recognizable as being likely related in some way. Now, there may be 10 million rows between Record X and Record Y. And there may be thousands of other entries that are similar to X and Y, and some that are totally different but that have other records they are similar to.
What I'm trying to determine is the best way to group the similar items together and say that with XX% certainty Record X and Record Y are probably of the same nature. Or perhaps a better way of saying it would be that the system would look at Record Y and say based on your content you're most like Record X as apposed to all other records.
I've seen some mentions of Natural Language Processing and other ways to find similarity between strings (like just brute-forcing some Levenshtein calculations) - however for us we have these two additional challenges:
The content is machine generated - not human generated
As opposed to a search engine approach where we determine results for a given query - we're trying to classify a giant repository and group them by how alike they are to one another.
Thanks for your input!
Interesting problem. Obviously, there's a scale issue here because you don't really want to start comparing each record to every other record in the DB. I believe I'd look at growing a list of "known types" and scoring records against the types in that list to see if each record has a match in that list.
The "scoring" part will hopefully draw some good answers here -- your ability to score against known types is key to getting this to work well, and I have a feeling you're in a better position than we are to get that right. Some sort of soundex match, maybe? Or if you can figure out how to "discover" which parts of new records change, you could define your known types as regex expressions.
At that point, for each record, you can hopefully determine that you've got a match (with high confidence) or a match (with lower confidence) or very likely no match at all. In this last case, it's likely that you've found a new "type" that should be added to your "known types" list. If you keep track of the score for each record you matched, you could also go back for low-scoring matches and see if a better match showed up later in your processing.
I would suggest indexing your data using a text search engine like Lucene to split your log entries into terms. As your data is machine generated use also word bigrams and tigrams, even higher order n-grams. A bigram is just a sequence of consecutive words, in your example you would have the following bigrams:
Change_Transaction, Transaction_XYZ789, XYZ789_Assigned, Assigned_To, To_Server, Server_GB47
For each log prepare queries in a similar way, the search engine may give you the most similar results. You may need to tweek the similarity function a bit to obtain best results but I believe this is a good start.
Two main strategies come to my mind here:
the ad-hoc one. Use an information retrieval approach. Build an index for the log entries, eventually using a specialized tokenizer/parser, by feeding them into a regular text search engine. I've heard people do this with Xapian and Lucene. Then you can "search" for a new log record and the text search engine will (hopefully) return some related log entries to compare it with. Usually the "information retrieval" approach is however only interested in finding the 10 most similar results.
the clustering approach. You will usually need to turn the data into numerical vectors (that may however be sparse) e.g. as TF-IDF. Then you can apply a clustering algorithm to find groups of closely related lines (such as the example you gave above), and investigate their nature. You might need to tweak this a little, so it doesn't e.g. cluster on the server ID.
Both strategies have their ups and downs. The first one is quite fast, however it will always just return you some similar existing log lines, without much quantities on how common this line is. It's mostly useful for human inspection.
The second strategy is more computationally intensive, and depending on your parameters could fail completely (so maybe test it on a subset first), but could also give more useful results by actually building large groups of log entries that are very closely related.
It sounds like you could take the lucene approach mentioned above, then use that as a source for input vectors into the machine learning library Mahout (http://mahout.apache.org/). Once there you can train a classifier, or just use one of their clustering algorithms.
If your DBMS has it, take a look at SOUNDEX().
I am in the architecture stage of an academic project involving billions of records. The project should be very lightweight in terms of computing power and highly scalable.
The information structure is very simple: I need to store a list of items each one with different features. The feature are integers, decimals, dates, strings etc. When the data is imported the types of the feature is known. Also, features can be used to reference other items.
I need to be able to get and sort a list of items by its features (more than one) - possibly using queries such as >, <, =, and regexes, length, left, right, mid for strings between the feature values and against user arbitrary input.
Reporting in the sense of sums, averages, grouping is also necessary by the demands for that are more relaxed - there is not need for a full cube capabilities, but more are better.
I am very new to the whole NoSQL world. What would you recommend?.
If you check out the tutorials for MongoDB, they have, in my opinion, the best introduction to the Map/Reduce system that is used to query and aggregrate.
I do wonder though why you have concluded in advance that NoSQL is the route to go. Although different items may have different schemas, are there a fixed number of entities and attributes, and why have you (if you have) ruled out SQL, which, after all, has decades of accumulated features for storing and querying data.
If you are going to use aggregates then you could use map reduce to populate aggregate tables and then serve that data.
Writing map reduce for every query may be cumbersome, you can also have a look at Apache Pig and Hive. This is especially helpful for the kindly of adhoc queries you are talking about.