I've recently revived my C interest, which means I have a lot of interest in certain articles and questions covering topics within the language.
Over the years I've grown accustomed to using search engines for this, by entering a query like "how to use [library] in [programming language]". This works very well, but frequently it doesn't for C.
Most likely this is due to it being a single letter, and some websites and search engines probably treat it as an insignificant part of the query (like "a" or "to").
When searching on specific websites such as SO, I can use tags, but overall I still experience a lack of content compared to other programming languages.
Is there any "standard" way to include C in queries or inputs like this? With C++ for example, a lot of content can be found using "cpp", so maybe there's a comparable format-friendly term for C.
To search that in Google, if you want the search engines to treat a string as important, put it in double quotations as follows.
"important"
"c"
Related
I am newbie at machine learning and data mining. Here's the problem: I have one input variable currently which is a small text comprises of non-standard nouns and want to classify in target category. I have about 40% of total training data from entire dataset. Rest 60% we would like to classify as accurately as possible. Followings are some input variables across multiple observations those are assigned 'LEAD_GENERATION_REPRESENTATIVE' title.
"Business Development Representative MFG"
"Business Development Director Retail-KK"
"Branch Staff"
"Account Development Rep"
"New Business Rep"
"Hong Kong Cloud"
"Lead Gen, New Business Development"
"Strategic Alliances EMEA"
"ENG-BDE"
I think above give idea what I mean by non-standard nouns. I can see here few tokens that are meaningful like 'development','lead','rep' Others seems random without any semantic but they may be appearing multiple times in data. Another thing is some tokens like 'rep','account' can appear for multiple category. I think that will make weighting/similarity a challenging task.
My first question is "is it worth automating this kind of classification?"
Second : "is it a good problem to learn machine learning classification?". There are only 30k such entries and handful of target categories. I can find someone to manually do that which will also be more accurate.
here's my take on this problem so far:
Full-text engine: like solr to build index and query rules that draws matches based on tokens - word, phrase, synonyms, acronyms, descriptions. I can get someone to define detail taxonomy for each category. Use boosting, use pluggable scoring lib
Machine learning:
Naive Bayes classification
Decision tree
SVM
I have tried out Solr for this with revers lookup though since I don't have taxonomy available at moment. It seems like I can get about 80% true positives (I'll have to dig more into confusion matrix to reduce false positives). My query is bunch of booleans terms and phrases with proximity and boosts; negations to reduce errors. I'm afraid this approach may lead to overfit and wont scale.
I am aware that people usually tries multiple modeling techniques to achieve which one works best or derives combination of techniques. I want to understand this problem with feasibility and complexity point of view. If its too broad question please just comment on feasibility of solution.
I am in the latter phases of productionizing an app that uses azure search.
After many tests I just saw something that I can't figure out.
My indexed data are job descriptions and mostly technical jobs. When I search for C# or C++ alone I was getting some resumes that were non-technical people that clearly had never done programming before. So, when I started digging in to it I realized that it's people that have either a middle initial C or something like that in the resume text.
Is there a way to tell azure search that I literally want "C#" or "C++" and not to treat # and ++ as word breaks?
Thanks
There's currently no direct to do this. This is something we need to add, it gets in the way particularly in these cases with programming language names.
In the meanwhile, one thing you can do is replace (both at indexing and search time) each with a known string (e.g. C++ -> cplusplus, C# -> csharp). Not the most elegant, but it should help for this case.
I am developing an online Bible search program. The Bible is a pretty large book, taking up nearly 5MB of space in plain text. I am planning on implementing an API in the program as well allowing other websites to include their own Bible search widgets and programs without having to develop the search queries or storing Bibles on their own servers.
With this in mind, I am going to expect that eventually I will have a moderate flow of queries passing through the program. Also, for those not familiar with the Bible, it has 2 methods of formatting the text. It can contain both red text and italics. I need a way to store the Scriptures along with the red letter and italics formatting but allowing the search queries to ignore the formatting.
It also needs to be fast and as efficient (memory and cpu usage) as possible. Any storage format will be considered (MySQL, JSON or XML text files, etc) as long as the querying can be done ignoring the formatting. File size and count doesn't really matter, so splitting up the books or even chapters into separate files is fine by me.
One more important thing to keep in mind though, is that I want to have some form of search method that can search across multiple verses. So a search for "but have everlasting life for God sent not his Son" would return John 3:16,17. Thanks for all ideas!
There are a bunch of different open source document search engines which are made for precisely what you're trying to do. Solr, Elastic Search, Xapian, Whoosh, Haystack (made for Django) and others. There are other posts on S.O. and elsewhere that go into the benefits of using one vs another, but your requirements are simple enough that any of them will be more than fine (and easily scale with very minimal effort should your project take off, which is always nice to know). So look at their examples and see which one looks most intuitive to you - Solr is arguably the most popular and it's the only one I've worked with, but Elastic Search uses the same popular Lucene backend and is apparently much easier to get up and running, so I would start there.
As for the actual implementation, you'll want to index each verse as a separate "document" if the single verse (or just verse number) is what you want to return. The search engine handles the ranking of the results based on relevancy (usually using a tf/idf algorithm, in case you're interested).
The way I'd handle the italics and red text is to include some kind of markup in the text (i.e. wrap the phrase in single asterisks for italics, double asterisks for red) and then tell the analyzer to ignore those characters - there may be a simpler way in the framework you end up choosing, though, so take that with a grain of salt. The queries spanning multiple verses requirement is more complicated, but the answer will probably involve indexing each whole chapter as a document instead of (or maybe in addition to? I'd have to think about it more) each verse.
A word of caution - if you're not familiar with search indexing, even something designed to be plug-and-play like Elastic Search will probably still require some time and effort to set up, so if you absolutely need to get this up and running quickly and you're already familiar with MySQL I suppose it could work (it does do fulltext search). But it's certainly not the best tool for the job, so if this is a project that you're invested in you will thank yourself later if you put in a little bit of work to learn one of these search frameworks. It may be overkill in terms of the amount of text you're dealing with, as others have pointed out, but it will be extremely flexible in how you can search on that text which seems to be what you want. For instance, adding other requirements later on would be very straightforward (for instance, you could let people limit their search to only matches in the red text).
I didn't know the bible had formatting. What is it used for? If it is for the verses, I'd suggest you store every verse in a database. In a highly normalized form, you got a table with books, a table with chapters and a table with verses. Each verse consists of a verse number and a verse text.
Now, I think the chapters don't have titles so they are actually just a number as well. In that case it it silly to store them separately, so you got just your table of books and a table of verses, in which each verse has a chapter number and a verse number and a verse text. That text I think of to be plain text, isn't it?
If the verse is plain text, you can easily make it searchable by storing it in MySQL and create a FULLTEXT index for it. That way, you can search quite efficiently and even use wildcards and such.
If the verse was to have formatting, you could choose to create two columns, one with the plain text for searching, and one with the formatted text for display, but I doubt you would need this.
PS: 5 MB of text is nothing really. If you got a dedicated program, you could keep it in memory in a single string and use strpos or a similar function to find a text. What language, database and platform are you using?
Questions
I want to classify/categorize/cluster/group together a set of several thousand websites. There's data that we can train on, so we can do supervised learning, but it's not data that we've gathered and we're not adamant about using it -- so we're also considering unsupervised learning.
What features can I use in a machine learning algorithm to deal with multilingual data? Note that some of these languages might not have been dealt with in the Natural Language Processing field.
If I were to use an unsupervised learning algorithm, should I just partition the data by language and deal with each language differently? Different languages might have different relevant categories (or not, depending on your psycholinguistic theoretical tendencies), which might affect the decision to partition.
I was thinking of using decision trees, or maybe Support Vector Machines (SVMs) to allow for more features (from my understanding of them). This post suggests random forests instead of SVMs. Any thoughts?
Pragmatical approaches are welcome! (Theoretical ones, too, but those might be saved for later fun.)
Some context
We are trying to classify a corpus of many thousands of websites in 3 to 5 languages (maybe up to 10, but we're not sure).
We have training data in the form of hundreds of websites already classified. However, we may choose to use that data set or not -- if other categories make more sense, we're open to not using the training data that we have, since it is not something we gathered in the first place. We are on the final stages of scraping data/text from websites.
Now we must decide on the issues above. I have done some work with the Brown Corpus and the Brill tagger, but this will not work because of the multiple-languages issue.
We intend to use the Orange machine learning package.
According to the context you have provided, this is a supervised learning problem.
Therefore, you are doing classification, not clustering. If I misunderstood, please update your question to say so.
I would start with the simplest features, namely tokenize the unicode text of the pages, and use a dictionary to translate every new token to a number, and simply consider the existence of a token as a feature.
Next, I would use the simplest algorithm I can - I tend to go with Naive Bayes, but if you have an easy way to run SVM this is also nice.
Compare your results with some baseline - say assigning the most frequent class to all the pages.
Is the simplest approach good enough? If not, start iterating over algorithms and features.
If you go the supervised route, then the fact that the web pages are in multiple languages shouldn't make a difference. If you go with, say lexical features (bag-o'-words style) then each language will end up yielding disjoint sets of features, but that's okay. All of the standard algorithms will likely give comparable results, so just pick one and go with it. I agree with Yuval that Naive Bayes is a good place to start, and only if that doesn't meet your needs that try something like SVMs or random forests.
If you go the unsupervised route, though, the fact that the texts aren't all in the same language might be a big problem. Any reasonable clustering algorithm will first group the texts by language, and then within each language cluster by something like topic (if you're using content words as features). Whether that's a bug or a feature will depend entirely on why you want to classify these texts. If the point is to group documents by topic, irrespective of language, then it's no good. But if you're okay with having different categories for each language, then yeah, you've just got as many separate classification problems as you have languages.
If you do want a unified set of classes, then you'll need some way to link similar documents across languages. Are there any documents in more that one language? If so, you could use them as a kind of statistical Rosetta Stone, to link words in different languages. Then, using something like Latent Semantic Analysis, you could extend that to second-order relations: words in different languages that don't ever occur in the same document, but which tend to co-occur with words which do. Or maybe you could use something like anchor text or properties of the URLs to assign a rough classification to documents in a language-independent manner and use that as a way to get started.
But, honestly, it seems strange to go into a classification problem without a clear idea of what the classes are (or at least what would count as a good classification). Coming up with the classes is the hard part, and it's the part that'll determine whether the project is a success or failure. The actual algorithmic part is fairly rote.
Main answer is: try different approaches. Without actual testing it's very hard to predict what method will give best results. So, I'll just suggest some methods that I would try first and describe their pros and cons.
First of all, I would recommend supervised learning. Even if the data classification is not very accurate, it may still give better results than unsupervised clustering. One of the reasons for it is a number of random factors that are used during clustering. For example, k-means algorithm relies on randomly selected points when starting the process, which can lead to a very different results for different program runnings (though x-means modifications seems to normalize this behavior). Clustering will give good results only if underlying elements produce well separated areas in the feature space.
One of approaches to treating multilingual data is to use multilingual resources as support points. For example, you can index some Wikipedia's articles and create "bridges" between same topics in different languages. Alternatively, you can create multilingual association dictionary like this paper describes.
As for methods, the first thing that comes to mind is instance-based semantic methods like LSI. It uses vector space model to calculate distance between words and/or documents. In contrast to other methods it can efficiently treat synonymy and polysemy. Disadvantage of this method is a computational inefficiency and leak of implementations. One of the phases of LSI makes use of a very big cooccurrence matrix, which for large corpus of documents will require distributed computing and other special treatment. There's modification of LSA called Random Indexing which do not construct full coocurrence matrix, but you'll hardly find appropriate implementation for it. Some time ago I created library in Clojure for this method, but it is pre-alpha now, so I can't recommend using it. Nevertheless, if you decide to give it a try, you can find project 'Clinch' of a user 'faithlessfriend' on github (I'll not post direct link to avoid unnecessary advertisement).
Beyond special semantic methods the rule "simplicity first" must be used. From this point, Naive Bayes is a right point to start from. The only note here is that multinomial version of Naive Bayes is preferable: my experience tells that count of words really does matter.
SVM is a technique for classifying linearly separable data, and text data is almost always not linearly separable (at least several common words appear in any pair of documents). It doesn't mean, that SVM cannot be used for text classification - you still should try it, but results may be much lower than for other machine learning tasks.
I haven't enough experience with decision trees, but using it for efficient text classification seems strange to me. I have seen some examples where they gave excellent results, but when I tried to use C4.5 algorithm for this task, the results were terrible. I believe you should get some software where decision trees are implemented and test them by yourself. It is always better to know then to suggest.
There's much more to say on every topic, so feel free to ask more questions on specific topic.
I'm developing: http://www.buscatiendas.com.mx
I've seen people entering text for queries with lots of typos.
What kind of search could i implement so similar words are found?
Like google does more or less would be neat.
I'm using SQL Server Full Text search.
Why don't you have google/bing index it for you and just use that using the site: feature provided by them?
If that is not an option, you might have to have one of your own 'spell checkers' (either implement yourself or just use an existing one), which is trained on the data you have. Note spell checking is not deterministic (for eg: latel, is it label? later?). You can only make a 'best' guess based on the actual data you have in your site.
There are probabilistic models where you can 'train' your spell guesser/checker to come up with the a 'best' guess.
The following page seems pretty useful. It has a description on how to write one yourself, and also has good links (including a survey paper) and links to implementations in different languages:
http://norvig.com/spell-correct.html.
There are two ways to solve this:
Buy a 3rd party product, like a google search applicance, or one of
Microsoft search servers.
Log all queries, and have someone review these, making a table which
links the bad queries to what they
should be. (It's possible you could
buy a component library which does
this, much like a
spelling checker.)
if you want to roll out your own, first u need to filter out noise words before u even start searching because this may just impose load on your database unnecessarily. should "a good book" be the same as searching for "the good book" or "his good book" or "good and bad reviews on a book"? so obviously, "a", "the", "an", "and", etc. do not at at all qualify as "useful" search keywords. once u got the "noise" filtered out, then u start the real searching. again, u should consider database performance. is it wise to search a dynamic database or a pre-precessed database? figure out a way to filter out the noise words in the search data too.