FAQ section - Algorithm for showing - angularjs

I'm writing this website with an FAQ-section. I've been fiddling abit with how the different questions are scored, in order to display the question, that most people wonder about, on top.
At the moment, I have created a counter that is incremented each time someone presses the question, and the list of questions is sorted in descending order.
My question is: Is there a better way to score these FAQ-questions? For example a combined counter and rating or something like that?

The rating should play no role in the FAQ algorithm. The rating is feedback for you: if a question/answer pair is repeatedly rated poorly, you should improve its quality.
The FAQ questions should be the ones that your visitors ask most often, so a simple hit counter is a good place to start. You should also consider ways to include questions that you often receive via phone or email or other feedback methods.
A lot of sites rely on manual maintenance of the FAQ, but I prefer an automated or partially automated approach for several reasons:
Staff is not always involved in the question/answer process, especially if you have a robust set of questions/answers online.
Manually-maintained lists become tedious and therefore receive less attention than they should.
Automated metrics often reveal customer behavior that isn't noticed or expected by staff.

What if you let the support staff arrange the FAQs. Because they get the extra phone calls and know the most popular questions.
So i would like the FAQs to be arranged manually by the support staff.

Related

Logic for recommender application

I am developing an application - which would have users answer maybe 10 questions - which would have 3-4 options for each question. At the end of the 10th question, based on the responses, it would need to suggest a certain solution. Since there are 100's of permutation and combinations - what's the logic that would be required to use and the database design,
thanks
EDIT some more detailed explanation
if my application is used to recommend a data plan from various mobile operators - based on the user answering questions like the time spent on the internet, the type of files being downloaded and so on. So, if the response to question 1 was a and question 2 was c, etc - then it would be a certain plan. If the response to question 1 was b and for question 2 it was c, then it would recommend a certain plan. So, if there were 10 questions - then the combinations can be quite large. So is there a certain algorithm that can handle this?
I. what would be the logic?
If I understand correctly, you would define "rules" such as
If the answer to question 5. is either A or B then the suggested plan would be planB, otherwise execute the rest of the rules.
So you would use a rule engine e.g.: http://www.jboss.org/drools/
II. what would be the database design?
This is quite simple:
USERS table,
QUESTIONS table and
ANSWERS table which would refer to the two others
Possibly there would be a QUESTIONNAIRE table as well, and the QUESTIONS table would refer to it.
Just a 'quick' comment, consider letting the user see changes in what company they could be recommended as they answer every question.
For example, if I am most interested in price that would be the question I would answer first and immediately see the 3 cheapest plans/products recommended to me.
The second question could be coverage and if I then could see the 3 plans with best coverage (in my area) that would be interesting too.
When I answer the third question about smart phone features and I say I want internet, then the first question should spit out the 3 cheapest plans/products that include internet, obviously they could change.
And so on...
Maybe it also could be a good idea to let the user "dive into" each question and see the full range of options for that answer. As a user I would appreciate that.
Above comments is just how I would appreciate if a form was made for me, I don't want to answer 10 questions about stuff I'm not really putting any value on, each user is different and will prefer to make their choice on their questions.
So, based on above it would be like a check list where the top answers would be the plans/products with the most fitting check marks. And to give immediate responses (as the user answer/alter each question), here AJAX would probably be your choice.

Feature selection and unsupervised learning for multilingual data + machine learning algorithm selection

Questions
I want to classify/categorize/cluster/group together a set of several thousand websites. There's data that we can train on, so we can do supervised learning, but it's not data that we've gathered and we're not adamant about using it -- so we're also considering unsupervised learning.
What features can I use in a machine learning algorithm to deal with multilingual data? Note that some of these languages might not have been dealt with in the Natural Language Processing field.
If I were to use an unsupervised learning algorithm, should I just partition the data by language and deal with each language differently? Different languages might have different relevant categories (or not, depending on your psycholinguistic theoretical tendencies), which might affect the decision to partition.
I was thinking of using decision trees, or maybe Support Vector Machines (SVMs) to allow for more features (from my understanding of them). This post suggests random forests instead of SVMs. Any thoughts?
Pragmatical approaches are welcome! (Theoretical ones, too, but those might be saved for later fun.)
Some context
We are trying to classify a corpus of many thousands of websites in 3 to 5 languages (maybe up to 10, but we're not sure).
We have training data in the form of hundreds of websites already classified. However, we may choose to use that data set or not -- if other categories make more sense, we're open to not using the training data that we have, since it is not something we gathered in the first place. We are on the final stages of scraping data/text from websites.
Now we must decide on the issues above. I have done some work with the Brown Corpus and the Brill tagger, but this will not work because of the multiple-languages issue.
We intend to use the Orange machine learning package.
According to the context you have provided, this is a supervised learning problem.
Therefore, you are doing classification, not clustering. If I misunderstood, please update your question to say so.
I would start with the simplest features, namely tokenize the unicode text of the pages, and use a dictionary to translate every new token to a number, and simply consider the existence of a token as a feature.
Next, I would use the simplest algorithm I can - I tend to go with Naive Bayes, but if you have an easy way to run SVM this is also nice.
Compare your results with some baseline - say assigning the most frequent class to all the pages.
Is the simplest approach good enough? If not, start iterating over algorithms and features.
If you go the supervised route, then the fact that the web pages are in multiple languages shouldn't make a difference. If you go with, say lexical features (bag-o'-words style) then each language will end up yielding disjoint sets of features, but that's okay. All of the standard algorithms will likely give comparable results, so just pick one and go with it. I agree with Yuval that Naive Bayes is a good place to start, and only if that doesn't meet your needs that try something like SVMs or random forests.
If you go the unsupervised route, though, the fact that the texts aren't all in the same language might be a big problem. Any reasonable clustering algorithm will first group the texts by language, and then within each language cluster by something like topic (if you're using content words as features). Whether that's a bug or a feature will depend entirely on why you want to classify these texts. If the point is to group documents by topic, irrespective of language, then it's no good. But if you're okay with having different categories for each language, then yeah, you've just got as many separate classification problems as you have languages.
If you do want a unified set of classes, then you'll need some way to link similar documents across languages. Are there any documents in more that one language? If so, you could use them as a kind of statistical Rosetta Stone, to link words in different languages. Then, using something like Latent Semantic Analysis, you could extend that to second-order relations: words in different languages that don't ever occur in the same document, but which tend to co-occur with words which do. Or maybe you could use something like anchor text or properties of the URLs to assign a rough classification to documents in a language-independent manner and use that as a way to get started.
But, honestly, it seems strange to go into a classification problem without a clear idea of what the classes are (or at least what would count as a good classification). Coming up with the classes is the hard part, and it's the part that'll determine whether the project is a success or failure. The actual algorithmic part is fairly rote.
Main answer is: try different approaches. Without actual testing it's very hard to predict what method will give best results. So, I'll just suggest some methods that I would try first and describe their pros and cons.
First of all, I would recommend supervised learning. Even if the data classification is not very accurate, it may still give better results than unsupervised clustering. One of the reasons for it is a number of random factors that are used during clustering. For example, k-means algorithm relies on randomly selected points when starting the process, which can lead to a very different results for different program runnings (though x-means modifications seems to normalize this behavior). Clustering will give good results only if underlying elements produce well separated areas in the feature space.
One of approaches to treating multilingual data is to use multilingual resources as support points. For example, you can index some Wikipedia's articles and create "bridges" between same topics in different languages. Alternatively, you can create multilingual association dictionary like this paper describes.
As for methods, the first thing that comes to mind is instance-based semantic methods like LSI. It uses vector space model to calculate distance between words and/or documents. In contrast to other methods it can efficiently treat synonymy and polysemy. Disadvantage of this method is a computational inefficiency and leak of implementations. One of the phases of LSI makes use of a very big cooccurrence matrix, which for large corpus of documents will require distributed computing and other special treatment. There's modification of LSA called Random Indexing which do not construct full coocurrence matrix, but you'll hardly find appropriate implementation for it. Some time ago I created library in Clojure for this method, but it is pre-alpha now, so I can't recommend using it. Nevertheless, if you decide to give it a try, you can find project 'Clinch' of a user 'faithlessfriend' on github (I'll not post direct link to avoid unnecessary advertisement).
Beyond special semantic methods the rule "simplicity first" must be used. From this point, Naive Bayes is a right point to start from. The only note here is that multinomial version of Naive Bayes is preferable: my experience tells that count of words really does matter.
SVM is a technique for classifying linearly separable data, and text data is almost always not linearly separable (at least several common words appear in any pair of documents). It doesn't mean, that SVM cannot be used for text classification - you still should try it, but results may be much lower than for other machine learning tasks.
I haven't enough experience with decision trees, but using it for efficient text classification seems strange to me. I have seen some examples where they gave excellent results, but when I tried to use C4.5 algorithm for this task, the results were terrible. I believe you should get some software where decision trees are implemented and test them by yourself. It is always better to know then to suggest.
There's much more to say on every topic, so feel free to ask more questions on specific topic.

Detecting an online poker cheat

It recently emerged on a large poker site that some players were possibly able to see all opponents cards as they played through exploiting a security vulnerability that was discovered.
A naïve cheater would win at an incredibly fast rate, and these cheats are caught very quickly usually, and if not caught quickly they are easy to detect through a quick scan through their hand histories.
The more difficult problem occurs when the cheater exhibits intelligence, bluffing in spots they are bound to be called in, calling river bets with the worst hands, the basic premise is that they lose pots on purpose to disguise their ability to see other players cards, and they win at a reasonably realistic rate.
Given:
A data set of millions of verified and complete information hand histories
Theoretical unlimited computer power
Assume the game No Limit Hold'em, although suggestions on Omaha or limit poker may be beneficial
How could we reasonably accurately classify these cheaters? The original 2+2 thread appeals for ideas, and I thought that the SO community might have some useful suggestions.
It's an interesting problem also because it is current, and has real application in bettering the world if someone finds a creative solution, as there is a good chance genuine players will have funds refunded to them when identified cheaters are discovered.
Plot V$PIP versus winrate of all players with a statistically significant #hands played. You should see outliers with naked eye. I think that's the basic thing to do first.
Then you can plot WTSD vs winrate, winrate at showdown vs winrate without showdown, %won at showdown vs VPIP.
The stats you choose must be significant statistically. If you know poker, the above choices make sense.
This is not a job for a machine, outliers are detected by eye.
EDIT: Omaha is much tougher, since it is really variant. There are cases of unbelievable streaks made by weak players who were not cheating.
I hate to be so blunt, but all the answers on this page with the exception of #Erwin Smout's are worthless.
Statistical analysis is a joke for identifying poker cheats
I realize the question allows there to be millions of hands worth of history available to the system. I'm sure there are players with hand histories this large, hell, I've probably played this many online hands. But I've also been playing online for over 10 years. Thats not a small amount of time, and it is my understanding that two conflicting things are true when it comes to identifying online poker cheaters: it needs to happen in a small amount of time, and like any good thief, an online poker cheat is going to take his stash elsewhere immediately after the taking.
There was a great example of the variance in poker in this paper which was generated by matching an always raise player versus an always call player (page 13 of the PDF). Over the course of 100,000 hands, wayyyy more than I think most people would be willing to play against someone who could see their cards, the always call player won on average .026 small blinds per hand. I know this does not sound like much, but assuming stakes of $5-10, that comes out to $6,500. Maybe someone can help me find the link, but the measured professional win rate is less not too much larger than this. Please note, NEITHER of these players was cheating, and the statistically expected difference over this number of hands is significantly less than what actually transpired.
What online poker players need to understand
Poker is gambling. It is a game of skill, because some players are able to elicit more information from their opponents than their opponents are able to gather, and that extra information is often as useful as seeing other peoples cards. Even players who are better players than their typical opponents, will end up long term losers. If you do not understand this, you're just searching for witches with statistics in the arbitrarily small number of hands you'll be playing against any opponent.
What can be done?
Keeping in mind the question states that cheaters are able to see the other players cards, you don't need statistical analysis to identify them. There are only three ways in which that is possible.
First is that the server is sending the information intentionally to clients which is an obvious security issue and should not be implemented (IMO, even for moderators). If a site was found allowing this to happen, it is the player's responsibility to move their funds elsewhere, or refuse to play on the site until that terrible design decision is rectified. It should also be the responsibility of the sites to inform their players of the exact steps that take place during hands played on the site so they have that to make their decision on when choosing a site in the first place. Security by obscurity is unpermitable. As for catching the thieves, this information should be sitting in log files on their servers, which should be regularly audited for this type of behavior.
Second is that the user has hacked the poker server and they would know about that in hurry, or else once it is exposed, it is again players responsibility to determine where to play. In this case, the cheater can be prosecuted in most countries.
Lastly, it is possible the dealing algorithm has been cracked. This one was a major problem in the past with companies that used naive methods to deal hands, but most of the major shops solved this problem by taking random inputs from players logged into their system as well as using entropy generating hardware to seed their random number generator. Thats not to say it cannot be cracked however. If this is the case, the only option is for the company to engineer a new random number generator.
Well. IT people get fascinated by all kinds of wrong question.
A better question is "how is cheating even possible ?". There is no need what so ever to send the opponent's hands over the wire until at showdown. If that data isn't sent to the client, then how could they cheat ?
They'd need to break into the server. Don't tell me that isn't preventable.
I think if they cheat intelligent, so with winning not too much rounds, it won't be detectable. I don't believe you could see the difference between luck and cheating here.
But I would like to know at which online poker provider the cheating is possible. Because I can't imagine a way how to do this, if the poker software is coded properly. If I was asked to program an online poker software, The users wouldn't be able to see the opponents cards, because there is no way he could get this information. And this is how I would do this.
Every connection between users and server is encrypted
no communication between users, the users can only talk to the server.
The server tells every user only the cards the user should see, and no other cards, unless the round is finished and the users open their cards.
The only way the users could cheat here is, you get together with other players, or impersonate multiple players with different accounts and accessing IPs, and open another channel to communicate between the players. This way the group has a big advantage because they know more than their own cards, but there's still no way they can see other cards. And because it's now a group that is cheating it is even more harder to detect it, because they can share their earnings with multiple players, and this group could even have a player that looses more than (s)he gains and still win overall.
I doubt you can say with any certainty if someone is cheating or if they are just good at Poker, past a certain point.
You could however narrow the candidates who you think might be cheating, by looking at the users who over your time period benefited overall. This will remove the vast majority of users, allowing you to focus your resources better. (This of course will include users who are skilled at Poker.).
Once you've done that, you can compare the history of play from while the cheat was possible to the history afterwards or before, and see if the users success decreases or increases.
That should give you a list of users who you need to investigate more carefully, possibly by analyzing specific games.
Enjoy, it's a nice problem.
For all of you expressing disbelief that this is even possible: the community on the poker forums linked in OP were similarly awestruck, but the site in question has confirmed that such a security vulnerability was present. Quite simply, the site was using very basic and insecure crypto to transmit hole card data to its players. Theoretically, it would have been possible for anyone aware of this to intercept transmissions from the site to a specific victim (eg. by being physically nearby and intercepting wireless data), and to cheat that player using the intercepted knowledge.
The question is about how to detect whether this vulnerability was actually exploited (before it was fixed), and if so by whom, given the resources outlined.
Oh, and also some of you seem to be assuming we're talking about a hypothetical scenario, and/or play-money poker; we're not. The site is real, the vulnerability was real, the investigation is really happening (see link in OP), and the games under investigation are real-money games with normal buyins of $200 and above.
I'm by no means a data-mining expert, and my grasp of statistical analysis of large data sets is pretty weak as well (and I'm not very good at poker, even though I love it) so take everything I say here with a grain of salt.
Weed out the junk data. You are going to only really care about players that fit into two categories: (1) players who win more hands than they lose, (2) players who win more money than they lose. Who cares about a cheater who loses a lot? Heh.
With this paired down list of players to actually analyze, I would take a look at their style of play. Assuming you have a lot of historical data, I would build a player skill profile and attempt to normalize their betting strategy. As a poor poker player, I normally will back up weaker cards that no decent player would back simply because they feel good. For example, any time I am dealt a face card with another low card (2, 3, 4, 5), if they're suited, I'll often ALWAYS call any bets made by other players before the turn, even though this strategy is not very successful. Pre-turn raises above the Big Blind often indicate a player has a pocket pair, yet my love of playing won't let me fold a suited hand pre-flop.
So for me, your analysis of my play would say that me matching aggressive calls pre-flop when I have anything suited would be normal. But a different player who only occasionally calls large pre-flop bets would be an indication that something might be out of whack.
I don't know what sort of system you'd need to build to make a profile of different users styles of play, but I imagine you could use some computer learning algorithms to "learn" a person's style of play with pretty decent accuracy.
You mentioned that a smart user would throw hands to minimize his appearance as a cheater. I think this is a GREAT opportunity for more profiling. Would an experienced, winning player play through an awful hand? Probably not, ever. If I was dealt a 4S, 7H, and saw 9D, JC, AH on the flop, I would know that my chances of winning were really, really small. It also tells us that the cards given on the flop aren't very strong for anyone, so anyone at the table betting probably has a Jack or Ace paired, two pair, or three of a kind. Since you know your 4S, 7H is worthless, you'd either bet hard to bluff the pot or fold outright. Not very many good players (who would have been found in your winning players shortened list) would ever stick around on a hand like that.
Anyway, those are the things I've thought of. Now actually implementing them, I have no idea where to even begin so I'm afraid I can't be of much help there. This is a very interesting academic problem though, so please do us a favor and keep us informed of what you end up going with. If you want to take this conversation offline, feel free to email me at stackoverflow#ericharrison.info.
Could you not look for simple indicators initially before trying to do anything too complex??
i.e.. PreFlop : A player folds pocket kings with no raise before him and someone else had pocket Aces..
This MIGHT be indicative of the player knowing his starting KINGS (pretty good) is not as good as someone elses pocket ACES .. however that's assuming he makes the decision pre-flop and not post flop.. depends really..
Ignore this, just thinking out loud..
To be perfectly honest, I'd doubt very much that the players who could see opponents hands were random. There must be some sort of cross over in the code that generates the card view that was selecting some users but not others. I would recommend running tests on this code and trying to find a trend in the "viewers" and "non-viewers". If you find a strong trend, then the trend could be applied to the actual dataset too see which users, or which hands or which whatever was generating the code fault.
The answer to your question is simple. There is no way to detect that type of cheater with just hand histories. You need the information that is not public in order to correlate multiple characteristic's to find a suspected cheater.
Ohh yea, and obviously the companies that provide these games do everything possible to setup shop in a low tax, non-regulated country. Until they are regulated and enforce strict code compliance and testing this will continue to happen.
the most likely cheating situation would seem to be people working together. Three guys at same table knowing each others cards should be able to make some betting adjustments that would allow the pool of betters to come out ahead.
What stops are in place to prevent collusion?

Dynamic Multiple Choice (Like a Wizard) - How would you design it? (e.g. Schema, AI model, etc.)

This question can probably be broken up into multiple questions, but here goes...
In essence, I'd like to allow users to type in what they would like to do and provide a wizard-like interface to ask for information which is missing to complete a requested query. For example, let's say a user types: "What is the weather like in Springfield?"
We recognize the user is interested in weather, but it could be Springfield, Il or Springfield in another state. A follow-up question would be:
What Springfield did you want weather for?
1 - Springfield, Il
2 - Springfield, Wi
You can probably think of a million examples where a request is missing key data or its ambiguous. Make the assumption the gist of what the user wants can be understood, but there are missing pieces of data required to complete the request.
Perhaps you can take it as far back as asking what the user wants to do and "leading" them to a query.
This is not AI in the sense of taking any input and truly understanding it. I'm not referring to having some way to hold a conversation with a user. It's about inferring what a user wants, checking to see if there is an applicable service to be provided, identifying the inputs needed and overlaying that on top of what's missing from the request, then asking the user for the remaining information. That's it! :-)
How would you want to store the information about services? How would you go about determining what was missing from the input data?
My thoughts:
Use regex expressions to identify clear pieces of information. These will be matched to the parameters of a service. Figure out which parameters do not have matching data and look up the associated question for those parameters. Ask those questions and capture answers. Re-run the service passing in the newly captured data. These would be more free-form questions.
For multiple choice, identify the ambiguity and search for potential matches ranked in order of likelihood (add in user history/preferences to help decide). Provide the top 3 as choices.
Thoughts appreciated.
Cheers,
Henry
This is not AI in the sense of taking any input and truly understanding it.
It most certainly is! You follow this up by stating exactly that:
I'm not referring to having some way to hold a conversation with a user. It's about inferring what a user wants, checking to see if there is an applicable service to be provided, identifying the inputs needed and overlaying that on top of what's missing from the request, then asking the user for the remaining information. That's it! :-)
Inference is at the heart of any topics in AI. What did the user mean? What did the user want? What information should I fetch? How do I parse that information and decide what the answer is?
You're essentially trying to design a state-of-the-art AI system, that uses a combination of NLP techniques to parse natural language queries and then using (maybe) a learning algorithm to determine how to perform the search, possibly hitting a knowledge base, or maybe Google (which also requires a process to parse the returned data to find the answer).
If there is any way you can constrain how input is entered (i.e. how the query is asked), that will help. But then you'll essentially be building a Web form... which has been done a million times over.
In short, you're attempting to create a very complex system but you explicitly don't want to use any of the relevant techniques. If you're attempting to use regexs to do all of this, good luck to you. Because that's one heck of a deep and dark rabbit hole into which I would not want to fall.
But if you insist, start by finding a good book on NLP, because that's where you'll have to start anyway.

How does Wikipedia avoid duplicate entries?

How can websites as big as Wikipedia sort duplicated entries out?
I need to know the exact procedure from the moment that user creates the duplicate entry and so on. If you don't know it but you know a method please send it.
----update----
Suppose there is wikipedia.com/horse and somebody afterward creates wikipedia.com/the_horse this is a duplicate entry! It should be deleted or may be redirected to the original page.
It's a manual process
Basically, sites such as wikipedia and also stackoverflow rely on their users/editors not to make duplicates or to merge/remove them when they have been created by accident. There are various features that make this process easier and more reliable:
Establish good naming conventions ("the horse" is not a well-accepted name, one would naturally choose "horse") so that editors will naturally give the same name to the same subject.
Make it easy for editors to find similar articles.
Make it easy to flag articles as duplicates or delete them.
Make sensible restrictions so that vandals can't mis-use these features to remove genuine content from your site.
Having said this, you still find a lot of duplicate information on wikipedia --- but the editors are cleaning this up as quickly as it is being added.
It's all about community (update)
Community sites (like wikipedia or stackoverflow) over time develop their procedures over time. Take a look at Wikipedia:about Stackoverflow:FAQ or meta.stackoverflow. You can spend weeks reading about all the little (but important) details of how a community together builds a site together and how they deal with the problems that arise. Much of this is about rules for your contributors --- but as you develop your rules, many of their details will be put into the code of your site.
As a general rule, I would strongly suggest to start a site with a simple system and a small community of contributors that agree on a common goal and are interested in reading the content of your site, like to contribute, are willing to compromise and to correct problems manually. At this stage it is much more important to have an "identity" of your community and mutual help than to have many visitors or contributors. You will have to spend much time and care to deal with problems as they arise and delegate responsibility to your members. Once the site has a basis and a commonly agreed direction, you can slowly grow your community. If you do it right, you will gain enough supporters to share the additional work amongst the new members. If you don't care enough, spammers or trolls will take over your site.
Note that Wikipedia grew slowly over many years to its current size. The secret is not "get big" but "keep growing healthily".
Having said that, stackoverflow seems to have grown at a faster rate than wikipedia. You may want to consider the different trade off decisions that were made here: stackoverflow is much more restricted in allowing one user to change the contribution of another user. Bad information is often simply pushed down to the bottom of a page (low ranking). Hence, it will not produce articles like wikipedia. But it's easier to keep problems out.
I can add one to Yaakov's list:
* Wikipedia makes sure that after merging the information, "The Horse" points to "Horse", so that the same wrong title can not be used a second time.
EBAGHAKI, responding to your last question in the comments above:
If you're trying to design your own system with these features, the key one is:
Make the namespace itself editable by the community that is identifying duplicates.
In MediaWiki's case, this is done with the special "#REDIRECT" command -- an article created with only "#REDIRECT [[new article title]]" on its first line is treated as a URL redirect.
The rest of the editorial system used in MediaWiki is depressingly simple -- every page is essentially treated as a block of text, with no structure, and with a single-stream revision history that any reader can add a new revision to. Nothing automatic about any of this.
When you try to create a main page, you are shown a long message encouraging you to search for the page title in various ways to see whether an existing page is already there -- many sites have similar processes. Digg is a typical example of one with an aggressive, automated search to try to convince you not to post duplicates -- you have to click through a screen listing potential duplicates and affirm that yours is different, before you are allowed to post.
I assume they have a procedure that removes extraneous words such as 'the' to create a canonical title, and if it matches an existing page not allow the entry.

Resources