Optimization of different combinations to retreive goal

Optimization of different combinations to retreive goal - artificial-intelligence

I have a list of multiple options with a limited set of values which occure in a random order and a score which belongs to the combinations.
Option 1 : [value1.1, value1.2, value1.3]
Option 2 : [value2.1, value2.2]
Option 3 : [value3.1, value3.2, value3.3, value3.4]
So the data structure would look the following:
Option 1
Option 2
Option 3
Score
value1.1
value2.1
value3.4
1.0
value1.1
value2.2
value3.1
0.5
value1.3
value2.2
value3.1
0.5
value1.2
value2.2
value3.2
0.1
...
...
...
...
The same combination of the values could occure multiple times with different scores.
How do I find the best combination of the options to get the highest score? Is there a traditional approach how to calculate the optimum or would I need an AI for solving this issue (if yes, which one)?

Related

How to model arbitrarily ordering items in database?

I accepted a new feature to re-order some items by using Drag-and-Drop UI and save the preference for each user to the database. What's the best way to do so?
After reading some questions on StackOverflow, I found this solution.
Solution 1: Use decimal numbers to indicate order
For example,
id item order
1 a 1
2 b 2
3 c 3
4 d 4
If I insert item 4 between item 1 and 2, the order becomes,
id item order
1 a 1
4 d 1.5
2 b 2
3 c 3
In this way, every new order = order[i-1] + order[i+1] / 2
If I need to save the preference for every user, then I need to another relationship table like this,
user_id item_id order
1 1 1
1 2 2
1 3 3
1 4 1.5
I need num_of_users * num_of_items records to save this preference.
However, there's a solution I can think of.
Solution 2: Save the order preference in a column in the User table
This is straightforward by adding a column in the User table to record the order. Each value would be parsed as an array of item_ids that ranked by the index of the array.
user_id . item_order
1 [1,4,2,3]
2 [1,2,3,4]
Is there any limitation of this solution? Or is there any other ways to solve this problem?

Usually, an explicit ordering deals with the presentation or some specific processing of data. Hence, it's a good idea to separate entities of theirs presentation/processing. For example
users
-----
user_id (PK)
user_login
...
user_lists
----------
list_id, user_id (PK)
item_index
item_index can be a simply integer value :
ordered continuously (1,2...N): DELETE/INSERT of the whole list are normally required to change the order
ordered discretely with some seed (10,20...N): you can insert new items without reordering the whole list
Another reason to separate entity data and lists: reordering lists should be done in transaction that may lead to row/table locks. In case of separated tables only data in list table is impacted.

Applying a range filter only for a particular field with specific value in SOLR

I have data indexed into solr as with fields like :-
name:Apples weight:5kg
name:Grapes weight:2kg
name:papaya weight:7kg
name:Apples weight:3kg
name:Grapes weight:3kg
I want my results to be shown in such a way that all my results except Apples comes as usual results and after that the results for apples are shown at the end that too with weight range of 4-8 kg only.
i.e the results for apples are shown at the end that too with a particular weight range.

First you'll have to limit the documents you want to your criteria - i.e. you want all documents, except for those that are apples and outside of 4-8kg (this assumes that your weight field is an integer - if it isn't - make it an integer field so that you can do proper range searches):
q=(*:* NOT name:Apples) OR (name:Apples AND weight[4 TO 8])
Then you can apply a negative boost to Apples (which you do by boosting everything that doesn't match by a large factor):
bq=(*:* -name:Apples)^1000

SPSS Matching Case Control 1:n

We want to match our cases and controls using SPSS 23.
We already matched our cases and controls on age in a 1:3 ratio and a tolerance of 1 month as followed:
DATASET ACTIVATE DataSet1.
FUZZY BY=Age SUPPLIERID=Databasenr NEWDEMANDERIDVARS=MatchID1 MatchID2 MatchID3 GROUP=Case FUZZ=1
EXACTPRIORITY=TRUE
MATCHGROUPVAR=Matchgroupvariable
/OPTIONS SAMPLEWITHREPLACEMENT=FALSE MINIMIZEMEMORY=FALSE SHUFFLE=TRUE.
Know we have 2 questions:
We want to use different tolerances for our cases. For example the cases aged under 1 year should be matched with a tolerance of 1 month and the cases older than 1 year with a tolerance of 6 month. How can we do that?
We want to distribute the controles equally on the cases. So we have 60 cases and 300 controles. First we want every case to have a control if possible, than we want to distribute the lasting controles equally on the cases so that every case has at least one and as many as possible controles.
Thanks for your help.

In order to have a fuzz that is more complex than a simple difference, you need to use a customfuzz function written in Python the computes the match criteria. I have shown a function for this below. Save this in a file named customfuzz.py somewhere that Python can find it, such as the python\lib\site-packages directory under your Statistics installation.
Then use CUSTOMMATCH='customfuzz.custommatch' in the FUZZY syntax instead of FUZZ.
Here is the function. The indentation shown is important. The code assumes that the ages are in years (which can include a fractional part).
custommatch(demander, supplier):
"""calculate match for one variable and return 0 or 1
demander and supplier are assumed to be (lists) of ages in years"""
# check for missing values
if demander[0] is None or supplier[0] is None:
return 0 # no match
delta = abs(demander[0] - supplier[0]) # difference in years
if demander[0] < 1: # demander age lt 1 year
if delta <= .08333: # difference le 1 month
return 1 # ok match
else:
return 0
else:
if delta <= .5: # difference le half year
return 1 # ok match
else:
return 0
For the second problem, what you would have to do is the first match round and then remove all the controls that were actually used and repeat the process with the reduced dataset of controls.

SQL Invoice Query Performance converting Credits to negative numbers

I have a 3rd party database that contains Invoice data I need to report on. The Quantity and Amount Fields are stored as Positive numbers regardless of whether the "invoice" is a Credit Memo or actual Invoice. There is a single character field that contains the Type "I" = Invoice, "R" = Credit.
In a report that is equating 1.4 million records, I need to sum this data, so that Credits subtract from the total and Invoices add to the total, and I need to do this for 8 different columns in the report (CurrentYear, PreviousYear, etc)
My problem is performance of the many different ways to achieve this.
The Best performing seems to be using a CASE statement within the equation like so:
Case WHEN ARH.AccountingYear - 2 = #iCurrentYear THEN ARL.ShipQuantity * (CASE WHEN InvoiceType = 'R' THEN -1 ELSE 1 END) ELSE 0 END as PPY_INVOICED_QTY
But code readable wise, this is super ugly since I have to do it to 8 different columns, performance is good, runs against all 1.4M records in 16 seconds.
Using a Scalar UDF kills performance
Case WHEN ARH.AccountingYear - 2 = #iCurrentYear THEN ARL.ShipQuantity * dbo.fn_GetMultiplier(ARH.InvoiceType) ELSE 0 END as PPY_INVOICED_QTY
Takes almost 5 minutes. So can't do that.
Other options I can think of would be:
Multiple levels of Views, use a new view to add a Multiplier column, then SELECT from that and do the multiplication using the new column
Build a table that has 2 columns and 2 records, R, -1 and I, 1, and join it based on InvoiceType, but this seems excessive.
Any other ideas I am missing, or suggestions on best practice for this sort of thing? I cannot change the stored data, that is established by the 3rd party application.

I decided to go with the multiple views as Igor suggested, actually using the nested version, even though readability is lower, maintenance is easier due to only 1 named view instead of 2. Performance is similar to the 8 different case statements, so overall running in just under 20 seconds.
Thanks for the insights.

Vector Space Model query - set of documends search

i'm trying to write a code for vsm search in c. So using a collection of documents i built a hashtable (inverded index) in wich each slot holds a word along with it's df and a pointer to a list in which each slot hold a name of a document(in which the word appeared at least once) along with the tf(how many times it appeared in this doccument). The user will write a question(also chooses weighting qqq.ddd and comparing method but that doesn't matter for my question) and i have to print him the documents that are relevant to it(from the most relevant to the least relevant). So the examples i've seen are showing which are the steps having only one document for example: we have a collection of 1.000.000 documents(N=1.000.000) and we want to compare
1 document: car insurance auto insurance
with the queston: best car insurance
So in the example it creates an array like this:
Term | Query | Document
| tf | tf
auto | 0 | 1
best | 1 | 0
car | 1 | 1
insurance| 1 | 2
The example also gives the df for each term so using these clues and the weighting and comparing methods it's easy to compare them turning them into vectors by finding the 4 coordinates(1 for each word in the array).
So in this example there are 1.000.000 documents and to see how relevant the document with the query is we use 1 time each(4 words) of the words that there are in the query and in the document. So we have to find 4 coordinates and then compare.
In what i'm trying to do there are like 8000 documents each of them having from 3 to 50 words. So how am i suppose to compare how relevant is a query with each document? If i have
a query: ping pong
document 1: this is ping kong
document 2: i am ping tongue
To compare the query-document1 i will use the words: this is ping kong pong (so 5 coordinates) and to compare the query-document2 i will use the words: i am ping tongue is kong (6 coordinates) and then since i use the same comparing method the one with the highest score is the most relevant? OR do i have to use for both the words: this is ping kong am tongue kong (7 coordinates)? So my question is which is the right way to compare all these 8000 documents with the question? I hope i succeed on making my question easy to understand. thank you for your time!

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight