Azure Search scoring - azure-cognitive-search

I have sets of 3 identical (in Text) items in Azure Search varying on Price and Points. Cheaper products with higher points are boosted higher. (Price is boosted more then Points, and is boosted inversely).
However, I keep seeing search results similar to this.
Search is on ‘john milton’.
I get
Product="Id = 2-462109171829-1, Price=116.57, Points= 7, Name=Life of Schamyl / John Milton Mackie, Description=.", Score=32.499783
Product="Id = 2-462109171829-2, Price=116.40, Points= 9, Name=Life of Schamyl / John Milton Mackie, Description=.", Score=32.454872
Product="Id = 2-462109171829-3, Price=115.64, Points= 9, Name=Life of Schamyl / John Milton Mackie, Description=.", Score=32.316270
I expect the scoring order to be something like this, with the lowest price first.
Product="Id = 2-462109171829-3, Price=115.64, Points= 9, Name=Life of Schamyl / John Milton Mackie, Description=.", Score=
Product="Id = 2-462109171829-2, Price=116.40, Points= 9, Name=Life of Schamyl / John Milton Mackie, Description=.", Score=
Product="Id = 2-462109171829-1, Price=116.57, Points= 7, Name=Life of Schamyl / John Milton Mackie, Description=.", Score=
What am I missing or are minor scoring variations acceptable?
The index is defined as
let ProductDataIndex =
let fields =
[|
new Field (
"id",
DataType.String,
IsKey = true,
IsSearchable = true);
new Field (
"culture",
DataType.String,
IsSearchable = true);
new Field (
"gran",
DataType.String,
IsSearchable = true);
new Field (
"name",
DataType.String,
IsSearchable = true);
new Field (
"description",
DataType.String,
IsSearchable = true);
new Field (
"price",
DataType.Double,
IsSortable = true,
IsFilterable = true)
new Field (
"points",
DataType.Int32,
IsSortable = true,
IsFilterable = true)
|]
let weightsText =
new TextWeights(
Weights = ([|
("name", 4.);
("description", 2.)
|]
|> dict))
let priceBoost =
new MagnitudeScoringFunction(
new MagnitudeScoringParameters(
BoostingRangeStart = 1000.0,
BoostingRangeEnd = 0.0,
ShouldBoostBeyondRangeByConstant = true),
"price",
10.0)
let pointsBoost =
new MagnitudeScoringFunction(
new MagnitudeScoringParameters(
BoostingRangeStart = 0.0,
BoostingRangeEnd = 10000000.0,
ShouldBoostBeyondRangeByConstant = true),
"points",
2.0)
let scoringProfileMain =
new ScoringProfile (
"main",
TextWeights =
weightsText,
Functions =
new List<ScoringFunction>(
[
priceBoost :> ScoringFunction
pointsBoost :> ScoringFunction
]),
FunctionAggregation =
ScoringFunctionAggregation.Sum)
new Index
(Name = ProductIndexName
,Fields = fields
,ScoringProfiles = new List<ScoringProfile>(
[
scoringProfileMain
]))

All indexes in Azure Search are split into multiple shards allowing us for quick scale up and scale downs. When a search request is issued, it’s issued against each of the shards independently. The result sets from each of the shards are then merged and ordered by score (if no other ordering is defined). It is important to know that the scoring function weights query term frequency in each document against its frequency in all documents, in the shard!
It means that in your scenario, in which you have three instances of every document, even with scoring profiles disabled, if one of those documents lands on a different shard than the other two, its score will be slightly different. The more data in your index, the smaller the differences will be (more even term distribution). It’s not possible to assume on which shard any given document will be placed.
In general, document score is not the best attribute for ordering documents. It should only give you general sense of document relevance against other documents in the results set. In your scenario, it would be possible to order the results by price and/or points if you marked price and/or points fields as sortable. You can find more information how to use $orderby query parameter here: https://msdn.microsoft.com/en-us/library/azure/dn798927.aspx

Related

Bybit API - How do I calculate qty in USDT Perpetual with leverage

I'm using bybit-api to create a conditional order but don't know how do I calculate quantity. Is it based on leveraged amount or original?
for example
I have balance of 50 USDT and want to use 100% per trade with following conditions.
BTC at price 44,089.50 with 50x leverage.
SHIB at price 0.030810 with 50x leverage.
How do I calculate the qty parameter?
https://bybit-exchange.github.io/docs/linear/#t-placecond
I trade Bitcoin through USDT Perpetuals (BTCUSDT). I've setup my own python API and created my own function to calculate quantity for cross margin:
def order_quantity(self, price:float, currency:str='USDT', leverage:float=50.0):
margin = self.get_wallet_balance(currency)
instrument = Instrument(self.query_instrument()[0], 'bybit')
if not price: # Market orders
last_trade = self.ws_get_last_trade() # private function to get last trade
lastprice = float(last_trade[-1]['price'])
else: # Limit orders
lastprice = price
totalbtc = float(margin[currency]['available_balance']) * (1 - instrument.maker_fee * leverage)
rawbtc = totalbtc / lastprice
btc = math.floor(rawbtc / instrument.lot_size) * instrument.lot_size
return min(btc,instrument.max_lot_size)
It is based on the leverage amount.
Your quantity should be :
qty = 50 USDT * 50 (leverage) / 44089 (BTC price) = 0.0567 BTC

How can I slide over values from specific columns to specific columns within the same dataset?

I have a post-joined datasets where the columns are identical except the right side has new and corrected data and a .TODROP suffix appended at the end of the column's name.
So the dataset looks something like this:
df = spark.createDataFrame(
[
(1, "Mary", 133, "Pizza", "Mary", 13, "Pizza"),
(2, "Jimmy", 8, "Hamburger", None, None, None),
(3, None, None, None, "Carl", 6, "Cake")
],
["guid", "name", "age", "fav_food", "name.TODROP", "age.TODROP", "fav_food.TODROP"]
)
I'm trying to slide over the right side columns to the left side columns if there is value:
if df['name.TODROP'].isNotNull():
df['name'] = df['name.TODROP']
if df['age.TODROP'].isNotNull():
df['age'] = df['age.TODROP']
if df['fav_food.TODROP'].isNotNull():
df['fav_food'] = df['fav_food.TODROP']
However, the problem is that the brute-force solution will take a lot longer with my real dataset because it has a lot more columns than this example. And I'm also getting this error so it wasn't working out anyway...
"pyspark.sql.utils.AnalysisException: Can't extract value from
name#1527: need struct type but got string;"
Another attempt where I try to do it in a loop:
col_list = []
suffix = ".TODROP"
for x in df.columns:
if x.endswith(suffix) == False:
col_list.append(x)
for x in col_list:
df[x] = df[x + suffix]
Same error as above.
Goal:
Can someone point me in the right direction? Thank you.
First of all, your dot representation of the column name makes confusion for the struct type of column. Be aware that. I have concatenate the column name with backtick and it prevents the misleading column type.
suffix = '.TODROP'
cols1 = list(filter(lambda c: not(c.endswith(suffix)), df.columns))
cols2 = list(filter(lambda c: c.endswith(suffix), df.columns))
for c in cols1[1:]:
df = df.withColumn(c, f.coalesce(f.col(c), f.col('`' + c + suffix + '`')))
df = df.drop(*cols2)
df.show()
+----+-----+---+---------+
|guid| name|age| fav_food|
+----+-----+---+---------+
| 1| Mary|133| Pizza|
| 2|Jimmy| 8|Hamburger|
| 3| Carl| 6| Cake|
+----+-----+---+---------+

Applying different multiplier to an integer depending on threshold

I'm having to build a program which calculates the Annual cost of minutes used for phone providers and it depends on different rates.
As an example one phone operator may have the following rates:
"rates": [
{"price": 15.5, "threshold": 150},
{"price": 12.3, "threshold": 100},
{"price": 8}
],
operators can have multiple tariffs with the last tariff always having no threshold.
so in the example above the first 150 minutes will be charged at 15.5p per minute, the next 100mins will be charged at 12.3p per minute and all subsequent minutes will be charged at 8p.
Therefore if:
AnnualUsage = 1000
the total cost would be 95.55.
I'm struggling to visualise a method which would accommodate for the multiple tariffs an operator could have and multiplying a value by a different price depending on threshold.
Please Help!
Just another option, I think it's self explanatory:
rates = [
{price: 15.5, threshold: 150},
{price: 12.3, threshold: 100},
{price: 8}
]
annual_usage = 1000
res = rates.each_with_object([]) do |h, ary|
if h.has_key?(:threshold) && annual_usage > h[:threshold]
annual_usage -= h[:threshold]
ary << h[:threshold] * h[:price]/100
else
ary << annual_usage * h[:price]/100
end
end
res #=> [23.25, 12.3, 60]
res.sum #=> 95.55
Take a look to Enumerable#each_with_object.
def tot_cost(rate_tbl, minutes)
rate_tbl.reduce(0) do |tot,h|
mins = [minutes, h[:threshold] || Float::INFINITY].min
minutes -= mins
tot + h[:price] * mins
end
end
rate_tbl = [
{ price: 15.5, threshold: 150},
{ price: 12.3, threshold: 100 },
{ price: 8 }
]
tot_cost(rate_tbl, 130) #=> 2015.0 (130*15.5)
tot_cost(rate_tbl, 225) #=> 3247.5 (150*15.5 + (225-150)*12.3)
tot_cost(rate_tbl, 300) #=> 3955.0 (150*15.5 + 100*12.3 + (300-250)*8)
If desired, h[:threshold] || Float::INFINITY can be replaced by
h.fetch(:threshold, Float::INFINITY)
RATES = [
{price: 15.5, threshold: 150},
{price: 12.3, threshold: 100},
{price: 8}
]
def total_cost(annual_usage)
rate_idx = 0
idx_in_threshold = 1
1.upto(annual_usage).reduce(0) do |memo, i|
threshold = RATES[rate_idx][:threshold]
if threshold && (idx_in_threshold > RATES[rate_idx][:threshold])
idx_in_threshold = 1
rate_idx += 1
end
idx_in_threshold += 1
memo + RATES[rate_idx][:price]
end
end
puts total_cost(1000).to_i
# => 9555
The key concepts:
using an enumerable method such as reduce to incrementally build the solution. You could alternatively use each but reduce is more idiomatic.
tracking progress through the rates list through the rate_idx and idx_in_threshold variables. These give you all the information you need to determine whether you should advance to the next tier.
Also, avoid writing your hash keys like "price": 15.5 - just remove the quotations, it's more idiomatic.
With object-oriented approach you can remove explicit if ..else statements and maybe make code more self explanatory.
class Total
attr_reader :price
def initialize(usage)
#usage = usage
#billed_usage = 0
#price = 0
end
def apply(rate)
applicable_usage = [#usage - #billed_usage, 0].max
usage_to_apply = [applicable_usage, rate.fetch(:threshold, applicable_usage)].min
#price += usage_to_apply * rate[:price]
#billed_usage += usage_to_apply
end
end
Simple usage with each method
rates = [
{:price => 15.5, :threshold => 150},
{:price => 12.3, :threshold => 100},
{:price => 8}
]
total = Total.new(1000)
rates.each { |rate| total.apply(rate) }
puts "Total: #{total.price}" # Total: 9555.0 (95.55)

Sort by a key, but value has more than one element using Scala

I'm very new to Scala on Spark and wondering how you might create key value pairs, with the key having more than one element. For example, I have this dataset for baby names:
Year, Name, County, Number
2000, JOHN, KINGS, 50
2000, BOB, KINGS, 40
2000, MARY, NASSAU, 60
2001, JOHN, KINGS, 14
2001, JANE, KINGS, 30
2001, BOB, NASSAU, 45
And I want to find the most frequently occurring for each county, regardless of the year. How might I go about doing that?
I did accomplish this using a loop. Refer to below. But I'm wondering if there is shorter way to do this that utilizes Spark and Scala duality. (i.e. can I decrease computation time?)
val names = sc.textFile("names.csv").map(l => l.split(","))
val uniqueCounty = names.map(x => x(2)).distinct.collect
for (i <- 0 to uniqueCounty.length-1) {
val county = uniqueCounty(i).toString;
val eachCounty = names.filter(x => x(2) == county).map(l => (l(1),l(4))).reduceByKey((a,b) => a + b).sortBy(-_._2);
println("County:" + county + eachCounty.first)
}
Here is the solution using RDD. I am assuming you need top occurring name per county.
val data = Array((2000, "JOHN", "KINGS", 50),(2000, "BOB", "KINGS", 40),(2000, "MARY", "NASSAU", 60),(2001, "JOHN", "KINGS", 14),(2001, "JANE", "KINGS", 30),(2001, "BOB", "NASSAU", 45))
val rdd = sc.parallelize(data)
//Reduce the uniq values for county/name as combo key
val uniqNamePerCountyRdd = rdd.map(x => ((x._3,x._2),x._4)).reduceByKey(_+_)
// Group names per county.
val countyNameRdd = uniqNamePerCountyRdd.map(x=>(x._1._1,(x._1._2,x._2))).groupByKey()
// Sort and take the top name alone per county
countyNameRdd.mapValues(x => x.toList.sortBy(_._2).take(1)).collect
Output:
res8: Array[(String, List[(String, Int)])] = Array((KINGS,List((JANE,30))), (NASSAU,List((BOB,45))))
You could use the spark-csv and the Dataframe API. If you are using the new version of Spark (2.0) it is slightly different. Spark 2.0 has a native csv data source based on spark-csv.
Use spark-csv to load your csv file into a Dataframe.
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(new File(getClass.getResource("/names.csv").getFile).getAbsolutePath)
df.show
Gives output:
+----+----+------+------+
|Year|Name|County|Number|
+----+----+------+------+
|2000|JOHN| KINGS| 50|
|2000| BOB| KINGS| 40|
|2000|MARY|NASSAU| 60|
|2001|JOHN| KINGS| 14|
|2001|JANE| KINGS| 30|
|2001| BOB|NASSAU| 45|
+----+----+------+------+
DataFrames uses a set of operations for structured data manipulation. You could use some basic operations to become your result.
import org.apache.spark.sql.functions._
df.select("County","Number").groupBy("County").agg(max("Number")).show
Gives output:
+------+-----------+
|County|max(Number)|
+------+-----------+
|NASSAU| 60|
| KINGS| 50|
+------+-----------+
Is this what you are trying to achieve?
Notice the import org.apache.spark.sql.functions._ which is needed for the agg() function.
More information about Dataframes API
EDIT
For correct output:
df.registerTempTable("names")
//there is probably a better query for this
sqlContext.sql("SELECT * FROM (SELECT Name, County,count(1) as Occurrence FROM names GROUP BY Name, County ORDER BY " +
"count(1) DESC) n").groupBy("County", "Name").max("Occurrence").limit(2).show
Gives output:
+------+----+---------------+
|County|Name|max(Occurrence)|
+------+----+---------------+
| KINGS|JOHN| 2|
|NASSAU|MARY| 1|
+------+----+---------------+

GAE datastore: filter by date interval

I have this model:
class Vehicle(db.Model):
...
start_production_date = db.DateProperty()
end_production_date = db.DateProperty()
I need to filter, for example, all vehicles in production within, say, 2010:
I thought I could do:
q = (Vehicle.all()
.filter('start_production_date >=', datetime(2010, 1, 1))
.filter('end_production_date <', datetime(2011, 1, 1)))
buy I get BadFilterError:
BadFilterError: invalid filter: Only one property per query may have inequality filters (<=, >=, <, >)..
so, how do I acheive this? Moreover this seems to me a quite common task.
One approach is to change your model to something like this:
class Vehicle(db.Model):
...
start_production_date = db.DateProperty()
start_production_year = db.IntegerProperty()
start_production_month = db.IntegerProperty()
start_production_day = db.IntegerProperty()
end_production_date = db.DateProperty()
end_production_year = db.IntegerProperty()
end_production_month = db.IntegerProperty()
end_production_day = db.IntegerProperty()
Update these new values on every put (you could override put) and the simply:
# specific year
q = (Vehicle.all()
.filter('start_production_year =', 2010))
# specific year, different months
q = (Vehicle.all()
.filter('start_production_year =', 2010)
.filter('start_production_month IN', [1, 2, 3, 4]))
# different years
q = (Vehicle.all()
.filter('start_production_year IN', [2010, 2011, 2012]))
I went with this solution:
I set in model a ListProperty item containing all years the model was crafted:
vhl.production_years = range(start_production_date.year, end_production_date + 1)
Then test:
q = (Vehicle.all()
.filter('production_years =', 2010))

Resources