how is tf calcualted? - solr

I want to know how is the term frequency factor i.e. tf calculated ?
I want to know the tf of the content. The results for the following query :
curl -g 'http://localhost:8983/solr/nutch/select?indent=on&q=python&wt=json&fl=title,score,[features%20efi.query=python%20store=myfeature_store]',content
is:
...
{
"title":"Raspberry Pi Stack Exchange",
"content":"Raspberry Pi Stack Exchange\nStack Exchange Network\nStack Exchange network consists of 175 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.\nVisit Stack Exchange\nLoading…\n0\n+0\nTour Start here for a quick overview of the site\nHelp Center Detailed answers to any questions you might have\nMeta Discuss the workings and policies of this site\nAbout Us Learn more about Stack Overflow the company\nBusiness Learn more about hiring developers or posting ads with us\nLog in\nSign up\ncurrent community\nRaspberry Pi\nhelp\nchat\nRaspberry Pi Meta\nyour communities\nSign up or log in to customize your list.\nmore stack exchange communities\ncompany blog\nBy using our site, you acknowledge that you have read and understand our Cookie Policy , Privacy Policy , and our Terms of Service .\nRaspberry Pi Stack Exchange is a question and answer site for users and developers of hardware and software for Raspberry Pi. It only takes a minute to sign up.\nSign up to join this community\nAnybody can ask a question\nAnybody can answer\nThe best answers are voted up and rise to the top\nHome\nQuestions\nTags\nUsers\nUnanswered\nExplore our Questions\nAsk Question\nraspbian pi-3 gpio python networking wifi pi-2 usb boot ssh\nmore tags\nActive\nHot\nWeek\nMonth\n0\nvotes\n0\nanswers\n3\nviews\nHostname on router and pi do not match\nheadless\nasked 4 mins ago\nJoseph\n1\n2\nvotes\n0\nanswers\n49\nviews\nAndroid won't connect to RasPi access point\nandroid\naccess-point\nsystemd-networkd\nwpa-supplicant\nmodified 6 mins ago\nThePunisher\n121\n2\nvotes\n3\nanswers\n53\nviews\napt-get update errors after copying Raspbian to new SD card\nraspbian\napt\nmodified 17 mins ago\nifschleife\n121\n1\nvote\n5\nanswers\n444\nviews\nWifi cuts out after a few hours, have to restart Pi\nraspbian\nnetworking\nwifi\nssh\nminecraft\nmodified 53 mins ago\nCommunity ♦\n1\n2\nvotes\n2\nanswers\n369\nviews\nCan't SSH by name on stretch; can on jessie\nssh\nraspbian-stretch\nputty\nmodified 1 hour ago\nCommunity ♦\n1\n0\nvotes\n0\nanswers\n8\nviews\nHow to use only 3 GPIO pins for a JSN-SR04T waterproof ultrasonic sensor\ngpio\nsensor\nasked 2 hours ago\nPeter bill\n191\n1\nvote\n2\nanswers\n52\nviews\nGPIO Not changing its value in a particular code section\ngpio\npython\nrelay\nmodified 2 hours ago\ntlfong01\n2,465\n0\nvotes\n0\nanswers\n1\nview\nMakes OpenVPN a local Apache Webserver accessable from outside?\nweb-server\nvpn\napache-httpd\nweb-browsers\nweb\nasked 2 hours ago\nJakob\n113\n0\nvotes\n1\nanswer\n15\nviews\nsainsmart relay - switches on when pi shuts down\npi-3\nboot-issues\nanswered 2 hours ago\npir8ped\n79\n0\nvotes\n1\nanswer\n301\nviews\nRaspberry Pi Matchbox virtual keyboard missing colon\ndisplay\nmodified 2 hours ago\nCommunity ♦\n1\n-1\nvotes\n0\nanswers\n27\nviews\nHow to fix ssh connection that's been broken by dhcpcd service\nlinux\nnetworking\nssh\ndhcp\nmodified 3 hours ago\nBelserich\n1\n4\nvotes\n2\nanswers\n8k\nviews\nHow can I use OpenCV with Python 3 on a Raspberry Pi?\nopencv\npython-3\nanswered 3 hours ago\nIngo\n19.1k\n2\nvotes\n0\nanswers\n14\nviews\nRPi-Zero, HID keyboard gadget for BIOS keyboard\nusb\nkeyboard\nhid\nlibcomposite\nmodified 3 hours ago\nEphemeral\n1,561\n0\nvotes\n0\nanswers\n13\nviews\nHow do I go about auto-mounting my NTFS hard drive at boot?\nboot\nmount\nfstab\nntfs\nasked 3 hours ago\nHasake\n11\nBrowse more Questions\nHot Network Questions\nTriple Approx Symbol\nBest ways to invest for a planned house purchase in 1 year?\nVariable selection in logistic regression model\nShould rooms be designed to minimize waste of sheet goods?\nWhy is Perihelion and Shortest day in North Hemisphere different?\nHow can I estimate the speed of this code section for this microcontroller?\nShell - Navigate up 'n' directories\nLooking for an effective pattern to cope with switch statements in C#\n",
"score":0.00982895,
"[features]":"tf=2.0"},
...
How is the value 2.0 coming? The word python is coming 4 times and there are 330 words in the content.

Solr now uses the BM25 scorer and not TF/IDF directly. The tf value used in BM25 is not the exact count of the times the term occur, but uses sqrt(TF).
sqrt(4) == 2.0
Raw TF TF Score
1 1.0
2 1.141
4 2.0
8 2.828
16 4.0

Related

echoprint server - Finger print search time taking 2-3 seconds

We are facing high fingerprint match solr query time. Following is our setup Info:
echonest/echoprint-server running on single node (solr 1.0) running on amazon ec2 instance m3.2x large box with 30G RAM
& 8 cores
2.5 million tracks(segment count 19933333) ingested with solr 1.0 index size around 91G.
Applied optimization HashQueryComponent.java https://github.com/playax/echoprint-server/commit/706d26362bbe9141203b2b6e7846684e7a417616#diff-f9e19e870c128c0d64915f304cf43677
Also tried to capture stats of eval method, some of the loop iterations of sequential subreader of index reader took more than 1 second to iterate over all the terms.
Any suggestions or pointers in the right directions will be very helpful.

Why is my Google Cloud SQL instance being billed every hour?

It seems like I'm being overbilled but I want to make sure I am not misunderstanding how Per Use billing works. Here are the details:
I'm running a small test PHP application on Google App Engine with no visitors other than myself every once in a while.
I periodically reset the database via cron: originally every hour, then every 3 hours last month, now every 6 hours.
Pricing plan: Per Use
Storage Used: 0.1% of 250 GB
Type: First Generation
IPv4 address: None
File system replication: Synchronous
Tier: D0
Activation Policy: On demand
Here's the billing through the first 16 days of this months:
Google SQL Service D0 usage - hour 383 hour(s) $9.57
16 days * 24 hours = 384 hours * $.025 = $9.60 . So it appears I've been charged every hour this month. This also happened last month.
I understand that I am charged the full hour for every part of an hour that the SQL instance is active.
Still, with the minimal app usage and the database reset 4 times a day, I would expect the charges (even allowing for a couple extra hours of usage each day) to be closer to:
16 days * 6 hours = 80 hours * $.025 = $2.40.
Any explanation for the discrepency?
The logs are the source of truth usually. Check them to see if you are being visited by an aggressive crawler, a stuck task that keeps retrying etc.
Or you may have a cron job that is running and performing work. You can view that in the "task queue/cron jobs" section in the control panel.
You might be have assigned an Ipv4 address to your instance and Google Developer Console clearly states
You will be charged $0.01 each hour the instance is inactive and has an IPv4 address assigned.
This might be the reason of your extra bill.

How to store mathematical expressions/explanations into database

I am given a task to develop a website for maths students with questions and their explanations.The site will have around 20,000 questions.And I need an effective way(easy storage,faster querying and fast rendering) to store those questions into the database.
Sample Question
In the first 10 overs of a cricket game, the run rate was only 3.2. What should be the run rate in the remaining 40 overs to reach the target of 282 runs?
Required run rate = 282 - (3.2 x 10) = 250 = 6.25
---------------- -----
40 40
Questions is a simple string and can easily be stored.But the real problem is to store those expressions with brackets and divide into the database?
You could store the expressions in LaTeX in the database.
Edit:
You can use libraries like http://www.mathjax.org/ for client-side rendering of the equations.
You have several options to store a string representation of mathematical expressions: MathML, LaTeX or ASCIIMathML.
For displaying it in a web browser I recommend MathJax.

Increase Cakephp 1.3 form security token expiration time

I'm deep into developing an application in Cakephp 1.3 so while I'd like to upgrade to version 2 I'd rather leave that as a last resort.
My problem is that the Token expiration time for forms when the security component is used is too low. I expect people to sit on this form for a little while, but after 10 minutes using it will result in a blackhole.
Is there anyway to increase the time before the token expires? Cakephp 2 has this option as detailed here: cakephp 2 security csrf-configuration
It doesn't seem to work in 1.3, is there a way?
Yes this is perfectly do-able, look for the Session.timeout number in your config/core.php and increase it. The number assigned is the base amount of seconds, the actual amount of time the timeout is set to is depending on your security level also:
High: Session.timeout * 10
Medium: Session.timeout * 100
Low: Session.timeout * 300.
As you want the features of high security, I would keep it on high but increase Session.timeout to your needs.
Further reading.
Update
It actually turns out that the Form token timeout only depends on the security level, and for some reason doesn't take the timeout into effect (see here), so the only way to increase this is to lower the security level or alter Cake :/

Apache2: server-status reported value for "requests/sec" is wrong. What am I doing wrong?

I am running Apache2 on Linux (Ubuntu 9.10).
I am trying to monitor the load on my server using mod_status.
There are 2 things that puzzle me (see cut-and-paste below):
The CPU load is reported as a ridiculously small number,
whereas, "uptime" reports a number between 0.05 and 0.15 at the same time.
The "requests/sec" is also ridiculously low (0.06)
when I know there are at least 10 requests coming in per second right now.
(You can see there are close to a quarter million "accesses" - this sounds right.)
I am wondering whether this is a bug (if so, is there a fix/workaround),
or maybe a configuration error (but I can't imagine how).
Any insights would be appreciated.
-- David Jones
- - - - -
Current Time: Friday, 07-Jan-2011 13:48:09 PST
Restart Time: Thursday, 25-Nov-2010 14:50:59 PST
Parent Server Generation: 0
Server uptime: 42 days 22 hours 57 minutes 10 seconds
Total accesses: 238015 - Total Traffic: 91.5 MB
CPU Usage: u2.15 s1.54 cu0 cs0 - 9.94e-5% CPU load
.0641 requests/sec - 25 B/second - 402 B/request
11 requests currently being processed, 2 idle workers
- - - - -
After I restarted my Apache server, I realized what is going on. The "requests/sec" is calculated over the lifetime of the server. So if your Apache server has been running for 3 months, this tells you nothing at all about the current load on your server. Instead, reports the total number of requests, divided by the total number of seconds.
It would be nice if there was a way to see the current load on your server. Any ideas?
Anyway, ... answered my own question.
-- David Jones
Apache status value "Total Accesses" is total access count since server started, it's delta value of seconds just what we mean "Request per seconds".
There is the way:
1) Apache monitor script for zabbix
https://github.com/lorf/zapache/blob/master/zapache
2) Install & config zabbix agentd
UserParameter=apache.status[*],/bin/bash /path/apache_status.sh $1 $2
3) Zabbix - Create apache template - Create Monitor item
Key: apache.status[{$APACHE_STATUS_URL}, TotalAccesses]
Type: Numeric(float)
Update interval: 20
Store value: Delta (speed per second) --this is the key option
Zabbix will calculate the increment of the apache request, store delta value, that is "Request per seconds".

Resources