How to search two different things in two different fields? - solr

I am using NUTCH 1.4 and SOLR 3.3.0 to crawl and index my website. On the front-end I use the php API SOLARIUM to query to SOLR. I have the following fields that I search in by default:
content -> of type Text
title -> of type Text
ur-> of type url
I want to search for a keyword but at the same time I want to exclude some of the results based on some URL pattern without affecting the total number of results return. (For example I always want to show 20 results.)
If anyone knows a way of doing this with SOLARIUM it would be really nice. But if not I am curious how this can be done in SOLR.
I have already looked at faceted search but I couldn't wrap my head around it. If someone can explain in details I would really appreciate it.

I can't help you with Solarium, but your Solr query should be relatively straightforward:
q=+keyword -ur:exclude&rows=20

http://{url_endpoint}/?wt=json&rows=20&start=0&q=content:contentText OR title:titleText OR ur:url
wt=json result will be in json format
rows=20 result will be paginated by 20 records per page
start=0 page to start displaying results
q= query to run search (make sure to properly escape inputs also * wildcard to look for anything before and after)
In php using curl.
$solr_end_point = ''; //enter endpoint
$search_term = '';
$url_type = '';
$start = 0;
$ch = curl_init();
$query = urlencode("content:*{$search_term}* OR title:*{$search_term}* OR ur:*{$url_type}*");
curl_setopt($ch, CURLOPT_URL, "http://{$solr_end_point}/?wt=json&rows=30&start={$start}&q={$query}");
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 2);
$result = curl_exec($ch);
curl_close($ch);
print_r($result); //output result (json)
$json_result = json_decode($result,true);
print_r($json_result); //output result as an array
exit();

Related

libcurl post request multiple lines

I am currently working on an application that needs to communicate between two web servers. In order to do that I am using libcurl in c. I am perfectly ok with making GET requests, but the POST ones I'm finding a bit more tricky.
For instance with curl in this case I'd do:
curl --location --request POST '%URL%' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'grant_type=client_credentials' \
--data-urlencode 'scope=%scope%' \
--data-urlencode 'client_id=%client_id%' \
--data-urlencode 'client_secret=%client_secret%'
Reading the libcurl documentation I understand I need to curl_easy_setopt(curl, CURLOPT_POST, 1L); to let libcurl know I'm posting.
The only problem I have is how exactly do I make the different lines?
The fact that the CURLOPT_POSTFIELDS is in fact called "fields" rather than "field" makes me think it should support multiple fields natively, so I instinctively think
curl_easy_setopt(curl, CURLOPT_POSTFIELDSIZE, (long) strlen(first_line));
curl_easy_setopt(curl, CURLOPT_POSTFIELDS, first_line);
curl_easy_setopt(curl, CURLOPT_POSTFIELDSIZE, (long) strlen(second_line));
curl_easy_setopt(curl, CURLOPT_POSTFIELDS, second_line);
curl_easy_setopt(curl, CURLOPT_POSTFIELDSIZE, (long) strlen(third_line));
curl_easy_setopt(curl, CURLOPT_POSTFIELDS, third_line);
...and so on.
But that doesn't work and the documentation goes on saying
To make multipart/formdata posts, check out the CURLOPT_MIMEPOST option combined with curl_mime_init.
Which, since I know very little about, looks sort of scary especially looking at the example under this page.
Can anybody help me with the request I need to make or at least explain the MIME thing a little simpler?
From the mime page on curl.se I get the feeling that I should already know the things I don't know and the research I have done hasn't really shed any more light.
Since the server is expecting the data in application/x-www-form-urlencoded format, MIME doesn't apply, so ignore that.
Despite its name, CURLOPT_POSTFIELDS is not treated in a plural manner. It cannot be used in multiple calls to curl_easy_setopt() to concatenate multiple pieces of data into a single POST body. It expects the entire POST body to be specified in a single call, as per documented in CURLOPT_POSTFIELDS explained:
Pass a char * as parameter, pointing to the full data to send in an HTTP POST operation. You must make sure that the data is formatted the way you want the server to receive it. libcurl will not convert or encode it for you in any way. For example, the web server may assume that this data is URL encoded.
curl.exe concatenates multiple --data... parameters together into a single POST body:
If any of these options is used more than once on the same command line, the data pieces specified will be merged with a separating &-symbol. Thus, using '-d name=daniel -d skill=lousy' would generate a post chunk that looks like 'name=daniel&skill=lousy'.
So, you will have to do the same in your own code, eg:
// build up this string however you need to...
const char *post_body = "grant_type=client_credentials"
"&scope=<url-encoded value>"
"&client_id=<url-encoded value>"
"&client_secret=<url-encoded value>";
curl_easy_setopt(curl, CURLOPT_POSTFIELDS, post_body);
// no need for CURLOPT_POSTFIELDSIZE in this case,
// libcurl will handle that internally...

Batch fetching messages performance

I need to get the last 100 messages in the INBOX (headers only). For that I'm currently using the IMAP extension to search and then fetch the messages. This is done with two requests (SEARCH and then UID FETCH).
What's the Gmail API equivalent to fetching multiple messages in one request?
All I could find is a batch API, which seems way more cumbersome (composing a long list of messages:get requests wrapped in plain HTTP code).
It's pretty much the same in the Gmail API as in IMAP. Two requests: first is messages.list to get the message ids. Then a (batched) message.get to retrieve the ones you want. Depending on what language you're using the client libraries may help with the batch request construction.
A batch request is a single standard HTTP request containing multiple Google Cloud Storage JSON API calls, using the multipart/mixed content type. Within that main HTTP request, each of the parts contains a nested HTTP request.
From: https://developers.google.com/storage/docs/json_api/v1/how-tos/batch
It's really not that hard, took me about an hour to figure it out in python even without the python client libraries (just using httplib and mimelib).
Here's a partial code snippet of doing it, again with direct python. Hopefully it makes it clear that's there's not too much involved:
msg_ids = [msg['id'] for msg in body['messages']]
headers['Content-Type'] = 'multipart/mixed; boundary=%s' % self.BOUNDARY
post_body = []
for msg_id in msg_ids:
post_body.append(
"--%s\n"
"Content-Type: application/http\n\n"
"GET /gmail/v1/users/me/messages/%s?format=raw\n"
% (self.BOUNDARY, msg_id))
post_body.append("--%s--\n" % self.BOUNDARY)
post = '\n'.join(post_body)
(headers, body) = _conn.request(
SERVER_URL + '/batch',
method='POST', body=post, headers=headers)
Great reply!
If somebody wants to use a raw function in php to make batch requests for fetching emails corresponding to message ids, please feel free to use mine.
function perform_batch_operation($auth_token, $gmail_api_key, $email_id, $message_ids, $BOUNDARY = "gmail_data_boundary"){
$post_body = "";
foreach ($message_ids as $message_id) {
$post_body .= "--$BOUNDARY\n";
$post_body .= "Content-Type: application/http\n\n";
$post_body .= 'GET https://www.googleapis.com/gmail/v1/users/'.$email_id.
'/messages/'.$message_id.'?metadataHeaders=From&metadataHeaders=Date&format=metadata&key='.urlencode($gmail_api_key)."\n\n";
}
$post_body .= "--$BOUNDARY--\n";
$headers = [ 'Content-type: multipart/mixed; boundary='.$BOUNDARY, 'Authorization: OAuth '.$auth_token ];
$curl = curl_init();
curl_setopt($curl,CURLOPT_URL, 'https://www.googleapis.com/batch' );
curl_setopt($curl, CURLOPT_CUSTOMREQUEST, "POST");
curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
curl_setopt($curl,CURLOPT_CONNECTTIMEOUT , 60 ) ;
curl_setopt($curl, CURLOPT_TIMEOUT, 60 ) ;
curl_setopt($curl,CURLOPT_POSTFIELDS , $post_body);
curl_setopt($curl, CURLOPT_RETURNTRANSFER,TRUE);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER,0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
$tmp_response = curl_exec($curl);
curl_close($curl);
return $tmp_response;
}
FYI the above function gets just the headers for the emails, in particular the From and Date fields, please adjust according to the api documentation https://developers.google.com/gmail/api/v1/reference/users/messages/get
In addition to MaK you can perform multiple batch requests using the google-api-php-client and Google_Http_Batch()
$optParams = [];
$optParams['maxResults'] = 5;
$optParams['labelIds'] = 'INBOX'; // Only show messages in Inbox
$optParams['q'] = 'subject:hello'; // search for hello in subject
$messages = $service->users_messages->listUsersMessages($email_id,$optParams);
$list = $messages->getMessages();
$client->setUseBatch(true);
$batch = new Google_Http_Batch($client);
foreach($list as $message_data){
$message_id = $message_data->getId();
$optParams = array('format' => 'full');
$request = $service->users_messages->get($email_id,$message_id,$optParams);
$batch->add($request, $message_id);
}
$results = $batch->execute();
here is the python version, using the official google api client. Note that I did not use the callback here, because I need to handle the responses in a synchronous way.
from apiclient.http import BatchHttpRequest
import json
batch = BatchHttpRequest()
#assume we got messages from Gmail query API
for message in messages:
batch.add(service.users().messages().get(userId='me', id=message['id'],
format='raw'))
batch.execute()
for request_id in batch._order:
resp, content = batch._responses[request_id]
message = json.loads(content)
#handle your message here, like a regular email object
solution from Walty Yeung is worked partially for my use case.
if you guys tried the code and nothing happens use this batch
batch = service.new_batch_http_request()

cURL error 26 couldn't open file

So, need help.
prePS - code works but not works fine when load test.
So, at all, php code get from user a file and save it on cluster(4 nodes) as tree parts.
so, simple code like.
$user_file = get_full_filename_of_userfile_from_post_array('user_file');
$fin = fopen($user_file,'rb');
$fout1 = fopen(get_uniq_name('p1'),'wb');
$fout2 = fopen(get_uniq_name('p2'),'wb');
$fout3 = fopen(get_uniq_name('p3'),'wb');
while ($part = fread($fin))
{
fwrite($fout1,get_1_part($part));
fwrite($fout2,get_2_part($part));
fwrite($fout3,get_3_part($part));
}
fclose($fin);
fclose($fout1);
fclose($fout2);
fclose($fout3);
$location = get_random_nodes(3,$array_of_existing_nodes);
foreach($location as $key => $node)//key 1..3
{
if(is_local_node($node)
{
save_local_file(get_part_file_name($key));
}
else
{
save_remote_file(get_part_file_name($key),$node);
}
}
//delete tmp files, logs,...etc
save_remote_file() - using cURL sends a file POST method like in docs
$post_data = array(
'md5sum' => $md5sum,
'ctime' => $ctime,
.....
'file' => #.$upload_file_full_name,
);
$ch = curl_init();
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, Config::get('connect_curl_timeout'));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible;)");
curl_setopt($ch, CURLOPT_URL, $URL.'/index.php');
curl_setopt($ch, CURLOPT_PORT, Config::get('NODES_PORT'));
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_data);
So, during test, I upload 14000 files, a file per request, 10 request per node (parallel).
php code, check file and get answer, and then in background saving file on cluster.
(yes ,I know that it would be nice to create daemon for saving - it is a task for future)
SO, sometime, there may be about 100, or even 200 php processes on node, in background
(use php-fpm function )
ignore_user_abort(true);
set_time_limit(0);
session_commit();
if(function_exists('fastcgi_finish_request'))
fastcgi_finish_request();
A few calculation.
14000 files = 14000*3=42000parts, saveing on random 3 from 4, so 25% parts save locally, 75% -remote
= 0.75*42000 = 31500 remote saveings
during test I gets about 100 errors on all nodes from curl
errno = 26
errer = couldn't open file "##ІИP_zOж
лЅHж"//it is stranger, because origin file name - it is about 124 chars in name. example
/var/vhosts/my.domains.com/www/process/r_5357bc33f3686_h1398258739.9968.758df087f8db9b340653ceb1abf160ae8512db03.chunk0.part2
before code with cURL , I added checks file_exists($upload_file_full_name)
and is_readable($upload_file_full_name); if not - log it.
checks passed good, but curl returns error(100 times from 31500 ones)
also, add code, if error, wait 10secs, try againg, wait 10secs, try, wait 10 try.
always all if first try with error, all next tries with error too, but according logs, at the same time another php processes what are saving other files, good send a part via curl.
So I don't understand, how can I find a reason and fixing it.

How read and write to cloudant using PHP

I have been trying to use the Curl examples on cloundant using PHP. However, nothing I try works. All I want to do is simply read, write, search data on cloundant using PHP. However, there doesn't seem to be a simple way for a beginning developer to do this.
Code here:
//Get DB's
$returned_content = get_data('https://**InstanceName**:**Password**#**InstanceName**.cloudant.com/_all_dbs');
function get_data($url)
{
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
The error I get is:
{"error":"unauthorized","reason":"Name or password is incorrect"}
According to How do I make a request using HTTP basic authentication with PHP curl?, you need to set the basic auth credentials outside the URL. Adapting their example to your variable names:
curl_setopt($ch, CURLOPT_USERPWD, $username . ":" . $password);
If your using PHP to interact with your Cloudant database you may want to check out the SAG api library. This makes things fairly easy to do. There are some great examples on how to use this API on their site. Hope this helps.
You may also want to check out IBM's new Bluemix environment where they have Cloudant as one of the available services. The integration between Cloudant and Bluemix is excellent.
http://www.saggingcouch.com/
http://www.bluemix.net
I have used Node.js & NodeExpress with Cloudant. You could do similar stuff for PHP. See if these 2 posts help
A Cloud Medley with IBM Bluemix, Cloudant & Node.js
Rock N'Roll with Bluemix, Cloudant & NodeExpress
// this is how I get cloudant api keys in PHP
$username='yourusername';
$password='yourpassword';
$URL='https://yourusername.cloudant.com/_api/v2/api_keys';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$URL);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
//curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch, CURLOPT_USERPWD, "$username:$password");
//$status_code = curl_getinfo($ch, CURLINFO_HTTP_CODE); //get status code
$response=curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
//$httpCode = 303; // test a fail code
if($httpCode == 201) {
$json = json_decode($response, true);
if($json['ok'] == 1) {
$cloudantkey = $json['key'];
$cloudantpassword = $json['password'];
echo "Obtained Cloud Api Keys.. <br />";
} else {
echo("error $httpCode");
die();
}
} else {
echo("error $httpCode");
die();
}
if(curl_error($ch)){
echo("curl error");
die();
}
curl_close ($ch);

How to display time according to country timezone in Cakephp

i am building a web app on cakephp 2.2 ..i am newbie in cakephp and have never worked with a time..the problem is i want to show the time where the website is viewing .. for example if someone is viewing my website in Australia,he can see the time according to their country. and i want to show seperate date and time .. don't want to combine time and date
here what i am doing right now
in AppController i have done this
public function beforeRender(){
if ($this->Auth->login() || $this->Auth->loggedIn()) {
App::uses('CakeTime', 'Utility');
}}
dont know how to echo as well
We can find a client location with out using database, for this we need to use some api, example I used the host api.
Controller code :
$clientIpAddress = $this->request->clientIp();
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://api.hostip.info/get_html.php?ip=$clientIpAddress&position=true");
curl_setopt($ch, CURLOPT_HEADER,0);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($ch);
$data will return a sample below: Country: UNITED STATES (US) City: Aurora, TX Latitude: 33.0582 Longitude: -97.5159 IP: 12.215.42.19
So you got the location from this $data.
Use this in view file as follows :
$this->Time->format('F jS, Y h:i A',date('M d, Y h:i:s'), null,'Aurora, TX');
But dont forget write the city name, ip address values in session, so no need to send the curl request every time a page loads for the single user. Just first time find the location and ip address and write in session then use it
An easy way to manage the time is to save all dates and times as GMT+0 or UTC. Uncomment the line date_default_timezone_set('UTC'); in app/Config/core.php to ensure your application’s time zone is set to GMT+0.
Next add a time zone field to your users table and make the necessary modifications to allow your users to set their time zone. Now that we know the time zone of the logged in user we can correct the date and time on our posts using the Time Helper:
echo $this->Time->format('F jS, Y h:i A', $post['Post']['created'], null, $user['User']['time_zone']);
Precondition is that you should know the time zone of users, and that would be better if you will store timezone per register user.
and if you dont want to use as register user you can also track and use some third party service to give timezone from IP Address like ip2location

Resources