several questions about using vespa - vespa

Followed the instruction of http://docs.vespa.ai/documentation/vespa-quick-start.html and issued yql-like curl (curl -s http://localhost:8080/search/?yql=select%20%2A%20from%20sources%20%2A%3B), got the error msg as follows, "message": "Could not instantiate query from YQL", could anyone point out if I missed anything to start any service?
I want to store all the documents in physical memory for fast query, is there any configuration for me to achieve that? btw, is the doc compressed by default? (Also, I'd like to avoid disk io when feeding documents)
Appreciate if anyone could share some internal architecture design doc for content/search node, thanks.
////
1 works by comment #1.

I will let others respond to point 2 and 3, but my guess for point 1 is that you miss the "where" clause in your yql query, hence the failure.

3) We don't have updated design documentation, sorry about that. If you ask more concrete questions I can provide answers or find something for you (but I suppose you already did that in https://github.com/vespa-engine/vespa/issues/5434).

On 2) you can define your fields as attribute and have a custom document summary referencing only attribute fields. See http://docs.vespa.ai/documentation/attributes.html http://docs.vespa.ai/documentation/document-summaries.html &

Related

Blockfrost.io get transaction data with addresses (from/to)

I am looking at using Blockfrost.io API in order to read cardano transactions, I am looking to get the bare minimum which is:
Address from
Address to
Assets transfered (type + amount)
Fees
So far I can not find how to retrieve a transaction addresses from and to while using:
https://docs.blockfrost.io/#tag/Cardano-Transactions/paths/~1txs~1%7Bhash%7D/get
Am I missing something?
So to answer my question:
Cardano does use something called "utxos" for the way it handles transactions and I would invite everyone to read about these.
Regarding Blockfrost.io, this means you need to have a look at the transactions api:
https://docs.blockfrost.io/#tag/Cardano-Transactions/paths/%7E1txs%7E1%7Bhash%7D/get
and also combine it with the utxos api:
https://docs.blockfrost.io/#tag/Cardano-Transactions/paths/~1txs~1{hash}~1utxos/get

Django Model: Best Approach to update?

I am trying to update 100's of objects in my Job which is scheduled every two hours.
I have articles table in my Model. All articles are parsed and then different attributes are saved for each article.
First i query to get all unparsed articles and then parse each URL which is saved against article and save the received attributes.
Below is my code
articles = Articles.objects.filter(status = 0) #100's of articles
for art in articles:
try:
url = art.link
result = ArticleParser(URL) #Custom function which will do all the parsing
art.author = result.articleauthor
art.description = result.articlecontent[:5000]
art.imageurl = result.articleImage
art.status = 1
art.save()
except Exception as e:
art.author = ""
art.description = ""
art.imageurl = ""
art.status = 2
art.save()
The thing is when this job is running CPU utilization is very high also DB process utilization is very high. I am trying to pin point when and where it spikes.
Question: Is this the right way to update multiple objects or is there any better way to do it? Any suggestions.
Appreciate your help.
Regards
Edit 1: Sorry for the confusion. There is some explanation to do. The fields like author, desc etc they will be different for every article they will be returned after i parse the URL. The reason i am updating in loop is because these fields will be different for every iteration according to the URL. I have updated the code i hope it helps clearing the confusion.
You are doing 100s of DB operations in a relatively tight loop, so it is expected that there is some load on the DB.
If you have a lot of articles, make sure you have an index on the status column to avoid a table scan.
You can try disabling autocommit and wrapping the whole update in one transaction instead.
From my understanding, you do NOT want to set the fields author, description and imageurl to same value on all articles, so QuerySet.update won't work for you.
Django recommends this way when you want to update or delete multi-objects: https://docs.djangoproject.com/en/1.6/topics/db/optimization/#use-queryset-update-and-delete
1.Better not to use 'Exception', need to specify concretely: KeyError, IndexError etc.
2.Data can be created once. Something like this:
data = dict(
author=articleauthor,
description=articlecontent[:5000],
imageurl=articleImage,
status=1
)
Articles.objects.filter(status=0).update(**data)
To Edit 1: Probably want to set up a periodic tasks celery. That is, for each query to a separate task. For help see this documentation.

Take the qualifiers - postgresql

I'm working on the postgresql 8.4 source code. I need to extrapolate the qualifiers (where part) from the query.
For example if the query is: select name from student where age > 18
I need to know "age" and "18".
I've already took the target list, and the range list in this way
Query *query_idr = (Query *)linitial(querytree_list);
ListCell *l;
ListCell *tl;
foreach(l, query_idr->rtable){
Oid tab_idT = ((RangeTblEntry *) lfirst(l)) ->relid;
}
foreach(tl, query_idr->targetList){
TargetEntry *tle = (TargetEntry *) lfirst(tl);
Oid col_id = tle->resorigtbl;
}
and it works, and I've got the id of the table student (with the first foreach) and id of name column (with the second foreach), but I can't understand how I have to take the qualifier.
Here is the navigable Query structure http://doxygen.postgresql.org/structQuery.html
I doubt you are going to get an answer here. In general hacking the PostgreSQL source code is not likely to have enough people here who can answer it that a general site like this will be helpful. However rather than leave this withough any such resources I would like to reply to provide a list of resources for answering questions like this one as well as my reading of the docs as someone with quite a bit of experience buildign things on Pg.
In essence what you are trying to do is navigate through the parse tree of the query. It looks to me like the setOperations member might be the place to look just because I can't think of anywhere else and because this might help with both join conditions and where clause filters (remember these are considered interchangeable by the planner). However I have little experience in this area and so I could be wrong.
I would entirely second the suggestion that the pgsql-hackers list is likely to be the best place to ask this sort of question. You will probably get a better answer there.

How to make datastore keys mapreduce-friendly(-er)?

Edit: See my answer. Problem was in our code. MR works fine, it may have a status reporting problem, but at least the input readers work fine.
I ran an experiment several times now and I am now sure that mapreduce (or DatastoreInputReader) has odd behavior. I suspect this might have something to do with key ranges and splitting them, but that is just my guess.
Anyway, here's the setup we have:
we have an NDB model called "AdGroup", when creating new entities
of this model - we use the same id returned from AdWords (it's an
integer), but we use it as string: AdGroup(id=str(adgroupId))
we have 1,163,871 of these entities in our datastore (that's what
the "Datastore Admin" page tells us - I know it's not entirely
accurate number, but we don't create/delete adgroups very often, so
we can say for sure, that the number is 1.1 million or more).
mapreduce is started (from another pipeline) like this:
yield mapreduce_pipeline.MapreducePipeline(
job_name='AdGroup-process',
mapper_spec='process.adgroup_mapper',
reducer_spec='process.adgroup_reducer',
input_reader_spec='mapreduce.input_readers.DatastoreInputReader',
mapper_params={
'entity_kind': 'model.AdGroup',
'shard_count': 120,
'processing_rate': 500,
'batch_size': 20,
},
)
So, I've tried to run this mapreduce several times today without changing anything in the code and without making changes to the datastore. Every time I ran it, mapper-calls counter had a different value ranging from 450,000 to 550,000.
Correct me if I'm wrong, but considering that I use the very basic DatastoreInputReader - mapper-calls should be equal to the number of entities. So it should be 1.1 million or more.
Note: the reason why I noticed this issue in the first place is because our marketing guys started complaining that "it's been 4 days after we added new adgroups and they still don't show up in your app!".
Right now, I can think of only one workaround - write all keys of all adgroups into a blobstore file (one per line) and then use BlobstoreLineInputReader. The writing to blob part would have to be written in a way that does not utilize DatastoreInputReader, of course. Should I go with this for now, or can you suggest something better?
Note: I have also tried using DatastoreKeyInputReader with the same code - the results were similar - mapper-calls were between 450,000 and 550,000.
So, finally questions. Is it important how you generate ids for your entities? Is it better to use int ids instead of str ids? In general, what can I do to make it easier for mapreduce to find all of my entities mapping them?
PS: I'm still in the process of experimenting with this, I might add more details later.
After further investigation we have found that the error was actually in our code. So, mapreduce actually works as expected (mapper is called for every single datastore entity).
Our code was calling some google services functions that were sometimes failing (the wonderful cryptic ApplicationError messages). Due to these failures, MR tasks were being retried. However, we have set a limit on taskqueue retries. MR did not detect nor report this in any way - MR was still showing "success" in the status page for all shards. That is why we thought that everything is fine with our code and that there is something wrong with the input reader.

EXCEEDED_ID_LIMIT: emptyRecycleBin id limit reached: 200

I'm just wondering if anyone else has seen this and if so, can you confirm that this is correct? The documentation claims, as you might expect, that 10,000 is the record limit for the system call:
Database.emptyRecycleBin(records);
not 200. Yet it's throwing an error at 200. The only thing I can think of is that this call occurs from within a batch Apex process.
It took a little over a week and me supplying a failing test case to salesforce support but the issue is now being reported as a salesforce known issue suggesting it may get addressed in the platform.
My workaround for now is to wrap the call in a Database.Batchable with the batch size to 200.
This is the only reference that I could find to there being a limit of 200 on emptyrecyclebin(), I dare say that you are correct
http://www.salesforce.com/us/developer/docs/api/Content/sforce_api_calls_emptyrecyclebin.htm
Adam, if you got shut down when attempting to log a case regarding this due to the whole Premier Support thing you should definitely escalate your case as it was handled incorrectly and SFDC needs to know about it. I had the same exact issue myself.
SOQL For Loops may be a helpful option for working around this limit as the 'for (Account[] accounts : [SELECT Id FROM Account WHERE IsDeleted = true ALL ROWS]' format provides batches of 200.

Resources