Freebase: What data dump file contains the "imdb_id"? - database

I run IMDbAPI.com and have been using Bing's Search API for finding IMDb ID's from title searches. Bing is currently changing their API over to the Azure Marketplace (August 1st) and is no longer available for free. I started testing my API using Freebase to resolve these ID's and hit their 100k limit in the first 8 hours (my site currently gets about 3 million requests a day, but only 200-300k are title searches)
This is exactly why they offer the data dump files,
I downloaded most of the files in the Film folder but cannot find where they are storing the "/authority/imdb/title" imdb id namespace data.
https://www.googleapis.com/freebase/v1/mqlread?query={"type":"/film/film","name":"True%20Grit","imdb_id":null,"initial_release_date>=":"1969-01","limit":1}
This is how I'm currently accessing the ID.
Does anyone know which file contains this information? and how to link back to it from the film title/id?

That imdb_id property is backed by a key in the /authority/imdb/title namespace, so you're looking for the line:
/m/015gxt /type/object/key /authority/imdb/title tt0065126
in the file http://download.freebase.com/datadumps/latest/freebase-datadump-quadruples.tsv.bz2
That's a 4 GB file, so be prepared to wait a little while for the download. Note that everything is keyed by MID, so you'll need to figure that out first if you don't have it in your database.
The equivalent query using MQL instead of the data dumps is https://www.googleapis.com/freebase/v1/mqlread?query=%7B%22type%22%3a%22/film/film%22,%22name%22%3a%22True%20Grit%22,%22imdb_id%22%3anull,%22initial_release_date%3E=%22%3a%221969-01%22,%22mid%22:null,%22key%22:[{%22namespace%22:%22/authority/imdb/title%22}],%22limit%22:1%7D&indent=1
EDIT: p.s. I'm pretty sure the files in the Browse directory are going away, so I wouldn't depend on them even if you could find the info there.

The previous answer works fine, it's just that a snappier version of such a query could be:
query = [{
'type': '/film/film',
'name': 'prometheus',
'imdb_id': null,
...
}];
The rest of the MQL request isn't mentionned as it doesn't differ from the aforementioned. Hope that helps.

Related

Solr - Bringing back snippets from indexed data

I have a Solr/Lucene set up where I have indexed a set of documents (MS Word files) and can happily search the content of these documents. However I would like to return a snippet from within the content of the document which shows where the matching line (+/- 5 words from the match term) is. I have tried to follow a range of Google hits but my indexing does not seem to have a direct access to the "content".
Can anyone give me some basic and simple pointers to where I might have made any errors on this - I have based all my work so far on the guidance and examples of the Solr Reference Guide - so I am not sure if the issue is in the search parameters or the original index.
I am doing this to create a clear set of user requirements for building an end solution rather than creating the end solution myself, so I am no expert on the tools and do not need to become one, just need to evidence what is possible with this tool set.
As MatsLindh noted above the issue was that the config was not drawing across the actual content of the Tika parse into a specific field, and so there was no full content of the text to display and highlight
To resolve this I followed the link (https://lucene.apache.org/solr/guide/7_1/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-solr-extractingrequesthandler) to the guidance documents and reviewed the part on fmap and used the example given for Last Modified Date as a guide on what to apply.
I then went to my solrconfig.xml file in the relevant core folder and added in the following line in the code beneath an already present fmap entry:
<str name="fmap.content">testcontent</str>
I had previously set up the testcontent field under the solr web interface in my core. I then re-ran my indexing line via a command prompt and that seemed to do the trick in terms of pulling out the basic content and rapping it with a basic emphasis.
All thanks for the input on this - still a lot more I want to test to help develop a clear requirement set but this really helps prove some of the basics are not complected.

Error 400 when Creating Custom Classifier in Watson Visual Recognition

I am currently facing a problem to create my own classfiers. I already tried to build a NodeJS application and also create it using IBM's demo, but everytime I submit my ZIP folders to be created, I recieve the following error message:
Cannot execute learning task. : need at least 2 _positive_examples fields, (or 1 _positive_examples and 1 negative_examples field) to train a classifier. null specified.
However, when I tried to use the IBM Demo webapp using the .zip they provide (husky, beagle and cats.zip files) the classifier is successfully createdi
I have currently 2 zips (1 positive and 1 negative) each containing 50 files named from 1.jpg to 50.jpg.
Have any of you guys ever gone through this issue and found a way to handle it?
Thanks for the attention.
Best regards,
Enrico Bergamo
As per the discussion on the DeveloperWorks Forum ( https://developer.ibm.com/answers/questions/377690/error-400-when-trying-to-create-a-custom-classifie/ ), Your zip files are not really zips. It looks like all you have done is name your folder positive.zip.
Right click on each of the folders in turn and select create zip or compress as zip.
What this message indicates is that the service is not receiving any POST fields that end in '_positive_examples' which is necessary for a training request. So double check your form field name parameters.

wikipedia dump all page titles and pageIDs

I'm trying to find a wikipedia dump containing pageIds and Titles. I don't want to request it on runtime or request 2000 per request, i want it ALL, i want to make a long list of all the pageIds and titles belonging to them and put them into my own database, so that i can use it in an application that requests the data from my own database.
Anybody know which dumps contain those information? It doesn't matter if they also contain more information that what i need - i can just write an app that picks the info i need.
I did try to request it ... it would have taken 140 days and they put up some limit of 2700 requests ... so it would take forever to get the whole thing, instead i want to download a file dumb and clean the data and upload a file to my own database containing only the info i need
Ok found it myself after getting multiple dumps, in short the answer is:
enwiki-latest-page.sql.gz
It contains pageids and Titles.
Entries look like this:
(1217768,0,'Black_River_(South_Carolina)','',0,0,0,0.6285160577990001,'20161001141146','20161001142916',738899573,1654,'wikitext')
First number is pageId. Third entry is title.
Rest i don't know what is - but no matter :D Thanks to myself i solved this issue and will close it :D Big pat on the bag

Putting large number of documents: Google App Engine and Search API

It is question about limit on putting large number of documents in Search API. I intend to put 2057 documents(paragraphs from text file). When I parse each paragraph from text file, create document for each paragraph and put it into index, app seems to be running forever and not responding at all. What can be reason for such behavior?
With regards
I researched the documentation and found the following method:
put(java.lang.Iterable<Document> documents)
My way of importing like this:
1. I collect all documents to be put into index in collector(ArrayList, List) until 200 documents(it's limit by GAE)
2. Put this collector into this method
In my case, it decreased the time of putting by 100 times

Difficulty with filename and filemime when using Migrate module

I am using the Drupal 7 Migrate module to create a series of nodes from JPG and EPS files. I can get them to import just fine. But I notice that when I am done importing them if I look at the nodes it creates, none of the attached filefield and thumbnail files contain filename information.
Upon inspecting the file_managed table I see that both the filename and filemime fields are empty for ONLY the files that I attached via the migrate module. This also creates an issue with downloading the files.
Now I think the problem has to do with the fact that I am using "file_link" instead of "file_copy" as the file operation I specify. The problem is I am importing around 2TB (thats Terabytes) of image files. We had to put in a special request with Rackspace just to get access to that much disk space on our server. So I can't go around copying from one directory to the next because of space issues. So "file_link" seems like the obvious choice.
Now you probably want to see how I am doing this exactly, so here is the code snippet:
$jpg_arguments = MigrateFileFieldHandler::arguments(NULL,
'file_link', FILE_EXISTS_RENAME, 'en', array('source_field' => 'jpg_name'),
array('source_field' => 'jpg_filename'), array('source_field' => 'jpg_filename'));
$this->addFieldMapping('field_image', 'jpg_uri')
->arguments($jpg_arguments);
As you can see I am specifying no base path (just like the beer.inc example file does). I have set file_link, the language, and the source fields for the description, title, and alt.
It is able to generate thumbnails from the JPGs. But still missing those columns of data in the db table. I traced through the functions the best I could but I don't see what is causing this. I tried running the uri in the table through the functions that generate the filename and the filemime and they output just fine. It is like something is removing just those segments of data.
Does anyone have any idea what this could be? I am using the Drupal 7 Migrate module version 2.2. It is running on Drupal 7.8.
Thanks,
Patrick
Ok, so I have found the answer to yet another question of mine. This is actually an issue with the migrate module itself. The issue is documented here. I will be repealing this bounty (as soon as I figure out how).

Resources