wikipedia dump all page titles and pageIDs - database

I'm trying to find a wikipedia dump containing pageIds and Titles. I don't want to request it on runtime or request 2000 per request, i want it ALL, i want to make a long list of all the pageIds and titles belonging to them and put them into my own database, so that i can use it in an application that requests the data from my own database.
Anybody know which dumps contain those information? It doesn't matter if they also contain more information that what i need - i can just write an app that picks the info i need.
I did try to request it ... it would have taken 140 days and they put up some limit of 2700 requests ... so it would take forever to get the whole thing, instead i want to download a file dumb and clean the data and upload a file to my own database containing only the info i need

Ok found it myself after getting multiple dumps, in short the answer is:
enwiki-latest-page.sql.gz
It contains pageids and Titles.
Entries look like this:
(1217768,0,'Black_River_(South_Carolina)','',0,0,0,0.6285160577990001,'20161001141146','20161001142916',738899573,1654,'wikitext')
First number is pageId. Third entry is title.
Rest i don't know what is - but no matter :D Thanks to myself i solved this issue and will close it :D Big pat on the bag

Related

image gallery - express/mongodb/react

can someone help me with my issue? I just started learning Express.
I have an images folder structure: enter image description here.
Now I store and receive them (at front) like this enter image description here.
I would like to store them at (express server) and make a get request at 'front' : get('https:// ...... /gallery')
get a list of folders in the array format [beruta16, borov , ..... etc], then iterate the array and make a new request for each element.
I think I can to make a list with folder names in JSON format like this app.use(URL,(req,res)=>res.json(array)) and also create a list of items(1,2,3 ... jpeg) for another request and send its. But I don't know how (handled create that madness).
Another way : fs.readdir(), but I am not sure if this is the right way. And for adding a new file I must do that again.
Mb's best way for this that store all URLlinks in MongoDB, but again make handle DB
Sorry, I can't find similar information for this

Can't find a referenced R File on the SEC Website?

I am attempting to understand the 2020q1 data set found here: https://www.sec.gov/dera/data/financial-statement-data-sets.html,
and am using the reference documentation inside the 2020q1 folder as a “readme” file. The reference documentation specifies that
within the Presentation (pre) data set, the “report” field is a numeric (integer) whose “value refers to the “R file” as posted
on the EDGAR Web site.” I have found no such file after extensive search, and am left with no method of interpreting the “report”
field and all associated data. Please link to the appropriate R File or guide me in the right direction for assistance if possible. Thanks!
So a point of clarification upfront, cause this confused me as well, the "R file" in question is not a script file of the R language. Instead, it simply seems to be a report file that holds the formatted data.
So after digging deeper into the readme, I found the following detail in the description for the SUB.txt data.
Note: To access the complete submission files for a given filing, please see the SEC EDGAR website. The SEC website folder HTTP(s)://www.sec.gov/Archives/edgar/data/{cik}/{accession}/ will always contain all the data sets for a given submission. To assemble the folder address to any filing referenced in the SUB data set, simply substitute {cik} with the cik field and replace {accession} with the adsh field (after removing the dash character). The following sample SQL Query provides an example of how to generate a list of addresses for filings contained in the SUB data set:
· select name,form,period, 'http(s)://www.sec.gov/Archives/edgar/data/' + ltrim(str(cik,10))+'/' + replace(adsh,'-','')+'/'+instance as url from SUBM subm order by period desc, name
Therefore, it looks like we have to correlate each "adsh" submission ID with the "cik" company ID in order to get the link we are looking for.
Doing this for the first entry of pre.txt, we get an adsh value of "0001032208-20-000006". I simply searched through sub.txt with notepad and found its associated cik of "1032208" which belongs to "SEMPRA ENERGY". Therefore, we generate the following link: http://www.sec.gov/Archives/edgar/data/1032208/000103220820000006
From there, we find a directory of files associated with the given submission. Inside is a collection of files with the prefix of "R". Simply clicking on them will open them in your browser, using the "report" and "line" fields, we can then correlate which file we want. Notice that we can add "/R{number}.htm" at the end of the link we generated to find this folder to get a given report number.
If you know what you are looking for, doing this by hand with "ctr+f" find functionality should be fine. Otherwise, you may want to open these docs in excel to generate the links for you.

Where can I find the simple information of the format for uploading questions?

Situation:
I want to train and simple configure the retrieve and rank service.
I just uploaded some PDFs and now I want to upload some questions.
In the documentation I do not find a simple information how the csv file must be structured and which are the must fields and which are not must files.
Something like: "[YOUR QUESTION (MUST)]",[DOCUMENT ID (MUST)], [RANKING (OPTIONAL)]
The document ID you will find in xyz in section xyz.
Inside the help I can not find such kind of help.
https://www.ibm.com/watson/developercloud/doc/retrieve-rank/training_data.shtml#script
Impact:
There is no chance to get a "real" documentation of the configuration outside the tutorial.
Possible Solution:
Provide additional documenation.
Maybe I was not able to find it and someone can guide me to the right place?
Ok, I found the solution for me, by try and error. Following steps do work for me:
1) You need a plain text file and the ending should be *.txt
2) Inside the file you have to write your questions like this:
What is the best place to be?
Why should I travel to the USA?
-> Don't do it like
"What is the best place to be?"
For me the help was missleading, because saying something about CSV files.
You can take a look also in the comment of #dalelane he is right, and highlight the entry text for the upload of the file.

Freebase: What data dump file contains the "imdb_id"?

I run IMDbAPI.com and have been using Bing's Search API for finding IMDb ID's from title searches. Bing is currently changing their API over to the Azure Marketplace (August 1st) and is no longer available for free. I started testing my API using Freebase to resolve these ID's and hit their 100k limit in the first 8 hours (my site currently gets about 3 million requests a day, but only 200-300k are title searches)
This is exactly why they offer the data dump files,
I downloaded most of the files in the Film folder but cannot find where they are storing the "/authority/imdb/title" imdb id namespace data.
https://www.googleapis.com/freebase/v1/mqlread?query={"type":"/film/film","name":"True%20Grit","imdb_id":null,"initial_release_date>=":"1969-01","limit":1}
This is how I'm currently accessing the ID.
Does anyone know which file contains this information? and how to link back to it from the film title/id?
That imdb_id property is backed by a key in the /authority/imdb/title namespace, so you're looking for the line:
/m/015gxt /type/object/key /authority/imdb/title tt0065126
in the file http://download.freebase.com/datadumps/latest/freebase-datadump-quadruples.tsv.bz2
That's a 4 GB file, so be prepared to wait a little while for the download. Note that everything is keyed by MID, so you'll need to figure that out first if you don't have it in your database.
The equivalent query using MQL instead of the data dumps is https://www.googleapis.com/freebase/v1/mqlread?query=%7B%22type%22%3a%22/film/film%22,%22name%22%3a%22True%20Grit%22,%22imdb_id%22%3anull,%22initial_release_date%3E=%22%3a%221969-01%22,%22mid%22:null,%22key%22:[{%22namespace%22:%22/authority/imdb/title%22}],%22limit%22:1%7D&indent=1
EDIT: p.s. I'm pretty sure the files in the Browse directory are going away, so I wouldn't depend on them even if you could find the info there.
The previous answer works fine, it's just that a snappier version of such a query could be:
query = [{
'type': '/film/film',
'name': 'prometheus',
'imdb_id': null,
...
}];
The rest of the MQL request isn't mentionned as it doesn't differ from the aforementioned. Hope that helps.

Receive multi file post with google app engine

I want to receive multi file post from image uploader.(i use this)
Most examples show how to receive one image from post.
I tried many ways but never got the results.
For example
self.request.POST['Filename']
gives only first filename.
What to do when there are multiple files/images in post?
The reason for this is to resize before upload images, that are too big for google app engine
to upload.
EDIT:
self.request.POST.multi.__dict__
shows
{'_items':
[('Filename', 'camila1.jpg'),
('Filedata[]', FieldStorage('Filedata[]', 'camila1.jpg')),
('Upload', 'Submit Query\r\n--negpwjpcenudkacqrxpleuuubfqqftwm----negpwjpcenudkacqrxpleuuubfqqftwm\r\nContent-Disposition: form-data; name="Filename"\r\n\r\nbornToBeWild1.jpg'),
('Filedata[]', FieldStorage('Filedata[]', 'bornToBeWild1.jpg')),
('Upload', 'Submit Query')]}
Your flash uploader is designed to work with PHP and sends multiple Filedata[] fields (php interprets this as an array for easy access)
So you need to iterate and get them all:
def post(self):
for file_data in self.request.POST.getall('Filedata[]'):
logging.info(file_data.filename)
data should be file_data.value
Are you using the Django libraries available to you? If so, check this out.
Call self.request.POST.getall('Filename') to get a list of FieldStorage objects; each one contains one file. You can access the file data with .value, the name with .name, and the mimetype with .type.
I have no idea how that multi uploader works, I have made one in the past however and I just added a number on the end of input field name, then hide it. Then add a new file input field to add another file. The reason for this is that they don't let you play around with input file fields to much because you could make it upload files they didn't want you uploading.
Using my conventions, in your example the 2 files in your example would be "Filename0" and "Filename1". You could also use firebug to see what it renaming the input file fields to.
Edit: I had a look, and it's using flash. So i have no idea how it works.

Resources