how to get dataset of 10.000 static html pages from Wiki - dataset

I am working on a classification algorithm. In order to do that I need a dataset that contains about 10,000 static HTML pages from wikimedia. Something like
page-title-1.html .... page-title-10000.html
I tried Google and I find out that my best solution was downloading it from http://dumps.wikimedia.org/other/static_html_dumps/2008-06/en/.
However, I do not know how to use it in order to get what I want.
There are some files as following
html.lst 2008-Jun-19 17:25:05 692.2M application/octet-stream
images.lst 2008-Jun-19 18:02:09 307.4M application/octet-stream
skins.lst 2008-Jun-19 17:25:06 6.0K application/octet-stream
wikipedia-en-html.tar.7z 2008-Jun-21 16:44:22 14.3G application/x-7z-compressed
I want to know how to do with *.lst files and what is in wikipedia-en-html.tar.7z

You might want to read the section "Static HTML tree dumps for mirroring or CD distribution" of Database download on Wikipedia (and in fact that whole page, which points you to 7zip for unpacking the main archive).

Related

Solr extract text from image and imagePdf files

I am working with Solr-6.5.1, I want to extract text from Image file and ImagePdf file.for this i installed TesseractOcr and configured this with solr in two ways:
1.Environment variable is set for TESSDATA_PREFIX = C:\Program Files (x86)\Tesseract-OCR and i used /update/extract request handler to index image with content.
2.I modified the tesseractOCRConfig.properties file in tika-parsers-1.13 jar file in solr lib to" tesseractPath=C:/Program Files (x86)/Tesseract-OCR" and i used /update/extract request handler to index image/imagePdf with content.
In this two way also i'm not getting any content ,But response giving only attr_x_parsed_by=org.apache.tika.parser.ocr.TesseractOCRParser.
Any other configuration i need to set for solr to TesseractOcr to extract content for Image/ImagePdf file.
Thanks in advance.

Copy static files from one Jekyll post to published site

I would like to copy any static file (image, PDF, etc.) found in a post folder inside _posts to the folder in which the HTML version of the post will be, inside _site.
Let's say I have this structure:
_posts/
2016/
06/
09-so-long-cloudflare/
2016-06-09-so-long-cloudflare-and-thanks-for-all-the-fissh.md
cloudflare-logo.png
performance-report-sample.pdf
My Jekyll settings for permalinks are:
# Permalinks
permalink: /:year/:month/:day/:title/
I would like to generate the site like this:
2016/
06/
09/
so-long-cloudflare-and-thanks-for-all-the-fissh/
index.html
cloudflare-logo.png
performance-report-sample.pdf
I've found this plugin that should do this, but I can't make it work. I get this error:
jekyll 3.1.6 | Error: undefined method `name' for #<Jekyll::Document:0x007fb7a0892b50>
Any idea?
Thanks!
Well, despite having no Ruby knowledge, I managed to build a plugin out of this old Gist! \o/
https://nhoizey.github.io/jekyll_post_files/
I hope this will help people with the same needs.

Loading .owl files in marklogic

Is it possible to load .owl files using mlcp?
I tried with -input_file_type rdf but it gives error as below:
bin/mlcp.sh import -host localhost -port 9010 -username uname
-password pwd -mode local -input_file_path /home/user/semantics/data -input_file_type rdf -input_file_pattern '.*.owl'
FATAL contentpump.RDFReader: dbpedia1.owl: Element or attribute do not
match QName production: QName::=(NCName':')?NCName. FATAL
contentpump.RDFReader: dbpedia2.owl: Element or attribute do not match
QName production: QName::=(NCName':')?NCName.
What am I missing here ?
MarkLogic documentation lists the supported triples file formats:
.rdf
.ttl
.json
.n3
.nt
.nq
.trig
Maybe you convert your .owl file to one of those formats, at which point you could use MLCP to load it. I tried plugging your example into a format converter, but that didn't work. Perhaps it's because we only have a snippet here.
MarkLogic should be able to process .owl files, but I think Joshua is right that MarkLogic is expecting .owl files to contain RDF/XML. You can also see that from the list of Mimetypes in the Admin interface. It lists the .owl extension as 'application/owl+xml', and RDF/XML seems to be the more common serialization of OWL.
Might just be that if you rename the file to .nt that it works..
HTH!

Solr 4: disable compression on stored fields: how to actually configure custom codec?

The short question is :
I want to disable stored field compression on Solr 4.3.0 index. After reading :
http://blog.jpountz.net/post/35667727458/stored-fields-compression-in-lucene-4-1
http://wiki.apache.org/solr/SimpleTextCodecExample
http://www.opensourceconnections.com/2013/06/05/build-your-own-lucene-codec/
I've decided to follow the path described there, and make my own codec. I'm pretty sure I've followed all the steps, however, when I actually try to use my codec (affectionatelly named "UncompressedStorageCodec"), I get the following error in Solr log:
java.lang.IllegalArgumentException: A SPI class of type org.apache.lucene.codecs.PostingsFormat with name 'UncompressedStorageCodec' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath.
The current classpath supports the following names: [Pulsing41, SimpleText, Memory, BloomFilter, Direct, Lucene40, Lucene41]
at org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:109)
From the output I get that Solr is not picking up the jar with my custom codec, and I don't get why?
Here's all the horriffic details:
I've created a class like this:
public class UncompressedStorageCodec extends FilterCodec {
private final StoredFieldsFormat fieldsFormat = new Lucene40StoredFieldsFormat();
protected UncompressedStorageCodec() {
super("UncompressedStorageCodec", new Lucene42Codec());
}
#Override
public StoredFieldsFormat storedFieldsFormat() {
return fieldsFormat;
}
}
in package: "fr.company.project.solr.transformers.utils"
The FQDN of "FilterCodec" is: "org.apache.lucene.codecs.FilterCodec"
I've created a basic jar file out of this (exported it as jar from Eclipse).
The Solr installation I'm using to test this is the basic Solr 4.3.0 unzipped, and started via it's embedded Jetty server and using the example core.
I've placed my jar with the codec in [solrDir]\dist
In:
[solrDir]\example\solr\myCore\conf\solrconfig.xml
I've added the line:
<lib dir="../../../dist/" regex="myJarWithCodec-1.10.1.jar" />
Then in the schema.xml file, I've declared some fieldTypes that should use this codec like so:
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" postingsFormat="UncompressedStorageCodec"/>
<fieldType name="string_lowercase" class="solr.TextField" positionIncrementGap="100" omitNorms="true" postingsFormat="UncompressedStorageCodec">
<!--...-->
</fieldType>
Now, if I use the DataImportHandler component to import some data into Solr, at commit time it tells me:
java.lang.IllegalArgumentException: A SPI class of type org.apache.lucene.codecs.PostingsFormat with name 'UncompressedStorageCodec' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath.
The current classpath supports the following names: [Pulsing41, SimpleText, Memory, BloomFilter, Direct, Lucene40, Lucene41]
at org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:109)
What I find strange is that the above mentioned codec jar also contains some Transformers for the DataImportHandler component. And those are picked up fine. Also, other jars placed in the dist folder (and declared in the same way in solrconfig.xml), like the jdbc driver are picked up fine. I'm guessing that for the codec there's this SPI thingy which loads things differentlly, and there's somethign he's missing...
I've also tried placing the codec jar in:
[solrDir]\example\solr-webapp\webapp\WEB-INF\lib\
as well as inside the WEB-INF\lib folder of the solr.war file, which is found in:
[solrDir]\example\webapps\
but I'm still getting the same error.
So basically, my question is, what's missing so that my codec jar is picked up by Solr?
Thanks
I'm going to answer this question myself since it sort of become moot due to some benchmarks I've made: long story short, I had arrived at the (wrong) conclusion that for really large stored fields, Solr 3.x and 4.0 (without field compression) is faster than Solr 4.1 and above (with field compression). However that was mostly due to some errors in my benchmarks. After repeating them I've obtained results where when you go from non-compressed to compressed fields even for very large stored fields, the index time is between 0% and 15% slower, which is really not bad at all, considering that afterwards queries on the compressed fields indexes are 10-20% times faster (the document fetching part).
Also, here's some remarks on how to speed up indexing:
Use the DataImportHandler plugin. It bypasses the Solr Rest (HTTP based) API and writes directly to the Lucene index.
Check out said plugins sources to see how it accomplishes this, and do your own plugin if the DataImportHandler doesn't meet your needs
If for whatever reason you want to stick to the Solr Rest API, use ConcurrentUpdateSolrServer and play around with the queue size and number of threads parameters. It will normally be a lot faster (up to 200% in my case) than the basic HttpSolrServer.
Don't forget to enable the javabin data serialization like this:
ConcurrentUpdateSolrServer solrServer = new ConcurrentUpdateSolrServer("http://some.solr.host:8983/solr", 100, 4);
solrServer.setRequestWriter(new BinaryRequestWriter());
I'm explicitly showing the code because I believe there migth be a small bug here:
If you look at the ConcurrentUpdateSolrServer constructor, you'll see that by default it already sets the request writer to binary:
//the ConcurrentUpdateSolrServer initializes HttpSolrServer objects using this constructor:
public HttpSolrServer(String baseURL, HttpClient client) {
this(baseURL, client, new BinaryResponseParser());
}
However after debugging I've noticed that if you don't explicitly call the setWriter method with the Binary writer argument, it will still use the XmlSerializer.
Going from XML to Binary serialization reduces the size of my documents about 3 times as they are being sent to the server. This makes my index times for this case about 150-200% faster.
I have recently tried and succeeded to get something very similar to work. The only difference is that I want to enable the best compression instead of no compression, and Solr defaults to the fastest compression. I also got the "SPI class [...] does not exist" error at some point, and here is what I have found out from various articles, including the ones you have linked to.
Lucene uses SPI to find the codec classes to load. Lucene requires the list of codec classes be declared in the file "org.apache.lucene.codecs.Codec", and the file must be on the class path. To get Solr to load the file: When you create your JAR file "myJarWithCodec-1.10.1.jar", make sure that it contains a file at "META-INF/services/org.apache.lucene.codecs.Codec". The file should have one full class name per line, like this:
org.apache.lucene.codecs.lucene3x.Lucene3xCodec
org.apache.lucene.codecs.lucene40.Lucene40Codec
org.apache.lucene.codecs.lucene41.Lucene41Codec
org.apache.lucene.codecs.lucene42.Lucene42Codec
fr.company.project.solr.transformers.utils.UncompressedStorageCodec
And in solrconfig.xml, replace:
<codecFactory class="solr.SchemaCodecFactory" />
with:
<codecFactory class="fr.company.project.solr.transformers.utils.UncompressedStorageCodec" />
You might also need to remove postingsFormat="UncompressedStorageCodec" from schema.xml if Solr complains. I think this particular parameter is for specifying the postings format, not the codec. Hope it helps.

Using Fat-Free PHP for Backbone.js routing with external Model.php files

I'm very new to Fat-Free and Backbone.js. I've been searching and reading articles and searching and reading articles trying to find a way to route to individual PHP files containing the database communications. The code below works, and I can use it, but it seems hackish. Is there a way to call an external PHP file (in the server/models/ directory) and a specific method from the $f3-route(...) line?
<?php
// File: /index.php
define("PATH",1);
$f3 = require('server/fatfree/lib/base.php');
$uri = explode('/', $_SERVER["REQUEST_URI"]);
require_once "server/models/{$uri[PATH]}.php";
$f3->route('GET /hello/#file', 'HelloModel->doSomething');
$f3->route('GET /project/#file', 'ProjectModel->doSomething');
$f3->route('GET /book/#file', 'BookModel->doSomething');
$f3->run();
?>
Thanks a lot for your advice.
You should add the server/models directory to the autoloader of F3 using the autoloader feature.
$f3->set('AUTOLOAD','server/models/');
That way, the required source files of your classes will be loaded on demand. However, note that the files must be named the same as your class, i.e. class Foo has to be defined in foo.php or Foo.php. The case of the filename does not matter.

Resources