Indexing Zip Files with Apache Solr - solr

I am trying to index zip files via Apache Solr.
My Zip files only contain one CSV file.
My CSV-Files look like this:
"N_NATIONKEY","N_NAME","N_REGIONKEY","N_COMMENT"
0,"ALGERIA ",0,"04.07.11"
1,"ARGENTINA ",1,"04.07.11"
2,"BRAZIL ",1,"04.07.11"
…
I was already able to index the zip-file with following result:
post http://localhost:8983/solr/first/update/extract?literal.id=zip2&commit=true&captureAttr=true&uprefix=attr_&fmap.content=attr_content
"ignored_":["stream_size",
"461",
"X-Parsed-By",
"org.apache.tika.parser.DefaultParser",
"X-Parsed-By",
"org.apache.tika.parser.pkg.PackageParser",
"stream_content_type",
"text/plain",
"Content-Type",
"application/zip"],
"div":["embedded",
"NATION.csv",
"package-entry"],
"id":"zip2",
"stream_size":[461],
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.pkg.PackageParser"],
"stream_content_type":["text/plain"],
"content_type":["application/zip"],
"attr_content":[" \n \n \n \n \n \n \n \n \n \n NATION.csv \n \"N_NATIONKEY\",\"N_NAME\",\"N_REGIONKEY\",\"N_COMMENT\"\r\n0,\"ALGERIA \",0,\"04.07.11\"\r\n1,\"ARGENTINA \",1,\"04.07.11\"\r\n2,\"BRAZIL \",1,\"04.07.11\"\r\n3,\"CANADA \",1,\"04.07.11\"\r\n4,\"EGYPT \",4,\"04.07.11\"\r\n5,\"ETHIOPIA \",0,\"04.07.11\"\r\n6,\"FRANCE \",3,\"04.07.11\"\r\n7,\"GERMANY \",3,\"04.07.11\"\r\n8,\"INDIA \",2,\"04.07.11\"\r\n9,\"INDONESIA \",2,\"1\"\r\n10,\"IRAN \",4,\"04.07.11\"\r\n11,\"IRAQ \",4,\"04.07.11\"\r\n12,\"JAPAN \",2,\"04.07.11\"\r\n13,\"JORDAN \",4,\"04.07.11\"\r\n14,\"KENYA \",0,\"04.07.11\"\r\n15,\"MOROCCO \",0,\"04.07.11\"\r\n16,\"MOZAMBIQUE \",0,\"1\"\r\n17,\"PERU \",1,\"04.07.11\"\r\n18,\"CHINA \",2,\"04.07.11\"\r\n19,\"ROMANIA \",3,\"1\"\r\n20,\"SAUDI ARABIA \",4,\"04.07.11\"\r\n21,\"VIETNAM \",2,\"1\"\r\n22,\"RUSSIA \",3,\"04.07.11\"\r\n23,\"UNITED KINGDOM \",3,\"04.07.11\"\r\n24,\"UNITED STATES \",1,\"04.07.11\"\r\n \n\n \n "],
"_version_":1615098997961129984}]
What I want is this:
"N_NATIONKEY":0,
"N_NAME":"ALGERIA ",
"N_REGIONKEY":0,
"N_COMMENT":"04.07.11",
"id":"84f3e0f3-8b13-47d8-818f-52504f79d91a",
"_version_":1615098850670804992
Here I am able to search after specific columns.
How can I index zipped files like this?
The documentation says that it should be able with Tika, but i don't realy get it.

Something like this is being done with .gz files in upcoming (7.6) Solr, see SOLR-10981. This does not cover zip though.
In general, you probably just want to unzip file and stream it directly to Solr. bin/post command does allow to take file content from the standard input, you just need to make sure the content type is correct. Check bin/post -h for details.

Related

Solr Data Config Error, Open quote is expected for attribute "driver"

I have a postgres data-config file.
<dataConfig>
<dataSource driver=”org.postgresql.Driver” url=”jdbc:postgresql://127.0.0.1:5432/mydb” user=”user” password=”pw” />
...
</dataConfig>
But when I run it, it shows error
Data Config problem: Open quote is expected for attribute "driver" associated with an element type "dataSource".
What's the problem here. is driver information that I put wrong?
Your quotes are wrong.
” and " are not the same kind of quotes (see the different presentation). Only " is a valid double quote in an XML file (and in most/all programming contexts).
The examples in your config file seems to have been mangled by a blog or a text editor on the way.

Solr extract text from image and imagePdf files

I am working with Solr-6.5.1, I want to extract text from Image file and ImagePdf file.for this i installed TesseractOcr and configured this with solr in two ways:
1.Environment variable is set for TESSDATA_PREFIX = C:\Program Files (x86)\Tesseract-OCR and i used /update/extract request handler to index image with content.
2.I modified the tesseractOCRConfig.properties file in tika-parsers-1.13 jar file in solr lib to" tesseractPath=C:/Program Files (x86)/Tesseract-OCR" and i used /update/extract request handler to index image/imagePdf with content.
In this two way also i'm not getting any content ,But response giving only attr_x_parsed_by=org.apache.tika.parser.ocr.TesseractOCRParser.
Any other configuration i need to set for solr to TesseractOcr to extract content for Image/ImagePdf file.
Thanks in advance.

how to get dataset of 10.000 static html pages from Wiki

I am working on a classification algorithm. In order to do that I need a dataset that contains about 10,000 static HTML pages from wikimedia. Something like
page-title-1.html .... page-title-10000.html
I tried Google and I find out that my best solution was downloading it from http://dumps.wikimedia.org/other/static_html_dumps/2008-06/en/.
However, I do not know how to use it in order to get what I want.
There are some files as following
html.lst 2008-Jun-19 17:25:05 692.2M application/octet-stream
images.lst 2008-Jun-19 18:02:09 307.4M application/octet-stream
skins.lst 2008-Jun-19 17:25:06 6.0K application/octet-stream
wikipedia-en-html.tar.7z 2008-Jun-21 16:44:22 14.3G application/x-7z-compressed
I want to know how to do with *.lst files and what is in wikipedia-en-html.tar.7z
You might want to read the section "Static HTML tree dumps for mirroring or CD distribution" of Database download on Wikipedia (and in fact that whole page, which points you to 7zip for unpacking the main archive).

How to understand log files downloaded from App Engine?

What is the format of Google App Engine log files, as downloaded by appcfg.sh request_logs?
As far as I've been able to determine, the format of the log file is as follows:
CLIENT_IP_ADDRESS - USERNAME [DATE:TIME TIMEZONE] "METHOD URL HTTP/VERSION" RESPONSE_CODE ??? URL USER_AGENT
LOG_LEVEL:TIMESTAMP MESSAGE
:
LOG_LEVEL:TIMESTAMP MESSAGE
:
LOG_LEVEL:TIMESTAMP MESSAGE
:
Each of the indented lines is associated with the non-indented line above it - that's how you determine which request each of them relates to, I think. The solitary colons are used to separate one log message from the next.
The non-indented lines are in reverse chronological order, but the groups of indented lines below them are in chronological order.
A strange client IP address like 0.1.0.2 indicates a within-App-Engine post from your app to your app's task queue. Actually, though, you can also see that it is within-App-Engine by looking at the user agent for that request.

libxml2 SAX query

I am trying to parse an XML file using the SAX interface of libxml2 in C.
My problem is that whitespace characters between end of a tag and start of a new tag are causing the callback
"Characters" to be executed...Hi All,
i.e.
<?xml version="1.0"?>
<doc>
<para>Hello, world!</para>
</doc>
produces these events:
start document
start element: doc
start element: para
characters: Hello, world!
end element: para
characters:
end element: doc
characters:
end document
It would be really nice if somehow these whitespaces don't get recognized as "characters".
Anybody got any idea why this is happening or how this can be prevented from happening???
This is, of course, happening since whitespace between elements is significant in XML. So it's just operating according to specification.
See, for instance, this discussion.

Resources