Solr extract text from image and imagePdf files - solr

I am working with Solr-6.5.1, I want to extract text from Image file and ImagePdf file.for this i installed TesseractOcr and configured this with solr in two ways:
1.Environment variable is set for TESSDATA_PREFIX = C:\Program Files (x86)\Tesseract-OCR and i used /update/extract request handler to index image with content.
2.I modified the tesseractOCRConfig.properties file in tika-parsers-1.13 jar file in solr lib to" tesseractPath=C:/Program Files (x86)/Tesseract-OCR" and i used /update/extract request handler to index image/imagePdf with content.
In this two way also i'm not getting any content ,But response giving only attr_x_parsed_by=org.apache.tika.parser.ocr.TesseractOCRParser.
Any other configuration i need to set for solr to TesseractOcr to extract content for Image/ImagePdf file.
Thanks in advance.

Related

Putsftp is taking a wrong path in SFTP server in nifi

I have a flow to fetch file from SFTP server, rename it and put it back to server in same location.
My flow:
Listsftp-> fetchsftp-> updateAttribute-> putsftp
My file location is in d drive, I have mentioned that location in remote path property of putsftp but it taking the path like
c:/users/myname/d:/file/location
And of course it is giving me error.
Is there any solution for this?
Thanks in advance.
you can use the SFTP processor only if you are using a server with Host - Port etc.
If you want to get some files from your disk (C:/ for example) you can use the GETFILE processor
an example of flow could be this:
GETSFTP with the property Keep Source File to false
UpdateAttribute
new property -> filename -> new_file_test.example
PUTSFTP
you can use GETSFTP/GETFILE PUTSFTP/PUTFILE

Image Extractor by AI Habitat produces a configuration error when importing Matterport dataset

I need help understanding the error message, which is along the lines of changing the file name to json because the configuration fails. I have a long error message but pasted the part that is mostly repeated throughout the message:
/Users/kyra/Documents/GitHub/habitat-sim/matterport/scans/house1/8194nk5LbLH 13/poisson_meshes/8194nk5LbLH_10.stage_config.json
I0412 19:04:17.735939 42397184 AttributesManagerBase.h:296] AttributesManager::createFromJsonOrDefaultInternal (Stage) : Proposing JSON name : /Users/kyra/Documents/GitHub/habitat-sim/matterport/scans/house1/8194nk5LbLH 13/poisson_meshes/8194nk5LbLH_10.stage_config.json from original name : /Users/kyra/Documents/GitHub/habitat-sim/matterport/scans/house1/8194nk5LbLH 13/poisson_meshes/8194nk5LbLH_10.ply | This file does not exist.
I0412 19:04:17.736085 42397184 AbstractObjectAttributesManagerBase.h:182] AbstractObjectAttributesManager::createObject (Stage) : Done making attributes with handle : /Users/kyra/Documents/GitHub/habitat-sim/matterport/scans/house1/8194nk5LbLH 13/poisson_meshes/8194nk5LbLH_10.ply
I0412 19:04:17.736093 42397184 AbstractObjectAttributesManagerBase.h:189] File (/Users/kyra/Documents/GitHub/habitat-sim/matterport/scans/house1/8194nk5LbLH 13/poisson_meshes/8194nk5LbLH_10.ply) exists but is not a recognized config filename extension, so new default Stage attributes created and registered.
I0412 19:04:17.736124 42397184 SceneDatasetAttributes.cpp:46]
What I did: Ran image extractor after activating Conda env. I modified the image extractor to change the file path to point to a .ply file in the matterport dataset.
Setup: 1)Facebook's AI Habitat-sim built from source,
2)MacBook Air M1,
3)Conda environment with the dependencies (using pip install -r requirements.txt) but habitat-sim is not installed by Conda,
4)Matterport3D dataset (downloaded one house).
Thank you.

Configuration and searching for Solr gettingstarted collection

I'm going through Solr Quick Start (version 6.2.0), which creates gettingstarted collection and then ingests docs/ folder, but cannot find more explanation about these two questions.
First, collection is created with this line from console:
Creating new collection 'gettingstarted' using command:
http://localhost:8983/solr/admin/collections?action=CREATE&name=gettingstarted&numShards=2&replicationFactor=2&maxShardsPerNode=2&collection.configName=gettingstarted
Where are schema and solrconfig.xml files for this collection?
And after documents are ingested, there are 4405 of them in index, all with title field. But when enter in q input field title:Solr, get this response
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":63,
"params":{
"q":"title:Solr",
"indent":"on",
"wt":"json",
"_":"1480494738956"}},
"response":{"numFound":0,"start":0,"maxScore":0.0,"docs":[]
}}
No documents are found, doesn't matter what is entered for title. Is it possible to search index by words in their title field?
Thanks
It looks like you are using SolrCloud example. In which case, the configuration is loaded into Zookeeper. So, the live version is not on the filesystem, as it would be with non-cloud examples. You can look at that via Admin UI instead.
If you just want to see the bootstrapped configuration, it is located in server/solr/configsets/ and the specific configuration depends what you chose when creating the example.
Config files will be in /server/solr/gettingstarted/conf
whenever you create new collection a folder with collection name is created inside server/solr/ inside that created folder there will be conf folder which contains config files(schema.xml, solrconfig.xml,etc) and data folder which contains index.
You should commit after you index documents into collection.

Google App Engine and ttf font not working

I've got a small problem where google app engine is complaining about my ttf file. This is what it says:
Could not guess mimetype for css/fonts/Pacifico.ttf. Using application/octet-stream.
Now I've followed this link and changed my yaml file appropriately (or so I think):
- url: /css/fonts/(.*\.ttf)
static_files: css/fonts/\1
upload: css/fonts/(.*\.ttf)
mime_type: application/x-font-ttf
But when I do this i get the following:
appcfg.py: error: Error parsing C:\Users\Roberto\Desktop\bootstrap\app.yaml: mapping values are not allowed here
in "C:\Users\Roberto\Desktop\bootstrap\app.yaml", line 25, column 17.
2014-01-16 23:22:16 (Process exited with code 2)
Any help in this matter?
I have done a test with glyphicons-halflings-regular.ttf from the Bootstrap project with the same app.yaml handler that you use (save for the indentation change as per the comments) and can verify that it works as expected:
This leads me to believe that you may using an older version of the GAE SDK (I use 1.8.8) or something else is wrong with your installation.
You can try this: appcfg.py uses python's mimetypes module to guess the type from the file extension so in any case, you should be able to solve the issue by adding the application/x-font-ttf mime type to your OS.
You're on Windows so you need to edit your registry and add a application/x-font-ttf key to HKEY_CLASSES_ROOT\MIME\Database\Content Type and add a string value called Extension with the value .ttf under the new key.
Extended procedure for adding the mimetype to Windows
Open the registry editor: Hit Winkey + R and type regedit, hit Enter
Navigate through the registry to the desired location: open HKEY_CLASSES_ROOT, inside it open MIME, inside that open Database and inside that open Content Type. It's like a folder structure.
Right click on Content Type and select New > Key, give it the name application/x-font-ttf.
Right click on the key you just created and select New > String Value. give it the name Extension.
Double click on the value you just created and assign it the Value data .ttf, hit OK.
Exit regedit and you're done!
Final none: I don't think it can be anything to do with the file itself, because the mimetypes module uses only the file extension to work out the MIME type. Unless there is some crazy unprintable character in the filename. You could try using the glyphicons-halflings-regular font I linked to to eliminate this possibility.

Indexing PDF with Solr

Can anyone point me to a tutorial.
My main experience with Solr is indexing CSV files. But I cannot find any simple instructions/tutorial to tell me what I need to do to index pdfs.
I have seen this: http://wiki.apache.org/solr/ExtractingRequestHandler
But it makes very little sense to me. Do I need to install Tika?
Im lost - please help
With solr-4.9 (the latest version as of now), extracting data from rich documents like pdfs, spreadsheets(xls, xlxs family), presentations(ppt, ppts), documentation(doc, txt etc) has become fairly simple.
The sample code examples provided in the downloaded archive from
here contains a basic solr template project to get you started quickly.
The necessary configuration changes are as follows:
Change the solrConfig.xml to include following lines :
<lib dir="<path_to_extraction_libs>" regex=".*\.jar" />
<lib dir="<path_to_solr_cell_jar>" regex="solr-cell-\d.*\.jar" />
create a request handler as follows:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults" />
</requestHandler>
2.Add the necessary jars from the solrExample to your project.
3.Define the schema as per your needs and fire a query like :
curl "http://localhost:8983/solr/collection1/update/extract?literal.id=1&literal.filename=testDocToExtractFrom.txt&literal.created_at=2014-07-22+09:50:12.234&commit=true" -F "myfile=#testDocToExtractFrom.txt"
go to the GUI portal and query to see the indexed contents.
Let me know if you face any problems.
You could use the dataImportHandler. The DataImortHandle will be defined at the solrconfig.xml, the configuration of the DataImportHandler should be realized in an different XML config file (data-config.xml)
For indexing pdf's you could
1.) crawl the directory to find all the pdf's using the FileListEntityProcessor
2.) reading the pdf's from an "content/index"-XML File, using the XPathEntityProcessor
If you have the list of related pdf's, use the TikaEntityProcessor
look at this http://solr.pl/en/2011/04/04/indexing-files-like-doc-pdf-solr-and-tika-integration/ (example with ppt) and this Solr : data import handler and solr cell
The hardest part of this is getting the metadata from the PDFs, using a tool like Aperture simplifies this. There must be tonnes of these tools
Aperture is a Java framework for extracting and querying full-text content and metadata from PDF files
Apeture grabbed the metadata from the PDFs and stored it in xml files.
I parsed the xml files using lxml and posted them to solr
Use the Solr, ExtractingRequestHandler. This uses Apache-Tika to parse the pdf file. I believe that it can pull out the metadata etc. You can also pass through your own metadata.
Extracting Request Handler
public class SolrCellRequestDemo {
public static void main (String[] args) throws IOException, SolrServerException {
SolrClient client = new
HttpSolrClient.Builder("http://localhost:8983/solr/my_collection").build();
ContentStreamUpdateRequest req = new
ContentStreamUpdateRequest("/update/extract");
req.addFile(new File("my-file.pdf"));
req.setParam(ExtractingParams.EXTRACT_ONLY, "true");
NamedList<Object> result = client.request(req);
System.out.println("Result: " +enter code here result);
}
This may help.
Apache Solr can now index all sort of binary files like PDF, Words, etc ... check out this doc:
https://lucene.apache.org/solr/guide/8_5/uploading-data-with-solr-cell-using-apache-tika.html

Resources