In CKAN, it's possible that a dataset may have multiple resources with various file extensions (e.g. CSV, JSON, etc.).
However, by querying Solr from CKAN with request that contains specific format (e.g. <CKAN_SITE_URL>/dataset/?res_format=CSV), I get result contains datasets with format except the CSV (say I'm querying with CSV as requested format) ones, but what I'm expecting is, a result contains only datasets with CSV resource(s).
Does anyone know which field in the Solr schema should I adjust in order to solve the aforementioned problem?
Thanks!
Related
I want to index the entire content of a CSV file to one field.
I am having trouble with this as Solr wants to create an enormous number of fields to index the CSV file (in schemaless setup) and in the standard manual schema setup (preferred), it produces many field not found errors.
I thought that the answer would lie in the CSV update parameters (https://solr.apache.org/guide/8_11/uploading-data-with-index-handlers.html#csv-update-parameters), but I haven't had success.
Any help is appreciated!
The "requirements" are something like this: as a user I want to upload a dataset in CSV, JSON, or other supported text formats and be able to do basic REST queries against it such as selecting all first names in the dataset or select the first 10 rows.
I'm struggling to think of the "best" way to store this data. While, off the bat I don't think this will generate millions of datasets, it seems generally bad to create a new table for every dataset for a user as, eventually, I would hit the inode limit. I could store as flat files in something like S3 that's cached but then it still does require opening and parsing the file to query it.
Is this a use case for the JSON type in Postgres? If not, what would be the "right" format and place to store this data?
I have data in two formates CSV and TEXT.
1) CSV file contains metadata. i.e. ModifyScore, Size, fileName etc.
2) actual text are in Text folders having files like a.txt, b.txt etc.
Please is it possible to index such data in Solr in a single core through DIH or another possible way?
According to your use case I would proceed with a custom indexing app.
Apparently you want to build your Solr document fetching some field from the CSV and some other field( the content) from the TXT .
Using Java for example, it is going to be quite simple :
You can use SolrJ, fetch the data from the CSV and TXT, build each Solr Document and then index it.
I would use the DIH if I can move the data in a DB ( even 2 tables are fine, as DIH supports joins).
Out of the box, you may be interested in using the script [1] transformer.
Using it in combination with your different data sources could work.
You need to play a little bit with it as it's not a direct solution to your problem.
[1] https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler#UploadingStructuredDataStoreDatawiththeDataImportHandler-TheScriptTransformer
Just to mention a couple more possibilities:
Use DIH to index txt files into collectionA, and use /update handler to ingest csv directly into collectionB, then use Streaming Expressions to merge both into a third collection that is the one you want to keep. The main advantage is everything is in Solr, no external code.
Use DIH to index files (or /update to index csv) and write a Update Request Processor that will intercept docs before they are indexed, that looks up the info from the other source, and adds it to the doc.
Yes it is possible for information and code, how to index data from multiple heterogeneous data sources see why the tikaEntityProcesor does not index the Text field in the following data-config file?
I need to get values from a certain column in a xlsx spreadsheet that was uploaded to my database in a image(blob) field. I would like to step through the rows and get values from say column 4 and insert the values into another table by using sqlserver. I can to it with CSV files by casting the image field to varbinary and then cast it again to varhar and search for ','s.
Can openrowset work on a blob field?
I doubt that this can work. Even though the data in the XLSX is stored in Microsofts Office Open-XML format (http://en.wikipedia.org/wiki/Office_Open_XML) the XML is then zipped which means that your XLSX file is a binary file. So if you want to access data in the xlsx (can't you use csv instead?) I think you need to do so programmatically. Depending on the programming logic of your choice there are various open-source projects allowing you to access xlsx file.
Java: Apache POI http://poi.apache.org/spreadsheet/
C++: http://sourceforge.net/projects/xlslib/?source=directory
...
I have stored a number of binary files in a SQL Server table. I created a full-text-index on that table which also indexes the binary field containing the documents. I installed the appropriate iFilters such that SQL Server can also read .doc, .docx and .pdf files.
Using the function DATALENGTH I can retrieve the length/size of the complete document, but this also includes layout and other useless information. I want to know the length of the text of the documents.
Using the iFilters SQL Server is able to retrieve only the text of such "complicated" documents but can it also be used to determine the length of just the text?
As far as I know (which isn't much), there is no way to query document properties via FTS. I would get the word count before inserting the document into the database, then insert the count along with it, into another column in the table. For Word documents, you can use the Document.Words.Count property; I don't know what the equivalent mechanism is for PDF documents.