I upload a file to my database through a process that takes my TXT file with 100000 rows to my information system.
My question is why is better to use a TXT file and not a CSV file?
What is the difference?
The database is an SQL Server and the information system is SharePoint.
Thank you for helping.
CSV files are basically text files only. The only thing is, we are telling explicitly that the values are separated by comma. Instead of CSV, if you use TSV, we are telling explicitly that the values are separated by tab.
when you load the data, you will be specifying the format file accordingly. So, in other words TSV or CSV no difference in the background. Both are text files i.e., TXT Files.
Say, you are having two fields in your file. You can still keep it as TXT file.
If you have Comma separated values(CSV), your format file will look like below:
<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="40" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="2" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="40" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="Column1" xsi:type="SQLVARYCHAR"/>
<COLUMN SOURCE="2" NAME="Column2" xsi:type="SQLVARYCHAR"/>
</ROW>
</BCPFORMAT>
If you have Tab separated values(TSV), your format file look like below:
<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR="\t" MAX_LENGTH="40" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="2" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="40" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="Column1" xsi:type="SQLVARYCHAR"/>
<COLUMN SOURCE="2" NAME="Column2" xsi:type="SQLVARYCHAR"/>
</ROW>
</BCPFORMAT>
So, CSV is basically TXT file with values separated by commas.
TSV is basically TXT file with values separated by tabs.
Related
I need to query a SQL table and output data into this XML format. can someone help out or point me into t he right direction.
Note: I need to use T-SQL to achieve this result.
<account>
<field name="A" value="aaaaaaaa" type="string"/>
<field name="B" value="bbbbbbbbb" type="string"/>
<field name="I" value="11111111" type="int"/>
</account>
I am using DataImportHandler for indexing data in Solr.
I am retrieving data from three columns from AUTO table in database where two columns namely TOPIC and PARTS have data of type 'CLOB' and column DATE has oracle timestamp which holds created date.
The problem is in my data-config file where I need to transform the clob data to string and also date to the UTC that Solr uses.
So I need two transformers i.e ClobTransformer and DateFormatTransformer.
I am wondering how will I use both the transformers in single entity.
here is my data-config file
<dataConfig>
<dataSource name="ds1" type="JdbcDataSource"
driver="oracle.jdbc.OracleDriver"
url="....."
user="....."
password="...."/>
<document name="doc">
<entity name="ent"
query="Select
auto.ID,
auto.Topic as Topic,
auto.Parts as Parts,
to_date(to_char(auto.Date, 'yyyy-MM-dd HH:MI:SS'), 'YYYY-MM-DD HH:MI:SS') AS Date,
From auto
order by auto.Date DESC"
dataSource="ds1" transformer="DateFormatTransformer">
<field column="ID" name="id"/>
<field column="TOPIC" name="topic" clob="true"/>
<field column="PARTS" name="parts" clob="true"/>
<field column="DATE" name="date" xpath="/RDF/item/date" dateTimeFormat="yyyy-MM-dd HH:mm:ss" locale="en"/>
</entity>
</document>
</dataConfig>
Above I have used only DateFormatTransformer.
Any help would be much appreciated.
Ok I came to know how its done. Just specifying the particular transformers using commas in the 'transformer' section of the tag like this:
<dataConfig>
<dataSource name="ds1" type="JdbcDataSource"
driver="oracle.jdbc.OracleDriver"
url="....."
user="....."
password="...."/>
<document name="doc">
<entity name="ent"
query="Select
auto.ID,
auto.Topic as Topic,
auto.Parts as Parts,
to_date(to_char(auto.Date, 'yyyy-MM-dd HH:MI:SS'), 'YYYY-MM-DD HH:MI:SS') AS Date,
From auto
order by auto.Date DESC"
dataSource="ds1" transformer="ClobTransformer,DateFormatTransformer">
<field column="ID" name="id"/>
<field column="TOPIC" name="topic" clob="true"/>
<field column="PARTS" name="parts" clob="true"/>
<field column="DATE" name="date" xpath="/RDF/item/date" dateTimeFormat="yyyy-MM-dd HH:mm:ss" locale="en"/>
</entity>
</document>
</dataConfig>
I have used two transformers, transformer="ClobTransformer,DateFormatTransformer"
How to index documents that contain specific string in solr? This is my current dataimporthandler
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" />
<document>
<entity name="page"
processor="XPathEntityProcessor"
stream="true"
forEach="/mediawiki/page/"
url="pages.xml"
transformer="RegexTransformer"
>
<field column="id" xpath="/mediawiki/page/id" />
<field column="title" xpath="/mediawiki/page/title" />
<field column="text" regex="\{\{PersonData" xpath="/mediawiki/page/revision/text" />
</entity>
</document>
</dataConfig>
I only want to index if the text field contain {{PersonData , but the above imports everything . Should this be specified in import handler or schema?
You need to do something like this:
<field column="$skipDoc" regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
In this case documents matching the specified regex are skipped, ie. articles that are "redirects" to other articles are skipped here.
Detailed documentation here:
http://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessor
So for yours you need to find a way to say skip all documents where "PersonData" data is NOT in "text" column.
Look specifically at : "Example: Indexing wikipedia" part of http://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessor
I am trying to parse the binary content data stored in database in table document_attachment in column file_data and trying to index the same so that it's content becomes available for searching using Solr.
When I run the indexer it fetches the rows which is twice in number to the rows returned by running the query in entity named "dcs" and throws no errors or exceptions. it however does not indexes the binary content(the field that I associate with tika despite of fetching it from the table).
I am using apache-solr-3.6.1 and Tika 1.0
My configuration files look something like :
data-config.xml
<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
<dataSource
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/espritkm_1?zeroDateTimeBehavior=convertToNull"
user="root"
password=""
autoCommit="true" batchSize="-1"
convertType="false"
name="test"
/>
<dataSource name="fieldReader" type="FieldStreamDataSource" />
<document name="items">
<entity name="dcs"
query="SELECT 222000000000000000+d.id AS common_id_attr,d.id AS id,UNIX_TIMESTAMP(d.created_at) AS date_added,d.file_name as common1, d.description as common2, d.file_mime_type as common3, 72 as common4,(Select group_concat(trim(tags) ORDER BY trim(tags) SEPARATOR ' | ') from tags t where t.type_id = 72 AND t.feature_id = d.id group by t.feature_id) as common5,d.created_by as common6, df.name as common7,CONCAT(d.file_name,'.',d.file_mime_type) as common8,'' as common9,(Select da.file_data from document_attachment da where da.document_id = d.id) as text FROM document d LEFT JOIN document_folder df ON df.id = d.document_folder_id WHERE d.is_deleted = 0 and d.parent_id = 0 " dataSource="test" transformer="TemplateTransformer">
<field column="common_id_attr" name="common_id_attr" />
<field column="id" name="id" />
<entity dataSource="fieldReader" processor="TikaEntityProcessor" dataField="dcs.text" format="text" pk="dcs.id" >
<field column="text" name="text" />
</entity>
</entity>
schema.xml
<schema>
<fields>
<field name="common_id_attr" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="id" type="string" indexed="true" stored="true"/>
<field name="text" type="text" indexed="true" stored="true" multiValued="true"/>
</fields>
<uniqueKey>common_id_attr</uniqueKey>
<solrQueryParser defaultOperator="OR"/>
<defaultSearchField>text</defaultSearchField>
</schema>
Though the rows it fetches is double the number of documents counting the rows of tika as separate (I don't understand why?). It does not indexes binary content.
I am stuck in this problem from long. Can someone please help
I was able to index the documents using Apache Solr version 3.6.2. I have described the steps here:
http://tuxdna.wordpress.com/2013/02/04/indexing-the-documents-stored-in-a-database-using-apache-solr-and-apache-tika/
I think it should be doable in 3.6.1 as well. I was only impatient to search for a tarball of version 3.6.1 when only 3.6.2 was avaiable from the official site.
I hope that helps.
I am new to solr, while creating the indexes i am attaching string to database table id
my field in schema.xml as follows
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<uniqueKey>id</uniqueKey>
and i am passing 'GROUP1' for id, but it is storing [B#1e090ee like this.
How could i store the same value(GROUP1) instead of [B#1e090ee ?
Please help
Is group_id string or some numeric data type?
If it's not string you need to cast it to char before concatenation with appropriate encoding.
Also add encoding (that matches your MySQL db encoding) parameter to dataSource tag, like this:
<dataSource
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://host/dbname"
batchSize="-1"
user="username"
password="password"
readOnly="true"
autoCommit="true"
encoding="UTF-8" />
Which DB are you using?
Do you see the correct values when you execute your query directly in your db?
IMHO, the problem has to be either with DataImportHandler or you actually have values like that ([B#1e090ee) in your group_id field.
Have you checked that encoding param in your dataCofig's dataSource is the same as your db's encoding?
Can you post your dataConfig file?
#mbonaci
I am using mysql database.
when i execute the same query, the results are coming fine
the following is my data config file
<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://host/dbname" batchSize="-1" user="username" password="password" readOnly="true" autoCommit="true" />
<document name="products">
<entity name="item" query="select group_id,group_title,description,DATE_FORMAT(created_date, '%Y-%m-%dT%H:%i:%sZ') as createdDate,group_status,CONCAT('GROUP',group_id) as id,'GROUP' as itemtype from collaboration_groups where group_status=1 ">
<field column="id" name="id" />
<field column="group_id" name="itemid" />
<field column="itemtype" name="itemtype" />
<field column="group_title" name="fullName" />
<field column="description" name="description"/>
<field column="createdDate" name="createdDate"/>
</entity>
</document>
</dataConfig>