Why does the Solr Data Import Handler hashes the uniqueKey?

Why does the Solr Data Import Handler hashes the uniqueKey? - solr

I have a very strange problem with Solr 4.6.0.
The uniqueKey field "id" contains a hash for every document instead of my string value. If add just one custom document with the update request handler in the Solr admin I get for example the ID value "book_45" that I specified, so that is correct.
But when I do a full import with the DIH (data import handler) then the id field contains a hash for every document like "[B#53bd370f" instead of my custom value. So the problem must be in the DIH.
My import script:
<dataConfig>
<dataSource
type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://host/database"
user="user"
password="password" />
<document name="project">
<entity name="document" transformer="RegexTransformer"
query="SELECT CONCAT('book_', b.id) AS book_id, b.slug, b.title, b.isbn,
b.publisher, b.releaseYear AS release_year, b.language, b.pageCount AS page_count, b.description,
b.print, b.addedBy_id AS added_by_id, b.dt AS created,
GROUP_CONCAT(a.name SEPARATOR ';') AS authors
FROM Book b
LEFT JOIN author_book ab ON ab.book_id = b.id
LEFT JOIN Author a ON a.id = ab.author_id
GROUP BY b.id
">
<field column="book_id" name="id" />
<field column="slug" name="book_slug" />
<field column="title" name="book_title" />
<field column="isbn" name="book_isbn" />
<field column="publisher" name="book_publisher" />
<field column="release_year" name="book_release_year" />
<field column="language" name="book_language" />
<field column="page_count" name="book_page_count" />
<field column="description" name="book_description" />
<field column="print" name="book_print" />
<field column="added_by_id" name="book_added_by_id" />
<field column="created" name="book_created" />
<field column="authors" splitBy=";" name="authors" />
</entity>
</document>
The id field in my schema.xml (which is the same as in the default shipped core collection1):
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<uniqueKey>id</uniqueKey>
Does anyone know what I am missing?

the [B#53bd370f is not a hash, but the result of a byte[].toString(). Whatever Mysql is returning is being treated as a byte[] instead of a String.
Try casting the id to varchar or char like this:
SELECT cast(CONCAT('book_', b.id) as CHAR) AS book_id...

Related

How to separate the values of a multivalued field into dynamic fields

I have 1 multivalued date type field, its definition in the schema.xml is shown below:
<field name="fecha_referencia" type="pdates" uninvertible="true" indexed="true" stored="true"/>
The total of values it can take are three, here is an example where it is already indexed:
fecha_referencia:["2015-12-04T00:00:00Z",
"2014-12-15T00:00:00Z",
"2014-02-03T00:00:00Z"]
I want to know is if you can divide the values at the time of indexing (I am indexing via DIH) into other dynamic fields or separate fields.
Example of what you are looking for:
fecha_referencia:["2015-12-04T00:00:00Z",
"2014-12-15T00:00:00Z",
"2014-02-03T00:00:00Z"],
fecha1:2015-12-04T00:00:00Z,
fecha2:2014-12-15T00:00:00Z,
fecha3:2014-02-03T00:00:00Z
Note: I have tried to test regex but have had no luck.
Any contribution would be of great help and well received by your server...
This is my data.config.xml structure:
<dataConfig>
<dataSource type="JdbcDataSource" driver="org.postgresql.Driver" url="jdbc:postgresql://10.152.11.47:5433/meta" user="us" password="ntm" URIEncoding="UTF-8" />
<document >
<entity name="tr_ident" query="SELECT id_ident, titulo,proposito,descripcion,palabra_cve
FROM ntm_p.tr_ident">
<field column="id_ident" name="id_ident" />
<field column="titulo" name="titulo" />
<field column="proposito" name="proposito" />
<field column="descripcion" name="descripcion" />
<field column="palabra_cve" name="palabra_cve" />
<entity name="tr_fecha_insumo" query="select fecha_creacion,fech_ini_verif,
fech_fin_verif from ntm_p.tr_fecha_insumo where id_fecha_insumo='${tr_ident.id_ident}'">
<field name="fecha_creacion" column="fecha_creacion" />
<field name="fech_ini_verif" column="fech_ini_verif" />
<field name="fech_fin_verif" column="fech_fin_verif" />
</entity>
<entity name="ti_fecha_evento"
query="select tipo_fecha,fecha_referencia from ntm_p.ti_fecha_evento where id_fecha_evento='${tr_ident.id_ident}'">
<field column="fecha_referencia" name="fecha_referencia" />
<entity name="tc_tipo_fecha" query="select des_tipo_fecha,id_tipo_fecha from ntm_p.tc_tipo_fecha where id_tipo_fecha='${ti_fecha_evento.tipo_fecha}'">
<field column="des_tipo_fecha" name="des_tipo_fecha" />
<field column="id_tipo_fecha" name="id_tipo_fecha" />
</entity>
</entity>
</entity>
</document>
</dataConfig>

Searching a nested structure. Can not retrieve parents with all corresponding childs

The data is imported with the data import handler:
<dataConfig>
<dataSource
...
/>
<!-- product import -->
<document>
<!-- entity = table -->
<entity name="skn" pk="SKN" rootEntity="true" query="select * from skn">
<field column="SKN" name="id" />
<field column="root" name="root" />
<field column="SEARCHDESCRIPTION" name="SEARCHDESCRIPTION" />
<entity name="sku" child="true" query="select * from sku where SKN = '${skn.SKN}'">
<field column="SKU" name="id" />
<field column="variant1" name="variant1" />
<field column="variant2" name="variant2" />
<field column="v1_long" name="v1_long" />
<field column="v2_long" name="v2_long" />
<field column="v1_type" name="v1_type" />
<field column="v2_type" name="v2_type" />
</entity>
</entity>
<propertyWriter
dateFormat="yyyy-MM-dd HH:mm:ss"
type="SimplePropertiesWriter"
directory="conf"
filename="dataimport.properties"
locale="de-DE"
/>
</document>
</dataConfig>
I can get all childs for a certain parent or all parents for a certain child (so the nested structure is working). But I cannot retrive parents with the corresponding childs.
I tried the following query:
q={!parent which="id:1"}&fl=*,[child]&rows=200
It returns the parent document but not the corresponding child. I dont't get any error message. I also checked the log file.
Can anybody help?

DeltaImport fetches all the data

I'm indexing data from database. I'm using delta import to fetch the recently updated data. However, I find that it is fetching the whole data twice and processing it once though the changes are applicable to only one row.
My config.xml where deltaquery is given:
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.github.cassandra.jdbc.CassandraDriver" url="jdbc:c*://127.0.0.1:9042/test" autoCommit="true" rowLimit = '-1' batchSize="-1"/>
<document name="content">
<entity name="test" query="SELECT * from person" deltaImportQuery="select * from person where seq=${dataimporter.delta.seq}" deltaQuery="select seq from person where last_modified > '${dataimporter.last_index_time}' ALLOW FILTERING" autoCommit="true">
<field column="seq" name="id" />
<field column="last" name="last_s" />
<field column="first" name="first_s" />
<field column="city" name="city_s" />
<field column="zip" name="zip_s" />
<field column="street" name="street_s" />
<field column="age" name="age_s" />
<field column="state" name="state_s" />
<field column="dollar" name="dollar_s" />
<field column="pick" name="pick_s" />
</entity>
</document>
</dataConfig>
There are about 2100000 rows. So it always cause a large memory consumption resulting in Running Out of Memory. What could be the problem? Or does it work in this way only?

If solr is running out of memory then it is time to add more memory to the solr box. Adding more RAM will help alleviate the issue.

handling null value in nested entity of solr data importer

I'm using the Solr Data Importer to import some category data. I didn't want to use a left join in the parent query because it's too complicated, I preferred to use nested object queries in the configuration to keep it simple.
I've got 3 one to one relationships for feature images of a category. My question is though, how can I handle it when the value in mediaItemX_id field is null? I've tried the nested configuration below, but when the value is null it's reporting invalid sql because the nested query doesn't print null - it prints blank....
<entity name="category" query="SELECT concat('CATEGORY_', c.id) as docId, c.id, externalIdentifier, name, description, shortDescription, mediaItem1_id, mediaItem2_id, mediaItem3_id, created, lastUpdated, keywords, 'CATEGORY' as docType,
name as autoSuggestField
FROM categories c inner join base_content bc where c.id = bc.id">
<field column="id" name="categoryId" />
<field column="externalIdentifier" name="externalIdentifier" />
<field column="docType" name="docType" />
<field column="name" name="name" />
<field column="description" name="description" />
<field column="shortDescription" name="shortDescription" />
<field column="created" name="created" dateTimeFormat="yyyy-MM-dd'T'HH:mm:ss" />
<field column="lastUpdated" name="lastUpdated" dateTimeFormat="yyyy-MM-dd'T'HH:mm:ss" />
<field column="publishDate" name="publishDate" dateTimeFormat="yyyy-MM-dd'T'HH:mm:ss" />
<field column="archiveDate" name="archiveDate" dateTimeFormat="yyyy-MM-dd'T'HH:mm:ss" />
<field column="autoSuggestField" name="suburbSuggest" />
<field column="keywords" name="keywords" />
<entity name="mediaItem1" query="SELECT uri, title, altText from media where ${category.mediaItem1_id} is not null and id = ${category.mediaItem1_id}">
<field column="uri" name="featureImage1Url" />
<field column="title" name="featureImage1Title" />
<field column="altText" name="featureImage1AltText" />
</entity>
<entity name="mediaItem2" query="SELECT uri, title, altText from media where ${category.mediaItem2_id} is not null and id = ${category.mediaItem2_id}">
<field column="uri" name="featureImage2Url" />
<field column="title" name="featureImage2Title" />
<field column="altText" name="featureImage2AltText" />
</entity>
<entity name="mediaItem1" query="SELECT uri, title, altText from media where ${category.mediaItem3_id} is not null and id = ${category.mediaItem3_id}">
<field column="uri" name="featureImage3Url" />
<field column="title" name="featureImage3Title" />
<field column="altText" name="featureImage3AltText" />
</entity>
</entity>

Solr supports the notion ${value:default} as replacements in other locations, so I'd try that at least:
${category.mediaItem1_id} IS NOT NULL AND id = ${category.mediaItem1_id:0}
I were unable to find a decent way to skip the entities whole if the current value is false.

Full-import failing when using CachedSqlEntityProcessor giving OutOfMemoryError Exception

Full-import failing when using CachedSqlEntityProcessor giving Exception
java.lang.OutOfMemoryError: GC overhead limit exceeded
How can i resolve this Issue.......
Without using CachedSqlEntityProcessor it is taking 15 hrs to index
and My products-data-config.xml is
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/localbazaar" user="root" password="sa" batchSize="100" />
<document name="products">
<entity name="domainProduct" query="SELECT p.PRODUCT_ID, p.NAME, LOWER(REPLACE(REPLACE(p.NAME,' ','-'),'/','-')) AS purl, p.description, p.BRAND_ID, p.CATEGORY_ID, p.GROUP_ID, p.MIN_PRICE, p.MAX_PRICE, p.AUTHOR, p.ISBN10, p.ISBN13, p.OLID, p.EAN13, p.UPCA, p.SKU, p.LANGUAGE, p.FORMAT, p.PUBLISHER, p.SUBJECT, c.NAME AS cname, c.URL_NAME, b.NAME AS bname, LOWER(REPLACE(REPLACE(b.NAME,' ','-'),'/','-')) AS bUrl, CONCAT('http://partnercenter.localbazaar.com/image?imageId=',i.IMAGE_NAME) AS productImage FROM product_t p LEFT OUTER JOIN category_t c ON (c.CATEGORY_ID=p.CATEGORY_ID) LEFT OUTER JOIN brand_t b ON (b.BRAND_ID=p.BRAND_ID) LEFT OUTER JOIN image_t i ON (i.ASSET_ID=p.PRODUCT_ID AND i.ASSET_TYPE_ID = 4 AND i.IMAGE_TYPE_ID = 0)">
<field column="PRODUCT_ID" name="productId" />
<field column="NAME" name="productName" />
<field column="purl" name="productUrlName" />
<field column="description" name="productDescription" />
<field column="BRAND_ID" name="brandId" />
<field column="CATEGORY_ID" name="categoryId" />
<field column="GROUP_ID" name="groupId" />
<field column="MIN_PRICE" name="minPrice" />
<field column="MAX_PRICE" name="maxPrice" />
<field column="AUTHOR" name="author" />
<field column="ISBN10" name="isbn10" />
<field column="ISBN13" name="isbn13" />
<field column="OLID" name="olid" />
<field column="EAN13" name="ean13" />
<field column="UPCA" name="upca" />
<field column="SKU" name="sku" />
<field column="LANGUAGE" name="language" />
<field column="FORMAT" name="format" />
<field column="PUBLISHER" name="publisher" />
<field column="SUBJECT" name="subject" />
<field column="cname" name="categoryName" />
<field column="URL_NAME" name="categoryUrlName" />
<field column="bname" name="brandName" />
<field column="bUrl" name="brandUrlName" />
<field column="productImage" name="productImage" />
<entity name="specifications" query="select PRODUCT_ID, CONCAT(PROPERTY_NAME,':::',property_value) as specifications FROM product_properties_t " processor="CachedSqlEntityProcessor" where="PRODUCT_ID=domainProduct.PRODUCT_ID" />
</entity>
</document>
</dataConfig>
and My store-products-data-config.xml is
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/localbazaar" user="root" password="sa" batchSize="100" />
<document name="products">
<entity name="domainStoreProduct" query="SELECT sp.STORE_PRODUCT_ID, sp.STORE_ID, sp.PRODUCT_ID, sp.MIN_PRICE, sp.MAX_PRICE, sp.STORE_TYPE_ID, sp.BUY_X, sp.GET_Y, s.NAME AS sname, LOWER(REPLACE(REPLACE(s.NAME,' ','-'),'/','-')) AS sUrl, s.DESCRIPTION AS sdesc, s.WEB_SITE_UTL, s.EMAIL, s.PHONE, s.MOBILE, s.ACTIVE AS act, a.ADDRESS_ID, a.location, LOWER(REPLACE(REPLACE(a.location,' ','-'),'/','-')) AS urlLoc, a.ADDRESS_LINE1, a.ADDRESS_LINE2, a.LATITUDE, a.LONGITUDE, a.zipcode, a.LANDMARK, a.CITY, CONCAT(a.LATITUDE,',',a.LONGITUDE) AS ll, p.NAME AS pname, LOWER(REPLACE(REPLACE(p.NAME,' ','-'),'/','-')) AS purl, p.description AS pdesc, p.BRAND_ID, p.CATEGORY_ID, p.GROUP_ID, p.AUTHOR, p.ISBN10, p.ISBN13, p.OLID, p.EAN13, p.UPCA, p.SKU, p.LANGUAGE, p.FORMAT, p.PUBLISHER, p.SUBJECT, c.NAME AS cname, c.URL_NAME, b.NAME AS bname, LOWER(REPLACE(REPLACE(b.NAME,' ','-'),'/','-')) AS bUrl, CONCAT('http://partnercenter.localbazaar.com/image?imageId=',ip.IMAGE_NAME) AS pImage, CONCAT('http://partnercenter.localbazaar.com/image?imageId=',ist.IMAGE_NAME) AS sImage, ci.CITY_ID FROM store_products_t sp LEFT OUTER JOIN store_t s ON (sp.STORE_ID=s.STORE_ID) LEFT OUTER JOIN address_t a ON (a.ASSET_TYPE_ID=3 AND a.ASSET_ID=sp.STORE_ID) LEFT OUTER JOIN product_t p ON (p.PRODUCT_ID=sp.PRODUCT_ID) LEFT OUTER JOIN category_t c ON (c.CATEGORY_ID=p.CATEGORY_ID) LEFT OUTER JOIN brand_t b ON (b.BRAND_ID=p.BRAND_ID) LEFT OUTER JOIN image_t ip ON (ip.ASSET_ID=sp.PRODUCT_ID AND ip.ASSET_TYPE_ID=4 AND ip.IMAGE_TYPE_ID=0) LEFT OUTER JOIN image_t ist ON (ist.ASSET_ID=sp.STORE_ID AND ist.ASSET_TYPE_ID=3 AND ist.IMAGE_TYPE_ID=0) LEFT OUTER JOIN city_t ci ON (ci.NAME=a.CITY)">
<field column="STORE_PRODUCT_ID" name="storeProductId" />
<field column="STORE_ID" name="storeId" />
<field column="PRODUCT_ID" name="productId" />
<field column="MIN_PRICE" name="storeMinPrice" />
<field column="MAX_PRICE" name="storeMaxPrice" />
<field column="STORE_TYPE_ID" name="storeTypeId" />
<field column="BUY_X" name="buyX" />
<field column="GET_Y" name="getY" />
<field column="sname" name="storeName" />
<field column="sUrl" name="storeUrlName" />
<field column="sdesc" name="description" />
<field column="WEB_SITE_UTL" name="webSiteUrl" />
<field column="EMAIL" name="email" />
<field column="PHONE" name="phone" />
<field column="MOBILE" name="mobile" />
<field column="act" name="active" />
<field column="ADDRESS_ID" name="addressId" />
<field column="location" name="location" />
<field column="urlLoc" name="urlLocation" />
<field column="ADDRESS_LINE1" name="addressLine1" />
<field column="ADDRESS_LINE2" name="addressLine2" />
<field column="LATITUDE" name="latitude" />
<field column="LONGITUDE" name="longitude" />
<field column="zipcode" name="zipcode" />
<field column="LANDMARK" name="landmark" />
<field column="CITY" name="city" />
<field column="ll" name="latlong" />
<field column="pname" name="productName" />
<field column="purl" name="productUrlName" />
<field column="pdesc" name="productDescription" />
<field column="BRAND_ID" name="brandId" />
<field column="CATEGORY_ID" name="categoryId" />
<field column="GROUP_ID" name="groupId" />
<field column="AUTHOR" name="author" />
<field column="ISBN10" name="isbn10" />
<field column="ISBN13" name="isbn13" />
<field column="OLID" name="olid" />
<field column="EAN13" name="ean13" />
<field column="UPCA" name="upca" />
<field column="SKU" name="sku" />
<field column="LANGUAGE" name="language" />
<field column="FORMAT" name="format" />
<field column="PUBLISHER" name="publisher" />
<field column="SUBJECT" name="subject" />
<field column="cname" name="categoryName" />
<field column="URL_NAME" name="categoryUrlName" />
<field column="bname" name="brandName" />
<field column="bUrl" name="brandUrlName" />
<field column="pImage" name="productImage" />
<field column="sImage" name="storeImage" />
<field column="CITY_ID" name="cityId" />
<entity name="specifications" query="select PRODUCT_ID, CONCAT(PROPERTY_NAME,':::',property_value) as specifications FROM product_properties_t " processor="CachedSqlEntityProcessor" WHERE="PRODUCT_ID= domainStoreProduct.PRODUCT_ID" />
<entity name="storeProperties" query="select STORE_ID, CONCAT(PROPERTY_ID,':::',PROPERTY_VALUE) as storeProperties FROM store_properties_t " processor="CachedSqlEntityProcessor" WHERE="STORE_ID=domainStoreProduct.STORE_ID" />
</entity>
</document>
</dataConfig>

You can try different things:
Try setting the batchSize property. If you tune it correctly, you can increase the performance of your datasource.
SELECT * is ALWAYS slower than providing the columns you need (even if you need all columns). I would suggest using SELECT PRODUCT_ID, NAME, ... in stead of using *
Why do you have the entities b, i and s? You don't use the fields from it, so I don't think they're very useful
Try using the CachedSqlEntityProcessor for your sub-entities. It will only retrieve the data once and re-use it for each subenttiy.
Can your product belong to more than 1 category (is it a multivalued field?), if not, then writing one query using JOINS is faster than writing multiple entities.
EDIT: I suggest seperating this thing into 2 questions because now it's really weird for other people to read your new question with my old answer.
I don't think you can choose where the CachedSqlEntityProcessor will put his cache (it's always in memory I think). The problem with your 8 hours of data import is that, because we're talking about a lot of records, a lot of queries will be used (every subentity uses its own query).
The solution to your problem is to remove the subentity and in your parent entity add the query of your subentity as a comma seperated list. I suggest looking at this answer.
If you do this, all your specifications (for examples) will be stored inside one column as a comma speerated list. You can then use a Solr ScriptTransformer to split the values and create multiple values.
This limits the number of queries to 1 big query and will also limit the use of RAM since it will parse each query individually. I have no clue what the performance will be, because you will have to parse each entity individually.
If this doesn't work I don't think there is a better solution than to wait 8 hours for the data import to complete. You can't expect that Solr will index it all in 1 2 3. You can try using a cronjob to run this task over night.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Why does the Solr Data Import Handler hashes the uniqueKey? - solr

the [B#53bd370f is not a hash, but the result of a byte[].toString(). Whatever Mysql is returning is being treated as a byte[] instead of a String. Try casting the id to varchar or char like this: SELECT cast(CONCAT('book_', b.id) as CHAR) AS book_id...

Related

How to separate the values of a multivalued field into dynamic fields

Searching a nested structure. Can not retrieve parents with all corresponding childs

DeltaImport fetches all the data

handling null value in nested entity of solr data importer

Full-import failing when using CachedSqlEntityProcessor giving OutOfMemoryError Exception

Categories

Resources