Solr fle layout with dynamic fields - solr

Im a newbie to SOLR and trying to understand dynamic fields,
Assume I have the following schema,
If document-1, contains
id = 1, author = "Tom" , title = "Python", text = "Book", first_name_string = "Tom"and last_name_string = "Dan"
and If document-2, contains
id = 2, author = "Brain" , title = "Java" , text = "Java"
How would the values be stored?
Is it my first document-1 and document-2 will be stored as seen above..What will be the values first_name_string and last_name_string for my document-2?
If I do a query on both the documents, how will the SOLR results look..
<?xml verson='1.0' ?>
<schema name='simple' version='1.1'>
<types>
<fieldtype name='string' class='solr.StrField' />
<fieldType name='long' class='solr.TrieLongField' />
</types>
<fields>
<field name='id' type='long' required='true' />
<field name='author' type='string' multiValued='true' />
<field name='title' type='string' />
<field name='text' type='string' />
<dynamicField name='*_string' type='string'
multiValued='true' indexed='true' stored='true' />
<copyField source='*' dest='fullText' />
<field name='fullText' type='string'multiValued='true' indexed='true' />
</fields>
<uniqueKey>id</uniqueKey>
<defaultSearchField>fullText</defaultSearchField>
<solrQueryParser defaultOperator='OR' />
</schema>

If you dont provide data for any fields .solr will skip those fields for that doc.if you want to have all the fields in all the docs.please specify default for fields in your schema.

Related

Apache Solr Query Parse Error during data import when using SolrEntityProcessor

When I try to do import of schooLocationDetails solr core, I get below error . Using Solr 5.3.1
Exception while processing: opportunityDetails document : SolrInputDocument(fields: []):org.apache.solr.handler.dataimport.DataImportHandlerException: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://:<solr_pwd>#<solr_server>:<solr_port>/solr/locationCore: org.apache.solr.search.SyntaxError: Cannot parse 'locationId:': Encountered "" at line 1, column 22.
Below is my data-config.xml for the solr core schooLocationDetails.
<dataConfig>
<document>
<entity name="school" dataSource="datasource" query="select * from school_table" transformer="RegexTransformer">
<field column="recordKey" name="recordKey" />
<field column="name" name="name" />
<field column="location" name="location" />
<field column="title" name="title" />
</entity>
<entity name="locationDetail" processor="SolrEntityProcessor" url="http://<solr-user>:<solr_pwd>#<solr_server>:<solr_port>/solr/locationCore" query="locationId:${school.location}"
fl="*,old_version:_version_">
<field column="locationId" name="locationId" />
<field column="city" name="city" />
<field column="state" name="state" />
<field column="old_version" name="old_version" />
</entity>
</document>
</dataConfig>
You have to add the entity referencing the value inside the other entity. When they're two separate entities, they can't reference each other values (and they'll be imported after each other instead).
<entity name="school" dataSource="datasource" query="select * from school_table" transformer="RegexTransformer">
<field column="recordKey" name="recordKey" />
<field column="name" name="name" />
<field column="location" name="location" />
<field column="title" name="title" />
<entity name="locationDetail" processor="SolrEntityProcessor" url="" query="locationId:${school.location}"
fl="*,old_version:_version_">
<field column="locationId" name="locationId" />
<field column="city" name="city" />
<field column="state" name="state" />
<field column="old_version" name="old_version" />
</entity>
</entity>

Configuration Nested Entity using DIH in SOLR

I wanna create nested entity with DIH using SOLR 6.x
i read
Defining nested entities in Solr Data Import Handler
and jira https://issues.apache.org/jira/browse/SOLR-5147
what i did
Schema.xml
<fields>
<field name="variantList" type="string" indexed="true" stored="true" />
<field name="variantList.variants" type="string" multiValued="false" required="false"/>
<field name="variantList.stockMinimum" type="int" multiValued="false" required="false"/>
<field name="variantList.stockOnHand" type="int" multiValued="false" required="false"/>
<field name="variantList.stockVariantId" type="long" multiValued="false" required="false"/>
</fields>
data-config.xml
<dataConfig>
<dataSource />
<document>
<entity name="PARENT" rootEntity='true' query="*" >
<field column="ID" name="id" />
<field column="BRAND_ID" name="brandId" />
<field column="PRODUCT_ID" name="productId" />
<field column="MERCHANT_PRODUCT_ID" name="merchantProductId" />
<field column="MERCHANT_ID" name="merchantId" />
<field column="SALES_REGION" name="salesRegion" />
<field column="LOCAL_DIRECT_DELIVERY" name="localDirectDelivery" />
<field column="NORMAL_SELLINGPRICE" name="normalSellingPrice" />
<field column="NEW_PRODUCT" name="newProduct" />
<field column="BEST_SELLER" name="bestSeller" />
<field column="CATEGORY1_ID" name="category1Id" />
<field column="CATEGORY2_ID" name="category2Id" />
<field column="CATEGORY3_ID" name="category3Id" />
<field column="CATEGORY4_ID" name="category4Id" />
<field column="DISPLAY_IMAGE_PATH" name="displayImagePath" />
<field column="MERCHANT_NAME" name="merchantName" />
<field column="PRODUCT_NAME" name="productName" />
<field column="CATEGORY1_NAME" name="category1Name" />
<field column="CATEGORY2_NAME" name="category2Name" />
<field column="CATEGORY3_NAME" name="category3Name" />
<field column="CATEGORY4_NAME" name="category4Name" />
<entity name="variantList" child="true" query="select VARIANT , STOCK_MINIMUM , STOCK_ONHAND , ID from SIF_MERCHANT_CATALOG_VARIANT
where MERCHANT_CATALOG_ID = '${PARENT.ID}'">
<field column="VARIANT" name="variantList.variants_s" />
<field column="STOCK_MINIMUM" name="variantList.stockMinimum" />
<field column="STOCK_ONHAND" name="variantList.stockOnHand" />
<field column="ID" name="variantList.stockVariantId" />
</entity>
</entity>
</document>
</dataConfig>
result that i want
<doc parent_1/>
<doc child_1/>
<doc child_1/>
<doc parent_2/>
<doc child_1/>
and what i get
<doc child_1/>
<doc child_1/>
<doc parent_1/>
<doc child_2/>
<doc parent_2/>
and i see aheryan's anwers , it should be right , i can use child=true
am i miss something ?
thanks
The child docs are returned together with parent docs if you just do a general query. As a flat list. So, that's probably what you are seeing.
The easiest way to check whether you got nested documents is to look at the value of the _root_ field, as the value will be the same for all documents in the parent/child hierarchy block.
You could also search for parent documents only and use Child Document Transformer to list its children.

Why does the Solr Data Import Handler hashes the uniqueKey?

I have a very strange problem with Solr 4.6.0.
The uniqueKey field "id" contains a hash for every document instead of my string value. If add just one custom document with the update request handler in the Solr admin I get for example the ID value "book_45" that I specified, so that is correct.
But when I do a full import with the DIH (data import handler) then the id field contains a hash for every document like "[B#53bd370f" instead of my custom value. So the problem must be in the DIH.
My import script:
<dataConfig>
<dataSource
type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://host/database"
user="user"
password="password" />
<document name="project">
<entity name="document" transformer="RegexTransformer"
query="SELECT CONCAT('book_', b.id) AS book_id, b.slug, b.title, b.isbn,
b.publisher, b.releaseYear AS release_year, b.language, b.pageCount AS page_count, b.description,
b.print, b.addedBy_id AS added_by_id, b.dt AS created,
GROUP_CONCAT(a.name SEPARATOR ';') AS authors
FROM Book b
LEFT JOIN author_book ab ON ab.book_id = b.id
LEFT JOIN Author a ON a.id = ab.author_id
GROUP BY b.id
">
<field column="book_id" name="id" />
<field column="slug" name="book_slug" />
<field column="title" name="book_title" />
<field column="isbn" name="book_isbn" />
<field column="publisher" name="book_publisher" />
<field column="release_year" name="book_release_year" />
<field column="language" name="book_language" />
<field column="page_count" name="book_page_count" />
<field column="description" name="book_description" />
<field column="print" name="book_print" />
<field column="added_by_id" name="book_added_by_id" />
<field column="created" name="book_created" />
<field column="authors" splitBy=";" name="authors" />
</entity>
</document>
The id field in my schema.xml (which is the same as in the default shipped core collection1):
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<uniqueKey>id</uniqueKey>
Does anyone know what I am missing?
the [B#53bd370f is not a hash, but the result of a byte[].toString(). Whatever Mysql is returning is being treated as a byte[] instead of a String.
Try casting the id to varchar or char like this:
SELECT cast(CONCAT('book_', b.id) as CHAR) AS book_id...

Solr only index content whose text has a specified minimal length/size

I'm trying to import the allemanic wikipedia xml-dump. I specified some regex rules to ignore wikipedia pages like categories, files, templates, ... This configuration does work without any problems.
But then I wanted to restrict the indexing to documents that have a contents field with a length of at least 200 characters. But I cannot think of any way to do it. I tried some regex but then the indexing would always instantly fail (something like (.*){5} doesn't seem to be supported?).
Does anyone know a regex that is supported by solr to skip documents with only 200 or less characters? Or is there any other way to achive this behaviour?
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" />
<document>
<entity name="page" processor="XPathEntityProcessor" stream="true" forEach="/mediawiki/page/" url="/home/patrick/Desktop/alswiki-20130413-pages-articles.xml" transformer="RegexTransformer,DateFormatTransformer,HTMLStripTransformer,TemplateTransformer">
<field column="origid" xpath="/mediawiki/page/id" />
<field column="id" regex="^(.*)$" replaceWith="als-$1" sourceColName="origid" />
<field column="name" xpath="/mediawiki/page/title" />
<field column="revision_id" xpath="/mediawiki/page/revision/id" />
<field column="user" xpath="/mediawiki/page/revision/contributor/username" />
<field column="contents" xpath="/mediawiki/page/revision/text" stripHTML="true" />
<field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
<field column="source" template="Swiss Wiki"/>
<field column="$skipDoc" regex="^#REDIRECT.*" replaceWith="true" sourceColName="contents"/>
<field column="$skipDoc" regex="^#WEITERLEITUNG.*" replaceWith="true" sourceColName="contents"/>
<field column="$skipDoc" regex="^#Redirect.*" replaceWith="true" sourceColName="contents"/>
<field column="$skipDoc" regex="^Wikipedia:.*" replaceWith="true" sourceColName="name"/>
<field column="$skipDoc" regex="^MediaWiki:.*" replaceWith="true" sourceColName="name"/>
<field column="$skipDoc" regex="^Vorlage:.*" replaceWith="true" sourceColName="name"/>
<field column="$skipDoc" regex="^Datei:.*" replaceWith="true" sourceColName="name"/>
<field column="$skipDoc" regex="^Hilfe:.*" replaceWith="true" sourceColName="name"/>
<field column="$skipDoc" regex="^Portal:.*" replaceWith="true" sourceColName="name"/>
<field column="$skipDoc" regex="^Kategorie:.*" replaceWith="true" sourceColName="name"/>
</entity>
</document>
</dataConfig>
Firstly, (.*){5} will match exactly 5 characters. (.*){5,} will match five or more. So for 200, it would be (.*){200,}.
If this doesn't work, it is reasonably easy to write a custom transformer:
http://wiki.apache.org/solr/DIHCustomTransformer

Full-import failing when using CachedSqlEntityProcessor giving OutOfMemoryError Exception

Full-import failing when using CachedSqlEntityProcessor giving Exception
java.lang.OutOfMemoryError: GC overhead limit exceeded
How can i resolve this Issue.......
Without using CachedSqlEntityProcessor it is taking 15 hrs to index
and My products-data-config.xml is
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/localbazaar" user="root" password="sa" batchSize="100" />
<document name="products">
<entity name="domainProduct" query="SELECT p.PRODUCT_ID, p.NAME, LOWER(REPLACE(REPLACE(p.NAME,' ','-'),'/','-')) AS purl, p.description, p.BRAND_ID, p.CATEGORY_ID, p.GROUP_ID, p.MIN_PRICE, p.MAX_PRICE, p.AUTHOR, p.ISBN10, p.ISBN13, p.OLID, p.EAN13, p.UPCA, p.SKU, p.LANGUAGE, p.FORMAT, p.PUBLISHER, p.SUBJECT, c.NAME AS cname, c.URL_NAME, b.NAME AS bname, LOWER(REPLACE(REPLACE(b.NAME,' ','-'),'/','-')) AS bUrl, CONCAT('http://partnercenter.localbazaar.com/image?imageId=',i.IMAGE_NAME) AS productImage FROM product_t p LEFT OUTER JOIN category_t c ON (c.CATEGORY_ID=p.CATEGORY_ID) LEFT OUTER JOIN brand_t b ON (b.BRAND_ID=p.BRAND_ID) LEFT OUTER JOIN image_t i ON (i.ASSET_ID=p.PRODUCT_ID AND i.ASSET_TYPE_ID = 4 AND i.IMAGE_TYPE_ID = 0)">
<field column="PRODUCT_ID" name="productId" />
<field column="NAME" name="productName" />
<field column="purl" name="productUrlName" />
<field column="description" name="productDescription" />
<field column="BRAND_ID" name="brandId" />
<field column="CATEGORY_ID" name="categoryId" />
<field column="GROUP_ID" name="groupId" />
<field column="MIN_PRICE" name="minPrice" />
<field column="MAX_PRICE" name="maxPrice" />
<field column="AUTHOR" name="author" />
<field column="ISBN10" name="isbn10" />
<field column="ISBN13" name="isbn13" />
<field column="OLID" name="olid" />
<field column="EAN13" name="ean13" />
<field column="UPCA" name="upca" />
<field column="SKU" name="sku" />
<field column="LANGUAGE" name="language" />
<field column="FORMAT" name="format" />
<field column="PUBLISHER" name="publisher" />
<field column="SUBJECT" name="subject" />
<field column="cname" name="categoryName" />
<field column="URL_NAME" name="categoryUrlName" />
<field column="bname" name="brandName" />
<field column="bUrl" name="brandUrlName" />
<field column="productImage" name="productImage" />
<entity name="specifications" query="select PRODUCT_ID, CONCAT(PROPERTY_NAME,':::',property_value) as specifications FROM product_properties_t " processor="CachedSqlEntityProcessor" where="PRODUCT_ID=domainProduct.PRODUCT_ID" />
</entity>
</document>
</dataConfig>
and My store-products-data-config.xml is
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/localbazaar" user="root" password="sa" batchSize="100" />
<document name="products">
<entity name="domainStoreProduct" query="SELECT sp.STORE_PRODUCT_ID, sp.STORE_ID, sp.PRODUCT_ID, sp.MIN_PRICE, sp.MAX_PRICE, sp.STORE_TYPE_ID, sp.BUY_X, sp.GET_Y, s.NAME AS sname, LOWER(REPLACE(REPLACE(s.NAME,' ','-'),'/','-')) AS sUrl, s.DESCRIPTION AS sdesc, s.WEB_SITE_UTL, s.EMAIL, s.PHONE, s.MOBILE, s.ACTIVE AS act, a.ADDRESS_ID, a.location, LOWER(REPLACE(REPLACE(a.location,' ','-'),'/','-')) AS urlLoc, a.ADDRESS_LINE1, a.ADDRESS_LINE2, a.LATITUDE, a.LONGITUDE, a.zipcode, a.LANDMARK, a.CITY, CONCAT(a.LATITUDE,',',a.LONGITUDE) AS ll, p.NAME AS pname, LOWER(REPLACE(REPLACE(p.NAME,' ','-'),'/','-')) AS purl, p.description AS pdesc, p.BRAND_ID, p.CATEGORY_ID, p.GROUP_ID, p.AUTHOR, p.ISBN10, p.ISBN13, p.OLID, p.EAN13, p.UPCA, p.SKU, p.LANGUAGE, p.FORMAT, p.PUBLISHER, p.SUBJECT, c.NAME AS cname, c.URL_NAME, b.NAME AS bname, LOWER(REPLACE(REPLACE(b.NAME,' ','-'),'/','-')) AS bUrl, CONCAT('http://partnercenter.localbazaar.com/image?imageId=',ip.IMAGE_NAME) AS pImage, CONCAT('http://partnercenter.localbazaar.com/image?imageId=',ist.IMAGE_NAME) AS sImage, ci.CITY_ID FROM store_products_t sp LEFT OUTER JOIN store_t s ON (sp.STORE_ID=s.STORE_ID) LEFT OUTER JOIN address_t a ON (a.ASSET_TYPE_ID=3 AND a.ASSET_ID=sp.STORE_ID) LEFT OUTER JOIN product_t p ON (p.PRODUCT_ID=sp.PRODUCT_ID) LEFT OUTER JOIN category_t c ON (c.CATEGORY_ID=p.CATEGORY_ID) LEFT OUTER JOIN brand_t b ON (b.BRAND_ID=p.BRAND_ID) LEFT OUTER JOIN image_t ip ON (ip.ASSET_ID=sp.PRODUCT_ID AND ip.ASSET_TYPE_ID=4 AND ip.IMAGE_TYPE_ID=0) LEFT OUTER JOIN image_t ist ON (ist.ASSET_ID=sp.STORE_ID AND ist.ASSET_TYPE_ID=3 AND ist.IMAGE_TYPE_ID=0) LEFT OUTER JOIN city_t ci ON (ci.NAME=a.CITY)">
<field column="STORE_PRODUCT_ID" name="storeProductId" />
<field column="STORE_ID" name="storeId" />
<field column="PRODUCT_ID" name="productId" />
<field column="MIN_PRICE" name="storeMinPrice" />
<field column="MAX_PRICE" name="storeMaxPrice" />
<field column="STORE_TYPE_ID" name="storeTypeId" />
<field column="BUY_X" name="buyX" />
<field column="GET_Y" name="getY" />
<field column="sname" name="storeName" />
<field column="sUrl" name="storeUrlName" />
<field column="sdesc" name="description" />
<field column="WEB_SITE_UTL" name="webSiteUrl" />
<field column="EMAIL" name="email" />
<field column="PHONE" name="phone" />
<field column="MOBILE" name="mobile" />
<field column="act" name="active" />
<field column="ADDRESS_ID" name="addressId" />
<field column="location" name="location" />
<field column="urlLoc" name="urlLocation" />
<field column="ADDRESS_LINE1" name="addressLine1" />
<field column="ADDRESS_LINE2" name="addressLine2" />
<field column="LATITUDE" name="latitude" />
<field column="LONGITUDE" name="longitude" />
<field column="zipcode" name="zipcode" />
<field column="LANDMARK" name="landmark" />
<field column="CITY" name="city" />
<field column="ll" name="latlong" />
<field column="pname" name="productName" />
<field column="purl" name="productUrlName" />
<field column="pdesc" name="productDescription" />
<field column="BRAND_ID" name="brandId" />
<field column="CATEGORY_ID" name="categoryId" />
<field column="GROUP_ID" name="groupId" />
<field column="AUTHOR" name="author" />
<field column="ISBN10" name="isbn10" />
<field column="ISBN13" name="isbn13" />
<field column="OLID" name="olid" />
<field column="EAN13" name="ean13" />
<field column="UPCA" name="upca" />
<field column="SKU" name="sku" />
<field column="LANGUAGE" name="language" />
<field column="FORMAT" name="format" />
<field column="PUBLISHER" name="publisher" />
<field column="SUBJECT" name="subject" />
<field column="cname" name="categoryName" />
<field column="URL_NAME" name="categoryUrlName" />
<field column="bname" name="brandName" />
<field column="bUrl" name="brandUrlName" />
<field column="pImage" name="productImage" />
<field column="sImage" name="storeImage" />
<field column="CITY_ID" name="cityId" />
<entity name="specifications" query="select PRODUCT_ID, CONCAT(PROPERTY_NAME,':::',property_value) as specifications FROM product_properties_t " processor="CachedSqlEntityProcessor" WHERE="PRODUCT_ID= domainStoreProduct.PRODUCT_ID" />
<entity name="storeProperties" query="select STORE_ID, CONCAT(PROPERTY_ID,':::',PROPERTY_VALUE) as storeProperties FROM store_properties_t " processor="CachedSqlEntityProcessor" WHERE="STORE_ID=domainStoreProduct.STORE_ID" />
</entity>
</document>
</dataConfig>
You can try different things:
Try setting the batchSize property. If you tune it correctly, you can increase the performance of your datasource.
SELECT * is ALWAYS slower than providing the columns you need (even if you need all columns). I would suggest using SELECT PRODUCT_ID, NAME, ... in stead of using *
Why do you have the entities b, i and s? You don't use the fields from it, so I don't think they're very useful
Try using the CachedSqlEntityProcessor for your sub-entities. It will only retrieve the data once and re-use it for each subenttiy.
Can your product belong to more than 1 category (is it a multivalued field?), if not, then writing one query using JOINS is faster than writing multiple entities.
EDIT: I suggest seperating this thing into 2 questions because now it's really weird for other people to read your new question with my old answer.
I don't think you can choose where the CachedSqlEntityProcessor will put his cache (it's always in memory I think). The problem with your 8 hours of data import is that, because we're talking about a lot of records, a lot of queries will be used (every subentity uses its own query).
The solution to your problem is to remove the subentity and in your parent entity add the query of your subentity as a comma seperated list. I suggest looking at this answer.
If you do this, all your specifications (for examples) will be stored inside one column as a comma speerated list. You can then use a Solr ScriptTransformer to split the values and create multiple values.
This limits the number of queries to 1 big query and will also limit the use of RAM since it will parse each query individually. I have no clue what the performance will be, because you will have to parse each entity individually.
If this doesn't work I don't think there is a better solution than to wait 8 hours for the data import to complete. You can't expect that Solr will index it all in 1 2 3. You can try using a cronjob to run this task over night.

Resources