Data generation: generate one entity with a several date constraints - database

A quick one, I am looking for a tool for data generation. I have an entity with dates; the date it was made, a start date and an end date. I want the data generation to take care of this constraints:
made maybe today or some day after
start maybe equal to made but not before
end maybe only be a day after start or any other date after start
I looked at http://generatedata.com and http://mockaroo.com, but they didn't have a way i could maintain the constraints. I just need that constraint, but not sure which softwares to try to maintain these constraints. I just need quick data to test my application. thanks
and just a by and by, have you ever been in such a situation where what you need you can't find?

benerator is the tool to use, which is is very flexible though one needs to learn it pretty fast. with my above situation, in the xml file for benerator (that's what it uses ), i just write the following and i'm good to go. in fact, i can even now put ranges for made, start and end dates. This is a section of a generate tag for 30 records of an entity (let's call it MY_ENTITY) with those dates
<import class="org.databene.commons.TimeUtil"/>
<generate name="MY_ENTITY" count="30" consumer="ENTITY_OUT">
<attribute name="MADE_DATE" type="date" script ="TimeUtil.today()" />
<variable name= "for_startDate" type="int" min="0" max="10" />
<attribute name="START_DATE" type="date" script="TimeUtil.addDays(this.MADE_DATE,
for_startDate)" nullable="false"/>
<variable name="for_endDate" type="int" min="1" max="10" />
<attribute name="END_DATE" type = "date" script="TimeUtil.addDays(this.START_DATE,
for_endDate)" nullable="false"/>
</generate>
and benerator supports many databases through JDBC, and it comes loaded with several JDBC drivers. try it here http://bergmann-it.de/test-software/index.php?lang=en. it's open source

Related

Bypass double quotes inside XML value

Check #lptr's comment for solution
I have this piece of XML from which I need to extract the values and ids using SQL Server:
<root>
<field id="1" value="gfjsdgfdjy duahsd "absdjsd"" />
<field id="37" value="ysgfdyua" />
<field id="13" value="asdas" />
<field id="73" value="fgdgfd" />
<field id="adsf" value="fdsa" />
</root>
This is what I use to extract the values and ids from that XML, which is stored into variable #test, and insert them into a temp table:
insert into #tmp (field, val)
select field.value('#id', 'nvarchar(100)') as fieldID,
field.value('#value', 'nvarchar(200)') as val
from #test.nodes('root/field') A(field)
That query works fine until there's a value that has double quotes like in the example above, which throws the followig error: XML parsing: line 1, character 108, whitespace expected
Any way of working around that?
I have to mention that I do not create these XMLs by hand, but get them from a DB, so any mistakes in their creation is not my fault.
Conformant XML parsers are required to report all well-formedness errors, so any conformant XML parser will reject this ill-formed XML.
You can sometimes get around this by using parsers (which are not conformant XML parsers) that attempt to repair errors. However, I don't know if any of them are capable of handling this particular problem, Note that in the general case it can't be detected, consider
<root>
<field id="1" value="some " inner " 3" />
<field id="2" value="some " inner= " 3" />
</root>
The second field is well-formed XML.
Whenever a supplier provides you with ill-formed XML, you need to complain, as you would with any other product defect. Usually I have found suppliers very responsive to such error reports. It's surprising how often the XML export capability was entrusted to some junior programmer with no XML knowledge or experience, even if the company concerned is a very professional outfit.

Dynamic TableName SOLR data import handler

I'm looking to configure SOLR to query a table based on certain data.
I unfortunately have to work with how the Database is setup, but here's what I'm after.
I have a table named Company that will contain a certain "prefix" value.
I want to use that prefix value to determine what tables I should query for the DIH.
As a quick sample:
<entity name="company" query="Select top 1 prefix from Company">
<field name="prefix" column="prefix"/>
<entity name="item" query="select * from ${company.prefix}item">
<field column="ItemID" name="id"/>
<field column="Description" name="description/>
</entity>
</entity>
However I only ever seem to get 1 document processed despite that table containing over 200,000 rows.
what am I doing wrong?
I think you could achieve this by:
using an stored procedure. You can call a sp from DIH as seen here
inside the stored procedure, you can do the table lookup as needed, and then return the results from the real query.
Depending on how good you are with MSSql-s SQL, you might be able to just put everything into a single SQL query and use that directly in DIH, but not sure about that.

solr: how to search for date ranges with at least X days?

Given: a list of consultants with a list of intervals when they are NOT available:
<consultant>
<id>1</id>
<not-available>
<interval><from>2013-01-01</from><to>2013-01-10</to>
<interval><from>2013-20-01</from><to>2013-01-30</to>
...
</not-available>
</consultant>
...
I'd like to search for consultants that are available (!) for at least X days in a specific interval from STARTDATE to ENDDATE.
Example: Show me all consultants that are available for at least 5 days in the range 2013-01-01 - 2013-02-01 (this would match consultant 1 because he is free from 2013-01-11 to 2013-01-19).
Question 1: How should my solr document look like?
Question 2: How has the query to look like?
As a general advice: precalculate as much as you can, store the data that you are querying for rather than the data you are getting as input.
Also, use several indexes based on different entities - if you have the liberty to do so, and if the queries would become simpler and more straight forward.
Ok, generalities aside and on to your question.
From your example I take it that you currently store in the index if a consultant is not available - probably, because that is what you get as input. But what you want to query is when they are available. So, you should think about storing the availability rather then the non-availability.
EDIT:
The most forward way to query this is to use the intervals as entities such that you do not have to resort to special SOLR features to query the start and the end of an interval on two multi valued fields.
Once you have stored the availability intervals you can also precalculate and store their lengths:
<!-- id of the interval -->
<field name="id" type="int" indexed="true" stored="true" multiValued="false" />
<field name="consultant_id" type="int" indexed="true" stored="true" multiValued="false" />
<!-- make sure that the time is set to 00:00:00 (*/DAY) -->
<field name="interval_start" type="date" indexed="true" stored="true" multiValued="false" />
<!-- make sure that the time is set to 00:00:00 (*/DAY) -->
<field name="interval_end" type="date" indexed="true" stored="true" multiValued="false" />
<field name="interval_length" type="int" indexed="true" stored="true" multiValued="false" />
Your query:
(1.) Optionally, retrieve all intervals that have at least the requested length:
fq=interval_length:[5 to *]
This is an optional step. You might want to benchmark whether it improves the query performance.
Additionally, you could also filter on certain consultant_ids.
(2.) The essential query is for the interval (use q.alt in case of dismax handler):
q=interval_start:[2013-01-01T00:00:00.000Z TO 2013-02-01T00:00:00.000Z-5DAYS]
interval_end:[2013-01-01T00:00:00.000Z+5DAYS TO 2013-02-01T00:00:00.000Z]
(added linebreak for readability, the two components of the query should be separated by regular space)
Make sure that you always set the time to the same value. Best is 00:00:00 because that is what /DAY does: http://lucene.apache.org/solr/4_4_0/solr-core/org/apache/solr/util/DateMathParser.html .
The less different values the better the caching.
More info:
http://wiki.apache.org/solr/SolrQuerySyntax - Solr Range Query
http://wiki.apache.org/solr/SolrCaching#filterCache - caching of fq filter results
EDIT:
More info on q and fq parameters:
http://wiki.apache.org/solr/CommonQueryParameters
They are handled differently when it comes to caching. That's why I added the other link (see above), in the first place. Use fq for filters that you expect to see often in your queries. You can combine multiple fq parameters while you can only specify q once per request.
How can I "use several indexes based on different entities"?
Have a look at the multicore feature: http://wiki.apache.org/solr/CoreAdmin
Would it be overkill to save for each available day: date;num_of_days_to_end_of_interval - should make querying much simpler?
Depends a bit on how much more data you are expecting in that case. I'm also not exactly sure that it would really help you for the query you posted. The date range queries are very flexible and fast. You don't need to avoid them. Just make sure you specify the time as broad as you can to allow for caching.

Solr: How distinguish between multiple entities imported through DIH

When using DataImportHandler with SqlEntityProcessor, I want to have several definitions going into the same schema with different queries.
How can I search both type of entities but also distinguish their source at the same time. Example:
<document>
<entity name="entity1" query="query1">
<field column="column1" name="column1" />
<field column="column2" name="column2" />
</entity>
<entity name="entity2" query="query2">
<field column="column1" name="column1" />
<field column="column2" name="column2" />
</entity>
</document>
How to get data from entity 1 and from entity 2?
As long as your schema fields (e.g. column1, column2) are compatible between different entities, you can just run DataImportHandler and it will populate Solr collection from both queries.
Then, when you query, you will see all entities combined.
If you want to mark which entity came from which source, I would recommend adding another field (e.g. type) and assigning to it different static values in each entity definition using TemplateTransformer.
Also beware of using clean command. By default it deletes everything from the index. As you are populating the index from several sources, you need to make sure it does not delete too much. Use preImportDeleteQuery to delete only entries with the same value in the type field that you set for that entity.

How to work with liquibase, a concrete example

Following the quickstart on liquibase i've created a changeset (very dumb :) )
Code:
<?xml version="1.0" encoding="UTF-8"?>
<databaseChangeLog
xmlns="http://www.liquibase.org/xml/ns/dbchangelog/1.6"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.liquibase.org/xml/ns/dbchangelog/1.6
http://www.liquibase.org/xml/ns/dbchangelog/dbchangelog-1.6.xsd">
<changeSet id="1" author="me">
<createTable tableName="first_table">
<column name="id" type="int">
<constraints primaryKey="true" nullable="false"/>
</column>
<column name="name" type="varchar(50)">
<constraints nullable="false"/>
</column>
</createTable>
<createTable tableName="new_table">
<column name="id" type="int">
<constraints primaryKey="true" nullable="false"/>
</column>
</createTable>
</changeSet>
</databaseChangeLog>
I've created a clean schema and i've launched the migrate command.
Liquibase created the database, with the support tables databasechangelog and ..lock.
Now how i can track the changes?? i've modified the changeset adding a new createTable element but when i try the command "update" liquibase tells me this
Migration Failed: Validation Failed:
1 change sets check sum
so i don't think to have understood the way to work with liquibase.
Someone may point me to the right direction??
Thanks
You should never modify a <changeSet> that was already executed. Liquibase calculates checksums for all executed changeSets and stores them in the log. It will then recalculate that checksum, compare it to the stored ones and fail the next time you run it if the checksums differ.
What you need to do instead is to add another <changeSet> and put your new createTable element in it.
QuickStart is good readin' but it is indeed quick :-) Check out the full manual, particularly its ChangeSet section.
This currently accepted answer is slightly out of date based on changes in Liquibase 2.x. In the 2.x version, Liquibase will still fail if the md5 checksum has changed for a changeset, but you can specify the runOnChange attribute if you want to be able modify it.
From the documentation:
runOnChange - Executes the change the first time it is seen and each time the change set has been changed
If it's a change to a changeset that basically has already been done, you can manually modify the database so that its md5 for that changeset matches the new one. Good for minor textual changes. Or you can delete that changeset row from your table.

Resources