I am trying to connect to sql server with spark-jdbc, using JDBC_SESSION_INIT_STATEMENT to create a temporary table and then download data from the temporary table in the main query.
I have the following code:
//df is org.apache.spark.sql.DataFrameReader
val s = """select * into #tmp_table from ( SELECT op.ID,
| op.Date,
| op.DocumentID,
| op.Amount,
| op.AmountCurr,
| op.CurrencyID,
| operson.ObjectTypeId AS PersonOT,
| op.PersonID,
| ocontract.ObjectTypeId AS ContractOT,
| op.ContractID,
| op.DocNum,
| op.MomentCreate,
| op.ObjectTypeID,
| op.OwnerObjectID
|FROM dbo.Operation op With (Index = IX_Operation_Date) --Без хинта временами уходит в скан всей таблицы
|LEFT JOIN dbo.Object ocontract ON op.ContractID = ocontract.ID
|LEFT JOIN dbo.Object operson ON op.PersonID = operson.ID
|WHERE op.Date>='2019-01-01' and op.Date<'2020-01-01' AND 1=1
|) wrap_for_single_connect
|OPTION (LOOP JOIN, FORCE ORDER, MAX_GRANT_PERCENT=25)""".stripMargin
df
.option(JDBCOptions.JDBC_SESSION_INIT_STATEMENT, s)
.jdbc(
jdbcUrl,
"(select * from tempdb.#tmp_table) sub",
connectionProps)
i get com.microsoft.sqlserver.jdbc.SQLServerException: Invalid object name '#tmp_table'.
And I have a feeling that JDBC_SESSION_INIT_STATEMENT is not working, because I deliberately tried to mess up the request and still got the Invalid object error.
How can I check if the request is working in JDBC_SESSION_INIT_STATEMENT?
One way to know whether your JDBCOptions.JDBC_SESSION_INIT_STATEMENT is executed is to enable INFO logging level for org.apache.spark.sql.execution.datasources.jdbc logger.
That should trigger this line and print out the following message to the logs:
Executing sessionInitStatement: [sql]
Given the comment I don't think you should use it to create a source table to load records from:
// This executes a generic SQL statement (or PL/SQL block) before reading
// the table/query via JDBC. Use this feature to initialize the database
// session environment, e.g. for optimizations and/or troubleshooting.
You should use dbtable or query parameter instead.
Background:
I have a table with the following schema on a SQL server. Updates to existing rows is possible and new rows are also added to this table.
unique_id | user_id | last_login_date | count
123-111 | 111 | 2016-06-18 19:07:00.0 | 180
124-100 | 100 | 2016-06-02 10:27:00.0 | 50
I am using Sqoop to add incremental updates in lastmodified mode. My --check-column parameter is the last_login_date column. In my first run, I got the above two records into Hadoop - let's call this current data. I noted that the last value (the max value of the the check column from this first import) is 2016-06-18 19:07:00.0.
Assuming there is a change on the SQL server side, I now have the following changes on the SQL server side:
unique_id | user_id | last_login_date | count
123-111 | 111 | 2016-06-25 20:10:00.0 | 200
124-100 | 100 | 2016-06-02 10:27:00.0 | 50
125-500 | 500 | 2016-06-28 19:54:00.0 | 1
I have the row 123-111 updated with a more recent last_login_date value and the count column has also been updated. I also have a new row 125-500 added.
On my second run, sqoop looks at all columns with a last_login_date column greater than my known last value from the previous import - 2016-06-18 19:07:00.0
This gives me only the changed data, i.e. 123-111 and 125-500 records. Let's call this - new data.
Question
How do I do a merge join in Hadoop/Hive using the current data and the new data so that I end up with the updated version of 123-111, 124-100, and the newly added 125-500?
Changed data load using scoop is a two phase process.
1st phase - load changed data into some temp (stage) table using
sqoop import utility.
2nd phase - Merge changed data with old data using sqoop-merge
utility.
If the table is small(say few M records) then use full load using sqoop import.
Sometimes it's possible to load only latest partition - in such case use sqoop import utility to load partition using custom query, then instead of merge simply insert overwrite loaded partition into target table, or copy files - this will work faster than sqoop merge.
You can change the existing Sqoop query (by specifying a new custom query) to get ALL the data from the source table instead of getting only the changed data. Refer using_sqoop_to_move_data_into_hive. This would be the simplest way to accomplish this - i.e doing a full data refresh instead of applying deltas.
https://github.com/markfink/dbslim
I'd like to execute the stored procedures with DbSlim using Fitnesse (Selenium, Xebium)
now what I tried to do is:
!define dbQuerySelectCustomerbalance (
execute dbo.uspLogError
)
| script | Db Slim Select Query | !-${dbQuerySelectCustomerbalance}-! |
which gives a green indicator,
however Microsoft SQL Server profiler gives no actions/logging...
so what i'd like to know is: is it possible to use dbslim for executing stored procedures,
if yes
what is the correct way to do it?
By the way, the connection to the Database i've on 1 page, and on the query page i included the connection to the database. (is that ok?)
Take out the !- ... -!. It is used to escape wikified words. But in this case you want it to be translated to the actual query.
!define dbQuerySelectCustomerbalance ( execute dbo.uspLogError )
| script | Db Slim Select Query | ${dbQuerySelectCustomerbalance} |
| show | data by column index | 1 | and row index | 1 |
You can add in the last line which outputing the first column of the first row for testing purpose if your SP is returning some result (or you can create one simple SP just to test this out)
Specifying the connection anywhere before this block will be fine, be it on the same page or in an SetUp/SuiteSetUp/normal page included/executed before.
In the attempt of being as clear as posible, I have 4 tables in my database as it follows
Join_Contrato_Medidor
ID_Union (identity)
ID_Contrato
ID_Medidor
Omitido (filger ?)
Promedios
ID_Contrato
ID_Medidor
ID_Marchamo
{Info I want}
Medidores
ID_Medidor
ID_Dispensario (filter ?)
Marchamo
ID_Marchamo
My current SQL Statement...
SELECT {Promedios.LI_1, Promedios.LF_1, Promedios.Total_1, Promedios.Qva_1, ...}
FROM (((
Join_Contrato_Medidor LEFT OUTER JOIN
Promedios ON Join_Contrato_Medidor.ID_Contrato = Promedios.ID_Contrato)
LEFT OUTER JOIN
Medidores ON Join_Contrato_Medidor.ID_Medidor = Medidores.ID_Medidor)
LEFT OUTER JOIN
Marchamo ON Promedios.ID_Marchamo = Marchamo.ID_Marchamo)
WHERE (Join_Contrato_Medidor.ID_Contrato = ?) AND (Medidores.ID_Dispensario = ?) AND (Join_Contrato_Medidor.Omitido <> TRUE)
The output im obtaining:
Information Columns | Omitido | ID_Union
Info | False | 806
Info | False | 806
Info | False | 806
Info | False | 806
*I wanted to include an image but I cannot do so until I have more reputation :( *
I have those 4 tables that I am Joining right now. I am currently getting all the columns desired to be output in the query, but the thing is that I would only like to get those records in which --Join_Contrato_Medidor.Omitido <> true-- instead of getting ALL records that match the ID_Contrato and ID_Dispensario conditions.
As a sample, I am outputing ID_Union, which is the identity field for the Join_Contrato_Medidor. It is marking all the records with a single ID_Union, which happens to be the only one record out of the 4 that has Omitido <> true. Also, the latest 3 records have their Omitido field set to true in the database nevertheless it is showing false in the query result.
If the question is unclear, please post me for clarification.
Thanks in advance
After working on other things until I had to face this issue again, I am back checking it. Your comment led me to try and see if switching the order of the tables would do the job, and it did! Thank you very much.
I started asking for hte Promedios table first and the nperform the rest of the query. This gave me access to the exact information that I wanted. Moreover, all the following queries I created them following this order and lead to better shorter queries.
Here's my situation. I have a table with a bunch of URLs and crawl
dates associated with them. When my program processes a URL, I want
to INSERT a new row with a crawl date. If the URL already exists, I
want to update the crawl date to the current datetime. With MS SQL or
Oracle I'd probably use a MERGE command for this. With mySQL I'd
probably use the ON DUPLICATE KEY UPDATE syntax.
I could do multiple queries in my program, which may or may not be
thread safe. I could write a SQL function which has various IF...ELSE
logic. However, for the sake of trying out Postgres features I've
never used before, I'm thinking about creating an INSERT rule -
something like this:
CREATE RULE Pages_Upsert AS ON INSERT TO Pages
WHERE EXISTS (SELECT 1 from Pages P where NEW.Url = P.Url)
DO INSTEAD
UPDATE Pages SET LastCrawled = NOW(), Html = NEW.Html WHERE Url = NEW.Url;
This seems to actually work great. It probably loses some points on
the "code readability" standpoint, as someone looking at my code for
the first time would have to magically know about this rule, but I
guess that could be solved with good code commenting and
documentation.
Are there any other drawbacks to this idea, or maybe a "your idea
sucks, you should do it /this/ way instead" comment? I'm on PG 9.0 if
that matters.
UPDATE: Query plan since someone wanted it :)
"Insert (cost=2.79..2.81 rows=1 width=0)"
" InitPlan 1 (returns $0)"
" -> Seq Scan on pages p (cost=0.00..2.79 rows=1 width=0)"
" Filter: ('http://www.foo.com'::text = lower((url)::text))"
" -> Result (cost=0.00..0.01 rows=1 width=0)"
" One-Time Filter: ($0 IS NOT TRUE)"
""
"Update (cost=2.79..5.46 rows=1 width=111)"
" InitPlan 1 (returns $0)"
" -> Seq Scan on pages p (cost=0.00..2.79 rows=1 width=0)"
" Filter: ('http://www.foo.com'::text = lower((url)::text))"
" -> Result (cost=0.00..2.67 rows=1 width=111)"
" One-Time Filter: $0"
" -> Seq Scan on pages (cost=0.00..2.66 rows=1 width=111)"
" Filter: ((url)::text = 'http://www.foo.com'::text)"
Ok, I managed to create a testcase. The result is that the update part is always executed, even on a fresh insert. COPY seems to bypass the rule system.
[For clarity I have put this into a separate reply]
DROP TABLE pages CASCADE;
CREATE TABLE pages
( url VARCHAR NOT NULL PRIMARY KEY
, html VARCHAR
, last TIMESTAMP
);
INSERT INTO pages(url,html,last) VALUES ('www.example.com://page1' , 'meuk1' , '2001-09-18 23:30:00'::timestamp );
CREATE RULE Pages_Upsert AS ON INSERT TO pages
WHERE EXISTS (SELECT 1 from pages P where NEW.url = P.url)
DO INSTEAD (
UPDATE pages SET html=new.html , last = NOW() WHERE url = NEW.url
);
INSERT INTO pages(url,html,last) VALUES ('www.example.com://page2' , 'meuk2' , '2002-09-18 23:30:00':: timestamp );
INSERT INTO pages(url,html,last) VALUES ('www.example.com://page3' , 'meuk3' , '2003-09-18 23:30:00':: timestamp );
INSERT INTO pages(url,html,last) SELECT pp.url || '/added'::text, pp.html || '.html'::text , pp.last + interval '20 years' FROM pages pp;
COPY pages(url,html,last) FROM STDIN;
www.example.com://pageX stdin 2000-09-18 23:30:00
\.
SELECT * FROM pages;
The result:
url | html | last
-------------------------------+------------+----------------------------
www.example.com://page1 | meuk1 | 2001-09-18 23:30:00
www.example.com://page2 | meuk2 | 2011-09-18 23:48:30.775373
www.example.com://page3 | meuk3 | 2011-09-18 23:48:30.783758
www.example.com://page1/added | meuk1.html | 2011-09-18 23:48:30.792097
www.example.com://page2/added | meuk2.html | 2011-09-18 23:48:30.792097
www.example.com://page3/added | meuk3.html | 2011-09-18 23:48:30.792097
www.example.com://pageX | stdin | 2000-09-18 23:30:00
(7 rows)
UPDATE: Just to prove it can be done:
INSERT INTO pages(url,html,last) VALUES ('www.example.com://page1' , 'meuk1' , '2001-09-18 23:30:00'::timestamp );
CREATE VIEW vpages AS (SELECT * from pages);
CREATE RULE Pages_Upsert AS ON INSERT TO vpages
DO INSTEAD (
UPDATE pages p0
SET html=NEW.html , last = NOW() WHERE p0.url = NEW.url
;
INSERT INTO pages (url,html,last)
SELECT NEW.url, NEW.html, NEW.last
WHERE NOT EXISTS ( SELECT * FROM pages p1 WHERE p1.url = NEW.url)
);
CREATE RULE Pages_Indate AS ON UPDATE TO vpages
DO INSTEAD (
INSERT INTO pages (url,html,last)
SELECT NEW.url, NEW.html, NEW.last
WHERE NOT EXISTS ( SELECT * FROM pages p1 WHERE p1.url = OLD.url)
;
UPDATE pages p0
SET html=NEW.html , last = NEW.last WHERE p0.url = NEW.url
;
);
INSERT INTO vpages(url,html,last) VALUES ('www.example.com://page2' , 'meuk2' , '2002-09-18 23:30:00':: timestamp );
INSERT INTO vpages(url,html,last) VALUES ('www.example.com://page3' , 'meuk3' , '2003-09-18 23:30:00':: timestamp );
INSERT INTO vpages(url,html,last) SELECT pp.url || '/added'::text, pp.html || '.html'::text , pp.last + interval '20 years' FROM vpages pp;
UPDATE vpages SET last = last + interval '-10 years' WHERE url = 'www.example.com://page1' ;
-- Copy does NOT work on views
-- COPY vpages(url,html,last) FROM STDIN;
-- www.example.com://pageX stdin 2000-09-18 23:30:00
-- \.
SELECT * FROM vpages;
Result:
INSERT 0 1
INSERT 0 1
INSERT 0 3
UPDATE 1
url | html | last
-------------------------------+------------+---------------------
www.example.com://page2 | meuk2 | 2002-09-18 23:30:00
www.example.com://page3 | meuk3 | 2003-09-18 23:30:00
www.example.com://page1/added | meuk1.html | 2021-09-18 23:30:00
www.example.com://page2/added | meuk2.html | 2022-09-18 23:30:00
www.example.com://page3/added | meuk3.html | 2023-09-18 23:30:00
www.example.com://page1 | meuk1 | 1991-09-18 23:30:00
(6 rows)
The view is necessary to prevent the rewrite system to go into recursion.
Construction of a DELETE rule is left as an exercise to the reader.
Some good points from someone who should know it or be very near to someone like that ;-)
What are PostgreSQL RULEs good for?
Short story:
Do the rules work well with SERIAL and BIGSERIAL ?
Do the rules work well with the RETURNING clauses of INSERT and UPDATE ?
Do the rules work well with stuff like random()?
All these things boils down to the fact, that the rule system is not row driven but transforms your statements in a way you never imagine.
Do yourself and your team mates a favour and stop using roles for things like that.
Edit: Your problem is well discussed in the PostgreSQL community. Search keywords are: MERGE, UPSERT.
I don't know if this gets too subjective but what I think about your solution is: It's all about semantics. When I do an insert, I expect an insert and not some fancy logic that maybe does an insert but maybe not. Indeed that's what functions are for.
At first I'd try checking for the URL in your program and then choosing whether to insert or update. If that turned out to be too slow, I'd use a function. If you name it like insert_or_update_url, you automatically get some documentation for free. The rewrite rule requires you to have some implicit knowledge and I generally try to avoid that.
On the plus side: If someone copies the data but forgets rules and functions, your solution might break silently (but that may depend on other constraints), but a missing function goes down screaming. Don't get me wrong, I think your solution is very creative and smart. Just a bit too obscure for my taste.
There's an example of implementing upsert / merge using simple function in Postgres documentation.
Never use rules — they're evil.
You cannot refer to other tables than old an new in the rule qualification.
You should instead do this in the rule body.
This is all because the rule is just a way to inform the rewrite system about what transformations it should and should not perform. Rules are not triggers, executing for every row, but they give the query planner a fine massage and ask it nicely to rewrite the plan.
From the docs:
What is a rule qualification? It is a restriction that tells when the actions of the rule should be done and when not. This qualification can only reference the pseudorelations NEW and/or OLD, which basically represent the relation that was given as object (but with a special meaning).