Fastest way to save large SQL Server table to S3 - sql-server

I am trying to save a 200GB table from a Microsoft SQL Server 2012 instance.
At the moment trying to use pyspark over a EMR cluster.
if __name__ == '__main__':
# sc = SparkContext()
spark = SparkSession \
.builder \
.appName("Spark camt52") \
.config("spark.sql.adaptive.enabled", "true") \
.config("spark.dynamicAllocation.enabled", "true") \
.config("spark.sql.streaming.schemaInference", "true") \
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
jdbcDf = (spark.read
.format("jdbc")
.option("url",
"jdbc:sqlserver://servr:1433;databaseName=XXX")
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("dbtable", "XXX.dbo.MyTable")
.option("user", "MyUser")
.option("password", "MyPass").load())
print(jdbcDf.count())
jdbcDf.write.mode("overwrite").parquet("s3://dump_bucket/tables/bigtable/")
Cluster Config:
'InstanceFleets': [
{
'Name': 'MASTER',
'InstanceFleetType': 'MASTER',
'TargetSpotCapacity': 1,
'InstanceTypeConfigs': [
{
'InstanceType': 'm5.xlarge',
},
]
},
{
'Name': 'CORE',
'InstanceFleetType': 'CORE',
'TargetSpotCapacity': 8,
'InstanceTypeConfigs': [
{
'InstanceType': 'r5.2xlarge',
},
],
},
],
Its been running for over 2h. I am still getting the hang of EMR and pyspark so maybe messed something up. Should I use a larger cluster maybe?

Related

SQL Server : JSON formatting grouping possible?

I have a single SQL Server table of the form:
192.168.1.1, 80 , tcp
192.168.1.1, 443, tcp
...
I am exporting this to json like this:
SELECT
hostIp AS 'host.ip',
'ports' = (SELECT port AS 'port', protocol AS 'protocol'
FOR JSON PATH)
FROM
test
FOR JSON PATH
This query results in:
[
{
"host": {
"ip": "192.168.1.1"
},
"ports": [
{
"port": 80,
"protocol": "tcp"
}
]
},
{
"host": {
"ip": "192.168.1.1"
},
"ports": [
{
"port": 443,
"protocol": "tcp"
}
]
},
....
However I wanted all data from a single IP grouped as such:
[
{
"host": {
"ip": "192.168.1.1"
},
"ports": [
{
"port": 80,
"protocol": "tcp"
},
{
"port": 443,
"protocol": "tcp"
}
]
},
...
I have found that there seems to be aggregate functions, but they either don't work for me or are for Postgresql, my example is in SQL Server.
Any idea how to get this to work?
The only simple way to do this is to self-join
SELECT
tOuter.hostIp AS [host.ip]
,ports = (
SELECT
tInner.port
,tInner.protocol
FROM test tInner
FOR JSON PATH
)
FROM test tOuter
GROUP BY
tOuter.hostIp
FOR JSON PATH;
It is obviously inefficient to self-join, but SQL Server does not support JSON_AGG. You can simulate it with STRING_AGG and no self-join
SELECT
t.hostIp AS [host.ip]
,ports = JSON_QUERY('[' + STRING_AGG(tInner.json, ',') + ']')
FROM test t
CROSS APPLY (
SELECT
t.port
,t.protocol
FOR JSON PATH, WITHOUT_ARRAY_WRAPPER
) tInner(json)
GROUP BY
t.hostIp
FOR JSON PATH;
db<>fiddle
As you might imagine, it gets more complex as more and more levels of nesting are involved.
Note:
Do not use ' for quoting column names, only []
There is no need to alias a column for JSON if it is already that name

Error on Credentials handshake of Public Test Environement

I am seeing the following error on step 3 Credentials handshake of OCN Node registration. (https://shareandcharge.atlassian.net/wiki/spaces/OCN/pages/265945122/OCN+Node+registration#3.-Credentials-handshake)
When I send the credentials handshake, I see the following error:
{
"status_code": 3001,
"status_message": "Failed to request from https://edrv-ocpi-dev.herokuapp.com/ocpi/versions: Cannot deserialize instance of `snc.openchargingnetwork.node.models.ocpi.OcpiResponse<java.util.List<snc.openchargingnetwork.node.models.ocpi.Version>>` out of START_ARRAY token\n at [Source: (String)\"[{\"status_code\":1000,\"data\":{\"versions\":[{\"version\":\"2.2\",\"url\":\"https://edrv-ocpi-dev.herokuapp.com/ocpi/2.2\"}]},\"timestamp\":\"2020-02-15T07:48:58.364Z\"}]\"; line: 1, column: 1]",
"timestamp": "2020-02-15T07:48:58.390914Z"
}
This is the request I am sending:
curl -X POST \
https://qa-client.emobilify.com/ocpi/2.2/credentials \
-H 'Accept: */*' \
-H 'Accept-Encoding: gzip, deflate' \
-H 'Authorization: Token 21c158fc-****-****-****-************' \
-H 'Cache-Control: no-cache' \
-H 'Connection: keep-alive' \
-H 'Content-Length: 297' \
-H 'Content-Type: application/json' \
-H 'Host: qa-client.emobilify.com' \
-H 'Postman-Token: 8cb3bcba-2cc0-4923-9290-04be261fb686,fe8e2916-ee94-4676-a60a-ad64e3869dcc' \
-H 'User-Agent: PostmanRuntime/7.19.0' \
-H 'cache-control: no-cache' \
-d '{
"token": "9e9cf68b-****-****-****-dcb7fcaa341e",
"url": "https://edrv-ocpi-dev.herokuapp.com/ocpi/versions",
"roles": [{
"party_id": "EDV",
"country_code": "NL",
"role": "EMSP",
"business_details": {
"name": "eDRV Technologies B.V."
}
}]
}'
I have setup a quick server on heroku as per step 2 in the tutorial. Here is what the versions endpoint code looks like:
app.get("/ocpi/versions", authorize, async (_, res) => {
res.send([{
status_code: 1000,
data: {
versions: [{
version: "2.2",
url: `${PUBLIC_URL}/ocpi/2.2`
}]
},
timestamp: new Date()
}])
})
Is there something I am missing in the request?
The original version of OCPI 2.2 contains an error. The bugfix branch contains the correct type: https://github.com/ocpi/ocpi/blob/release-2.2-bugfixes/version_information_endpoint.asciidoc
You can change your response to the following to make it work:
res.send({
status_code: 1000,
data: [{
version: "2.2",
url: `${PUBLIC_URL}/ocpi/2.2`
}],
timestamp: new Date()
})

How to get the table-name and database-name in the CDC event received from debezium kafka connect

Setup: I have CDC enabled on MS SQL Server and the CDC events are fed to Kafka using debezium kafka connect(source). Also more than one table CDC events are routed to a single topic in Kafka.
Question: Since I have more than one table data in the kafka topic, I would like to have the table name and the database name in the CDC data.
I am getting the table name and database name in MySQL CDC but not in MS SQL CDC.
Below is the Debezium source connector for the SQL Server
curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d '{
"name": "cdc-user_profile-connector",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"tasks.max": "1",
"database.hostname": "<<hostname>>",
"database.port": "<<port>>",
"database.user": "<<username>>",
"database.password": "<<password>>",
"database.server.name": "test",
"database.dbname": "testDb",
"table.whitelist": "schema01.table1,schema01.table2",
"database.history.kafka.bootstrap.servers": "broker:9092",
"database.history.kafka.topic": "digital.user_profile.schema.audit",
"database.history.store.only.monitored.tables.ddl": true,
"include.schema.changes": false,
"event.deserialization.failure.handling.mode": "fail",
"snapshot.mode": "initial_schema_only",
"snapshot.locking.mode": "none",
"transforms":"addStaticField,topicRoute",
"transforms.addStaticField.type":"org.apache.kafka.connect.transforms.InsertField$Value",
"transforms.addStaticField.static.field":"source_system",
"transforms.addStaticField.static.value":"source_system_1",
"transforms.topicRoute.type":"org.apache.kafka.connect.transforms.RegexRouter",
"transforms.topicRoute.regex":"(.*)",
"transforms.topicRoute.replacement":"digital.user_profile",
"errors.tolerance": "none",
"errors.log.enable": true,
"errors.log.include.messages": true,
"errors.retry.delay.max.ms": 60000,
"errors.retry.timeout": 300000
}
}'
I am getting the below output (Demo data)
{
"before": {
"profile_id": 147,
"email_address": "test#gmail.com"
},
"after": {
"profile_id": 147,
"email_address": "test_modified#gmail.com"
},
"source": {
"version": "0.9.4.Final",
"connector": "sqlserver",
"name": "test",
"ts_ms": 1556723528917,
"change_lsn": "0007cbe5:0000b98c:0002",
"commit_lsn": "0007cbe5:0000b98c:0003",
"snapshot": false
},
"op": "u",
"ts_ms": 1556748731417,
"source_system": "source_system_1"
}
My requirement is to get as below
{
"before": {
"profile_id": 147,
"email_address": "test#gmail.com"
},
"after": {
"profile_id": 147,
"email_address": "test_modified#gmail.com"
},
"source": {
"version": "0.9.4.Final",
"connector": "sqlserver",
"name": "test",
"ts_ms": 1556723528917,
"change_lsn": "0007cbe5:0000b98c:0002",
"commit_lsn": "0007cbe5:0000b98c:0003",
"snapshot": false,
"db": "testDb",
"table": "table1/table2"
},
"op": "u",
"ts_ms": 1556748731417,
"source_system": "source_system_1"
}
This is planned as a part of https://issues.jboss.org/browse/DBZ-875 issue
Debezium Kafka-Connect generally puts data from each table in a separate topic and the topic name is of the format hostname.database.table. We generally use the topic name to distinguish between the source table & database name.
If you are putting the data from all the tables manually into one topic then you might have to add the table and database name manually as well.

How to import JSON into SQL 2016

I'm trying something new and could use some guidance. I have a JSON file full of data that I would like to have imported into SQL 2016. I can get the entire file into SQL but breaking it apart into a readable format is where I am stuck. I'm new to SQL, so when reading articles on this subject I am struggling to comprehend. The following query is what I've used to perform the initial import:
SELECT *
FROM OPENROWSET (BULK 'c:\tmp\test.json', SINGLE_CLOB) as j
CROSS APPLY OPENJSON(BulkColumn)
Each row is populated but the values contain sub sections that I need to have expanded. Here is the test data that I am using:
{
"site_id":123456,
"statusEnum":"fully_configured",
"status":"fully-configured",
"domain":"site.name.com",
"account_id":111111,
"acceleration_level":"standard",
"site_creation_date":1410815844000,
"ips":[
"99.99.99.99"
],
"dns":[
{
"dns_record_name":"site.name.com",
"set_type_to":"CNAME",
"set_data_to":[
"frgt.x.wafdns.net"
]
}
],
"original_dns":[
{
"dns_record_name":"name.com",
"set_type_to":"A",
"set_data_to":[
""
]
},
{
"dns_record_name":"site.name.com",
"set_type_to":"A",
"set_data_to":[
"99.99.99.99"
]
},
{
"dns_record_name":"site.name.com",
"set_type_to":"CNAME",
"set_data_to":[
""
]
}
],
"warnings":[
],
"active":"active",
"additionalErrors":[
],
"display_name":"site.name.com",
"security":{
"waf":{
"rules":[
{
"action":"api.threats.action.block_ip",
"action_text":"Block IP",
"id":"api.threats.sql_injection",
"name":"SQL Injection"
},
{
"action":"api.threats.action.block_request",
"action_text":"Block Request",
"id":"api.threats.cross_site_scripting",
"name":"Cross Site Scripting"
},
{
"action":"api.threats.action.block_ip",
"action_text":"Block IP",
"id":"api.threats.illegal_resource_access",
"name":"Illegal Resource Access"
},
{
"block_bad_bots":true,
"challenge_suspected_bots":true,
"exceptions":[
{
"values":[
{
"ips":[
"99.99.99.99"
],
"id":"api.rule_exception_type.client_ip",
"name":"IP"
}
],
"id":123456789
},
{
"values":[
{
"ips":[
"99.99.99.99"
],
"id":"api.rule_exception_type.client_ip",
"name":"IP"
}
],
"id":987654321
}
],
"id":"api.threats.bot_access_control",
"name":"Bot Access Control"
},
{
"activation_mode":"api.threats.ddos.activation_mode.auto",
"activation_mode_text":"Auto",
"ddos_traffic_threshold":1000,
"id":"api.threats.ddos",
"name":"DDoS"
},
{
"action":"api.threats.action.quarantine_url",
"action_text":"Auto-Quarantine",
"id":"api.threats.backdoor",
"name":"Backdoor Protect"
},
{
"action":"api.threats.action.block_ip",
"action_text":"Block IP",
"id":"api.threats.remote_file_inclusion",
"name":"Remote File Inclusion"
},
{
"action":"api.threats.action.disabled",
"action_text":"Ignore",
"id":"api.threats.customRule",
"name":"wafRules"
}
]
},
"acls":{
"rules":[
{
"ips":[
"99.99.99.99"
],
"id":"api.acl.whitelisted_ips",
"name":"Visitors from whitelisted IPs"
},
{
"geo":{
"countries":[
"BR",
"CN",
"DE",
"ES",
"GB",
"HK",
"IR",
"IT",
"KP",
"KR",
"KZ",
"NL",
"PL",
"RO",
"RU",
"TR",
"TW",
"UA"
]
},
"id":"api.acl.blacklisted_countries",
"name":"Visitors from blacklisted Countries"
}
]
}
},
"sealLocation":{
"id":"api.seal_location.none",
"name":"No seal "
},
"ssl":{
"origin_server":{
"detected":true,
"detectionStatus":"ok"
},
"generated_certificate":{
"ca":"GS",
"validation_method":"email",
"validation_data":"administrator#site.name.com",
"san":[
"*.site.name.com"
],
"validation_status":"done"
}
},
"siteDualFactorSettings":{
"specificUsers":[
],
"enabled":false,
"customAreas":[
],
"allowAllUsers":true,
"shouldSuggestApplicatons":true,
"allowedMedia":[
"ga",
"sms"
],
"shouldSendLoginNotifications":true,
"version":0
},
"login_protect":{
"enabled":false,
"specific_users_list":[
],
"send_lp_notifications":true,
"allow_all_users":true,
"authentication_methods":[
"ga",
"sms"
],
"urls":[
],
"url_patterns":[
]
},
"performance_configuration":{
"advanced_caching_rules":{
"never_cache_resources":[
],
"always_cache_resources":[
]
},
"acceleration_level":"standard",
"async_validation":true,
"minify_javascript":true,
"minify_css":true,
"minify_static_html":true,
"compress_jepg":true,
"progressive_image_rendering":false,
"aggressive_compression":false,
"compress_png":true,
"on_the_fly_compression":true,
"tcp_pre_pooling":true,
"comply_no_cache":false,
"comply_vary":false,
"use_shortest_caching":false,
"perfer_last_modified":false,
"accelerate_https":false,
"disable_client_side_caching":false,
"cache300x":false,
"cache_headers":[
]
},
"extended_ddos":1000,
"res":0,
"res_message":"OK",
"debug_info":{
"id-info":"1234"
}
}
I know enough about SQL to know that I am going to have to have multiple tables for these subsections. How do I select these subsections to expand them out into their own tables? If I am not clear on the explanation or the question, please comment and I will do my best to be more precise.
I was able to successfully import my JSON data into a SQL data base using the following code:
###Getting the list of Site_IDs
$param1 = #{account_id='####';api_id='######'; api_key='asdfjklmnopqrstuvwxyz';page_size='1000'}
$sitelist = Invoke-WebRequest https://my.website.com/api/prov/v1/sites/list -Method Post -Body $param1
$sitelist = $sitelist.content
$sitelist = $sitelist | convertfrom-json
$sitelist = $sitelist.sites
$sites = $sitelist | select -ExpandProperty Site_ID
###Setting the ConnectionString for the SQL Server and Database
$server = 'servername\sqlinstance'
$database = 'databasename'
$connstring = "server = $server; database = $database; trusted_connection = true;"
$conn = New-Object System.Data.SqlClient.SqlConnection
$conn.ConnectionString = $connstring
$cmd = New-Object System.Data.SqlClient.SqlCommand
$cmd.CommandTimeout = 60
$cmd.Connection = $conn
$cmd.CommandText = $query
$cmd.Parameters.Add((New-Object Data.SqlClient.SqlParameter("#Site_ID",[Data.SQLDBType]::BigInt,20))) | Out-Null
$cmd.Parameters.Add((New-Object Data.SqlClient.SqlParameter("#JSON_Source",[Data.SQLDBType]::NVARCHAR))) | Out-Null
###Setting the SQL Query to Insert the Site Data
$query = #"
INSERT INTO dbo.WebsiteSourceData (Site_ID,JSON_Source)
Values (#Site_ID, #JSON_Source);
"#
###Opening the SQL Connection
Try{
$conn.Open()
###Looping through each site to get it's configuration and then inserting that data to SQL
ForEach($site in $sites){
$param2 = #{api_id='#####';api_key='abcdefghijklmnopqrstuvwxyz';site_id="$site"}
$status = Invoke-WebRequest -URI https://my.website.com/api/prov/v1/sites/status -Method Post -Body $param2
$content = $status.Content
$adapter = New-Object System.Data.SqlClient.SqlDataAdapter
$adapter.SelectCommand = $cmd
$cmd.Parameters[0].Value = $site
$cmd.Parameters[1].Value = $content
$dataset = New-Object System.Data.DataSet
$adapter.Fill($dataset)
}#ForEach($site in $sites)
}#Try
Catch{
Write-Host "Exception thrown : $($error[0].exception.message)"
}#Catch
Finally{
$conn.Close()
}#Finally
From there, using the extremely helpful advice from this thread: How to shred JSON data from within a SQL database I was able to access and parse out the data into separate tables.

How to configure Opserver for SQL Server clusters

I am trying out Opserver to monitor SQL Server instances. No issue with configuring standalone instances, but when I tried to configure SQL Server clusters using the method documented here: http://www.patrickhyatt.com/2013/10/25/setting-up-stackexchanges-opserver.html
I am confused about where to put SQL Server cluster named instance and Windows node servers:
In the JSON code below:
{
"defaultConnectionString": "Data Source=$ServerName$;Initial Catalog=master;Integrated Security=SSPI;",
"clusters": [
{
"name": "SDCluster01",
"nodes": [
{ "name": "SDCluster01\\SDCluster01_01" },
{ "name": "SDCluster02\\SDCluster01_02" },
]
},
],
I assume SDCLuster01 is the instance DNS name and SDCluster01_01 and SDCluster01_02 are Windows node server names.
But what if I have a named instance (clustered) like SDCluster01\instance1?
I tried to configure it like this:
{
"defaultConnectionString": "Data Source=$ServerName$;Initial Catalog=master;Integrated Security=SSPI;",
"clusters": [
{
"name": "SDCluster01\instance1",
"nodes": [
{ "name": "SDCluster01\\SDCluster01_01" },
{ "name": "SDCluster02\\SDCluster01_02" },
]
},
],
But after deploying to Opserver it gave me this error message:
[NullReferenceException: Object reference not set to an instance of an object.]
Any ideas on how to configure the JSON file correctly for SQL Server clusters?

Resources