Exporting SharePoint usage log files into a database using LogParser - sql-server

So basically we have lots of SharePoint usage log files generated by our SharePoint 2007 site and we would like to make sense of them. For that we're thinking of reading the log files and dumping into a database with the appropriate columns and all. Now I was going to make an SSIS package to read all the text files and extract the data when I came across LogParser. Is there a way to use LogParser to dump data into an Sql Server database or the SSIS way is better? Or is there any other better way to use the SharePoint usage logs?

This is the script we use to load IIS log files in a SQL Server database:
LogParser "SELECT * INTO <TABLENAME> FROM <LogFileName>" -o:SQL -server:<servername> -database:<databasename> -driver:"SQL Server" -username:sa -password:xxxxx -createTable:ON
The <tablename>, <logfilename>, <servername>, <databasename> and sa password need to be changed according to your specs.
From my experience LogParser works really well to load data from IIS logs to SQL Server, so a mixed approach is the best:
Load raw data from IIS log to SQL Server using LogParser
Use SSIS to extract and manipulate data from the temporary table containing the raw data in the final table you'll use for reporting.

You'll have to write a plugin to logparser. Here is what I did:
[Guid("1CC338B9-4F5F-4bf2-86AE-55C865CF7159")]
public class SPUsageLogParserPlugin : ILogParserInputContext
{
private FileStream stream = null;
private BinaryReader br = null;
private object[] currentEntry = null;
public SPUsageLogParserPlugin() { }
#region LogParser
protected const int GENERAL_HEADER_LENGTH = 300;
protected const int ENTRY_HEADER_LENGTH = 50;
protected string[] columns = {"TimeStamp",
"SiteGUID",
"SiteUrl",
"WebUrl",
"Document",
"User",
"QueryString",
"Referral",
"UserAgent",
"Command"};
protected string ReadString(BinaryReader br)
{
StringBuilder buffer = new StringBuilder();
char c = br.ReadChar();
while (c != 0) {
buffer.Append(c);
c = br.ReadChar();
}
return buffer.ToString();
}
#endregion
#region ILogParserInputContext Members
enum FieldType
{
Integer = 1,
Real = 2,
String = 3,
Timestamp = 4
}
public void OpenInput(string from)
{
stream = File.OpenRead(from);
br = new BinaryReader(stream);
br.ReadBytes(GENERAL_HEADER_LENGTH);
}
public int GetFieldCount()
{
return columns.Length;
}
public string GetFieldName(int index)
{
return columns[index];
}
public int GetFieldType(int index)
{
if (index == 0) {
// TimeStamp
return (int)FieldType.Timestamp;
} else {
// Other fields
return (int)FieldType.String;
}
}
public bool ReadRecord()
{
if (stream.Position < stream.Length) {
br.ReadBytes(ENTRY_HEADER_LENGTH); // Entry Header
string webappguid = ReadString(br);
DateTime timestamp = DateTime.ParseExact(ReadString(br), "HH:mm:ss", null);
string siteUrl = ReadString(br);
string webUrl = ReadString(br);
string document = ReadString(br);
string user = ReadString(br);
string query = ReadString(br);
string referral = ReadString(br);
string userAgent = ReadString(br);
string guid = ReadString(br);
string command = ReadString(br);
currentEntry = new object[] { timestamp, webappguid, siteUrl, webUrl, document, user, query, referral, userAgent, command };
return true;
} else {
currentEntry = new object[] { };
return false;
}
}
public object GetValue(int index)
{
return currentEntry[index];
}
public void CloseInput(bool abort)
{
br.Close();
stream.Dispose();
stream = null;
br = null;
}
#endregion
}

If you want more in-depth reporting and have the cash and computer power you could look at Nintex Reporting. I've seen a demo of it and it's very thorough, however it needs to continuously run on your system. Looks cool though.

This is the blog post I used to get all the info needed.
It is not necessary to go to the length of custom code.
In brief, create table script:
CREATE TABLE [dbo].[STSlog](
[application] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[date] [datetime] NULL,
[time] [datetime] NULL,
[username] [varchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[computername] [varchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[method] [varchar](16) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[siteURL] [varchar](2048) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[webURL] [varchar](2048) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[docName] [varchar](2048) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[bytes] [int] NULL,
[queryString] [varchar](2048) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[userAgent] [varchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[referer] [varchar](2048) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[bitFlags] [smallint] NULL,
[status] [smallint] NULL,
[siteGuid] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
) ON [PRIMARY]
Call to make log parser load in the data for a file
"C:\projects\STSLogParser\STSLogParser.exe" 2005-01-01 "c:\projects\STSlog\2005-01-01\00.log" c:\projects\logparsertmp\stslog.csv
"C:\Program Files\Log Parser 2.2\logparser.exe" "SELECT 'SharePointPortal' as application, TO_DATE(TO_UTCTIME(TO_TIMESTAMP(TO_TIMESTAMP(date, 'yyyy-MM-dd'), TO_TIMESTAMP(time, 'hh:mm:ss')))) AS date, TO_TIME( TO_UTCTIME( TO_TIMESTAMP(TO_TIMESTAMP(date, 'yyyy-MM-dd'), TO_TIMESTAMP(time, 'hh:mm:ss')))), UserName as username, 'SERVERNAME' as computername, 'GET' as method, SiteURL as siteURL, WebURL as webURL, DocName as docName, cBytes as bytes, QueryString as queryString, UserAgent as userAgent, RefURL as referer, TO_INT(bitFlags) as bitFlags, TO_INT(HttpStatus) as status, TO_STRING(SiteGuid) as siteGuid INTO STSlog FROM c:\projects\logparsertmp\stslog.csv WHERE (username IS NOT NULL) AND (TO_LOWERCASE(username) NOT IN (domain\serviceaccount))" -i:CSV -headerRow:ON -o:SQL -server:localhost -database:SharePoint_SA_IN -clearTable:ON

Sorry I found out that Sharepoint Logs are not the same as IIS logs. They are different. How can we parse them?

Related

sqlalchemy.orm.exc.UnmappedInstanceError: Class 'builtins.dict' is not mapped AND using marshmallow-sqlalchemy

I don't get it. I'm trying to start a brand new table in MS SQL Server 2012 with the following:
In SQL Server:
TABLE [dbo].[Inventory](
[Index_No] [bigint] IDENTITY(1,1) NOT NULL,
[Part_No] [varchar(150)] NOT NULL,
[Shelf] [int] NOT NULL,
[Bin] [int] NOT NULL,
PRIMARY KEY CLUSTERED
(
[Index_No] ASC
)
UNIQUE NONCLUSTERED
(
[Part_No] ASC
)
GO
NOTE: This is a BRAND NEW TABLE! There is no data in it at all
Next, this is the Database.py file:
import pymssql
from sqlalchemy import create_engine, Table, MetaData, select, Column, Integer, Float, String, text,
func, desc, and_, or_, Date, insert
from marshmallow_sqlalchemy import SQLAlchemyAutoSchema
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
USERNAME = "name"
PSSWD = "none_of_your_business"
SERVERNAME = "MYSERVER"
INSTANCENAME = "\SQLSERVER2012"
DB = "Inventory"
engine = create_engine(f"mssql+pymssql://{USERNAME}:{PSSWD}#{SERVERNAME}{INSTANCENAME}/{DB}")
class Inventory(Base):
__tablename__ = "Inventory"
Index_No = Column('Index_No', Integer, primary_key=True, autoincrement=True)
Part_No = Column("Part_No", String, unique=True)
Shelf = Column("Shelf", Integer)
Bin = Column("Bin", Integer)
def __repr__(self):
return f'Drawing(Index_No={self.Index_No!r},Part_No={self.Part_No!r}, Shelf={self.Shelf!r}, ' \
f'Bin={self.Bin!r})'
class InventorySchema(SQLAlchemyAutoSchema):
class Meta:
model = Inventory
load_instance = True
It's also to note that I'm using SQLAlchemy 1.4.3, if that helps out.
and in the main.py
import Database as db
db.Base.metadata.create_all(db.engine)
data_list = [{Part_No:123A, Shelf:1, Bin:5},
{Part_No:456B, Shelf:1, Bin:7},
{Part_No:789C, Shelf:2, Bin:1}]
with db.Session(db.engine, future=True) as session:
try:
session.add_all(data_list) #<--- FAILS HERE AND THROWS AN EXCEPTION
session.commit()
except Exception as e:
session.rollback()
print(f"Error! {e!r}")
raise
finally:
session.close()
Now what I've googled on this "Class 'builtins.dict' is not mapped", most of the solutions brings me to marshmallow-sqlalchemy package which I've tried, but I'm still getting the same error. So I've tried moving the Base.metadata.create_all(engine) from the Database.py into the main.py. I also tried implementing a init function in the Inventory class, and also calling the Super().init, which doesn't work
So what's going on?? Why is it failing and is there a better solution to this problem?
Try creating Inventory objects:
data_list = [
Inventory(Part_No='123A', Shelf=1, Bin=5),
Inventory(Part_No='456B', Shelf=1, Bin=7),
Inventory(Part_No='789C', Shelf=2, Bin=1)
]

Cannot Insert into SQL using PySpark, but works in SQL

I have created a table below in SQL using the following:
CREATE TABLE [dbo].[Validation](
[RuleId] [int] IDENTITY(1,1) NOT NULL,
[AppId] [varchar](255) NOT NULL,
[Date] [date] NOT NULL,
[RuleName] [varchar](255) NOT NULL,
[Value] [nvarchar](4000) NOT NULL
)
NOTE the identity key (RuleId)
When inserting values into the table as below in SQL it works:
Note: Not inserting the Primary Key as is will autofill if table is empty and increment
INSERT INTO dbo.Validation VALUES ('TestApp','2020-05-15','MemoryUsageAnomaly','2300MB')
However when creating a temp table on databricks and executing the same query below running this query on PySpark as below:
%python
driver = <Driver>
url = "jdbc:sqlserver:<URL>"
database = "<db>"
table = "dbo.Validation"
user = "<user>"
password = "<pass>"
#import the data
remote_table = spark.read.format("jdbc")\
.option("driver", driver)\
.option("url", url)\
.option("database", database)\
.option("dbtable", table)\
.option("user", user)\
.option("password", password)\
.load()
remote_table.createOrReplaceTempView("YOUR_TEMP_VIEW_NAMES")
sqlcontext.sql("INSERT INTO YOUR_TEMP_VIEW_NAMES VALUES ('TestApp','2020-05-15','MemoryUsageAnomaly','2300MB')")
I get the error below:
AnalysisException: 'unknown requires that the data to be inserted have the same number of columns as the target table: target table has 5 column(s) but the inserted data has 4 column(s), including 0 partition column(s) having constant value(s).;'
Why does it work on SQL but not when passing the query through databricks? How can I insert through pyspark without getting this error?
The most straightforward solution here is use JDBC from a Scala cell. EG
%scala
import java.util.Properties
import java.sql.DriverManager
val jdbcUsername = dbutils.secrets.get(scope = "kv", key = "sqluser")
val jdbcPassword = dbutils.secrets.get(scope = "kv", key = "sqlpassword")
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://xxxx.database.windows.net:1433;database=AdventureWorks;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
// Create a Properties() object to hold the parameters.
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
connectionProperties.setProperty("Driver", driverClass)
val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
val stmt = connection.createStatement()
val sql = "INSERT INTO dbo.Validation VALUES ('TestApp','2020-05-15','MemoryUsageAnomaly','2300MB')"
stmt.execute(sql)
connection.close()
You could use pyodbc too, but the SQL Server ODBC drivers aren't installed by default, and the JDBC drivers are.
A Spark solution would be to create a view in SQL Server and insert against that. eg
create view Validation2 as
select AppId,Date,RuleName,Value
from Validation
then
tableName = "Validation2"
df = spark.read.jdbc(url=jdbcUrl, table=tableName, properties=connectionProperties)
df.createOrReplaceTempView(tableName)
sqlContext.sql("INSERT INTO Validation2 VALUES ('TestApp','2020-05-15','MemoryUsageAnomaly','2300MB')")
If you want to encapsulate the Scala and call it from another language (like Python), you can use a scala package cell.
eg
%scala
package example
import java.util.Properties
import java.sql.DriverManager
object JDBCFacade
{
def runStatement(url : String, sql : String, userName : String, password: String): Unit =
{
val connection = DriverManager.getConnection(url, userName, password)
val stmt = connection.createStatement()
try
{
stmt.execute(sql)
}
finally
{
connection.close()
}
}
}
and then you can call it like this:
jdbcUsername = dbutils.secrets.get(scope = "kv", key = "sqluser")
jdbcPassword = dbutils.secrets.get(scope = "kv", key = "sqlpassword")
jdbcUrl = "jdbc:sqlserver://xxxx.database.windows.net:1433;database=AdventureWorks;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
sql = "select 1 a into #foo from sys.objects"
sc._jvm.example.JDBCFacade.runStatement(jdbcUrl,sql, jdbcUsername, jdbcPassword)

SQL Server DELETE TOP slows down over time

I am running the following code to move data from a staging table (row-store) into production tables (column store). I observed that the process slows down over time. I suspect it could be because I am performing INSERT SELECT FROM, then DELETE until the table is empty. Could there be a better way to do this or What can I do to optimize this process? Should I rebuild the index in the middle? Appreciate your inputs.
Database
Partitioned tables by month
Simple recovery model
8 CPU, 16GB RAM
Table definition
No contraints, no triggers in the table
Average monthly rows in staging ~2 billion rows every when transfer
starts
CREATE TABLE [dbo].[AnalogT](
[SourceId] [int] NOT NULL,
[WindFarmId] [smallint] NOT NULL,
[StationId] [int] NOT NULL,
[TDIID] [int] NOT NULL,
[CTDIID] [int] NOT NULL,
[LogTime] [datetime2](0) NOT NULL,
[MeanValue] [real] NULL,
[MinValue] [real] NULL,
[MaxValue] [real] NULL,
[Stddev] [real] NULL,
[EndValue] [real] NULL,
[Published] [datetime2](3) NULL,
[TimeZoneOffset] [tinyint] NULL
) ON [Data]
GO
CREATE UNIQUE CLUSTERED INDEX [CIDX_AnalogT] ON [dbo].[AnalogT]
(
[LogTime] ASC,
[CTDIID] ASC,
[StationId] ASC,
[WindFarmId] ASC
) ON [Data]
GO
Get dataset to transfer
public class AnalogDeltaTransfer2ProductionService : DeltaTransfer2ProductionService
{
public override TdiType GetTargetTdiType()
{
return TdiType.Analog;
}
public override string GetCommandText(int batchSize)
{
string commandText = $"SELECT TOP {batchSize} " +
"[SourceId]," +
"[WindFarmId]" +
",[StationId]" +
",[TDIID]" +
",[CTDIID]" +
",[LogTime]" +
",[MeanValue]" +
",[MinValue]" +
",[MaxValue]" +
",[Stddev]" +
",[EndValue]" +
",[Published]" +
",[TimeZoneOffset]" +
",GETUTCDATE() [DateInsertedUtc]" +
$"FROM[dbo].[{GetStagingTable()}] ORDER BY LogTime, CTDIID, StationId, WindFarmId ";
return commandText;
}
public override string GetStagingTable()
{
var appSettings = new AppSettings();
return appSettings.deltasync_staging_analog;
}
public override string GetProductionTable()
{
var appSettings = new AppSettings();
return appSettings.deltasync_destination_analog;
}
}
Bulk load data
//Bulkd insert data from staging into production table
private void BulkTransferData(int batchSize, int batchNo)
{
var appSettings = new AppSettings();
var tdiTypeName = Enum.GetName(typeof (TdiType), GetTargetTdiType());
//source and destination databases is the same
var sourceAndDestinationConnectionString =
new StringBuilder(appSettings.deltasync_destination_connectionstring +
"Application Name=HPSDB DeltaSync DT2P {tdiType}")
.Replace("{tdiType}", tdiTypeName.ToUpper().Substring(0, 3))
.ToString();
using (var stagingConnection = new SqlConnection(sourceAndDestinationConnectionString))
{
stagingConnection.Open();
// get data from the source table as a stream datareader.
var commandSourceData = new SqlCommand(GetCommandText(batchSize), stagingConnection);
commandSourceData.CommandType = CommandType.Text;
commandSourceData.CommandTimeout = appSettings.deltasync_deltatransfer_prod_commandtimeout_secs.AsInt();
//prepare the fast reader
var reader = commandSourceData.ExecuteReader();
var transactionTimeOut =
TimeSpan.FromSeconds(appSettings.deltasync_deltatransfer_prod_trnxntimeout_secs.AsInt());
using (
var transactionScope = new TransactionScope(TransactionScopeOption.RequiresNew, transactionTimeOut))
{
//establish connection to destination production table
using (var destinationConnection = new SqlConnection(sourceAndDestinationConnectionString))
{
//grab connection from connection pool
destinationConnection.Open();
// set up the bulk copy object.
// note that the column positions in the source
// data reader match the column positions in the destination table so there is no need to map columns.
using (
var bulkCopy = new SqlBulkCopy(sourceAndDestinationConnectionString,
SqlBulkCopyOptions.TableLock))
{
try
{
// write from the source to the destination.
bulkCopy.DestinationTableName = GetProductionTable();
bulkCopy.BatchSize = appSettings.deltasync_deltatransfer_staging_batchsize.AsInt();
bulkCopy.BulkCopyTimeout =
appSettings.deltasync_deltatransfer_prod_commandtimeout_secs.AsInt();
bulkCopy.EnableStreaming = true;
bulkCopy.WriteToServer(reader);
}
catch (Exception ex)
{
//interrupt transaction and allow rollback process to commence
Log.Error(
$"DeltaTransfer2ProductionService.BulkTransferData/{tdiTypeName}/BulkInsert/{batchNo}: Faulted with with ff error: {ex}");
throw;
}
}
}
//establish connection to destination production table
using (var destinationConnection = new SqlConnection(sourceAndDestinationConnectionString))
{
//grab connection from connection pool
destinationConnection.Open();
//delete the rows that has been moved to destination table
try
{
//delete top x number of rows we just moved into prod
var selectTopXSqlText = GetCommandText(batchSize);
var sqlText = $"WITH cte AS({selectTopXSqlText}) DELETE FROM cte;";
var commandDropData = new SqlCommand(sqlText, destinationConnection);
commandDropData.CommandType = CommandType.Text;
commandDropData.CommandTimeout =
appSettings.deltasync_deltatransfer_prod_commandtimeout_secs.AsInt();
commandDropData.ExecuteNonQuery();
}
catch (Exception ex)
{
//interrupt transaction and allow rollback process to commence
Log.Error(
$"DeltaTransfer2ProductionService.BulkTransferData/{tdiTypeName}/DeleteTop/{batchNo}: Faulted with with ff error: {ex}");
throw;
}
}
//commit all changes
transactionScope.Complete();
transactionScope.Dispose();
}
}
}

Importing Large XML file into SQL 2.5Gb

Hi I am trying to import a large XML file into a table on my sql server (2014)
I have used the code below for smaller files and thought it would be ok as this is a once off, I kicked it off yesterday and the query was still running when I came into work today so this is obviously the wrong route.
here is the code.
CREATE TABLE files_index_bulk
(
Id INT IDENTITY PRIMARY KEY,
XMLData XML,
LoadedDateTime DATETIME
)
INSERT INTO files_index_bulk(XMLData, LoadedDateTime)
SELECT CONVERT(XML, BulkColumn, 2) AS BulkColumn, GETDATE()
FROM OPENROWSET(BULK 'c:\scripts\icecat\files.index.xml', SINGLE_BLOB) AS x;
SELECT * FROM files_index_bulk
Can anyone point out another way of doing this please ive looked around at importing large files and it keeps coming back to using bulk. which I already am.
thanks in advance.
here is the table I am using I want to pull all the data into.
USE [ICECATtesting]
GO
/****** Object: Table [dbo].[files_index] Script Date: 28/04/2017 20:10:44
******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[files_index](
[Product_ID] [int] NULL,
[path] [varchar](100) NULL,
[Updated] [varchar](50) NULL,
[Quality] [varchar](50) NULL,
[Supplier_id] [int] NULL,
[Prod_ID] [varchar](1) NULL,
[Catid] [int] NULL,
[On_Market] [int] NULL,
[Model_Name] [varchar](250) NULL,
[Product_View] [int] NULL,
[HighPic] [varchar](1) NULL,
[HighPicSize] [int] NULL,
[HighPicWidth] [int] NULL,
[HighPicHeight] [int] NULL,
[Date_Added] [varchar](150) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
and here is a snippit of the xml file.
<ICECAT-interface xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://data.icecat.biz/xsd/files.index.xsd">
<files.index Generated="20170427010009">
<file path="export/level4/EN/11.xml" Product_ID="11" Updated="20170329110432" Quality="SUPPLIER" Supplier_id="2" Prod_ID="PS300E-03YNL-DU" Catid="151" On_Market="0" Model_Name="Satellite 3000-400" Product_View="587591" HighPic="" HighPicSize="0" HighPicWidth="0" HighPicHeight="0" Date_Added="20050627000000">
</file>
<file path="export/level4/EN/12.xml" Product_ID="12" Updated="20170329110432" Quality="ICECAT" Supplier_id="7" Prod_ID="91.42R01.32H" Catid="151" On_Market="0" Model_Name="TravelMate 740LF" Product_View="40042" HighPic="http://images.icecat.biz/img/norm/high/12-31699.jpg" HighPicSize="19384" HighPicWidth="170" HighPicHeight="192" Date_Added="20050627000000">
</file>
<file path="export/level4/EN/13.xml" Product_ID="13" Updated="20170329110432" Quality="SUPPLIER" Supplier_id="2" Prod_ID="PP722E-H390W-NL" Catid="151" On_Market="0" Model_Name="Portégé 7220CT / NW2" Product_View="37021" HighPic="http://images.icecat.biz/img/norm/high/13-31699.jpg" HighPicSize="27152" HighPicWidth="280" HighPicHeight="280" Date_Added="20050627000000">
</file>
The max size of an XML column value in SQL Server is 2GB. It will not be possible to import a 2.5GB file into a single XML column.
UPDATE
Since your underlying objective is to transform XML elements within the file into table rows, you don't need to stage the entire file contents into a single XML column. You can avoid the 2GB limitation, reduce memory requirements, and improve performance by shredding the XML in client code and using a bulk insert technique to insert batches of multiple rows.
The example Powershell script below uses an XmlTextReader to avoid reading the entire XML into a DOM and uses SqlBulkCopy to insert batches of many rows at once. The combination of these techniques should allow you to insert millions rows in minutes rather than hours. These same techniques can be implemented in a custom app or SSIS script task.
I noticed a couple of the table columns specify varchar(1) yet the XML attribute values contain many characters. You'll need to either expand length of the columns or transform the source values.
[String]$global:connectionString = "Data Source=YourServer;Initial Catalog=YourDatabase;Integrated Security=SSPI";
[System.Data.DataTable]$global:dt = New-Object System.Data.DataTable;
[System.Xml.XmlTextReader]$global:xmlReader = New-Object System.Xml.XmlTextReader("C:\FilesToImport\files.xml");
[Int32]$global:batchSize = 10000;
Function Add-FileRow() {
$newRow = $dt.NewRow();
$null = $dt.Rows.Add($newRow);
$newRow["Product_ID"] = $global:xmlReader.GetAttribute("Product_ID");
$newRow["path"] = $global:xmlReader.GetAttribute("path");
$newRow["Updated"] = $global:xmlReader.GetAttribute("Updated");
$newRow["Quality"] = $global:xmlReader.GetAttribute("Quality");
$newRow["Supplier_id"] = $global:xmlReader.GetAttribute("Supplier_id");
$newRow["Prod_ID"] = $global:xmlReader.GetAttribute("Prod_ID");
$newRow["Catid"] = $global:xmlReader.GetAttribute("Catid");
$newRow["On_Market"] = $global:xmlReader.GetAttribute("On_Market");
$newRow["Model_Name"] = $global:xmlReader.GetAttribute("Model_Name");
$newRow["Product_View"] = $global:xmlReader.GetAttribute("Product_View");
$newRow["HighPic"] = $global:xmlReader.GetAttribute("HighPic");
$newRow["HighPicSize"] = $global:xmlReader.GetAttribute("HighPicSize");
$newRow["HighPicWidth"] = $global:xmlReader.GetAttribute("HighPicWidth");
$newRow["HighPicHeight"] = $global:xmlReader.GetAttribute("HighPicHeight");
$newRow["Date_Added"] = $global:xmlReader.GetAttribute("Date_Added");
}
try
{
# init data table schema
$da = New-Object System.Data.SqlClient.SqlDataAdapter("SELECT * FROM dbo.files_index WHERE 0 = 1;", $global:connectionString);
$null = $da.Fill($global:dt);
$bcp = New-Object System.Data.SqlClient.SqlBulkCopy($global:connectionString);
$bcp.DestinationTableName = "dbo.files_index";
$recordCount = 0;
while($xmlReader.Read() -eq $true)
{
if(($xmlReader.NodeType -eq [System.Xml.XmlNodeType]::Element) -and ($xmlReader.Name -eq "file"))
{
Add-FileRow -xmlReader $xmlReader;
$recordCount += 1;
if(($recordCount % $global:batchSize) -eq 0)
{
$bcp.WriteToServer($dt);
$dt.Rows.Clear();
Write-Host "$recordCount file elements processed so far";
}
}
}
if($dt.Rows.Count -gt 0)
{
$bcp.WriteToServer($dt);
}
$bcp.Close();
$xmlReader.Close();
Write-Host "$recordCount file elements imported";
}
catch
{
throw;
}
Try this. Just another method that I have used for some time. It's pretty fast (could be faster). I pull a huge xml db from a gaming company every night. This is how i get it an import it.
$xml = new XMLReader();
$xml->open($xml_file); // file is your xml file you want to parse
while($xml->read() && $xml->name != 'game') { ; } // get past the header to your first record (game in my case)
while($xml->name == 'game') { // now while we are in this record
$element = new SimpleXMLElement($xml->readOuterXML());
$gameRec = $this->createGameRecord($element, $os); // this is my function to reduce some clutter - and I use it elsewhere too
/* this looks confusing, but it is not. There are over 20 fields, and instead of typing them all out, I just made a string. */
$sql = "INSERT INTO $table (";
foreach($gameRec as $field=>$game){
$sql .= " $field,";
}
$sql = rtrim($sql, ",");
$sql .=") values (";
foreach($gameRec as $field=>$game) {
$sql .= " :$field,";
}
$sql = rtrim($sql,",");
$sql .= ") ON DUPLICATE KEY UPDATE "; // online game doesn't have a gamerank - not my choice LOL, so I adjust that for here
switch ($os) {
case 'pc' : $sql .= "gamerank = ".$gameRec['gamerank'] ; break;
case 'mac': $sql .= "gamerank = ".$gameRec['gamerank'] ; break;
case 'pl' : $sql .= "playercount = ".$gameRec['playercount'] ; break;
case 'og' :
$playercount = $this->getPlayerCount($gameRec['gameid']);
$sql .= "playercount = ".$playercount['playercount'] ;
break;
}
try {
$stmt = $this->connect()->prepare($sql);
$stmt->execute($gameRec);
} catch (PDOException $e) {// Kludge
echo 'os: '.$os.'<br/>table: '.$table.'<br/>XML LINK: '.$comprehensive_xml.'<br/>Current Record:<br/><pre>'.print_r($gameRec).'</pre><br/>'.
'SQL: '.$sql.'<br/>';
die('Line:33<br/>Function: pullBFG()<BR/>Cannot add game record <br/>'.$e->getMessage());
}
/// VERY VERY VERY IMPORTANT do not forget these 2 lines, or it will go into a endless loop - I know, I've done it. locks up your system after a bit hahaah
$xml->next('game');
unset($element);
}// while there are games
This should get you started. Obviously, adjust the "game" to your xml records. Trim out the fat I have here.
Here is the createGameRecord($element, $type='pc')
Basically it turns it into an array to use elsewhere, and makes it easier to add it to the db. with a single line as seen above: $stmt->execute($gameRec); Where $gameRec was returned from this function. PDO knows gameRec is an array, and will parse it out as you INSERT IT. the "delHardReturns() is another of my fucntion that gets rid of those hard returns /r /n etc.. Seems to mess up the SQL. I think SQL has a function for that, but I have not pursed it.
Hope you find this useful.
private function createGameRecord($element, $type='pc') {
if( ($type == 'pc') || ($type == 'og') ) { // player count is handled separately
$game = array(
'gamename' => strval($element->gamename),
'gameid' => strval($element->gameid),
'genreid' => strval($element->genreid),
'allgenreid' => strval($element->allgenreid),
'shortdesc' => $this->delHardReturns(strval($element->shortdesc)),
'meddesc' => $this->delHardReturns(strval($element->meddesc)),
'bullet1' => $this->delHardReturns(strval($element->bullet1)),
'bullet2' => $this->delHardReturns(strval($element->bullet2)),
'bullet3' => $this->delHardReturns(strval($element->bullet3)),
'bullet4' => $this->delHardReturns(strval($element->bullet4)),
'bullet5' => $this->delHardReturns(strval($element->bullet5)),
'longdesc' => $this->delHardReturns(strval($element->longdesc)),
'foldername' => strval($element->foldername),
'hasdownload' => strval($element->hasdownload),
'hasdwfeature' => strval($element->hasdwfeature),
'releasedate' => strval($element->releasedate)
);
if($type === 'pc') {
$game['hasvideo'] = strval($element->hasvideo);
$game['hasflash'] = strval($element->hasflash);
$game['price'] = strval($element->price);
$game['gamerank'] = strval($element->gamerank);
$game['gamesize'] = strval($element->gamesize);
$game['macgameid'] = strval($element->macgameid);
$game['family'] = strval($element->family);
$game['familyid'] = strval($element->familyid);
$game['productid'] = strval($element->productid);
$game['pc_sysreqos'] = strval($element->systemreq->pc->sysreqos);
$game['pc_sysreqmhz'] = strval($element->systemreq->pc->sysreqmhz);
$game['pc_sysreqmem'] = strval($element->systemreq->pc->sysreqmem);
$game['pc_sysreqhd'] = strval($element->systemreq->pc->sysreqhd);
if(empty($game['gamerank'])) $game['gamerank'] = 99999;
$game['gamesize'] = $this->readableBytes((int)$game['gamesize']);
}// dealing with PC type
if($type === 'og') {
$game['onlineiframeheight'] = strval($element->onlineiframeheight);
$game['onlineiframewidth'] = strval($element->onlineiframewidth);
}
$game['releasedate'] = substr($game['releasedate'],0,10);
} else {// not type = pl
$game['playercount'] = strval($element->playercount);
$game['gameid'] = strval($element->gameid);
}// no type = pl else
return $game;
}/
Updated: Much faster. I did some research, and while the above post I made shows one (slow) method, I was able to find one that works even faster - for me it does.
I put this as a new answer due to the complete difference from my previous post.
LOAD XML LOCAL INFILE 'path/to/file.xml' INTO TABLE tablename ROWS IDENTIFIED BY '<xml-identifier>'
Example
<students>
<student>
<name>john doe</name>
<boringfields>bla bla bla......</boringfields>
</student>
</students>
Then, MYSQL command would be:
LOAD XML LOCAL INFILE 'path/to/students.xml' INTO TABLE tablename ROWS IDENTIFIED BY '<student>'
rows identified must have single quote and angle brackets.
when I switched to this method, I went from 12min +/- to 30 seconds!! +/-
tips that worked for me. was use the
DELETE FROM tablename
otherwise it will just append to your db.
Ref: https://dev.mysql.com/doc/refman/5.5/en/load-xml.html

An optimized stored procedure to replace this LINQ statement

I have the following two tables in SQL Server 2008
TABLE [JobUnit](
[idJobUnit] [int] IDENTITY(1,1) NOT NULL,
[Job_idJob] [int] NOT NULL, // Foreign key here
[UnitStatus] [tinyint] NOT NULL, // can be (0 for unprocessed, 1 for processing, 2 for processed)
)
TABLE [Job](
[idJob] [int] IDENTITY(1,1) NOT NULL,
[JobName] [varchar(50)] NOT NULL,
)
Job : JobUnit is one-to-many relationship
I am trying to write an efficient store procedure that would replace the following LINQ statement
public enum UnitStatus{
unprocessed,
processing,
processed,
}
int jobId = 10;
using(EntityFramework context = new EntityFramework())
{
if (context.JobUnits.Where(ju => ju.Job_idJob == jobId)
.Any(ju => ju.UnitStatus == (byte)UnitStatus.unproccessed))
{
// Some JobUnit is unprocessed
return 1;
}
else
{
// There is no unprocessed JobUnit
if (context.JobUnits.Where(ju => ju.Job_idJob == jobId) //
.Any(ju => ju.UnitStatus == (byte)UnitStatus.processing))
{
// JobUnit has some unit that is processing, but none is unprocessed
return 2;
}
else
{
// Every JoUnit is processed
return 3;
}
}
}
Thanks for reading
So really, you're just looking for the lowest state of all the units in a particular job?
CREATE PROCEDURE GetJobState #jobId int AS
SELECT MIN(UnitStatus)
FROM JobUnit
WHERE Job_idJob = #jobId
I should also say you could use this approach just as easly in Linq.

Resources