I'm trying to concatenate a number of characters corresponding to some ints (the first 15 ASCII characters for example):
;with cte as (
select 1 nr
union all
select nr + 1
from cte
where nr <= 15)
select (
select char(nr)
from cte
for xml path (''), type).value('.', 'nvarchar(max)')
option (maxrecursion 0)
but I'm getting an error saying:
Msg 6841, Level 16, State 1, Line 1
FOR XML could not serialize the
data for node 'NoName' because it contains a character (0x0001) which
is not allowed in XML. To retrieve this data using FOR XML, convert it
to binary, varbinary or image data type and use the BINARY BASE64
directive.
Even if I try to modify my CTE's seed from 1 to 10 for example, I still get the error but for a different character, 0x000B.
I have two possible solutions I'm looking for:
find a way to concatenate all the characters (any other method than using FOR XML) - preffered solution
or
remove all characters that are not allowed in XML - I've tried this but it seems I just hit other non-allowed characters. I've also looked for a list of these non-allowed characters but I couldn't find one.
Any help is very much appreciated.
Update - context:
This is part of a bigger CTE where I'm trying to generate random character sets from random numbers by doing multiple divisions and modulus operations.
I modulo each number by 256, get the result, turn it into its corresponding CHAR() and then dividing the number by 256 and so on until it's modulo or division is 0.
In the end I want to concatenate all of these characters. I have everything in place, I'm just encountering this error which does not allow me to concatenate the generated strings from CHAR().
This might sound weird and you might say that it's not a SQL-task and you can do it in other languages, but I want to try and find a solution in SQL, no matter how low the performance is.
XML PATH is just one of the techniques used for grouped concatenation. Aaron Bertrand explains and compares all of them in Grouped Concatenation in SQL Server. Built-in support for this is coming in the next version of SQL Server in the form of STRING_AGG.
Bertrand's article explains that XML PATH can only work with XML safe characters. Non-printable characters like 0x1 (SOH) and 0xB (Vertical Tab) won't work without XML encoding the data first. Typically, this isn't a problem because real data doesn't contain non-printable charactes - what would a SOH and VT look like on a text box?
Perhaps, the easiest way to solve your problem is to use UNICODE() instead of CHAR() to generate Unicode characters and start form 32 instead of 0 or 1.
For now, the fastest and safest method to aggregate strings is to use a SQLCLR custom aggregate. If you don't use sloppy techniques like concatenating strings directly, it will also consume the least amount of memory.The various GROUP_CONCAT implementations shown in this project are small enough that you can copy and use in your own projects. They will work with any Unicode character too, even with non-printable ones.
BTW, SQL Server vNext brings STRING_AGG to aggregate strings. We'll just have to wait a year or two.
The non-ordered version, GROUP_CONCAT is just 99 lines. It simply collects all strings in a dictionary and writes them out at the end:
using System;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.IO;
using System.Collections.Generic;
using System.Text;
namespace GroupConcat
{
[Serializable]
[SqlUserDefinedAggregate(Format.UserDefined,
MaxByteSize = -1,
IsInvariantToNulls = true,
IsInvariantToDuplicates = false,
IsInvariantToOrder = true,
IsNullIfEmpty = true)]
public struct GROUP_CONCAT : IBinarySerialize
{
private Dictionary<string, int> values;
public void Init()
{
this.values = new Dictionary<string, int>();
}
public void Accumulate([SqlFacet(MaxSize = 4000)] SqlString VALUE)
{
if (!VALUE.IsNull)
{
string key = VALUE.Value;
if (this.values.ContainsKey(key))
{
this.values[key] += 1;
}
else
{
this.values.Add(key, 1);
}
}
}
public void Merge(GROUP_CONCAT Group)
{
foreach (KeyValuePair<string, int> item in Group.values)
{
string key = item.Key;
if (this.values.ContainsKey(key))
{
this.values[key] += Group.values[key];
}
else
{
this.values.Add(key, Group.values[key]);
}
}
}
[return: SqlFacet(MaxSize = -1)]
public SqlString Terminate()
{
if (this.values != null && this.values.Count > 0)
{
StringBuilder returnStringBuilder = new StringBuilder();
foreach (KeyValuePair<string, int> item in this.values)
{
for (int value = 0; value < item.Value; value++)
{
returnStringBuilder.Append(item.Key);
returnStringBuilder.Append(",");
}
}
return returnStringBuilder.Remove(returnStringBuilder.Length - 1, 1).ToString();
}
return null;
}
public void Read(BinaryReader r)
{
int itemCount = r.ReadInt32();
this.values = new Dictionary<string, int>(itemCount);
for (int i = 0; i <= itemCount - 1; i++)
{
this.values.Add(r.ReadString(), r.ReadInt32());
}
}
public void Write(BinaryWriter w)
{
w.Write(this.values.Count);
foreach (KeyValuePair<string, int> s in this.values)
{
w.Write(s.Key);
w.Write(s.Value);
}
}
}
}
Just another approach (works with non-printables too):
You are adding one character after each other. You do not need any group concatenation at all. Your recursive (rather iterativ) CTE is a hidden RBAR on its own and will do this for you.
The following example uses a list of ints (considering your use case where you need to do this with random numbers) as input:
DECLARE #SomeInts TABLE(ID INT IDENTITY,intVal INT);
INSERT INTO #SomeInts VALUES(36),(33),(39),(32),(35),(37),(1),(2),(65);
WITH cte AS
(
SELECT ID,intVal AS nr,CAST(CHAR(intVal) AS VARCHAR(MAX)) AS targetString FROM #SomeInts WHERE ID=1
UNION ALL
SELECT si.ID,intVal + 1,targetString + CHAR(intVal)
FROM #SomeInts AS si
INNER JOIN cte ON si.ID=cte.ID+1
)
SELECT targetString, CAST(targetString AS varbinary(max))
FROM cte
option (maxrecursion 0);
The result (printed and as growing hex list --> beware of x01 and x02):
Related
Stateid StateName Year Population
1 andhra 2008 25000
2 andhra 2009 10000
3 ap 2008 15000
2 ap 2009 20000
How to get each StateName TotalPopulation #2009
without using the Linq, here a solution
Dictionary<string,int> data=new Dictionary<string,int>(); // to store the words and count
string inputString = "I love red color. He loves red color. She love red kit.";
var details=inputString.Split(' '); // split the string you have on space, u can exclude the non alphabet characters
foreach(var detail in details)
{
// based on Ron comment you should trim the empty detail in case you have multi space in the string
if(!string.IsNullOfEmpty(detail) && data.ContainsKey(detail))
data[detail].Value++;
else
data.Add(detail,1);
}
What I did was, i broke the string into an array using the split function. Then looped through each of the element and checked whether that element has been parsed or not. If yes, the add the count by 1 else add the element to the dictionary.
class Program
{
static void Main(string[] args)
{
string inputString = "I love red color. He loves red color. She love red kit.";
Dictionary<string, int> dict = new Dictionary<string, int>();
var arr = inputString.Split(' ','.',',');
foreach (string s in arr)
{
if (dict.ContainsKey(s))
dict[s] += 1;
else
dict.Add(s, 1);
}
foreach (var item in dict)
{
Console.WriteLine(item.Key + "- " + item.Value);
}
Console.ReadKey();
}
}
Try this way
string inputString = "I love red color. He loves red color. She love red kit."; Dictionary<string, int> wordcount = new Dictionary<string, int>();
var words = inputString.Split(' ');
foreach (var word in words)
{
if (!wordcount.ContainsKey(word))
wordcount.Add(word, words.Count(p => p == word));
}
wordcount will have the output you are looking for. Note that it will have all entries for all words, so if you want for only a subset, then alter it to lookup against a master list.
Check the link given bellow.
Count Word
This example shows how to use a LINQ query to count the occurrences of a specified word in a string. Note that to perform the count, first the Split method is called to create an array of words. There is a performance cost to the Split method. If the only operation on the string is to count the words, you should consider using the Matches or IndexOf methods instead. However, if performance is not a critical issue, or you have already split the sentence in order to perform other types of queries over it, then it makes sense to use LINQ to count the words or phrases as well.
class CountWords
{
static void Main()
{
string text = #"Historically, the world of data and the world of objects" +
#" have not been well integrated. Programmers work in C# or Visual Basic" +
#" and also in SQL or XQuery. On the one side are concepts such as classes," +
#" objects, fields, inheritance, and .NET Framework APIs. On the other side" +
#" are tables, columns, rows, nodes, and separate languages for dealing with" +
#" them. Data types often require translation between the two worlds; there are" +
#" different standard functions. Because the object world has no notion of query, a" +
#" query can only be represented as a string without compile-time type checking or" +
#" IntelliSense support in the IDE. Transferring data from SQL tables or XML trees to" +
#" objects in memory is often tedious and error-prone.";
string searchTerm = "data";
//Convert the string into an array of words
string[] source = text.Split(new char[] { '.', '?', '!', ' ', ';', ':', ',' }, StringSplitOptions.RemoveEmptyEntries);
// Create and execute the query. It executes immediately
// because a singleton value is produced.
// Use ToLowerInvariant to match "data" and "Data"
var matchQuery = from word in source
where word.ToLowerInvariant() == searchTerm.ToLowerInvariant()
select word;
// Count the matches.
int wordCount = matchQuery.Count();
Console.WriteLine("{0} occurrences(s) of the search term \"{1}\" were found.", wordCount, searchTerm);
// Keep console window open in debug mode
Console.WriteLine("Press any key to exit");
Console.ReadKey();
}
}
/* Output:
3 occurrences(s) of the search term "data" were found.
I am importing an Excel file which is formatted like a report - that is some columns are only populated once for each group of rows that it belongs to, such as:
CaseID |Date |Code
157207 | |
|8/1/2012 |64479
|8/1/2012 |Q9967
|8/1/2012 |99203
I need to capture one of these group headers (CaseID, in the example above) and use it for subsequent rows where the field is blank, then save the next value that I encounter.
I have added a variable (User::CurrentCaseId) and a Script transform, with the following code:
public class ScriptMain : UserComponent
{
string newCaseId;
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (!Row.CaseIDName_IsNull && Row.CaseIDName.Length > 0)
newCaseId = Row.CaseIDName;
else
newCaseId = "DetailRow";
}
public override void PostExecute()
{
base.PostExecute();
if (newClaimNumber != "DetailRow")
Variables.CurrentCaseId = newCaseId;
}
Basically, I am trying to read the value when present and save it in this variable. I use a conditional split to ditch the rows that only have the CaseID and then use a derived column to put the variable value into a new column to complete the detail row.
Alas, the value is always blank (placed a data viewer after the derived column). I modified the script to always set the variable to a fixed string - the derived column is still blank.
This seemed like a good plan... I received some feedback in the MS forums that you can't set a variable value and use its new value within the same Data Flow Task. If that is so, the only solution I can think of is to write the CaseID out to a table when present and read it back in when absent. I really hate to do that with several million rows (multiple Excel worksheets). Any better ideas?
Best,
Scott
This can be a good starting point for you.
I used the following file as the source. Saved it into C:\Temp\5.TXT
CaseID |Date |Code
157207 | |
|8/1/2012 |64479
|8/1/2012 |Q9967
|8/1/2012 |99203
157208 | |
|9/1/2012 |77779
|9/2/2012 |R9967
|9/3/2012 |11203
Put a DFT on the Control Flow surface.
Put Script Component as Source on the DFT
3.1. Go to Inputs and Outputs section
3.2. Add Output. Change it name to MyOutput.
3.2.1 Add the following output columns - CaseID, Date, Code
3.2.1 The data types are four-byte unsigned integer [DT_UI4], string [DT_STR], string [DT_STR]
Now go to Scripts // Edit Script. Put the following code. Make sure to add
using System.IO;
to the namespace area.
public override void CreateNewOutputRows()
{
string[] lines = File.ReadAllLines(#"C:\temp\5.txt");
int iRowCount = 0;
string[] fields = null;
int iCaseID = 0;
string sDate = string.Empty;
string sCode = string.Empty;
foreach (string line in lines)
{
if (iRowCount == 0)
{
iRowCount++;
}
else
{
fields = line.Split('|');
//trim the field values
for (int i = 0; i < fields.Length; i++)
{
fields[i] = fields[i].Trim();
}
if (!fields[0].Equals(string.Empty))
{
iCaseID = Convert.ToInt32(fields[0]);
}
else
{
MyOutputBuffer.AddRow();
MyOutputBuffer.CaseID = iCaseID;
MyOutputBuffer.Date = fields[1];
MyOutputBuffer.Code = fields[2];
}
}
}
}
}
Testing your code: Add a Union All components right beneath where you put the Script component. Connect the output of the Script component to the Union All component. Put data viewer.
Hopefully this should help you. Please let us know. I responded to a similar question today; please check that one out as well. That may help in solidfying the concept - IMHO.
The following is the C# code and generated SQL in a LINQ to SQL query for two cases.
Case 1
using (JulianDataContext dc = new JulianDataContext(this.CurrentConnectionString))
{
#if DEBUG
dc.Log = new DebugTextWriter();
#endif
IEnumerable<UserNewsfeedDeliveryTime> temp = dc.UserNewsfeedDeliveryTimes.Where(u => u.NewsfeedEmailPeriodicity > 0 && DateTime.Today >= u.NextNewsfeedDelivery.Value.Date);
ids = temp.Select(p => p.Id).ToList();
}
SELECT [t0].[Id], [t0].[NewsfeedEmailPeriodicity], [t0].[LastSentNewsfeedEmail], [t0].[NextNewsfeedDelivery]
FROM [dbo].[UserNewsfeedDeliveryTimes] AS [t0]
WHERE ([t0].[NewsfeedEmailPeriodicity] > #p0) AND (#p1 >= CONVERT(DATE, [t0].[NextNewsfeedDelivery]))
-- #p0: Input Int (Size = -1; Prec = 0; Scale = 0) [0]
-- #p1: Input DateTime (Size = -1; Prec = 0; Scale = 0) [15-11-2012 00:00:00]
Case 2
using (JulianDataContext dc = new JulianDataContext(this.CurrentConnectionString))
{
#if DEBUG
dc.Log = new DebugTextWriter();
#endif
IEnumerable<UserNewsfeedDeliveryTime> temp = dc.GetTable<UserNewsfeedDeliveryTime>();
temp = temp.Where(u => u.NewsfeedEmailPeriodicity > 0 && DateTime.Today >= u.NextNewsfeedDelivery.Value.Date);
ids = temp.Select(p => p.Id).ToList();
}
SELECT [t0].[Id], [t0].[NewsfeedEmailPeriodicity], [t0].[LastSentNewsfeedEmail], [t0].[NextNewsfeedDelivery]
FROM [dbo].[UserNewsfeedDeliveryTimes] AS [t0]
The difference
The difference between these two linq queries:
dc.UserNewsfeedDeliveryTimes
and
dc.GetTable<UserNewsfeedDeliveryTime>()
Why? Could it be that, in case 2, LINQ to SQL is retrieving all data from database and finish the query by filtering all objects in memory?
If so, how can we make keep this generic and still force all the T-SQL to be generated?
Solution
Both answers, are correct but I had to pick one, sorry! I think also it is interesting to add that in this case, since I changed to work with an IQueryable (inherits from IEnumerable), in this line:
temp = temp.Where(u => u.NewsfeedEmailPeriodicity > 0 && DateTime.Today >= u.NextNewsfeedDelivery.Value.Date);
I had two overload methods, one from the IQueryable interface and another to the IEnumerable interface.
public static IQueryable<TSource> Where<TSource>(this IQueryable<TSource> source, Expression<Func<TSource, bool>> predicate);
public static IEnumerable<TSource> Where<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate);
So I had to convert my predicate explicitly to Expression> predicate, otherwise the IEnumerable interface method would have been picked up at compile time and, if I am not mistaken, I would get some dynamic sql exception saying the T-SQL could not have been generated.
From my understanding, IEnumerable does not transform the original query information that IQueryable holds. It's almost as if the cast freezes any changes to the IQueryable query at the point of casting. If you look at MSDN, it turns out that IQueryable inherits IEnumerable:
http://msdn.microsoft.com/en-us/library/system.linq.iqueryable.aspx
Hence, you see this behaviour. It is important with LINQ-SQL to work with IQueryable unless you want the query frozen at the point it is turned to an IEnumerable.
In your first example, the where is inclusive of the original query. The select is not hence the query generated.
In your second example, you capture the table itself into an IEnumerable. Any changes on top of this are done in memory on top of the original query.
When you think, the IEnumerable version of where will not be able to transform the original data of the IQueryable due to the cast and how inheritance works.
When you also consider deferred loading, and how LINQ works, this seems to make sense.
To me it is a big annoyance, as it can lead you into generating some terrible performing code.
Try using IQueryable instead of IEnumerable.
Weird, because on my examples I get the opposite results from you, ie with IEnumerable, case 1 works fast and case 2 retrieves all the data. But using IQueryable fixes the issue.
I'm trying to find a way to count my columns coming from a Flat File. Actually, all my columns are concatened in a signe cell, sepatared with a '|' ,
after various attempts, it seems that only a script task can handle this.
Does anyone can help me upon that ? I've shamely no experience with script in C# ou VB.
Thanks a lot
Emmanuel
To better understand, below is the output of what I want to achieve to. e.g a single cell containing all headers coming from a FF. The thing is, to get to this result, I appended manually in the previous step ( derived column) all column names each others in order to concatenate them with a '|' separator.
Now , if my FF source layout changes, it won't work anymore, because of this manualy process. So I think I would have to use a script instead which basically returns my number of columns (header ) in a variable and will allow to remove the hard coded part in the derived column transfo for instance
This is an very old thread; however, I just stumbled on a similar problem. A flat file with a number of different record "formats" inside. Many different formats, not in any particular order, meaning you might have 57 fields in one line, then 59 in the next 1000, then 56 in the next 10000, back to 57... well, think you got the idea.
For lack of better ideas, I decided to break that file based on the number of commas in each line, and then import the different record types (now bunched together) using SSIS packages for each type.
So the answer for this question is there, with a bit more code to produce the files.
Hope this helps somebody with the same problem.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace OddFlatFile_Transformation
{
class RedistributeLines
{
/*
* This routine opens a text file and reads it line by line
* for each line the number of "," (commas) is counted
* and then the line is written into a another text file
* based on that number of commas found
* For example if there are 15 commas in a given line
* the line is written to the WhateverFileName_15.Ext
* WhaeverFileName and Ext are the same file name and
* extension from the original file that is being read
* The application tests WhateverFileName_NN.Ext for existance
* and creates the file in case it does not exist yet
* To Better control splited records a sequential identifier,
* based on the number of lines read, is added to the beginning
* of each line written independently of the file and record number
*/
static void Main(string[] args)
{
// get full qualified file name from console
String strFileToRead;
strFileToRead = Console.ReadLine();
// create reader & open file
StreamReader srTextFileReader = new StreamReader(strFileToRead);
string strLineRead = "";
string strFileToWrite = "";
string strLineIdentifier = "";
string strLineToWrite = "";
int intCountLines = 0;
int intCountCommas = 0;
int intDotPosition = 0;
const string strZeroPadding = "00000000";
// Processing begins
Console.WriteLine("Processing begins: " + DateTime.Now);
/* Main Loop */
while (strLineRead != null)
{
// read a line of text count commas and create Linde Identifier
strLineRead = srTextFileReader.ReadLine();
if (strLineRead != null)
{
intCountLines += 1;
strLineIdentifier = strZeroPadding.Substring(0, strZeroPadding.Length - intCountLines.ToString().Length) + intCountLines;
intCountCommas = 0;
foreach (char chrEachPosition in strLineRead)
{
if (chrEachPosition == ',') intCountCommas++;
}
// Based on the number of commas determined above
// the name of the file to be writen to is established
intDotPosition = strFileToRead.IndexOf(".");
strFileToWrite = strFileToRead.Substring (0,intDotPosition) + "_";
if ( intCountCommas < 10)
{
strFileToWrite += "0" + intCountCommas;
}
else
{
strFileToWrite += intCountCommas;
}
strFileToWrite += strFileToRead.Substring(intDotPosition, (strFileToRead.Length - intDotPosition));
// Using the file name established above the line captured
// during the text read phase is written to that file
StreamWriter swTextFileWriter = new StreamWriter(strFileToWrite, true);
strLineToWrite = "[" + strLineIdentifier + "] " + strLineRead;
swTextFileWriter.WriteLine (strLineToWrite);
swTextFileWriter.Close();
Console.WriteLine(strLineIdentifier);
}
}
// close the stream
srTextFileReader.Close();
Console.WriteLine(DateTime.Now);
Console.ReadLine();
}
}
}
Please refer my answers in the following Stack Overflow questions. Those answers might give you an idea of how to load a flat file that contains varying number of columns.
Example in the following question reads a file containing data separated by special character Ç (c-cedilla). In your case, the delimiter is Vertical Bar (|)
UTF-8 flat file import to SQL Server 2008 not recognizing {LF} row delimiter
Example in the following question reads an EDI file that contains different sections with varying number of columns. The package reads the file loads it accordingly with parent-child relationships into an SQL table.
how to load a flat file with header and detail parent child relationship into SQL server
Based on the logic used in those answers, you can also count the number of columns by splitting the rows in the file by the column delimiter (Vertical Bar |).
Hope that helps.
My table has a timestamp column named "RowVer" which LINQ maps to type System.Data.Linq.Binary. This data type seems useless to me because (unless I'm missing something) I can't do things like this:
// Select all records that changed since the last time we inserted/updated.
IEnumerable<UserSession> rows = db.UserSessions.Where
( usr => usr.RowVer > ???? );
So, one of the solutions I'm looking at is to add a new "calculated column" called RowTrack which is defined in SQL like this:
CREATE TABLE UserSession
(
RowVer timestamp NOT NULL,
RowTrack AS (convert(bigint,[RowVer])),
-- ... other columns ...
)
This allows me to query the database like I want to:
// Select all records that changed since the last time we inserted/updated.
IEnumerable<UserSession> rows = db.UserSessions.Where
( usr => usr.RowTrack > 123456 );
Is this a bad way to do things? How performant is querying on a calculated column? Is there a better work-around?
Also, I'm developing against Sql Server 2000 for ultimate backwards compatibility, but I can talk the boss into making 2005 the lowest common denominator.
AS Diego Frata outlines in this post there is a hack that enables timestamps to be queryable from LINQ.
The trick is to define a Compare method that takes two System.Data.Linq.Binary parameters
public static class BinaryComparer
{
public static int Compare(this Binary b1, Binary b2)
{
throw new NotImplementedException();
}
}
Notice that the function doesn't need to be implemented, only it's name (Compare) is important.
And the query will look something like:
Binary lastTimestamp = GetTimeStamp();
var result = from job in c.GetTable<tblJobs>
where BinaryComparer.Compare(job.TimeStamp, lastTimestamp)>0
select job;
(This in case of job.TimeStamp>lastTimestamp)
EDIT:
See Rory MacLeod's answer for an implementation of the method, if you need it to work outside of SQL.
SQL Server "timestamp" is only an indicator that the record has changed, its not actually a representation of Date/Time. (Although it is suppose to increment each time a record in the DB is modified,
Beware that it will wrap back to zero (not very often, admittedly), so the only safe test is if the value has changed, not if it is greater than some arbitrary previous value.
You could pass the TimeStamp column value to a web form, and then when it is submitted see if the TimeStamp from the form is different to the value in the current record - if its is different someone else has changed & saved the record in the interim.
// Select all records that changed since the last time we inserted/updated.
Is there a better work-around?
Why not have two columns, one for createddate another for lastmodifieddate. I would say that is more traditional way to handle this scenario.
Following on from jaraics' answer, you could also provide an implementation for the Compare method that would allow it to work outside of a query:
public static class BinaryExtensions
{
public static int Compare(this Binary b1, Binary b2)
{
if (b1 == null)
return b2 == null ? 0 : -1;
if (b2 == null)
return 1;
byte[] bytes1 = b1.ToArray();
byte[] bytes2 = b2.ToArray();
int len = Math.Min(bytes1.Length, bytes2.Length);
int result = memcmp(bytes1, bytes2, len);
if (result == 0 && bytes1.Length != bytes2.Length)
{
return bytes1.Length > bytes2.Length ? 1 : -1;
}
return result;
}
[DllImport("msvcrt.dll")]
private static extern int memcmp(byte[] arr1, byte[] arr2, int cnt);
}
The use of memcmp was taken from this answer to a question on comparing byte arrays. If the arrays aren't the same length, but the longer array starts with the same bytes as the shorter array, the longer array is considered to be greater than the shorter one, even if the extra bytes are all zeroes.