I have Google Forms results from live workshops that I want to edit in Sheets. I then want to calculate averages, pivot etc. to explore the data for insights.
I want to clean and standardize the data. I'd like to replace a series of long strings and reduce them just to their leading number:
UPDATE as requested here is a View-only sample Sheet with raw data on 1st tab, and how I'd like it to be formatted (2nd tab)
https://docs.google.com/spreadsheets/d/1pP8YV3oJXWGt3-88qgzMuup1pm1IsIgD4SHDL09MXrc/edit?usp=sharing
Example:
I'm aware of certifications I should be starting.
Replace with
3
Example 2:
I'm currently progressing a powerful certification.
Replace with:
4
In Excel I would simply use * as wildcard for the rest of the string, but Sheets appears different. I've read documentation and posts about regular expressions etc. and I'm not sure if that's overkill or how to proceed.
I'd THEN like to create a macro which does that for the whole sheet:
All strings which begin with 1)*
--> Replace with 1
All strings which begin with 2)*
--> Replace with 2
try:
=REGEXEXTRACT(A1; "^\d+")
Related
How to add leading zeros in ADF data flow from the expression builder
For example – have column with numeric value as “000001” but it is coming as 1 only in SQL DB , if I put in entire value in single quotes it is coming but I need dynamic way of implementation with out hard coding.
I agree with #Larnu's comments that even if we give 00001 to an int type column it will give as 1 only.
So, we have to give those in single quotes ('00001') to use like that or import the incoming data as string instead of int.
As you are using ADF dataflow, if you want to use the 00001, you can generate those using derived column transformation from SQL source. But this depends on your requirement like how your leading 0's varies. So, use according to it.
Sample demo:
concat('0000', toString(id))
Result:
Use that column as per your requirement, after that you can convert it back to the same input by toInteger(id).
I'm trying to parse an XML file in Excel, which is a Japanese dictionary. It contains several translations of each entry into different languages, and some entries have multiple translations per language. I want to write a formula that finds all of the translations by their language code, returns them as an array, and concatenates them using a TEXTJOIN formula. But I don't know how to go about this in Excel.
In Google Sheets, this would be easily solved by a FILTER function, but I can't use Sheets as there's too much data, and I haven't managed to get access to the beta FILTER function yet.
In the below picture, I'm trying to return the values in the <gloss xml:lang*> column, by searching for the values in the lang column. So for example, I want to return all values which have a "dut" next to them, and concatenate those into a single line using TEXTJOIN.
Any idea how I could go about doing this?
I fixed this by downloading the FILTER function. This is part of the Office Insider program, which releases Beta features if you choose to participate. You can access the Insider program by going to File > Account > Office Insider. Then to update your Office version go to File > Account > Office Updates to install the Insider update.
To filter the list by the "lang" column the formula looked like:
=FILTER([range in H column], [range in I column]=T$2)
I haven't specified either range because I used a formula-defined range using the INDIRECT function to avoid filtering through a million rows. The H range is what I want in the results of the filter, the I range is what I want to filter by - the "lang" code. T$2 represents the "lang" code, in this case "dut", and when I copy it across it will filter by each of the 8 lang codes in Row 2.
Then I used TEXTJOIN to combine it the array result into one column using the comma separator:
=TEXTJOIN(", ", TRUE, FILTER(...))
I have a table containing postcodes but there is no validation built in to the entry form so there is no consistency in the way they are stored in the database, sample below:
ID Postcode
001742 B5
001745
001746
001748 DY3
001750
001751
001768 B276LL
001774 B339HY
001776 B339QY
001780 WR51DD
I want to use these postcode to map the distance from a central point but before I can do that I need to put them into a valid format and filter out any blanks or incomplete postcodes.
I had considered using
left(postcode,3) + ' ' + right(postcode,3)
To correct the formatting but this wouldn't work for postcodes like 'M6 8HD'
My aim is to get the list of postcodes in a valid format but I don't know how to account for different lengths of postcode. Is this there a way to do this in SQL Server?
As discussed in the comments, sometimes looking at a problem the other way around presents a far simpler solution.
You have a list of arbitrary input provided by users, which frequently doesn't contain the correct spacing. You also have a list of valid postcodes which are correctly spaced.
You're trying to solve the problem of finding the correct place to insert spaces into your arbitrary inputs to make them match the list of valid codes, and this is extremely difficult to do in practice.
However, performing the opposite task - removing the spaces from the valid postcodes - is remarkably easy to do. So that is what I'd suggest doing.
In our most recent round of data modelling, we have modelled addresses with two postcode columns - PostCode containing the postcode as provided from whatever sources, and PostCodeNoSpace, a computed column which strips whitespace characters from PostCode. We use the latter column for e.g. searches based on user input. You may want to do something similar with your list of Valid postcodes, if you're keeping it around permanently - so that you can perform easy matches/lookups and then translate those matches back into a version that has spaces - which is actually a solution to the original question posed!
I am using SQL Server 2008 and I have a column in a table, which has values like below. It basically shows departure and arrival information.
-->Heathrow/Dublin*Dublin/Heathrow
-->Gatwick/Liverpool*Liverpool/Carlisle *Carlisle/Gatwick
-->Heathrow/Dublin*Liverpool/Heathrow
(The 3rd example shown above is slightly different where the person did not depart from Dublin, instead departed from a Liverpool).
This makes the column too lengthy, and I want to remove only the adjacent duplicates, so the information can be shown like below:
-->Heathrow/Dublin/Heathrow
-->Gatwick/Liverpool/Carlisle/Gatwick
-->Heathrow/Dublin***Liverpool/Heathrow
So, this would still show the correct travel route, but omits only the contiguous duplicates. Also, in the 3rd case, since the departure and arrival information location is not the same, Iwould like to show it as ***.
I found a post here that removes all duplicates (Find and Remove Repeated Substrings) but this is slightly different from the solution that I need.
Could someone share any thoughts please?
The first step is to adapt the process defined in the following link so that it splits based on /:
T-SQL split string
This returns a table which you would then loop through checking if the value contains an *. In that case you would get the text values before and after the * and compare them. Use CHARINDEX to get the position of the *, and SUBSTRING to get the values before and after. Once you have those check both values and append to your output string accordingly.
So you have a database column that contains this text string? Is your concern to display the data to the user in a new format, or to update the data in your database table with a new value?
Do you have access to the original data from which this text string was built? It would probably be easier to re-create the string in the format you desire than it would be to edit the existing string programmatically.
If you don't have access to this data, it would probably be a lot simpler to update your data (or reformat it for display) if you do the string manipulation in a high-level language such as c# or java.
If you're reformatting it for display, write the string manipulation code in whatever language is appropriate, right before displaying it. If you're updating your table, you could write a program to process the table, reading each record, building the replacement string, and updating the record before moving on to the next one.
The bottom line is that T-SQL is just not a good language for doing this sort of string examination and manipulation. If you can build a fresh string from the original data, or do your manipulation in a high-level language, you'll have an easier job of it and end up with more maintainable code.
I wrote a code for the first example you gave. You still need to
improve it for the rest ...
DECLARE #STR VARCHAR(50)='Heathrow/Dublin*Dublin/Heathrow'
IF (SELECT SUBSTRING(#STR,CHARINDEX('/',#STR)+1,CHARINDEX('*',#STR)-CHARINDEX('/',#STR)-1)) =
(SELECT SUBSTRING(#STR,CHARINDEX('*',#STR)+1,LEN(SUBSTRING(#STR,CHARINDEX('/',#STR)+1,CHARINDEX('*',#STR)-CHARINDEX('/',#STR)-1))))
BEGIN
SELECT STUFF(#STR,CHARINDEX('*',#STR),LEN(SUBSTRING(#STR,CHARINDEX('/',#STR)+1,CHARINDEX('*',#STR)-CHARINDEX('/',#STR)-1))+1,'')
END
ELSE
BEGIN
SELECT STUFF(#STR,CHARINDEX('*',#STR),LEN(SUBSTRING(#STR,CHARINDEX('*',#STR)+1,LEN(SUBSTRING(#STR,CHARINDEX('/',#STR)+1,CHARINDEX('*',#STR)-CHARINDEX('/',#STR)-1)))),'***')
END
I ran a query on a MS SQL database using SQL Server Management Studio, and some the fields contained new lines. I selected to save the result as a csv, and apparently MS SQL isn't smart enough to give me a correctly formatted CSV file.
Some of these fields with new lines are wrapped in quotes, but some aren't, I'm not sure why (it seems to quote fields if they contain more than one new line, but not if they only contain one new line, thanks Microsoft, that's useful).
When I try to open this CSV in Excel, some of the rows are wrong because of the new lines, it thinks that one row is two rows.
How can I fix this?
I was thinking I could use a regex. Maybe something like:
/,[^,]*\n[^,]*,/
Problem with this is it matches the last element of one line and the 1st of the next line.
Here is an example csv that demonstrates the issue:
field a,field b,field c,field d,field e
1,2,3,4,5
test,computer,I like
pie,4,8
123,456,"7
8
9",10,11
a,b,c,d,e
A simple regex replacement won't work, but here's a solution based on preg_replace_callback:
function add_quotes($matches) {
return preg_replace('~(?<=^|,)(?>[^,"\r\n]+\r?\n[^,]*)(?=,|$)~',
'"$0"',
$matches[0]);
}
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){5}$~m';
$result=preg_replace_callback($row_regex, 'add_quotes', $source);
The secret to $row_regex is knowing ahead of time how many columns there are. It starts at the beginning of a line (^ in multiline mode) and consumes the next five things that look like fields. It's not as efficient as I'd like, because it always overshoots on the last column, consuming the "real" line separator and the first field of the next row before backtracking to the end of the line. If your documents are very large, that might be a problem.
If you don't know in advance how many columns there are, you can discover that by matching just the first row and counting the matches. Of course, that assumes the row doesn't contain any of the funky fields that caused the problem. If the first row contains column headers you shouldn't have to worry about that, or about legitimate quoted fields either. Here's how I did it:
preg_match_all('~\G,?[^,\r\n]++~', $source, $cols);
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){' . count($cols[0]) . '}$~m';
Your sample data contains only linefeeds (\n), but I've allowed for DOS-style \r\n as well. (Since the file is generated by a Microsoft product, I won't worry about the older-Mac style CR-only separator.)
See an online demo
If you want a java programmatic solution, open the file using the OpenCSV library. If it is a manual operation, then open the file in a text editor such as Vim and run a replace command. If it is a batch operation, you can use a perl command to cleanup the CRLFs.