Mixed Service Codes/Descriptions database

Mixed Service Codes/Descriptions database - database

I am facing an issue regarding a significantly large database that I have to reorganize. There are two columns, one consists of the Service Code of an item and next is a column containing the Description of the relevant item. Below is an example:
TSB Trim Booklet
LMN Loading Manual
GLM Grain Loading Manual
etc.
There are a total of 170 different items.
The problem is this: On a different Excel file, there is a column containing (mixed around 16,000 times) only the Descriptions of the items without the 3-letter Service Code.
How can I link them quickly?

Assumptions: you want to take the service code from file 1 and apply it to the descriptions from file 2 and a single description always has the same service code.
Use the following formula in file 2 (the big one you want to add service codes to)
=INDEX([file1]Sheetname!$A:$A,MATCH([file2]Sheetname!A2,[file1]Sheetname!$B:$B,0))
Where
[file1]Sheetname!$A:$A
is the column with service codes in the file/sheet with both the code and the description
[file2]Sheetname!A2
is the cell with description in the file/sheet with just descriptions
and
[file1]Sheetname!$B:$B
is the column with descriptions in the file/sheet with both the code and the description

Related

How can I Filter Data in Google Data Studio based on containing 2 or more input options?

I'm building a dashboard for a dataset that includes a Location column, and I would like the user to be able to filter based on location. The trouble is that some projects are happening in multiple locations, so that the column is comma separated list of locations, making the normal dropdown filter option cumbersome ("NY, Paris" and "Paris, NY" are treated as different values). While this can be overcome by adding a parameter dropdown box and allowing the user to select an option (say Paris) and then using a Contains function to filter the output, the Parameter drop down box only allows 1 selection to be made. So a search for all project happening in either Paris or New York seems like it would have to be done using 2 separate parameters. Is anyone aware of an elegant workaround for this that will allow multiple selections of locations within a single dropdown.
The inelegant solutions I've come up with are:
Use n parameter boxes and cap the locations that can be filtered in a single view at n.
Have users input a comma separated list as a parameter, parse that for locations and then show all REGEXP CONTAINS matches of that provided list.
Example dataset showing multiple locations per project in the locations column:
Edited to add a link to a sample report here. The problem, in a nutshell is that I would like people to be able to select 2 or more location parameters so that they don't have to limit themselves to viewing 1 location at a time.

One way to filter CSVs (Comma Separated Values) is by using the CSV Filter Control Community Visualisation (click on the icon on the toolbar and select to view all):
Data Tab
Column to filter on: Location
Cross-filtering: Select (☑) (this ensures that the CSV Filter Control filters other charts based on the value(s) selected)
Style Tab
OR instead of AND behaviour: Select (☑)
Editable Google Data Studio Report (Embedded Google Sheets Data Source) and a GIF to elaborate:

Export SQL Data to Fixed Width Text File with Multiple Record Types

I need to create an export of data from SQL Server (multiple tables) into a fixed width text file. The text file will have rows that are different based on the record type.
Header Info (Customer, Address)
Line Item Info (Customer, Item, Qty)
Summary Info (Customer, Total Qty)
Any suggestions to accomplish this efficiently?
I'm currently re-casting all columns into char to create the "fixed width" then using SSIS to merge the tables before exporting as a ragged right text file. However, because not all the widths are the same, I'm having to concatenate the line item info into one column to make the merge work. Also, the header info is being merged AFTER the line item info, not before so there's a sorting problem there. Not sure if I'm going down the right path?
Hope that made sense... this export is used to import into a COBOL type system.
Thanks,

Using SSIS create three data flow tasks, each for creating a single text file with the fixed-width format.
File 1: Header Info
File 2: Line Item Info
File 3: Summary Info
Then concatenate them together into a fourth file using the approach described in the following link:
How to concatenate 2 files in SSIS (Integration Services)?
Hope this helps.

For these sorts of problems, I reach for SSIS. It eats this kind of thing for lunch

SSIS Expression to Isolate a String

I am building an SSIS package that will populate data from an Excel Spreadsheet into our Database for Reporting.
The customer did not provide an individual column for the City and Unfortunately, the customer cannot update their export file to add the city, so I am trying to build a city column using the Branch Names.
I need an SSIS Expression (or several) to use in a Derived Column Transformation to pull the Name of the Cities out of the Branch Name. The issue I have is that the Spacing and placement of the names varies. I have tried to use Token, Sub string and Right and Left combined with other expressions and I always seem to cut something off.
Has anyone else run into this and how can I fix it. (I am not familiar with C# to use a Script Component).
Here is a Sample of the Data that I have.
Branch Name
JS OMAHA - 09
JS SIOUX FALLS - 48
JS DOWNINGTOWN - 53
JS ST PAUL - 70
JS BLOOMINGTON - 103
JS PITTSBURGH NORTH -149-
JS TINTON FALLS - 186
JS BLAINE - 337
JS ROCHESTER MN - 423

Do you have a list of valid cities sitting in a table? If so you can use a lookup transformation.
Lets say your list if cities is in a table called city
On the General tab pick No Cache
On the Connection tab tab pick the city table
On the Columns tab tab match the Branch Name column to the city column in your city table
In the Advanced tab, tick Modify the SQL statement and change the end to where [Branch Name] Like '%' + ? '%'
Now your lookup will find the closest match and pass it through as an extra column.
The other way is to load it all into a staging table and do an UPDATE, also using LIKE
Whatever you do, it will help to have a list of valid cities in a table
The other way is to make an assumption about the tokens in the data and use string functions in a derived column transformation to extract it out, but you can get some unexpected results.
I can expand further on these if you wish but I won't waste time if you're never going to return to the question.

Whilst you stated that you are not familiar with script components - they are the correct tool for the job. You will get much greater flexibility by using C# (or VB.Net) code to manipulate your strings. There are a number of good tutorials online to show you how to use a script task, and lots of information about string manipulation in C#.

How to extract all the IUPAC names mentioned in the data available from Pubchem(NCBI) into a text file?

I want to build lists of prefixes and suffixes of some length from all the IUPAC names mentioned in Pubchem Database,so that I can use them further in my project as a feature.So I want all the IUPAC chemical names in a text file or in some format where I can extract these lists.
Thanks.

Sounds you need something like this Nist species list
You can search for most also in the Webbook but I failed to find a download link for the complete set.
In our lab we got a Cd(?) with the mass spectral database which contained the (complete? - well it got like 250.000 substances) database as text file. Maybe you can get that through some of the vendors.

The pubchem site offers you to download a dump of their data by ftp. Why not use that?

PubChem data can be downloaded via ftp from the PubChem site. A complete description of the available data can be obtained here: https://pubchemdocs.ncbi.nlm.nih.gov/downloads
Of particular interest for the question of IUPAC names, the data are downloadable from the "Compound Extras" section of the ftp site: ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/
The README-Extras file in this location describes the data in detail. For the IUPAC names, the following information is provided:
CID-IUPAC.gz:
This is a listing of all CIDs with their computed IUPAC names.
It is a gzipped text file with CID, tab, IUPAC on each line. Note
that the names may contain UTF8 characters.
A download today (23-Apr-2020) contains 102,586,778 rows. An excerpt of the information is shown below.
> head CID-IUPAC
1 3-acetyloxy-4-(trimethylazaniumyl)butanoate
2 (2-acetyloxy-3-carboxypropyl)-trimethylazanium
3 5,6-dihydroxycyclohexa-1,3-diene-1-carboxylic acid
4 1-aminopropan-2-ol
5 (3-amino-2-oxopropyl) dihydrogen phosphate
6 1-chloro-2,4-dinitrobenzene
7 9-ethylpurin-6-amine
8 2,3-dihydroxy-3-methylpentanoic acid
9 (2,3,4,5,6-pentahydroxycyclohexyl) dihydrogen phosphate
11 1,2-dichloroethane

Plain, computer parseable lists of common first names?

I need a list of common first names for people, like "Bill", "Gordon", "Jane", etc. Is there some free list of lots of known names, instead of me having to type them out? Something that I can easily parse with the programme to fill in an array for example?
I'm not worried about:
Knowing if a name is masculine or feminine (or both)
If the dataset has a whole pile of false positives
If there are names that aren't on it, obviously no dataset like this will be complete.
If there are 'duplicates', i.e. I don't care if the dataset lists "Bill" and "William" and "Billy" as different names. I'd rather have more data than less
I don't care about knowing the popularity the name
I know Wikipedia has a list of most popular given names, but that's all in a HTML page and manged up with horrible wiki syntax. Is there a better way to get some sample data like this without having to screen scrape wikipedia?

A CSV from the General Register Office of Scotland with all the forenames registered there in 2007.
Another large set of first names in CSV format and SQL format too (but they didn't say which DB dumped the SQL).
GitHub page with the top 1000 baby names from 1880 to 2009, already parsed into a CSV for you from the Social Security Administration.
CSV of baby names and meanings from a Princeton CS page.
That ought to be enough to get you started, I'd think.

You can easily consume the Wikipedia API (http://en.wikipedia.org/w/api.php) to retrieve the list of pages in specific category, looks like Category:Given names is something you want to start from.
http://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmnamespace=0&cmlimit=500&cmtitle=Category:Given_names
The part of result from this URL looks like this:
<cm pageid="5797824" ns="0" title="Abdou" />
<cm pageid="5797863" ns="0" title="Abdu" />
<cm pageid="859035" ns="0" title="Abdul Aziz" />
<cm pageid="6504818" ns="0" title="Abdul Qadir" />
Look at the API and select appropriate format and query parameters, and check categories.
P.S.
BTW, The wiki-text from page you linked to contain names in a form that easy to extract using regexp... As well as titles of links in the rendered HTML page have “(name)” attached to the name itself.

Social Security Administration - Beyond the Top 1000 Names Data Files
The above is a comprehensive list of first names in use in the US. The zip files contain national and state-level data by year of birth in CSV format. It includes the number of occurrences (minimum 5) and gender. For example, the national file for 2010 includes 33,838 baby names.