How to parse text over multiple lines with textfsm? - text-parsing

I understood that TextFSM is a good way to parse text files, however, I see that it can parse data over single lines, my question is how to parse text spread over multiple lines.
<Page>
CUSIP No. 123456 13G Page 2 of 10 Pages
-----------------------------------------------------------------------------
(1) NAMES OF REPORTING PERSONS
ABC Ltd.
-----------------------------------------------------------------------------
(2) CHECK THE APPROPRIATE BOX IF A MEMBER OF A GROUP
(a) [ ]
(b) [X]
--------------------------------------------------------------------------------
(3) SEC USE ONLY
--------------------------------------------------------------------------------
(4) CITIZENSHIP OR PLACE OF ORGANIZATION
Bruny Islands
--------------------------------------------------------------------------------
NUMBER OF (5) SOLE VOTING POWER
0
SHARES -----------------------------------------------------------------
BENEFICIALLY (6) SHARED VOTING POWER
1,025,824 shares of Common Stock
OWNED BY --------------------------------------------------------------
EACH (7) SOLE DISPOSITIVE POWER
0
REPORTING --------------------------------------------------------------
PERSON WITH: (8) SHARED DISPOSITIVE POWER
1,025,824 shares of Common Stock
-----------------------------------------------------------------------------
(9) AGGREGATE AMOUNT BENEFICIALLY OWNED BY EACH REPORTING PERSON
1,025,824 shares of Common Stock
-----------------------------------------------------------------------------
(10) CHECK BOX IF THE AGGREGATE AMOUNT
IN ROW (9) EXCLUDES CERTAIN SHARES
[ ]
-----------------------------------------------------------------------------
(11) PERCENT OF CLASS REPRESENTED
BY AMOUNT IN ROW (9)
4.15%
-----------------------------------------------------------------------------
(12) TYPE OF REPORTING PERSON
CO
-----------------------------------------------------------------------------
in the above text, I want to parse Names of reporting persons and Citizenship or place of organization, how which is not in a single line. What is the best way to approach this problem?

You can do this with TextFSM state transition.
This template does what you need:
Value REPORTING_PERSONS (\S+[\S ]+)
Value CITIZENSHIP (\S+[\S ]+)
Start
^.+NAMES OF REPORTING PERSONS -> Person
^.+CITIZENSHIP OR PLACE OF ORGANIZATION -> Citizenship
^ +NUMBER OF -> Record
Person
^ +${REPORTING_PERSONS}
^-+ -> Start
Citizenship
^ +${CITIZENSHIP}
^-+ -> Start
Result:
REPORTING_PERSONS CITIZENSHIP
------------------- -------------
ABC Ltd. Bruny Islands
Here you can see a few examples:
https://github.com/google/textfsm/wiki/Code-Lab

Value REPORTING_PERSON (\S+[\S ]+)
Value CITIZENSHIP (\S+[\S ]+)
Start
^.+NAMES\s+OF\s+REPORTING\s+PERSONS -> Person
^.+CITIZENSHIP\s+OR\s+PLACE\s+OF\s+ORGANIZATION -> Citizenship
^ NUMBER OF -> Record
Person
^(\s+)${REPORTING_PERSON} -> Start
Citizenship
^\s+${CITIZENSHIP} -> Start

Here's an example of a long and complicated line that I don't want to come up with a specific regex for.
LSBATCH: User input
/hps/nobackup2/production/metagenomics/assembly-pipeline/prod/venv/bin/python /hps/nobackup2/production/metagenomics/... -p DRP000303 -r DRR000714
Instead, I just match the complete line that follows a marker line containing User input:
# match entire line
Value job_command (.*)
Start
# match line after line containing "User input"
^.*User input -> JobCommand
# some more rules...
JobCommand
^${job_command} -> Start

Related

Build text statements based on number in Cell

Not sure I am saying this right but I need to build a list of statements based on a number in a cell. For example, in column A I have a list of room types: Office, Bathroom, Reception, Lobby, etc. and in column B I have the number of those room types in the building.
| COL A | | COL B |
Office 5
Bathroom 3
Reception 1
Lobby 2
For Office, I put 5 in column B - Bathroom I have 3 in B, and so on.
Now what I need is a way to read the number of Offices and build a statement like:
Office 001
Office 002
Office 003
Office 004
Office 005
Of course if I had put 6, then I would see Office 006. I am not worried about getting all of the variable names into one column as each room type will then have it's own set of questions that I will figure out later.
Right now I am using messy IF statements and dragging them down the sheet.
try:
=ARRAYFORMULA(
TRANSPOSE(SPLIT(CONCATENATE(REPT(A:A&"♦", B:B)), "♦"))&TEXT(COUNTIFS(
TRANSPOSE(SPLIT(CONCATENATE(REPT(A:A&"♦", B:B)), "♦")),
TRANSPOSE(SPLIT(CONCATENATE(REPT(A:A&"♦", B:B)), "♦")),
ROW(INDIRECT("A1:A"&SUM(B:B))), "<="&
ROW(INDIRECT("A1:A"&SUM(B:B)))), " 000"))

Add label to node from a CSV file in NEO4J

I am trying to add some nodes to my graph database from a CSV, which suppose is like:
| city continent feature_1 ...
|--------------------------------------------------
0 | Barcelona Europe
1 | Stockholm Europe
2 | New York America
3 | Nairobi Africa
4 | Tokyo Asia
The first approach was to simply load this data as:
// Insert city nodes
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///city_data.csv' AS row
MERGE (city: City {name: row.city})
Next step was to incorporate the continent information, so I could have nodes of different colors. This means having two labels for each node, which is something I am not sure how to do. Anyway, for the moment I decided to simply have one label instead, which contained the continent information. Since this information is within the CSV file I believe apoc.create.node tool is the way to go. Hence, inspired by How to use apoc.load.csv in conjunction with apoc.create.node I tried the following:
CALL apoc.load.csv('file:///city_data.csv') YIELD row
CALL apoc.create.node(['row.continent'], {name:['row.continent']}) YIELD node
RETURN count(*)
This does not raise any error, but it does something different from what I was thinking of. It basically sets the column name ("row.continent") itself as the label...
The problem is that you surround the variable in apostrophes, so try this:
CALL apoc.load.csv('file:///city_data.csv') YIELD row
CALL apoc.create.node([row.continent], {name: row.continent}) YIELD node
RETURN count(*)

SQL Server database design for evaluations

I'm designing this employee evaluation web page, and was wondering if my current database design is the correct one or if it could be improved.
This is my current design
Table Agenda:
+--------------+----------+----------+-----------+------+-------+-------+
| idEvaluation | Location | Employee | #Employee | Date | Date1 | Date2 |
+--------------+----------+----------+-----------+------+-------+-------+
Date is the date scheduled for the evaluation to be performed.
Date 1 and Date 2 its a period of time to retrieve some metrics from another database.
Table Evaluations:
+--------------+---------+------------+------+----------+
| idEvaluation | Manager | Department | Date | Comments |
+--------------+---------+------------+------+----------+
Table Scores:
+--------------+----------+-------+
| idEvaluation | idFactor | Score |
+--------------+----------+-------+
idFactor relates to another table which contains the factor and a description of it, like I said its this a correct design??
My concern its this, currently there are 60 employees, 11 managers and 12 factors, each employee its evaluated twice a year by every manager, so in the Agenda Table there's not much trouble since its only one record per evaluation (60 employees = 60 records), how ever on the Evaluations Table there are 11 records for every evaluation, so it goes to 660 records (60 employees * 11 managers = 660), and then on the Scores Table it goes even bigger since there are 12 factors for every evaluation, it goes to 7920 records (660 evaluations * 12 factors each = 7920).
Is this normal?? Am I doing it wrong?? Any input its appreciated.
EDIT
Location, Employee, #Employee, Manager and Department are loaded automatically by the vb.net page, they are "imported" from an Active Directory and its checked before insertion so duplicate names, misspelled names, and this sort of thing its not an issue.
The main idea is you dont want to repeat string literals
So if you have
id Department
1 Sales
2 IT
3 Admin
Instead of repeat Sales many time you only use 1 which is smaller so things also get faster.
Second if you have users
id user
1 Jhon Alexander
2 Maria Jhonson
If Jhon decide change his name then you will have to check all tables and change the name. Also there is the problem if two person have same name you wont know which one are you evaluating.
So go for separated table and use the ID.

how to see a difference between entity and a column

Sometimes I am having a hard time seeing a difference between an entity and a column when I am starting to make a diagram. I don't know when it is supposed to be a entity or a column. For example, in some game if you have a user and that user can play by itself or it can play in the group. Would you make that two different entities User and GroupUser ?
Also, for example if the User has levels, status and badges they earn which is part of the game. Would these be entities also or they would just be in one entity which would be part of the User ?
Entity could be a Person (e.g. Student), Place (e.g. Room Name), Object (e.g. Books), Abstract Concept (e.g. Course, Order) that could be represented in your database and normally could become a Table in your Database.
Column(s) on the other hand is/are the attribute(s) of your Entity.
So, in your case you have a User entity and the possible columns or attributes (or fields) are
UserID, UserLevel, UserStatus, Badges, PlayStatus (values could be individual or group).
Your Badges although is a column could turn into Entity if it violates the Normalization rules.
For example if you have this Table for User:
Table: Users
UserID UserName UserStatus PlayStatus Badges
------ -------- ---------- ---------- ------
1 Surefire Active Single Private, Warrior, Platoon Leader
2 FastMachine Active Group Private, Warrior
3 BeatTheGeek Inactive Group Private
The Badges here violates the 1NF (1st Normal Form) in Normalization rules which says that there should be no repeating groups or in this case no Multi-valued columns. So, this could be normalized like:
Table: Users
UserID UserName UserStatus PlayStatus
------ -------- ---------- ----------
1 Surefire Active Single
2 FastMachine Active Group
3 BeatTheGeek Inactive Group
Table: Badges
BadgeID BadgeName
------ --------
1 Private
2 Indie
3 Warrior
4 Platoon Leader
5 Colonel
6 1 Star General
7 2 Star General
8 3 Star General
9 4 Star General
10 5 Star General
11 Hero
Table: UserBadgesHistory
UserID BadgeID ReceiveDate
------ -------- -----------
1 1 12/01/2013
1 3 12/05/2013
1 4 1/5/2014
2 1 2/5/2014
2 3 2/10/2014
3 2 11/10/2013
In general, an entity has multiple columns (i.e. attributes) of its own, and a column (or attribute) does not.
In your example, if the only data you're interested in storing is a User's current level, then level is unlikely to be an entity. This is because it would have only a single attribute of name/number. If you wanted to find all Users currently at level 4, you would simply do a query with level = 4.
On the other hand, if you had a reason to add additional data about the level, such as what abilities are associated with that level or the date a given User achieved the level, then you would want to make Level a separate entity.
A Level entity would have an ID, a number or name, and whatever other attributes you need as data.
ID | Prerequisite | Ability
----+--------------+--------------
1 | NULL | May gain foos
2 | Gain 10 foos | May gain bars
3 | Gain 20 bars | 30 free foos
In a fully normalized state, you would have another entity called UserLevel in which you would store data about, for example, when a certain User gained a level.
The UserLevel entity would contain the LevelID and the UserID as foreign keys (links back to the other entities), and a DateAchieved column for when the User achieved the level.
LevelID | UserID | DateAchieved
---------+--------+-------------
1 | 1 | 2014-02-01
1 | 2 | 2014-02-01
2 | 1 | 2014-02-05
3 | 1 | 2014-02-09
2 | 2 | 2014-02-11
4 | 1 | 2014-02-13
This shows User 1 and User 2 starting at Level 1 on the same day and leveling up at different rates.

Best way to store results data in database? [duplicate]

This question already has answers here:
Is storing a delimited list in a database column really that bad?
(10 answers)
Closed 9 years ago.
I have results data like this:
1. account, name, #, etc
2. account, name, #, etc
...
10. account, name, #, etc
I have approximately 1 set of results data generated each week.
Currently it's stored like so:
DATETIME DATA_BLOB
Which is annoying because I can't query any of the data without parsing the BLOB into a custom object. I'm thinking of changing this.
I'm thinking of having one giant table:
DATETIME RANK ACCOUNT NAME NUMBER ... ETC
date1 1 user1 nn #
date1 2 user2 nn #
...
date1 10 userN nn #
date2 1 user5 nn #
date2 2 user12 nn #
...
date2 10 userX nn #
I don't know anything about database design principles, so can someone give me feedback on whether this is a good approach or there might be a better one?
Thanks
I think it is ok to have a table like that, if there are not one-to-many relationships. In that case, it would be more efficient to have multiple tables like in my example below. Here are some general tips as well:
Tip: Good practice My professor told me that it's always good to have an "ID" column, which is a unique number identifier for each item in the table (1, 2, 3… etc.). (Perhaps that was the intent of your "Number" column.) I think SQLite forces each table to have an ID column anyways.
Tip: Saving storage space - Also, if there is a one-to-many relationship (example: one name has many accounts) then it might save space to have a separate table for the accounts, and then store the ID of the name in the first table- so that way you are storing many ints instead of duplicate strings.
Tip: Efficiency - Some databases have specific frameworks designed to handle relationships such as many-to-one or many-to-many, so if you use their framework for that (I don't remember exactly how to do it) it will probably work more efficiently.
Tip: Saving storage space - If you make your own ID column it might be a waste if it automatically includes an "ID" column anyways - so you might want to check for that possibility.
Conceptual Example: (Storing multiple accounts for the same name)
Poor Solution:
Storing everything in 1 table (inefficient, because it duplicates Bob's name, rank, and datetime):
ID NAME RANK DATETIME ACCOUNT
1 Bob 1 date1 bob_account_1
2 Joe 2 date2 user2_joe
3 Bob 1 date1 bob_account_2
4 Bob 1 date1 bobs_third_account
Better Solution: Having 2 tables to prevent duplicated information (Also demonstrates the usefulness of ID's). I named the 2 tables "Account" and "Name."
Table 1: "Account" (Note that NAME_ID refers to the ID column of Table 2)
ID NAME_ID ACCOUNT
1 1 bob_account_1
2 2 user2_joe
3 1 bob_account_2
4 1 bobs_third_account
Table 2: "Name"
ID NAME RANK DATETIME
1 Bob 1 date1
2 Joe 2 date2
I'm not a database expert so this is just some of what I learned in my internet programming class. I hope this helps lead you in the right direction in further research.

Resources