Parsing the Wikipedia Pagelink dataset

Parsing the Wikipedia Pagelink dataset - database

I downloaded the Wikipedia Pagelinks dataset (available on Wiki Dumps - http://dumps.wikimedia.org/enwiki/20140102/). I want to run PageRank algorithm on the dataset, however, I am unable to parse the data because it is not very well documented.
This is a sample of the dataset downloaded. The fields given are p1_from, p1_namespace, and p1_title. Looking online, p1_namespace is a number that denotes the type of article, but I do not know what p1_from is. To implement the pagerank algorithm, I want the number of articles that link to a particular article, however, I do not know what p1_from stands for. By its name, it sounds like it is the number of links that go away from that article, and not the other way around. Is this the case? And also, if it is, how can I reverse the graph given the data, so I can find the correct numbers.
DROP TABLE IF EXISTS `pagelinks`;
/*!40101 SET #saved_cs_client = ##character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `pagelinks` (
`pl_from` int(8) unsigned NOT NULL DEFAULT '0',
`pl_namespace` int(11) NOT NULL DEFAULT '0',
`pl_title` varbinary(255) NOT NULL DEFAULT '',
UNIQUE KEY `pl_from` (`pl_from`,`pl_namespace`,`pl_title`),
KEY `pl_namespace` (`pl_namespace`,`pl_title`,`pl_from`)
) ENGINE=InnoDB DEFAULT CHARSET=binary;
/*!40101 SET character_set_client = #saved_cs_client */;
--
-- Dumping data for table `pagelinks`
--
/*!40000 ALTER TABLE `pagelinks` DISABLE KEYS */;
INSERT INTO `pagelinks` VALUES (10,0,'Computer_accessibility'),(12,0,'-ism'),(12,0,'1848_Revolution'),(12,0,'1917_October_Revolution'),
(12,0,'1919_United_States_anarchist_bombings'),(12,0,'19th_century_philosophy'),
(12,0,'6_February_1934_crisis'),(12,0,'A._K._Press'),(12,0,'A._S._Neill'),(12,0,'AK_Press'),(12,0,'A_Greek–English_Lexicon'),(12,0,'A_Language_Older_Than_Words'),
(12,0,'A_Vindication_of_Natural_Society'),(12,0,'A_las_Barricadas'),(12,0,'Abbie_Hoffman'),(12,0,'Absolute_idealism'),(12,0,'Abstentionism'),(12,0,'Action_theory_(philosophy)'),
(12,0,'Adam_Smith'),(12,0,'Adolf_Brand'),(12,0,'Adolf_Hitler'),(12,0,'Adolphe_Thiers'),(12,0,'Aesthetic_emotions'),(12,0,'Aesthetics'),(12,0,'Affinity_group'),(12,0,'Affinity_groups'),
(12,0,'African_philosophy'),(12,0,'Against_Civilization:_Readings_and_Reflections'),(12,0,'Against_His-Story,_Against_Leviathan'),(12,0,'Age_of_Enlightenment'),(12,0,'Agriculturalism'),
(12,0,'Agriculture'),(12,0,'Al-Ghazali'),(12,0,'Alain_Badiou'),(12,0,'Alain_de_Benoist'),(12,0,'Albert_Camus'),(12,0,'Albert_Libertad'),(12,0,'Albert_Meltzer'),(12,0,'Aleister_Crowley'),
(12,0,'Alex_Comfort'),(12,0,'Alexander_Berkman'),(12,0,'Alexandre_Christoyannopoulos'),(12,0,'Alexandre_Skirda'),(12,0,'Alfredo_M._Bonanno')

I am unable to parse the data because it is not very well documented.
The SQL dumps contain directly data from the MySQL table MediaWiki uses. Those tables are documented on mediawiki.org, in your case it's the pagelinks table.
The fields given are p1_from, p1_namespace, and p1_title.
No, that's not a 1 (the number one), it's an l (the letter L), pl is short for pagelinks.
I do not know what p1_from is.
From the documentation, that's “Key to the page_id of the page containing the link.” To find out the name of the page where the links comes from, you will need the page table.

Related

extracting certain size data from column

I have a table in MS SQL Server 2016. the table has a column called notes varchar(255)
The data that contains in the notes column contains notes entry by end user.
Select ServiceDate, notes from my_table
ServiceDate, notes
--------------------------------------
9/1/2022 The order was called in AB13456736
9/1/2022 AB45876453 not setup
9/2/2022 Signature for AB764538334
9/2/2022 Contact for A0943847432
9/3/2022 Hold off on AB73645298
9/5/2022 ** Confirmed AB88988476
9/6/2022 /AB9847654 completed
I would like to be able to extract the word AB% from the notes column. I know the ABxxxxxxx is always 10 characters. Because the ABxxxxxx always entered in different position, it's difficult to use exact extract where to look for. I have tried substring(), left() functions and because the value AB% is always in different positions, I can't get it to extract. is there a method I can use to do this?
thanks in advance.

Assuming there is only ONE AB{string} in notes, otherwise you would need a Table-Valued Function.
Note the nullif(). This is essentially a Fail-Safe if the string does not exist.
Example
Declare #YourTable Table ([ServiceDate] varchar(50),[notes] varchar(50)) Insert Into #YourTable Values
('9/1/2022','The order was called in AB13456736')
,('9/1/2022','AB45876453 not setup')
,('9/2/2022','Signature for AB764538334')
,('9/2/2022','Contact for A0943847432')
,('9/3/2022','Hold off on AB73645298')
,('9/5/2022','** Confirmed AB88988476')
,('9/6/2022','/AB9847654 completed')
Select *
,ABValue = substring(notes,nullif(patindex('%AB[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%',notes),0),10)
from #YourTable
Results
ServiceDate notes ABValue
9/1/2022 The order was called in AB13456736 AB13456736
9/1/2022 AB45876453 not setup AB45876453
9/2/2022 Signature for AB764538334 AB76453833
9/2/2022 Contact for A0943847432 NULL
9/3/2022 Hold off on AB73645298 AB73645298
9/5/2022 ** Confirmed AB88988476 AB88988476
9/6/2022 /AB9847654 completed NULL

Postgres - CRUD operations with arrays of composite types

One really neat feature of Postgres that I have only just discovered is the ability to define composite type - also referred to in their docs as ROWS and as RECORDS. Consider the following example
CREATE TYPE dow_id AS
(
tslot smallint,
day smallint
);
Now consider the following tables
CREATE SEQUENCE test_id_seq INCREMENT 1 MINVALUE 1 MAXVALUE 2147483647 START 1 CACHE 1;
CREATE TABLE test_simple_array
(
id integer DEFAULT nextval('test_id_seq') NOT NULL,
dx integer []
);
CREATE TABLE test_composite_simple
(
id integer DEFAULT nextval('test_id_seq') NOT NULL,
dx dow_id
);
CREATE TABLE test_composite_array
(
id integer DEFAULT nextval('test_id_seq') NOT NULL,
dx dow_id[]
);
CRUD operations on the first two tables are relatively straightforward. For example
INSERT INTO test_simple_array (dx) VALUES ('{1,1}');
INSERT INTO test_composite_simple (dx) VALUES (ROW(1,1));
However, I have not been able to figure out how to perform CRUD ops when the table has an array of records/composite types as in test_composite_array. I have tried
INSERT INTO test_composite_array (dx) VALUES(ARRAY(ROW(1,1),ROW(1,2)));
which fails with the message
ERROR: syntax error at or near "ROW"
and
INSERT INTO test_composite_array (dx) VALUES("{(1,1),(1,2)}");
which fails with the message
ERROR: column "{(1,1),(1,2)}" does not exist
and
INSERT INTO test_composite_array (dx) VALUES('{"(1,1)","(1,2)"}');
which appears to work though it leaves me feeling confused since a subsequent
SELECT dx FROM test_composite_array
returns what appears to be a string result {"(1,1),(1,2)} although a further query such as
SELECT id FROM test_composite_array WHERE (dx[1]).tslot = 1;
works. I also tried the following
SELECT (dx[1]).day FROM test_composite_array;
UPDATE test_composite_array SET dx[1].day = 99 WHERE (dx[1]).tslot = 1;
SELECT (dx[1]).day FROM test_composite_array;
which works while
UPDATE test_composite_array SET (dx[1]).day = 99 WHERE (dx[1]).tslot = 1;
fails. I find that I am figuring out how to manipulate arrays of records/composite types in Postgres by trial and error and - altough Postgres documentation is generally excellent - there appears to be no clear discussion of this topic in the documentation. I'd be much obliged to anyone who can point me to an authoritative discussion of how to manipulate arrays of composite types in Postgres.
That apart are there any unexpected gotchas when working with such arrays?

You need square brackets with ARRAY:
ARRAY[ROW(1,1)::dow_id,ROW(1,2)::dow_id]
A warning: composite types are a great feature, but you will make your life harder if you overuse them. As soon as you want to use elements of a composite type in WHERE or JOIN conditions, you are doing something wrong, and you are going to suffer. There are good reasons for normalizing relational data.

I define a field as varchar(30) but I can insert character only 28

I don't understand I already define size of character as VARCHAR(30) and I try to insert data via web page = STIFF COMP,R FR DOOR SKIN CTR
but it can't
Error string or binary data would be truncated

If you create your column correctly, then it should work well
Let's see the simple example below:
CREATE TABLE [dbo].[Table_1](
[stringTest] [varchar](30) NULL
) ON [PRIMARY]
GO
Just a simple Table_1 with 1 row [stringTest] type [varchar](30)
Then I insert your string insert into Table_1(stringTest) values('STIFF COMP,R FR DOOR SKIN CTR')
It's working fine, so just a confirm: - your original text is fitted.
So other concern is:
You set up database wrongly (check my above simple table)
You use an application (asp.net per-harp) to add the value in. So you may check in debug mode to see the correct value (may be it's formatted or encoded, since i saw a comma , in your string)

Oracle SET DESCRIBE DEPTH is obsolete (replacement)

I cannot find in the Oracle docs any reference to the new version of the below command:
SET DESCRIBE DEPTH 3
line 89: "SET DESCRIBE DEPTH 3" is Obsolete.
How can it be achieved in newer versions of Oracle databases?
The behaviour it should mimic for Object types e.g.
CREATE OR REPLACE TYPE ADDRESSES AS OBJECT (
street VARCHAR2 (25),
house_no NUMBER(2)
);
CREATE OR REPLACE TYPE PEOPLE AS OBJECT (
name VARCHAR2 (15),
address ADDRESSES,
MAP MEMBER FUNCTION Equals RETURN VARCHAR2,
MEMBER FUNCTION PeopleToString RETURN VARCHAR2,
PRAGMA RESTRICT_REFERENCES (PeopleToString, RNDS, WNDS, RNPS, WNPS)
)
NOT FINAL;
CREATE TABLE Locations (
pseudo VARCHAR2(15) CONSTRAINT pk_xyz_table PRIMARY KEY
CONSTRAINT fk_loc_xyz REFERENCES XYZ(pseudo),
person PEOPLE
);
SET DESC DEPTH 3
DESC Locations
PSEUDO NOT NULL VARCHAR2(15)
PEOPLE
PEOPLE IS NOT FINAL
NAME VARCHAR2(15)
ADDRESS ADRESSES
STREET VARCHAR2(25)
HOUSE_NO NUMBER`

SET DESC DEPTH n is not obsolete in SQL*Plus, according to the 12c manual and my tests.
The problem appears to be with Oracle SQL Developer's poor imitation of SQL*Plus. These bugs are why it's dangerous for integrated development environments to try to clone SQL*Plus.
SQL*Plus is not a great tool. It's main advantage is it's compatibility across many platforms. There are so many ways to "run a script", it's nice to have a method that you know will work the same for everyone.
Accept no imitation - if you need SQL*Plus, use the real thing.

Dynamic default values for table columns in Postgresql 9.1

I have a table called members
CREATE TABLE netcen.mst_member
(
mem_code character varying(8) NOT NULL,
mem_name text NOT NULL,
mem_cnt_code character varying(2) NOT NULL,
mem_brn_code smallint NOT NULL, -- The branch where the member belongs
mem_email character varying(128),
mem_cell character varying(11),
mem_address text,
mem_typ_code smallint NOT NULL,
CONSTRAINT mem_code PRIMARY KEY (mem_code ))
each member type has a different sequence for the member code. i.e for gold members their member codes will be
GLD0091, GLD0092,...
and platinum members codes will be
PLT00020, PLT00021,...
i would like to have the default value for the field mem_code as a dynamic value depending on the member type selected. how can i use a check constraint to implement that??
please help, am using Postgresql 9.1
i have created the following trigger function to construct the string but i still get an error when i insert into the members table as Randy said.
CREATE OR REPLACE FUNCTION netcen.generate_member_code()
RETURNS trigger AS
$BODY$DECLARE
tmp_suffix text :='';
tmp_prefix text :='';
tmp_typecode smallint ;
cur_setting refcursor;
BEGIN
OPEN cur_setting FOR
EXECUTE 'SELECT typ_suffix,typ_prefix,typ_code FROM mst_member_type WHERE type_code =' || NEW.mem_typ_code ;
FETCH cur_setting into tmp_suffix,tmp_prefix,tmp_typecode;
CLOSE cur_setting;
NEW.mem_code:=tmp_prefix || to_char(nextval('seq_members_'|| tmp_typecode), 'FM0000000') || tmp_suffix;
END$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
ALTER FUNCTION netcen.generate_member_code()
OWNER TO mnoma;
where could i be going wrong?
i get the following error
ERROR: relation "mst_member_type" does not exist
LINE 1: SELECT typ_suffix,typ_prefix,typ_code FROM mst_member_type W...
^
QUERY: SELECT typ_suffix,typ_prefix,typ_code FROM mst_member_type WHERE typ_code =1
CONTEXT: PL/pgSQL function "generate_member_code" line 7 at OPEN

i think this is a normalization problem.
the codes you provide are derivable from other information - therefore really do not belong as independent columns.
you could just store the type in one column, and the number in another - then on any query where needed append them together to make this combo-code.
if you want to persist this denormalized solution, then you could make a trigger to construct the string on any insert or update.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Parsing the Wikipedia Pagelink dataset - database

Related

extracting certain size data from column

Postgres - CRUD operations with arrays of composite types

I define a field as varchar(30) but I can insert character only 28

Oracle SET DESCRIBE DEPTH is obsolete (replacement)

Dynamic default values for table columns in Postgresql 9.1

Categories

Resources