Processing lexical analyses of sentences using the Perl split function

I have two kinds of lexical analyses of sentences that I need to process. One type of data comes in a "tagged" format, and the other comes in a "parsed" format.

Tagged

The input (@subsentences) looks like:

5.4_CD Passive_NNP Processes_NNP of_IN Membrane_NNP Transport_NNP 85_CD We_PRP have_VBP examined_VBN membrane_NN structure_NN and_CC how_WRB it_PRP is_VBZ used_VBN to_TO perform_VB one_CD membrane_NN function_NN :_: the_DT binding_JJ of_IN one_CD cell_NN to_TO another_DT ._.

Desired output

5.4 Passive Processes of Membrane Transport 85 We have examined membrane stru....

My code

@finalsentence = split(/_\S+/,$subsentences[$j]);

Parsed

   Parsing [sent. 1 len. 31]:        nsubj(85-7, Processes-3)        nn(Transport-6, Membrane-5)        prep_of(Processes-3, Transport-6)        nsubj(examined-10, We-8)        nsubjpass(used-17, it-15)        xsubj(perform-19, it-15)        conj_and(examined-10, used-17)        xcomp(used-17, perform-19)        dobj(perform-19, function-22)        prep_of(binding-25, cell-28) <- refer to this for examples below

Desired output (for the last line)

the sent. number (ie. sent. 1 )
the grammar function (ie. prep_of )
the first dependency word (ie. binding )
the second dependency word (ie. cell )

My code

Here is how I do it, but when I check for word boundaries (\b), sometimes they're not defined and on top of that, it's pretty crude:

For the sent. number:

@parsesentcounter = split (/.*sent\.\s/, $typeddependencies[$i]);@parsesentcounter = split (/\s/, $typeddependencies[$i]);

This (crude method) leaves the sent. number (sent. 1) at $parsesentcounter[2]

For the grammar function:

@grammarfunction = split(/\(\S+\s\S+\s/,$typeddependencies[$i]);

This leaves the grammar function(prep_of) at $grammarfunction[0]

For the dependency words, I do it in a few steps (I think I get lost a bit here):

@dependencywords = split (/,\s+/,$typeddependencies[$i]); ## Take out all commas, there was also a space associated@dependencywords = split (/-\S+\s+/,$typeddependencies[$i]); ## Take out all -digits and space

This leaves the second dependency word(cell) at $dependencywords[1].

Then for first dependency word:

@firstdependencyword = split(/.*subj\w*.|.*obj\w*.|.*prep\w*\(|.*xcomp\w*\(|.*agent\(|purpcl\(|.*conj_and\(/,$dependencywords[0]);

This leaves the first dependency word (binding) at $firstdependencyword[1]

Processing lexical analyses of sentences using the Perl split function

Tagged

Desired output

My code

Parsed

Desired output (for the last line)

My code

Trending Articles

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

Felony Arrest of Joseph A. White and Heather Coomer-White

Nalgonda District Police Office Mobile Numbers List in Telangana State

GTA 5 PPSSPP Zip File Download For Android Mediafire 382 MB

PRC MOE SCHOOL TEACHER CHARGED FOR SEXUALLY PENETRATING 12 YEAR-OLD WITH FINGERS

Practice Sheet of Right form of verbs for HSC Students

Arrest logs for Wednesday, March 20, 2019

Nahitaji matokeo ya kidato cha nne ya mwaka 1998

Moondru Mudichu 02-03-2017 – Polimer tv Serial

Bureau of Internal Revenue: Regional Offices (Directory)

ESENT データベース USS.jtx で、エラーイベント ID 490、454、489、455 が記録される事象について

Outlook でメールを保存または送信時に...

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Teen Shot In Miami Drive-By Dies From Injuries

Throw Back: Samini — Where My Baby Dey (Prod by Kaywa)

VMOU RSCIT Result 2017, RSCIT Result VMOU rkcl.vmou.ac.in Name Wise

the range cannot be deleted (6028) in microsoft word

Arrow Flash 2 – Sinhala Dubbed – Episode 17 – 28th February 2016

SEAGCD2 - Editorial

O'CONNELL MICHAEL F. 11/29/197...