Quantcast
Channel: Processing lexical analyses of sentences using the Perl split function - Code Review Stack Exchange
Viewing all articles
Browse latest Browse all 2

Processing lexical analyses of sentences using the Perl split function

$
0
0

I have two kinds of lexical analyses of sentences that I need to process. One type of data comes in a "tagged" format, and the other comes in a "parsed" format.


Tagged

The input (@subsentences) looks like:

5.4_CD Passive_NNP Processes_NNP of_IN Membrane_NNP Transport_NNP 85_CD We_PRP have_VBP examined_VBN membrane_NN structure_NN and_CC how_WRB it_PRP is_VBZ used_VBN to_TO perform_VB one_CD membrane_NN function_NN :_: the_DT binding_JJ of_IN one_CD cell_NN to_TO another_DT ._.

Desired output

5.4 Passive Processes of Membrane Transport 85 We have examined membrane stru....

My code

@finalsentence = split(/_\S+/,$subsentences[$j]);

Parsed

   Parsing [sent. 1 len. 31]:        nsubj(85-7, Processes-3)        nn(Transport-6, Membrane-5)        prep_of(Processes-3, Transport-6)        nsubj(examined-10, We-8)        nsubjpass(used-17, it-15)        xsubj(perform-19, it-15)        conj_and(examined-10, used-17)        xcomp(used-17, perform-19)        dobj(perform-19, function-22)        prep_of(binding-25, cell-28) <- refer to this for examples below

Desired output (for the last line)

  • the sent. number (ie. sent. 1 )
  • the grammar function (ie. prep_of )
  • the first dependency word (ie. binding )
  • the second dependency word (ie. cell )

My code

Here is how I do it, but when I check for word boundaries (\b), sometimes they're not defined and on top of that, it's pretty crude:

For the sent. number:

@parsesentcounter = split (/.*sent\.\s/, $typeddependencies[$i]);@parsesentcounter = split (/\s/, $typeddependencies[$i]); 

This (crude method) leaves the sent. number (sent. 1) at $parsesentcounter[2]

For the grammar function:

@grammarfunction = split(/\(\S+\s\S+\s/,$typeddependencies[$i]);

This leaves the grammar function(prep_of) at $grammarfunction[0]

For the dependency words, I do it in a few steps (I think I get lost a bit here):

@dependencywords = split (/,\s+/,$typeddependencies[$i]); ## Take out all commas, there was also a space associated@dependencywords = split (/-\S+\s+/,$typeddependencies[$i]); ## Take out all -digits and space

This leaves the second dependency word(cell) at $dependencywords[1].

Then for first dependency word:

@firstdependencyword = split(/.*subj\w*.|.*obj\w*.|.*prep\w*\(|.*xcomp\w*\(|.*agent\(|purpcl\(|.*conj_and\(/,$dependencywords[0]);

This leaves the first dependency word (binding) at $firstdependencyword[1]


Viewing all articles
Browse latest Browse all 2

Latest Images

Trending Articles





Latest Images