I have two kinds of lexical analyses of sentences that I need to process. One type of data comes in a "tagged" format, and the other comes in a "parsed" format.
Tagged
The input (@subsentences
) looks like:
5.4_CD Passive_NNP Processes_NNP of_IN Membrane_NNP Transport_NNP 85_CD We_PRP have_VBP examined_VBN membrane_NN structure_NN and_CC how_WRB it_PRP is_VBZ used_VBN to_TO perform_VB one_CD membrane_NN function_NN :_: the_DT binding_JJ of_IN one_CD cell_NN to_TO another_DT ._.
Desired output
5.4 Passive Processes of Membrane Transport 85 We have examined membrane stru....
My code
@finalsentence = split(/_\S+/,$subsentences[$j]);
Parsed
Parsing [sent. 1 len. 31]: nsubj(85-7, Processes-3) nn(Transport-6, Membrane-5) prep_of(Processes-3, Transport-6) nsubj(examined-10, We-8) nsubjpass(used-17, it-15) xsubj(perform-19, it-15) conj_and(examined-10, used-17) xcomp(used-17, perform-19) dobj(perform-19, function-22) prep_of(binding-25, cell-28) <- refer to this for examples below
Desired output (for the last line)
- the sent. number (ie.
sent. 1
) - the grammar function (ie.
prep_of
) - the first dependency word (ie.
binding
) - the second dependency word (ie.
cell
)
My code
Here is how I do it, but when I check for word boundaries (\b), sometimes they're not defined and on top of that, it's pretty crude:
For the sent. number:
@parsesentcounter = split (/.*sent\.\s/, $typeddependencies[$i]);@parsesentcounter = split (/\s/, $typeddependencies[$i]);
This (crude method) leaves the sent. number (sent. 1
) at $parsesentcounter[2]
For the grammar function:
@grammarfunction = split(/\(\S+\s\S+\s/,$typeddependencies[$i]);
This leaves the grammar function(prep_of
) at $grammarfunction[0]
For the dependency words, I do it in a few steps (I think I get lost a bit here):
@dependencywords = split (/,\s+/,$typeddependencies[$i]); ## Take out all commas, there was also a space associated@dependencywords = split (/-\S+\s+/,$typeddependencies[$i]); ## Take out all -digits and space
This leaves the second dependency word(cell
) at $dependencywords[1]
.
Then for first dependency word:
@firstdependencyword = split(/.*subj\w*.|.*obj\w*.|.*prep\w*\(|.*xcomp\w*\(|.*agent\(|purpcl\(|.*conj_and\(/,$dependencywords[0]);
This leaves the first dependency word (binding
) at $firstdependencyword[1]