ANALYSIS OF IMAGES, SOCIAL NETWORKS, AND TEXTS
April, 9-11th, 2015, Yekaterinburg
This work is partially supported by the RFH grant #13-04-12020
“New open electronic thesaurus for Russian”.
... в четвертичном периоде. Достигали высоты ...
... в четвертичном периоде.{S} Достигали высоты ...
corpus-ru-dbpedia-short-dea-1000.csv
from the Corpus subfolder: Text > Open...Sentence.grf
in MERGE modeReplace.grf
in REPLACE modeSentence.grf
splits the text into sentences, adding {S}
tag before the next sentence (language dependent)Replace.grf
removes ¬
(soft hyphen) and converts no-break spaces to spaces{S}
{STOP}
to delimit texts{ЮУрГУ,.N+ORG+gen(M)}
alphabet.txt
)семью
assigned these lexical tags: семью,семья.N+anim(j)+gen(F):aeF
семью,.ADV
семью,семь.NUM+plur:t
Мамонты — вымерший род млекопитающих из семейства
слоновых, живший в четвертичном периоде.{S} Достигали
высоты 5,5 метров и массы тела 10—12 тонн.{S}
Таким образом, мамонты были в два раза тяжелее самых
крупных современных наземных млекопитающих —
африканских слонов.
Мамонты — вымерший род млекопитающих из семейства
слоновых, живший в четвертичном периоде.{S}
род
into Regular expressionвымерший
(participle) and широколиственных
(adjective) can be omitted<N> — <V:S>* род (<A>+<!DIC>)* <N>
<род>
: matches all the entries that have род as canonical form<стать.V>
: matches all entries having стать as canonical form and the grammatical code V<V>
: matches all entries having the grammatical code V{стану,стать.V}
or <стану,стать.V>
: matches all the entries having стану as inflected form, стать
as canonical form and the grammatical code V<E>
: the empty word or epsilon. Matches the empty string<TOKEN>
: matches any token, except the space; used by default for morphological filters<MOT>
: matches any token that consists of letters<MIN>
: matches any lower-case token<MAJ>
: matches any lower-case token<PRE>
: matches any token that starts with a capital letter<DIC>
: matches any word that is present in the dictionaries of the text<SDIC>
: matches any simple word in the text dictionaries<CDIC>
: matches any composed word in the dictionaries of the text<TDIC>
: matches any tagged token like {XXX,XXX.XXX}
<NB>
: matches any contiguous sequence of digit (1234 is matched but not 1 234)<#>
: prohibits the presence of space<N>
, press Enter—
box, connected to the <N>
boxрод
box, connected to the —
box<N>
box, connected to the род
box<N>
box, click on the final state (a circle with a square inside) to connect these 2 boxes<V:S>
box between the —
and род
boxes<A>+<!DIC>
box between the род
and <N>
boxesGraphs/match-hyponyms.grf
: FSGraph > Save<N>
box (hyponym) and change it to <N>/{[
to add {[
before the matched noun, when the graph is applied in the MERGE mode<N>/{[
and click on the —
box to disconnect these boxes<E>/]=HYPONYM}
box between the <N>/{[
and —
boxes. It will add ]=HYPONYM}
after the matched noun<N>
box for adding a HYPERNYM
tag to ittag-hyponyms.grf
tag-hyponyms.grf
concord.ind
file in the corpus foldercorpus-ru-dbpedia-short-dea-1000_snt
match-hyponyms.grf
: FSGraph > Open...<N>
box, right-click on it and select <N>
box and change it to <N>/$hyponym$
to store the matched noun with all morphological information in the $hyponym$
variable<N>
box to store the matched noun in variable $hypernym$
in the morphological mode<E>/$hypernym.LEMMA$: $hyponym.LEMMA$
before the final statemine-hyponyms.grf
Dela/CISLEXru_igrok.bin
CISLEXru_igrok.bin
and enter this wordБук,.N+FAMN+PN+anim(o)+gen(M):neM
Бук,.N+FAMN+PN+anim(o)+gen(F):neF:geF:deF:aeF:teF:qeF:nm:gm:dm:aom:tm:qm
бук,бука.N+anim(o)+gen(F)+gen(M):gm:aom
бук,.N+anim(j)+gen(M):neM:ajeM
mine-hyponyms.grf
to remove ambiguous outputs: change the first <N>
box to <N~PN:n>
Artem Lukanin