ANALYSIS OF IMAGES, SOCIAL NETWORKS, AND TEXTS
April, 9-11th, 2015, Yekaterinburg
This work is partially supported by the RFH grant #13-04-12020
“New open electronic thesaurus for Russian”.
A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research.
FST, is a type of finite automaton which maps between two sets of symbols. We can visualize an FST as a two-tape automaton that recognizes or generates pairs of strings. Intuitively, we can do this by labeling each arc in the finite-state machine with two symbol strings, one from each tape.
... в четвертичном периоде. Достигали высоты ...
... в четвертичном периоде.{S} Достигали высоты ...
corpus-ru-dbpedia-short-dea-1000.csv
from the Corpus subfolder: Text > Open...Sentence.grf
in MERGE modeReplace.grf
in REPLACE modeSentence.grf
splits the text into sentences, adding {S}
tag before the next sentence (language dependent)Replace.grf
removes ¬
(soft hyphen) and converts no-break spaces to spaces{S}
{STOP}
to delimit texts{ЮУрГУ,.N+ORG+gen(M)}
alphabet.txt
)семью
assigned these lexical tags: семью,семья.N+anim(j)+gen(F):aeF
семью,.ADV
семью,семь.NUM+plur:t
Unlike synonymy and antonymy, which are lexical relations between word forms, hyponymy/hypernymy is a semantic relation between word meanings: e.g., {maple} is a hyponym of {tree}, and {tree} is a hyponym of {plant}. Much attention has been devoted to hyponymy/hypernymy (variously called subordination/superordination, subset/superset, or the ISA relation)...
A concept represented by the synset {x, x′,...} is said to be a hyponym of the concept represented by the synset {y, y′,...} if native speakers of English accept sentences constructed from such frames as An x is a (kind of) y. The relation can be represented by including in {x, x′,...} a pointer to its superordinate, and including in {y, y′,...} pointers to its hyponyms.
Мамонты — вымерший род млекопитающих из семейства
слоновых, живший в четвертичном периоде.{S} Достигали
высоты 5,5 метров и массы тела 10—12 тонн.{S}
Таким образом, мамонты были в два раза тяжелее самых
крупных современных наземных млекопитающих —
африканских слонов.
Мамонты — вымерший род млекопитающих из семейства
слоновых, живший в четвертичном периоде.{S}
род
into Regular expressionвымерший
(participle) and широколиственных
(adjective) can be omitted<N> — <V:S>* род (<A>+<!DIC>)* <N>
Мамонты — вымерший род млекопитающих из семейства слоновых
Бук — род широколиственных деревьев семейства Буковые
<род>
: matches all the entries that have род as canonical form<стать.V>
: matches all entries having стать as canonical form and the grammatical code V<V>
: matches all entries having the grammatical code V{стану,стать.V}
or <стану,стать.V>
: matches all the entries having стану as inflected form, стать
as canonical form and the grammatical code V<E>
: the empty word or epsilon. Matches the empty string<TOKEN>
: matches any token, except the space; used by default for morphological filters<MOT>
: matches any token that consists of letters<MIN>
: matches any lower-case token<MAJ>
: matches any lower-case token<PRE>
: matches any token that starts with a capital letter<DIC>
: matches any word that is present in the dictionaries of the text<SDIC>
: matches any simple word in the text dictionaries<CDIC>
: matches any composed word in the dictionaries of the text<TDIC>
: matches any tagged token like {XXX,XXX.XXX}
<NB>
: matches any contiguous sequence of digit (1234 is matched but not 1 234)<#>
: prohibits the presence of space<N>
, press Enter—
box, connected to the <N>
boxрод
box, connected to the —
box<N>
box, connected to the род
box<N>
box, click on the final state (a circle with a square inside) to connect these 2 boxes<V:S>
box between the —
and род
boxes<A>+<!DIC>
box between the род
and <N>
boxesGraphs/match-hyponyms.grf
: FSGraph > Savematch-hyponyms.grf
, Search<N>
box (hyponym) and change it to <N>/{[
to add {[
before the matched noun, when the graph is applied in the MERGE mode<N>/{[
and click on the —
box to disconnect these boxes<E>/]=HYPONYM}
box between the <N>/{[
and —
boxes. It will add ]=HYPONYM}
after the matched noun<N>
box for adding a HYPERNYM
tag to ittag-hyponyms.grf
tag-hyponyms.grf
concord.ind
file in the corpus foldercorpus-ru-dbpedia-short-dea-1000_snt
{[Мамонты]=HYPONYM} — вымерший род
{[млекопитающих]=HYPERNYM} из семейства слоновых{[Бук]=HYPONYM} — род широколиственных
{[деревьев]=HYPERNYM} семейства Буковые
match-hyponyms.grf
: FSGraph > Open...<N>
box, right-click on it and select <N>
box and change it to <N>/$hyponym$
to store the matched noun with all morphological information in the $hyponym$
variable<N>
box to store the matched noun in variable $hypernym$
in the morphological mode<E>/$hypernym.LEMMA$: $hyponym.LEMMA$
before the final statemine-hyponyms.grf
Dela/CISLEXru_igrok.bin
млекопитающее: мамонт
дерево: бук
дерево: бука
дерево: Бук
CISLEXru_igrok.bin
and enter this wordБук,.N+FAMN+PN+anim(o)+gen(M):neM
Бук,.N+FAMN+PN+anim(o)+gen(F):neF:geF:deF:aeF:teF:qeF:nm:gm:dm:aom:tm:qm
бук,бука.N+anim(o)+gen(F)+gen(M):gm:aom
бук,.N+anim(j)+gen(M):neM:ajeM
mine-hyponyms.grf
to remove ambiguous outputs: change the first <N>
box to <N~PN:n>
млекопитающее: мамонт
дерево: бук
Artem Lukanin