Text Processing with Finite State Transducers in Unitex

Artem Lukanin

ANALYSIS OF IMAGES, SOCIAL NETWORKS, AND TEXTS

April, 9-11th, 2015, Yekaterinburg

Text Processing with Finite State Transducers in Unitex

Artem Lukanin

This work is partially supported by the RFH grant #13-04-12020
“New open electronic thesaurus for Russian”.

What is Unitex?

What is corpus?

A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research.

Sinclair 2005

What is Finite State Transducer (FST)?

FST, is a type of finite automaton which maps between two sets of symbols. We can visualize an FST as a two-tape automaton that recognizes or generates pairs of strings. Intuitively, we can do this by labeling each arc in the finite-state machine with two symbol strings, one from each tape.

Jurafsky 2000

Simple sentence splitting FST

		
... в четвертичном периоде. Достигали высоты ...
... в четвертичном периоде.{S} Достигали высоты ...
		
		

Get your corpus from a text file in Unitex

  1. Run Unitex
    • If you are working on Windows, the program will ask you to choose a personal working directory, which you can change later in Info>Preferences...>Directories.

Get your corpus from a text file in Unitex

  1. Open corpus-ru-dbpedia-short-dea-1000.csv from the Corpus subfolder: Text > Open...

Preprocessing

Tokenization

Applying dictionaries

	семью,семья.N+anim(j)+gen(F):aeF
	семью,.ADV
	семью,семь.NUM+plur:t

Hyponyms and hypernyms

Unlike synonymy and antonymy, which are lexical relations between word forms, hyponymy/hypernymy is a semantic relation between word meanings: e.g., {maple} is a hyponym of {tree}, and {tree} is a hyponym of {plant}. Much attention has been devoted to hyponymy/hypernymy (variously called subordination/superordination, subset/superset, or the ISA relation)...

Hyponyms and hypernyms

A concept represented by the synset {x, x′,...} is said to be a hyponym of the concept represented by the synset {y, y′,...} if native speakers of English accept sentences constructed from such frames as An x is a (kind of) y. The relation can be represented by including in {x, x′,...} a pointer to its superordinate, and including in {y, y′,...} pointers to its hyponyms.

Miller 1993

Hyponym and hypernym mining from Russian texts

Мамонты — вымерший род млекопитающих из семейства 
слоновых, живший в четвертичном периоде.{S} Достигали 
высоты 5,5 метров и массы тела 10—12 тонн.{S}
Таким образом, мамонты были в два раза тяжелее самых 
крупных современных наземных млекопитающих — 
африканских слонов.

Indicators

Мамонты — вымерший род млекопитающих из семейства 
слоновых, живший в четвертичном периоде.{S}
  1. Text > Locate pattern...
  2. Type род into Regular expression
  3. Select Index all utterances in text in Search limitation
  4. Click Search

Concordance

Patterns in Unitex

  1. Text > Locate pattern...
  2. Regular expression <N> — <V:S>* род (<A>+<!DIC>)* <N>
  3. Click Search
2 matches
				Мамонты — вымерший род млекопитающих из семейства слоновых
				Бук — род широколиственных деревьев семейства Буковые
			

Lexical masks

Lexical masks. Special symbols

Lexical masks. Special symbols

Graphs in Unitex

  1. FSGraph > New
  2. Click on the initial state (arrow), click inside the empty place while holding Ctrl to create a new box, connected to the initial state, type <N>, press Enter

A graph for matching text

  1. Create a box, connected to the <N> box

A graph for matching text

Text > Locate Pattern..., Locate pattern in the form of: Graph, Set match-hyponyms.grf, Search

Transducers in Unitex

  1. Click on the first <N> box (hyponym) and change it to <N>/{[ to add {[ before the matched noun, when the graph is applied in the MERGE mode

Transducers in Unitex

  1. Save the graph as tag-hyponyms.grf

Tagging hyponyms and hypernyms

  1. Text > Locate pattern...

Tagging hyponyms and hypernyms

				{[Мамонты]=HYPONYM} — вымерший род 
{[млекопитающих]=HYPERNYM}
из семейства слоновых
{[Бук]=HYPONYM} — род широколиственных
{[деревьев]=HYPERNYM}
семейства Буковые

Mining hyponyms and hypernyms

  1. Open match-hyponyms.grf: FSGraph > Open...

Mining hyponyms and hypernyms

  1. Modify the second <N> box to store the matched noun in variable $hypernym$ in the morphological mode

Mining hyponyms and hypernyms

Mining hyponyms and hypernyms

  1. Set this graph in Text > Locate pattern...
				млекопитающее: мамонт
				дерево: бук
				дерево: бука
				дерево: Бук
			
ambiguous outputs

Mining hyponyms and hypernyms

  1. Why so many Бук outputs? Let's see in the dictionary: DELA > Lookup..., select CISLEXru_igrok.bin and enter this word
Бук,.N+FAMN+PN+anim(o)+gen(M):neM
Бук,.N+FAMN+PN+anim(o)+gen(F):neF:geF:deF:aeF:teF:qeF:nm:gm:dm:aom:tm:qm
бук,бука.N+anim(o)+gen(F)+gen(M):gm:aom
бук,.N+anim(j)+gen(M):neM:ajeM

Mining hyponyms and hypernyms

  1. Let's modify mine-hyponyms.grf to remove ambiguous outputs: change the first <N> box to <N~PN:n>
2 outputs
				млекопитающее: мамонт
				дерево: бук
			

References

  1. Jurafsky, D., & James, H. (2000). Speech and language processing an introduction to natural language processing, computational linguistics, and speech.
  2. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to wordnet: An on-line lexical database*. International journal of lexicography, 3(4), 235-244.

References

  1. Paumier, S. (2015). Unitex 3.1.beta User Manual. Université Paris-Est Marne-la-Vallée. January 15, 2015,
    http://igm.univ-mlv.fr/~unitex/UnitexManual3.1.pdf
  2. Sinclair, J. (2005). "Corpus and Text - Basic Principles" in Developing Linguistic Corpora: a Guide to Good Practice, ed. M. Wynne. Oxford: Oxbow Books: 1-16. Available online from
    http://ahds.ac.uk/linguistic-corpora/ [Accessed 2015-04-01].

Text Processing in Unitex

Text Processing with Finite State Transducers in Unitex

Artem Lukanin

Slides: artyom.ice-lc.com/slides/unitextutorial

Powered by Shower