MARC: hrWaC and slWac: compiling web corpora for Croatian and Slovene

hrWaC and slWac: compiling web corpora for Croatian and Slovene

Web corpora have become an attractive source of linguistic content, yet are for many languages still not available. This paper introduces two new annotated web corpora: the Croatian hrWaC and the Slovene slWaC. Both were built using a modified standard “Web as Corpus” pipeline having in mind the lim...

Full description

Permalink:	http://skupnikatalog.nsk.hr/Record/ffzg.KOHA-OAI-FFZG:312924/Details
Matična publikacija:	Text, Speech and Dialogue : 14th International Conference, TSD 2011, Pilsen, Czech Republic, September 1-5, 2011. : Proceedings Lecture Notes in Computer Science
Glavni autori:	Ljubešić, Nikola, informatičar (-), Erjavec, Tomaž (Author)
Vrsta građe:	Članak
Jezik:	eng
Online pristup:	http://link.springer.com/book/10.1007/978-3-642-23538-2


LEADER	01967naa a2200253uu 4500
008	131111s2011 xx eng\|d
020			\|a 9783-642-23537-5
035			\|a (CROSBI)552901
040			\|a HR-ZaFF \|b hrv \|c HR-ZaFF \|e ppiak
100	1		\|9 445 \|a Ljubešić, Nikola, \|c informatičar
245	1	0	\|a hrWaC and slWac: compiling web corpora for Croatian and Slovene / \|c Ljubešić, Nikola ; Erjavec, Tomaž.
246	3		\|i Naslov na engleskom: \|a hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene
300			\|a 395-402 \|f str.
520			\|a Web corpora have become an attractive source of linguistic content, yet are for many languages still not available. This paper introduces two new annotated web corpora: the Croatian hrWaC and the Slovene slWaC. Both were built using a modified standard “Web as Corpus” pipeline having in mind the limited amount of available web data. The modifications are described in the paper, focusing on the content extraction from HTML pages, which combines high precision of extracted language content with a decent recall. The paper also investigates text-types of the acquired corpora using topic modeling, comparing the two corpora among themselves and with ukWaC.
536			\|a Projekt MZOS \|f 130-1301679-1380
546			\|a ENG
690			\|a 5.04
693			\|a web corpus, Croatian, Slovene, topic modeling \|l hrv \|2 crosbi
693			\|a web corpus, Croatian, Slovene, topic modeling \|l eng \|2 crosbi
700	1		\|a Erjavec, Tomaž \|4 aut
773	0		\|t Text, Speech and Dialogue : 14th International Conference, TSD 2011, Pilsen, Czech Republic, September 1-5, 2011. : Proceedings \|d Berlin / Heidelberg : Springer, 2011 \|k Lecture Notes in Computer Science \|n Ivan Habernal and Vaclav Matousek \|z 978-3-642-23537-5 \|g str. 395-402 \|a International Conference, TSD 2011(14 ; 2011; Pilsen, Czech Republic)
856			\|u http://link.springer.com/book/10.1007/978-3-642-23538-2
942			\|c RZB \|t 1.08 \|u 2 \|z Znanstveni \|v MeđRecenzija
999			\|c 312924 \|d 312922

hrWaC and slWac: compiling web corpora for Croatian and Slovene

Slični primjerci