This toolkit has a set of perl scripts for CWB generation and processing makefile - try make t1 ou make t2 cqp.cgi - CGI for query all the corpus (with permition for read by all) This scripts also guess if the corpus is a paralel one txt2cqp - makes a cwb corpus from a list of txt, html, or XML files Ex: txt2cqp -lema -html corpusname ~/public_html/musica/html/*.html and see http://natura.di.uminho.pt/jjbin/cqp5 (corpus mmm) addradical - (used in txt2cqp) use jspell to add lema and pos to a cqp corpus html2pml - transform html in a pml file (=html with almost just
tags) html2pml -listofpair - transform a file with lines with filename pairs in a pair of files with the concatenation of the pml. good for align htmls my.tmx - a translation memory in TMX format for test t1 quebraxmlsent - (used in txt2cpq) in texts or xml specific element tmx2cqp - builds a cpq paralell corpus from a TMX tmxsplit - split TMX in XML files (one for each language) xmlalign2cqp - makes the cwb corpus and align them ( tags f (for syncronization) and p (for align)) align2tmx - see cqpalign2tmx cqpalign2tmx - makes a TMX (translation memory exange format) from align file. The align files are created with EasyAlign (CQP) filealigner - align a pair of files uses html2pml (to convert to PML) xmlalign2cqp ... (to align with EasyAlign) align2tmx (to build a TMX) pdfaligner htmlaligner mkbitextra - see mkterminum mkterminum dir directory -> paths .paths list of files -> blocks .blocks list of blocks -> _pairs ._pairs list of bitext candidate pairs -> pairs .pairs list of bitext -> tmx .tmxdir directory with the TMXs Files used in tests: makefile listofpairs.ex listofurls.ex my.tmx Installation needs: cwb jspell and jpell.pt dicionary (or jspell.en but this one is a poor one) Lingua::PT::PLN