%% -*- latex -*-
\documentclass[10pt]{beamer}

\usepackage{url}

\usepackage{hyperref}
\usepackage[portuges]{babel}
\usepackage[T1]{fontenc}
%\usepackage[latin1]{inputenc}
\usepackage[mathletters]{ucs}
\usepackage[utf8x]{inputenc}
\usepackage{aeguill}
\usepackage{cam2tex}
\usepackage{fancyvrb}
\usepackage{graphicx}

\def\red#1{\color{red}#1\color{red}}
\def\green#1{\color{green}#1\color{black}}
\def\blue#1{\color{blue}#1\color{blue}}

\usetheme{Nat}
\usecolortheme{crane}
\setbeamertemplate{navigation symbols}{} %no nav symbols
\setbeamercovered{transparent}
\beamertemplateheadempty
\usepackage{beamercolorthemeorange}

\def\alg#1{\mbox{}\\[0.4em]
 \begin{raggedright} \(#1\) \end{raggedright}\\[0.4em]
}
\def\lcom#1{\Acom{\hfill ...#1}}
\def\bex#1{\fbox{#1}}
\def\bexx#1{\begin{block}{}#1\end{block}}

\begin{document}

\title[gwb ]{Per-Fide}
\author[jj]{J.J. Almeida \and {\small Alberto Simões}}
\date{\today}
\frame{\titlepage}


=.= {\scriptsize algumas} Ferramentas ligadas ao ciclo de vida Per-Fide

\_makefileg[rankdir=0,scale=0.3]{

deliverables: stardict PTD TMX bitextos 

bitextos : WWW  
	getwebbitext

TMX : bitextos
	mkterminum

PTD : TMX
	nattools

stardict: PTD TMX
	geraStarDict
}


=. Menu

\tableofcontents

=.= Get Web Bitexts


Bitext = a text and its translation

\:
.1 Procura de bitextos na web:
   . de modo controlado
   #
.1 Construir corora {\scriptsize temáticos, úteis, ...}:
   . alinhados (etc)
   #
.2 Gerar {o que for possível} 
#

E já agora {\scriptsize (underground)} goals
:
. Modularidade / funcional programming
. métodos  work flow
. types + API + scripting oriented
. quick feedback
. try to build something I can use!
#

=.== Motivação

\mbox{}\\[20em] \includegraphics[width=.1\textwidth]{demo2.jpg}

=.== Corora a partir da Web

Several WACi\:
. STRAND (Resnick):
   . partindo de páginas contendo \fbox{\emph{Português} - \emph{English}}
   . criar uma base de dados com biurl - (bitext candidates)
   #
. BootCat (Baroni):
   . partindo de um conjunto de palavras semente (domain specific)...\\
   {\normalsize
     \(
      \Ab{
         \Afor{\Aat{a}{\Acom{some subgroups of seeds}}}
           { text_c ← text_c ∪ websearch(a) } \\
         \Aat{texts}{\Acom{Reject bad texts}(text_c)}\\
         \Aat{cor}{\Acom{Build a corpus with \emph{texts}}}\\
         \Acom{Do miracles with \emph{cor}}
      }
     \)
    }
   . WebBootCat - ... mais um interface Web
   #
. Parguess: (URL* ) → (biurl candidates)*
. MkTerminum: pipelines of tools
#

=.== Problemas

Qualidade / ruído
\:
. conteúdo:
 . Riqueza ( Terminologically )
 . Fiabilidade da informação
 . tradução literal/creativa
 #
. Estrutura:
 . Formato
 . Qualite das correspondências:
    . traduções em falta ou parciais?
    #
 #
. traduções automáticas
# 

\begin{block}{Claim}
 Precisamos de controle sobre as fontes!
\end{block}


=. GetWebBitext

:
. Controlar e escolher as fontes
. Pipeline para extracção e construção:
   . translation memories (=parallel corpora)
   . terminologia
   . ...
   #
. Modular
. Reutilizando:
   . Open-Corpus-Workbench, 
   . Easy-Align, 
   . NATools, 
   . \textit{Yahoo!} API {\scriptsize or Google::Search Ajax}
   . StarDict
   . ...
   #
#

=.== Algorithm and Pipelines

\includegraphics[width=\columnwidth]{flow}

=. Algorithm and Pipelines

\:
.1
 document URLs that contain all the keywords $K$ and
  is cataloged by the search engine as being in language $L_1$.
  $$Docs_{L_1} = yahoo(K, site:D, lang: L_1)$$

.1 for each URL in $Docs_{L_1}$ try to guess the corresponding URL
  with the document in language $L_2$:
  $$Docs_{L_2} = urlguess(Docs_{L_1}, L_2)$$

.1 retrieve all documents pointed by the obtained URL (if they
  exist), and convert them to a textual format (PML):
  $$Bitexts = retrieve(Docs_{L_1},Docs_{L_2})$$

.1 build a parallel corpora $PC$ aligning at the sentence level the
  retrieved documents. Note that this is done for each document pair.
  $$PC = align(Bitexts)$$
#

=. Algorithm (continuation)

\:
.1 filter the parallel corpora discarding translation units or
  documents with low alignment quality:
  $$PC = filter(PC)$$

.1 extract probabilistic translation dictionaries from the aligned
  corpora:
  $$PTD = extractPTD(PC)$$

.1 extract bilingual terminology using the probabilistic
  dictionaries and a set of alignment patterns:
  $$Terms = terms(PC, PTD, Patterns)$$

.1 create a StarDict dictionary for off-line usage based on the
  bilingual terminology and dictionaries extracted:
  $$StarDict = mkSD(PTD, Terms)$$
#


=. URLs transformation (url-guesser)

\begin{Verbatim}
 http://dom.ini.o/en/file.html

                       ↓ split

 http://dom.ini.o/ +  en/file         + .html

                       ↓ url_guesser 

 ...               +  pt/file         + ...
                      por/file  
                      portuguese/file 

                       ↓ join

 http://dom.ini.o/pt/file.html
\end{Verbatim}

== Pipeline elements

=. Bitexts --> TMX

\alg{
\Afun{mkterminum(pairs:bitext*):TMX}{
   \Afor{(t1,t2) \in pairs}{
      \Acom{Verify language and compatibility(t1,t2)}\\
      \Aif{\Apar{\Acom{not suitable}}{remove(t1,t2)}}     
%                             \qquad \qquad \qquad \lcom{correspondence}
   } \\
   \Afor{(t1,t2) \in pairs}{
      tmx_i = align(t1,t2)\\
      \Aif{\Apar{\Acom{found many non 1:1 corresp}}{remove(tmx_i)}}  
%                            \lcom{align}
   } \\
   \Aret \sum_i tmx_i
 }
}

=.== Examples

\:
.1 Save some micro TMX (bilingual Call for papers)

 getwebbitext -s "ciawi-conf.org" -l pt:es -until tmx trabalhos

.2 Build a small corpus of alcoholic drinks

 getwebbitext -l pt:en vodka cerveja

% mkterminum
% natools -tmx -id ...
% geraStarDict _

#

=.== Example 1: CFP

Extraction of some topics (form a CFP)

 getwebbitext 
   -s "ciawi-conf.org"              → the source
   -l pt:es                         → languages
   -until tmx                       → sub pipeline 
   trabalhos                        → keywords

metrics:

 800 translation tools,  (TMX format)
 50  s
 10k words

=. xpdf-tmx

\mbox{}\\[20em] \includegraphics[width=.1\textwidth]{demo2.jpg}
 
=.== Example 2: Drinks Corpus

 getwebbitext -l pt:en vodka cerveja
      → default source = european eur-lex
      → default until  = ... the end of pipeline

\:
.  3m21s to execute this task,
.  37~MB of bitext candidates (12 bitexts in HTML format, and 22 in PDF)
.  Excluding the PDF documents the final TMX:
    .  32~941 TU (about 9MB).  
    #
. Including PDF documents (several of the PDF documents were rejected
  -- format or alignment problems)
.  the final TMX:
   . had 81~844 translation units 
   . (about 22MB of text --- about 1~300~000 tokens)
   #
#

=.

$$cervejas \hbox{ (29) }
\left\{ \begin{matrix}
    beer  & \rightarrow & 98\% \cr
    actual & \rightarrow & 2\%
  \end{matrix}\right. $$
$$cerveja \hbox{ (53) }
\left\{ \begin{matrix}
    beer  & \rightarrow & 62\% \cr
    brewing & \rightarrow & 24\% \cr
    distilling & \rightarrow & 3\% \cr
    coloured & \rightarrow & 1\%
  \end{matrix}\right. $$
$$vodka \hbox{ (139) }
\left\{ \begin{matrix}
    vodka  & \rightarrow & 94\% \cr
    flavoured & \rightarrow & 2\% \cr
    vodkas & \rightarrow & 1\%
  \end{matrix}\right. $$
$$licor \hbox{ (73) }
\left\{ \begin{matrix}
    liqueur  & \rightarrow & 95\% \cr
    licor & \rightarrow & 2\% \cr
    liqueurs & \rightarrow & 1\%
  \end{matrix}\right. $$

=.

$$rum \hbox{ (99) }
\left\{ \begin{matrix}
    rum  & \rightarrow & 96\% \cr
    produced & \rightarrow & 1\% \cr
    word & \rightarrow & 1\% \cr
    solbaerrom & \rightarrow & 1\%
  \end{matrix}\right. $$
$$vinho \hbox{ (271) }
\left\{ \begin{matrix}
    wine  & \rightarrow & 81\% \cr
    vinho & \rightarrow & 7\% \cr
    aromatised & \rightarrow & 2\% \cr
    wines & \rightarrow & 2\% \cr
    wine-based & \rightarrow & 1\%
  \end{matrix}\right. $$
$$vinagres \hbox{ (38) }
\left\{ \begin{matrix}
    vinegar & \rightarrow & 96\%
  \end{matrix}\right. $$
$$malte \hbox{ (208) }
\left\{ \begin{matrix}
    malt & \rightarrow & 95\%
  \end{matrix}\right. $$
$$aguardente \hbox{ (226) }
\left\{ \begin{matrix}
    spirit  & \rightarrow & 70\% \cr
    aguardente & \rightarrow & 14\% \cr
    spirits & \rightarrow & 13\% \cr
    diluted & \rightarrow & 1\% \cr
    distilled & \rightarrow & 1\%
  \end{matrix}\right. $$

=. Example 2 conclusion:

The full process took near 30 minutes...

In the end we obtained:
. a 81K translation unit TMX file (22MB);
. a pair of probabilistic translation dictionaries PTD;
. a StarDict dictionary
#

=. Stardict + PTD

\mbox{}\\[20em] \includegraphics[width=.1\textwidth]{demo2.jpg}
 
=.= Some paternalistic suggestions


\:
 .1 whenever possible chose sources that you know;
 .1 if you build (small) TMX with good quality translation
you can consult and query them now and join them in the future.
 .1 try to select seed terms that just exist in one of the
languages
 .1 if you suspect that the sources contain appendices, chose
a set of extreme seed terms (eg. "ácido sulfúrico"
and "nitrato de prata") to find those rich terminological documents
#

=. Fim, The End

\includegraphics[width=1.05\textwidth]{gil.jpg}

=.= Pos Scriptum TMX

\_makefileg[rankdir=0,scale=0.3]{

t: Tmx-view Tmx-grep Tmx-API Tmx-CGI Tmx-tools

Tmx-view: prince xpdf css TMX
	xpdf-tmx

Tmx-grep: po-grep

Tmx-API: XML-TMX

Tmx-tools: tmxclean tmxuniq tmx2tmx tmx2html

}


=.= Pos Scriptum PTD

\_makefileg[rankdir=0,scale=0.3]{

t: dic-equiv ptd-browser

dic_equiv: PTD
	jj-trans-equi

ptd-browser: PTD

}

=. PS NATools 


  nat-create -id=name -lang=pt..en -tmx file.tmx

→ fork Natools 33

=.

\end{document}


=.= Dicionários, regras e motor

Separação do motor dos dicionários:
 . \emph{motor jSpell}:
   . motor C (iSpell++);
   . módulos de interface em Perl;
   #
 . \emph{vários dicionários} (+ afixos):
   . Português  (razoável)
   . Inglês     (médio)
   . Castelhano (em início)
   . Latim
   . ...
   #
#

=.== Dicionários

$$
\begin{matrix}
  \hfill Dic    & \equiv & Regras \times PalavrasInf^\star \hfill \cr
  \hfill Regras & \equiv & RegraId \mapsto Regra^\star \hfill \cr
    PalavrasInf & \equiv & Palavra \times Class \times RegraId^\star
\end{matrix}
$$

\emph{Dicionário}
\begin{Verbatim}[frame=single]
 #vt = /CAT=v,T=inf,TR=t/

 avaliar/#vt/DLMPRXYcu/
\end{Verbatim}

\emph{Regras}
\begin{Verbatim}[frame=single,fontsize=\small]
flag *D:             ; "CAT=v,T=inf"               # lavrar =>
[AEI] R  > -R,DOR    ; "CAT=a_nc,G=m,N=s,FSEM=dor" #  ->lavrador
[AEI] R  > -R,DORA   ; "CAT=a_nc,G=f,N=s,FSEM=dor" #  ->lavradora
[AEI] R  > -R,DORES  ; "CAT=a_nc,G=m,N=p,FSEM=dor" #  ->lavradores
[AEI] R  > -R,DORAS  ; "CAT=a_nc,G=f,N=p,FSEM=dor" #  ->lavradoras
\end{Verbatim}

=.= Modos de funcionamento

\:
. \emph{corrector ortográfico} \\ herança do iSpell
. \emph{interpretador}\\ permite a interacção directa com o
  utilizador via linha-de-comando para a pesquisa de palavras (com ou
  sem \emph{near-misses}) e consulta das respectivas propriedades
  morfológicas;
. \textbf{\emph{Biblioteca de programação (Perl) morfológica}}\\
  permite que se use a análise morfológica como um dos primeiros
  blocos na construção de aplicações em PLN.
#


=.== Questões de cobertura

Preocupação com a zona activa da língua...

\begin{Verbatim}
 aliazar::
 alibânia::
 álibi::
 alibilidade::
 alíbil::
 álica::
 alicaído::
 alicanso::
 alicante::
 alicantinador::
 alicantina::
 alicantineiro::
 alicário::
 alicatão::
 alicate::
 alicece::
\end{Verbatim}

=. Questões de cobertura

Não queremos todas as palavras...

\begin{Verbatim}
 aliazar::grupo de lezírias circundadas de água
 alibânia::tecido de algodão das Índias Orientais
 álibi::justificação do réu, que consiste em p...
 alibilidade::qualidade do que é alíbil
 alíbil::próprio para a nutrição
 álica::espécie de trigo ou de cevada de que os...
 alicaído::de asa caída
 alicanso::licranço
 alicante::casta de uva algarvia e andaluza
 alicantinador::alicantineiro
 alicantina::trapaça no jogo ou nos negócios
 alicantineiro::que ou aquele que faz ou vive de alicanti...
 alicário::o que fabrica ou vende álica
 alicatão::grande tenaz para segurar a peça que se pret...
 alicate::ferramenta formada por duas barras ou p...
 alicece::alicerce
\end{Verbatim}


%=.
%
%\:
%. abreviaturas = id _(-->)_ classificação
%. dic = pal _(-->)_ classificação * esquemas
%. esquemas = idregra*
%. afixos = idregra _(-->)_ regra
%#
%

=. Via linha-de-comando

Invocando jSpell com:
\begin{Verbatim}[frame=single]
      jspell -d port -a -J -y  
\end{Verbatim}

Permite a \emph{análise interactiva}:

\begin{Verbatim}[frame=single,fontsize=\footnotesize]
International Jspell Version 1.62

avaliação
* avaliação 0 :lex(avaliar,[CAT=nc,T=inf,TR=t,G=f,N=s,FSEM=cao])

avaliei
* avaliei 0 :lex(avaliar,[CAT=v,T=pp,TR=t,P=1,N=s])

availódromo
& avaliódromo 0 :avaliódromo=lex(avaliar,[CAT=nc,T=inf,TR=t,FSEM=odromo])
\end{Verbatim}

As análises são marcadas com * ou com \&:
 . palavra no dicionário (ou derivação prevista no dicionário);
 . palavra não reconhecida;
#

=.== API Perl do Jspell

\begin{Verbatim}[frame= single,fontsize=\small]
use Lingua::Jspell;

# Inicializar o uso de um dicionário
my $dict = Lingua::Jspell->new("port");

# obter lista de radicais da palavra gato
@l1 = $dict->rad("pode");
# @l1 = ('poder', 'podar')

# obter lista de análises da palavra pode
@l2 = $dict->fea("pode");
# @l2 = ( {rad=>'porto', CAT=nc, G=m,  N=s},
          {rad=>'Porto', CAT=np, LA=1, SEM=cid, G=m, N=s},
          {rad=>'portar',CAT=v,  T=p,  TR=t,    P=1, N=s})

# obter lista de palavras derivadas de gato
@l3 = $dict->der("gato");     

# obter lista de lemas e flags para gato
@l4 = $dict->flags("gato");         
\end{Verbatim}

=. API Perl do Jspell (continuação)

\begin{Verbatim}[frame= single,fontsize=\small]
 fea( pal, restrição): FS*
      @l2 = $dict->fea("poder",{CAT=>"v"});

 onethat(FS , FS*) : FS
      %f  = onethat({...restrição} , @l)

 nlgrep( opções, padrão, file*):linha*
      @lin=$dict->nlgrep({max=>100, sep =>"\n"}, patt, files);

 setmode  -- define o funcionamento perante desconhecidas
      setmode({nm=>"full",flags=>1})

 featags         
      @l=$pt->featags(lindas)
             JFS, ...

 featagsrad
      @l=$pt->featagsrad(lindas)
             JFS:lindo, ...

 mkradtxt

 isguess
\end{Verbatim}

=.= Exemplo: marcação de tempos compostos simples

\begin{Verbatim}[frame=single,label=\emph{Original}]
      O João tem comido muito e a Joana tem comida.
\end{Verbatim}
\vskip 2mm
\begin{Verbatim}[frame=single,label=\emph{Texto anotado}]
      O João tem_comido muito e a Joana tem comida.
\end{Verbatim}

Ou seja: procurar ter+v-pp-ms

\begin{Verbatim}[frame= single,fontsize=\small]
use Lingua::Jspell;
$pt=Lingua::Jspell->new("port");

@ter   = (rad=>"ter");
@vppms = (CAT=>"v",T=>"ppa",G=>"m",N=>"s");

while (<>) {
  s{(\w+) (?=(\w+))}{ 
    if ($pt->fea($1,{@ter}) and $pt->fea($2,{@vppms}))
       { "$1_"}     else  { "$1 "}
   }eg;
  print;
}
\end{Verbatim}

=.= WebJspell -- interface web

(Rui Vilela)
\:
 . Criação de um interface web sobre o jspell
 . usa a biblioteca Perl (Lingua::Jspell)
 . liga também a outros recursos externos públicos:
     . (dicionário aberto)
     #
 . liga a outras ferramentas baseadas em jspell:
     . htspell -- verificador ortográfico de páginas html
     #
 #

=.== Htspell

\:
 . Análise ortográfica de páginas html 
 . CGI
 . Funciona como proxy intrusivo de anotação 
     de palavras desconhecidas
#

\[
\Afun{htspell\_cgi(url)}{
 \Aat{pag}{get\_url(url)}\\
 \Afor{palavra: p \in pcdata(pag)}{
   \Aif{ \Apar{p \notin dicionario_{PT,EN}}{substituir(p,red(p))} } }\\
 \Afor{link: l \in pag}{ 
   substituir(l, htspell\_cgi(l))} \\
 \Aret{b}
}
\]


=.= Lingua::Jspell::DicManager

\:
 . processar dicionários:
     . criar novos
     . inserir palavras
     . instalar dicionários
     . modificar (atributos, flags, comentários, ...)
     #
 . funções de ordem superior para:
     . visitar todas as palavras
     . modificar todas as palavras
     #
 #

=.= Conclusões

Saliente-se:
 . flexão mas também derivação
 . capacidade de programação de processadores morfológicos
 . capacidade de programação sobre dicionários jspell
 . ligação a elementos externos (scripting+análise morfológica)
 . micro-estrutura aberta
 . (zigzag grammar)
#

=. Fim

=. PS. Alguns exercícios propostos

\begin{block}{A - Usando dicionários existentes}
\:
.1 Dado um ficheiro com uma palavra por linha, determinar mais 
duas colunas com \emph{pos} e \emph{lema}, usando a biblioteca Jspell
.2 Fazer um grep por lema  (mkradtxt)
.3 Dicionarizar um conjunto de termos multiplavra
#
\end{block}

\begin{block}{B - Construir novos dicionários}
\:
.4 Construir um dicinário port. Jspell novo; adicionar-lhe a divisão 
    silábica e o número de sílabas
.5 Adicionar um atributo de frequência ao dicionário Português
.6 Validador de dicionários com base em várias restrições
#
\end{block}

\begin{block}{D - Outros}
\:
.7  marcar palavras desconhecidas em dic. bilingues
#
\end{block}

%=. Abordagens clássicas