\documentclass[a4paper,11pt]{article}
\usepackage[T1]{fontenc}
\usepackage{sepln}
\usepackage{epsf}
\usepackage[english]{babel} 


\title{Text to speech --- a rewriting system approach}
\begin{document}
\maketitle
\parskip .1cm


\begin{abstract}
 In this document we present an open source Portuguese text to
speech. Our first goal is to provide a flexible way to extend it,
using a generic way to convert Portuguese words on SAMPA phonemes, and
consult dictionaries only on exceptions examples.

 The Text-to-Speech is compound of five layers, each one based on
simple rules in a way to be easily tuned. In order to do that, we
wrote a generic text rewriting system that is presented in the
section two.

 The result of this work is a tool that can be used as an independent
Text-to-Speech system or as a Natural Language Processing library for
various tasks. We present some examples how them can be used in the
\emph{Applications} section.
\end{abstract}

% \tableofcontents

\section{Introduction}

Text-to-Speech(TTS) is, as we know, a difficult area. Romance languages,
like Portuguese, are very hard to transform to sound because of the
great amount of exceptions.

We intended to make the most generic Perl\cite{PPerl,PCook} module
to convert Portuguese texts to sound, using rules to transform words
to SAMPA \cite{SAMPA} phonemes and dictionaries for exception cases.

Our approach is based on rewriting rules. We take a text, divide it into
sentences and, based on the punctuation, classify the sentence as
exclamative, interrogative or other. This classification will be used
later, by the prosodic transformer in order to make sentences more
understandable.

Each sentence is divided on words to be, each of them, transformed to
SAMPA. This process is based on dictionary search and on rule
transformations. Later, the words are joined and compared, again,
with rules to make better word junctions.

The SAMPA sentence so formed is passed to a prosodic transformer to make
the phrase sound more human (transform a same frequency sound to a
melodic one). This is done with rewriting rules, too.

Scheme \ref{esquema} tries to explain this cycle.

All these rewriting systems, and functions can be used as a complete
program, or can be called independently. In the later case, we are
talking about a Perl module or library.

This Perl module has many functions that can be helpful whith out the
full TTS system. We can name some of them, like the text to sampa
conversor, the word to sampa (different from the previous one in the
data type), or text to MBrola\cite{MBROLA} system file format.

Other ones, not so connected with the TTS system, can be helpful for
other purposes, like the number to text conversion, or the e-mail
and internet URLs conversor to text. 

Looking to this module as an application, we can get a program to read
text from the standard input, or to create a wave file to play later.

This system was not built alone, but in conjuntion with a \texttt{pt::pln}
perl module (portuguese natural language processing module) that implements
some basic functions like sentence tokenize to words, words division by
syllable, tonic syllable search and so on.


\twocolumn[
%\begin{figure}
\epsfxsize=\textwidth
\epsfbox{tts.dia.eps}   
%\caption{pt::speaker structure}
\label{esquema}
%\end{figure}
]


\section{Rewriting system}

To this and other purposes, we built a rewriting system. What we intend
about this is a system that, given a set of rules, parses a text and
rewrites all matching patterns.

Each set of rules, after the compilation take place, generates a function
that accepts the text to be rewrited, and returns it. These functions
can be easily composed so that we will have composed rewriting systems.

There are various kinds of rules, since simple substitutions rules,
rules that evaluate the string that will replace, rules that are evaluated
only when starting the system, rules that evaluate if there is some
context condition and rules that make the system quit.

If there is any of the rules that make the system exit, the system will
process the text until no rule pattern matches.

In this particular system, we should define rules on a file, and
compile it using the \texttt{mktextrr} command. This, transforms rules
to a Perl script, that does the real job.

The source file is a perl file that, between \texttt{RULES} and
\texttt{ENDRULES} string, accepts rule definitions to construct
a function.

Rules have the following syntax:

{\footnotesize
\begin{verbatim}
left hand side ==> right hand side
left hand side =e=> right hand side
left hand side ==> right hand  !! condiction
\end{verbatim}
}

Because of the column width of the article, we will use a more \LaTeX{}
style.

Let's see a first example: a rewriting system that expands an e-mail
to a HTML link to that e-mail:

$$\hbox{\verb!RULES email_expand!}$$
$$\hbox{\verb!(\w+(\.\w+)+\@\w+(\.\w+)+)!}$$
$$\Downarrow$$
$$\hbox{\verb!<a href="mailto:$1">$1</a>!}$$


Note that regular expression matching is pure perl code.

Saving this file under the name \texttt{email\_expand} and processing
it with $$\hbox{\texttt{mktextrr email\_expand email\_expand.pl},}$$ 
we get a function, named \texttt{email\_expand} that accepts a text and
does the transformations needed.

You will notice that we didn't use the \texttt{ENDRULES} command. This
is because there is nothing more after the rules set.
Now, suppose we
will edit the file, and make another function, this one expand \texttt{http
URL}s:

$$\hbox{\verb!RULES email_expand!}$$
$$\hbox{\verb!(\w+(\.\w+)+\@\w+(\.\w+)+)!}$$
$$\Downarrow$$
$$\hbox{\verb!<a href="mailto:$1">$1</a>!}$$
\vskip .5cm
$$\hbox{\verb!RULES http_expand!}$$
$$\hbox{\verb!(http://\w+(\.\w+)+)!}$$
$$\Downarrow$$
$$\hbox{\verb!<a href="$1">$1</a>!}$$

One more time, we didn't use the \texttt{ENDRULES}, because we are
defining another rule set. Compiling the
rules set, and using \texttt{http\_expand(email\_expand(\$text))} we
can replace all emails and URL's.

Because \texttt{RULES} define a set, we can replace the example
with the following code:

$$\hbox{\verb!RULES expand!}$$
$$\hbox{\verb!(\w+(\.\w+)+\@\w+(\.\w+)+)!}$$
$$\Downarrow$$
$$\hbox{\verb!<a href="mailto:$1">$1</a>!}$$
\vskip .1cm
$$\hbox{\verb!(http://\w+(\.\w+)+)!}$$
$$\Downarrow$$
$$\hbox{\verb!<a href="$1">$1</a>!}$$

\noindent and, calling once this function, all emails and URLs will be
replaced.

Sometimes, we want to evaluate the right side of the substitution.
We can do such a thing using rewriting system \texttt{=e=>} arrow:

$$\hbox{\verb!RULES arithmetic!}$$
$$\hbox{\verb!\s*\d+\s*[+*-/]\s*\d+\s*!}$$
$$\Downarrow_e$$
$$\hbox{\verb!$&!}$$

This simple example, parses a text file and all formulae 
found, are evaluated. This is, indeed, a fast and easy way to
change texts.

The \texttt{=b=>} arrow does not have a left hand
side, and evaluates the right hand side. So, if we need a dot at
the end of the string, we can make:

$$\Downarrow_b$$
$$\hbox{\verb!$_.="."!}$$

Note that this is different from

$$\hbox{\verb!$==>.!}$$

because there will be a endless loop. Meanwhile, it will work fine if
you write:

$$\hbox{\verb![^.]$==>.!}$$

Finally, we can impose conditions to each rule, along with the
matching pattern. For an example purpose, we have defined
a user hash (\texttt{\%user}) that associates user names to their
full names. We can write:

$$\hbox{\footnotesize\verb#\b(\w+)\b==>$user{$1} !! defined($user{$1})#}$$

This way, each word is checked, but only the ones that
match with a user name will be substituted.

\section{Text to words}

To read a text, we must divide it into smaller tokens. These token can be,
first, sentences and, later, words.

Text can't be divided straight away to words because we need sentence
delimiters to check the sentence type (interrogative, exclamative or
imperative) and make prosodic works.

The sentence text is, then, divided by spaces or commas to
words. These words are lower cased and checked on a phoneme
dictionary. This dictionary can contain full translations of
Portuguese to SAMPA or semi-translated ones that will be transformed,
again, with the transformation rules.

This dictionary have the following syntax:

{\scriptsize
\begin{verbatim}
  dic :: line dic
  line :: word '=' SAMPA
        | word '=' '!' SEMI-SAMPA
        | prefix '*' '=' SEMI-SAMPA '*'
\end{verbatim}
}

We explain how this dictionary is used on the next section.

\section{Words to SAMPA}

To translate words to SAMPA, we need a word. Taking this word, we will check
if it exists on the dictionary file:

\begin{itemize}
\item If it finds a word
and its full SAMPA translation (first case), its translation is returned;

\item If it exists, but it has an exclamation mark (second case),
it is substituted by the SEMI-SAMPA code and the process continues;

\item If none of them exists, the last letter is substituted with an
asterisk (\texttt{*}) and checked on the dictionary. If it does
exists, the prefix text (before the asterisk) is substituted by
SEMI-SAMPA and the rest of the word is concatenated and the process
continues with the rewrite rules. But, if it does not exists, we take
off the last letter before the asterisk, and check it again, until
the word disappear.

These asterisks are used to signal that there is an exception for words
starting that way. So, with only one of these rules we can process a
lot of words (verbal constructs, and so on).

\item If there isn't a prefix for the word, the process continues to
the rewrite rules.

\end{itemize}

This rewriting system uses two functions. The first one, tries to
convert simple letters sequences to it's respective phonemes (SAMPA
and some pseudo-SAMPA ones) and the second one, tries to convert it
all to SAMPA. Some pseudo-SAMPA is left to be possible to make some
sounds take some more time.

Some letter have two or more different sound if they appear between
specific letters. So, we have rules like

$$\hbox{\texttt{(\$vg)x(\$vg)==>\$1z\$2}}$$

where the \texttt{\$vg} variable contains all the letters and SAMPA
phonemes that should be considered vowels.

There are other cases where some letters appearing in the beginning of
words should be read differently for the cases where it appears in the
middle of some others.

These rewriting rules try to find the tonic syllable, too. This is
done checking if certain sequences of letters are found on the end of
the word. 

\section{Transformation of adjacent words}

When reading, people tend to join some letters. As in English we can
write ``aren't'' instead of ``are not'', Portuguese speakers do several
oral contractions.

For this purpose, we decided to make a new set of rules to rewrite sentences
joining some word vowels, like

$$\hbox{este elefante} \longrightarrow \hbox{est'elefante}$$
$$ \downarrow$$
$$\hbox{eSt@ elefant@} \longrightarrow \hbox{eStelefant@}$$

or concatenating some words

$$\hbox{és esperto} \longrightarrow \hbox{é\textbf{z}esperto}$$
$$\downarrow$$
$$\hbox{ES 6jSpertu} \longrightarrow \hbox{Ez6jSpertu}$$

These rules have a slash delimiting words, so the two examples
showed before, would be:

\begin{verbatim}
   @/([ea])==>$1
   S/6==>z/e
\end{verbatim} %% Correct $ emacs hightlight

We should make it clear that these rules should join SAMPA words, and
no Portuguese words. This is the reason we use an uppercase \texttt{S}
on the second rule, instead of a lowercase one.

\section{Prosodic transformer}

This is another rewriting system. This is, probably, the most complicated
one.

First, we define a set of letters and its respective duration. Then
we  match the tonic syllable, to make sound frequency go
up or down some time later. For this, we define a set of commands,
like \texttt{=Sub}, \texttt{=Sup} and \texttt{=Pause} to make frequency go
down, frequency go up and pause the sound for some time. 

Here are some examples of frequency variations for interrogative sentences:

{\footnotesize
\begin{verbatim}
($vg):\? ==>$1=Sub=Sup=Pausa500
($vg):($vg)\?==>$1=Sub $2=Sup=Pausa500
($vg):($con)($vg)\?==>$1=Sub $2 $3=Sup=Pausa500
\end{verbatim}
}

Note that these colons symbolize the first vowel from the tonic syllable.
Thus, there are some rules that make sound frequency go up, at the colon,
and go down slighty, letter by letter, to the end of the word:

{\footnotesize
\begin{verbatim}
($vg): ==>$1=Acen
($vg)=Acen ==> \n$1-dur=$durac{$1}-30-130
\end{verbatim}}

At the end, we replace these commands with their respective frequency
numbers and send them to the MBROLA\cite{MBROLA} phoneme file for
later conversion to wave and playback.

The monochordic version of this system was hard to understand. After
applying a random transformation, making sound vary randomly, it was
better understandable. Finally, using our simple prosodic transformer,
we can understand even the differences between interrogative and
imperative sentences. Of course, this system is very incomplete, and
futher alterations will make it work better.

\section{Non words to words}

Before tokenizing text to words, we thought it would be useful to
translate numbers, emails and URLs to a readable form. This is an
addiction to the basic Text-to-Speech system that make it more
sophisticated for real use. 

\subsection{Numbers to Words}

The first one, for numbers, takes a number, decomposes it into the various
components (unities, decimal, and so on) and translates each of them
to the corresponding text form. This example will show only a small
piece of the rewritng system because it is very long:

\begin{verbatim}
RULES number
10==>dez 
11==>onze 
12==>doze 
 [...]
18==>dezoito 
19==>dezanove 
20==>vinte 
2(\d)==>vinte e $1 
30==>trinta 
3(\d)==>trinta e $1 
 [...]
70==>setenta 
7(\d)==>setenta e $1 
80==>oitenta 
8(\d)==>oitenta e $1 
90==>noventa 
9(\d)==>noventa e $1 
1==>um 
2==>dois 
 [...]
8==>oito 
9==>nove 
0$==>zero 
\end{verbatim} % yeah.. $ emacs highlight

The complete set of rules to translate numbers from zero
to 999 999 uses about of 80 lines.

\subsection{URLs to Words}

The second rewriting system takes emails and URL's and
\textit{textifies} them. On emails, words smaller than four letters
are spelt, and others are read normally. The \textit{at} symbol is
read, as well as the dots. Meanwhile, we put some pauses after
each dot to make it more understandable.

This example is a bit more complicated. Because we want to translate
the email to words, we will, probably, have endless loops. We want
 words of three or less characters to be spelt and bigger words
read normally.

For this, we defined a associative array (we could make another rewriting
system for this) that associates each letter to it's pronunciation:

\begin{verbatim}
%letters = {
            'a' => 'á',
            'b' => 'bę',
            'c' => 'cę',
            ...
            'z' => 'zę' }
\end{verbatim}

If you make a system like,

$$\hbox{\verb![a-zA-Z]{1,3}?\b!}$$
$$\Downarrow_e$$
$$\hbox{\footnotesize\verb!join("",map {$letters{lc($_)}} split(//,$1))!}$$

%$
 after substituting letters for their pronunciation, they will be
replaced again and again, forever.

The solution we encounter to make this work, was to place a token that
will go from the beginning to the end of the string. The result will
be:

{\footnotesize
\begin{verbatim}
=b=> $_ = "_$_"
_\. ==> , ponto _
_\@ ==> , arroba _
_([a-zA-Z]{1,3}?)\b =e=>
    join("", map {$letters{$lc($_)}} 
                  split (//,$1))."_"
_(.+?)\b ==> $1 _
_$==>
\end{verbatim}
}

Explaining these rules:
\begin{itemize}
\item Place the token underscore (\texttt{\_}) at the beginning of the text (this token is
not the best one, because e-mails can contain them, but it makes easier to explain);
\item Replace underscores followed by a dot by the word \textit{ponto} followed by the underscore;
\item Make the same thing with the \textit{at} symbol;
\item Words with one to three letters preceded by underscore, are translated, each letter
to its word form, joined together and an underscore placed after the expression so it
won't be processed again.
\item Place the token after words with more than three letters;
\item If the token is at the end of the string, remove it!
\end{itemize}

For a better understanding, look at this example:

{\footnotesize 
\begin{verbatim}
 1. cj@di.servidor.pt
 2. _cj@di.servidor.pt
 3. cę jota _@di.servidor.pt
 4. cę jota,arroba _di.servidor.pt
 5. cę jota,arroba dę í _.servidor.pt
 6. cę jota,arroba dę í,ponto _servidor.pt
 7. cę jota,arroba dę í,ponto servidor _.pt
 8. cę jota,arroba dę í,ponto servidor,ponto _pt
 9. cę jota,arroba dę í,ponto servidor,ponto pę tę_
10. cę jota,arroba dę í,ponto servidor,ponto pę tę
\end{verbatim}
}


\section{Application Examples}

In this section we provide two simple examples of how to use
\texttt{pt::speaker} for real applications.

\subsection{Telephone Numbers}

Now that we are in the era of mobile phones that recognize our voice and
connects directly to the person we want, we can make the opposite thing
using a simple system. Imagine a database file with nicknames, full names
and the respective telephone numbers. We want that, given a nickname, the
program read the full name and the telephone number. 

Look to a simple database file:

\begin{verbatim}
maria:Maria Alice:999222323
manuel:Manuel Joăo:999323222
\end{verbatim}

The perl program, will be something like this loads the dictionary, searches the
nickname we want, and reads it:

\begin{verbatim}
# Charge dictionary
open DIC, "dic";
while(<DIC>) {
 ($nick,$name,$num)=split /:/;
 $dic{$nick}=[$name,$num];
}
close DIC;

# Read a nick
$nick = <>;
if (defined($n = $dic{$nick})) {
  # build the sentence to read
  # "the telephone number of XXX is xxx
  $s = "O telefone do $name é ";
  # split the number by three digits
  # to be more easily understandable
  $n->[1]=~/^(...)(...)(...)$/;
  $s.="$1 $2 $3";
  pt::speaker::speak($s);
} else {
  # say that we didn't fint it
  pt::speaker::speak("năo encontrei")
}
\end{verbatim}

\section{HTML Table Of Contents}

Let's look to yet another example of usability for this module.  We have
an XHTML\footnote{not HTML so we can use a XML tool like XML::DT\cite{DT}} file
and want to read the headings:

{\footnotesize
\begin{verbatim}
use XML::DT;
use pt::speaker;

%handler = (
  '-default' => sub{},
  'h1' => sub {
    $h2=0;
    $h1++;
    # Say that it is a chapter
    pt::speaker::speak("Capítulo $h1: $c");
  }
  'h2' => sub {
    $h2++;
    # Say that it is a section
    pt::speaker::speak("Secçăo $h1 ponto $h2: $c");
  }
);

dt(shift,%handler);
\end{verbatim}
}

This example can be a little weird at the first look, but it's
easy to understand the idea: read each heading 1 as chapters and
each heading 2 as sections, number them and read the table of
contents.

\section{Conclusions}

By this article, we can conclude that the transformation of text to sound
can be done with simple substitutions, and a little of language processing
techniques.

This framework can be enlarged and made more powerful. The simple act
of adding a rule on any of the rewriting systems make a real difference on
the sound generated. There is the possibility to make an application to
check, accordingly  to a phonetic dictionary, the percentage of words
we match correctly. This can help any of us to add or remove rules from
the rewriting system, knowing the level of changes that operation will
bring.

The possibility to add some functions to translate numbers, email addresses,
URLs, time of day or acronyms.  It's simple to add a XML\cite{XML} parser to
make various types of transformations accordingly with the tag we are
looking into. For example, we can make some tags to be spelt, other to
be realced with higher frequency, telephone numbers to be read by two
or three numbers set, and so on.

We can check that the main power of this framework is the rewriting
system that makes almost all the text-to-speech.

Futher development may include an adaptation of the system to the
Festival\cite{Festival} Speech Syntethis System that have more power
than the MBrola system.

The simple transformation from words to phoneme symbols can be
rewriten, again, to \LaTeX\cite{LComp}, making the habitual phonema
syntax we are used to under dictionaries.

\bibliographystyle{plain}
\bibliography{bibliography}

\end{document}