User Tools

Site Tools


ferramentas:xml-tx

~~SLIDESHOW~~

XML-TX : type-based XML Validation

Classic XML validation

  • document is XML valid
    • format validation
  • document follows a DTD
    • Elements and attributes are valid
    • Grammar is valid
  • document follows a specified Schema
    • values of some elements or attributes have a specific type

but...

Example: what is wrong?

 <?xml version="1.0" encoding="ISO-8859-1"?>
 <text>
   <body>
     <entry>
       <url>http://natura.di.uminho.pt</url>
       <url>http://naturaaaa.di.uminho.pt</url>
       <url2>www.di.uminho.pt</url2>
       <orth>aaron</orth>
       <domain xml:lang="pt">gato cat 33</domain>
       <translation>Aaron, Aarão (nome próprio)</translation>
     </entry>
     <entry>
       <orth>aback</orth>
       <domain xml:lang="en">gato cat 33</domain>
       <pos>z.</pos>
     </entry>
     <entry>
       <orth>abaft</orth>
       <pos>adv.</pos>
       <translation>à popa, à ré</translation>
     </entry>
 

Example (continuation)

What is 'to be valid?'

  • day
    • 1..31
  • element pos
    • value in a enumerate set (in file POS)
  • element url
    • is a url
    • that url exists
  • element translation
    • is a portuguese text
    • spellchecked
  • element domain
    • text in the language “xml:lagn”

Operational Semantics...

Types

  • url is a aliveurl
    • aliveurl
  • translation is a portuguese text
    • text(PT)
  • domain is a “xml:lang” text
    • text(@xml:lang)
  • day is a 1..31
    • [1..31]
  • pos is a enumerate (from file POS)
Elements have types
  • types are not sets: have functions
    • is-valid
    • fix-it

Design goals

  • Pragmatics
    • help in marking errors
    • help in fixing errors
  • syntax
    • as simple as possible
    • as powerfull as possible
  • semantics
    • type based
      • dynamic types
      • function is-valid
      • function fix-it
    • builtin types
    • user defined types

Design goals (2)

  • validity process can see the world
    • a type of an element may depend on the value of an attribute
    • function “is-valid” dependent of everything necessary
  • Partial
    • partial validation, typing, visiting

Module XML::TX

 use XML::TX;
 my $types={ sentencePt => text("pt"),
             sentenceEn => text("en"),
             domain     => sub{text($v{'xml:lang'} || "pt")},
             url        => "urlActive",
           };
 
 addType(
    urlActive =>
     { markit => sub{ $c = markAsErr($c) 
                   unless (LWP::Simple::head($c));
                      toxml()}, } );
 
 markit($filename,$types);

... and also

 
 fixit( $filename, $types );
 isvalid( value, type )

tx DSL (try to hide details...)

tx tx-file x.xml
 url2         url
 href         urlActive
 pos          enumFromFile("POS")
 orth         text("en")
 translation  text("pt")
 domain       text(@xml:lang)
 fig@url      urlActive
  
 %%
  
 use LWP::Simple;
  
 addType(
   urlActive =>
      { markit => sub{ $c = markAsErr($c) 
                            unless (LWP::Simple::head($c));
                       toxml()}, } );

Available types

  • email
  • date
  • text(L1)
  • enumFromFile(t,F)
  • enum(day ,[1..31])
  • fromRegExp(type, regexp )
 addType(
   typename => { markit => sub {...},
                 fixit => ....  },
 )

User defined types

 addType(
   url => { markit => sub{ 
              $c = markAsErr($c) unless $c =~ m{^(http|file)://};
              toxml()},
            fixit  => sub{
              $c = "http://" . $c   if $c =~ /^www\./;
              $c = markAsErr($c) unless $c =~ m{^(http|file)://};
              toxml()}, },
 );

Type date

 use Date::Manip;
 addType(
   date => { markit => sub{ $c = markAsErr($c) unless ....
                               toxml()},
             fixit  => sub{
                my $aux = ParseDate($c);
                if ($aux){ $c = pp($aux); }
                else     { $c = markAsErr($c); }
                toxml()}}, 
 );

Fixit

 tx -correct x.tx y.xml > output
  • simple corrections for specific situations
    • color =⇒ colour
    • correct common mistakes
  • interactive corrections
  • validator EPR

Validator EPR

  extract + process + rebuild
  Correcção=ext_proc_rec(CorrectorInteractivo,....)
  XML-DT based validators
  final pos-processor
  facet-oriente processor

Micro Demo

ferramentas/xml-tx.txt · Last modified: 2008/07/18 18:13 by 127.0.0.1