ferramentas:xml-tx
Table of Contents
~~SLIDESHOW~~
XML-TX : type-based XML Validation
Classic XML validation
- document is XML valid
- format validation
- document follows a DTD
- Elements and attributes are valid
- Grammar is valid
- document follows a specified Schema
- values of some elements or attributes have a specific type
but...
Example: what is wrong?
<?xml version="1.0" encoding="ISO-8859-1"?> <text> <body> <entry> <url>http://natura.di.uminho.pt</url> <url>http://naturaaaa.di.uminho.pt</url> <url2>www.di.uminho.pt</url2> <orth>aaron</orth> <domain xml:lang="pt">gato cat 33</domain> <translation>Aaron, Aarão (nome próprio)</translation> </entry> <entry> <orth>aback</orth> <domain xml:lang="en">gato cat 33</domain> <pos>z.</pos> </entry> <entry> <orth>abaft</orth> <pos>adv.</pos> <translation>à popa, à ré</translation> </entry>
Example (continuation)
What is 'to be valid?'
- day
- 1..31
- element pos
- value in a enumerate set (in file POS)
- element url
- is a url
- that url exists
- element translation
- is a portuguese text
- spellchecked
- element domain
- text in the language “xml:lagn”
Operational Semantics...
Types
- url is a aliveurl
- aliveurl
- translation is a portuguese text
- text(PT)
- domain is a “xml:lang” text
- text(@xml:lang)
- day is a 1..31
- [1..31]
- pos is a enumerate (from file POS)
Elements have types
- types are not sets: have functions
- is-valid
- fix-it
- …
Design goals
- Pragmatics
- help in marking errors
- help in fixing errors
- syntax
- as simple as possible
- as powerfull as possible
- semantics
- type based
- dynamic types
- function is-valid
- function fix-it
- builtin types
- user defined types
Design goals (2)
- validity process can see the world
- a type of an element may depend on the value of an attribute
- function “is-valid” dependent of everything necessary
- Partial
- partial validation, typing, visiting
Module XML::TX
use XML::TX; my $types={ sentencePt => text("pt"), sentenceEn => text("en"), domain => sub{text($v{'xml:lang'} || "pt")}, url => "urlActive", }; addType( urlActive => { markit => sub{ $c = markAsErr($c) unless (LWP::Simple::head($c)); toxml()}, } ); markit($filename,$types);
... and also
fixit( $filename, $types );
isvalid( value, type )
tx DSL (try to hide details...)
tx tx-file x.xml
url2 url href urlActive pos enumFromFile("POS") orth text("en") translation text("pt") domain text(@xml:lang) fig@url urlActive %% use LWP::Simple; addType( urlActive => { markit => sub{ $c = markAsErr($c) unless (LWP::Simple::head($c)); toxml()}, } );
Available types
- email
- date
- text(L1)
- enumFromFile(t,F)
- enum(day ,[1..31])
- fromRegExp(type, regexp )
addType( typename => { markit => sub {...}, fixit => .... }, )
User defined types
addType(
url => { markit => sub{ $c = markAsErr($c) unless $c =~ m{^(http|file)://}; toxml()}, fixit => sub{ $c = "http://" . $c if $c =~ /^www\./; $c = markAsErr($c) unless $c =~ m{^(http|file)://}; toxml()}, },
);
Type date
use Date::Manip;
addType(
date => { markit => sub{ $c = markAsErr($c) unless .... toxml()}, fixit => sub{ my $aux = ParseDate($c); if ($aux){ $c = pp($aux); } else { $c = markAsErr($c); } toxml()}},
);
Fixit
tx -correct x.tx y.xml > output
- simple corrections for specific situations
- color =⇒ colour
- correct common mistakes
- interactive corrections
- validator EPR
Validator EPR
extract + process + rebuild
Correcção=ext_proc_rec(CorrectorInteractivo,....)
XML-DT based validators
final pos-processor
facet-oriente processor
Micro Demo
ferramentas/xml-tx.txt · Last modified: 2008/07/18 18:13 by 127.0.0.1