sNews Forum

Website Talk => Web standards => Topic started by: Sven on May 23, 2008, 04:08:08 pm

Title: Semantic data extraction issue
Post by: Sven on May 23, 2008, 04:08:08 pm

Hi all

I tried to extract semantic content with the w3.org online service.
So I pointed the default landing page of my website where a page called "accueil" is set as the homepage, and it gave an error sayin:
Quote
Using org.apache.xerces.parsers.SAXParser
Exception net.sf.saxon.trans.DynamicError: Illegal HTML character: decimal 137
Illegal HTML character: decimal 137

I've searched where this bloody char is and didn't find it in the output code.
Link to result: NO GOOD (http://www.w3.org/2005/08/online_xslt/xslt?xmlfile=http%3A%2F%2Fcgi.w3.org%2Fcgi-bin%2Ftidy-if%3FdocAddr%3Dhttp%253A%252F%252Fwww.hiseo.fr%252F&xslfile=http%3A%2F%2Fwww.w3.org%2F2002%2F08%2Fextract-semantic.xsl)
Hum...hum...

Then I made another test with the same page but, with its complete URL (/home/accueil]
Link to result: GOOD (http://www.w3.org/2005/08/online_xslt/xslt?xmlfile=http%3A%2F%2Fcgi.w3.org%2Fcgi-bin%2Ftidy-if%3FdocAddr%3Dhttp%253A%252F%252Fwww.hiseo.fr%252Fhome%252Faccueil&xslfile=http%3A%2F%2Fwww.w3.org%2F2002%2F08%2Fextract-semantic.xsl)
and there it said: :o
Quote
extracted data
Generic metadata

Title
    Hiseo : un rédacteur Web pour votre communication
Author
    Philippe Le Mesle - hiseo.fr
Description
    Site d'un rédacteur Web. Le référencement naturel de vos sites : audit, création, étude concurrentielle, optimisation de code et rédaction de contenus
Language code
    fr
Explicit language annotations within the document

        * fr
2 pages which should be identical gave 2 different results!  ???

And more, why my french chars are not recognized? >:(
Does someone has an idea?
Title: Re: Semantic data extraction issue
Post by: funlw65 on May 23, 2008, 04:41:36 pm
Mine is good. For default link: http://www.morisca.net
And the same for the page designated as home page http://morisca.net/home/recent-articles-by-categories/
I think is about index.php...

Here (http://www.w3.org/2005/08/online_xslt/xslt?xmlfile=http%3A%2F%2Fcgi.w3.org%2Fcgi-bin%2Ftidy-if%3FdocAddr%3Dhttp%253A%252F%252Fwww.morisca.net&xslfile=http%3A%2F%2Fwww.w3.org%2F2002%2F08%2Fextract-semantic.xsl)


--------------------------------
But problems with Blog page: Content is not allowed in prolog ---> this link is not working: http://www.morisca.net/blog but this is ok http://www.morisca.net/blog/
Title: Re: Semantic data extraction issue
Post by: Sven on May 23, 2008, 05:14:28 pm
Helloooo M Fun
nice to see you.

Yeap, you're right: yours is okey, mine is just garbage! Did you notice Outline of the document isn't extracted. Where's the f*ck? :-[
Title: Re: Semantic data extraction issue
Post by: funlw65 on May 23, 2008, 06:05:48 pm
Outline of the document isn't extracted

What is that? The description?
Title: Re: Semantic data extraction issue
Post by: Sven on May 23, 2008, 06:18:59 pm
at the bottom of your link:
Quote
Outline of the document

    * Alternative
          o - My projects about free energy, windmills, cnc routers, etc.
          o Recent articles by categories
                + News
                + Photography
                + Art
                + Web design
                + Software
                + Daily
                + Religion
                + Electronics
                + Windmills
                + CNC routers
                + Solar panels
                + 3D graphics
          o Welcome
          o Search Box
          o Categories:
          o Temple Talk
          o Tech Links
          o Other Links
          o Wise Words
          o Latest Articles
          o Latest Comments
I don't have any. :-\

Maybe the char issue is due to the format when saving the index file.
What format are you using for encoding?
- ANSI /UTF-8 without BOM
- ANSI/UTF-8 with BOM
- UTF-8
- UCS2 Big-Endian
- UCS2 litlle-Endian
 ???

Title: Re: Semantic data extraction issue
Post by: funlw65 on May 23, 2008, 06:32:57 pm
Well, utf8_general_ci is used for Collation in tables.

And this in html page code:

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
   <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
   <meta name="Generator" content="sNews 1.6" />
Title: Re: Semantic data extraction issue
Post by: Sven on May 23, 2008, 07:56:08 pm
Well, utf8_general_ci is used for Collation in tables.
Same. Chars in the DB are O.K.
And this in html page code:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
   <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
   <meta name="Generator" content="sNews 1.6" />
Same too. I tried UTF-8 in lower case. No change.
For encoding savings I tried all combinations above without improvements.
Title: Re: Semantic data extraction issue
Post by: funlw65 on May 23, 2008, 08:05:47 pm
Maybe your title with french chars must be after this:

<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
Title: Re: Semantic data extraction issue
Post by: Sven on May 24, 2008, 08:54:19 am
Moving it down gives the same result with those bloody chars: é instead of
What does that mean?
Prologue and header seem OK:
Quote
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr" lang="fr">
<head>
<?php title(); ?>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<meta http-equiv="content-language" content="fr" />
Don't get it at all.???

RE-EDITION: nope. Still having the same problem in an article (and in any page). See result there (http://www.w3.org/2005/08/online_xslt/xslt?xmlfile=http%3A%2F%2Fcgi.w3.org%2Fcgi-bin%2Ftidy-if%3FdocAddr%3Dhttp%253A%252F%252Fwww.hiseo.fr%252Foptimisations%252Fmemento-du-referencement%252F&xslfile=http%3A%2F%2Fwww.w3.org%2F2002%2F08%2Fextract-semantic.xsl)

Title: Re: Semantic data extraction issue
Post by: Sven on May 24, 2008, 12:13:26 pm
I'm gonna open a new thread for this char issue.
I'm not sure but I believe it comes from the DB. :(
Title: Re: Semantic data extraction issue
Post by: philmoz on May 25, 2008, 02:23:03 am
http://www.lehtml.com/xhtml/docmini.html

I don't know if the doctype change is legit, but might work.. ??
Title: Re: Semantic data extraction issue
Post by: Sven on May 25, 2008, 09:22:49 am
Phil, subject has been divided: see this thread (http://snewscms.com/forum/index.php?topic=7404.new#new) for the char issue.