Please login or register.

Login with username, password and session length
Advanced search  

News:

You need/want an older version of sNews ? Download an older/unsupported version here.

Author Topic: Semantic data extraction issue  (Read 10994 times)

Sven

  • ULTIMATE member
  • ******
  • Karma: 88
  • Posts: 2029
  • Chasing MY bugs!
    • hiseo.fr - rédacteur Web
Semantic data extraction issue
« on: May 23, 2008, 04:08:08 PM »


Hi all

I tried to extract semantic content with the w3.org online service.
So I pointed the default landing page of my website where a page called "accueil" is set as the homepage, and it gave an error sayin:
Quote
Using org.apache.xerces.parsers.SAXParser
Exception net.sf.saxon.trans.DynamicError: Illegal HTML character: decimal 137
Illegal HTML character: decimal 137

I've searched where this bloody char is and didn't find it in the output code.
Link to result: NO GOOD
Hum...hum...

Then I made another test with the same page but, with its complete URL (/home/accueil]
Link to result: GOOD
and there it said: :o
Quote
extracted data
Generic metadata

Title
    Hiseo : un rédacteur Web pour votre communication
Author
    Philippe Le Mesle - hiseo.fr
Description
    Site d'un rédacteur Web. Le référencement naturel de vos sites : audit, création, étude concurrentielle, optimisation de code et rédaction de contenus
Language code
    fr
Explicit language annotations within the document

        * fr
2 pages which should be identical gave 2 different results!  ???

And more, why my french chars are not recognized? >:(
Does someone has an idea?

funlw65

  • Hero Member
  • *****
  • Karma: 96
  • Posts: 771
    • Country Lab
Re: Semantic data extraction issue
« Reply #1 on: May 23, 2008, 04:41:36 PM »

Mine is good. For default link: http://www.morisca.net
And the same for the page designated as home page http://morisca.net/home/recent-articles-by-categories/
I think is about index.php...

Here


--------------------------------
But problems with Blog page: Content is not allowed in prolog ---> this link is not working: http://www.morisca.net/blog but this is ok http://www.morisca.net/blog/
« Last Edit: May 23, 2008, 06:12:17 PM by funlw65 »
Logged

Sven

  • ULTIMATE member
  • ******
  • Karma: 88
  • Posts: 2029
  • Chasing MY bugs!
    • hiseo.fr - rédacteur Web
Re: Semantic data extraction issue
« Reply #2 on: May 23, 2008, 05:14:28 PM »

Helloooo M Fun
nice to see you.

Yeap, you're right: yours is okey, mine is just garbage! Did you notice Outline of the document isn't extracted. Where's the f*ck? :-[

funlw65

  • Hero Member
  • *****
  • Karma: 96
  • Posts: 771
    • Country Lab
Re: Semantic data extraction issue
« Reply #3 on: May 23, 2008, 06:05:48 PM »

Outline of the document isn't extracted

What is that? The description?
Logged

Sven

  • ULTIMATE member
  • ******
  • Karma: 88
  • Posts: 2029
  • Chasing MY bugs!
    • hiseo.fr - rédacteur Web
Re: Semantic data extraction issue
« Reply #4 on: May 23, 2008, 06:18:59 PM »

at the bottom of your link:
Quote
Outline of the document

    * Alternative
          o - My projects about free energy, windmills, cnc routers, etc.
          o Recent articles by categories
                + News
                + Photography
                + Art
                + Web design
                + Software
                + Daily
                + Religion
                + Electronics
                + Windmills
                + CNC routers
                + Solar panels
                + 3D graphics
          o Welcome
          o Search Box
          o Categories:
          o Temple Talk
          o Tech Links
          o Other Links
          o Wise Words
          o Latest Articles
          o Latest Comments
I don't have any. :-\

Maybe the char issue is due to the format when saving the index file.
What format are you using for encoding?
- ANSI /UTF-8 without BOM
- ANSI/UTF-8 with BOM
- UTF-8
- UCS2 Big-Endian
- UCS2 litlle-Endian
 ???

funlw65

  • Hero Member
  • *****
  • Karma: 96
  • Posts: 771
    • Country Lab
Re: Semantic data extraction issue
« Reply #5 on: May 23, 2008, 06:32:57 PM »

Well, utf8_general_ci is used for Collation in tables.

And this in html page code:

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
   <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
   <meta name="Generator" content="sNews 1.6" />
Logged

Sven

  • ULTIMATE member
  • ******
  • Karma: 88
  • Posts: 2029
  • Chasing MY bugs!
    • hiseo.fr - rédacteur Web
Re: Semantic data extraction issue
« Reply #6 on: May 23, 2008, 07:56:08 PM »

Well, utf8_general_ci is used for Collation in tables.
Same. Chars in the DB are O.K.
And this in html page code:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
   <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
   <meta name="Generator" content="sNews 1.6" />
Same too. I tried UTF-8 in lower case. No change.
For encoding savings I tried all combinations above without improvements.

funlw65

  • Hero Member
  • *****
  • Karma: 96
  • Posts: 771
    • Country Lab
Re: Semantic data extraction issue
« Reply #7 on: May 23, 2008, 08:05:47 PM »

Maybe your title with french chars must be after this:

<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
Logged

Sven

  • ULTIMATE member
  • ******
  • Karma: 88
  • Posts: 2029
  • Chasing MY bugs!
    • hiseo.fr - rédacteur Web
Re: Semantic data extraction issue
« Reply #8 on: May 24, 2008, 08:54:19 AM »

Moving it down gives the same result with those bloody chars: é instead of é
What does that mean?
Prologue and header seem OK:
Quote
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr" lang="fr">
<head>
<?php title(); ?>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<meta http-equiv="content-language" content="fr" />
Don't get it at all.???

RE-EDITION: nope. Still having the same problem in an article (and in any page). See result there

« Last Edit: May 24, 2008, 11:13:17 AM by Sven »
Logged

Sven

  • ULTIMATE member
  • ******
  • Karma: 88
  • Posts: 2029
  • Chasing MY bugs!
    • hiseo.fr - rédacteur Web
Re: Semantic data extraction issue
« Reply #9 on: May 24, 2008, 12:13:26 PM »

I'm gonna open a new thread for this char issue.
I'm not sure but I believe it comes from the DB. :(

philmoz

  • High flyer
  • ULTIMATE member
  • ******
  • Karma: 161
  • Posts: 1988
    • fiddle 'n fly
Re: Semantic data extraction issue
« Reply #10 on: May 25, 2008, 02:23:03 AM »

http://www.lehtml.com/xhtml/docmini.html

I don't know if the doctype change is legit, but might work.. ??
Logged
Of all the things I have lost, it is my mind that I miss the most.

Sven

  • ULTIMATE member
  • ******
  • Karma: 88
  • Posts: 2029
  • Chasing MY bugs!
    • hiseo.fr - rédacteur Web
Re: Semantic data extraction issue
« Reply #11 on: May 25, 2008, 09:22:49 AM »

Phil, subject has been divided: see this thread for the char issue.