GIDForums  

Go Back   GIDForums > Webmaster Forums > Web Design Forum
User Name
Password
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

 
 
Thread Tools Search this Thread Rate Thread
  #1  
Old 27-Mar-2004, 06:16
JdS's Avatar
JdS JdS is offline
Senior Member
 
Join Date: Aug 2001
Location: KUL, Malaysia
Posts: 3,371
JdS will become famous soon enough

Building a search engine and handling HTML entities.


I am trying to build a little search engine of my own for my next site. I am currently working on how to store the data that my little crawler will extract off my own web pages for all my present and future sites.

The question I am asking myself and I hope you will help me decide, is this: what am I supposed to do with the HTML entities like &, ",  , ™, > or <... etc? Do I remove them from the content? Or do I translate them before saving the content to the search index?
  #2  
Old 27-Mar-2004, 09:39
BobbyDouglas's Avatar
BobbyDouglas BobbyDouglas is offline
Regular Member
 
Join Date: Aug 2003
Posts: 789
BobbyDouglas has a spectacular aura aboutBobbyDouglas has a spectacular aura about
If you are going to do cached pages, you will need to leave them.

Can you explain some more why you would not want them?
__________________
Mr. Bob's Web Design - Tirelessly looking for ways to enhance the customer base of your business.
  #3  
Old 27-Mar-2004, 16:21
JdS's Avatar
JdS JdS is offline
Senior Member
 
Join Date: Aug 2001
Location: KUL, Malaysia
Posts: 3,371
JdS will become famous soon enough
No, no cached pages... I was asking since I wasn't sure. I guess I would need them (stored) already translated since a user might search using a snippet of code. If i have them stored as HTML entities instead, then the search would not match a search term like "<html>"?
 
 

Recent GIDBlogLast Week of IA Training by crystalattice

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Network Sites: GIDNetwork · GIDWebHosts · GIDSearch · Learning Journal by J de Silva, The

All times are GMT -6. The time now is 00:13.


vBulletin, Copyright © 2000 - 2008, Jelsoft Enterprises Ltd.