GIDForums  

Go Back   GIDForums > Site Info and Feedback > GIDNetwork™ > GIDSearch™
User Name
Password
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

 
 
Thread Tools Search this Thread Rate Thread
  #1  
Old 29-May-2005, 01:52
JdS's Avatar
JdS JdS is offline
Senior Member
 
Join Date: Aug 2001
Location: KUL, Malaysia
Posts: 3,371
JdS will become famous soon enough

If you had to write a search engine...


Some of you may know this already but off and on in the past, I have been playing with the idea to write a simple little search engine for myself -- it's like a hobby of mine, you could say...

Over the last few days, I have been looking at my notes again and I made some progress writing some new code today. However, I am stuck trying to figure out something I have not even thought about before today.

Let's say the code has successfully grabbed the contents off a web page... and this is what the text looks like after the HTML markup is removed...

bitval = 0 reverse loop thru array (from the end to beginning) 'OR' the current array location into bitval Shift bitval 1 bit left using <<

What do you do with the non-word characters (and html entities) like "= ( ) ' <"? Would you save them? Would you ignore them? Or would it even figure in your algorithm in other ways?

I realise this is not a C or C++ programming question at all -- it's more like an algorithm discussion only a computer programmer can appreciate (not a webmaster for sure) and the reason why I have posted it here.
  #2  
Old 29-May-2005, 08:14
dsmith's Avatar
dsmith dsmith is offline
Senior Member
 
Join Date: Jan 2004
Location: Utah, USA
Posts: 1,351
dsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of light
J, you know much more about search engines than I will ever hope to know. So disregard my response if it seems silly.

My first thought was to simply throw out symbols. However, as I thought about it, I think the symbols (at least some of them) may be used to determine weight.

Generally, things in quotation marks would be more "significant"

Things in parenthesis tend to be off topic (I have a habit of using these quite a bit ).

Not sure if that helps, but that is where my thoughts drifted...
  #3  
Old 29-May-2005, 08:25
JdS's Avatar
JdS JdS is offline
Senior Member
 
Join Date: Aug 2001
Location: KUL, Malaysia
Posts: 3,371
JdS will become famous soon enough
Great! Just the man I have been waiting to hear from...

I suppose, at this point, there's no right or wrong answer; just thoughts and opinions.

Even so, I don't think it's wise for me to "weight" any word(s) based on it being enclosed within quotes for example. That would be too much... Still, I don't know what to do with these non-word characters...

From my own observations of Google's results and the snippets that they return with each link, it does appear that they 'store' these special characters after all. Whether these "characters" play any part in their search algorithm, is anybody's guess..

Anyway, I hope you could think about it a bit more and I hope some others will offer their opinions too.
  #4  
Old 30-May-2005, 00:42
WaltP's Avatar
WaltP WaltP is offline
Outstanding Member
 
Join Date: Feb 2004
Location: Midwest US
Posts: 3,258
WaltP is a name known to allWaltP is a name known to allWaltP is a name known to allWaltP is a name known to allWaltP is a name known to allWaltP is a name known to all
My initial reaction is to treat them as whitespace. IMO the only time you would consider non-alphnumerics as significant is when they are typed into the search string.

Slightly more difficult is to treat them as special type of whitespace, ignored but not a space. That way if you search for "program book" you won't get a hit on
Quote:
after you finish the program, book your flight
This is assuming your engine can deal with phrases, which none of the others seem to do.
__________________

Got a cough? Go home tonight and eat a whole box of Ex-Lax. Tomorrow, you'll be afraid to cough.
-- Pearl Williams
  #5  
Old 30-May-2005, 03:46
JdS's Avatar
JdS JdS is offline
Senior Member
 
Join Date: Aug 2001
Location: KUL, Malaysia
Posts: 3,371
JdS will become famous soon enough
When I searched for say... "=>'apple'", it appears that the non-alphanumeric characters are not taken into consideration at all.

But I am not there (handling a search query) yet -- I am at the stage of "analysing" a recently crawled page. i.e. extracting all the words off a web page and storing them. Just don't know what to do with these "symbols" and non-word characters.

After thinking about it, I have decided to take your advice and store them, just like a word but only because I don't know what to do otherwise. I don't think it will figure at all in the final "search algorithm", that I will hopefully someday, write.

So if I have to deal with an entity like ", I will translate it to " and save it in the list of words. All other non-word characters will be stored individually in this same list.

Thank you for your input guys.

Placing this thread back in the gsearch forum.
 
 

Recent GIDBlogStupid Management Policies by crystalattice

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Using meta tags help in ranking on some search engine pcx Search Engine Optimization Forum 8 29-Mar-2005 16:42
crawled by new search engine! [faxo.com] JUNK KED Search Engine Optimization Forum 4 23-Oct-2004 18:32
How a search engine really works (In english) jrobbio Open Discussion Forum 0 06-Jul-2003 18:13
Search Engine Positioning 101 and 201 "How To" Tips... 000 Search Engine Optimization Forum 0 29-May-2003 11:34

Network Sites: GIDNetwork · GIDWebHosts · GIDSearch · Learning Journal by J de Silva, The

All times are GMT -6. The time now is 08:25.


vBulletin, Copyright © 2000 - 2008, Jelsoft Enterprises Ltd.