![]() |
|
#1
|
||||
|
||||
If you had to write a search engine...Some of you may know this already but off and on in the past, I have been playing with the idea to write a simple little search engine for myself -- it's like a hobby of mine, you could say...
Over the last few days, I have been looking at my notes again and I made some progress writing some new code today. However, I am stuck trying to figure out something I have not even thought about before today. Let's say the code has successfully grabbed the contents off a web page... and this is what the text looks like after the HTML markup is removed... bitval = 0 reverse loop thru array (from the end to beginning) 'OR' the current array location into bitval Shift bitval 1 bit left using << What do you do with the non-word characters (and html entities) like "= ( ) ' <"? Would you save them? Would you ignore them? Or would it even figure in your algorithm in other ways? I realise this is not a C or C++ programming question at all -- it's more like an algorithm discussion only a computer programmer can appreciate (not a webmaster for sure) and the reason why I have posted it here. __________________
J de Silva Learning Journal | GIDForums™ | GIDNetwork™ | GIDWebhosts™ | GIDSearch™ |
|
#2
|
||||
|
||||
|
J, you know much more about search engines than I will ever hope to know. So disregard my response if it seems silly.
My first thought was to simply throw out symbols. However, as I thought about it, I think the symbols (at least some of them) may be used to determine weight. Generally, things in quotation marks would be more "significant" Things in parenthesis tend to be off topic (I have a habit of using these quite a bit Not sure if that helps, but that is where my thoughts drifted... __________________
The best damn Sports Blog period. |
|
#3
|
||||
|
||||
|
Great!
Just the man I have been waiting to hear from...I suppose, at this point, there's no right or wrong answer; just thoughts and opinions. Even so, I don't think it's wise for me to "weight" any word(s) based on it being enclosed within quotes for example. That would be too much... Still, I don't know what to do with these non-word characters... From my own observations of Google's results and the snippets that they return with each link, it does appear that they 'store' these special characters after all. Whether these "characters" play any part in their search algorithm, is anybody's guess.. Anyway, I hope you could think about it a bit more and I hope some others will offer their opinions too. __________________
J de Silva Learning Journal | GIDForums™ | GIDNetwork™ | GIDWebhosts™ | GIDSearch™ |
|
#4
|
||||
|
||||
|
My initial reaction is to treat them as whitespace. IMO the only time you would consider non-alphnumerics as significant is when they are typed into the search string.
Slightly more difficult is to treat them as special type of whitespace, ignored but not a space. That way if you search for "program book" you won't get a hit on Quote:
__________________
Got a cough? Go home tonight and eat a whole box of Ex-Lax. Tomorrow, you'll be afraid to cough. -- Pearl Williams |
|
#5
|
||||
|
||||
|
When I searched for say... "=>'apple'", it appears that the non-alphanumeric characters are not taken into consideration at all.
But I am not there (handling a search query) yet -- I am at the stage of "analysing" a recently crawled page. i.e. extracting all the words off a web page and storing them. Just don't know what to do with these "symbols" and non-word characters. After thinking about it, I have decided to take your advice and store them, just like a word but only because I don't know what to do otherwise. I don't think it will figure at all in the final "search algorithm", that I will hopefully someday, write. So if I have to deal with an entity like ", I will translate it to " and save it in the list of words. All other non-word characters will be stored individually in this same list. Thank you for your input guys. Placing this thread back in the gsearch forum. __________________
J de Silva Learning Journal | GIDForums™ | GIDNetwork™ | GIDWebhosts™ | GIDSearch™ |
Recent GIDBlog
Stupid Management Policies by crystalattice
| Thread Tools | Search this Thread |
| Rate This Thread | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Using meta tags help in ranking on some search engine | pcx | Search Engine Optimization Forum | 8 | 29-Mar-2005 16:42 |
| crawled by new search engine! [faxo.com] | JUNK KED | Search Engine Optimization Forum | 4 | 23-Oct-2004 18:32 |
| How a search engine really works (In english) | jrobbio | Open Discussion Forum | 0 | 06-Jul-2003 18:13 |
| Search Engine Positioning 101 and 201 "How To" Tips... | 000 | Search Engine Optimization Forum | 0 | 29-May-2003 11:34 |
Network Sites: GIDNetwork · GIDWebHosts · GIDSearch · Learning Journal by J de Silva, The