![]() |
|
#1
|
|||
|
|||
Google searchengine programmingHowdy,
I want to know how the google, webcrawler etc. searchengines really work as I am learning php programming and want to write a searchengine. I have read around 10 websites, found on google, about “how searchengines work” and not a single one of them make it clear if it is the spider or the index or the search software does the ranking according to it’s ranking algorithm. All they ever say is that, a searchengine has 3 softwares : a) the spider b) the index c) the search system (search-box, template, etc.) The spiders crawl the web collecting webpages and then forward them to the index and then the search software searches the index for the sought keywords/phrases. Also, some say that the spiders copy the whole website into it’s index. So, in other words, there is 2 copies of a website. One residing in the website owner’s webserver and the other residing on the index of the searchengine. So now, I can only assume 3 possibilities how a searchengine works from all this: 1. The spider does not do the ranking according to any algorithm. All it does is visit a website, grab all it’s html codes (copy a website) and then dump the html codes to it’s index. The Index is nothing but a big txt file (.txt, .html) on the searchengine’s webserver that keeps full copy (html codes) of each website. The search-system, when searching and finding links (in the index) gives the ranking according to the searchengine’s ranking algorithm. This means, the spider nor the index is responsible for the ranking because these 2 parts of the searchengine are not taught the ranking algorithm. OR 2. The spider does the ranking according to the searchengine’s ranking algorithm. It visits a website and grabs all it’s html codes (copy a website) and then finally dump the html codes to it’s index. When it dumps the copies of websites it ranks them according to the searchengine’s algorithm. The Index is nothing but a big txt file (.txt, .html) on the searchengine’s webserver that keeps full copy (html codes) of each website. The search-system, when searching and finding links (in the index) does not give the ranking according to the searchengine’s ranking algorithm because that has been already done by the spider when dumping the data onto the index. This means, the spider is responsible for giving the ranking and not the index nor the search-system responsible for the ranking because these 2 parts of the searchengine are not taught the ranking algorithm. OR 3. The spider does not do the ranking according to any algorithm. All it does is visit a website, grab all it’s html codes (copy a website) and then dump the html codes to it’s index. The Index is not only a big txt file (.txt, .html) on the searchengine’s webserver that keeps full copy (html codes) of each website but also the system that does the ranking. When it receives data from the spider, it ranks the links in it’s database according to the searchengine’s ranking algorithm. The search-system, when searching and finding links (in the index) does not give the ranking according to the searchengine’s ranking algorithm. Frankly, all it does is output a copy of certain parts of the index onto a searcher’s screen. This means, neither the spider or the search-system is responsible for the ranking because these 2 parts of the searchengine are not taught the ranking algorithm. So, which assumption is correct according to the 3 above ? |
|||
|
#2
|
||||
|
||||
|
I would guess you can put the ranking in any area that makes sense for the system. There's no requirement that it be in one module or another, so just figure out the best place for it in your opinion and try it there. If it doesn't work, move it.
__________________
During the election they said Obama could only be elected when pigs fly. Well, we currently have an epidemic of Swine Flu. Coincidence? |
|
#3
|
||||
|
||||
|
Maybe you don't know this but I am working on creating my own little search engine - though the development is on hold for a while already.
It's hard to simply pick one of the 3 "assumptions" offered by you in your post and say that's what they're doing. We don't know if that's what any one of those search engines you mention is doing to the web pages they collect off the WWW. I can however say that a spider or a search engine bot will certainly not be involved in the "ranking" process. It simply doesn't make sense for it to involve itself in such a complex task. A spider will simply collect the web page off a URL you feed to it - that's it! I suppose then you may want to pass on the collected "web page" to another, separate module (that you will write) to handle the analysing, processing and storing of the data off the retrieved document. Unless they (search engines) offer a cached version of a web page (in their index), then it would be useless for them to store HTML/markup bits in their database. So it is safe to assume that markup is usually filtered out and discarded after analysing and processing a crawled document. __________________
J de Silva Learning Journal | GIDForums™ | GIDNetwork™ | GIDWebhosts™ | GIDSearch™ |
|
#4
|
|||
|
|||
|
Quote:
So, what is this seperate module called that does the ranking ? I guess the spider collects to the index. The index is nothing but a cache. Then some other agent ranks and creates a 2nd index that ranks links for each keyword. And the query interface only grabs the data from this 2nd index. |
|
#5
|
||||
|
||||
|
If you seriously want to get somewhere with this endeavour, I can suggest the 3 documents that I looked at when I was figuring out this same subject.
All these articles help you to quickly understand the basics and how they do what they do, it's up to you to add your own ingredients to your own search engine ultimately. __________________
J de Silva Learning Journal | GIDForums™ | GIDNetwork™ | GIDWebhosts™ | GIDSearch™ |
|
#6
|
|||
|
|||
|
Quote:
CHHHHEEEEEEERRRRSSS !!!!!!!!! |
Recent GIDBlog
Problems with the Navy (Chiefs) by crystalattice
| Thread Tools | Search this Thread |
| Rate This Thread | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Check your keyword position using Google API | jrobbio | Search Engine Optimization Forum | 5 | 20-Jul-2006 16:29 |
| Cookin' With Google and other hacks | jrobbio | Open Discussion Forum | 3 | 25-Sep-2004 04:44 |
| The Google "Sandbox" Effect | Div | Search Engine Optimization Forum | 3 | 02-Aug-2004 01:00 |
| Search Engine Positioning 101 and 201 "How To" Tips... | 000 | Search Engine Optimization Forum | 0 | 29-May-2003 11:34 |
Network Sites: GIDNetwork · GIDWebHosts · GIDSearch · Learning Journal by J de Silva, The