![]() |
|
#1
|
|||
|
|||
Read a .html file, check that file for links [was:I am at a loss....]y0
i am at a real loss and i hope i've come to the right place. I have to write a program in C, which reads a .html file, checks that file for links, and then reads those links for new links, and so forth. I have so far opened the file, read the file, and checked if links exist. The problem i'm having is that once I've found a link, i need to somehow store that link to use later. I've looked at strstr and strtok.. both of which don't seem to be what i want, or just plain don't work at all. Any suggestions would be insanely helpful, as i'm completely clued out. The links would be in the form of <a href="link.html">. I'm guessing using the "" as a delimiter to grab the link.html though like I said, I have no idea how. Thanks, salem Last edited by JdS : 13-May-2004 at 11:38.
Reason: Please use a better title in your thread
|
|
#2
|
||||
|
||||
|
Hello salem.
Interestingly enough, I am playing with the same type of problem. This isn't as easy as it may first appear either, because HTML is such a loosely defined language! My point being that: HTML Code:
HTML Code:
Will both work in html. So searching for the quote may not be good enough. My approach is going to be something like this:
If it does you any good, here are some half baked functions that I have already written to do some things with a file that may (or may not ) helplCPP / C++ / C Code:
I would really like to hear how you come along with this if you get a chance. It is one of the many projects that I have that is taking back seat to the normal routine of life ![]() Good luck, d |
|
#3
|
|||
|
|||
|
Wow, that's some insane code there.
Though lucky for me, it has been specified that the fomat will be <a href="link.html">. I managed to work out how to use strtok. It was a real pain in the ****, since I had tried it before, given up, posted here, then went back to it. Turns out to have been the simplest of things. Though I swear I had it that way before, and yet it was giving me Segmentation Faults that time. Now it's onto actually recusively checking THOSE links that I have just taken out of the original file. C is a weird language. Thanks for the help though. I hope you suceed in your mission Cheers, salem |
|
#4
|
||||
|
||||
|
Quote:
Formats to look for http://web.addr/pagename -- possibly 2 or more '.'s file://c:/path/web.addr -- possibly 2 or more '.'s ftp://web.addr/path -- possibly 2 or more '.'s, maybe ignore the link mailto: -- ignore this link ? -- terminates the webpage and starts a query string : -- after the web address designates a port, so also terminates the webpage There may be more formats to deal with, but they aren't coming to mind. I believe the only valid characters for a webpage after the http:// are: Alphanumerics . - / __________________
Got a cough? Go home tonight and eat a whole box of Ex-Lax. Tomorrow, you'll be afraid to cough. -- Pearl Williams |
|
#5
|
||||
|
||||
|
WOW! Way too make this harder for me Walt. Thanks for the help though. I am an html idiot so I totally forgot about that. One question does html allow whitespace between "<" and the "a" in this case:
Quote:
I appreciate the input. |
|
#6
|
||||
|
||||
|
Quote:
__________________
Got a cough? Go home tonight and eat a whole box of Ex-Lax. Tomorrow, you'll be afraid to cough. -- Pearl Williams |
|
#7
|
||||
|
||||
|
Is there no Regular Expression matching in C or C++? If I were doing this and if it's available to the language, I'd be looking around to see if I could use REGEXs to extract stuff like this instead.
__________________
J de Silva Learning Journal | GIDForums™ | GIDNetwork™ | GIDWebhosts™ | GIDSearch™ |
|
#8
|
||||
|
||||
|
Quote:
It is not native. I have done a bit of research and found this interesting little tidbit that tells how to get regex functionality out of C and also explains why C does not natively "like" regular expression,. http://www.linuxgazette.com/issue55/tindale.html The article is linux/gnu related, but I thought it was very good information. Also, for anyone wanting regex support, there is a gnu library called regex.h that adds this functionality. I will be looking into it. As a C programmer, I always think first in terms of string manipulation and maybe in this case that is not the best approach. Thanks for the input J. |
|
#9
|
|||
|
|||
Re: Read a .html file, check that file for linksi have a similar task to do...
i need to search a word in a html file and also store the forward links of this page....can u help me doin this?? language or platform no constraint... |
|
#10
|
|||
|
|||
Re: Read a .html file, check that file for linkshey ,
i need some basic help i tried reading a .html file (a webpage i saved from internet) in C like the simple text files we read . CPP / C++ / C Code:
as fp returns NULL , i guess .html files r not read in this way .. Is der ne other way to read html files in C ? PLZ help !! |
Recent GIDBlog
Writing a book by crystalattice
| Thread Tools | Search this Thread |
| Rate This Thread | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| some I/O problems...again | cameron | C++ Forum | 3 | 03-Mar-2004 22:39 |
Network Sites: GIDNetwork · GIDWebHosts · GIDSearch · Learning Journal by J de Silva, The