GIDForums  

Go Back   GIDForums > Computer Programming Forums > C Programming Language
User Name
Password
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

 
 
Thread Tools Search this Thread Rate Thread
  #1  
Old 13-May-2004, 06:25
salemite salemite is offline
New Member
 
Join Date: May 2004
Posts: 2
salemite is on a distinguished road

Read a .html file, check that file for links [was:I am at a loss....]


y0

i am at a real loss and i hope i've come to the right place. I have to write a program in C, which reads a .html file, checks that file for links, and then reads those links for new links, and so forth.

I have so far opened the file, read the file, and checked if links exist. The problem i'm having is that once I've found a link, i need to somehow store that link to use later.

I've looked at strstr and strtok.. both of which don't seem to be what i want, or just plain don't work at all.

Any suggestions would be insanely helpful, as i'm completely clued out.

The links would be in the form of <a href="link.html">. I'm guessing using the "" as a delimiter to grab the link.html though like I said, I have no idea how.

Thanks,
salem
Last edited by JdS : 13-May-2004 at 11:38. Reason: Please use a better title in your thread
  #2  
Old 13-May-2004, 08:47
dsmith's Avatar
dsmith dsmith is offline
Senior Member
 
Join Date: Jan 2004
Location: Utah, USA
Posts: 1,351
dsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of light
Hello salem.

Interestingly enough, I am playing with the same type of problem. This isn't as easy as it may first appear either, because HTML is such a loosely defined language! My point being that:

HTML Code:
<a href="link.html">
and
HTML Code:
< a href = link.html >

Will both work in html. So searching for the quote may not be good enough.

My approach is going to be something like this:
  1. Search for "<".
  2. When I find "<", search for href keyword. This has to occur before the "=" sign though.
  3. When I find the href keyword read up to and including the "=" sign.
  4. Read everything after the equal sign up to but not including the ">"
  5. Parse through my new string and remove leading and trailing space as well as any ''.


If it does you any good, here are some half baked functions that I have already written to do some things with a file that may (or may not ) helpl
CPP / C++ / C Code:
/****************************************************************
*** is_match                                                  ***
***   Parameters:                                             ***
***     FILE* fp - file pointer to preopened file             ***
***     char* string - The string to search for               ***
***     char rewind - Flag indicating whether to rewind up    ***
***                   finding the string                      ***
***   Returns:                                                ***
***     0: As soon as the match string is not found           ***
***     1: Only if the search string is found at the position ***
***   Notes:                                                  ***
***     This function will search the file (fp) from the      ***
***     current position for the string.  It will fail        ***
***     (return 0) at the first deviation from this string    ***
***     the file pointer will be at the occurence of the      ***
***     deviation.                                            ***
****************************************************************/
int is_match(FILE* fp, char* string, char rewind)
{
	int 	index = 0;
	char	read;
	fpos_t	save_pos;
	
	fgetpos(fp,&save_pos);
	fseek(fp,-1,SEEK_CUR);
	while( ( read = fgetc(fp) )!= EOF){
		if(read == string[index])
			index++;					/*** Go to next search letter ***/
		else
			break;
		if(string[index] == 0)  		/*** If the index is at the end of the search string it is a match ***/
			break;						/*** Exit loop upon success ***/
	};
	if( string[index] || rewind)
		fsetpos(fp,&save_pos);
	if(string[index] == 0)
		return(1);						/*** Found the word ***/
	else
		return(0);						/*** Never found it. ***/
}


/****************************************************************
*** find_word                                                 ***
***   Parameters:                                             ***
***     FILE* fp - file pointer to preopened file             ***
***     char* search - The string to search for               ***
***   Returns:                                                ***
***     0: If the search string is not found                  ***
***     1: If the search string is found                      ***
***   Notes:                                                  ***
***     This function will search the file (fp) and will try  ***
***     to find the entire string (search).  If the string is ***
***     found, the file will be positioned just after the     ***
***     found string.  If it is not found, the file pointer   ***
***     will be at the EOF                                    ***
****************************************************************/
int find_word(FILE* fp, char *search)
{
	char	read;
	int		match = 0;

	do{
		read=fgetc(fp);
		if(read == *search){
			if(match = is_match(fp,search,0))
				break;						/*** Exit loop upon success ***/
		}
	}while(read!=EOF);
	if(match)
		return(1);						/*** Found the word ***/
	else
		return(0);						/*** Never found it. ***/
}



/****************************************************************
*** read_until                                                ***
***   Parameters:                                             ***
***     FILE* fp - file pointer to preopened file             ***
***     char tag - character to read up to                    ***
***   Returns:                                                ***
***     A pointer to the properly allocated string            ***
***   Notes:                                                  ***
***     This function will start from the current file postion***
***     of the file(fp) and will copy the contents from that  ***
***     point to the first occurence of the character(tag) or ***
***     until the eof is found.  Caution: this routine does   ***
***     not indicate whether the eof occured or not.          ***
****************************************************************/
char* read_until(FILE* fp, char tag)
{
	int		bufsize = BUFSIZE;
	char*	string;
	char* 	buffer = (char*) malloc(bufsize * sizeof(char) );
	char	cread;

	count = 0;
	while( ( (cread = fgetc(fp) ) != EOF) && (cread != tag) ){
		*(buffer+count) = cread;
		count++;
		if( count >= bufsize){
			bufsize += BUFSIZE;		
			buffer = (char*) realloc(buffer, bufsize * sizeof(char) );
		}
	}
	*(buffer+count) = 0;							/*** Terminate string ***/
	string = (char*) malloc( (count + 1) * sizeof(char) );
	memcpy(string,buffer,count+1);
	free(buffer);
	fseek(fp,-1,SEEK_CUR);				/*** Backup file position by one ***/

	return(string);
}



/****************************************************************
*** read_until_string                                         ***
***   Parameters:                                             ***
***     FILE* fp - file pointer to preopened file             ***
***     char* search - exact string to read up to             ***
***   Returns:                                                ***
***     A pointer to the properly allocated string            ***
***   Notes:                                                  ***
***     This function will start from the current file postion***
***     of the file(fp) and will copy the contents from that  ***
***     point to the first occurence of the exact string      ***
***     (search) or until the eof is found.  Caution: this    ***
***     routine does not indicate whether the eof occured or  ***
***     not.                                                  ***
****************************************************************/
char* read_until_string(FILE* fp, char* search)
{
	int		bufsize = BUFSIZE;
	char*	string;
	char* 	buffer = (char*) malloc(bufsize * sizeof(char) );
	char	cread;
	
	count = 0;
	while( (cread = fgetc(fp) ) != EOF){								/*** This is set to be a binary file ***/
		if(cread == *search)				/*** Is this the start of the end?  ***/			
			if(is_match(fp,search,1))
				break;				
		*(buffer+count) = cread;
		count++;
		if( count >= bufsize){
			bufsize += BUFSIZE;
			buffer = (char*) realloc(buffer, bufsize * sizeof(char) );
		}
	}
	*(buffer+count) = 0;								/*** Terminate string ***/
	string = (char*) malloc( (count+1) * sizeof(char));
	memcpy(string,buffer,count+1);
	free(buffer);

	return(string);
}

I would really like to hear how you come along with this if you get a chance. It is one of the many projects that I have that is taking back seat to the normal routine of life

Good luck,
d
  #3  
Old 13-May-2004, 08:55
salemite salemite is offline
New Member
 
Join Date: May 2004
Posts: 2
salemite is on a distinguished road
Wow, that's some insane code there.

Though lucky for me, it has been specified that the fomat will be <a href="link.html">.

I managed to work out how to use strtok. It was a real pain in the ****, since I had tried it before, given up, posted here, then went back to it. Turns out to have been the simplest of things. Though I swear I had it that way before, and yet it was giving me Segmentation Faults that time.

Now it's onto actually recusively checking THOSE links that I have just taken out of the original file.

C is a weird language. Thanks for the help though. I hope you suceed in your mission

Cheers,
salem
  #4  
Old 13-May-2004, 21:00
WaltP's Avatar
WaltP WaltP is offline
Outstanding Member
 
Join Date: Feb 2004
Location: Midwest US
Posts: 3,258
WaltP is a name known to allWaltP is a name known to allWaltP is a name known to allWaltP is a name known to allWaltP is a name known to allWaltP is a name known to all
Quote:
Originally Posted by dsmith
My approach is going to be something like this:
  1. Search for "<".
  2. When I find "<", search for href keyword. This has to occur before the "=" sign though.
  3. When I find the href keyword read up to and including the "=" sign.
  4. Read everything after the equal sign up to but not including the ">"
  5. Parse through my new string and remove leading and trailing space as well as any ''.
That's also not quite it. Try this:
  • Search for "<a" -- remember the location after the 'a'
  • Search for ">" -- remember this location
  • Between these look for "href". An anchor tag may not have an href attribute:
    <a name="forumloc"> sets a location you can jump to with
    <a href="pagename.html#forumloc">
  • Look for '='
  • Look for nonwhitespace
  • --If this char is ", search for second "
  • --If not, search for nonweb-based address character (see below)

Formats to look for
http://web.addr/pagename -- possibly 2 or more '.'s
file://c:/path/web.addr -- possibly 2 or more '.'s
ftp://web.addr/path -- possibly 2 or more '.'s, maybe ignore the link
mailto: -- ignore this link
? -- terminates the webpage and starts a query string
: -- after the web address designates a port, so also terminates the webpage

There may be more formats to deal with, but they aren't coming to mind. I believe the only valid characters for a webpage after the http:// are:
Alphanumerics . - /
__________________

Got a cough? Go home tonight and eat a whole box of Ex-Lax. Tomorrow, you'll be afraid to cough.
-- Pearl Williams
  #5  
Old 13-May-2004, 21:06
dsmith's Avatar
dsmith dsmith is offline
Senior Member
 
Join Date: Jan 2004
Location: Utah, USA
Posts: 1,351
dsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of light
WOW! Way too make this harder for me Walt. Thanks for the help though. I am an html idiot so I totally forgot about that. One question does html allow whitespace between "<" and the "a" in this case:

Quote:
Originally Posted by WaltP
Search for "<a" -- remember the location after the 'a'

I appreciate the input.
  #6  
Old 13-May-2004, 21:18
WaltP's Avatar
WaltP WaltP is offline
Outstanding Member
 
Join Date: Feb 2004
Location: Midwest US
Posts: 3,258
WaltP is a name known to allWaltP is a name known to allWaltP is a name known to allWaltP is a name known to allWaltP is a name known to allWaltP is a name known to all
Quote:
Originally Posted by dsmith
One question does html allow whitespace between "<" and the "a" in this case
No. At least that makes it easier.
__________________

Got a cough? Go home tonight and eat a whole box of Ex-Lax. Tomorrow, you'll be afraid to cough.
-- Pearl Williams
  #7  
Old 14-May-2004, 09:21
JdS's Avatar
JdS JdS is offline
Senior Member
 
Join Date: Aug 2001
Location: KUL, Malaysia
Posts: 3,371
JdS will become famous soon enough
Is there no Regular Expression matching in C or C++? If I were doing this and if it's available to the language, I'd be looking around to see if I could use REGEXs to extract stuff like this instead.
  #8  
Old 14-May-2004, 09:37
dsmith's Avatar
dsmith dsmith is offline
Senior Member
 
Join Date: Jan 2004
Location: Utah, USA
Posts: 1,351
dsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of light
Quote:
Originally Posted by JdS
Is there no Regular Expression matching in C or C++? If I were doing this and if it's available to the language, I'd be looking around to see if I could use REGEXs to extract stuff like this instead.

It is not native. I have done a bit of research and found this interesting little tidbit that tells how to get regex functionality out of C and also explains why C does not natively "like" regular expression,.

http://www.linuxgazette.com/issue55/tindale.html

The article is linux/gnu related, but I thought it was very good information. Also, for anyone wanting regex support, there is a gnu library called regex.h that adds this functionality. I will be looking into it.

As a C programmer, I always think first in terms of string manipulation and maybe in this case that is not the best approach.

Thanks for the input J.
  #9  
Old 04-Dec-2007, 04:50
ankit07 ankit07 is offline
New Member
 
Join Date: Dec 2007
Posts: 2
ankit07 is on a distinguished road

Re: Read a .html file, check that file for links


i have a similar task to do...
i need to search a word in a html file and also store the forward links of this page....can u help me doin this??
language or platform no constraint...
  #10  
Old 16-Jan-2008, 07:27
nikhil nikhil is offline
New Member
 
Join Date: Jan 2008
Posts: 1
nikhil is on a distinguished road

Re: Read a .html file, check that file for links


hey ,
i need some basic help
i tried reading a .html file (a webpage i saved from internet)
in C like the simple text files we read .
CPP / C++ / C Code:
FILE * fp;
fp = fopen("filename.html","r");

as fp returns NULL , i guess .html files r not read in this way ..
Is der ne other way to read html files in C ?

PLZ help !!
 
 

Recent GIDBlogWriting a book by crystalattice

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
some I/O problems...again cameron C++ Forum 3 03-Mar-2004 22:39

Network Sites: GIDNetwork · GIDWebHosts · GIDSearch · Learning Journal by J de Silva, The

All times are GMT -6. The time now is 02:09.


vBulletin, Copyright © 2000 - 2008, Jelsoft Enterprises Ltd.