GIDForums  

Go Back   GIDForums > Computer Programming Forums > C++ Forum
User Name
Password
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

 
 
Thread Tools Search this Thread Rate Thread
  #1  
Old 11-Dec-2003, 08:09
markov markov is offline
Awaiting Email Confirmation
 
Join Date: Dec 2003
Posts: 1
markov is an unknown quantity at this point

Parsing to HTML document in C++.


Hi all.
I need to do parsing to HTML document in C++.
Goal is to take HTML file as input and return as output separated text of links, titles and body.
Can somebody tell me how to do it?
  #2  
Old 20-Jan-2004, 15:59
dsmith's Avatar
dsmith dsmith is offline
Senior Member
 
Join Date: Jan 2004
Location: Utah, USA
Posts: 1,351
dsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of light
parsing an html file is like parsing any text file. You just need to go through and look for the proper keywords and then extract the information after it. I have written a couple of small functions that I can post if you would like. They don't do exactly what you are afte r, but they extract needed information from an HTML file.
  #3  
Old 20-Jan-2004, 16:11
JdS's Avatar
JdS JdS is offline
Senior Member
 
Join Date: Aug 2001
Location: KUL, Malaysia
Posts: 3,371
JdS will become famous soon enough
Hello dsmith,

Please go ahead.. post your functions if you like. Many readers find this page off the search engines and may find any information you can share useful.
  #4  
Old 20-Jan-2004, 19:01
dsmith's Avatar
dsmith dsmith is offline
Senior Member
 
Join Date: Jan 2004
Location: Utah, USA
Posts: 1,351
dsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of light
Okay, here a few basic functions. They are kind of hacked but they work. I wrote them for a office football pick pool tracking program. I used them to automatically extract the games and spreads each week. They worked for my specific purpose, but obviously need some work depending upon your application

CPP / C++ / C Code:
int find_word(FILE* fp, char search[])
{
	int		index = 0;
	char	read;

	do{
		read=fgetc(fp);
		if(read == search[index])
			index++;	//Go to next search letter
		else
			index = 0;	//Mismatch - start over
		if(search[index] == 0)  //If the index is at the end of the search string it is a match
			break;		//Exit loop upon success.
	}while(read!=EOF);
	if(search[index] == 0)
		return 1;		//Found the word
	else
		return 0;		//Never found it.
}



//This function is identical to find_word, but will also stop searching at
//the end of the table.  A better implementation would be to make another parameter
//for the "stop" string.
int find_table_word(FILE* fp, char search[])
{
	int		index = 0;
	char	read;
	char	stop[]="/table";
	int		stop_index = 0;


	do{
		read=fgetc(fp);
		if(read == stop[stop_index])
			stop_index++;
		else
			stop_index=0;
		if(read == search[index])
			index++;
		else
			index = 0;
		if(search[index] == 0)
			break;
		if(stop[stop_index]==0)
			break;
	}while(read!=EOF);
	if(search[index] == 0)
		return 1;
	else
		return 0;
}


char* read_until(FILE* fp, char tag)
{
	char* 	string = (char*) malloc(50);
	//This is a weak implementation.  Big opportunity for segfault here.
	//This should be changed to dynamically change memory allocation.
	char*	pos;
	char	cread;

	pos = string;
	while( (cread = fgetc(fp) ) != tag){
		*pos = cread;
		pos++;
	}
	*pos = 0;					//Terminate string
	fseek(fp,-1,SEEK_CUR);		//Backup file position by one

	return string;
}

So, to parse a page for links, you could use something like:
CPP / C++ / C Code:

   while(find_word(file,"<a href=\"")){
       link[x] = read_until(file,'"');
       find_word(file,">");
       name[x] = read_until(file,'<');
       x++;
    }

I didn't test that, but it should go threw an html file that is opened with fopen and store all of the html link locations and associated key words.

If these are of any use, feel free to use them. They obviously come with no garantee that they won't burn down your house or kill your dog...
  #5  
Old 20-Jan-2004, 19:33
JdS's Avatar
JdS JdS is offline
Senior Member
 
Join Date: Aug 2001
Location: KUL, Malaysia
Posts: 3,371
JdS will become famous soon enough
Thank you... I have just edited the bbcode you used in your post so that they are immediately obvious for what they are.

Instead of using [code] to surround your C/C++ code examples, you can use [c++] or even simply [c].
  #6  
Old 21-Jan-2004, 07:38
dsmith's Avatar
dsmith dsmith is offline
Senior Member
 
Join Date: Jan 2004
Location: Utah, USA
Posts: 1,351
dsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of light
JdS:

Thanks, that bbcoding sure makes a difference. I have used editors that don't have that nice of syntax highlighting
 
 

Recent GIDBlogProblems with the Navy (Officers) by crystalattice

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
DiscountASP.NET Launches FREE RichTextBox ASP.NET HTML Editor dasp Web Hosting Advertisements & Offers 2 24-Sep-2008 13:11
Parsing PHP code that's stored in a database. JdS MySQL / PHP Forum 3 13-May-2004 09:15
JavaScript Tutorial Part 1 pcxgamer Web Design Forum 2 01-Dec-2003 09:16
html to php tenaki Web Design Forum 17 28-Oct-2003 16:18
[class] Generate Forms Without Using HTML! Elmseeker PHP Code Library 6 11-Mar-2003 12:05

Network Sites: GIDNetwork · GIDWebHosts · GIDSearch · Learning Journal by J de Silva, The

All times are GMT -6. The time now is 16:54.


vBulletin, Copyright © 2000 - 2010, Jelsoft Enterprises Ltd.