GIDForums  

Go Back   GIDForums > Computer Programming Forums > C Programming Language
User Name
Password
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

 
 
Thread Tools Search this Thread Rate Thread
  #1  
Old 22-Nov-2006, 21:46
aeroboy86 aeroboy86 is offline
New Member
 
Join Date: Nov 2006
Posts: 1
aeroboy86 is on a distinguished road

MS Word File Format


Hello all,

I am simply trying to extract text from a MS word document using a C program. Does anyone have information on the MS .doc file format. i have tried wotis but for some reason i cannot access the information. Any help on this issue would be much appreciated.

Kind Regards

James

  #2  
Old 23-Nov-2006, 17:34
mathematician mathematician is offline
Member
 
Join Date: Nov 2006
Location: Shrewsbury Uk
Posts: 131
mathematician will become famous soon enough

Re: MS Word File Format


I haven't got much to contribute, apart from what you possibly know already. What you may or may not know is that Word uses unicode rather than ascii characters. For the first 127 characters they are the same, ecept for one crucial difference: ascii characters are 8 bits wide, whereas unicode characters are 16 bits wide.
  #3  
Old 23-Nov-2006, 21:14
davekw7x davekw7x is offline
Outstanding Member
 
Join Date: Feb 2004
Location: Left Coast, USA
Posts: 4,703
davekw7x is a splendid one to beholddavekw7x is a splendid one to beholddavekw7x is a splendid one to beholddavekw7x is a splendid one to beholddavekw7x is a splendid one to beholddavekw7x is a splendid one to beholddavekw7x is a splendid one to behold

Re: MS Word File Format


Quote:
Originally Posted by aeroboy86
Hello all,

I am simply trying to extract text from a MS word document using a C program. Does anyone have information on the MS .doc file format. i have tried wotis but for some reason i cannot access the information

Maybe you meant wotsit? http://www.wotsit.org/

On the opening page you can see a reference to a couple of versions of Microsoft word.

Also, there was/is a sourceforge project to create a library of functions to access Microsoft Word documents: http://wvware.sourceforge.net/. I also remember a couple of "rtf" reader projects for the LaTeX text processing system. "Rich Text Format" is closely related to Microsoft Word. Not exactly the same but RTF and Microsoft Word format do have something in common: they are controlled ("owned") by Microsoft. Microsoft can and does change them with each release of new office products. They are rarely forward compatible, and sometimes not even backward compatible with previous versions. The changes always introduce new features that weren't available in older products, and sometimes it requires some inside knowledge of both old and new formats in order to make a conversion. So all of the good stuff you learn that worked for Version 6 is pretty much useless for Word 2003, for example.

My point is that people who undertake projects to reverse-engineer the commercial products (sometimes in violation of specific End User License Agreements that you click on when you install the product) are usually only partially successful, and by the time that something almost useful comes out of it there is a new version that is nearly-but-not-quite compatible with all of their work. I found the RTF programs not suitable (for my needs, anyhow) for current versions of Microsoft products.

From Wikipedia: http://en.wikipedia.org/wiki/Microso...d#File_formats

"The DOC format of Word 97 was publicly documented by Microsoft, but later versions have been kept private, available only to partners, governments and institutions"


So, your request may seem "simple" to you, but the solution may not be so simple.

Since various open-source products (Open-Office, ABI Word, and others) can read some Microsoft Office documents, you could even try to gain some understanding from them, but I haven't had much success in understanding even small parts of most of programs like that. On the other hand, I haven't really tried very hard (That's my story and I am sticking with it!)

It's possible that extracting text (and not worrying about all of the formatting stuff) might not be too tough, but I don't know of any "easy" way to find out how to do it.

The main point is that it (the .doc file format) is a moving target. If you have some Microsoft Word documents and you know what specific version was used to create them, then you might be able to find some source code that works for that version (like the sourceforge project). If you just want to look at the documents, then you can try some of the open source office suites. (Microsoft even has a free viewer that you can download to let you look at them.)

Regards,

Dave
 
 

Recent GIDBlogToyota - 2008 September Promotion by Nihal

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Airport Log program using 3D linked List : problem reading from file batrsau C Programming Language 11 29-Feb-2008 07:44
Download files in c for windows operating system oozsakarya C Programming Language 5 20-Jun-2006 03:33
won't read file from disk, dunno why wbsquared03 C++ Forum 3 29-Nov-2004 11:19
CD burner wont burn!! robertli55 Computer Hardware Forum 1 18-Jun-2004 10:53
Yet another CD burner problem: Lite-On LSC-24082K Erwin Computer Hardware Forum 1 22-May-2004 11:28

Network Sites: GIDNetwork · GIDWebHosts · GIDSearch · Learning Journal by J de Silva, The

All times are GMT -6. The time now is 05:40.


vBulletin, Copyright © 2000 - 2008, Jelsoft Enterprises Ltd.