![]() |
|
#1
|
|||
|
|||
MS Word File FormatHello all,
I am simply trying to extract text from a MS word document using a C program. Does anyone have information on the MS .doc file format. i have tried wotis but for some reason i cannot access the information. Any help on this issue would be much appreciated. Kind Regards James |
|||
|
#2
|
|||
|
|||
Re: MS Word File FormatI haven't got much to contribute, apart from what you possibly know already. What you may or may not know is that Word uses unicode rather than ascii characters. For the first 127 characters they are the same, ecept for one crucial difference: ascii characters are 8 bits wide, whereas unicode characters are 16 bits wide.
|
|
#3
|
|||
|
|||
Re: MS Word File FormatQuote:
Maybe you meant wotsit? http://www.wotsit.org/ On the opening page you can see a reference to a couple of versions of Microsoft word. Also, there was/is a sourceforge project to create a library of functions to access Microsoft Word documents: http://wvware.sourceforge.net/. I also remember a couple of "rtf" reader projects for the LaTeX text processing system. "Rich Text Format" is closely related to Microsoft Word. Not exactly the same but RTF and Microsoft Word format do have something in common: they are controlled ("owned") by Microsoft. Microsoft can and does change them with each release of new office products. They are rarely forward compatible, and sometimes not even backward compatible with previous versions. The changes always introduce new features that weren't available in older products, and sometimes it requires some inside knowledge of both old and new formats in order to make a conversion. So all of the good stuff you learn that worked for Version 6 is pretty much useless for Word 2003, for example. My point is that people who undertake projects to reverse-engineer the commercial products (sometimes in violation of specific End User License Agreements that you click on when you install the product) are usually only partially successful, and by the time that something almost useful comes out of it there is a new version that is nearly-but-not-quite compatible with all of their work. I found the RTF programs not suitable (for my needs, anyhow) for current versions of Microsoft products. From Wikipedia: http://en.wikipedia.org/wiki/Microso...d#File_formats "The DOC format of Word 97 was publicly documented by Microsoft, but later versions have been kept private, available only to partners, governments and institutions" So, your request may seem "simple" to you, but the solution may not be so simple. Since various open-source products (Open-Office, ABI Word, and others) can read some Microsoft Office documents, you could even try to gain some understanding from them, but I haven't had much success in understanding even small parts of most of programs like that. On the other hand, I haven't really tried very hard (That's my story and I am sticking with it!) It's possible that extracting text (and not worrying about all of the formatting stuff) might not be too tough, but I don't know of any "easy" way to find out how to do it. The main point is that it (the .doc file format) is a moving target. If you have some Microsoft Word documents and you know what specific version was used to create them, then you might be able to find some source code that works for that version (like the sourceforge project). If you just want to look at the documents, then you can try some of the open source office suites. (Microsoft even has a free viewer that you can download to let you look at them.) Regards, Dave |
Recent GIDBlog
Not selected for officer school by crystalattice
| Thread Tools | Search this Thread |
| Rate This Thread | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Airport Log program using 3D linked List : problem reading from file | batrsau | C Programming Language | 11 | 29-Feb-2008 07:44 |
| Download files in c for windows operating system | oozsakarya | C Programming Language | 5 | 20-Jun-2006 03:33 |
| won't read file from disk, dunno why | wbsquared03 | C++ Forum | 3 | 29-Nov-2004 11:19 |
| CD burner wont burn!! | robertli55 | Computer Hardware Forum | 1 | 18-Jun-2004 10:53 |
| Yet another CD burner problem: Lite-On LSC-24082K | Erwin | Computer Hardware Forum | 1 | 22-May-2004 11:28 |
Network Sites: GIDNetwork · GIDWebHosts · GIDSearch · Learning Journal by J de Silva, The