Cleaning Up Word HTML

by Heather Floyd March 30, 2009 20:31

If you have spent any time doing web development or updating your blog or website using some sort of Content Management System,  you have likely come across the problem of converting MS Word files into HTML code. It seems like it would be a simple operation – Word does include a “Save As Webpage…” option, but if you take a look at the HTML generated you would be disappointed to see what a mess it is.

Cleaning up Word-junked content before using it online is very important for code compliance and decent, consistent display. Sure the simplest way to strip out Word garbage is to just copy and paste the text from Word into a basic text editor, then copy & paste it from the text editor to your email, blog, or CMS interface. The only problem is that this strips out ALL formatting, which you will need to painstakingly recreate for your online publishing. If you have long formatted documents, this will quickly become tedious and error-prone.

The other option is to seek out a “cleaning” or conversion utility, which would take either a regular Word Doc and convert it to compliant HTML, or would take a “Save As Webpage…” word-generated HTML file, and strip out the Word-only HTML crap. In general< i have found that these tools do a decent job of generating clean code that still includes the basic formatting tags that are necessary for proper display.

As a web developer who has been dealing with this issue for over a decade, I have certainly tried many solutions and have yet to find my “holy-grail”. The main problem I have found with conversion/cleanup programs is that they aren’t smart enough to convert Word-styled bulleted lists into properly formatted <ul>/<li> code. Believe me, the utility that can do THAT will be the winner in my book.

So, here are a handful of options for your Word-to-HTML projects.

Online Utilities

Recommended

Textism.com Word HTML Cleaner
http://www.textism.com/wordcleaner/
COST: Word files up to 20Kb are free, larger files require an inexpensive subscription (€5 - €20)
HOW-TO: Save a Word document ‘as Web Page…’ to your hard drive, then upload to the website
NOTES: Does a good job, but doesn't fix converted lists.

WordOff
http://wordoff.org/
COST: free
HOW-TO: Save a Word document ‘as Web Page…’ to your hard drive, then open it in notepad, copy & paste the HTML to the form on the website
NOTES: Does a good job, but doesn't fix converted lists.

Not Recommended

HTML Tidy Online
http://infohound.net/tidy/
COST: free
HOW-TO: Save a Word document ‘as Web Page…’ to your hard drive, then upload to website, or paste in some HTML from the saved Word doc
NOTES: For the "Tidy Settings" check "Clean" and "Word 2000" for best results. Doesn't remove Word styles (class="MsoBodyText", etc.), doesn't fix converted lists.

Microsoft Word 2000 HTML Mess Cleaner
http://www.algotech.dk/word-html-cleaner-input.htm
COST: free
HOW-TO: Save a Word document ‘as Web Page…’ to your hard drive, then open it in notepad, copy & paste the HTML to the form on the website
NOTES: Converts paragraphs using <BR> tags, which isn't ideal.

Desktop Installed Programs

Somewhat Recommended

Firefox Add-on: Html Validator
https://addons.mozilla.org/en-US/firefox/addon/249
COST: free
HOW-TO: Save a Word document ‘as Web Page…’ to your hard drive, then open it with Firefox. Go to Edit > View Source..., click the "Clean up this page..." button
NOTES: Requires that you have Firefox web browser installed. Doesn't remove Word styles (class="MsoBodyText", etc.), doesn't fix converted lists.

Zapadoo Word Cleaner
http://www.zapadoo.com/wordcleaner/
COST: $99
HOW-TO: Drag-n-Drop or open Word Docs into the program, choose the appropriate conversion template and click a button
NOTES: Can convert many documents at once, very full-featured including the ability to customize your own "templates" for cleaning, though I was dissapointed that the included templates don’t handle lists the way I want. I haven’t been able to  configure a custom one to my standards after spending quite some time on it.

RTF to XHTML Converter
http://rtftohtml.com/
COST: $34.50 (€29)
HOW-TO: In Word, save as RTF file, browse to it in the program, set an output file path, click "Convert" button
NOTES: This program did properly convert lists to <li> tags, but it also added all sorts of extra <div> and <span> tags with useless style info. There aren't any options to exclude this sort of formatting, which would have made this program a winner. Unfortunately, it just doesn't strip out enough junk.

WordHTML CV
http://www.technoriversoft.com/wordtohtmlconverter.html
COST: free
HOW-TO: Drag-n-Drop your Word Doc onto the program window
NOTES: Doesn't remove Word styles (class="MsoBodyText", etc.), doesn't fix converted lists

Not Recommended

Web Code Converter
http://www.web-code-converter.com/
COST: $19.95
NOTES: I couldn't test this, since it opened with an error message. Re-installing didn’t help.

Atrise ToHTML
http://www.atrise.com/to-html/
COST: $25
HOW-TO: Drag & Drop your Word Doc onto the little program window
NOTES: Easy to use, but not recommended because it strips out ALL formatting, leaving only paragraph breaks. I would expect more functionality for $25.

Word2html LT
http://www.wordcnv.com/word2html-lt.html
COST: €40
HOW-TO: Browse to your file, click Open.
NOTES: Even though their website claims "Full support of bullets and numbered lists" I found that it wasn't the case. No <li> in sight. I was also unimpressed with its inability to figure out heading tags.

Convert Doc
http://www.softinterface.com/Convert-Doc/Features/Convert-DOC-To-HTML.htm
COST: free, as far as I could tell
HOW-TO: Browse to your file, set some options, click Convert.
NOTES: Unfortunately, it doesn't seem to do very much differently than Word's own "Save As HTML" option. If you have other file conversion needs, though (PDFs, etc) you might find this a useful program.

WordToWeb 2.5
http://www.solutionsoft.com/w2w.htm
COST: $299
HOW-TO: Uses a Wizard-like interface to browse to your file, set a gazillion options and finally Convert.
NOTES: This has a lot of options to create webpages from your Word docs, but as far as I can tell, it does a terrible job at cleaning the html produced - if anything it seems to ADD extra junk.

 

If you have a favorite, feel free to post a link in the comments.

 

Comments

4/8/2009 4:49:48 PM #

Johan

I totaly agree on this "issue". The Word formating can really mess up things (your clean accessible XHTML code) and it´s some times hard to explain for "the clients" that it´s no good to copy text directly from Word... I use to tell our clients that the best way is to copy the text to "anteckningar" (I don´t know, but I think it´s called Notes in English), and then in to the CMS. I will check out your suggestions =).

Johan Sweden |

5/22/2009 5:59:52 AM #

cleaning franchises

Hey - nice blog, just looking around some blogengine.net sites, seems a pretty nice platform.  I'm currently using Wordpress for a few of my sites but looking to change one of them over to blogengine.net as a trial run.  Anything in particular you would recommend about it?  Cheers,  Matthew

cleaning franchises United Kingdom |

6/11/2009 2:54:09 AM #

Eping Wang

Hi, I am a developer of a Word HTML Cleaning tool - HTML Cleaner for Word.
This tool is just released a month later than your article.
I think it will do better jod than most other tools.
I hope you will have a try.

I also write a comparision of most cleaning tools at
http://www.wonderstudio.cn/soft/cleanW/index.htm  and click
Cleaning Tools in the content list.
There's also a good article by a writer five years ago at www.informit.com/articles/article.aspx?p=359433

About the problem of list, I'm not sure, in my experience, most of these tools will just clean the HTML from Word exported HTML, and they will not translate DOC into HTML directly. If Word exports with <li> then <li>
will not be stripped by these tools.I am afraid it's not the problem of the cleaning tools, does your Word HTML have the tag of <li> or not?

At last, I want to put a link to this article and your name on my page
with other referenced articles, may I have the honor?

Eping Wang |

6/11/2009 10:54:54 AM #

HeatherFloyd

@Eping: When I have a chance I'd like to take a look at your tool and possibly write an update to this post. Regarding <li> - If you are formatting your text in Word, you can use the "bullets & lists" formatting, but when it is Saved as HTML, it saves it as (I believe) <p> tags with a style called "list" or something. A very smart Word cleaner program would be able to replace <p class="liststyle"> with <li>.

Of course, you are always welcome to link to these posts. Thanks very much.  

HeatherFloyd United States |

8/26/2009 2:11:16 PM #

Dave N

Just what I was looking for…thanks so much for your help!

Dave N United States |

8/26/2009 4:11:09 PM #

HeatherFloyd

Dave,
Glad to be of service. I should also mention that the new version of "Zapadoo Word Cleaner" is much improved and is what I have settled on using regularly. I have been meaning to write a new review, but haven't gotten around to it yet.

HeatherFloyd United States |

9/26/2009 6:19:54 AM #

translate pdf

Hi. I have no problem with word formatting IF I copy the content of the word document and paste it directly into Adobe Dreamweaver. Try it, or maybe with some other simple web design software, I'm sure the formatting will be preserved.

Note: Paste the content into the design area, not into the source code. It will automatically add <p> <br /> tags etc..

Laters,
Dejan

translate pdf United States |

Comments are closed

Powered by BlogEngine.NET 1.5.0.7
Theme by Mads Kristensen (tweaked by Heather)

Who Is Arachne's Sister?

I consider myself a spiritual sister to Arachne, of ancient greek origin, who was a priestess of the Goddess Athene and  an exceptional weaver. Though I have an interest in the fabric arts, these days most of my weaving happens online - in the form of website development and online marketing, and in building connections and relationships.


My real name is Heather Floyd and for over a decade I have been involved in web and software design and development. Now I help solopreneurs/independent professionals and micro-businesses who are overwhelmed with website options and costs to have a website that gets traffic and generates business with less aggravation and expense.

I also have interests in environmentalism & sustainability issues, personal development, and productivty.

I'm Listening to...