Download this page as a Word document (Created using the docxtohtml converter!)

HTML to docx Converter

Table of contents

Project website

Project website on Codeplex

Introduction

This converter converts HTML into Word documents (docx format). The code is written in PHP and works with PHPWord. It is particularly designed to take simple HTML - the kind of HTML typically produced by WYSIWYG editors (such as TinyMCE) or that might be included in a blog - and converts this into a docx Word document. The intention is that the resulting Word document is in a form that is familiar to most people who use Word documents and therefore easy to use. It is not intended as a way to recreate complex web page layout in a Word document.

This converter requires SimpleHTMLDom and PHPWord to function - copies of both of which are included in the release here (although you might want to download the latest versions of these).

Note: this is alpha code, and so it is still possible that changes could be made that are not compatible with code you may have built on top of it.

Setting up

See example.php for an example of how to use this converter. Download the Word document created by this example. Note you do not need to include the documentation directory on your live production server.

Creating a "style sheet"

This converter uses a style sheet in the form of a php array which allows you to assign PHPWord styles to HTML elements, classes and inline styles. This is an example style sheet used to create the Word document at example.php, and this is the style sheet used to convert this page to a Word document.

Note that all the attribute-values in these arrays are PHPWord attribute-values - you should refer to the PHPWord documentation for more information on these - see: PHPWord_Docs_0.6.2.docx in the phpword directory.

Measurements are generally in TWIPs (as described in the PHPWord documentation). You can add a width in pixels directly onto an HTML cell tag, e.g.: <td width=200> and this will be converted into TWIPs automatically - converting at 15TWIPs per pixel. Image widths and heights are specified in pixels for PHPWord.

Elements

htmltodocx currently processes the following elements:

ElementAllowed child elements
bodyp, ul, ol, table, div, h1, h2, h3, h4, h5, h6
h1a, em, i, strong, b, br, span, code, u, sup, text
h2a, em, i, strong, b, br, span, code, u, sup, text
h3a, em, i, strong, b, br, span, code, u, sup, text
h4a, em, i, strong, b, br, span, code, u, sup, text
h5a, em, i, strong, b, br, span, code, u, sup, text
h6a, em, i, strong, b, br, span, code, u, sup, text
pa, em, i, strong, b, ul, ol, img, table, br, span, code, u, sup, text, div, p
diva, em, i, strong, b, ul, ol, img, table, br, span, code, u, sup, text, div, p, h1, h2, h3, h4, h5, h6
atext
ema, strong, b, br, span, code, u, sup, text
ia, strong, b, br, span, code, u, sup, text
stronga, em, i, br, span, code, u, sup, text
ba, em, i, br, span, code, u, sup, text
supa, em, i, br, span, code, u, text
ua, em, strong, b, i, br, span, code, sup, text
ulli
olli
lia, em, i, strong, b, ul, ol, img, br, span, code, u, sup, text
img
tabletbody, tr
tbodytr
trtd, th
tdp, a, em, i, strong, b, ul, ol, img, br, span, code, u, sup, text, table
thp, a, em, i, strong, b, ul, ol, img, br, span, code, u, sup, text, table
br
code
spana, em, i, strong, b, img, br, span, code, sup, text
text

Inheritance

Attributes which can be inherited follow standard CSS recommendations for inheritance. See the function htmltodocx_inheritable_props(). The following attributes can be inherited:

size
name
bold
italic
superScript
subScript
underline
color
fgColor
align
spacing
listType

Special characters

All HTML entities listed here are supported. For example: © (&copy;), £ (&pound;), ® (&reg;), & (&amp;).

Language support

Note PHPWord does not support utf8 character encoding. The version of PHPWord shipped with the htmltodocx converter is patched to deal with this: all instances of utf8_encode() have been replaced with a new function - utf8encode_dummy() - which simply returns its string argument. Discussion.

For example:

Russian

привет!

Bengali

আরে!

Tables

Note that tables cannot be nested in PHPWord. Nested tables will be displayed as text.

Line 236, Document.php in PHPWord, changed to public function so that lists can be used in a table cell. Note that in any case lists are not currently enabled in htmltodocx converter and "pseudo lists" are used instead - which can have styling applied within each list element.

Images

You can align images left, middle, or right, but you don't appear to be able to butt them together on one line with PHPWord - they will appear on different lines. A way around this could be to insert them into cells in a table.