This converter converts HTML into Word documents (docx format). The code is written in PHP and works with PHPWord. It is particularly designed to take simple HTML - the kind of HTML typically produced by WYSIWYG editors (such as TinyMCE) or that might be included in a blog - and converts this into a docx Word document. The intention is that the resulting Word document is in a form that is familiar to most people who use Word documents and therefore easy to use. It is not intended as a way to recreate complex web page layout in a Word document.
This converter requires SimpleHTMLDom and PHPWord to function - copies of both of which are included in the release here (although you might want to download the latest versions of these).
Note: this is alpha code, and so it is still possible that changes could be made that are not compatible with code you may have built on top of it.
See example.php for an example of how to use this converter. Download the Word document created by this example. Note you do not need to include the documentation directory on your live production server.
This converter uses a style sheet in the form of a php array which allows you to assign PHPWord styles to HTML elements, classes and inline styles. This is an example style sheet used to create the Word document at example.php, and this is the style sheet used to convert this page to a Word document.
Note that all the attribute-values in these arrays are PHPWord attribute-values - you should refer to the PHPWord documentation for more information on these - see: PHPWord_Docs_0.6.2.docx in the phpword directory.
Measurements are generally in TWIPs (as described in the PHPWord documentation). You can add a width in pixels directly onto an HTML cell tag, e.g.: <td width=200> and this will be converted into TWIPs automatically - converting at 15TWIPs per pixel. Image widths and heights are specified in pixels for PHPWord.
htmltodocx currently processes the following elements:
| Element | Allowed child elements |
|---|---|
body | p, ul, ol, table, div, h1, h2, h3, h4, h5, h6 |
h1 | a, em, i, strong, b, br, span, code, u, sup, text |
h2 | a, em, i, strong, b, br, span, code, u, sup, text |
h3 | a, em, i, strong, b, br, span, code, u, sup, text |
h4 | a, em, i, strong, b, br, span, code, u, sup, text |
h5 | a, em, i, strong, b, br, span, code, u, sup, text |
h6 | a, em, i, strong, b, br, span, code, u, sup, text |
p | a, em, i, strong, b, ul, ol, img, table, br, span, code, u, sup, text, div, p |
div | a, em, i, strong, b, ul, ol, img, table, br, span, code, u, sup, text, div, p, h1, h2, h3, h4, h5, h6 |
a | text |
em | a, strong, b, br, span, code, u, sup, text |
i | a, strong, b, br, span, code, u, sup, text |
strong | a, em, i, br, span, code, u, sup, text |
b | a, em, i, br, span, code, u, sup, text |
sup | a, em, i, br, span, code, u, text |
u | a, em, strong, b, i, br, span, code, sup, text |
ul | li |
ol | li |
li | a, em, i, strong, b, ul, ol, img, br, span, code, u, sup, text |
img | |
table | tbody, tr |
tbody | tr |
tr | td, th |
td | p, a, em, i, strong, b, ul, ol, img, br, span, code, u, sup, text, table |
th | p, a, em, i, strong, b, ul, ol, img, br, span, code, u, sup, text, table |
br | |
code | |
span | a, em, i, strong, b, img, br, span, code, sup, text |
text | |
Attributes which can be inherited follow standard CSS recommendations for inheritance. See the function htmltodocx_inheritable_props(). The following attributes can be inherited:
sizenamebolditalicsuperScriptsubScriptunderlinecolorfgColoralignspacinglistTypeAll HTML entities listed here are supported. For example: © (©), £ (£), ® (®), & (&).
Note PHPWord does not support utf8 character encoding. The version of PHPWord shipped with the htmltodocx converter is patched to deal with this: all instances of utf8_encode() have been replaced with a new function - utf8encode_dummy() - which simply returns its string argument. Discussion.
For example:
привет!
আরে!
Note that tables cannot be nested in PHPWord. Nested tables will be displayed as text.
Line 236, Document.php in PHPWord, changed to public function so that lists can be used in a table cell. Note that in any case lists are not currently enabled in htmltodocx converter and "pseudo lists" are used instead - which can have styling applied within each list element.
You can align images left, middle, or right, but you don't appear to be able to butt them together on one line with PHPWord - they will appear on different lines. A way around this could be to insert them into cells in a table.

