Microsoft Word and Web Development

Microsoft Word is the most prevalent and popular word-processing application across the University. Files created with Word are linked to across our websites.

We can link from web pages to downloadable Word documents with ease, however when it comes to publishing to the web using HTML, using text copied from Microsoft Word documents can cause a few problems.

Character Sets and Encoding

All text-based files are written using a limited character set, used to properly display the words we write. Character set choices often reflect support for different languages, or additional styling features such as 'curly quotes'.

Character sets provide codes to describe each letter, number and symbol in our documents. In most cases, characters are converted from one encoding to another without incident, but in some cases the codes cannot be matched and the chosen character set struggles to describe the characters. This results in the appearance of strange strings of letters and symbols appearing amongst readable text.

We can eliminate this by making sure the content we use in our web pages uses the same character set as the page it appears in.

The LJMU Website now publishes HTML pages in the character encoding 8-bit Unicode Transformation Format, or UTF-8 for short. Doing so allows us to publish in a range of languages that use non-Western character sets, such as Chinese. Without UTF-8, we could not publish our Chinese and Arabic International School pages.

Microsoft Word uses a proprietary Windows-based character set, with an extended amount of characters for styling documents. This is excellent for creating easily-readable printed matter, but the character set it uses to do this is not fully compatible with UTF-8 web pages.

Seemingly harmless characters such as apostrophes, long dashes, hyphens and double quotes can sometimes display as unreadable characters and can ruin your web page.

Exporting 'as web page' from Microsoft Word

Microsoft Word can create HTML pages for you, simply by exporting your document in the chosen format (see our related page What software can I use?). The web page that is created is written in the default character encoding (for example Windows-1232), so the characters in the page can be seen on screen. However if you are creating content for use on the main LJMU Website, wish to use multiple languages in the one page, or ensure greater accessibility then this character set would not be suitable. Because of this, creating or editing a web page using a Microsoft Word source requires some additional steps to ensure your page is fully readable.

Preparing your Web Page for UTF-8

Any web page that uses the LJMU Header include file seen at the top of almost all of our web pages, will need to be UTF-8 encoded. UTF-8 is widely used and most HTML editors will accommodate it by default, or allow you to select it.

In addition to this, our web pages inform web browsers (Internet Explorer, Firefox, Opera, et cetera) about the encoding of the page, with the following line of HTML

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

As browsers can display pages in a range of different character sets, this line ensures it chooses the right one.

The RedDot text editor encodes all web pages as UTF-8. Both methods are only as good as the content they are provided with, and as such the following steps will ensure that copying & pasting from Microsoft Word into a web page or editor results in the correct outcome:

Pasting content from Microsoft Word into a UTF-8 Web Page, or the RedDot Text Editor

  • Once your editing in Word is complete, choose File->Save As...
  • Choose from the format drop-menu the option 'Plain Text (*.txt)'
  • Save the file to a known location, your desktop for example.
  • Before the file saves, a dialog box will appear asking you about encoding: Choose 'other encoding'.
  • Then make sure you check the 'Allow Character Substitution' box.
  • Your document is then previewed, and you will see all characters such as 'curly quotes' are replaced with 'safe' ones. 
You can then open the saved .txt file and safely copy the contents you require into a web page that uses UTF-8 encoding.

With these simple steps you can ensure that your pasted text is suitable for use across the LJMU Website. 



Page last modified 16 September 2011.

Send feedback
 
LJMU Logo banner imageLJMU Logo banner image (print)