Tool Guides

How to clean HTML pasted from Word and Google Docs

Copy from Word, paste into CMS, get 400 lines of MSO tags. Here is why and how to fix it.

6 min read
Free Guide

You copy a paragraph from Word. You paste it into your CMS. You switch to the HTML view and find 400 lines of garbage you did not write.

This is one of those problems that has existed for twenty years and still catches people out. It happens because Word and Google Docs were never designed to produce clean HTML. They were designed to produce documents that look right in their own editors, and they use HTML as a serialisation format with zero regard for what that HTML looks like.

What Word leaves behind

When you copy from Microsoft Word, the clipboard contains Rich Text and HTML. The HTML version is not web HTML. It is a dialect designed for round-tripping back into Word and rendering in Outlook.

Here is what you will typically find:

MsoNormal and friends. Every paragraph gets a class like MsoNormal, MsoListParagraph, or MsoTitle. These are Word's internal style names dumped straight into class attributes. They mean nothing to a browser.

mso- style properties. Word injects CSS properties like mso-bidi-font-size, mso-fareast-font-family, and mso-ansi-language. These are not real CSS. No browser recognises them. They exist so Word can reconstruct the original formatting if you paste the HTML back.

Conditional comments. Blocks wrapped in <!--[if gte mso 9]> and <![endif]--> contain XML markup intended exclusively for Microsoft Office. This includes embedded font declarations, list definitions, and namespace bindings.

Embedded XML namespaces. Word's HTML output often includes xmlns:o, xmlns:w, and xmlns:m namespace declarations. These reference Office-specific XML schemas. In a web page, they are dead weight.

Inline styles on everything. Even a simple bold paragraph arrives with inline font-family, font-size, line-height, and margin declarations. Word does not trust stylesheets. It inline-styles every element individually.

A single paragraph that reads "Quarterly revenue increased by 12%" can easily produce 30 lines of HTML. Most of it is noise.

What Google Docs leaves behind

Google Docs is cleaner than Word but still messy in its own way.

docs-internal-guid IDs. Every pasted block gets wrapped in a <b> tag (yes, bold) with an id attribute like docs-internal-guid-a1b2c3d4. The bold tag is not for emphasis. Google Docs uses it as a generic container. The ID is an internal reference that means nothing outside Google's editor.

Nested spans for basic formatting. A sentence with one bold word might arrive as three nested <span> elements, each with inline styles. Google Docs builds formatting by layering spans rather than using semantic elements.

The bold-with-font-weight-normal absurdity. Google Docs sometimes wraps content in a <b> tag and then applies font-weight: normal via inline style on a child span. The visual result is normal-weight text wrapped in a bold tag. It renders correctly because the inline style wins, but the markup is nonsensical.

Excessive style attributes. Line-height, font-family, font-size, color, and background-color appear as inline styles on nearly every element. Google Docs does not use classes at all. Everything is inline.

Orphaned list markup. Lists sometimes arrive as sequences of paragraphs with manual numbering or bullet characters rather than proper <ol> or <ul> elements.

The manual approach

The most common workaround is to paste as plain text. In most editors, that is Ctrl+Shift+V (or Cmd+Shift+V on macOS).

It works. It also strips everything. Bold, italic, headings, links, lists. You get raw text and have to rebuild all the formatting by hand.

For a two-line email signature, that is fine. For a 3,000-word document with headings, subheadings, bullet lists, and links, it is a waste of time.

The smart approach

A better solution is to strip the garbage and keep the formatting. That means:

  1. 1Detect the source. Word and Google Docs leave distinct signatures. MsoNormal classes and mso- properties mean Word. docs-internal-guid IDs mean Google Docs. The cleanup strategy differs for each.
  1. 1Remove proprietary attributes. Strip every class that starts with Mso, every CSS property that starts with mso-, every id that starts with docs-internal-guid. These are source-specific and serve no purpose on the web.
  1. 1Remove conditional comments. Everything between <!--[if and <![endif]--> can go. It is Office-only markup.
  1. 1Collapse redundant nesting. A <b> tag wrapping a <span> with font-weight: normal should resolve to just the content. Nested spans with identical styles should merge.
  1. 1Preserve semantic structure. Keep <h1> through <h6>, <p>, <ul>, <ol>, <li>, <a>, <strong>, <em>. These are the elements that carry meaning. Everything else is presentation noise.
  1. 1Clean inline styles. Remove font-family (your site has its own), remove font-size (your stylesheet handles that), remove line-height declarations. Keep intentional formatting like colour only if relevant.

The result: clean, semantic HTML that works in any CMS, any email client, any website. The formatting you care about survives. The 400 lines of Office markup do not.

Before and after

Here is what a simple bold heading and paragraph look like when pasted from Word, before and after cleaning.

Before (from Word):

<p class="MsoNormal" style="margin-bottom:0cm;line-height:normal">
  <b>
    <span style="font-size:14.0pt;font-family:'Calibri',sans-serif;
    mso-fareast-font-family:'Times New Roman';mso-bidi-font-family:
    'Times New Roman';mso-ansi-language:EN-AU">
      Quarterly Update
    </span>
  </b>
</p>
<p class="MsoNormal" style="margin-bottom:0cm;line-height:normal">
  <span style="font-size:11.0pt;font-family:'Calibri',sans-serif;
  mso-fareast-font-family:'Times New Roman';mso-bidi-font-family:
  'Times New Roman';mso-ansi-language:EN-AU">
    Revenue increased by 12%.
  </span>
</p>

After (cleaned):

<p><strong>Quarterly Update</strong></p>
<p>Revenue increased by 12%.</p>

Same content. Same formatting. 90% less markup.

Other cleanup presets

Word and Google Docs are the most common offenders, but they are not the only ones. Our HTML cleaner includes presets for several other sources.

Email HTML. Outlook, Gmail, and Apple Mail each produce their own flavour of bloated HTML. Outlook in particular still uses Word's rendering engine internally, so email HTML often contains the same mso- properties as Word documents.

AI output. ChatGPT and similar tools produce surprisingly clean HTML in some cases, but they also introduce characteristic patterns: smart quotes, specific dash styles, and predictable structure. Our AI cleanup preset handles those (see our post on cleaning ChatGPT output for the full breakdown).

Shopify. Shopify's rich text editor and product description fields accumulate stale inline styles over time, especially after copy-pasting from other sources. The Shopify preset strips those while preserving Liquid-compatible markup.

Try it

The HTML cleaner runs entirely in your browser. Paste your messy HTML, pick a preset or let it auto-detect the source, and copy the clean output. No uploads, no server processing. Your content stays on your device.

It handles the tedious work so you can focus on what the content says rather than fighting with what Word decided to do to it.