Office 2007 documents are zipped XML

Microsoft Office 2007 documents are zipped XML files. For example, you can change a Word document’s extension from .docx to .zip and unzip it. Apparently this isn’t widely known; most people I talk to are surprised when I mention this.

I’ve found a couple uses for the zip/XML format. One is that you can unzip a document and grab all the embedded content. For example, .jpeg images are simply files that are zipped up into the Office document.

Another use is that you can crack open a document’s underlying XML to search for something you can’t find via the user interface. You can unzip Office documents, tweak them, and zip them back up. I don’t  recommend this, but I’ve done it when I was desperate. (Microsoft publishes an API for manipulating Office files. Using the official APIs is safer and in the long run easier, but I haven’t looked into it.)


Related posts
:

I owe Microsoft Word an apology
Contrasting Microsoft Word and LaTeX

Tagged with:
Posted in Computing
5 comments on “Office 2007 documents are zipped XML
  1. I agree. I also had the experience where one of my Excel files became corrupted. I was able to extract the relevant material by changing the file extension to zip and exploring the components. I blogged about it here (http://bit.ly/27be6l).

  2. This actually may be one of the coolest things about Office 2007+, particularly for anyone who’s ever suffered the agony of BIFF files.

  3. wcoenen says:

    This is also good to know because it has an impact on the efficiency of version control systems. Compression basically sabotages the ability of subversion (or any other version control system) to generate small deltas.

    I have answered this question about it on stackoverflow: http://stackoverflow.com/questions/1320654/will-subversion-efficiently-store-openxml-office-documents

  4. I believe this storage format was initiated by OpenOffice (actually Sun Microsystems).

  5. Tan says:

    can the content be put into one xml file? (to a single sheet)

3 Pings/Trackbacks for "Office 2007 documents are zipped XML"
  1. [...] Office 2007 documents are zipped XML Good user interface design: EpiPen Bad user interface design: hotel showers ? X [...]

  2. [...] was disorienting at first. Then I thought about how Office 2007 documents are zipped XML files. But how does dired know that my .xlsx file is a zip file? I suppose since Emacs is a Unix [...]

  3. [...] Office documents are zipped XML Personal organization software [...]