Microsoft Office 2007 documents are zipped XML files. For example, you can change a Word document’s extension from
.zip and unzip it. Apparently this isn’t widely known; most people I talk to are surprised when I mention this.
I’ve found a couple uses for the
zip/XML format. One is that you can unzip a document and grab all the embedded content. For example,
.jpeg images are simply files that are zipped up into the Office document.
Another use is that you can crack open a document’s underlying XML to search for something you can’t find via the user interface. You can unzip Office documents, tweak them, and zip them back up. I don’t recommend this, but I’ve done it when I was desperate. (Microsoft publishes an API for manipulating Office files. Using the official APIs is safer and in the long run easier, but I haven’t looked into it.)
5 thoughts on “Office 2007 documents are zipped XML”
I agree. I also had the experience where one of my Excel files became corrupted. I was able to extract the relevant material by changing the file extension to zip and exploring the components. I blogged about it here (http://bit.ly/27be6l).
This actually may be one of the coolest things about Office 2007+, particularly for anyone who’s ever suffered the agony of BIFF files.
This is also good to know because it has an impact on the efficiency of version control systems. Compression basically sabotages the ability of subversion (or any other version control system) to generate small deltas.
I have answered this question about it on stackoverflow: http://stackoverflow.com/questions/1320654/will-subversion-efficiently-store-openxml-office-documents
I believe this storage format was initiated by OpenOffice (actually Sun Microsystems).
can the content be put into one xml file? (to a single sheet)