Wednesday, July 24, 2013

Sharepoint, Exporting, and Ditching the Manifest


While working at one of my areas schools, I was tasked with getting all the pictures out of a Sharepoint site. Specifically, it is a MOSS 2007 installation. I used the stsadm.exe program that comes with Sharepoint in order to export everything from the site. When that runs, it creates a series of CMP files.

Turns out CMP files are just Windows CAB files with a different extension. You can rename the CMP to CAB and then simply double-click them to see what's inside. And, what is inside? In most of the CABs, you will find DAT files. The DAT files are actual usable files (for the most part), but you have to know what they were to start with. For example, you could rename the appropriate DAT file as DOC (assuming that specific file had been a word doc) and it would open in Word.  In one of the CABs, though, you will find a Manifest.xml file. This lists every file the system backed up, what its "real" name is, and what the DAT file was called. You put two and two together and come with the correct filename and extension for the files.

That's great if you have a small set of files. In my case, the Manifest.xml alone was 380MB. Yes, that is 380 Megabytes! Ain't nobody got time for that!  Instead, I took the cheater's route: Rename every DAT file as JPG and see what happens.

Now, before we start patting backs here, let me explain what that entailed: Each CMP has multiple (from about a dozen to more than 2000, depending) files in it. In my case, there were 32 CMP files. I needed a way to extract the information from each of those CMPs and then convert the thousands of files to JPG. Here's another kick: The DAT files all start over in naming with each new CMP file. So, you can't simply extract every file into the same folder because files would overwrite each other.

Enter command line fun:
The first thing I did was rename the CMP to CAB. That was easy:
ren *.cmp *.cab
I created a new folder on the computer called SchoolName (I used the actual name, of course):
md c:\SchoolName
To keep things easy, I made sure I was working in the SchoolName folder on the C:\ drive:
c: (then press Enter), then type: cd\schoolName (and press Enter)
Next, I knew I would have to extract each CAB into its own folder. So, let's create folders:
for /L %a in (1,1,32) do call md schoolname%a
This ran a loop that created a directory called schoolname1, schoolname2, etc to schoolname32 inside the SchoolName folder. So, from the root, they would be C:\SchoolName\schoolname1\, C:\SchoolName\schoolname2\, etc...

I switched back to my flash drive where the renamed CAB files resided:
e: (then press Enter, where "e" is the letter of your flash drive)
Now, I needed to extract those files from the CABS and put them in the correct folders:
for /L %I in (1,1,32) do call expand -F:* schoolname%I.cab c:schoolname%I 
Notice that I did *NOT* put a backslash after the c: in that line! By excluding the backslash, I am telling Windows to use the c: drive, but start in the last directory I accessed on that drive. That was SchoolName, remember? So, this will extract the files inside the CABs to the appropriate subdirectory in that SchoolName folder. That is, schoolname1.cab files will go into the schoolname1 subfolder, etc.

Okay, the next shortcut? I was just looking for pictures. I decided to use JPG as the extension of choice. So, if I rename all the files I just extracted to JPG, then I could use the "Medium Icons" view to see which files actually show images!  Again, I don't have time to delve into every single folder to rename thousands of files, so let's have a FOR-LOOP to the work:
for /L %I in (1,1,32) do ren schoolname%I\*.dat *.jpg
 Once that ran through, I opened each folder, deleted everything that wasn't a valid picture and kept the images. Now, I do realize that there are other picture formats. Ideally, I would have run the REN command to change the extension to PNG or GIF or whatever. But, I knew that the majority of the images they wanted to keep were JPG.

Was this the "best" way to accomplish the task? Maybe not. Did it get the files? Yes. We ended up with 250MB worth of images that someone will have sift through. I'm just glad THAT isn't my job.

Note: The above steps would work for any files. Need to pull all your PDFs? Just name all the DATs as PDF and look for PDF thumbnails. DOC, XLS, PPT, etc might be a bit trickier, and might be worth trying to open a 380MB xml file for somebody. Not for me in the scope of this project.

No comments:

Post a Comment