As I decided to migrate my blog from blogger.com to my own blogging software on my own domain, I did not want to leave my old posts behind. As I had more then 150 of them, I had to do some automation to do that within my lifetime.
The first step is to get the data from Blogger.com database to somewhere where you can access it in full. You will need FTP or SFTP access to a computer with a real IP address. The idea is to go to blogger.com and specifiy that you want to switch storage of your blog pages from their blogspot.com service to your of FTP or SFTP server. You will have to provide server address, URL, folder, username and password. URL does not matter at this point, but other parameters define where your data will be dumped to.
Save the changes and republish the blog. After a few dozens of minutes all your data will be at your server in a set of complete html files. Now we need something that will parse those html files and write out something that your blogging software can understand. As I use Mnemosyne, my format is a set of mail message files.
To allow me to use HTML formatted messages in my blog (and not only reST) I added this to my config.py:
class EntryMixin: def _init_content(self): """Read in the message's body, strip any signature, and format using reStructedText unless X-Format=='html'.""" s = self.msg.get_payload(decode=True) if not s: return '' try: s = s[:s.rindex('-- ')] except ValueError: pass body=False try: if self.msg['X-Format'] == "html": body = s.replace(" ", " ") body = re.sub(r'&(?!\w{1,10};)',r'&',body) body = xml.dom.minidom.parseString("<div>"+body+"</div>").toxml() except KeyError: pass except xml.parsers.expat.ExpatError, e: print "W: Parse failed for "+self.msg['Subject']+" at "+self.msg['Date']+" from "+str(int(time.mktime(time.strptime(self.msg["Date"],"%a %b %d %H:%M:%S %Y")))) print xml.parsers.expat.ErrorString(e.code), e.lineno, e.offset if not body: parts = docutils.core.publish_parts(s, writer_name='html') body = parts['body'] self.cache('content', body) return body
This will try to parse messages as pure xHTML if custom header "X-Format" is set to "html" in the blog entry. There is one problem with this approach - xHTML must be valid, otherwise XML parser in Kid templating engine will fail and there will be no end of trouble. That is why even in HTML mode we reparse the body to XML and back to string again. If we get a parsing error at that point, we fall back to reST parser.
Now we need something that will analyse our Blogger.com generated HTML files and get our content from there. Here is the script that I used:
#!/usr/bin/python import os, os.path, time, sys, glob, re, xml.dom.minidom id = 1 host = "old" mdate = re.compile( r'\">(\d+) (\D+) (\d+)<' ) mbody = re.compile( r'</div>(.+)<div style=' ) mfooter = re.compile( r'<a href=\"http://example.com/(.+)\" title=\"permanent link\">(\d+):(\d+)</a></em>' ) files = glob.glob("old/*/*/*.html") for file in files: year = 0 month = 0 day = 0 hour = 0 minute = 0 subject = "" oldurl = "" body = "" date = "" title = "" footer = "" f = open( file, "r" ) status = 0 for l in f: if status == 0: if l.find('class="date-header"')>0: date = l status = 1 elif status == 1: if l.find('class="post-title"')>0: status = 2 if l.find('class="post-body"')>0: status = 4 elif status == 2: if len(l.strip()) > 0: title = l status = 3 elif status == 3: if l.find('class="post-body"')>0: status = 4 elif status == 4: if len(l.strip()) > 5: body += l if l.find('padding-bottom: 0.25em;')>0: status = 5 elif status == 5: if l.find('posted by ')>0: footer = l break f.close() rdate = mdate.search( date ) rbody = mbody.search( body ) rfooter = mfooter.search( footer ) year = rdate.groups()[2].strip() month = rdate.groups()[1].strip() day = rdate.groups()[0].strip() subject = title.strip() body = rbody.groups()[0] body = "<p>"+body+" </p>" body = re.sub(r'<img([^>]*?[^/])>',r'<img\1/>',body) if subject == "": subject = re.sub(r'<br.*?>',' ', body) subject = re.sub(r'</p.*?>',' ', subject) subject = re.sub(r'<.*?>','', subject) subject = subject.strip() line = subject.find(' ') if line > 45: subject = subject[:40]+"..." else: subject = subject[:line] oldurl = "http://oldblog.blogspot.com/"+rfooter.groups()[0].strip() hour = rfooter.groups()[1].strip() minute = rfooter.groups()[2].strip() mtime = time.mktime(time.strptime(day+" "+month+" "+year+" "+hour+":"+minute, "%d %B %Y %H:%M")) outname = str(int(mtime)) while os.path.exists( outname+"."+str(id)+"."+host ): id += 1 outname = outname+"."+str(id)+"."+host out = open( "entries/new/"+outname, "w" ) out.write("Date: "+time.ctime(mtime)) out.write(" Subject: "+subject) out.write(" X-URL: "+oldurl) out.write(" X-Tags: untagged") out.write(" X-Format: html") out.write(" ") out.write(body) out.close()
Here I parse the HTML files using a simple state automata and then assemble all the data that we need to have, like timestamp in filename and Date field. In this script it is assumed that in the current directory there is "old/" directory with subdirectories like "2005", "2006", ... which have subdirectories for months in which there are xHTML files for individual posts. The output is written to "entries/new/$timestamp.$id.$host" files.
After running this script and then running mnemosyne you will see a bunch of messages about failed parsing of some messages. That shows you where your xHTML is not valid. The usual problems are html entities ( like € ) that the parser does not recognise and lack of closure on img or br tags:
This will fail because of € <img src="smile.png"> will fail, it must be <img src"smile.png"/>
My import script fixes img and br tags, but for other problems there is not much choice but going trough the entries and fixing them up manually. Also later on, if you will want to paste some custom HTML into your post, you will have to mark the whole post as HTML mode post and also check if you pasted HTML is valid XML manually.
There are some other fun things in this blog, but I will go into that in later posts.