View Part 1
Beardy Geek Git Hub Repository
Welcome back. In part 2 I’ll be getting to the meat of the issue, which is retrieving the data from an existing Wordpress blog, and feeding the data into my own models.
Models
I’ve decided to start the models from scratch, rather than trying to copy the way Wordpress has laid them out. It should keep things simpler that way. I’ll be using the contrib comments system for the post comments, and I’ll also be creating 3 other models, Category, Tag, and Post. Here’s the code for the models:
from django.db import models
from django.contrib.auth.models import User
from datetime import datetime
POST_STATUS = (
('P', 'Published'),
('U', 'Unpublished'),
)
class Tag(models.Model):
text = models.CharField(max_length=75)
slug = models.CharField(max_length=75)
def __unicode__(self):
return self.text
class Category(models.Model):
text = models.CharField(max_length=75)
slug = models.CharField(max_length=75)
def __unicode__(self):
return self.text
class Meta:
verbose_name_plural = "categories"
class Post(models.Model):
title = models.CharField(max_length=75)
slug = models.CharField(max_length=75)
content = models.TextField()
author = models.ForeignKey(User)
post_date = models.DateTimeField(default=datetime.now)
status = models.CharField(max_length=1, choices=POST_STATUS)
categories = models.ManyToManyField(Category)
tags = models.ManyToManyField(Tag)
def __unicode__(self):
return self.title
Wordpress Export
Just a quick note: I am using Wordpress Version 2.6.1. Things may be different in other versions, but the code here works on this version. I’ll go through the code so you should see where the problem is if it doesn’t work with your version.
OK, if you didn’t already know, you can export all your Wordpress data into an xml file. Go to your dashboard, click ‘Manage’, and then ‘Export’. Select which authors to restrict (if any) and hit the download button. You should now have the data in an xml file.
XML File Editing
When I first tried to parse this file, I got an error as one of the namespaces used in the document is undeclared. To rectify this, you need to open up the file in a text editor (not an xml editor, you may get the same parse error). Near the top of the file, you should see something like this:
The missing namespace is ‘excerpt’, and we need to add this. It doesn’t matter what value you give it. I did this:
Save the file and we’re ready to start parsing the data.
XML Parsing
To parse the XML I have used ElementTree. This is included in Python 2.5, but if you’re using an earlier version, you can get ElementTree from Effbot.org.
I won’t include all the code here, just some snippets as examples. To view the full source, please check out the BeardyGeek Github Repository.
First we need to load our xml file.
tree = ElementTree.parse('c:/wordpress.xml')
Then find the top level under which our data resides, which is the ‘channel’ tag.
chan = tree.find('channel')
Now I want to create some shortcut variables for the namespaces that we will use when finding tags.
wp_ns = '{http://wordpress.org/export/1.0/}'
content_ns = '{http://purl.org/rss/1.0/modules/content/}'
Now we can get all the category, tag and item(post) entries:
cats = chan.findall('{http://wordpress.org/export/1.0/}category')
tags = chan.findall('{http://wordpress.org/export/1.0/}tag')
items = chan.findall('item')
This will give us lists of all the elements for those three items.
Finding and Saving Data
I’ll give an example of saving the data using the category tag.
for cat in cats:
c = Category(text=cat.find(wp_ns + 'cat_name').text,
slug=cat.find(wp_ns + 'category_nicename').text)
c.save()
The tag data is the same as the above.
The item data is a bit more complex. If you look at the xml you’ve exported, you’ll see that the item data includes both posts and pages. But it also gives all previous revisions of each post, which will include any drafts saved whilst writing a post. So we need to find all those with a status of ‘publish’ and a page type of ‘post’. We’ll deal with the ‘page’ data another time, using Flatpages.
if item.find(wp_ns + 'status').text == 'publish' and
item.find(wp_ns + 'post_type').text == 'post':
i = Post(title=item.find('title').text,
slug=item.find(wp_ns + 'post_name').text,
content=item.find(content_ns + 'encoded').text, author=u,
post_date=item.find(wp_ns + 'post_date').text,
status='P')
i.save()
The ‘u’ (value for author) is a User object I create earlier in the code that I’ve used as the default author of each post (see source).
Post Categories
Now we have to find out which categories and tags this post has. Within each ‘item’ we have ‘category’ data. A bit confusingly this ‘category’ data also includes the tags, and to discover that you need to look at the ‘domain’ attribute to see which it is. Plus we only need the category with the ‘nicename’ in it (slugified).
post_cats = item.findall('category')
for pc in post_cats:
#check for attributes
if pc.get('nicename'):
if pc.attrib['domain'] == 'category':
c2 = Category.objects.get(slug=pc.attrib['nicename'])
i.categories.add(c2)
elif pc.attrib['domain'] == 'tag':
t2 = Tag.objects.get(slug=pc.attrib['nicename'])
i.tags.add(t2)
Comments
The last section deals with comments. I am using the django.contrib.comments module for this.
comments = item.findall(wp_ns + 'comment')
for comm in comments:
if not comm.find(wp_ns + 'comment_author_email').text:
comm_email = ''
else:
comm_email = comm.find(wp_ns + 'comment_author_email').text
if not comm.find(wp_ns + 'comment_author_url').text:
comm_url = ''
else:
comm_url = comm.find(wp_ns + 'comment_author_url').text
db_comm = Comment(comment=comm.find(wp_ns + 'comment_content').text,
ip_address=comm.find(wp_ns + 'comment_author_IP').text,
object_pk=i.id, submit_date=comm.find(wp_ns + 'comment_date').text,
user_email=comm_email,
user_name=comm.find(wp_ns + 'comment_author').text[:50],
user_url=comm_url,
content_type=ct, site=site)
db_comm.save()
Conclusion
Well that wraps it up for this post. I'll cover extracting the data for Flatpages in the next post, but this should give you enough to get started. You can see how to extract the required data from the xml document, so if you want to extend the models beyond what I have, you shouldn't have any problems. Again, check out the fully code, plus the other file changes (url.py etc) at the Beardy Geek Git Hub Repository. Have fun.