Export or Print

nex

10.01.03, 19:14

I don't know of any way to export the raw data directly from the database, and, for example, import it into your local antville installation. You could, however, create a local static mirror of the HTML pages using wget, or you could do something similar with your rss feed. I'm not aware of an RSS reader that allows you to put all stories of one month on one page, e.g. for printing them out, but maybe you can find one at blogspace or so.

And personally, I think you should keep the format electronic (full text search, perfect copies possible, no coffee spills) and let the poor trees alone :-)

Update: It turns out I lied above. I never actually used RSS, and even less one of the various newsreaders. But now I checked out the link I provided above :-) Turns out that RSS is only about delivering headlines; if you want the full story, you get the HTML version anyway. Of the clients listed there, I think the only one I don't hate is peerkat. It allows you to use a MySQL database as its data store, which means that you could easily export the data from there and process it further. But you won't get the whole stories, so I guess that's not what you want. The RSS people are working on a module that delivers all the content of a story, so once Antville supports this, we should really be able to suck a whole month's content into a local database in a bandwith-efficient manner (spidering with wget downloads quite some redundant data).

By the way, I suggested wget several times already, but never wrote a HOWTO, because no one ever asked. If you need one, tell me.

nex

11.01.03, 00:06

I also don't rely on Antville to keep data that's important for me, I mirror my blog and save it with my own backups. I keep my backups in different places, so they'd survive even if the whole house burnt down, thus this method is even saver than keeping a hardcopy. And I think for my children paging through hypertext will be as natural as paging through folders of hardcopy is to us. However, this method requires regular care of the backups—CD-Rs won't last as long as 40 years (as laser prints do), so you have to copy the data from time to time.

Anyway, about wget: This is a non-interactive spider/download tool, which is rather well known on Linux, but also available under DOS/Windows. You specify options in a command line or in a file that tell it what files to retrieve from the net (http or ftp) and it gets them. Just download it, skim over the help file and try it out.

A perfect set of options would ensure that no redundant or unneeded pages are loaded; for example, you wouldn't need an edit form for every story in a static copy of your blog, since they wouldn't work anyway. It would also send a cookie with every request so your private offline-stories are also retrieved. I don't have the time to do this tonight, mainly because I'm tidying up my room and I have to finish that job before I go to bed, because all the stuff I'm moving around is temporarily stored on my bed :-)

But I can provide a starting point: Create a batch file/shell script named 'backup-blog', which executes this command:

wget http://your.antville.org --dot-style=binary -r -l 3 -np -k -t 1

Explanation:

wget is the name of the program
next comes the parameter for the program, namely the URL which we use as starting point
followed by some options... dot-style just makes sure you get some neat feedback on what it's doing; isn't important at all but looks cool
-r makes the retrieval recursive, i.e. it follows all links to other pages in the page located at the initial URL and if these pages contain links, it follows them further and so on. You don't have to be afraid that it does some harm in your blog or trashes anything: firstly, wget is not logged in, so it doesn't get any edit or delete links and it wouldn't be allowed to do something like that anyway. Secondly, everything that changes anything, like editing or commenting, isn't activated through a link, but through a button. wget will follow all 'comment' links, but it will get the 'login' page every time as a result.
-l 3 restrict the level of recursion to 3; i.e. if wget follows a link from your frontpage, then follows another link, and then another link again, it will stop there instead of going on forever. You might want to increase this number to cause all stories to be downloaded. Settign it to the number of months you have will make sure that every story will be available through the calender, but not neccessarily through the 'previous stories' links. E.g., suppose your blog is 4 months old and you set -l 4, but in one topic you have so many stories that they span 10 pages, then wget won't follow the 'previous' links all the way back to the 10th page, thus they won't work in your local mirror. However, you will have saved all stories and can explore them by looking at the calendar or the folders on your disk.
-np means no parent and is very, very important if you don't want to download the whole www. It restricts wget to only go deeper in the directory hierachy and never up. This means, from your front page, it will advance into your topics and pages of single days, but it will never go up to www.antville.org and it won't follow links to yahoo.com or any other site, which would be quite a catastrophy.
-k is a nifty option that converts absolute links to relative links, e.g. "your.antville.org/topics/bierdeckelsammeln" would be converted to something like "../../topics/bierdeckelsammeln". This is cool, because when you click the link in your local mirror, you won't be sent to antville, but to the local copy of that page. Of course, the link will only be converted if you really have a local copy of this page. In practice, this means you can click through the local copy of your blog and the page of every day will come from your disk in fractions of a second.
finally, -t 1 specifies that if a page or file cannot be reached, wget will retry to get it only once. If your connection is unrealiable, you might want to increase this number.

So there you are, a starting point. What this doesn't do is

get only one month
get offline-stories, nor the image pool, nor the file pool
avoid retrieving unnecessary pages

However, this could surely be implemented. Maybe someone else would like to go on from here and post a better set of options, or I'll have time to think something up next week.

Sorry for writing such a long-winded story again! I'm suffering from geek syndrome and always have to explain everything I know; hope this helps :-)

April 2025
So.	Mo.	Di.	Mi.	Do.	Fr.	Sa.
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30
März