A new kind of wiki
What is a wiki? It has to be more then a just program to transform _light_ *weight* text markup into valid html. Beyond the markup, a wiki is a platform to help many persons work on a shared document. Being a programmer, I'm familiar with this concept since we use similar tools to work on shared programming projects: revision control systems.
The state of the art in revision control systems is Git and Mercurial. What is it that makes then so good? Some people will tell you that it's because they are distributed. That's a good point but there is more to it. They manage to work without centralized authority by merging concurrent changes. To be able to do that, they have to be able to detect which changes you need to merge and and which ones you already have and the key to do that is to keep the full family lineage of every revision of every files for all the people that you work with. The main problem with Subversion is not that it's centralized, it's that it flattens the history into a linear series of revision and that destroys the hope for smart merges.
Those tools that make programmers incredibly more productive, for some reason, did not make it into the wiki world. Why is that so? I studied the internals of revision control tools and even though there are a few tricky parts, as long as you keep the full history lineage, doing a smart merge is really not that hard. My conclusion is that there is absolutely no reason why a wiki could not do smart branch and merge.
Now I'm am proud to announce the first public release of Gazest, my attempt at solving this problem. Gazest is a wiki engine with a smart storage model. The full branch and merge is yet to be implemented but I already see much improvement in conflict resolution during concurrent edits and undo of old revisions. Don't take my word for it, try the demo site. I want to tell more about the internals but I have to implement the branch and merge so stay tuned.
update: I have to mention that Ikiwiki supports smart branch and merge by using Git as the storage system. It differs in scope from Gazest though. It is more oriented towards developpers since you need to use Git directly if you want to branch.
Comments
As I understand it, Gazest uses relational database (via SQLAlchemy to store DAG of revisions. Although I might agree that it is better to use database for web server backend, please read Linus Torvalds (author of Linux and Git version control system) email about why SCMs do not use database for backend:
- http://permalink.gmane.org/gmane.comp.version-control.git/42057
Excerpt below:
Just curious why won't you use something like PostgreSQL for data storage at this point, but, then I know nothing about git internals
I can pretty much guarantee that if we used a "real" database, we'd have
really really horrendously bad performance
total inability to actually recover from errors.
Other SCM projects have used databases, and it always boils down that. Most either die off, or decide to just do their own homegrown database (eg switching to FSFS for SVN).
That's a very interesting post; thanks for the link. As you guessed, Gazest uses a relational database to store the history DAG. It was simply easier to implement that way.
You might wonder why I didn't use Git or Mercurial directly. Well, the interface that those provide assume that each user will have it's own repository and that most actions are on a whole tree of sources. Am I to internally clone the whole repo when a use click edit? Most actions on a wiki touch a single page. This is not to say that there would be no gain in using the low level git storage and I might implemented it on of these days but doing so is far from trivial.
Until the profiler tells me otherwise, a database is fast enough. At the moment, I'm more looking for a fast renderer for Markdown.
About working on the whole tree of sources True, most SCM are geared more towards software project than website / web portal. Deciding to go for per-file history has it's appeal... until you consider changes like moving chunk of contents from one file (web page) to another, or renaming file. For example on some MoinMoin wiki I have created redirect page with correctly spelled name before I had permissions to move and delete pages: I wanted to rename page to correct spelling, removing the redirect page. But because MoinMoin has per-file history (newermind nonlinear history) it is not possible.
About cloning repository on edit. It is not necessary. Even distributed SCMs like Git or Mercurial can work with single central repository. It would be better / simpler to just create branch on edit, and after saving merge it into "trunk". If it is fast-forwar, it succeeds. If it fails, author has to solve merge conflicts (and merge again), perhaps with option to rebase instead.
About storing DAG of revisions in database Perhaps for a small wiki it would be enough. You rather wouldn't need performance of SCMs with ability to handle vey large number of possibly huge size files, with very long history.
About Markdown rendered The one you use doesn't seem to understand '>' for quoting. You do cache results, don't you?
I track content, not pages. Pages are just pointers in the DAG where revnodes hold content. I don't have copy or redirect yet but it will just be pointer manipulations on a shared DAG.
When you edit, each time you hit preview, Gazest tries to merge. If there is a merge, you end up in a micro-branch. The most straight forward way to do this is with the storage of a rev control system would be to branch. Right? The bzr devs just told me that brzlib has per-file branching.
I just switched to python-markdown2 for a 30 fold speedup. You read that right. Who would care if I spent a week hacking a better storage model when I can't spit out pages at a decent rate? I do not rule out file based storage, but for now, there are far more pressing issues, mostly on the usability front.
I'm really open minded on that, if someone comes up with a working patch, I will merge it.
File based storage Git does not store history per file. In the loose format it stores file contents ('blob' objects), directory contents ('tree' objects) storing file names (for web page this would be URL or part of URL) and parts of file mode, and points of history with descriptions ('commit' objects), all referenced by its contents (or rather sha1 of type+contents), and all zlib compressed. Commit object points to other commit objects (commit parents) creating DAG of revisions, and to top-level tree object storing whole state of project at given revision. There is a reason it is called object database.
In packed format it stores everything in one file (pack file), delta compressed using modified LibXDiff code, results also individually zlib-compressed, and uses index file for access.
Git has collection of low-level tools called plumbing for accessing object database; now it has beginnings of Python bindings thanks to GSoC libification project.
By the way fast-forward merge means that there were no changes in one branch, and all it is left to do is to advance history. Note that for example Bazaar creates useless empty-merges depending on the direction of merge.
As to hacking on Gazest: I would have to learn Python first...
Why don't you drop me an email so we can elaborate using a medium that sucks less than my comment system?

This really looks promising. But why don't you directly use git, hg or even bzr directly as a storage backend? That would allow branching the wiki and editing it with a regular text editor and then push your changes back on the live branch published by pylons.
Any plan to support the reStructuredText syntax?
File based storage doesn't scale well across multiple web heads and sucks for queries. How do you get the revisions with at least 10 spam reports? It can be down and others have done it but that turned into tools that only power users could handle. I might add an Hg backend one of these days but that didn't seem like the easiest solution.
There is reStructuredText support in the extra_macros packages.