Some Confluence instance are public and allow anonymous comments. By doing so you face one major problem spam.
A solution is to use Confluence CAPTCHA to identify human user. But this is not really accessible.
Another solution would be to use a spam filter and I decided to give it a go…
- Develop this spam filtering as a plugin
- Make sure that the filter can learn from comments marked as spam and/or marked as not being spam.
To process comments I went with Bayesian filtering. For that purpose I used Classifier4J. This library doesn’t seem to be actively developed but it has a nice simple API for Bayesian filtering. Developing against nice simple interface will allow me to change library quite easily.
Plug into Confluence
I needed to process every comment that is created and/or updated. This was done by developing a simple event listener that would delegate to the spam filter.
Then I have to be able to mark a comment as being spam in Confluence. I decided to rely on the status field of comments (as per the ContentEntityObject). Unfortunately this is not used in comments and I had to work around it by detaching a comment from its parent page. Once we have developed the ContentType within Confluence I should be able to do without the workaround…
Integration in the UI was made easy through usage of web-items.
Any comment can be reported as spam.
In the administration section, one can have a look at all comments reported or that have been directly filtered as spam. They can then be marked as not spam or removed completely.
Whenever a comment is marked as spam or not spam the system learn and builds it database of words which constitute spam and words which do not. This means the more one use the spam filter the better it is going to be.
First, I would need to be a bit polished implementation wise and to be tested of course.
To make it completely pluggable would require a bit of work in Confluence as well, which should happen for the most part with the new ContentType architecture.
As it is installed at the moment the spam filter starts with an empty database and its state is not persisted. Creating a database of what is and is not spam on a Confluence instance should be difficult.