KOPI - An Online Plagiarism Search and Information Portal

by Máté Pataki

Due to the growing number of digitally available documents, the problem of plagiarism is becoming increasingly serious. The aim of the KOPI project, carried out at the Department of Distributed Systems (DSD) of SZTAKI, was to develop an online plagiarism search portal, which would help digital libraries to protect their documents, and teachers to identify copied texts or publications.

The work was performed as part of a joint project between SZTAKI and Monash University in Melbourne, with a series of investigations being made on document chunking and overlap detection, two techniques on which the detection of plagiarism is based. Continuing this work, development of the portal KOPI commenced in 2003, funded by the Hungarian Government. The portal will become available to users by the end of June 2004.

There are two different approaches to fighting plagiarism. The first is the protection of the document by preventing it from being copied or misused, and the second is the recognition of plagiarism. Protection is an important issue but it can cause difficulties for legal users. Moreover, all kinds of protection will be cracked in time. According to our view, the most effective technique in fighting plagiarism is the fast detection of document overlapping: in other words, there is no sense in copying a digital document if the copy can be detected within minutes. This method is used to protect documents that are part of the KOPI system from illegal use.

As a portal site, KOPI includes common services such as a forum, context-sensitive help, FAQ, and static documents including information on plagiarism and university laws. In addition, two system-specific services are offered to users of the portal: a document upload and management service, and the plagiarism search engine. The first can be used to upload documents (html, rtf, doc, pdf, txt) or a batch of documents (zip), and to attach meta-information to them. The meta-data are stored in Dublin Core meta-data format to make possible future interoperability with other systems.

The structure of the KOPI system.

The uploaded documents can then be compared with each other, with previously uploaded documents, with all documents uploaded by users, or with collections of documents gathered from the Web or documents in digital libraries. The comparison is made offline, reducing waiting time and costs for the user. When the job is finished, the message handler unit sends the results to the user via e-mail.
The heart of the similarity search engine is the chunking method, which is used to chunk the given text into smaller pieces. This task and the conversion of the document to plain text are performed by the document converter subsystem. When comparing documents, only these chunks or their so-called compressed fingerprints are examined to determine how many common parts the documents have. The KOPI system uses a combination of word chunking and overlapping word chunking to chunk the documents. This new algorithm provides a fast and accurate search, while keeping the size of the database small. (For more information on the chunking methods see the links below.)

In order to perform an efficient plagiarism search, the KOPI system needs to collect as many documents as possible. Four possible sources exist:

documents on Internet
digital library collections
publications and theses from schools, universities, or conference organisers
material uploaded by the users of the KOPI system.

Documents from the Internet are collected using a Web crawler. Digital libraries with an open interface to the Internet (eg OAI) can also be easily harvested. In the future, it is likely that university students will be requested to submit their theses in digital form, and so within a couple of years a large set of documents will have been collected.

KOPI is currently a stand-alone portal application. Future developments in the frame of a PhD project will target the creation of a distributed KOPI architecture. In such a system, institutes would use their own local copy of the KOPI engine, but could initiate a plagiarism search involving documents over the whole distributed KOPI system.

Links:
KOPI portal: http://kopi.sztaki.hu
Department of Distributed Systems of SZTAKI: http://dsd.sztaki.hu
Plagiarism Detection and Document Chunking Methods, The Twelfth International World Wide Web Conference: http://www2003.org/cdrom/papers/poster/p186/p186-Pataki.html
Match Detect Reveal Project at Monash University Melbourne: http://www.csse.monash.edu.au/projects/MDR/
The Dublin Core Metadata Initiative: http://dublincore.org/

Please contact:
László Kovács, SZTAKI, Hungary
Tel: +36 1 279 6212
E-mail: laszlo.kovacssztaki.hu