What is the Hardest Content to Classify?

The first theme that came to mind as I thought about topics to blog about is the whole area of classification of different types of content: text, sound, video and images. I often speak to clients who have a range of item types stored in a number of repositories. They’re often looking to classify new content, or to work on older content in order to improve its findability. They are always looking to get more value from their content.

In these circumstances a content audit is often called for, to answer the ‘What do you have?’ question. This then leads to a general discussion of the content types and the ways in which they can be classified, usually using a controlled vocabulary either applied by a machine, by a person, or by a mixture of the two.

One thing that often makes people ask me questions is my fairly frequent assertion that images are easily the hardest item types to deal with.

Why are Images the Hardest Content to Classify?

  • Textual items contain text. Use of auto-categorising software, free text storage and access .etc .etc makes organising and finding textual items relatively easy.
  • Sound can be digitised and turned into text.
  • Video often has an audio track that can be turned into text too. Computers can be used to identify scenes. Breaking a video into scenes and linking a synched and indexed soundtrack together can provide pretty good access for many people – (though there’s a whole blog post on the many access points to video that these process doesn’t provide).

Images on the other hand have no text, no scenes, all you have are individual images, with the meaning and access points held in the visuals.

Some will say that this is really not a problem, all you need to do is use content based image retrieval software to identify colours, textures and shapes in your images, and you’ll soon be searching for images without any manual indexing. However, whilst this technology is promising, it leaves a lot to be desired.

Today, the way to provide a wide and deep level of access to still images continues to be by using people to view images, write captions and assign keywords or tags to each image based on image ‘depictions’ and ‘aboutness and attributes’. This manual process often requires the use of a controlled vocabulary to improve consistency and application.

However, how this indexing is done and what structures support it, will be the subject of further posts.

Ian Davis

September 2008

%d bloggers like this: