Indexing Metadata and Discovery in 2016

Metadata Indexing Discovery in 2016

Now that the festive season is over, what are we up to? This year’s remarkably mild weather, 12-14 degrees in January is unheard of. There is no excuse to jet off to a warm climate this month!

Providing e-Books to Africa

The first thing I have to report is that we have a new database in-house of some 734,000 eBooks from over 6000 publishers. Soutron LMS is being used as the Staging Server to prepare data for publishing to an online e-commerce eBooks service in Africa. This is a production level service that is a critical part of selling eBooks complete with Adobe DRM, using Soutron technology.

The interesting thing about this is not the e-commerce website, although we have built up a vast amount of knowledge in preparing the site and delivering it complete with dynamic currency conversion and affiliate marketing. Rather it is utilising the power of Soutron LMS directly on content that we are responsible for and a very large dataset at that. For the first time we have our own library of some significance to manage in-house.

The catalogue database itself is some 37Gb in size and includes metadata with abstracts. We are holding the ePUB files separately together with the Book cover images (about 22Gb worth). It is a very useful size of database to work with and to test various functions around the catalogue and thesaurus and export.

There are several automated processes on the server that perform continual loading of new titles and removal of titles that have been withdrawn. All of the loading and management is performed using automated scripts as are the fulfilment profiles, so the manpower to operate the systems is really minimal. De-duplication is a critical factor as multiple formats are provided and our service provider only wants to work with ePUB files. Publishers push out separate records for each type of ePUB file. Checks and rules are applied to validate quality of the metadata and especially pricing data.

Metadata & Indexing

It’s a big surprise to find that so much is wrong in the metadata. Starting with ISBNs, often hardbacks are included and presented as eISBNs. The big one though is the indexing of titles. Subject categories based on BISC are provided in the ONIX data feed from suppliers but these are so flat that the ability to create a meaningful hierarchy to explore and display is impossible. It wouldn’t be so bad but the term descriptions often bear no resemblance to the content of the eBook itself and a large number of travel books have abstracts of places quite different from where the author has been. So this has given us reason to really use the thesaurus and global edit in a way that we rarely see when testing our usual data set. The result is we have a “living” thesaurus that is addressing just about all topics published and a growing respect for indexers.


The driver for all this back room work is Discovery. Making the content more easily discoverable and useful for those who are selling and marketing and introducing the texts to users. It is leading me to think that we need a more intelligent approach to indexing and I am keen to hear from anyone with ideas on how we might take that. There are one or two companies out there that are using AI to do this and it would be great to hear what you think of such approaches.

Do they work or is the nuancing of language too important to be left to algorithms!

Setting up these services needs custom bibliographies using the Export function (paper is important in Africa). This has been very easy to set up in Soutron and other than the banner header, I have pretty much total control over what I want to output without pushing it into Word. I am surprised no one has asked for this to be more flexible given the banner background is hard coded right now. I could also do with extra selectors to bring data out of the database, such as all the records that are published. I am pushing to build out a Dashboard to make filtering output and reporting simpler.

New records are continually coming into the database, at about 5,000 a month from the existing list of publishers, more will be added as we bring other publishers into the mix. Here’s looking to a million titles before the end of 2016.


[author] [author_image timthumb=’on’][/author_image] [author_info]Graham Beastall – Senior Consultant and Managing Director. Graham’s background is in Accountancy, Public Administration and Organisational Theory with a deep technical understanding of databases and web technologies. More posts by Graham.[/author_info] [/author]

Stay informed

Enter your name and email below to get our latest articles delivered straight to your Inbox.

Your permission to stay in touch with Soutron Global
I consent to receiving future communications from Soutron Global, the latest information on our products and solutions for libraries, archives, knowledge and information centres by the following methods. Please tick:
This field is for validation purposes and should be left unchanged.

Note: We respect your privacy at all times. You may unsubscribe at any time.

Like and Share this article today!