British Newspaper Archive: Digitising the nation's memory
A look at the technology behind the British Library's project to put 300 years of newpapers online
Consuming content in digital form has become the norm for many of us. We watch videos on smartphones, we skim the news on tablets, we share photos on social networks and we read books on e-readers. But in a world where digital rules, where does a traditional organisation like The British Library fit in?
The first thing most people find out about The British Library is that it holds at least one copy of every book produced in the United Kingdom and the Republic of Ireland. The Library adds some three million volumes every year, occupying roughly 11 kilometres of new shelf space.
It also owns an almost complete collection of British and Irish newspapers since 1840. Housed in its own building in Colindale, North London, the collection consists of more than 660,000 bound volumes and 370,000 reels of microfilm, containing tens of millions of newspapers.
Related Articles on Techworld
It may have come as a surprise, therefore, when The British Library – an organisation that places such high value on paper objects – announced in May 2010 that it was teaming up with online publisher brightsolid to digitise a large portion of the British Newspaper Archive and make it available via a dedicated website.
The British Newspaper Archive
By the time The British Newspaper Archive website went live in November 2011, it offered access to up to 4 million fully searchable pages, featuring more than 200 newspaper titles from every part of the UK and Ireland. Since then, the Library has been scanning between 5,000 and 10,000 pages every day, and the digital archive now contains around 197TB of data.
The newspapers – which mainly date from the 19th century, but which include runs dating back to the first half of the 18th century – cover every aspect of local, regional and national news. The archive also offers offers a wealth of material for people researching family history, including family notices, announcements and obituaries.
According to Nick Townend, head of digital operations at The British Library, the idea of the project is to ensure the stability of the collection and make it available to as many people as possible.
“The library has traditionally had quite an academic research focus, but the definition of research has maybe broadened to mean everybody who's interested in doing research, and I think the library's trying to respond to that and make the collections more accessible,” said Townend.
The British Library and brightsolid have set themselves a minimum target of scanning 40 million pages over ten years. “That's actually a relatively small percentage of the total collection,” said Townened. The entire collection consists of 750 million pages.
“The digitisation project gives us a really good audit of the physical condition of the collection items,” he added. “Some of the earlier collections were made on very thin paper and it's just naturally degraded over time, so they've effectively become 'at risk' collection items. Making a digital surrogate is part of the longer term preservation of the collection.”
Eight thousand pages a day
The fragility of some items in the collection is the reason why the scanning process has to take place on-site at Colindale, according to Malcolm Dobson, chief technology officer at brightsolid. He explained that the company set up a scanning facility there at the start of the project, with five very high-spec scanners from Zeutschel.
“We do fairly high resolution scanning – 400 DPI, 24-bit colour. The full-res image sizes vary from anything from 100MB up to 600MB per page,” said Dobson. “At 400DPI these can be 12,000 pixels by 10,000 pixels – very large bitmaps. So even compressed, they are massive.”
The pages are scanned in TIF format, and then converted into JPEG 2000 files. According to Dobson, JPEG 2000 provides a good quality of compression and retains a much better representation of the image than standard JPEG.
“We throw away the TIF files because they're just too big to keep,” said Dobson. “To put it into perspective, we've probably got something like 250TB of JPEG 2000, and we have 3 copies of each file, so it's a lot of data. If we'd just been going with the uncompressed TIF, that would probably be something in excess of a petabyte and a half.”
Once scanned, the images are transported over a Gigabit Ethernet connection to brightsolid's data centre in Dundee. The transfer happens over night, and usually takes around five to six hours.
The scanned images are entered into an OCR workflow system, where they are cropped, de-skewed, and turned into searchable text using optical character recognition. They are also “zoned” using an offshore arrangement in Cambodia. This means that areas of the page are manually catalogued by content – such as births, marriages, adverts or photographs – and referenced to coordinates.
“We end up with quite a comprehensive metadata package that accompanies the image, and it's that metadata package along with the OCR information that forms the basis of the material that's then searchable,” said Townened.