Since work on the NBK began in 2016 we have had many conversations with stakeholders, including current and potential contributing libraries, data suppliers, and those with an interest using the services we are building. These conversations have addressed a wide range of issues, including complex questions around data ownership. We have also been spending time learning more about the available technology. This input has helped us clarify our thinking and has resulted in the multi-database model of the NBK presented at the end of last year. This approach offers much great flexibility in how we manage the data and the way we can develop and support the increasing range of services and facilities that the NBK will underpin.
All incoming data flows into the NBK database (the data lake), with MARC data being stored as supplied. From here all relevant data is passed to the other databases. There are significant benefits in this multi-database approach to managing the data relating to the varied use cases for the different services being created, the varying data sources for different services, data licensing considerations, technical considerations, and increasing integration between Jisc services.
The NBK data lake
The data lake allows us to manage the data flow and the records are held as supplied by each contributor before any of the data standardisation activity undertaken for the Library Hub Cataloguing service. This will support flexible development as the data management evolves. For example, as/when we change the record processing for the Library Hub Cataloguing service we will be able to extract and reprocess sub-sets of data without having to re-request data loads from contributors.
There has been a lot of community interest in the potential for tools to support libraries in upgrading data where appropriate. We will be exploring developments in this area using the unprocessed data from the NBK data lake, working with the Elastic Search technology we’re using for the Library Hub Discover service. Any work we can do to support libraries in enhancing record quality will then feed into both the Cataloguing and Discover services, as well as benefiting the library’s local catalogue users.
Whilst the NBK data lake will initially feed into two databases, it is possible that there will be other variants in future. For example, we will be working with libraries and other Jisc services to explore problems in the area of eResource management, where subsets of NBK records may be combined with data from other sources to offer support in this complex area.
The Library Hub Cataloguing Service
By having a dedicated cataloguing database we can focus on data quality. We have heard strong views on a number of issues, for example the merger of RDA and AACR2 source records during the deduplication process. So as the Cataloguing service develops we will be consulting on all aspects, in particular issues relating to data quality, data deduplication and merged record creation.
Having a cataloguing database also gives us flexibility in data management. So we will be including records from sources that are of specific value for cataloguing, for example Library of Congress data, whilst excluding data from contributors that do not use MARC, to help maintain the overall quality of the database. In addition, data licensing restrictions mean that not all records can be made available for shared cataloguing. By completely excluding such records from the cataloguing database this simplifies the data management and provides an assurance to contributors and data suppliers that their records are not being shared inappropriately.
The Library Hub Discover service
The Library Hub Discover service will build on the work of Copac, where the focus is on coverage. An end-user must trust that they are seeing the full picture of a library’s holdings, so we will continue to include all records from a contributing library, regardless of quality. We will also be including records from all data suppliers as we will not be making any of the data available in MARC format.
For Discover we also need to maximise deduplication, whilst having the flexibility to show the original contributed records. The emphasis in the deduplication and record creation is on creating the best, most complete, record for resource discovery purposes. And as we do for Copac, we will continue to buy in data, such as table-of-content and book cover images, to enhance the value of the records to the end-user.
Coverage and deduplication are also the essential elements for supporting the Library Hub Compare service (CCM tools). This will also benefit from the flexibility and analytics capabilities of the Elastic Search open source technology being used for the Discover & Compare database and we will be exploring the way we can use this to best effect with the Collection Management community to create a more flexible and interactive service.
The NBK service development work is taking place within the context of wider Jisc development activity that is focused on bringing related services together to provide more effective service presentation and improved user workflows. The Library Hub Discover interface is being developed within this context and it will be facilitated by working with Elastic Search, providing flexible search development. It also provides consistency of underlying technology with other Jisc services increasing the potential for effective service integration, where appropriate.
For the future
The new NBK system model has emerged from working with community members, data suppliers and others. This approach offers us the best way to allow each service to have a clear focus, and enable services to evolve over time in the ways that best support the core users of that service, as well as supporting ongoing experimentation and new service development. This does create additional work in the short term, but we feel this is an important investment in the long term sustainability of the services being built on the NBK, both now and in future.