Giuseppe Sollazzo: What's the biggest enabler of transport innovation today?

To answer that question - and get our brains ticking over with many more - we welcomed Giuseppe Sollazzo, Head of Data at the DfT, to sit down with Matt and Tom for the final episode of Making Passenger series 1.

Giuseppe broke down NaPTAN, NAP and threw a few more anacronyms at us for good measure! We talked about why data is gathered and why it's so important to the public transport sector. We looked at cases where data had been gathered before there was even a user need for it and considered how we link data structures to user need.

Giuseppe's passion for data and open data is possibly unparalleled, so he was the perfect person to join us in pondering how we can use data to enable engagement and innovation across the public transport sector and beyond.

Matt : 0:04

Hello, and welcome to this week's Making Passenger podcast. I'm Matt... ...and I'm Tom. This week we're speaking with Giuseppe Sollazzo, Head of Data at the DFT.

Tom : 0:13

Two years ago Passenger embarked on an R&D project which led us to meeting Giuseppe. The dataset in question is undergoing review and potential redesign by his team. In this episode, we catch up with him to find out how he's getting on.

Matt : 0:24

I hope you enjoy! Hi Giuseppe, thank you so much for joining us here today.

Giuseppe Sollazzo : 0:34

Thank you for inviting me.

Matt : 0:36

So we wanted to talk to you a little bit around some of the work that the DFT are doing on all sorts of data sets, but most specifically around NaPTAN and things that are included in that. For around 10 years we've been working with datasets at the DfT are responsible for and as of any long standing big datasets. There are a lot of stakeholders. And so there are bound to be some kind of inaccuracies and things like that. Back in 2018, we began to do some r&d work with a transport data set called NaPTAN. And it became apparent that it wasn't quite as accurate as we needed it to be. And so when we pulled that thread, we began to learn a lot about some of the very real challenges that a technology firm like ours has when working with open data. So it's quite important I think that everyone understands exactly what NaPTAN is and why it's important. So could you perhaps give us a bit of detail around that?

Giuseppe Sollazzo : 1:23

Yeah, so that is pretty much the national dataset of any point used by people to access public transport, so railway stations, airports, ports, and predominantly bus stops. So there's about 450,000 bus stops in NaPTAN which constitute the 95% of what NaPTAN is. So it's basically any point from which your passengers can join or leave public transport together with information about the location of that point. And it's a bit of a you know, relic of the past in many ways. So NaPTAN was first set up when a website called transport direct was alive. So this was a website set up by by government to enable journey planning, and this was switched off, in 2013 when needed. Basically, comms moved on to saying, okay, we shouldn't be providing journey planning, but we'll still maintain this this dataset for anyone. So it's still used for a number of functions, predominantly for electronic bus registration. Its also used by you know, Google's OpenStreetMap as a base for for all their geographical databases, for Google Maps and things like that. Some local authorities also use it as a master record for the bus stop management systems. So it's got quite a broad set of users.

Tom : 2:41

So Giuseppe following some experimental research back in in October 2018, that we did as Passenger with the data science unit at Bournemouth University, as Matt said, we've done a lot of work with NaPTAN in the past, but we started to bring some work together with with those guys and we were looking at how NaPTAN had potential errors in it, particularily around the bearings, which way some of the bus stops were pointing and how that might impact some of the users that were seeing those in the apps that we're building. We started to sort of research as part of that how we might be able to take the reports we were getting from from users that things were inaccurate, and automate that across the UK dataset to see whether that was sort of in the regions that was being reported or whether that was actually something that was kind of more widespread. Before we go into that, could you tell us a bit about some of the work you're doing at the DFT on NaPTAN, and particularly around the ownership of that dataset and the processes around it?

Giuseppe Sollazzo : 3:37

Oh, I love that question. It's very complicated to begin with. So DFT plays a leading role in everything having to do with NaPTAN, but NaPTAN is fundamentally a collaborative effort. So DfT, there are at least three teams involved in what we call the management of NaPTAN. To begin with, NaPTAN is formed by different elements. So on one side of it is a data standard. So it's a data model, based on something bigger called trans model, I'm not going to get into details about that. And it's also a data exchange format, which is basically an XML schema explaining how to share information about transferred concepts, I would say. And then there's a data process and the data process involves a number of cycles. So we run a central service to aggregate data that is actually generated and sent to us by local transport authorities or their representative. So some level the transport [illegible] will have a internal function to send us their data, some others we use a company to do that. There's an ecosystem actually of tools around NaPTAN. So that's the way it works. LA's provide data, we put that data together and we publish it. My team at the DFT is sort of the service owner for anything having to do with NaPTAN. So we took ownership of this last year, and we administer it, we respond to query, we know we are working to investigate the data quality in NaPTAN, and we engage with the with the market. And I mean, last week I was on this group called p tech, so it's a it's a group of people who are working in the space of trustworthy information. I'm talking to you guys and other companies we are engaging with. So yet, at the moment, there's a service, which is run by our digital service. And that's pretty much the process, and we do some work to refresh that service.

Tom : 5:28

That's really interesting Giuseppe, I mean, going back to what we discovered, which was around the 400,000 bus stops throughout the UK, somewhere in that region. And the automation that we put in around the checking of those bearings gave us a reading of around kind of 4% of inaccurate data. The biggest frustration, I suppose from outside was as a tech company was really not understanding how we could go about updating that. So we've kind of done this research, we've done this piece of work as a tech company that uses it puts it into apps, that has This customer facing quality check. And we didn't know the process around how that would work. I mean, how is how is that done? Now? How is it evolved over the last couple of years since we did that work in terms of some of those things?

Giuseppe Sollazzo : 6:14

So first of all, let me praise the work you've done, because it's really brilliant and to be honest, it gives us a lot of interesting ideas on how to work together with the sector to to fix NaPTAN. But one step back, I have to take a step back, which is about once again, NaPTAN was created for a specific purpose. And that specific purpose is no more and things have evolved over time. I mean, if you work with nothing, you're probably aware that there are some very old web pages on the CSS style that hasn't been seen any refresh probably for 10 years. So support was kept to the bare minimum because there's clearly a duty for the FDA to maintain a data set, but there wasn't any thinking into "What should we do with this in the future?" and things change. When my team was created, I mean, part of the reason for creating my team was actually to bring thinking around all the things we do with data at DFT. And NaPTAN was clearly one of the things on the radar, I have to say, I've been a data geek before. And I've always been quite keen NaPTAN. NaPTAN to me is a great example of a national data set, where there's a process as much as it is, you know, as I say, that convoluted process. So there wasn't much work done to it, but a lot of tension towards doing some work with it. So I'd say at the moment, as we, you know, started taking over it, we also started engaging with the market, we also gained with the number of teams at the FDA itself, like the Bus Open Data team, which are heavy users of NaPTAN, and we're starting to ask the question. So of course, there has been little evolution over time, but something we want to explore is actually how to make sure that we can keep NaPTAN evolving over time. And once again, NaPTAN was created for Journey planning for our own provided journey planning. It's something we no longer provide. And therefore, we need to ask the question, what's the current user need for NaPTAN and how we adapt it to emerging user needs?

Tom : 8:12

Guiseppe, given the implications of errors in the data, one of the most important aspects will be how to get those corrected. As I mentioned before, you know, we're a tech company that builds apps for users. One of the most important things for us was really how we resolve that. Apps in app stores have a rating system, which is pretty brutal, to be quite frank, and when when data presented isn't accurate, users of of those apps are scathing. And I think there's almost an implication where one particular you know where bits of data are inaccurate, that the whole thing is not as good as it can be. And so when you're not the owner of that data, as an app provider, or an app developer or a technology company, you need some published guidance on where you go and how you get that resolved so that it's at the standard that you need it to be to deliver your product and your service. You know, in the work that you're doing, what are you working on that might encourage this transparency around ownership and governance and almost, to some degree, an SLA around the quality of the data that is being provided as open data?

Giuseppe Sollazzo : 9:17

So let me first talk about NaPTAN. And I'll give you a sort of a civil service-y answer, which is "we're working on it". So after now, NaPTAN didn't have a any form of real engagement with the wider transport sector, the word there were actually some bits of engagement with local transport authorities. We have a number of contacts in local authorities. Whenever there's a problem with it, we go back to them, but probably we need to make that brother. Now. One thing I didn't say is that we are working on a refresh of the NaPTAN service and redevelopment of it, thanks to the stick of legislation. So in September, there's new accessibility legislation that will come into force created a set of new requirements for for government websites. So that's giving us a bit of an impetus to doing a redevelopment of NaPTAN. So at the moment, we're running a NaPTAN refresh project to get together with our digital service. And we just finished a discovery. We've gotten it straight into alpha. and the plan is, first of all, to, you know, to bring what it is of nothing today into the 21st century. So first of all, onto our current website, styling, content guides, you know, comply with accessibility regulation, and also bring the current NaPTAN platform into the current set of services run by our digital services so that he can be maintained and supported as an ongoing service. As part of that whenever we also started to ask questions to the sector as to what's the best way forward to engage with them and how do we get things fixed? Now, as I said, there is a complex set of stakeholders that form the NaPTAN ecosystem, many local authorities have their own data point assurance processes. There's a number of companies out there who provide data quality tools, software services to local authorities. And on the other hand, we also are working on data quality for NaPTAN because we have some requirements in that respect. So one thing we're doing at the moment as part of this refresh is also to investigate the deployment app, then we will be releasing some of our data quality checks as an open source library, to encourage the the sector to do their own checks, but also to potentially support the market with some extra thinking around how to provide a baseline of data to check. So that's about NaPTAN. And you were asking about, you know, transparency, ownership and governance. And I think that's a broader problem. It's about open data in general. And I think it's important to see to look at where data is Successful, what is achieved and how it was made successful? My current obsession in that space is what I call data curation, which is basically taking care of datasets like libraries do with their collection. So that means having accurate records on the provenance of data, it means having the right policies to manage those items effectively and safely. And it means putting particular items on display for a specific reason when there's an exhibition, or to a common thread or to a common discussion. So doing this with data basic translates into thinking a bit more about why the data is created wide is needed, what are the users doing with the data. So basically, it's not just about creating structures that do governance and the civil service, we're very good at creating boards and committees and all of that, and we have good reasons to do so. And but at the same time, it's also about linking all these structures to the user need and making sure that you know the process is solid, but also enabling engagement and innovation. So that's pretty much my overview of how to deal with with data sets like this.

Tom : 13:04

That certainly makes a lot of sense to me. I mean, this idea of curation and showing how it's being used, I think would inherently improve the the understanding of what's being done with the data when it's being input and being managed. I think you're right, to a large extent that, you know, without sight of that, that the people that are responsible, potentially at the local authority level for inputting this data, without seeing how it's being used at the sharp end of innovation, then there's always a chance that perhaps the the importance of the role that they are doing is not necessarily seen as as important as it is it really is. And, and that's, that's interesting.

Giuseppe Sollazzo : 13:37

Yeah, for some of them is going to be a burden. But I you know, as we engage you with local authorities, were actually finding out that there's many officers in local authorities who are incredibly passionate about data and about data quality. And there are questions. I mean, they've been asking us questions around naptime. You know, NaPTAN now is meant to include for example, information about accessibility and different local authorities have different processes and different opinions about that. And I think that's fascinating. I mean, this is the data set. And at the same time, it's providing an insight into how local authorities around the services for for their own population. And that's very important.

Tom : 14:15

That's an interesting one as well. I think the geographical differences and the requirements that different areas have, I'd be interested to hear about how you build that into a kind of a national standard because we've seen that ourselves you know, not every region is the same in terms of accessibility in Scotland, perhaps, you know, compared to some areas in in England.

Giuseppe Sollazzo : 14:34

Yeah, absolutely, there are huge differences in how for example, local authorities like to have the name of the stop displayed on you know, in a bus stop, which is one of the way NaPTAN has been used in the country. There are also different ways in which different local authorities group together bus stops, for example. So it is interesting question with a few colleagues at DFT asked can we use NaPTAN to identify bus stations? Well, bus station is a concept that makes sense in common parlance, but which doesn't have a strong definition in the data. So my question is, should we have that strong definition in the future? but how a person perceived what the bus station is, might be different from, you know, from local authorities. In London, for example, we have something like the coach station at the Victoria coach station, which is a big coach station, it's not a bus station. For some people. The difference between the coach station the bus station is meaningless, but if you're an operator its actually very meaningful. So we need to think about different uses and the language used to to explain this concept to professional data users but also to the end users.

Matt : 15:51

I think that's really interesting. I think you're right, it comes down to that sort of local level and about what people want meaning people give to the data. It's going to make this a very interesting problem to solve for you. And I personally cannot wait to see how you solve it. But when we first asked you to come on the podcast, you referenced NAP - N-A-P - now that's an acronym, I'm gonna be honest, I had to go and look up and I thought I knew them all. So for anyone who didn't know, could you perhaps tell us? What is NAP? Is that different and where does it come from?

Giuseppe Sollazzo : 16:21

So it's interesting because one of the criticism we have is that are you calling everything with starting with the word with the letters NAP. So that was just a coincidence! NAP is an acronym that means national access point. And it is something that derives from a European Union directive. So there is a European Union directive called the intelligent transport systems, which mandated member countries to create a roads metadata catalogue to represent things like safety statistics on roads and other concepts around around roads. Now, we exited the European Union. There were questions around whether we should be complying or not with this directive, and fundamentally what we said, however, together with the policy team in charge or this project, we said, well, actually, the concept of NAP regardless of compliance with the directive is actually quite useful because there is a growing need of data about roads, and had to say this growing need has been just strengthened by the Coronavirus crisis when it was important to access very quickly statistics about the usage of road in different areas of the country's, data about payments, data about cycling lanes. So we said you know what we should be doing this as a full piece of research. We started with a discovery and alpha. And the idea is to try and see what the user need is for for roads metadata catalogue and evolve it over time. But yet fundamentally, just to summarise, it is a catalogue of purely metadata about roads enabling both public sector and private sector players to I would say advertise the existence of their data sets, even if the data set is not publicly accessible.

Matt : 18:05

Okay, and so the idea behind this is just to create the data set? You haven't necessarily got an end use in mind? You're just creating or cultivating this data set and then putting out to the market to use and, and innovate with?

Giuseppe Sollazzo : 18:17

Yeah, pretty much. But once again, this is literally just about discoverability. So yeah, the idea is that we need to improve the level of discoverability of roads data in the country. And when I say discoverability, that means sometimes just being aware that a certain data set exists, it doesn't mean that the data set needs to be accessible, of course, I'm a big fan of open data, and I will support open data as much as possible. But there are certain data sets that might be owned by private companies, which are still useful to form a view about roads. And therefore we are exploring whether the NAP should be a good platform to have records around those data and see how they can be connected to users in the market.

Tom : 19:02

So Giuseppe, I mean, in terms of what's included in that catalogue, we talked a little bit about sort of governance and curation, I mean, are those parts of the NAP research or the NAP platform as well?

Giuseppe Sollazzo : 19:14

They are. So, at the moment as I say, we are in alpha, we just completed alpha, there was a show and tell a few weeks back, we are about to discuss the procurement of beta and these questions are basically going to be explored by beta. So, one key question is how we operate this platform? whether we should be us, the DFT or whether it should be an external entity? And part of the discussion is also around how do we perform those curation functions and how do we assess the quality of metadata (because metadata itself will have to be quality assessed)? and you know, what the basically the look and feel of the of the service around this? So these are questions that we we are keen to explore in detail.

Tom : 20:00

So, we talked a little bit about data strategy when we were setting up the podcast. And I think it was, it was last September when the DFT spoke about how it was working towards its own data strategy. As as Head of Data at DFT, can you give us an overview of what having a data strategy means? And what did your research reveal up until this point?

Giuseppe Sollazzo : 20:20

So data strategy is pretty much two things. So first, its realisation that there's a lot of good work happening at the DFT and the need for us to have a coherent approach to bringing all this different work streams together. So some of them would be official statistics. Some of them will be in the Bus Open Data programme, street manager, NaPTAN itself and all these things are very good in their own right. But we started to work towards the understanding of how they fit in a longer term vision of how we use and publish data at DFT. The second aspect is equally important - is a growing appreciation that data is probably the biggest enabler of transport innovation in this time and age. The DFT, as a policy departmentas a role to play to facilitate that innovation, by helping the sector work with data more effectively. So the strategy basically wants to do that. So part of that is engagement with the community, part of that is actually learning and understanding how we make policy in the context of data driven transport innovation.

Tom : 21:22

Would you agree with the statement "future mobility is hindered by current standards, current data standards", apparently, that was a phrase that was used by a business development manager at the BSI group?

Giuseppe Sollazzo : 21:34

Interesting. I'm a big fan of data standards, actually, one of my first job years ago involves something called HL- 7, which was a data standard used to transfer medical data in healthcare software. You can't imagine something geekier than that! But there's always, I think, a risk of falling prey to that famous XKCD comic where we have 14 standards. Someone comes in saying "let's create a universal standard to cover those 14 use cases", and boom, you've ended with 15 competing standards. So I think that always the key here is understanding the user need. So in transport, there are benefits to representing common concepts like you know, timetables, fares, routes, and clearly there is a benefit in doing this in a standard way. Now, there are multiple standards covering slightly different angles to mention a few you are aware of - Transxchange, trans model, NetX, GTFS, MDS, you know, all fantastic acronyms! But I would say in transport the proliferation of all these standards hasn't really hindered mobility apps so far. So if anything its actually fostered the ongoing debate about you know, how can we provide better data as an enabler to innovation? So I'm not negative about you know, this competing standards, but of course, there is a benefit to say you know, if the same concept is used across a variety of standards, then why not standardising that concept, and things like you know, routes can be standardised. Things like timetables can be standardised. It's harder to standardise things, you know, like fares, for example, where there are so many different models. Yeah, I think there is an understanding of my side that there isn't a casual standard to cover all current use cases, mostly because these use cases evolve quicker than a standard can be developed. But yes, we can work together to bring some of this concepts together.

Matt : 23:23

You mentioned there about fares not working to a standardised format, which I completely agree. It's something that we've been looking at for a very long time is how we structure fare data, and how we can build systems that will allow us to work with lots of different operators and not have to rewrite our importer every time and things like that. Are there any other data sets that you think specifically would be good to standardise that are very difficult?

Giuseppe Sollazzo : 23:46

That's very good question. So on fares specifically. I think there's a very good work being done by the Bus Open Data Service team and I think you know, that's probably a conversation to be had with them. In terms of standardising data. There was a discussion recently that made me think about better standardising cycling data, for example. There was a question that came to me about the turning of cycling lanes from advisory into mandatory during the Coronavirus crisis. And there was a question as to do we have any data set to represent that? and someone told me well, actually, that concept is represented in a standard way in OpenStreetMap. So, I went to look at that, and unfortunately, although it is represented a standard way, there was simply not enough data to make that, you know, statement about the whole country. So, that's opened a question to me about yes, and that {inaudible} but also data availability, and these are two concepts probably need to be discussed together. Because, you know, if we were to ask the community to provide data about, cycling lane, if thats important, then of course, having a standard will simplify the data collection. That's important, but at the same time, we need to be, you know, careful not to create excessive burden on anyone, because otherwise you will never see that data has been available.

Tom : 25:11

It just goes to show how amazing a tool OpenStreetMap is, though, I mean, does not realise that particular, you know, data was there or the framework for it, and to go there to find it and just not have enough of the data, but to know that there's a platform there that it could be added to.

Giuseppe Sollazzo : 25:27

It does at the same time, there's clearly a question especially for, you know, a public authority around, you know, the authoritness of data sets. And clearly there are different views in the community about what is really an authoritative data set for a number of things. So there are also competing interests in the market about certain data sets, and there are interesting different approaches to that. So yeah, that's probably a question to be explored in its own podcast.

Matt : 26:00

Quite possibly. Just a quick question before we go,a question that's more than likely going to be cut out but this is a question that Beth really wanted me to ask. So Beth, our marketing manager, big on maps, and she wanted to know if you have a favourite map, because her favourite map is of the - what was it Tom? Was it the London Underground laid over the roads that it would follow in London?

Tom : 26:18

That's the one.

Matt : 26:19

Something like that.

Giuseppe Sollazzo : 26:20

Oh thats nice!

Matt : 26:21

Do you have a favourite map?

Giuseppe Sollazzo : 26:23

Oh, I have so many! I mean map are crazily nice. So actually, let me just blow my own trumpet for now, I did this crazy thing using OpenStreetMap or basically colouring the roads by the name so you know you colour something called 'the road' in pink something called 'the street' in green something called 'a lane' in another colour and you get this, you know, nice looking geeky map. Now a guy I know called Duncan Geere, who is a data journalist, took that map, and he filmed his plotter, drawing that map live and I thought that was just the most mesmerising thing I've ever seen. Hypnotic, like seeing a plotter drawing the map live. That was good. But aside from the personal stuff, I like an entire category of maps called the figure-ground maps. So it's when you, actually it's not a map. It's basically just a picture of all the build spaces in an area. So you have all the building shapes. And what you have is this is this sort of comparison between what's black - so the buildings - and what's white, which is basically the space, and it's something both really aesthetically pleasing, but also pretty informative about the density of buildings in an area.

Matt : 27:32

Brilliant. Thank you so much for answering that question. And all of the others. Really appreciate you giving us some time today to sit down and talk about this information. Hopefully, we'll have you on again in future perhaps. But for now, thank you so much for joining us.

Giuseppe Sollazzo : 27:46

Thank you very much.

Matt : 27:52

Next week, we'll be taking a break, but we'll be back for another series in a few weeks time. We'd like to massively thank everyone who took part in series one of Making Passenger was a great opportunity for us to learn from experts from many different pieces of the transport technology puzzle. We hope the podcast has contributed in some way to keeping the conversation going during lockdown. It certainly taught us more than we ever thought we'd know about podcasting. Questions, comments or suggestions - we'd love to hear from you. Please do tweet us @makingpassenger and if you want to know what we're up to between now and when you next hear from us, do sign up for our newsletter at www.discoverpassenger.com Until next time!

Making Passenger

Giuseppe Sollazzo: What's the biggest enabler of transport innovation today?

Listen to this podcast on