Open source solution WebXray to help protecting public websites

webXray open source

On 4 February 2020, the UK’s Centre for Data Ethics and Innovation (CDEI) published a report on online targeting and ad-tracking processes on public websites. Several additional studies were conducted on the topic and the majority of them used the open source software webXray to collect data and conduct analysis on the websites’ level of compliance with current data protection regulation.

The CDEI’s report ‘Review of online targeting: Final report and recommendations’ was published after the release of several studies on third-party advertisement on UK local council websites by the BBC, the browser Brave and the ePrivacy company Cookiebot. These studies highlighted the presence of third-party advertisement and ad tech trackers. For instance, Cookiebot found that 52% of EU public health service web pages contain commercial trackers. The BBC Shared Data Unit analysed more than 400 local councils’ benefits page over two days and found that UK local councils’ webpages contain more than 950 cookies and more than two-thirds of councils did not appear to ask for the correct form of consent under privacy laws.

In order to carry out the studies’ data collection, the BBC, Brave and Cookiebot used the webXray open source software for their analysis of third-party advertisement in public websites. WebXray is an open source software for analysing third-party content on webpages and identifying the companies which collect user data. WebXray is accessible to non-programmers thanks to a command line user interface. WebXray is a professional tool designed for academic research, privacy compliance officers, regulators, and users that are curious about hidden data flows on the web.

Thanks to webXray, users can access reports with the following information: average numbers of third-parties and cookies per website, most commonly occurring third-party domains and elements, volumes of data transferred, use of SSL encryption. WebXray uses a custom library of domain ownership to identify the origin of a flow of data from a given third-party domain to a corporate owner, and if applicable, to parent companies. The data schema used for the custom library of domain ownership is flexible, allowing for the generation of custom reports as well as authoring extensions to add additional data sources.

The public version of webXray uses Chrome to load pages and stores data in an SQLite database. The open source software was developed using Python and is now available on GitHub under a General Public Licence 3.0.

Going further than data collection

Municipalities and public website owners can use webXray to check their compliance with the current EU legislation, the General Data Protection Regulation (GDPR), and easily monitor which companies are using third-party advertisement and ad-tracking processes on their platform.

In light of those findings, the UK’s CDEI made the following recommendations in its report:

Online harms regulator should be required to provide regulatory oversight of targeting;
Draft of a code for public sector use of online targeting;
Platforms should be required to host publicly accessible archives for online political advertising, “opportunity” advertising (jobs, credit and housing), and adverts for age-restricted products;
Regulation should encourage platforms to provide users with more information and control; “Data intermediaries” could improve data governance and rebalance power towards users.