TAS Data Collector helps to make public money spending more transparent. Precognox developed a service for Átlátszó.hu (‘Transparent’), called Üvegzsebfigyelő (‘Glass Pocket Observer’), which makes available all the public information about how public money is spent by national institutions.
With the release of Glass Pocket Observer, the dream of dozens of Hungarians came true, the dream of those who are curious about public documents of national institutions and want to have access to public information. It can also make the daily work of professionals easier, e.g. the work of journalists who intend to process and analyze the accessible data so as to get a deeper understanding of it.
What makes this application unique? The fact that it saves the users a lot of time and effort by collecting data from numerous separate web pages, hence they do not need to visit the websites one by one. Moreover, they are also liberated from desperately searching for bits and pieces of information hidden in dispersed parts of the pages of national institutions. The task is performed by TAS Data Collector that collects circa 80 thousand documents of 240 national organizations. It means that it visits hundreds of websites as data is hosted more than one subpage per institution.
Once data is collected by a small client program it is sent to the central server of TAS, where it is safely stored. Our customer does not need to deal with data gathering and harvesting, since the whole process is carried out by us.
It is a well-known experience that in non-user-friendly websites you tend to look for certain pieces of information for hours. Our developers managed to find all the necessary fragments of information in the websites of public institutions and set configurations that are able to do the same. Machines are taught to execute tasks that demand human-knowledge, but much quicker and in a much bigger quantity than humans are able to. What is more, once it is trained, it accomplishes the exact same task with no mistake in a given frequency. TAS Data Collector is set to identify and categorize information of the webpage relevant to the user (e.g. document, date of publication, category of document).
What if the website is modified? What if one site is replaced by a new one? Change does not hinder our work at all, as TAS Data Collector is adjusted to the change of the content of the website. Data is traced and collected regardless the transformation of its source.
Another issue that is easy for humans, but needs to be taught to machines is the realization of various forms of data. Date, for example, tends to show up in different formats, namely the sequence of year, month and date can be different. However, the configuration is set to convert date to a unified form and display it in a clear, understandable and easily searchable piece of information to the users. Date of any format is first identified in the given website, then brought to the center of TAS, where it is processed, transformed and unified.
Finally, similar to humans TAS Data Collector has the competence of reading between the lines. It recognizes not only pre-labelled information, but also data hidden in the text flow of the website. Like humans, our configuration is able to cope with data of any degree of complexity. What makes it possible is data enrichment that is executed as the first step of data processing.
TAS is an open system, which means that one can start and finish work outside it. Regarding Glass Pocket Observer the first part of the job was done by Data Collector: collecting, processing and displaying data. However, as per the customer’s request, work did not stop here. As a complex web application additional functions were developed to help the users better understand and efficiently handle data, as well as to encourage the institutions to provide the citizens with transparent data. E.g. notifications for the users and for the public organizations, evaluation of the performance of the institutions.
If you need help with getting clean data from the web, please contact us.