Mastering Big Data

Be it on the internet, at the stock exchange or in the field of medicine: the amounts of data confronting people increase to ever larger proportions. Former staff members of the Helmholtz Zentrum München have developed a software that is capable of handling large, unstructured amounts of data. In function, it follows principles that are similar to those of the human brain.

We all know it from our daily Google search: the data flood confronting us humans is ever increasing. Stock brokers predicting the price of shares, physicians attempting to understand diseases: the crucial element is always to separate the few important bits of information from the seemingly endless amount of unimportant information. The search technologies we use today may be able to search data on the basis of key words. However, redundant and overlapping information or even contextual information cannot be recognised. So the question of what is really important and what is not remains a riddle for currently employed search engines.

This is precisely what the software developed by the Clueda company, established by former scientists from the Helmholtz Zentrums München (HZM), attempts to do. They derived the basic principles of their idea from the human brain. "Our software associates and learns", says Volker Stümpflen, Managing Director of Clueda. "This enables it to extract from large amounts of data exactly those bits of information that are relevant under certain aspects."

Volker Stümpflen and his colleagues Mara Hartsperger and Benedikt Wachinger worked at the HZM Institute of Bioinformatics and Systems Biology (IBIS), where they jointly researched complex, genetically caused diseases. It is no longer a rare event that scientists such as they become company founders. This is illustrated by numbers published by Stifterverbandes für die Deutsche Wissenschaft. In 2012, according these numbers, the German universities alone generated 1,145 companies ("Gründungsradar" of the Stifterverband). Extramural research organisations such as the Fraunhofer-Gesellschaft with its intensely application-oriented approach as well as the Max Planck Society and the Helmholtz Association are not even contained in these numbers.

At the Helmholtz Zentrum München, one of the issues Stümpflen and his colleagues looked into was the problem of the flood of publications: at the time, there existed some 400,000 publications in the field of diabetes alone. If a person were to actually read all these publications, he or she would be kept occupied for about 200 years. The approaches the scientists developed to use a computer for sorting information and identifying complementary relationships germinated into the idea to develop also other products and to establish a company.

"Our software is capable of making decisions", says Stümpflen. He explains that first, the documents are "semantically processed". This means: the software not only analyses the grammatical structure of sentences, but also the context deriving from this, which it then stores in a kind of associative knowledge network. "With this knowledge it then can, for example, determine that the sentence "VW - buys - Porsche" is of relevance to share prices", says Stümpflen.

To achieve this feat, it first needs to learn a whole large amount of terms and their meaning - just like a child. In the medical field, for instance, this entails some 600,000 terms that are fed into the software by way of a knowledge basis. As of a certain amount, the system knows enough to independently extrapolate the meaning of other terms.

Recently, Clueda was awarded the "Best in Big Data Award" by the Computerwoche magazine for its product "Real Time Analytics", which the company developed in co-operation with the Baader Bank. The software helps investors and stock exchange traders to filter those news that are of relevance to stock market prices from the mass of all available information. The former Helmholtz researchers have developed a software application also for the medical professions: it helps physicians to identify indications of certain disease patterns and causes from out of the bundle of patient files, medical reports and medical findings. Further applications are planned for the recruitment of "suitable" patients for clinical studies and for identifying hitherto unknown correlations between medication, genes and disease patterns.

25.11.2013 , Martin Trinkaus
Print Version