Home Deep DiveArticles New AI system to extract data from the internet

New AI system to extract data from the internet


Scientists have developed a new artificial intelligence system that can more effectively extract data from the vast wealth of information present on the internet.

The data necessary to answer myriad questions – about, say, the correlations between the industrial use of certain chemicals and incidents of disease, or between patterns of news coverage and voter-poll results -may all be online in form of plain text. However, extracting data from plain text and organising it for quantitative analysis may be prohibitively time consuming.

Researchers from Massachusetts Institute of Technology (MIT) in the US developed a new approach to information extraction. Most machine-learning systems work by combing through training examples and looking for patterns that correspond to classifications provided by human annotators.For instance, humans might label parts of speech in a set of texts, and the machine-learning system will try to identify patterns that resolve ambiguities – for instance, when “her” is a direct object and when it is an adjective. Typically, computer scientists will try to feed their machine-learning systems as much training data as possible. That generally increases the chances that a system will be able to handle difficult problems. In the new research, scientists trained their system on scanty data.

“In information extraction, traditionally, in natural-language processing, you are given an article and you need to do whatever it takes to extract correctly from this article,” said Regina Barzilay, professor at MIT. A machine-learning system will generally assign each of its classifications a confidence score, which is a measure of the statistical likelihood that the classification is correct, given  the patterns discerned in the training data.With the new system, if the confidence score is too low, the system automatically generates a web search query designed to pull up texts likely to contain the data it is trying to extract.It then attempts to extract the relevant data from one of the new texts and reconciles the results with those of its initial extraction. If the confidence score remains too low, it moves on to the next text pulled up by the search string, and so on.

Every decision the system makes is the result of machine learning. The system learns how to generate search queries, gauge the likelihood that a new text is relevant to its extraction task, and determine the best strategy for fusing the results of multiple attempts at extraction. The researchers compared their system’s performance to that of several extractor trained using more conventional machine-learning techniques. For every data item extracted in both tasks, the new system outperformed its predecessors, usually by about 10 per cent.

Recommended for You

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Close Read More

See Ads