Currently showing: Other

13 Jun 18 06:23

Have you ever wondered if an encyclopedia could also be used in shaping-up re/insurance products? Or, did you ever think about using the world's largest knowledge bank to better estimate risks and understand exposures in re/insurance industry. Well, this is just tip of the iceberg - not everything we re/insurers can derive out of an openly available data source. The intention behind writing this article is to inform the readers about some smart solutions innovated and sculpted to fit best our purpose. The Swiss Re Casualty R&D team, under the leadership of Filippo Salghetti-Drioli, organized a hackathon to bring together colleagues from various backgrounds, technical and analytical, to find as many different approaches as we can to gather data from open sources. 

Swiss Re being one of the oldest players in the Reinsurance industry, owns enormous amount of data. At the same time, we still depend on external sources for more information. These sources can be both free as well as paid. Challenges with open data sources are that they are very few in number and the formats in which the data is available demands a lot of time to be converted into useful info. Whereas, the paid data sources are generally well-formatted and are of high quality. However, they are quite expensive and can only be accessed for a limited period. Dependency on these expensive vendors cannot be completely eliminated. However, it can be decreased to some extent by utilizing open data sources. We decided to start our research with one of the largest open sources, which is quite popular among the internet community - Wikipedia. Apart from being a free databank, Wikipedia is rapidly becoming a useful free reference works online. It provides internet users with millions of articles on a wide range of topics. Moreover, the breadth, depth and quality of the information available in Wikipedia makes it much more proficient in comparison to other existing encyclopedias.
Mining Wikipedia and connected databases:
We found some smart ways to extract useful information from Wikipedia and other linked data sources and they are:

1) DBpedia (using SPARQL)
DBpedia, with the use of SPARQL (SPARQL Protocol and RDF Query Language) lets you extract structured/semi-structured data from Wikipedia. E.g. you may open the following link - , paste the below query in the 'Query Text' box and hit 'Run Query'. 
SELECT * WHERE {?Person rdfs:label "Roger Federer"@en.}  The output will contain a couple of links, from where you can get the information about the subject you queried the information for. The output will contain a couple of links, from where you can get the information about the subject you queried the information for.

2) Rest API
The Wikimedia REST API offers access to Wikimedia's content and metadata in machine-readable formats. The MediaWiki action API is a web service that provides convenient access to wiki features, data, and meta-data over HTTP, via a URL usually at api.php.
E.g. This URL tells English Wikipedia's web service API to send you the content of the main page: You can use any programming language to make an HTTP GET request for that URL (or just visit that link in your browser) to get a JSON document.

WIKIDATA is a free source which acts as a central data repository for the structured data of other sister projects of Wikimedia, which includes data from Wikipedia, Wikivoyage, Wikisource and others. Users can download specific contents or the entire dump related to various topics of interest. 

4) Web scraping
Web-scraping is a well-known technique of scraping information from HTML pages. It lets users to download the information from HTML pages (in unstructured format) and offers wide range of functions to clean and format the downloaded text corpuses. The advantage of this technique is, it's not restricted to Wikipedia and to its related databases and can be used to mine other open sources (permitted by Swiss Re legal team).
From unstructured to structured:
Using the above techniques we extracted significant amount of information related to large liability losses from Wikipedia. However, the information fetched was in unstructured form, like any newspaper article. We used some smart libraries from R and Python to extract useful information (such as, dates, loss location, type of disaster, number of dead, injured, economic and insured loss amount etc.) from unstructured text corpuses.

The paradigm shift:
We envision that Wikipedia and other open data sources contain significant amount of information that can be leveraged by various streams and domains. For example, in re/insurance industry only, open data source can contribute towards enhanced Risk engineering, Claims processing, Underwriting, Client solutions, Economic researches etc. These additional information can not only be used in enriching your existing datasets, but also they can be used as supportive evidences to your analysis and/or to model outputs. It can benefit us in numerous ways. The above techniques can be tweaked and used to best serve your purposes. In today's ever-changing market conditions, we understand the importance of tech-driven solutions and therefore, we are taking our tech activities to the next level. To do that, we are steadily adding new tools and technologies to our disposal. Mining open data sources is just one of them.

(Disclaimer: All the necessary approvals were taken from the legal team before accessing/downloading Wikipedia data)
Author: Ashok ShettyMarkus Reitschuster

Category: Other

Location: Bangalore, Karnataka, India


If you would like to leave a comment, please, log in.