VKontakte’ local friendship networks: identifying the missed residence of users in profile data
Keywords:social network analysis, online communities, VKontakte, network topology, big data, using R for data analysis, network homophily, missing data
Online social networks (e. g. the most popular Russian website ‘VKontakte’) are a source of available information about users due to the open data policy. Therefore, researchers have great opportunities to study the topology of interaction networks in the online environment using a social network analysis. However, the personal data that users provide in their public profiles are often incomplete: sections on gender, age or city may be missed inadvertently or skipped intentionally.At the same time, these essential characteristics serve as ‘nodes’ (i. e. users) and help single out clusters of similar agents and their behavior patterns. The absence of some data can significantly affect network metrics (e. g. size of network, average path length between two participants, distribution of the number of connections between them, etc.) and cause distorted results. In this regard, there is a need to fill gaps in data. The paper presents a case study on the design and applications of a classifier which would determine whether a VKontakte user whose location was not specified in the profile is a resident of a particular city. The classifier was created and tested for the Izhevsk city user network. It is based on the decision tree method which gradually filters the accounts by a series of questions.The paper explains the choice of the main indicators helping the classifier to determine the user’s city, describes the algorithm and shows how the network topology changes as the missing data on user’s location are added.