Can you measure museum visits with social media data and machine learning?
Museums need to understand their visitors: who they are, how many there are and crucially, why they came. At a local level, this offers insight on the museum’s engagement with communities and its impact on the surrounding economy. At a national level, it informs our understanding of museums' role, both cultural and economic, within society - which is especially relevant given the Department for Culture, Media and Sport's (DCMS) current review of museums.
However, many museums only collect limited information about visits and the data they do collect is not standardised. This is partly because collecting this information can be costly; for example, museums with free entry can't always readily count how many people enter their buildings. 
In this post, we explore whether new digital data sources can start to address this by estimating visitor numbers using information from social media (such as data from FourSquare) and techniques from machine learning. FourSquare is an online location-based social network (LBSN) that allows people to virtually 'check-in' to venues, to let their friends know where they currently are. Machine learning is a set of techniques, developed largely in computer science, that allows patterns in complicated data sets to be identified.
We will be studying the 1,303 Arts Council England (ACE) accredited museums (museums in England that have successfully applied for accreditation from ACE, and met its set of standards). Although this doesn't include every museum in England, it's a significant proportion which includes the largest national museums. The list also contains several ACE-accredited art galleries.
The analysis is part of a wider trend of using machine learning and new data sources, such as the data generated by social networks and mobile phones, to help with the hard task of tracking people’s movements. For example, data generated by mobile phones and collected by network operators has been used to understand how many people take part in a certain event. In a similar fashion, Twitter and FourSquare data has also been used for similar purposes. 
For each museum, ACE has information on how it's run (whether by the local authority or independent), its accreditation status (whether it has full or provisional accreditation status), geographic location and associated Multiple Deprivation Index (IMD) information, which offers insight into the state of the local economy and environment. For 254 of these museums (around a fifth of the group), information on the number of annual visits is also provided as part of the ACE accreditation data or through the Visit Britain visitor survey data, which we have matched across the group.
We will use this information, along with FourSquare check-ins, to develop a model that estimates the approximate number of visits. This model will then be used to estimate the approximate number of visits for those venues where no information was available in the data. 
There are challenges in doing this. First, it is hard to deal with smaller venues that might not even be in the FourSquare database, or venues that people who use FourSquare are just less likely to visit. Secondly, online popularity of a venue may not always correlate with its offline popularity for reasons other than the self-selecting nature of FourSquare users. For example, in the case of museums with open-air sites or museums on sites which have multiple uses, what counts as checking in on Foursquare may not correspond cleanly to the act of visiting the museum. Among FourSquare users, there may also be varying patterns of behaviour that affect the results and while we have information on the number of check-ins, this is not linked to individual users.
In order to make the task simpler, instead of predicting a raw count for the number of museum visits, we divide museums into three categories, according to their number of visits: small, medium and large. In particular, we classify small as the bottom 25 per cent (i.e. when you sort museums by number of visits they are in the quarter with the lowest visitor numbers), 'large' as the top 25 per cent and 'medium' as what falls in-between.
The distribution is shown in the figure below. Note that the x-axis is in logarithmic scale (logarithms compress larger numbers, e.g. the logarithm of 10 is 1; the logarithm of 100 is 10). We do this to visibly fit the data on the graph, as there is a lot of variation in museum visits, ranging from millions of visits to a few thousand.
Since the plot’s x-axis is shown in logarithmic scale, the absolute range of visit numbers for the large category is much wider than the medium category, which is in turn wider than the small category. 
As shown in the picture below, places which had more FourSquare check-ins also tended to have a higher numbers of visits.
We then use two different machine learning algorithms that take data on a set of input variables, namely the museum type, the local authority’s population and IMD index, the number of FourSquare check-ins to the venue and accreditation status, to predict the size group that a museum is in.
The first algorithm we use is a k-nearest-neighbours (k-NN) classifier. This predicts which size group a museum should be allocated to, by taking the museum in question and seeing which is the most common classification among the k museums that are closest to the museum - closest in the mathematical sense of having the most similar characteristics in the input test data (e.g. similar FourSquare check-ins, deprivation scores etc). For example, if we want to predict the size group of a museum, and two out of the three museums with the most similar characteristics to our museum are large museums and one a medium museum, then on a k-neighbours classifier where k=3 our museum would be predicted to be in the large category as it is the most common neighbour.
The second model is a support vector machine (SVM). This uses an alternative strategy to classify the data. In principle, one would expect museums within the same classification to have similar characteristics e.g. large museums would have more FourSquare check-ins, and be more likely to be in London than medium and small museums. The SVM tries to create a set of mathematical boundaries that separates the data for the different kinds of museum as cleanly as possible. When the boundaries are fed the data on a museum’s characteristics they return whether the museum falls within the large, small or medium 'areas' that the boundaries define as a prediction of the museum’s size. The SVM tries to choose these boundaries to make the predictions as accurate as possible. The third model we use is a variation on the SVM which allows for more complicated boundaries to separate the three kinds of museum e.g. curved borders as opposed to straight lines. 
As is customary in machine learning, we separate the part of the dataset where we know the visit numbers into two subsets: the training data to develop the model, and the test data to test how it’s performing. Once the test dataset is chosen, no part of it can ever be used to train the model. The test dataset can be thought as a 'future' data points where we pretend we don’t know the true labels and use it to test model; in reality, we of course know the labels (i.e. how many visits the museum got, and which of the three groups it fell into) and we evaluate the performance of the machine learning model by comparing the known labels to the estimated labels.  For this work we choose a test dataset composed of 81 museums chosen randomly from the sample.
In general we find that the FourSquare check-ins data lets us predict the approximate number of museum visits much better than we otherwise could. The model that performs best of the three in predicting museum’s size using both the FourSquare and non-FourSquare data is the k-NN with k=15 i.e. each museum’s size is predicted based on the size of the 15 museums which are most similar to it in terms of their known characteristics. This correctly determines 46 (56.79%) of the size categories in the test sample. Interestingly, all the errors are due to the model overestimating the size of the venues: 17 venues are estimated as large when they are smaller in reality; 17 of the small venues are estimated as medium. Here is the full breakdown of the predicted size versus the real size.
Using the FourSquare data has allowed us to predict the general size of the museum much better than by chance - by comparison a prediction based on randomly allocating museums to the three categories according to the share that they constitute of the training sample would on average be right 38% of the time.
If we extend the model to assess the distribution of museum visit numbers across all of the accredited museums for which we have FourSquare data but no visits data (997 museums), we find that 201 museums (20%) are predicted to be in the small category, 609 museums (61%) in the medium category and 187 museums (19%) are in the large category.
To test the degree to which the FourSquare data is providing us with additional information we estimate the models using just the FourSquare data and compare this with the models when the FourSquare data is excluded. Using FourSquare data only, the best model which is the k-nearest neighbours (k=5) manages to achieve 65.43% accuracy, while the linear SVM has an accuracy of 60.49%. Finally, using non-FourSquare data only we get a maximum accuracy of 59.25%. The full results are shown below.
This is still quite a simple model and we are not using that much data on the museums themselves. We have admittedly made the task easier, by looking at three size categories, but only four pieces of information are used to classify the museum’s size, which is fairly limited when you consider the diversity of museums. These models were estimated on the data that we currently have, and there is room for improvement by using richer information on other factors that affect the number of visits such as location and transport links, opening times and whether the museum has free entry or not. Another factor that would probably help improve the accuracy is if we had more visits data for the sites where we don’t have this information, increasing the data available to train the model on.
It is early days for the use of new digital data sources in the museums sector, and there is much still to learn. But our analysis shows they have great potential. New data sources based on social media or new sensing technologies are starting to provide fresh opportunities for museums to understand their activities. There are also likely to be increasing incentives for museums to engage with this agenda. As in other sectors, data is becoming more pervasive. Things that previously have been anecdotal can now start to be measured and, in the long-term, organisations that take advantage of this will have a head-start. The museums sector is unlikely to be an exception to this. However, in the short-term, there will be capacity challenges and, given the immaturity of the area, risks to individual museums in engaging. Government can usefully help the sector develop in this area.
 This is not to say that methods don’t exist. For example, a leg counter with a beam and then numbers divided by 2 (i.e. one person = 2 legs) is a simple way that museums which don’t charge for entry use can use to collect information on the number of times people enter or leave a museum building. The use of RFIDtagging or Bluetooth beacons may also become more common in future.
 The number of FourSquare users worldwide was revealed last year at 50 million monthly active users and the demographics of the typical users are skewed towards western countries, towards urban population and towards young well-educated profiles. Despite this bias that needs to be kept in mind during the analysis, this data has been used to uncover mobility patterns in several projects.
 It is possible for museums to have provisional (as distinct from full) accreditation status where they have demonstrated that they meet the majority of the Accreditation Standard, and are actively resolving any actions required by the accreditation panel.
 This data was obtained by querying the FourSquare Application Programming Interface (API) for the check-ins that corresponded to the museum’s name and address.
 While the range of “medium” museums looks smaller compared to the small museums, it represents a range of roughly 10E5-10E4=90000 compared to roughly 10E4-10E2=9900.
 Technically speaking it is an SVM with a radial basis function as a kernel.
 For each of the the machine learning algorithms we need to choose a certain set of parameters. For example, for the “k nearest neighbours” (also called k-NN) we need to specify the “k”, how many points we consider as neighbours. Depending on these parameters, the performances of the algorithm can improve or deteriorate. So, how do we choose the most optimal parameters? To do this we use another technique that is customary in machine learning: cross-validation. Instead of choosing a single set of parameters, we search among the possible values of k (and also using different measures of distance); but rather than training each of these on the original training dataset, we create many “synthetic datasets” within the initial training data: an actual training set and a validation training set (For example, by randomly partitioning the training data set using say 75% of the data as a training dataset and the remaining 25% as validation data, and doing this repeatedly, we can generate a large number of training sets to estimate the model on - this is akin to Bootstrapping in statistics). By iterating the different parameter choices over these many synthetic datasets, averaging the model performance over all these partitions of the data and choosing the set of parameters that performs best, we can generalise the model’s results and try and minimise the risk that they are too dependant on the initial dataset.