Kostas Stathoulopoulos | June 1, 2017
Tech meetups provide an excellent way for people from different disciplines to meet, exchange ideas, learn from each other and network. These connections can lead to potential collaborations which impact innovation in an ecosystem. Meetup.com is a platform where people can organise and attend events. In Arloesiadur, we used natural language processing and graph theory to analyse this data source, understand the structure and interests of the Welsh tech community and identify emerging technologies. We answered the following questions:
Meetup was established in 2002. We used its API to collect information on tech groups, organised events and registered users, only for UK cities. Specifically, we retrieved data for 2,878 groups, 972,413 users and 38,929 unique events. Even though the timeframe of the dataset was from 2006-2016, we focused on the time period between 2012 and 2016 since Wales had not developed any activity on that platform before that period. It should be mentioned that 80% of the Meetup groups in the UK were created in that timeframe. Meetup groups use tags to specify their areas of focus and we identified 3725 unique tags. Even though tags provide a granular way of examining a tech community, their large number complicates the recognition of trends and wider networks of collaboration. This is why we aggregated them into higher level topics.
In this blog, we describe how we created the hierarchy of tags, topics and broad categories. This hierarchy associates Meetup groups with areas of activity to answer the questions above.
We used natural language processing and unsupervised learning to cluster the tags that appeared in a similar context and then assigned Meetup groups to these clusters according to the tags they used.
Initially, we lowercased the tags and separated them using regular expressions. Then, we created N-grams where the tags were consisted of more than one words. For instance, the tag “big data” was not split into two words, but the bigram “big_data” was created. Finally, we formed artificial sentences where each sentence contained the preprocessed tags that a Meetup group had been labelled with by its owner.
We fed this collection of tokens sentence by sentence to word2vec, a shallow neural network that finds the dense vector representation of words. As a result, words that are used in similar context appear closer in the vector space in a way that we can measure. The following table provides an example of tokens that were found in the same context.
Even though the trained model can capture the similarity of the given terms, the output vectors are drawn on a 350 dimensional space which makes the interpretation and visualisation of the results impossible. We overcome this issue by using t-SNE, a dimensionality reduction technique that is well suited in projecting the vectors in a two dimensional space, while keeping similar points close to each other. The outcome of this method can be plotted on a scatterplot where similar points will be found closer together than the dissimilar ones.
We then use a Gaussian Mixture model to cluster the points of the 2D space by learning the probability distribution of the different sub-populations that exist in the set. The produced clusters, where every cluster is a collection of tags representing the topics that the Meetup groups are active in. Overall, 42 topics were identified and labelled manually since their number was relatively small.
Not all tags provide insight on the activities of a group. When a tag appears in multiple groups, it usually is a very broad term that do not assist us in distinguishing between what different Meetup groups do. Tokens such as new_technology, web_development, web_technology and software_development, belong to this generic category. We used the TF-IDF (Term Frequency, Inverse Document Frequency) weight to measure the importance of a token in a document, where document is the set of tags of a Meetup group. Tokens with a high TF-IDF weight can help us characterise the activities of a group, while those with a low TF-IDF weight are found in multiple documents and therefore, are not very helpful for segmenting groups.
On the flip side, when a tag appears only in a handful of groups, it is considered very rare and it is not helpful in identifying broad topics. In Arloesiadur, we removed 1,752 tags that were used by less than five groups.
So far, we clustered the Meetup tags and selected those that will be used to classify meetup groups into tech topics. The final part of our analysis was to find the topic weights for each Meetup group. We tackled this by measuring the number of tokens with the same topic label in a group as a share of all its tokens. As a result, every Meetup group is described by a distribution of topics, where each topic is a collection of tags.
Some things to be aware of:
Since word2vec has to be retrained, the vector representation of the tags will differ. Therefore, dimensionality reduction with t-SNE and clustering with GMM will have to be trained and optimised again.