Creating topics for the Meetup groups

Kostas Stathoulopoulos | June 1, 2017

Tech meetups provide an excellent way for people from different disciplines to meet, exchange ideas, learn from each other and network. These connections can lead to potential collaborations which impact innovation in an ecosystem. Meetup.com is a platform where people can organise and attend events. In Arloesiadur, we used natural language processing and graph theory to analyse this data source, understand the structure and interests of the Welsh tech community and identify emerging technologies. We answered the following questions:

  • What are the tech networking trends in Wales?
  • What is the structure of the tech community network in Wales?
  • How have these tech networks changed over time?
  • What are the connections between tech communities in Wales and other parts of the UK?

Data Collection

Meetup was established in 2002. We used its API to collect information on tech groups, organised events and registered users, only for UK cities. Specifically, we retrieved data for 2,878 groups, 972,413 users and 38,929 unique events. Even though the timeframe of the dataset was from 2006-2016, we focused on the time period between 2012 and 2016 since Wales had not developed any activity on that platform before that period. It should be mentioned that 80% of the Meetup groups in the UK were created in that timeframe. Meetup groups use tags to specify their areas of focus and we identified 3725 unique tags. Even though tags provide a granular way of examining a tech community, their large number complicates the recognition of trends and wider networks of collaboration. This is why we aggregated  them into higher level topics.

In this blog, we describe how we created the hierarchy of tags, topics and broad categories. This hierarchy associates Meetup groups with areas of activity to answer the questions above.

Data preparation

We used natural language processing and unsupervised learning to cluster the tags that appeared in a similar context and then assigned Meetup groups to these clusters according to the tags they used.

Initially, we lowercased the tags and separated them using regular expressions. Then, we created N-grams where the tags were consisted of more than one words. For instance, the tag “big data” was not split into two words, but the bigram “big_data” was created. Finally, we formed artificial sentences where each sentence contained the preprocessed tags that a Meetup group had been labelled with by its owner.

From tags to topics

We fed this collection of tokens sentence by sentence to word2vec, a shallow neural network that finds the dense vector representation of words. As a result, words that are used in similar context appear closer in the vector space in a way that we can measure. The following table provides an example of tokens that were found in the same context.

<table>

Even though the trained model can capture the similarity of the given terms, the output  vectors are drawn on a 350 dimensional space which makes the interpretation and visualisation of the results impossible. We overcome this issue by using t-SNE, a dimensionality reduction technique that is well suited in projecting the vectors in a two dimensional space, while keeping similar points close to each other. The outcome of this method can be plotted on a scatterplot where similar points will be found closer together than the dissimilar ones.

We then use a Gaussian Mixture model to cluster the points of the 2D space by learning the probability distribution of the different sub-populations that exist in the set. The produced clusters, where every cluster is a collection of tags representing the topics that the Meetup groups are active in. Overall, 42 topics were identified and labelled manually since their number was relatively small.

Two dimensional representation of the Meetup tags. Tags with the same colour belong to the same cluster, all of which were manually labelled. In Arloesiadur, the 42 identified topics were manually aggregated to 9 broad categories to reduce the complexity of the interactive visualisations.
Two dimensional representation of the Meetup tags. Tags with the same colour belong to the same cluster, all of which were manually labelled. In Arloesiadur, the 42 identified topics were manually aggregated to 9 broad categories to reduce the complexity of the interactive visualisations.

Assigning Meetup groups to topics

Not all tags provide insight on the activities of a group. When a tag appears in multiple groups, it usually is a very broad term that do not assist us in distinguishing between what different Meetup groups do. Tokens such as new_technology, web_development, web_technology and software_development, belong to this generic category. We used the TF-IDF (Term Frequency, Inverse Document Frequency) weight to measure the importance of a token in a document, where document is the set of tags of a Meetup group. Tokens with a high TF-IDF weight can help us characterise the activities of a group, while those with a low TF-IDF weight are found in multiple documents and therefore, are not very helpful for segmenting groups.

On the flip side, when a tag appears only in a handful of groups, it is considered very rare and it is not helpful in identifying broad topics. In Arloesiadur, we removed 1,752 tags that were used by less than five groups.

So far, we clustered the Meetup tags and selected those that will be used to classify meetup groups into tech topics. The final part of our analysis was to find the topic weights for each Meetup group. We tackled this by measuring the number of tokens with the same topic label in a group as a share of all its tokens. As a result, every Meetup group is described by a distribution of topics, where each topic is a collection of tags.

Some things to be aware of:

  • Word2vec was trained using the tags of all the UK Meetup groups. We did this because the model scales very well with the size of the dataset.
  • We ignored the tags that appeared in less than 5 groups during the training of word2vec. Specifically, we set the threshold for min_count hyperparameter, which is used to ignore all words below it, equal to five.
  • The number of clusters in GMM was selected using Bayesian Information Criterion (BIC) and silhouette score.
  • Initially, the clusters of tags do not have a label. Even though there is a rich literature on how to tackle topic labelling, we decided to label them manually because their number was small.
  • In Fig. 1, there are three clusters where we could not find a shared label. We removed them, and their tags, from the analysis.

Limitations

  • Word2vec is trained on a particular vocabulary of tags which captures their current usage. However, the model will have to be retrained to extend the size of its vocabulary and understand how often the tags are used.

Since word2vec has to be retrained, the vector representation of the tags will differ. Therefore, dimensionality reduction with t-SNE and clustering with GMM will have to be trained and optimised again.

 

Back to the top