A shiny app for Exploratory Qualitative Analysis

This post is to share my (first) R Shiny app, which was produced as a result of a Qualitative analysis project I carried out for the Office of the Police and Crime Commissioner (OPCC), Norfolk, UK. The Shiny app provides users to either upload sample corpora made available with the app, or to simply upload custom corpora. It, then, enables users to

  • Check word distributions from corpora
  • Cluster documents and perform Topic modelling
  • Cluster words and produce word clouds
  • Generate networks of words

The following sections elaborate on the different aspects of this app.

Phase Selection & The User Guide

userGuide

The Phase Selection drop down list enables users to select which phase they want to enter for Qualitative analysis. Although the tabs, being visible, may naturally prompt the user to select any phase by clicking at them, they are not actually meant for this purpose. The purpose of tabs is just to keep separated the different phases in the app, while the Phase Selection list is meant to facilitate users to select a phase of their choice.

phaseSelection

Importing a Corpus

importingCorpus

Users have the option of uploading either their custom corpus (which may be a single text file with a new line as a delimiter, or a collection of similar text files), or a sample corpus. Currently, three sample corpora are available for users:

  1. UAE Expat Forum
  2. UAE Trip Advisor
  3. Middle East Politics

These corpora were constructed after performing some web scrapping of discussion forums. The code for web scraping these websites can be found here. For the purpose of demonstration, we shall select the Middle East Politics data set.

After selection, clicking on the ‘Upload Corpus’ button uploads the selected corpus, with a message being displayed in the right panel.

Pre-processing

preProcessing

The Pre-processing phase enables users to perform several basic pre-processing procedures, such as removal of punctuation, numbers, and stopwords. This app also allows users to enter custom stopwords, separated by a comma, as well as custom thesauri, with the same procedure of separating words with commas.

The ‘Apply Pre-processing’ button applies the selected pre-processing procedures to our corpus data.

Feature Generation

featureGeneration

The Feature Generation phase is provided for users to select weighting criteria for words, documents, and a normalisation scheme. Once selected, the ‘Generate Features’ button applies the selected criteria to words and documents.

For this demonstration, I have chosen TF scheme, with no normalisation.

Feature Selection

featureSelection

The Feature Selection phase is an integral part of the app (given how it enables us to cut down on memory being used – memory limitations apply to this app, of course). Users may determine a lower bound for word frequency, or set the allowed sparsity level. The ‘Select Features’ button calculates the new Term Document matrix. We shall set our frequency lower bound to 33, and leave the default value for sparsity level.

Initial Analysis

initialAnalysis

In the Initial Analysis phase, users have the options to plot a Rank Frequency plot, or a Word Frequency plot, both of which are downloadable as PDF files.

Cluster Documents

clusterDocs

To Cluster documents, users have to select a lower bound for word frequency, the number of documents they want to identify in clustering procedure, as well as the number of topics they wish to identify from the clustered documents. The resulting graph is downloadable as a PDF file.

Clustering Words

clusterWords

To Cluster words, users select the quantile for word frequency (which has the same purpose as of selecting a lower bound for word frequency), the number of groups they want to form for words, and finally one of the two force-drawing graph algorithms. This graph can also be downloaded as a PDF file.

In addition to the above, users may also create a dendrogram of words, from relevant options made available in the app.

Words Networks

wordsNetwork

This phase can be used to identify relations between words – the association of words to each other is represented by inter-connecting lines, just as in a network. To generate a network of words, users first select the quantile for word frequency, just as in the Word Clustering phase, and then simply select one of the two force-drawing graph algorithms. The ‘Generate Words Network’ button plots the graph, which can also be downloaded as a PDF file.

This phase is also the most memory-intensive phase. I realise that, at time, the plotted network may not be intelligible. Hope to improve it soon.


That completes a walk-through of this app!The app can be accessed from this link.

I would be very eager to hear any generous amount of criticism, or your thoughts in general, about this app.

Many thanks for your time reading through this post!

Exploratory Qualitative Analysis — a Shiny App

Background

While studying for my MSc in Knowledge Discovery and Data Mining, I was fortunate to work on a project for the Office of the Police and Crime Commissioner, Norfolk, UK. The project involved analysis of textual data, which was generated from a local policing survey from the relatively serene county of Norfolk.
Towards the end of the project, I considered how I could shift the power of R and Qualitative analyses into the hands of the OPCC, Norfolk. I realised that the OPCC were neither Programmers nor Statisticians, so expecting them to master R and Statistical concepts was unfair. At this point, I stumbled upon the Shiny package from R, and decided to use it and transform most of my analyses into an interactive web application for use in future surveys.

The Application
The present application is a modified version of the original developed for the OPCC, Norfolk. It contains the following functionalities:

1. Importing custom corpus
Users may import their own corpus, as long as these are .txt files, and that the corpus is either a single .txt file, or the corpus is composed of a directory of several .txt files.

2. Importing sample corpus
Users not having a corpus may like to use any of the three corpora included with the application. These include i) the UAE Expat forum corpus, ii) the UAE TripAdvisor corpus, and iii) the Middle East Politics’ forum corpus. The code used in scraping these can be accessed by clicking here.

3. Importing custom stopwords and thesauri
In addition to the stopwords shipped with the tm package in R, users may include their own corpus-specific stopwords or thesauri into the pre-processing phase.

4. SMART term-weighting scheme
As provided in the tm package, the SMART notation used in assigning importance to words in our corpus can be utilised. For example, users may wish to assign importance to words simply based on their frequency of occurrence; the terms occurring most often in the corpus would have the highest importance. Alternatively, users may wish to assign greater importance to words that occur rarely in the corpus. In such instances, the SMART weighting notation can be used to achieve our objectives.

5. Calculating Word Frequency Distributions
Users may wish to see whether the word frequency distribution resulting from their corpus follows the Zipf’s law, or simply to record the most frequent words in the corpus. For these purposes, the Rank-Frequency and Word-Frequency plots are provided, which may also be downloaded by users.

6.Clustering Documents
It is often the case that in a corpus, certain documents can be identified as belonging to the same topic, while others belong to a different one. It can be quite helpful to identify such groups of documents, as we can then differentiate and assign the documents to their respective groups and get a rough idea of our corpus.
In the present application, such clustering can be performed by making use of words and their frequencies. That is, it is assumed that documents containing the same words belong to the same group. Thus, the presence, absence, and frequency of words are taken into account to form clusters of documents.

The resulting plot is downloadable.

7. Clustering Words
Similar to clustering documents, we may also wish to identify groups of words by assuming that words which occur together — that is, occur in similar contexts — belong to the same group.
This application supports two forms of clustering, i) Partitional, and i) Hierarchical, where the former identifies ‘flat’ groups and the latter identifies hierarchies of groups.

The graphs produced for these purposes (Associative Word cloud and the Dendrogram) are downloadable.

8. A Words’ network: an experimental plot
As an experiment, I have included another graph in the application — the Words’ network. This graph only seeks to complement the Associative word cloud from the ‘Clustering Words’ section, in that it links words to each other and assigns colours to the links based on their weights. The greater number of times a word occurs with another, the darker its link to that word.
I realise that such graphs can sometimes be confusing due to the fact that each word would have some link with another. However, in my personal analyses, I have witnessed cases where disparate groups of words were identified. This is the reason why I take the opportunity to include the Words’ network graph in this application.

The Words’ network graph is also downloadable.


Presently, I am endeavouring to push the Shiny App to the Shiny Server, a great initiative by the RStudio team to enable useRs to host their Shiny Apps.
As soon as I succeed, I shall share the link to the online App here. In the meanwhile, I share the GitHub page for the App, and hope that useRs find it interesting.

Cheers!