Text analysis with NVivo

Analyse the text of a set of documents, using company reports as an example

Specialist Library Support
7 min readOct 3, 2019

Please note this resource is no longer being updated

Graphic

NVivo is a pioneering mixed methods data analysis tool with text analysis capabilities available for English, Spanish, French, German, Portuguese, Japanese, and Chinese Simplified. There are several different versions of NVivo which vary slightly in their focus and functionality. In this post we will explore NVivo’s text analysis capabilities using NVivo 12 Plus, working with an example data set comprised of 15 British Petroleum (BP plc) annual reports extracted from the Thomson ONE financial database. Throughout, our analysis will adopt a quantitative approach looking at word frequency, word co-occurrence and the clustering of documents based on word similarity.

  1. The ‘Explore’ tab
  2. Data
  3. Analysis
  4. Glossary
  5. Summary

1. The ‘Explore’ tab

The ‘Explore’ tab menu

The ‘Explore’ tab in NVivo includes a wide range of tools designed to help you interrogate your data set.

NVivo can be used to manage different types of data including structured and unstructured data. It is a powerful tool which enables you to store and organise data by different categories such as theme, sentiment or attributes. Once organised, these data sets can be analysed using the various functions available in NVivo such as clustering analysis, thematic reviews and network analysis. The software also allows for easy cross-tabulation of mixed methods data, visualisation of word occurrence and co-occurrence, and the creation of mind maps.

^back to contents

2. Data

NVivo allows you to manage data from a wide range of different sources such as social media analytics, surveys, spreadsheets, images, text and video data.

The ‘Import’ tab menu
The ‘Import’ tab allows you import from many different sources.

In this post we will explore how to analyse text data in form of a PDF. To start the process (assuming the relevant documents are stored in one folder) import your data using the following process:

  1. Open the application and click on ‘NVivo > Blank Project’. Then fill in the fields ‘Title’ and ‘Description of project’.
  2. Click on ‘Import > Files’ then ‘Folder of choice’. Highlight all relevant files then choose ‘Open > Import’.

These imported files can be classified by theme, month of publication or author.

^back to contents

3. Analysis

If you are not using NVivo, prior to analysing the text you might have to “clean” the data. This means removing unnecessary characters such as punctuation, trailing whitespaces, stop words (words that do not provide insight into the body of text such as the/and/or) and changing everything to upper case or lower case. NVivo automatically performs text cleaning when queries are performed, however, it does this using a default set of rules for what constitutes an “unnecessary” character. You can review the default stop words and modify them to suit your preference.

Analysing word occurrence

NVivo allows you to search for and analyse the occurrences of specific words within the documents you upload. It allows you to observe trends in the occurrence of these words and consider contextual information such as the words which are commonly used in close proximity. For example, to identify BP’s engagement with the issue of climate change, the occurrence of the word ‘climate’ can be checked in all 15 PDFs.

Use the steps below to perform a word occurrence analysis (red boxes show these options in the following images):

  1. Click on ‘Explore > Text Search > Selected Items > Select All’.
  2. In the ‘Search for field’ type ‘climate’ then select ‘With synonyms’ then ‘Run Query’.

There are several ways to display the results of your query which you can move between using the vertical tabs on the bottom-right (with a black box in the following image). The results can be viewed as a ‘Summary’ table, ‘Reference’ (showing all sentences the word appears in), as a ‘PDF’ (showing all pages and locations where the word appears) and as a ‘Word Tree’ (a visualisation of the searched word and those that occur before and after it).

Specific word and its synonyms
Word tree of results
Search for occurrences of the word ‘climate’ in all documents. Count the occurrences and display them in a word tree.

The summary table shows that the word climate appeared in 9 files with a coverage of 0.019% (low occurrence). The word tree shows the co-occurrence of climate with other words and sentences. As expected, it often co-occurs with the word change.

Analysing word frequency

NVivo also allows you to identify the most commonly occurring words in a set of documents. These results can be explored further in order to identify themes across a collection of files or compare the use of terms within different files in the collection. Use the steps below to analyse the data set for word frequency(red boxes show these options in the following images):

  1. Click on ‘Explore > Word Frequency > Selected Items > Select All’.
  2. In the ‘Display words field use default value of 1000 most frequent (this can be changed as required).
  3. For ‘Grouping’, select ‘With stemmed word’ (this reduces words to their root form) then ‘Run Query’.
Display 1000 most used words, stemmed
Word cloud of most used words
NVivo allows you to visualise the most common words in a word cloud or tree map.

The different ways to view the results of a word frequency query can be selected from a vertical list of tabs in the bottom-right of the screen (with a black box in the previous image). The results can be viewed as a ‘Summary’ table, ‘Word Cloud’, ‘Tree Map’ and by ‘Cluster Analysis’. The summary and word cloud show ‘million’ is the most talked about word.

Cluster analysis

NVivo’s cluster analysis function allows you to explore for varying themes (with word similarity) across the files in a collection using an unsupervised approach. Note, cluster analysis can be performed on words (i.e. clustering co-occurring words) using the word frequency query.

The clustering of documents is done by calculating the word similarity across the files.

  1. Click on ‘Explore > Cluster Analysis > Files, Externals & Memos > Select > Select All’.
  2. For ‘Clustered by’ and ‘Using similarity metric’ select ‘Word similarity’ and ‘Pearson correlation coefficient’ respectively. This is because this is an unsupervised analysis and so no coding has been done. Then click on ‘Finish’.
Items clustered by similarity, tree structure
Items clustered by similarity, graphical view
Clustering by word similarity results can be visualised in different ways.

The software default cluster number is ten. The ideal number can be identified using varying methods. Here, we have visualised the documents clustered by word similarity and from the 2D cluster map above, there are ideally three groups of documents. (Details about clustering techniques goes beyond the scope of this post.)

These clusters can then be further explored using additional word frequency queries. The word clouds of the three clusters show that cluster 2 is one document with a collection of numbers and symbols. Manually inspecting the document, it is found that this was not written in English. Cluster 3 is different from cluster 1 as it focuses on ‘cost’.

Cluster 1
Cluster 2
Cluster 3
(from left to right) Cluster 1, 2 and 3

4. Glossary

  1. Structured and unstructured data: Structured data is highly organised information, typically stored as a table with columns and row. Unstructured data is information that has not or cannot be organised in a pre-defined manner (typically text heavy).
  2. Mixed methods: This is a research methodology that involves integrating quantitative and qualitative data.
  3. Thematic analysis: Used in qualitative research to explore, examine and record themes within data.
  4. Themes: These are patterns across data sets that describes a phenomenon.
  5. Sentiment analysis: This is used to categorise opinion expressed as positive, negative or neutral.
  6. Clustering analysis: Used to group items into similar categories usually determined using a similarity measure. The help page ‘How are cluster analysis diagrams generated?’ by the makers of NVivo explains who this similarity is measured.
  7. Word occurrence and co-occurrence: Word occurrence is simply the number of times a word appears in a body of text while co-occurrence is the frequency of occurrence of two words alongside each other in a certain order.
  8. Mind mapping: This is a visual representation of information that starts with a central information and surrounded by connected branches of related topics. This is typically an iterative process used as a brainstorming tool as explained in a help page ‘About mind maps’ by the makers of NVivo.
  9. Social network analysis: This is the process of investigating social structures through the use of networks and graph theory. This is also explained in a help page ‘About social network analysis’ by the makers of NVivo.

^back to contents

5. Summary

NVivo provides an easy graphical user interface to perform data mining analysis. The articles below show further functionality of the software:

  1. NVivo has a range of tutorials demonstrated in the various versions.
  2. The University of Buffalo has a range of tutorial on qualitative analysis in NVivo.
  3. The University of Edinburgh published an in depth introduction to NVivo.

^back to contents

--

--