Luckily for us there are many tools we can use to perform exploratory data analysis (EDA) on the Bible. PyCaret is a multipurpose Python platform for machine learning and data analysis. They have a natural language processing (NLP) module in their package that we will use on our Bible dataset. With this we will perform Bible EDA with PyCaret.
Initial Data
We will start with our dataset we generated in our blog post Transforming the Bible into a Dataframe. As noted, this dataframe consists of several columns: ‘book‘ representing books of The Bible; ‘chapter‘ representing chapters, ‘verse_number‘ representing verses, and the verses’ text itself is represented in the column ‘verse‘.
Initial Setup
Initial setup of the pycaret model involves importing of the pycaret.nlp package and then building a reference for the package to our dataframe, with specified target text for natural language processing.
# init setup
from pycaret.nlp import *
s = setup(df, target = 'verse')
The intitial setup outputs some basic descriptive information about our Bible corpus. For one, there are 31,102 documents in our data, representing the verses of the Bible. Our corpus consists of a 6,999 distinct vocabulary of words.
Model Selection
Pycaret.nlp gives us several NLP models to deploy using a combination of the gensim and sklearn packages. We can access the available packages using the below command:
models()
We will pick LDA for our first run with the model, and can try other models later. LDA, short for Latent Dirichlet Allocation, was published in 2003 by the machine learning researchers David Blei, Andrew Ng, and Michael Jordan. It was one of the first widely adopted topic modeling frameworks. For the curious, we’ve included the abstract below:
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
PyCaret makes the creation of the model easy, we will see just how easy in the next section.
Running the model
One can create a model with one line of code, once they have initialized the pycaret.nlp package and pointed it to the relevant corpus.
lda = create_model('lda')
print(lda)
LdaModel(num_terms=6999, num_topics=4, decay=0.5, chunksize=100)
The print statement has outputted several relevant features for understanding the model created using LDA. In totality, 6999 different tokens (“words”) were considered in the model creation, and the specified default count of topics for the model is 4.
Visualizing the Topics
In subsequent posts, we will explore additional visualizations and optimization of topic generation using various target parameters such as Book of the Bible. For now we will examine the default 4 topics generated by LDA to improve our future work.
Topic 0
plot_model(lda, plot = 'frequency',topic_num='Topic 0')
plot_model(lda, plot = 'distribution', topic_num='Topic 0')
Interpretation
Topic 0 consists of a relatively medium variety of words, typically with lengths under 10-11 characters.
Topic 1
plot_model(lda, plot = 'frequency',topic_num='Topic 1')
plot_model(lda, plot = 'distribution', topic_num='Topic 1')
Interpretation
Topic 1 consists of a relatively medium-large variety of words, with center point of length distribution around 8-9 characters.
Topic 2
plot_model(lda, plot = 'frequency',topic_num='Topic 2')
plot_model(lda, plot = 'distribution', topic_num='Topic 2')
Interpretation
Topic 2 consists of a relatively medium-large variety of words, with center point of length distribution around 12-13 characters.
Topic 3
plot_model(lda, plot = 'frequency',topic_num='Topic 3')
plot_model(lda, plot = 'distribution', topic_num='Topic 3')
Interpretation
Topic 3 consists of a relatively medium-large variety of words, with center point of length distribution around 8-9 characters.
Initially we note a curious structure to the topics, they all vary considerably in terms of the highest single word count value. For topic 0 we see ‘give’ is in the number 1 position with ~150 references, for topic 1 we see ‘say’ is in the number 1 position with ~1400 references, for topic 2 we see ‘let’ is in the number 1 position with 11 references, and for topic 3 we see ‘shall’ is in the number 1 position with ~9000 references.
This post is one of several posts as a part of our Bible NLP Analysis.
Additional resources
One response to “Bible EDA with PyCaret”
Thanks for the post!