NLP - Better Biblos

Bible EDA with PyCaret

Ryan Dominguez — Fri, 24 Jun 2022 04:02:36 +0000

Luckily for us there are many tools we can use to perform exploratory data analysis (EDA) on the Bible. PyCaret is a multipurpose Python platform for machine learning and data analysis. They have a natural language processing (NLP) module in their package that we will use on our Bible dataset. With this we will perform Bible EDA with PyCaret.

Initial Data

We will start with our dataset we generated in our blog post Transforming the Bible into a Dataframe. As noted, this dataframe consists of several columns: ‘book‘ representing books of The Bible; ‘chapter‘ representing chapters, ‘verse_number‘ representing verses, and the verses’ text itself is represented in the column ‘verse‘.

Initial Setup

Initial setup of the pycaret model involves importing of the pycaret.nlp package and then building a reference for the package to our dataframe, with specified target text for natural language processing.

# init setup
from pycaret.nlp import *
s = setup(df, target = 'verse')

The intitial setup outputs some basic descriptive information about our Bible corpus. For one, there are 31,102 documents in our data, representing the verses of the Bible. Our corpus consists of a 6,999 distinct vocabulary of words.

Model Selection

Pycaret.nlp gives us several NLP models to deploy using a combination of the gensim and sklearn packages. We can access the available packages using the below command:

models()

We will pick LDA for our first run with the model, and can try other models later. LDA, short for Latent Dirichlet Allocation, was published in 2003 by the machine learning researchers David Blei, Andrew Ng, and Michael Jordan. It was one of the first widely adopted topic modeling frameworks. For the curious, we’ve included the abstract below:

We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

PyCaret makes the creation of the model easy, we will see just how easy in the next section.

Running the model

One can create a model with one line of code, once they have initialized the pycaret.nlp package and pointed it to the relevant corpus.

lda = create_model('lda')
print(lda)

LdaModel(num_terms=6999, num_topics=4, decay=0.5, chunksize=100)

The print statement has outputted several relevant features for understanding the model created using LDA. In totality, 6999 different tokens (“words”) were considered in the model creation, and the specified default count of topics for the model is 4.

Visualizing the Topics

In subsequent posts, we will explore additional visualizations and optimization of topic generation using various target parameters such as Book of the Bible. For now we will examine the default 4 topics generated by LDA to improve our future work.

Topic 0

plot_model(lda, plot = 'frequency',topic_num='Topic 0')

plot_model(lda, plot = 'distribution', topic_num='Topic 0')

Interpretation

Topic 0 consists of a relatively medium variety of words, typically with lengths under 10-11 characters.

Topic 1

plot_model(lda, plot = 'frequency',topic_num='Topic 1')

plot_model(lda, plot = 'distribution', topic_num='Topic 1')

Interpretation

Topic 1 consists of a relatively medium-large variety of words, with center point of length distribution around 8-9 characters.

Topic 2

plot_model(lda, plot = 'frequency',topic_num='Topic 2')

plot_model(lda, plot = 'distribution', topic_num='Topic 2')

Interpretation

Topic 2 consists of a relatively medium-large variety of words, with center point of length distribution around 12-13 characters.

Topic 3

plot_model(lda, plot = 'frequency',topic_num='Topic 3')

plot_model(lda, plot = 'distribution', topic_num='Topic 3')

Interpretation

Topic 3 consists of a relatively medium-large variety of words, with center point of length distribution around 8-9 characters.

Initially we note a curious structure to the topics, they all vary considerably in terms of the highest single word count value. For topic 0 we see ‘give’ is in the number 1 position with ~150 references, for topic 1 we see ‘say’ is in the number 1 position with ~1400 references, for topic 2 we see ‘let’ is in the number 1 position with 11 references, and for topic 3 we see ‘shall’ is in the number 1 position with ~9000 references.

This post is one of several posts as a part of our Bible NLP Analysis.

Additional resources

https://pycaret.gitbook.io/docs/

The post Bible EDA with PyCaret appeared first on Better Biblos.

Generating a Bible Semantic Space

Ryan Dominguez — Wed, 11 May 2022 03:42:40 +0000

In this post, we will discuss how to generate a Bible Semantic Space.

Where we last left off in Part 1 of Diving deeper into Bible NLP Visualization, we used Python and created a dataframe of the King James bible text file that looked like so, with books, chapters, verse numbers, and verse text separated into their own columns:

book	chapter	verse_number	verse
Gen	1	1	In the beginning God created the heaven and t…
Gen	1	2	And the earth was without form, and void; and…
Gen	1	3	And God said, Let there be light: and there w…
Gen	1	4	And God saw the light, that it was good: and …
Gen	1	5	And God called the light Day, and the darknes…

The dataframe is now a helpful tool for efficient information retrieval. We can ask questions such as “how many verses are contained in Genesis?”, by referencing appropriate Python filters and groupby statements with correspondingly appropriate aggregations. However, we need to do more work on the dataframe in order to visualize semantic meaning effectively. We next need to generate a semantic space.

Bible Semantic Space to enable Bible NLP

We feed the text of the verse column into a word embedding model to generate a semantic space. Word embedding models are tools in natural language processing to translate words and/or the context of words into vector representations so we can direct our computer to a mathematical object to analyze. Here’s an analogy to explain the concept. In Einsteinian physics, position is not determined by an absolute space time grid, position is relative to the position of other objects and their movement.

Word embedding models function somewhat similarly. Word embedding models are initially trained on massive corpus’/bodies of linguistic data. Corpus’ are datasets of combined sources such as Wikipedia, the set of all tweets, an agglomeration of the last 20 years’ worth of several publishers ‘ news articles, etc. From this set of information meaning is extracted by way of cooccurrence of words and phrases.

Word2vec

We have chosen word2vec as the embedding model for our initial analysis within the gensim package. Word2vec was created in 2013 by a team of Google researchers, including Tomas Mikolov.

Depending upon our choice of semantic embedding model, we must utilize different preprocessing techniques to get our dataframe ready. We will use gensim to preprocess our strings into a format most advantageous for word2vec. Preprocessing consists of a number of techniques for manipulating your text for better machine representation, such as:

Removal of punctuation
Removal of stop words
Change case to lower
Stemming
Lemmatizing
Tokenization

There is room for choice here, for example our selection of stopwords to eliminate will alter the semantic space generated. Typical stopwords eliminated include (‘and’,’or’,’the’,’a’,etc.). It is also possible with modern embedding models to include all stopwords in the corpus. Each word is assigned a numerical token representation like below:

Lastly, each tokenized word is assigned a point in a N-dimensional space. One can choose the number of dimensions they wish to use. The general thought is that more dimensions are able to capture more sophisticated linguistic meaning. We will be using an N of 100 for our initial runs with word2vec. Several modern models extend their count of parameters into the billions to achieve state of the art performance. We can see the model output below.

We can use dimension reduction on the above object to reduce it to 2 dimensions for interpretable plotting. We will cover that and more in a future post.

The post Generating a Bible Semantic Space appeared first on Better Biblos.

Transforming the Bible into a Dataframe

Ryan Dominguez — Sun, 08 May 2022 05:37:42 +0000

Introduction to Transforming the Bible

In this first post of a series of posts about our Bible NLP Analysis, we will be transforming the Bible into a dataframe. Part 1 will focus on the ingestion of the file into a broadly usable Pandas dataframe.

I utilized a .txt file of the King James Bible from the website sacred-texts.com. Their formatting of the file enabled for much easier ingestion of the text in a logical way than other sources I have found.

Python & Bible NLP

My language of choice for NLP is Python. Python has many excellent modules to aid in various object-oriented tasks, NLP included. Pandas is one such module in the data science realm. One can find links to many of the resources I utilize here on the Better Biblos resources page. In addition to many other functions, Pandas allows for easy ingestion of the .txt file using:

!pip install pandas
import pandas as pd
bible = pd.read_csv(r'...\kjvdat.txt')
bible.head()

Inspecting the object using bible.head() we observe the following tabular structure:

0	1
Gen\|1\|1\| In the beginning God created the heav..	NaN
Gen\|1\|2\| And the earth was without form, and v…	NaN
Gen\|1\|3\| And God said, Let there be light: and…	NaN
Gen\|1\|4\| And God saw the light, that it was go…	NaN
Gen\|1\|5\| And God called the light Day, and the…	NaN

Not too bad, but we can utilize the logical structure of the Bible into books, chapters, and verses to create a more accessible object. In particular we will focus on the pipe (“|”) delimiter character nicely included in the file, and then force name the automatically generated columns from the .split command in line 1 below:

bible = bible[0].str.split('|', n=-1, expand=True)
bible.columns = ['book','chapter','verse_number','verse']
bible.head()

book	chapter	verse_number	verse
Gen	1	1	In the beginning God created the heaven and t…
Gen	1	2	And the earth was without form, and void; and…
Gen	1	3	And God said, Let there be light: and there w…
Gen	1	4	And God saw the light, that it was good: and …
Gen	1	5	And God called the light Day, and the darknes…

We can now very easily query subsets of data by books, chapters, and verses.

The post Transforming the Bible into a Dataframe appeared first on Better Biblos.