In this post, we will discuss how to generate a Bible Semantic Space.
Where we last left off in Part 1 of Diving deeper into Bible NLP Visualization, we used Python and created a dataframe of the King James bible text file that looked like so, with books, chapters, verse numbers, and verse text separated into their own columns:
book | chapter | verse_number | verse |
Gen | 1 | 1 | In the beginning God created the heaven and t… |
Gen | 1 | 2 | And the earth was without form, and void; and… |
Gen | 1 | 3 | And God said, Let there be light: and there w… |
Gen | 1 | 4 | And God saw the light, that it was good: and … |
Gen | 1 | 5 | And God called the light Day, and the darknes… |
The dataframe is now a helpful tool for efficient information retrieval. We can ask questions such as “how many verses are contained in Genesis?”, by referencing appropriate Python filters and groupby statements with correspondingly appropriate aggregations. However, we need to do more work on the dataframe in order to visualize semantic meaning effectively. We next need to generate a semantic space.
Bible Semantic Space to enable Bible NLP
We feed the text of the verse column into a word embedding model to generate a semantic space. Word embedding models are tools in natural language processing to translate words and/or the context of words into vector representations so we can direct our computer to a mathematical object to analyze. Here’s an analogy to explain the concept. In Einsteinian physics, position is not determined by an absolute space time grid, position is relative to the position of other objects and their movement.
Word embedding models function somewhat similarly. Word embedding models are initially trained on massive corpus’/bodies of linguistic data. Corpus’ are datasets of combined sources such as Wikipedia, the set of all tweets, an agglomeration of the last 20 years’ worth of several publishers ‘ news articles, etc. From this set of information meaning is extracted by way of cooccurrence of words and phrases.
Word2vec
We have chosen word2vec as the embedding model for our initial analysis within the gensim package. Word2vec was created in 2013 by a team of Google researchers, including Tomas Mikolov.
Depending upon our choice of semantic embedding model, we must utilize different preprocessing techniques to get our dataframe ready. We will use gensim to preprocess our strings into a format most advantageous for word2vec. Preprocessing consists of a number of techniques for manipulating your text for better machine representation, such as:
- Removal of punctuation
- Removal of stop words
- Change case to lower
- Stemming
- Lemmatizing
- Tokenization
There is room for choice here, for example our selection of stopwords to eliminate will alter the semantic space generated. Typical stopwords eliminated include (‘and’,’or’,’the’,’a’,etc.). It is also possible with modern embedding models to include all stopwords in the corpus. Each word is assigned a numerical token representation like below:
Lastly, each tokenized word is assigned a point in a N-dimensional space. One can choose the number of dimensions they wish to use. The general thought is that more dimensions are able to capture more sophisticated linguistic meaning. We will be using an N of 100 for our initial runs with word2vec. Several modern models extend their count of parameters into the billions to achieve state of the art performance. We can see the model output below.
We can use dimension reduction on the above object to reduce it to 2 dimensions for interpretable plotting. We will cover that and more in a future post.
One response to “Generating a Bible Semantic Space”
It is a very good website to learn nlp and bible. You have explained every subject very kinldy for me as a novice to this field. One thing bothers me is that I can not see some contents maybe they are plots or images. it is just black screen except for the text. Could you let me know how I can see them too. Thsk you again for your effort to make this website.