<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>NLP - Better Biblos</title>
	<atom:link href="https://betterbiblos.com/category/analysis/nlp/feed/" rel="self" type="application/rss+xml" />
	<link>https://betterbiblos.com</link>
	<description>Welcome</description>
	<lastBuildDate>Tue, 13 Sep 2022 08:09:28 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/cropped-cropped-BB_2-e1651905174701.png?fit=32%2C32&#038;ssl=1</url>
	<title>NLP - Better Biblos</title>
	<link>https://betterbiblos.com</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">207373970</site>	<item>
		<title>Bible EDA with PyCaret</title>
		<link>https://betterbiblos.com/2022/06/24/bible-eda-with-pycaret/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=bible-eda-with-pycaret</link>
					<comments>https://betterbiblos.com/2022/06/24/bible-eda-with-pycaret/#comments</comments>
		
		<dc:creator><![CDATA[Ryan Dominguez]]></dc:creator>
		<pubDate>Fri, 24 Jun 2022 04:02:36 +0000</pubDate>
				<category><![CDATA[Analysis]]></category>
		<category><![CDATA[LDA]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[PyCaret]]></category>
		<category><![CDATA[Bible]]></category>
		<guid isPermaLink="false">https://betterbiblos.com/?p=315</guid>

					<description><![CDATA[<p>Luckily for us there are many tools we can use to perform exploratory data analysis [&#8230;]</p>
<p>The post <a href="https://betterbiblos.com/2022/06/24/bible-eda-with-pycaret/">Bible EDA with PyCaret</a> appeared first on <a href="https://betterbiblos.com">Better Biblos</a>.</p>
<p>The post <a rel="nofollow" href="https://betterbiblos.com/2022/06/24/bible-eda-with-pycaret/">Bible EDA with PyCaret</a> appeared first on <a rel="nofollow" href="https://betterbiblos.com">Better Biblos</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>Luckily for us there are many tools we can use to perform exploratory data analysis (EDA) on the Bible.  <a href="https://pycaret.gitbook.io/docs/" target="_blank" rel="noreferrer noopener">PyCaret</a> is a multipurpose Python platform for machine learning and data analysis.  They have a natural language processing (NLP) module in their package that we will use on our Bible dataset.  With this we will perform Bible EDA with PyCaret.</p>



<h2 class="wp-block-heading">Initial Data</h2>



<p>We will start with our dataset we generated in our blog post <a href="https://betterbiblos.com/2022/05/08/transforming-the-bible-into-a-dataframe/" target="_blank" rel="noreferrer noopener" title="Transforming the Bible into a Dataframe">Transforming the Bible into a Dataframe</a>.  As noted, this dataframe consists of several columns: &#8216;<strong>book</strong>&#8216; representing books of The Bible; &#8216;<strong>chapter</strong>&#8216; representing chapters, &#8216;<strong>verse_number</strong>&#8216; representing verses, and the verses&#8217; text itself is represented in the column &#8216;<strong>verse</strong>&#8216;.</p>



<figure class="wp-block-image size-full"><img data-recalc-dims="1" fetchpriority="high" decoding="async" width="458" height="315" src="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/bb_data.png?resize=458%2C315&#038;ssl=1" alt="Bible EDA with PyCaret" class="wp-image-317" srcset="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/bb_data.png?w=458&amp;ssl=1 458w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/bb_data.png?resize=300%2C206&amp;ssl=1 300w" sizes="(max-width: 458px) 100vw, 458px" /></figure>



<h2 class="wp-block-heading">Initial Setup</h2>



<p>Initial setup of the pycaret model involves importing of the pycaret.nlp package and then building a reference for the package to our dataframe, with specified target text for natural language processing.</p>



<pre class="wp-block-code"><code># init setup
from pycaret.nlp import *
s = setup(df, target = 'verse')</code></pre>



<figure class="wp-block-image size-full"><img data-recalc-dims="1" decoding="async" width="175" height="142" src="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_init_setup.png?resize=175%2C142&#038;ssl=1" alt="" class="wp-image-318"/></figure>



<p>The intitial setup outputs some basic descriptive information about our Bible corpus.  For one, there are 31,102 documents in our data, representing the verses of the Bible.  Our corpus consists of a 6,999 distinct vocabulary of words.</p>



<h2 class="wp-block-heading">Model Selection</h2>



<p>Pycaret.nlp gives us several NLP models to deploy using a combination of the gensim and sklearn packages.  We can access the available packages using the below command:</p>



<pre class="wp-block-code"><code>models()</code></pre>



<figure class="wp-block-image size-full"><img data-recalc-dims="1" decoding="async" width="403" height="188" src="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_nlp_model_selection.png?resize=403%2C188&#038;ssl=1" alt="" class="wp-image-320" srcset="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_nlp_model_selection.png?w=403&amp;ssl=1 403w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_nlp_model_selection.png?resize=300%2C140&amp;ssl=1 300w" sizes="(max-width: 403px) 100vw, 403px" /></figure>



<p>We will pick LDA for our first run with the model, and can try other models later.  LDA, short for <a href="https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf" target="_blank" rel="noreferrer noopener" title="Latent Dirichlet Allocation">Latent Dirichlet Allocation</a>, was published in 2003 by the machine learning researchers David Blei, Andrew Ng, and Michael Jordan.  It was one of the first widely adopted topic modeling frameworks.  For the curious, we&#8217;ve included the abstract below:</p>



<p class="has-background" style="background-color:#c1e9fa">We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of
discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each
item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in
turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of
text modeling, the topic probabilities provide an explicit representation of a document. We present
efficient approximate inference techniques based on variational methods and an EM algorithm for
empirical Bayes parameter estimation. We report results in document modeling, text classification,
and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI
model.</p>



<p>PyCaret makes the creation of the model easy, we will see just how easy in the next section.</p>



<h2 class="wp-block-heading">Running the model</h2>



<p>One can create a model with one line of code, once they have initialized the pycaret.nlp package and pointed it to the relevant corpus.</p>



<pre class="wp-block-code has-white-background-color has-background"><code>lda = create_model('lda')
print(lda)</code></pre>



<pre class="wp-block-code"><code>LdaModel(num_terms=6999, num_topics=4, decay=0.5, chunksize=100)</code></pre>



<p>The print statement has outputted several relevant features for understanding the model created using LDA.  In totality, 6999 different tokens (&#8220;words&#8221;) were considered in the model creation, and the specified default count of topics for the model is 4.</p>



<h2 class="wp-block-heading">Visualizing the Topics</h2>



<p>In subsequent posts, we will explore additional visualizations and optimization of topic generation using various target parameters such as Book of the Bible.  For now we will examine the default 4 topics generated by LDA to improve our future work.</p>



<h3 class="wp-block-heading">Topic 0</h3>



<pre class="wp-block-code"><code>plot_model(lda, plot = 'frequency',topic_num='Topic 0')</code></pre>



<figure class="wp-block-image size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="949" height="485" src="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic0.png?resize=949%2C485&#038;ssl=1" alt="" class="wp-image-342" srcset="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic0.png?w=949&amp;ssl=1 949w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic0.png?resize=300%2C153&amp;ssl=1 300w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic0.png?resize=768%2C392&amp;ssl=1 768w" sizes="auto, (max-width: 949px) 100vw, 949px" /></figure>



<pre class="wp-block-code"><code>plot_model(lda, plot = 'distribution', topic_num='Topic 0')</code></pre>



<figure class="wp-block-image size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="924" height="477" src="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic0_dist.png?resize=924%2C477&#038;ssl=1" alt="" class="wp-image-351" srcset="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic0_dist.png?w=924&amp;ssl=1 924w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic0_dist.png?resize=300%2C155&amp;ssl=1 300w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic0_dist.png?resize=768%2C396&amp;ssl=1 768w" sizes="auto, (max-width: 924px) 100vw, 924px" /></figure>



<h4 class="wp-block-heading">Interpretation</h4>



<p>Topic 0 consists of a relatively medium variety of words, typically with lengths under 10-11 characters.</p>



<h3 class="wp-block-heading">Topic 1</h3>



<pre class="wp-block-code"><code>plot_model(lda, plot = 'frequency',topic_num='Topic 1')</code></pre>



<figure class="wp-block-image size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="942" height="474" src="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic1.png?resize=942%2C474&#038;ssl=1" alt="" class="wp-image-343" srcset="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic1.png?w=942&amp;ssl=1 942w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic1.png?resize=300%2C151&amp;ssl=1 300w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic1.png?resize=768%2C386&amp;ssl=1 768w" sizes="auto, (max-width: 942px) 100vw, 942px" /></figure>



<pre class="wp-block-code"><code>plot_model(lda, plot = 'distribution', topic_num='Topic 1')</code></pre>



<figure class="wp-block-image size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="928" height="480" src="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic1_dist.png?resize=928%2C480&#038;ssl=1" alt="" class="wp-image-352" srcset="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic1_dist.png?w=928&amp;ssl=1 928w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic1_dist.png?resize=300%2C155&amp;ssl=1 300w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic1_dist.png?resize=768%2C397&amp;ssl=1 768w" sizes="auto, (max-width: 928px) 100vw, 928px" /></figure>



<h4 class="wp-block-heading">Interpretation</h4>



<p>Topic 1 consists of a relatively medium-large variety of words, with center point of length distribution around 8-9 characters.</p>



<h3 class="wp-block-heading">Topic 2</h3>



<pre class="wp-block-code"><code>plot_model(lda, plot = 'frequency',topic_num='Topic 2')</code></pre>



<figure class="wp-block-image size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="957" height="487" src="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic2.png?resize=957%2C487&#038;ssl=1" alt="" class="wp-image-344" srcset="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic2.png?w=957&amp;ssl=1 957w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic2.png?resize=300%2C153&amp;ssl=1 300w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic2.png?resize=768%2C391&amp;ssl=1 768w" sizes="auto, (max-width: 957px) 100vw, 957px" /></figure>



<pre class="wp-block-code"><code>plot_model(lda, plot = 'distribution', topic_num='Topic 2')</code></pre>



<figure class="wp-block-image size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="916" height="476" src="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic2_dist.png?resize=916%2C476&#038;ssl=1" alt="" class="wp-image-353" srcset="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic2_dist.png?w=916&amp;ssl=1 916w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic2_dist.png?resize=300%2C156&amp;ssl=1 300w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic2_dist.png?resize=768%2C399&amp;ssl=1 768w" sizes="auto, (max-width: 916px) 100vw, 916px" /></figure>



<h4 class="wp-block-heading">Interpretation</h4>



<p>Topic 2 consists of a relatively medium-large variety of words, with center point of length distribution around 12-13 characters.</p>



<h3 class="wp-block-heading">Topic 3</h3>



<pre class="wp-block-code"><code>plot_model(lda, plot = 'frequency',topic_num='Topic 3')</code></pre>



<figure class="wp-block-image size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="938" height="484" src="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic3.png?resize=938%2C484&#038;ssl=1" alt="" class="wp-image-346" srcset="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic3.png?w=938&amp;ssl=1 938w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic3.png?resize=300%2C155&amp;ssl=1 300w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic3.png?resize=768%2C396&amp;ssl=1 768w" sizes="auto, (max-width: 938px) 100vw, 938px" /></figure>



<pre class="wp-block-code"><code>plot_model(lda, plot = 'distribution', topic_num='Topic 3')</code></pre>



<figure class="wp-block-image size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="924" height="481" src="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic3_dist.png?resize=924%2C481&#038;ssl=1" alt="" class="wp-image-354" srcset="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic3_dist.png?w=924&amp;ssl=1 924w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic3_dist.png?resize=300%2C156&amp;ssl=1 300w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/06/pycaret_bible_nlp_topic3_dist.png?resize=768%2C400&amp;ssl=1 768w" sizes="auto, (max-width: 924px) 100vw, 924px" /></figure>



<h4 class="wp-block-heading">Interpretation</h4>



<p>Topic 3 consists of a relatively medium-large variety of words, with center point of length distribution around 8-9 characters.</p>



<p>Initially we note a curious structure to the topics, they all vary considerably in terms of the highest single word count value. For topic 0 we see &#8216;give&#8217; is in the number 1 position with ~150 references, for topic 1 we see &#8216;say&#8217; is in the number 1 position with ~1400 references, for topic 2 we see &#8216;let&#8217; is in the number 1 position with 11 references, and for topic 3 we see &#8216;shall&#8217; is in the number 1 position with ~9000 references.</p>



<p>This post is one of several posts as a part of our <a href="https://betterbiblos.com/analyses/bible-nlp/" target="_blank" rel="noreferrer noopener" title="Bible NLP">Bible NLP Analysis</a>.</p>



<p>Additional resources</p>



<p><a href="https://pycaret.gitbook.io/docs/" target="_blank" rel="noopener">https://pycaret.gitbook.io/docs/</a></p>The post <a href="https://betterbiblos.com/2022/06/24/bible-eda-with-pycaret/">Bible EDA with PyCaret</a> appeared first on <a href="https://betterbiblos.com">Better Biblos</a>.<p>The post <a rel="nofollow" href="https://betterbiblos.com/2022/06/24/bible-eda-with-pycaret/">Bible EDA with PyCaret</a> appeared first on <a rel="nofollow" href="https://betterbiblos.com">Better Biblos</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://betterbiblos.com/2022/06/24/bible-eda-with-pycaret/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">315</post-id>	</item>
		<item>
		<title>Generating a Bible Semantic Space</title>
		<link>https://betterbiblos.com/2022/05/11/bible-semantic-space/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=bible-semantic-space</link>
					<comments>https://betterbiblos.com/2022/05/11/bible-semantic-space/#comments</comments>
		
		<dc:creator><![CDATA[Ryan Dominguez]]></dc:creator>
		<pubDate>Wed, 11 May 2022 03:42:40 +0000</pubDate>
				<category><![CDATA[Analysis]]></category>
		<category><![CDATA[Data Wrangling]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[Bible]]></category>
		<guid isPermaLink="false">https://betterbiblos.com/?p=141</guid>

					<description><![CDATA[<p>In this post, we will discuss how to generate a Bible Semantic Space. Where we [&#8230;]</p>
<p>The post <a href="https://betterbiblos.com/2022/05/11/bible-semantic-space/">Generating a Bible Semantic Space</a> appeared first on <a href="https://betterbiblos.com">Better Biblos</a>.</p>
<p>The post <a rel="nofollow" href="https://betterbiblos.com/2022/05/11/bible-semantic-space/">Generating a Bible Semantic Space</a> appeared first on <a rel="nofollow" href="https://betterbiblos.com">Better Biblos</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>In this post, we will discuss how to generate a Bible Semantic Space.</p>



<p>Where we last left off in <a href="https://betterbiblos.com/2022/05/08/diving-deeper-into-bible-nlp-visualization-part-1/" target="_blank" rel="noreferrer noopener">Part 1 of Diving deeper into Bible NLP Visualization</a>, we used Python and created a dataframe of the <a href="https://betterbiblos.com/king-james-bible/" target="_blank" rel="noreferrer noopener" title="King James Bible">King James bible text file</a> that looked like so, with books, chapters, verse numbers, and verse text separated into their own columns:</p>



<figure class="wp-block-table"><table><tbody><tr><td>book</td><td>chapter</td><td>verse_number</td><td>verse</td></tr><tr><td>Gen</td><td>1</td><td>1</td><td>In the beginning God created the heaven and t&#8230;</td></tr><tr><td>Gen</td><td>1</td><td>2</td><td>And the earth was without form, and void; and&#8230;</td></tr><tr><td>Gen</td><td>1</td><td>3</td><td>And God said, Let there be light: and there w&#8230;</td></tr><tr><td>Gen</td><td>1</td><td>4</td><td>And God saw the light, that it was good: and &#8230;</td></tr><tr><td>Gen</td><td>1</td><td>5</td><td>And God called the light Day, and the darknes&#8230;</td></tr></tbody></table></figure>



<p>The dataframe is now a helpful tool for efficient information retrieval.  We can ask questions such as &#8220;how many verses are contained in Genesis?&#8221;, by referencing appropriate Python filters and groupby statements with correspondingly appropriate aggregations.  However, we need to do more work on the dataframe in order to visualize semantic meaning effectively.  We next need to generate a semantic space. </p>



<h2 class="wp-block-heading">Bible Semantic Space to enable Bible NLP</h2>



<p>We feed the text of the verse column into a word embedding model to generate a semantic space.  Word embedding models are tools in natural language processing to translate words and/or the context of words into vector representations so we can direct our computer to a mathematical object to analyze.  Here&#8217;s an analogy to explain the concept.  In Einsteinian physics, position is not determined by an absolute space time grid, position is <em>relative</em> to the position of other objects and their movement.  </p>



<figure class="wp-block-image aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/pexels-mikhail-nilov-7663144.jpg?resize=515%2C772&#038;ssl=1" alt="Bible NLP" class="wp-image-144" width="515" height="772"/></figure>



<p>Word embedding models function somewhat similarly.  Word embedding models are initially trained on massive corpus&#8217;/bodies of linguistic data.  Corpus&#8217; are datasets of combined sources such as Wikipedia, the set of all tweets, an agglomeration of the last 20 years&#8217; worth of several publishers &#8216; news articles, etc.  From this set of information meaning is extracted by way of cooccurrence of words and phrases.</p>



<h2 class="wp-block-heading">Word2vec</h2>



<p>We have chosen <a href="https://radimrehurek.com/gensim/models/word2vec.html" target="_blank" rel="noreferrer noopener">word2vec</a> as the embedding model for our initial analysis within the <a href="https://radimrehurek.com/gensim/" target="_blank" rel="noreferrer noopener">gensim</a> package. Word2vec was created in 2013 by a team of Google researchers, including Tomas Mikolov.</p>



<p>Depending upon our choice of semantic embedding model, we must utilize different preprocessing techniques to get our dataframe ready.  We will use gensim to preprocess our strings into a format most advantageous for word2vec.  Preprocessing consists of a number of techniques for manipulating your text for better machine representation, such as:</p>



<ul class="wp-block-list"><li>Removal of punctuation</li><li>Removal of stop words</li><li>Change case to lower</li><li>Stemming</li><li>Lemmatizing</li><li>Tokenization</li></ul>



<p>There is room for choice here, for example our selection of <a href="https://en.wikipedia.org/wiki/Stop_word" target="_blank" rel="noreferrer noopener" title="stopwords">stopwords</a> to eliminate  will alter the semantic space generated.  Typical stopwords eliminated include (&#8216;and&#8217;,&#8217;or&#8217;,&#8217;the&#8217;,&#8217;a&#8217;,etc.).  It is also possible with modern embedding models to include all stopwords in the corpus.  Each word is assigned a numerical token representation like below:</p>



<figure class="wp-block-image size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="975" height="259" src="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/image.png?resize=975%2C259&#038;ssl=1" alt="" class="wp-image-176" srcset="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/image.png?w=975&amp;ssl=1 975w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/image.png?resize=300%2C80&amp;ssl=1 300w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/image.png?resize=768%2C204&amp;ssl=1 768w" sizes="auto, (max-width: 975px) 100vw, 975px" /></figure>



<p>Lastly, each tokenized word is assigned a point in a N-dimensional space.  One can choose the number of dimensions they wish to use.  The general thought is that more dimensions are able to capture more sophisticated linguistic meaning.  We will be using an N of 100 for our initial runs with word2vec.  Several modern models extend their count of parameters into the billions to achieve state of the art performance.  We can see the model output below.</p>



<figure class="wp-block-image size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1010" height="500" src="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/image-1.png?resize=1010%2C500&#038;ssl=1" alt="" class="wp-image-177" srcset="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/image-1.png?w=1010&amp;ssl=1 1010w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/image-1.png?resize=300%2C149&amp;ssl=1 300w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/image-1.png?resize=768%2C380&amp;ssl=1 768w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></figure>



<p>We can use dimension reduction on the above object to reduce it to 2 dimensions for interpretable plotting.  We will cover that and more in a future post.</p>The post <a href="https://betterbiblos.com/2022/05/11/bible-semantic-space/">Generating a Bible Semantic Space</a> appeared first on <a href="https://betterbiblos.com">Better Biblos</a>.<p>The post <a rel="nofollow" href="https://betterbiblos.com/2022/05/11/bible-semantic-space/">Generating a Bible Semantic Space</a> appeared first on <a rel="nofollow" href="https://betterbiblos.com">Better Biblos</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://betterbiblos.com/2022/05/11/bible-semantic-space/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">141</post-id>	</item>
		<item>
		<title>Transforming the Bible into a Dataframe</title>
		<link>https://betterbiblos.com/2022/05/08/transforming-the-bible-into-a-dataframe/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=transforming-the-bible-into-a-dataframe</link>
					<comments>https://betterbiblos.com/2022/05/08/transforming-the-bible-into-a-dataframe/#respond</comments>
		
		<dc:creator><![CDATA[Ryan Dominguez]]></dc:creator>
		<pubDate>Sun, 08 May 2022 05:37:42 +0000</pubDate>
				<category><![CDATA[Analysis]]></category>
		<category><![CDATA[Data Wrangling]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[Bible]]></category>
		<guid isPermaLink="false">https://betterbiblos.com/?p=125</guid>

					<description><![CDATA[<p>Introduction to Transforming the Bible In this first post of a series of posts about [&#8230;]</p>
<p>The post <a href="https://betterbiblos.com/2022/05/08/transforming-the-bible-into-a-dataframe/">Transforming the Bible into a Dataframe</a> appeared first on <a href="https://betterbiblos.com">Better Biblos</a>.</p>
<p>The post <a rel="nofollow" href="https://betterbiblos.com/2022/05/08/transforming-the-bible-into-a-dataframe/">Transforming the Bible into a Dataframe</a> appeared first on <a rel="nofollow" href="https://betterbiblos.com">Better Biblos</a>.</p>
]]></description>
										<content:encoded><![CDATA[<figure class="wp-block-image size-full"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/pexels-symeon-ekizoglou-2105937.jpg?ssl=1" alt="" class="wp-image-127"/></figure>



<h2 class="wp-block-heading">Introduction to Transforming the Bible</h2>



<p>In this first post of a series of posts about our <a href="https://betterbiblos.com/analyses/bible-nlp/" target="_blank" rel="noreferrer noopener" title="Bible NLP">Bible NLP Analysis</a>, we will be transforming the Bible into a dataframe. Part 1 will focus on the ingestion of the file into a broadly usable Pandas dataframe.</p>



<figure class="wp-block-image size-large"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1024" height="256" src="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/bibletodata.png?resize=1024%2C256&#038;ssl=1" alt="" class="wp-image-310" srcset="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/bibletodata.png?resize=1024%2C256&amp;ssl=1 1024w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/bibletodata.png?resize=300%2C75&amp;ssl=1 300w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/bibletodata.png?resize=768%2C192&amp;ssl=1 768w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/bibletodata.png?resize=1536%2C384&amp;ssl=1 1536w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/bibletodata.png?resize=1568%2C392&amp;ssl=1 1568w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/bibletodata.png?w=1600&amp;ssl=1 1600w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></figure>



<p>I utilized a .txt file of the King James Bible from the website <a href="https://sacred-texts.com/bib/osrc/index.htm" target="_blank" rel="noreferrer noopener" title="sacred-texts.com">sacred-texts.com</a>. Their formatting of the file enabled for much easier ingestion of the text in a logical way than other sources I have found.</p>



<h2 class="wp-block-heading">Python &amp; Bible NLP</h2>



<p>My language of choice for NLP is Python.  Python has many excellent modules to aid in various object-oriented tasks, NLP included.  Pandas is one such module in the data science realm.  One can find links to many of the resources I utilize here on the Better Biblos <a href="https://betterbiblos.com/resources/" target="_blank" rel="noreferrer noopener" title="Resources">resources</a> page.  In addition to many other functions, Pandas allows for easy ingestion of the .txt file using:</p>



<pre class="wp-block-code has-white-background-color has-background"><code>!pip install pandas
import pandas as pd
bible = pd.read_csv(r'...\kjvdat.txt')
bible.head()</code></pre>



<p>Inspecting the object using bible.head() we observe the following tabular structure:</p>



<figure class="wp-block-table is-style-regular" style="font-size:15px"><table class="has-white-background-color has-background has-fixed-layout"><tbody><tr><td>0</td><td>1</td></tr><tr><td>Gen|1|1| In the beginning God created the heav..</td><td>NaN</td></tr><tr><td>Gen|1|2| And the earth was without form, and v&#8230;</td><td>NaN</td></tr><tr><td>Gen|1|3| And God said, Let there be light: and&#8230;</td><td>NaN</td></tr><tr><td>Gen|1|4| And God saw the light, that it was go&#8230;</td><td>NaN</td></tr><tr><td>Gen|1|5| And God called the light Day, and the&#8230;</td><td>NaN</td></tr></tbody></table></figure>



<p>Not too bad, but we can utilize the logical structure of the Bible into books, chapters, and verses to create a more accessible object. In particular we will focus on the pipe (&#8220;|&#8221;) delimiter character nicely included in the file, and then force name the automatically generated columns from the .split command in line 1 below:<br></p>



<pre class="wp-block-code has-white-background-color has-background"><code>bible = bible&#91;0].str.split('|', n=-1, expand=True)
bible.columns = &#91;'book','chapter','verse_number','verse']
bible.head()</code></pre>



<figure class="wp-block-table" style="font-size:15px"><table class="has-white-background-color has-background has-fixed-layout"><tbody><tr><td>book</td><td>chapter</td><td>verse_number</td><td>verse</td></tr><tr><td>Gen</td><td>1</td><td>1</td><td>In the beginning God created the heaven and t&#8230;</td></tr><tr><td>Gen</td><td>1</td><td>2</td><td>And the earth was without form, and void; and&#8230;</td></tr><tr><td>Gen</td><td>1</td><td>3</td><td>And God said, Let there be light: and there w&#8230;</td></tr><tr><td>Gen</td><td>1</td><td>4</td><td>And God saw the light, that it was good: and &#8230;</td></tr><tr><td>Gen</td><td>1</td><td>5</td><td>And God called the light Day, and the darknes&#8230;</td></tr></tbody></table></figure>



<p>We can now very easily query subsets of data by books, chapters, and verses.</p>The post <a href="https://betterbiblos.com/2022/05/08/transforming-the-bible-into-a-dataframe/">Transforming the Bible into a Dataframe</a> appeared first on <a href="https://betterbiblos.com">Better Biblos</a>.<p>The post <a rel="nofollow" href="https://betterbiblos.com/2022/05/08/transforming-the-bible-into-a-dataframe/">Transforming the Bible into a Dataframe</a> appeared first on <a rel="nofollow" href="https://betterbiblos.com">Better Biblos</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://betterbiblos.com/2022/05/08/transforming-the-bible-into-a-dataframe/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">125</post-id>	</item>
	</channel>
</rss>
