<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Data Wrangling - Better Biblos</title>
	<atom:link href="https://betterbiblos.com/category/analysis/data-wrangling/feed/" rel="self" type="application/rss+xml" />
	<link>https://betterbiblos.com</link>
	<description>Welcome</description>
	<lastBuildDate>Tue, 13 Sep 2022 08:09:28 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/cropped-cropped-BB_2-e1651905174701.png?fit=32%2C32&#038;ssl=1</url>
	<title>Data Wrangling - Better Biblos</title>
	<link>https://betterbiblos.com</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">207373970</site>	<item>
		<title>Generating a Bible Semantic Space</title>
		<link>https://betterbiblos.com/2022/05/11/bible-semantic-space/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=bible-semantic-space</link>
					<comments>https://betterbiblos.com/2022/05/11/bible-semantic-space/#comments</comments>
		
		<dc:creator><![CDATA[Ryan Dominguez]]></dc:creator>
		<pubDate>Wed, 11 May 2022 03:42:40 +0000</pubDate>
				<category><![CDATA[Analysis]]></category>
		<category><![CDATA[Data Wrangling]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[Bible]]></category>
		<guid isPermaLink="false">https://betterbiblos.com/?p=141</guid>

					<description><![CDATA[<p>In this post, we will discuss how to generate a Bible Semantic Space. Where we [&#8230;]</p>
<p>The post <a href="https://betterbiblos.com/2022/05/11/bible-semantic-space/">Generating a Bible Semantic Space</a> appeared first on <a href="https://betterbiblos.com">Better Biblos</a>.</p>
<p>The post <a rel="nofollow" href="https://betterbiblos.com/2022/05/11/bible-semantic-space/">Generating a Bible Semantic Space</a> appeared first on <a rel="nofollow" href="https://betterbiblos.com">Better Biblos</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>In this post, we will discuss how to generate a Bible Semantic Space.</p>



<p>Where we last left off in <a href="https://betterbiblos.com/2022/05/08/diving-deeper-into-bible-nlp-visualization-part-1/" target="_blank" rel="noreferrer noopener">Part 1 of Diving deeper into Bible NLP Visualization</a>, we used Python and created a dataframe of the <a href="https://betterbiblos.com/king-james-bible/" target="_blank" rel="noreferrer noopener" title="King James Bible">King James bible text file</a> that looked like so, with books, chapters, verse numbers, and verse text separated into their own columns:</p>



<figure class="wp-block-table"><table><tbody><tr><td>book</td><td>chapter</td><td>verse_number</td><td>verse</td></tr><tr><td>Gen</td><td>1</td><td>1</td><td>In the beginning God created the heaven and t&#8230;</td></tr><tr><td>Gen</td><td>1</td><td>2</td><td>And the earth was without form, and void; and&#8230;</td></tr><tr><td>Gen</td><td>1</td><td>3</td><td>And God said, Let there be light: and there w&#8230;</td></tr><tr><td>Gen</td><td>1</td><td>4</td><td>And God saw the light, that it was good: and &#8230;</td></tr><tr><td>Gen</td><td>1</td><td>5</td><td>And God called the light Day, and the darknes&#8230;</td></tr></tbody></table></figure>



<p>The dataframe is now a helpful tool for efficient information retrieval.  We can ask questions such as &#8220;how many verses are contained in Genesis?&#8221;, by referencing appropriate Python filters and groupby statements with correspondingly appropriate aggregations.  However, we need to do more work on the dataframe in order to visualize semantic meaning effectively.  We next need to generate a semantic space. </p>



<h2 class="wp-block-heading">Bible Semantic Space to enable Bible NLP</h2>



<p>We feed the text of the verse column into a word embedding model to generate a semantic space.  Word embedding models are tools in natural language processing to translate words and/or the context of words into vector representations so we can direct our computer to a mathematical object to analyze.  Here&#8217;s an analogy to explain the concept.  In Einsteinian physics, position is not determined by an absolute space time grid, position is <em>relative</em> to the position of other objects and their movement.  </p>



<figure class="wp-block-image aligncenter size-large is-resized"><img data-recalc-dims="1" fetchpriority="high" decoding="async" src="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/pexels-mikhail-nilov-7663144.jpg?resize=515%2C772&#038;ssl=1" alt="Bible NLP" class="wp-image-144" width="515" height="772"/></figure>



<p>Word embedding models function somewhat similarly.  Word embedding models are initially trained on massive corpus&#8217;/bodies of linguistic data.  Corpus&#8217; are datasets of combined sources such as Wikipedia, the set of all tweets, an agglomeration of the last 20 years&#8217; worth of several publishers &#8216; news articles, etc.  From this set of information meaning is extracted by way of cooccurrence of words and phrases.</p>



<h2 class="wp-block-heading">Word2vec</h2>



<p>We have chosen <a href="https://radimrehurek.com/gensim/models/word2vec.html" target="_blank" rel="noreferrer noopener">word2vec</a> as the embedding model for our initial analysis within the <a href="https://radimrehurek.com/gensim/" target="_blank" rel="noreferrer noopener">gensim</a> package. Word2vec was created in 2013 by a team of Google researchers, including Tomas Mikolov.</p>



<p>Depending upon our choice of semantic embedding model, we must utilize different preprocessing techniques to get our dataframe ready.  We will use gensim to preprocess our strings into a format most advantageous for word2vec.  Preprocessing consists of a number of techniques for manipulating your text for better machine representation, such as:</p>



<ul class="wp-block-list"><li>Removal of punctuation</li><li>Removal of stop words</li><li>Change case to lower</li><li>Stemming</li><li>Lemmatizing</li><li>Tokenization</li></ul>



<p>There is room for choice here, for example our selection of <a href="https://en.wikipedia.org/wiki/Stop_word" target="_blank" rel="noreferrer noopener" title="stopwords">stopwords</a> to eliminate  will alter the semantic space generated.  Typical stopwords eliminated include (&#8216;and&#8217;,&#8217;or&#8217;,&#8217;the&#8217;,&#8217;a&#8217;,etc.).  It is also possible with modern embedding models to include all stopwords in the corpus.  Each word is assigned a numerical token representation like below:</p>



<figure class="wp-block-image size-full"><img data-recalc-dims="1" decoding="async" width="975" height="259" src="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/image.png?resize=975%2C259&#038;ssl=1" alt="" class="wp-image-176" srcset="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/image.png?w=975&amp;ssl=1 975w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/image.png?resize=300%2C80&amp;ssl=1 300w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/image.png?resize=768%2C204&amp;ssl=1 768w" sizes="(max-width: 975px) 100vw, 975px" /></figure>



<p>Lastly, each tokenized word is assigned a point in a N-dimensional space.  One can choose the number of dimensions they wish to use.  The general thought is that more dimensions are able to capture more sophisticated linguistic meaning.  We will be using an N of 100 for our initial runs with word2vec.  Several modern models extend their count of parameters into the billions to achieve state of the art performance.  We can see the model output below.</p>



<figure class="wp-block-image size-full"><img data-recalc-dims="1" decoding="async" width="1010" height="500" src="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/image-1.png?resize=1010%2C500&#038;ssl=1" alt="" class="wp-image-177" srcset="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/image-1.png?w=1010&amp;ssl=1 1010w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/image-1.png?resize=300%2C149&amp;ssl=1 300w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/image-1.png?resize=768%2C380&amp;ssl=1 768w" sizes="(max-width: 1000px) 100vw, 1000px" /></figure>



<p>We can use dimension reduction on the above object to reduce it to 2 dimensions for interpretable plotting.  We will cover that and more in a future post.</p>The post <a href="https://betterbiblos.com/2022/05/11/bible-semantic-space/">Generating a Bible Semantic Space</a> appeared first on <a href="https://betterbiblos.com">Better Biblos</a>.<p>The post <a rel="nofollow" href="https://betterbiblos.com/2022/05/11/bible-semantic-space/">Generating a Bible Semantic Space</a> appeared first on <a rel="nofollow" href="https://betterbiblos.com">Better Biblos</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://betterbiblos.com/2022/05/11/bible-semantic-space/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">141</post-id>	</item>
		<item>
		<title>Transforming the Bible into a Dataframe</title>
		<link>https://betterbiblos.com/2022/05/08/transforming-the-bible-into-a-dataframe/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=transforming-the-bible-into-a-dataframe</link>
					<comments>https://betterbiblos.com/2022/05/08/transforming-the-bible-into-a-dataframe/#respond</comments>
		
		<dc:creator><![CDATA[Ryan Dominguez]]></dc:creator>
		<pubDate>Sun, 08 May 2022 05:37:42 +0000</pubDate>
				<category><![CDATA[Analysis]]></category>
		<category><![CDATA[Data Wrangling]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[Bible]]></category>
		<guid isPermaLink="false">https://betterbiblos.com/?p=125</guid>

					<description><![CDATA[<p>Introduction to Transforming the Bible In this first post of a series of posts about [&#8230;]</p>
<p>The post <a href="https://betterbiblos.com/2022/05/08/transforming-the-bible-into-a-dataframe/">Transforming the Bible into a Dataframe</a> appeared first on <a href="https://betterbiblos.com">Better Biblos</a>.</p>
<p>The post <a rel="nofollow" href="https://betterbiblos.com/2022/05/08/transforming-the-bible-into-a-dataframe/">Transforming the Bible into a Dataframe</a> appeared first on <a rel="nofollow" href="https://betterbiblos.com">Better Biblos</a>.</p>
]]></description>
										<content:encoded><![CDATA[<figure class="wp-block-image size-full"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/pexels-symeon-ekizoglou-2105937.jpg?ssl=1" alt="" class="wp-image-127"/></figure>



<h2 class="wp-block-heading">Introduction to Transforming the Bible</h2>



<p>In this first post of a series of posts about our <a href="https://betterbiblos.com/analyses/bible-nlp/" target="_blank" rel="noreferrer noopener" title="Bible NLP">Bible NLP Analysis</a>, we will be transforming the Bible into a dataframe. Part 1 will focus on the ingestion of the file into a broadly usable Pandas dataframe.</p>



<figure class="wp-block-image size-large"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1024" height="256" src="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/bibletodata.png?resize=1024%2C256&#038;ssl=1" alt="" class="wp-image-310" srcset="https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/bibletodata.png?resize=1024%2C256&amp;ssl=1 1024w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/bibletodata.png?resize=300%2C75&amp;ssl=1 300w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/bibletodata.png?resize=768%2C192&amp;ssl=1 768w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/bibletodata.png?resize=1536%2C384&amp;ssl=1 1536w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/bibletodata.png?resize=1568%2C392&amp;ssl=1 1568w, https://i0.wp.com/betterbiblos.com/wp-content/uploads/2022/05/bibletodata.png?w=1600&amp;ssl=1 1600w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></figure>



<p>I utilized a .txt file of the King James Bible from the website <a href="https://sacred-texts.com/bib/osrc/index.htm" target="_blank" rel="noreferrer noopener" title="sacred-texts.com">sacred-texts.com</a>. Their formatting of the file enabled for much easier ingestion of the text in a logical way than other sources I have found.</p>



<h2 class="wp-block-heading">Python &amp; Bible NLP</h2>



<p>My language of choice for NLP is Python.  Python has many excellent modules to aid in various object-oriented tasks, NLP included.  Pandas is one such module in the data science realm.  One can find links to many of the resources I utilize here on the Better Biblos <a href="https://betterbiblos.com/resources/" target="_blank" rel="noreferrer noopener" title="Resources">resources</a> page.  In addition to many other functions, Pandas allows for easy ingestion of the .txt file using:</p>



<pre class="wp-block-code has-white-background-color has-background"><code>!pip install pandas
import pandas as pd
bible = pd.read_csv(r'...\kjvdat.txt')
bible.head()</code></pre>



<p>Inspecting the object using bible.head() we observe the following tabular structure:</p>



<figure class="wp-block-table is-style-regular" style="font-size:15px"><table class="has-white-background-color has-background has-fixed-layout"><tbody><tr><td>0</td><td>1</td></tr><tr><td>Gen|1|1| In the beginning God created the heav..</td><td>NaN</td></tr><tr><td>Gen|1|2| And the earth was without form, and v&#8230;</td><td>NaN</td></tr><tr><td>Gen|1|3| And God said, Let there be light: and&#8230;</td><td>NaN</td></tr><tr><td>Gen|1|4| And God saw the light, that it was go&#8230;</td><td>NaN</td></tr><tr><td>Gen|1|5| And God called the light Day, and the&#8230;</td><td>NaN</td></tr></tbody></table></figure>



<p>Not too bad, but we can utilize the logical structure of the Bible into books, chapters, and verses to create a more accessible object. In particular we will focus on the pipe (&#8220;|&#8221;) delimiter character nicely included in the file, and then force name the automatically generated columns from the .split command in line 1 below:<br></p>



<pre class="wp-block-code has-white-background-color has-background"><code>bible = bible&#91;0].str.split('|', n=-1, expand=True)
bible.columns = &#91;'book','chapter','verse_number','verse']
bible.head()</code></pre>



<figure class="wp-block-table" style="font-size:15px"><table class="has-white-background-color has-background has-fixed-layout"><tbody><tr><td>book</td><td>chapter</td><td>verse_number</td><td>verse</td></tr><tr><td>Gen</td><td>1</td><td>1</td><td>In the beginning God created the heaven and t&#8230;</td></tr><tr><td>Gen</td><td>1</td><td>2</td><td>And the earth was without form, and void; and&#8230;</td></tr><tr><td>Gen</td><td>1</td><td>3</td><td>And God said, Let there be light: and there w&#8230;</td></tr><tr><td>Gen</td><td>1</td><td>4</td><td>And God saw the light, that it was good: and &#8230;</td></tr><tr><td>Gen</td><td>1</td><td>5</td><td>And God called the light Day, and the darknes&#8230;</td></tr></tbody></table></figure>



<p>We can now very easily query subsets of data by books, chapters, and verses.</p>The post <a href="https://betterbiblos.com/2022/05/08/transforming-the-bible-into-a-dataframe/">Transforming the Bible into a Dataframe</a> appeared first on <a href="https://betterbiblos.com">Better Biblos</a>.<p>The post <a rel="nofollow" href="https://betterbiblos.com/2022/05/08/transforming-the-bible-into-a-dataframe/">Transforming the Bible into a Dataframe</a> appeared first on <a rel="nofollow" href="https://betterbiblos.com">Better Biblos</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://betterbiblos.com/2022/05/08/transforming-the-bible-into-a-dataframe/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">125</post-id>	</item>
	</channel>
</rss>
