Overview
An idea that I have been trying to wrap my head around for some time is the automatic recognition and connection of abstract concepts. This project served as a preliminary test of this idea.
The goal was to take a website teaching mathematics and create abstractions of each page using document embeddings. The embeddings represented a condensed abstraction of all of the content on each of the web pages. These pages discussed algebra and calculus from beginning to advanced and then on to ordinary differential equations. To evaluate whether I could accurately recreate the relationships between this abstract concept using these embeddings, I tracked the links produced by the site’s author and aimed to predict these links. The assumption was that the site itself is set up to link each lesson sequentially to other topics. Lessons are also referenced directly with links within each page. The embeddings were abstracted by parsing each page, removing all links, and creating the embedding using Doc2Vec. From these embedding, connections between each concept were produced by comparing the similarity between each embedding. This experiment resulted in a rough recreation of the original site built entirely from the site’s content.
This recreation of the original site was also done using the existing links from the site and graph extension algorithms. These entities were added to GraphDB to support this process and visualized in the image shown.
Technology
The embeddings were produced using Doc2Vec using data parsed from PaulCalculusNotes.com using BeautifulSoup. Other methods included using NLTK and GraphDB to visualize the entity relationships.