Analyzing Hierarchical Data with a Graph Database and Displaying the Results

Loading

SWIPE ↑ or

Introductory Video

Here is a talk I am proposing for PyData warsaw and elsewhere. 

Jupiter Notebooks, Pandas, and PySpark are great at analyzing data organized as  a table or an array, but what if your data is hierarchical, or worse yet a graph?   Yes, Apache Arrow now supports statically bound parent pointers, but that is a far cry from a persistent graph of objects with a dynamically changing collection of attributes. Something simpler and more powerful is needed, particularly when the data exceeds the size of the available memory.  

PythonLinks.info organizes Python videos into a tree of categories, using a graph database written in Python and optimized in C.    A database is needed because data from multiple sources is imported, merged, categorized and edited. A hierarchical database is best for managing a tree of categories.   A graph database not only supports hierarchies,  but also allows for  bidirectional links between talks, and their conferences and authors. Videos can be accessed by traversing the tree of categories, or by using the canonical url index, the YouTube video id index, or the Twitter hashtag index.

First the videos are imported from youtube.    If needed, the talk descriptions are retrieved by crawling the conference website.   Next  the 710 videos are  sorted into a tree of  100  categories.  A basic principle in human factors is that there should be no  more than about 7 videos in each category.  A hierarchical analysis is used to calculate the number of videos in each branch of the tree.  The data is served as a json tree, the client javascript reassembles it into a graph, and then the client javascript calculates the best videos in each branch of the tree based on up votes versus total votes.   A desktop interface allows the user to see where he is in the taxonomy.  A cell phone interface lets the the user navigate the tree  by swiping left right up or down.    The user can also search, and display the results as a tree. 

Using a graph database led to a huge decrease in system complexity.

Comment On Twitter