Here is a talk I am proposing for PyData warsaw and elsewhere.
Jupiter Notebooks, Pandas, and PySpark are great at analyzing data organized as a table or an array, but what if your data is hierarchical, or worse yet a graph? Yes, Apache Arrow now supports statically bound parent pointers, but that is a far cry from a persistent graph of objects with a dynamically changing collection of attributes. Something simpler and more powerful is needed, particularly when the data exceeds the size of the available memory.
PythonLinks.info organizes Python videos into a tree of categories, using a graph database written in Python and optimized in C. A database is needed because data from multiple sources is imported, merged, categorized and edited. A hierarchical database is best for managing a tree of categories. A graph database not only supports hierarchies, but also allows for bidirectional links between talks, and their conferences and authors. Videos can be accessed by traversing the tree of categories, or by using the canonical url index, the YouTube video id index, or the Twitter hashtag index.
Using a graph database led to a huge decrease in system complexity.