Python in the Archive: Computational Data Mining of Historical Records from South Asia

I was recently awarded two research grants to begin a project that combines techniques from the world of computer science with data and questions from the fields of History & South Asia Studies. The idea was generated after taking the Computational Data Exploration (CIS-105) course with Dr. Arvind Bhusnurmath in the Computer Science Department here at Penn. The Center for the Advanced Study of India and the Price Lab for Digital Humanities supported the project with generous research grants that allowed me to gather a team of researchers and workers across the ranks at the University of Pennsylvania. We built a relational database, wrote basic code in Python, generate key visuals using packages like Pandas and NumPy, and most importantly, tested and proved the merits of combining two radically different disciplines into a broader humanistic pursuit. This project is in its initial stages and I hope to carry it forward beyond my graduate student career here at Penn. Additional information about the endeavor including a two page summary of the pilot project and a link to some visualizations we generated can be found below.

Python in the Archives: Computational data mining and visualization of historical records from Mughal India, ca. 1352-1850

Project Abstract: Our world is producing information faster than we can analyze it. In fact, ninety-percent of today’s data has been generated in the past two years alone. The challenges of managing and making sense of endless facts and figures have provided a catalyst to the growing field of computational data science. In addition, more and more non-specialists are partaking in algorithmic work because programming languages like Python are user friendly and relatively easier to learn. For the most part, computational analysis remains oriented towards predictive modeling and optimization for the benefit of business and politics. My project moves away from this trend by synthesizing approaches in data science with historical questions and primary sources from early-modern India (AD 1352-1850). My initial dataset will be generated from one of six detailed archival catalogs compiled by the National Archives of India between 1982 and 2011. The data comprises 627 descriptive entries that will be assembled into a relational database, analyzed using statistical tools in Python, and visualized according to key research questions. I believe that my method will allow me to discover broader patterns, trends, and associations between constituting elements of the archive that a more traditional reading of selected documents cannot provide. Finally, I hope to demonstrate ways that scholars in the humanities can incorporate computational methods into their research, and how data scientists can benefit from the interesting issues and problems historical sources present such as working with uneven data sets and the multiple representational forms that our evidence takes. Our intended audience comprises researchers in the social sciences and humanities along with those working in the emerging fields of data science, visualization, and digital humanities. Our main objective is to demonstrate the value of computational tools for both creating and analyzing unconventional datasets.

brief summary of results

visualizations generated through Python

Data Analysts:

  • Sudev J Sheth, Project Lead
    Doctoral Candidate, School of Arts & Sciences (South Asia Studies, History)
  • Jennifer Sui, Undergraduate Student
    School of Arts & Sciences ’17 (Economics, Statistics)

Faculty Mentors:

  • Dr. Ramya Sreenivasan (Department of South Asia Studies)
  • Dr. Arvind Bhusnurmath (Department of Computer and Information Science)
  • Dr. Devesh Kapur (Department of Political Science)
  • Dr. Sayan Bhattacharyya (Price Lab for Digital Humanities)