Data science is a growing field that is facing a major challenge: the demand by industry for students who have been trained in data science is outstripping the supply. Furthermore, many educators lack the resources to create courses in data science.
Susan Davidson, Weiss Professor of Computer and Information Science, raised this issue in a keynote address at February’s Women in Data Science Philadelphia@Penn conference, a regional offshoot of a global initiative to “inspire and educate data scientists worldwide, regardless of gender, and to support women in the field.” Data scientists devise new ways of generating knowledge and insights out of the massive datasets enabled by the digital revolution, and so are intimately familiar with the challenges that come with scale. As such, Davidson closed her talk with a proposal for how to meet the growing demand for data science education.
Davidson and Zachary Ives, Adani President’s Distinguished Professor and Chair of the Department of Computer and Information Science, have teamed up with IBM and the Linux Foundation to create the OpenDS4All project, an open-source data science curriculum that will provide educators with access to the tools and materials needed to establish data science courses at institutions across the world.
The release of OpenDS4All is particularly timely in light of the COVID-19 pandemic, as data science is a field that is particularly amenable to a virtual classroom setting. Now that it is live, educators from around the world will be able to access its learning modules, design their own unique lessons and even deliver content remotely using the increasing number of technologies for video conferencing, scheduling, office hours management, and remote collaboration.
The project began when Davidson and Ives were approached by IBM about creating an open-source platform for sharing data science curricula. Several years ago, Davidson and Ives created CIS 545: Big Data Analytics for Penn students, and have devoted themselves to fine-tuning the course over successive offerings. OpenDS4All expands on the course materials for CIS 545 and divides the information into eight modules with topics such as “data wrangling and integration,” “exploratory data analysis,” “scalable data processing,” “machine learning,” “ethics,” and more.
OpenDS4All will provide Creative Commons-licensed slides, instructor notes, hands-on examples in Python Jupyter notebooks, and homework assignments to educators, and will encourage those drawing on the curriculum materials to develop robust courses for their students.
Davidson says that OpenDS4All was an ambitious project that presented interesting challenges. To maximize accessibility to many different communities, Davidson commented that she and Ives worked to ensure that educators would be able to use the OpenDS4All curriculum for short courses or segments in a longer course, quarter system courses, or semester-long courses.
“We had to make sure we were giving different paths to use the materials,” Davidson says.
The team also worked to ensure that institutions with fewer resources could still access the curriculum. For example, while many universities have reliable internet access and access to cloud-based software, if a school in a remote location has limited internet access, Davidson and Ives wanted to build in pathways for them to download necessary materials for lectures, hands-on examples and assignments, and run the Jupyter notebooks locally, so all interested institutions could have access to the OpenDS4All materials.
Davidson also noted that data privacy became an issue when building the project. When incorporating datasets into the hands-on examples and homework assignments, Davidson and Ives drew on publicly available datasets. Their work hit a roadblock, however, when a dataset they pulled from Kaggle, Google’s online repository of public data, was taken down. The dataset was originally sourced from LinkedIn, but due to growing privacy concerns for its members, LinkedIn removed it from public use.
“We couldn’t, in good conscience, use the data that had been taken down,” Davidson says. “One of the OpenDS4All modules is about this very thing — data science and ethics. One of the key issues in this work is ‘how do you protect the privacy of others?’ and we wrestled with that first-hand when we developed this curriculum.”
The team chose to replace the removed dataset with synthetic data. While it wasn’t their first choice for the curriculum, Davidson says that this is exactly the kind of issue data scientists must contend with and thus an important object lesson on ethics for students.
Davidson and Ives hope that others will grow the project as OpenDS4All’s open-source feature will enable users to add content. ODPi, a nonprofit organization supported by the Linux Foundation committed to “standardizing the big data ecosystem,” will consult with a Technical Steering Committee to manage added content.
While Davidson and Ives are proud of OpenDS4All and the information it will share with the world, they are more excited for the future of the project, as they see their work as laying the foundation for a much larger endeavor.
“This initial release represents a first step towards a broader data science curriculum,” Ives says. “We look forward to working with our colleagues on the OpenDS4All Technical Steering Committee and contributors from across the world to expand and enrich it, and ideally to translate it into other languages and audiences.”
OpenDS4All is available here. Read more about Davidson and Ives’ work on the Computer and Information Science webpage.