Data Science
“Data Science” has become a buzzword everybody have heard of – but not many really know what it designates. It is a global term representing all the methods to extract information from data. The methods can be of various type – scientific, algorithmic, the data can come from various source and under various format.
Data Science is a broad term. Within the project “Data Quality Explored” (QuaXP), we offer a course to give the tools for beginners and advanced learners to understand and study a part of the field: Data Quality for Machine Learning. Machine Learning is the part of Data Science which deals with solving problems from data using algorithms. It is an automated task, and the main issue for the scientist is to understand how the algorithm works, meaning how it uses the data that are fed to it and how it can be improved. To get a better idea about the course, visit the announcement post on the HOOU blog.
Why should Data Science be accessible for everybody? An example
We now got an idea of Data Science as a high scientific topic, mainly aimed to researchers, mathematicians and computer scientists. However, the numerous applications of Data Science in our daily life makes it a concern to everyone.
Many current societal applications exploit the potential of data science and the big amounts of data available. Sometimes, those applications can lead to weird or even shocking results.
In 2016, an international beauty contest took place online: Beauty.AI had the particularity to let machine learning algorithms decide on the winners. The idea was to get free of any human bias in the perception of beauty, the project being used later for gauging health from a picture of the face.
This is a collection of the 44 winners of this international beauty contest (source: beauty.ai):
As we can see, despite the contest being international, a vast majority of the winners is light-skinned. Not so much of a judgment “free of bias”.
This happened because the data the “judges” were trained on did not have enough dark-skinned photos to develop a pattern over them, and considered them as an exception.
It is important for the society to understand that those kind of mistakes can happen, and why they happen, to try and avoid building a technology unreflective of nowadays societal struggles.
Data and code literacy
It is possible to study the underlying concepts of Data Science without a strong scientific knowledge, if the goal is not to be able to run experiments, but only to understand what impacts the results. This separation in skills needed for data science can be explained with two concepts: data literacy and code literacy (2).
Data literacy represents the ability to understand and make decision from data in their context (3). This ability does not necessarily mean an ability to code experiments, and the ways to understand, visualize and manipulate the data do not have to involve coding, as many sofwares exist to get a grasp on a dataset without running actual code.
Code literacy, on the other hand, represents the ability to understand the underlying concepts of technology (4), and in our particular case, some data science functions and libraries.
A separation between those two skills makes it clear that it is possible to have data literacy without necessarily having code literacy. This is the approach we take in the project QuaXP, where we propose two levels of difficulty:
- Beginner level: the student is taught mainly data literacy with the help of widgets and graphs to visualize the data.
- Advanced level: the student is taught data and code literacy, in the form of Python code and use of libraries.
Combining both levels in one course: discussion
Besides allowing a broader audience to follow the course, the development of two levels of difficulty can be useful for the learners themselves. One point to note is that the course is globally similar for both levels, in that the text does not change much between the beginner and the advanced level, the only exception being the text about the code itself. The visualization cells are the main point differing between the two versions.
The learners following the beginner level have the possibility to reveal some advanced content, if their curiosity makes them brave enough to look at the code content. They also beneficiate from the fact that the content needs to be quite deep to interest advanced learners, therefore giving more precise explanations, even for the beginner level.
For the advanced learners, the advantages are richer: they can use the beginner content as a summary content, to have a preview of the course, or to remember the essential points. The explanations are low level, but do not lack precision, making them understandable for everybody: a person with code literacy does not necessarily have enough data literacy to understand a high-level content.
On the other hand, an unclear separation between both levels could lead to frustration. One can think of a learner who does not feel like a complete beginner, but who also cannot follow the advanced level. This learner might find him/herself frustrated by the low-difficulty of the beginner level, but also not feel brave enough to follow the advanced level. Though, this pitfall can be avoided by adjusting both difficulty levels and making transitions smoother. Also, tutorials are presented in the introduction of the course to be able to learn the coding knowledge necessary to follow the advanced level.
Another difficulty could be a beginner who already finds him/herself too challenged by the beginner level. We try to avoid that by testing our course material before the release, but this may always happen. A solution to that is to create a lighter content and propose a summary of the most important points of the course under a video format. This can be thought of as an additional level, lighter than the beginner level, where the learner watches a few videos to get the most important points raised by the course.
Lastly, the course can be used in class as a support, either to introduce a topic to the students, or as a complete part of the teaching material. If it is used on the side, it can allow to flatten the starting difference of skills and understanding between the students.
How would you feel about following a course targeted for different difficulty levels?
References
- (1) beauty.ai
- (2) Kastner et al., 2020: Teaching Machine Learning and Data Literacy to Students of Logistics using Jupyter Notebooks, Die 18. Fachtagung Bildungstechnologien der Gesellschaft für Informatik e.V.
- (3) Koltay, T.: Data literacy: in search of a name and identity. In Journal of Documentation, 2015, 71; pp. 401–415.
- (4) Donick, M.: Die Unschuld der Maschinen. Springer Fachmedien Wiesbaden, Wiesbaden, 2019.
That is the HOOU project QuaXP at the TUHH: