“Data Science” has become a buzzword everybody have heard of – but not many really know what it designates. It is a global term representing all the methods to extract information from data. The methods can be of various type – scientific, algorithmic, the data can come from various source and under various format.
Data Science is a broad term. Within the project “Data Quality Explored” (QuaXP), we offer a course to give the tools for beginners and advanced learners to understand and study a part of the field: Data Quality for Machine Learning. Machine Learning is the part of Data Science which deals with solving problems from data using algorithms. It is an automated task, and the main issue for the scientist is to understand how the algorithm works, meaning how it uses the data that are fed to it and how it can be improved. To get a better idea about the course, visit the announcement post on the HOOU blog.
We now got an idea of Data Science as a high scientific topic, mainly aimed to researchers, mathematicians and computer scientists. However, the numerous applications of Data Science in our daily life makes it a concern to everyone.
Many current societal applications exploit the potential of data science and the big amounts of data available. Sometimes, those applications can lead to weird or even shocking results.
In 2016, an international beauty contest took place online: Beauty.AI had the particularity to let machine learning algorithms decide on the winners. The idea was to get free of any human bias in the perception of beauty, the project being used later for gauging health from a picture of the face.
This is a collection of the 44 winners of this international beauty contest (source: beauty.ai):
As we can see, despite the contest being international, a vast majority of the winners is light-skinned. Not so much of a judgment “free of bias”.
This happened because the data the “judges” were trained on did not have enough dark-skinned photos to develop a pattern over them, and considered them as an exception.
It is important for the society to understand that those kind of mistakes can happen, and why they happen, to try and avoid building a technology unreflective of nowadays societal struggles.
It is possible to study the underlying concepts of Data Science without a strong scientific knowledge, if the goal is not to be able to run experiments, but only to understand what impacts the results. This separation in skills needed for data science can be explained with two concepts: data literacy and code literacy (2).
Data literacy represents the ability to understand and make decision from data in their context (3). This ability does not necessarily mean an ability to code experiments, and the ways to understand, visualize and manipulate the data do not have to involve coding, as many sofwares exist to get a grasp on a dataset without running actual code.
Code literacy, on the other hand, represents the ability to understand the underlying concepts of technology (4), and in our particular case, some data science functions and libraries.
A separation between those two skills makes it clear that it is possible to have data literacy without necessarily having code literacy. This is the approach we take in the project QuaXP, where we propose two levels of difficulty:
Besides allowing a broader audience to follow the course, the development of two levels of difficulty can be useful for the learners themselves. One point to note is that the course is globally similar for both levels, in that the text does not change much between the beginner and the advanced level, the only exception being the text about the code itself. The visualization cells are the main point differing between the two versions.
The learners following the beginner level have the possibility to reveal some advanced content, if their curiosity makes them brave enough to look at the code content. They also beneficiate from the fact that the content needs to be quite deep to interest advanced learners, therefore giving more precise explanations, even for the beginner level.
For the advanced learners, the advantages are richer: they can use the beginner content as a summary content, to have a preview of the course, or to remember the essential points. The explanations are low level, but do not lack precision, making them understandable for everybody: a person with code literacy does not necessarily have enough data literacy to understand a high-level content.
On the other hand, an unclear separation between both levels could lead to frustration. One can think of a learner who does not feel like a complete beginner, but who also cannot follow the advanced level. This learner might find him/herself frustrated by the low-difficulty of the beginner level, but also not feel brave enough to follow the advanced level. Though, this pitfall can be avoided by adjusting both difficulty levels and making transitions smoother. Also, tutorials are presented in the introduction of the course to be able to learn the coding knowledge necessary to follow the advanced level.
Another difficulty could be a beginner who already finds him/herself too challenged by the beginner level. We try to avoid that by testing our course material before the release, but this may always happen. A solution to that is to create a lighter content and propose a summary of the most important points of the course under a video format. This can be thought of as an additional level, lighter than the beginner level, where the learner watches a few videos to get the most important points raised by the course.
Lastly, the course can be used in class as a support, either to introduce a topic to the students, or as a complete part of the teaching material. If it is used on the side, it can allow to flatten the starting difference of skills and understanding between the students.
How would you feel about following a course targeted for different difficulty levels?
That is the HOOU project QuaXP at the TUHH:
]]>The topic of the course is data quality applied for machine learning: we use problems and datasets for machine learning models to compare the effects of differences in data quality on the performance of the models. The problem of data quality appears in a step called “preprocessing”, that comes before the development of the machine learning model. This step of preprocessing is the one we put the focus on in the course, and the machine learning task is only here as a verification step, to understand how changes in data quality can affect the performance of the machine learning model.
We want the learner to understand the work done on a real problem, giving meaning to the whole presented process. Examples have to be found, on the form of machine learning problems and their associated datasets. The first step of designing this course was to find the right problems to work on. We identified four main requirements for the ideal problem and the related dataset:
For the first chapter concerned by this issue, we already have limited our range of possibilities: we want a problem and a dataset that deals with mostly numerical data, and we want to use logistics data because we already have experience with it. Specifically, we previously worked with Automatic Identification System (AIS) data, spatio-temporal data collected from ships traveling at sea. The data contain the ship’s characteristics (identification code, name, destination, size, …) and specific data about the current trip (position, speed, orientation, …). We therefore looked for problems using AIS data, and openly accessible AIS datasets.
AIS data can be found openly from a few organisms, and they often differ on their format, so we had several options to look at (e.g.: U.S. Marine Cadastre, aprs.fi, sailwx.info). For the last two of these website, the download of the datasets is not very easy for a novice, so the best option is the open provider of the U.S. Marine Cadastre. Many more AIS providers can be found, but they are unfortunately all commercials.
As expected, the set of requirements presented earlier is not easy to combine into a usable problem and dataset. Finding a problem that is simple enough for the beginner level, and interesting with regards to its research significance is rather hard. Obviously, current research is not made of problems that are easy to solve. We could try to adapt a current research problem to an easier task, and building less efficient models, but even for that we need a more complex model than the easiest ones of the machine learning libraries. In our case, it is hard to create a very simple model with spatiotemporal data. Furthermore, for the specificity of the topic of data quality, we need the collected data to be as raw as possible.
We analyzed two problems:
To best ensure that the beginner learners are able to understand the core of the problem and the ways to solve it, it is necessary to settle down on tasks that are not a current hot research topic. As this could mean a less interesting content for advanced learners, we need to make sure that the tasks we give to solve are connected to the real world and plausible: this can be achieved with storytelling, by creating an engaging experience for the learner additionally to the pure academic knowledge.
In the end, the tasks we propose to solve are rather simple (predict the length of the ship from its width, predict the mean speed from the type of vessel, …), but we can coat them under a believable story: the learners are data scientists in charge of analyzing the AIS data they receive, and solving small problems for a coastal station that needs information about ships. We also have an introductory part about the data themselves, for the learner to be engaged in the topic of maritime logistics at the beginning of the course, and get more interest in the later solved problems.
In addition, we use other datasets in quizzes and practical tasks to ensure a variety in the topics treated and reduce the risk of boredom or weariness due to a single topic. The learners can also see other examples and understand that they can use their new knowledge in many different ways. For example, we introduce datasets of UFO sighting reports and wine quality analyzes. Each dataset can give room to imagination, for example, the students can imagine themselves as wine producters trying to improve the quality of their wine, or scientists from a Moon base trying to predict the arrival of aliens, …
The first chapter of the course will be tested at the TUHH internally in the summer semester of 2020 and we can have a first idea if the concept works by the summer. The chapter will be openly available in September / October 2020.
The next steps of development are the design of the next two chapters: image and text quality. We expect to follow the same idea of having a believable problem to work on and where learners can identify themselves as data scientists solving real problems.
Dieser Beitrag wurde verfasst von Anna Lainé.
]]>