Considerations for integrating AI with Citizen Science in Rubin#
Advice, resources, and best practices for the integration of machine learning techniques into Citizen Science projects to be added here.
Overview#
The goal of this page is to provide background, references, and advice for Rubin users who would like to use artificial intelligence (AI) – otherwise known as machine learning (ML) – within or alongside their citizen science (CS) research workflows. AI and CS are overlapping and complementary procedures for rapidly assessing (typically, classifying) scientific data. They have similar and unique pros and cons, and there are multiple ways to use these two methods in concert to improve the quality of your scientific results. In addition to this advice, Rubin CS offers tutorials for CS in Rubin and expedited access to the Zooniverse CS platform, and Rubin CST offers tutorials on AI. We recommend studying AI, including performing your own small computational studies, before embarking on a CS project that integrates AI.
Almost all of the discussion on this page is in terms of classification tasks; we will use the specific task of classifying ‘stars’ and ‘galaxies’ as an example when needed. Additionally, our discussion will focus on classifying images, as this is the most widespread application of both AI and CS. When referring to
Comparing AI and CS#
AI and CS have multiple similarities and differences. Identifying these items for your project is important for determining the best use of AI in your Rubin CS project.
Similarities#
The similarities between AI and CS include the following:
Both methods take large sets of input data and output a classification score for each object in the set. The output scores are not inherently probabilistic, but they can be calibrated to be nearly probabilistic.
To an approximation, they can both be treated as ‘’black boxes’’ because it remains an active area of research to interpret the ‘’reasons’’ that an algorithm or a human would classify an object. Using statistical metrics helps in both cases, but definitively determining the reasons remains elusive.
Despite being black boxes, the problem or question must be carefully and specifically designed for AI algorithms and CS projects.
The data must be carefully prepared, for example, by normalizing images.
Though they do it in different ways, both methods can classify large numbers of complex inputs (like images) more efficiently and more reliably than an individual performing a manual investigation of the data.
Both approaches require training. An AI model must be numerically trained on example data that has pairs of images and classifications. In a CS project, volunteers must be taught about the subject matter and provided with example pairs of images and their corresponding true classifications.
Both AI and CS approaches are advanced fields of research, and there are experts in both fields; that expertise is very important for a successful Rubin CS project.
Both methods can be applied to a wide variety of data across many areas of research and industry – natural images, climate, biology, medicine, etc.
Differences#
CS volunteers typically have a personal investment in their classification task. Downstream, this leads to interactions between scientists and volunteers being critical for the motivation of the volunteers.
AI models can be rapidly retrained and redeployed on the same dataset. This means that an AI model can be refined multiple times on a training set before being applied to a new dataset. Relatedly, the probabilities and uncertainties of the classification can be refined.
CS volunteers rarely repeat their work on the same dataset. Downstream, this requires careful timing in deploying the CS project and sharing it with the public.
AI models require access to relatively significant computational resources – i.e., graphical processing units (GPUs).
AI requires a relatively very high amount of data for training the algorithm, while very little is necessary for a CS approach.
CS volunteers are typically more skilled in cases where it is difficult to discern between two classes.
Three Typical Pathways for Integrating AI and CS#
There are three typical pathways for combining AI and CS.
Human classifiers prepare data for AI classifiers. When there is not enough training data for an AI model, human classifiers can perform the initial labeling. This is a common tactic in industry applications.
AI classifiers prepare data for humans. When there is a wealth of data for supervised training of an AI model, but the model persistently struggles to discern certain classes (resulting in false positives and false negatives), it may be appropriate to send this data to CS volunteers for their more nuanced perspective. If there isn’t much training data, a clustering method (e.g., k-means clustering or autoencoders) can be used to provide an initial, coarse-grained classification of the data, which can then be sent to CS volunteers.
Active Learning Hybrid. Coming soon.