Artificial Intelligence and Computer-Supported Collaborative Learning in Programming: A Systematic Mapping Study

Objective: The Computer-Supported Collaborative Learning (CSCL) approach integrates artificial intelligence (AI) to enhance the learning process through collaboration and information and communication technologies (ICTs). In this sense, innovative and effective strategies could be designed for learning computer programming. This paper presents a systematic mapping study from 2009 to 2021, which shows how the integration of CSCL and AI supports the learning process in programming courses. Methodology: This study was conducted by reviewing data from different bibliographic sources such as Scopus, Web of Science (WoS), ScienceDirect


INTRODUCTION
In Computer Science (CS), programming courses have a higher attrition rate compared with other courses (Figueiredo & García-Peñalvo, 2018,Munson & Zitovsky, 2018,Zingaro et al., 2018. The use of collaborative approaches in programming courses has shown satisfactory results by providing skills, aptitudes, and good practices, among other benefits (Suárez et al., 2021). The Computer Supported Collaborative Learning (CSCL) approach integrates artificial intelligence (AI) to improve the learning process through collaboration and information and communication technology (ICT) (Kozma, 2000, Loizzo & Ertmer, 2016, Docq & Daele, 2001, Solarte-Pabón & Machuca-Villegas, 2019, Thomas et al., 2002. According to Magnisalis et al., 2011, the CSCL approach is based on the following processes: a) Group Formation (GF support): forming student groups to obtain the best results from collaboration; b) Learning Domain (LD support): evaluating student learning activities with the help computers; c) Path of Interaction (PI support): the student receives comments on learning outcomes. The literature shows some projects that integrate CSCL processes with artificial intelligence (AI) techniques, such as Machine Learning to automatically identify groups of students according to their profiles or natural language processing for the automatic assessment of code (Casamayor et al., 2009,Costaguta & de los Angeles Menini, 2014, Black & Wiliam, 1998, Varier et al., 2017. However, in the literature, there is no state-of-the-art technique that shows how the integration of CSCL and AI impacts and improves the learning process in programming courses. This paper presents a systematic mapping study based on tech mining (Mohammadi, 2012), which was conducted by reviewing data from scientific documents such as journals and conference proceedings from Scopus, WoS, and ScienceDirect, in addition to data extraction and analysis from repositories in the GitHub platform. The final corpus contains 316 records: 90 papers and 226 repositories. This, with the aim to understand the evolution of programming languages, technologies, tools, and strategies based on CSCL and AI to support teaching in programming courses. This paper has three sections. The first section presents the methodology in three parts related to the elaboration of the mapping study: research questions and expected results, data protocol, and the construction of technological map visualizations. The second section shows the expected results, classified into four aspects: the commonly used programming languages in the development of technologies and learning tools based on CSCL and AI; the timeline of the evolution of projects, tools, and technologies; the strategies for CSCL processes based on AI; and the classification of the reference corpus according to the ACM Computing Classification System (CCS) (ACM, n.d.). Finally, based on the results, the third section discusses and concludes what is needed to improve the learning process in programming courses.

METHODOLOGY
This section details the process carried out to elaborate the mapping study of the learning programming process as supported by CSCL and AI. It includes the sources of information, the search functions, and the selection criteria used to construct the final reference corpus and the procedures for analysis and technological mapping.
This mapping study is based on tech mining, which aims to generate practical intelligence using analytical and visualization applications for data analysis (Choi et al., 2012, Mohammadi, 2012, Barab et al., 1997, Antonenko et al., 2012. Thus, future technologies and innovations can be anticipated, ensuring long-term competitiveness and supporting decision-making processes (Jonassen, 2012,Capelo & Dias, 2009, Fadde, 2009). This systematic review aims to find the technological wave and cutting-edge research of CSCL and AI in the programming learning process by solving four questions of interest, as shown in Table 1. ware repositories, which is the most popular open-source code platform among developers. GitHub contains information on more than 3,5 million software developers and more than 23 million repositories since 2008 (GitHub, 2018).
Search and data filters. Table 2 shows ten search functions specific to the Scopus, ScienceDirect, Web of Science (WoS), and GitHub data sources. The search functions included keywords taken from experts in this field, as well as from researchers and programming professors. In addition, the Table   shows how many documents were obtained for each data source and the total records per query.    Figure 1. After filtering the information, the final corpus contained 316 records: 90 papers and 226 repositories. Table 3 shows the records filtered by each data source.

Source: Authors
As seen in Figure 1, the implementation of PRIMA is divided into four sections, as follows: • Identification: this allows to identify how many records have been obtained from the queries of the different data sources.
• Screening: all the information is collected, forming a data corpus.
• Eligibility: according to the exclusion criteria mentioned above, the data are filtered to obtain a final corpus. In this section, the records that are not qualitative are also identified.
• The item included: a dataset with structured information is obtained which can be processed for analysis.   Figure 2 shows the workflow of this study, which involved three processes. The first one, the corpus of references, contains information from repositories and scientific documents. On the one hand, the features of the repositories were extracted: citations, year, author, copies, forks, stars, and updates. On the other hand, the features of the scientific documents were extracted: year, keywords, abstract, DOI, authors, and organizations. In the second process, the extracted features were processed by a NER (name entity recognition), a technique of natural language processing (NLP) that extracts specific entities from the text (e.g., cities, names, and others). NER-Spacy (Vasiliev, 2020) was used to extract the following features from the text fields of the corpus of references: methods, type of technology, type of software, strategies, and programming languages. The index construction task aimed to divide and organize the reference corpus information by years, types of software, and software categories. The third process involved constructing technological maps using three types of visualizations: treemap, radial tidy, and fishbone. The treemap is used to display large amounts  of hierarchically structured (tree structured) data. The space in the visualization is divided into rectangles that are sized and sorted by a quantitative variable. The levels in the hierarchy are displayed as rectangles that contain other rectangles. Each set of rectangles on the same level in the hierarchy represents a column or an expression in a data table. Each individual rectangle at a level in the hierarchy represents a category in a column. The radial tidy tree is a node link tree diagram of classes in a package hierarchy, positioned in polar coordinates using Vega's tree transform. Adjusting the parameters shows layouts suitable for general trees or cluster dendrograms. Fishbone is a representation tool for categorizing the potential causes of a problem or evolution of a problem in order to identify its root causes. Typically used to identify progress analysis, a fishbone diagram combines the practice of organization charts with a mind map template.

Construction of technological maps
The technological map of programming languages represents the most used programming languages and the development trends in technologies for learning programming. This technological map was constructed by means of the cooccurrence process, in which the repositories that belong to the same programming language are grouped. In the process, the TAGs or labels of the NER are included in a vector. The set of vectors constitutes a matrix of programming language cooccurrences.
Thus, if the label contains information about the programming language, a value of 1 is assigned to the vector; otherwise, it is 0. The cooccurrence matrix is transformed into a square matrix, and it becomes the input of the treemap. Each rectangle that represents a programming language consists of rectangles that represent repositories. The largest rectangle at the top left corner represents the most important programming language, and the smallest rectangle at the bottom right corner represents the least important one. The orange tone represents a repository in the upper hierarchy, while a green tone represents a repository in the lower hierarchy, based on the most cited, copied, and cloned studies, as well as on forks.
The evolution timeline depicts the technological evolution of the subject under study. It presents the dynamics of the most relevant projects, of the type of technologies, and of the programming languages used in each period. To summarize, the map shows the evolution of the technologies for learning programming in one decade. The technological map is elaborated by performing the following tasks: a) The reference corpus is organized by the number of stars, cites, and forks in the repositories, as well as by the number of citations in the papers. To this effect, the fishbone presents the most relevant projects over time.
b) The reference corpus is grouped using a data ontology that contains categories and subcategories of the Computing Classification System (CCS) (ACM, n.d.). The fishbone shows the type of technologies extracted from the CCS.
c) The cooccurrence of the programming languages is determined. For each year, the languages are grouped into three categories: more relevant, relevant, and less relevant. The relevance is associated with the number of repositories that use a programming language for a given year. d) Finally, the extracted information is manually organized into the fishbone.
As for the technological map according to the Computing Classification System (CCS), the reference corpus was grouped using a data ontology of the Association for Computing Machinery (ACM, n.d.). The cooccurrence was determined in order to identify which repository or paper belonged to If a reference meets the three criteria, it is stored in a segment of a vector. Then, its cooccurrence is determined. Thus, the vector is represented as a coincidence matrix, which is then transformed into a square matrix, which will in turn be the input for the radial tidy tree graph.  In the repositories, it was found that there is affinity between programming languages for a specific development. For Python, JavaScript, and Java, most of the projects found are focused on online learning platforms (nsoojin, n.d., Leocardoso94, n.d., Haghighatlari et al., 2020)  The languages were grouped according to the computational tasks in which they are most used.

RESULTS
That is to say, in data processing: R, Python, Node, and Java; for scaling applications: R, C++, Ruby, and Python; for data visualization: JavaScript and Python with HTML and CSS; for multiple processes (threads): Python, Ruby, and Node; and for adaptive and responsive tasks: JavaScript and Python.

Evolution timeline
The technological evolution of programming learning from 2009 to 2021was studied in the corpus of repositories. For each year, the most relevant repositories, programming languages, and technologies were identified. Figure 4 describes the projects over one decade. In the first years, the projects were related to online platforms, online courses, content management systems, web apps for learning, and massive online open courses (MOOC). In the last years, the projects focused on cloud platforms and intelligent learning management systems. and the Stanford Machine Learning Course (Pathrabe et al., 2019). Interactive guides are resources designed to reinforce student learning in a virtual environment through audio and video. Some of the strategies used were the intelligent tutoring system (Triantafillou et al., 2002) and the rule-based system (Magnisalis et al., 2011).
In 2012, the improvements were made to interactive guides via games with 3D graphics, wikis, and learning assistants, where the interaction between teacher and student is supported by the com- roomWiki (Khandaker, 2010), and JavaStud (yrojha4ever, 2015). Other platforms, such as tpot (Epis-tasisLab, n.d.) and Dex (johnlee175, ), use machine learning to improve students' writing style in a programming language. For these years, different uses of the strategies were reported: the Coalition Formation Algorithm (Yang & Luo, 2007)  By reviewing the most relevant projects within the timeline, the projects based on CSCL and AI to improve the learning process in programming courses were identified. According to GitHub, these projects are in six software categories: software methodologies, compilation errors, software design, help with the style of the source code, artificial intelligence platforms, and virtual judges. Table 4 shows the most important projects of this review, grouped by the categories of the GitHub software, computational techniques, and CSCL processes. The relevance of a project in GitHub depends on the number of stars, copies, contributors, and forks.
Below are some works found in the review of the GitHub repositories. These projects apply some strategies based on CSCL (Table 4).  Minerva (dmlc, 2019) forecasts code and, based on this, it plans codes to keep students at an optimal level. To maintain this manually can take a long time and be surprisingly complex, so Minerva makes people with an optimal level periodically review their codes.
GreedExCol (Debdi et al., 2015): a CSCL system designed to support collaboration in experimental optimality results for greedy algorithms. GreedExCol supports the discussion between the members of a small group of up to four students. The methodology for working with GreedExCol is as follows: first, each student performs their individual work (experimental research); then, the results obtained by each one are shared and discussed, so that they can propose the functions that are considered to be optimal.
ClassroomWiki (Khandaker, 2010): a collaborative web-based Wiki writing platform. For students, ClassroomWiki provides a web interface to write and review their group's Wiki, as well as and a thematic forum to discuss their ideas during collaboration. When students collaborate, ClassroomWiki tracks all student activities and builds detailed models of the students who present their contributions to their groups. ClassroomWiki is based on a multiagent framework that uses student models to form groups of students and improve collaborative learning. Through CSCL processes, Classroom-Wiki supports the formation of active learning groups and rubrics. In addition, the environment also offers formative comments through different mechanisms that support the student in the improvement of the proposed programming solutions.

Technological map according to the ACM Computing Classification System (CCS)
In the classification of the corpus of references, three results were found. First, according to the categories and subcategories of the CCS (Figure 5), 20 % of the repositories and papers focus on collaborative learning with multiple agents based on cooperative methods, virtual education, and virtual classes. 16 % focus on data analysis through tools, social software, and network analysis. 15,3 % are e-learning and b-learning systems. For the remaining percentage, the projects are divided into the formation of groups, roles, and wikis. Second, in the analysis of projects and documents based on CSCL processes with artificial intelligence support, 15 % are data mining, 12 % involve natural language processing, 20 % use neural networks, and 7 % use predictive and statistical methods. Third, the references were classified as follows: 23 % are academic courses, 12 % are Application Programming Interfaces (APIs), 9 % are software platforms, 13 % are learning games, and 36 % are platforms and learning tools.

Classification of strategies according to CSCL processes and types of software
10 strategies based on CSCL and AI were found. Each of the strategies has a specific purpose within GF, LD, and PI support. Figure 6 shows the grouping between the corpus of references and the strategies found. Each strategy is detailed below: i) VALCAM algorithm. This algorithm makes use of a virtual currency (V) in the following manner. The system agent works as the provider and accountant of thevirtual currency. Every time   the user agents form a coalition and perform the required task, their individual and group performances are evaluated by the system and group agents, respectively. After the evaluation, the system agent rewards each user agent's individual performance, while each group agent rewards each user agent's performance as a group member (Soh et al., 2006a, Soh et al., 2008.
ii) JigSaw learning. The JigSaw technique is a method for organizing classroom activities that makes students dependent on each other to succeed. It breaks classes into groups and breaks assignments into pieces that the group assembles to complete (a jigsaw puzzle) (Gutwin et al., 2013, Magnisalis et al., 2011.
iii) Adaptive educational system (AES). An AES is mainly a system that aims to adapt some of its key functional characteristics (for example, content presentation and/or navigation support) to the learner's needs and preferences. Thus, an adaptive system operates differently for different learners, considering information accumulated in individual or group learner models (Debdi et al., 2015, Triantafillou et al., 2002. iv) Intelligent tutoring system (ITS). It aims to provide learner tailored support during the problemsolving process, as a human tutor would do. To achieve this, ITS designers apply techniques from the broader field of Artificial Intelligence (AI) and implement extensive modeling of the problem-solving process in the specific domain of application (Triantafillou et al., 2002).

v) Adaptive hypermedia systems (AHS).
It builds a user model of the goals, preferences, and knowledge of the individual user, and it uses this model to adapt the content of pages and the links between them to his/her needs. The variables that user models include can be classified as user-dependent, which includes those directly related to the user and define him/her as an individual, and as user-independent, which affect the user indirectly and are mainly related to the context of a user's work with a hypermedia application (Yang & Luo, 2007).
vi) Rule-based system. It uses the Constraint-Based Modeling (CBM) approach (i.e., it represents the domain knowledge as a set of constraints and a rule-based system) and offers adaptive/intelligent support by providing learners with hints during individual and group problemsolving processes, as well as feedback on peer interaction based on individual student contributions. Results show that CBM is an effective technique for both modeling and supporting students in developing collaboration skills (the participants acquired both declarative knowledge about good collaboration and did collaborate more effectively) (Magnisalis et al., 2011).
vii) Coalition Formation Algorithm. In the initial state, all agents are mutually independent and not cooperative. Hereafter, as the agents acquire unceasingly more knowledge from the system and the environment, every agent may form some coalition on the basis of certain principles by consulting and comparing. Each coalition is considered as an independent entirety. All members in the coalition cooperate fully, so the coalition will be allowed to draw support from the abilities and resources of other members to complete tasks more efficiently than a single agent (Yang & Luo, 2007).
viii) E-learning systems. E-learning can be thought of as the learning process created by interaction with digitally delivered content, services, and support. It involves the intensive use of information and communication technologies (ICTs) to serve, facilitate, and revolutionize the learning process (Debdi et al., 2015).
ix) Predict student behavior. A prediction model whose main objectives are automatically predicting students' performance and helping to measure and improve their goals (Abdulwahhab & Abdulwahab, 2017, Qiu et al., 2016).
x) Team Syntegrity Model (TSM) is a new process developed by Stafford Beer to allow groups to work together in a democratic, nonhierarchical fashion in order to capture their best thinking.
It is a particularly appropriate process to use when groups are characterized by high levels of diversity (Asproth et al., 2011, Leonard, 2011. Table 5 groups each strategy according to the type of implementation (manual or technological), the specific process of the CSCL to which it points, and the number of papers that explain its documentation or implementation. Team syntegrity model GF 9 JigSaw learning LD 14 Adaptive education system LD 18 Intelligent tutoring system LD 19 Adaptive hypermedia system LD 5 Regarding the GF support process, strategies are currently being implemented to group students by numerical categories, abilities and skills, student profile, and weighted maxima and minima. Moreover, these strategies have been supported by AI to automatically identify and group using Machine Learning methods (Thomas et al., 2002, Bennedsen et al., 2008, Wiggins et al., 2015, Soh et al., 2006a, Soh et al., 2006b, Costa et al., 2017 and probability and statistical methods (Salcedo & Idrobo, 2011,Hazzan et al., 2003. However, this study did not find a CSCL strategy or AI-supported tool that groups computer programming students. Thus, the following question arises: How should CSCL and AI strategies be implemented to form groups in programming courses? As for the LD support process, automatic code assessment appears in the form of virtual judges that are used in programming competitions, encouraging their integration into academic programming courses. However, virtual judges implemented as a method of teaching programming are not enough to foster logic and abstraction skills. Therefore, different learning platforms have appeared, such as I-MIND (Khandaker, 2010), UNCode (Restrepo-Calle et al., 2018), and INGInious (luvoain, n.d.), among others, which allow automatically assessing code, controlling the learning process, and obtaining an analysis of the results. However, this tool falls short in the evaluation of syntactic and semantic code, code style, multiparadigm compilers, and plagiarism identification. In this paper, some strategies supported by AI which could improve the code assessment process, but there are still questions to be answered: What would be the best method for automatic code assessment? How can students achieve programming proficiency through automatic code assessment?
In the case of PI support, programming content feedback is a difficult question to answer, since each student has a different programming style, but there are currently tools such as GreedEx-Col (Debdi et al., 2015) and ClassroomWiki (Khandaker, 2010), and I-MINDS (Soh et al., 2008) which identify the syntactic structure tree of a programming language in order to compare it with the student's code. This form of feedback has proven to work well. However, when evaluating by competencies or with a numerical system, this feedback is not so effective, as it needs to be more precise. AI and the CSCL could improve this process, helping to solve questions such as: How does AI improve the process of providing feedback from the source code? How could CSCL strategies be implemented in programming learning tools?
In the development and documentation of tools that support the learning of programming, there is still work to be done. On the one hand, there are no tools that implement all CSCL processes.
With respect to the use of AI techniques, out of the tools found, 7 % implement AI techniques, but only 1 % is documented. On the other hand, there is no model that integrates the CSCL approach with AI techniques, thus allowing to implement learning activities and to observe and analyze the evolution of the system and how its users (students) improve their skills. In addition, the different tools found in this paper could be explored by professors and institutions, or new technologies could be developed from them.
In further studies, new alternatives will be explored to improve the process of learning programming, including aspects related to the implementation of tools, statistical methods, and learning analyses based on CSCL and AI that enrich and allow monitoring students' training process, improving good programming practices, soft skills, and fostering greater collaboration and individual and group abstraction.