X

DATA601

Information

# DATA601-Introduction to Data Science Welcome to Data 601 - Introduction to Data Science class. The class latest Syllabus can be found in the [Data 601 main GitHub repository](https://github.com/fgonzaleumbc/DATA601). Various topics can be found in the syllabus from contact information, class schedule, class links, grading information univeverity policies, and university resources. This repository also contains various Data Science and Python resources which includes: - [Data 601 - Data Science - Reference Guide.docx](https://github.com/fgonzaleumbc/DATA601/blob/main/DATA%20601%20-%20Data%20Science%20-%20Reference%20Guide.docx): This document contains a set of curated set of references from installing Python, important data science libraries, important datasets, and other websites. - [Data_Science_Introduction_Topics_Index.xlsx](https://github.com/fgonzaleumbc/DATA601/blob/main/Data_Science_Introduction_Topics_Index.xlsx): This schedule contains a detailed list of each of the lectures, a summary of topics discussed in each lectures and the tentative schedule for Homework and Project deliverables. # Applications of Data Science and Related Technologies In the last decade, improvements to computing processing have allowed to use various approaches from data science and related subfields of artificial intelligence (AI), machine learning (ML), natural language processing (NLP), and Generative AI (Gen-AI), make significant contributions and efficiency gains to all fields of studies from engineering to business processes to data analysis. The class material and discussions (see "[List of Class Repositories](#List-of-Class-Repositories)" below) discusses end to end lifecycle end to end. These includes but not limited to: - Common programming languages used for data science (e.g., Python), - Data collection and storing (e.g., relational databases, SQL, etc.) and data querying, - Data cleaning/munging/wrangling/preparation considerations and processes, - Data transformation, data augmentation, and data pipelines, - Exploratory data analyses (EDA), Applications of artificial intelligence algorithms including but not limited to: - Supervised machine learning (e.g., regression, classification, time series) - Unsupervised machine learning (e.g., clustering, dimensionality reduction) - NLP (e.g., stopwords, TFIDF, bag of words model, similarity ranking techniques, text classification, text clustering) - Advance NLP and Generative AI (e.g., large language models) Other related topics include but not limited to: - Probability and statistics - Robotic Process Automation (RPA) - Ethics ### Data Science Lifecycle: The following diagram shows an overview of the data science lifecycle: ![Data_Science_Lifecycle](https://github.com/user-attachments/assets/c6e7e916-461f-4a72-92db-7804ee517e47) The above figure and overview of the data science lifecycle includes the following step descriptions: - Defining scope, formulating question and identifying funding - Planning designing, procuring, developing system - Data Cleaning: may also be referred as data preparation, data wrangling, data transformation, data - profiling, data pipeline. This step may also include defining data transformation needed (e.g., deriving features, scaling, and/or normalization). - Exploratory Data Analysis (EDA) and data visualization - Defining task to be performed, automated, or task that the data allows to be performed - Define data variables and features to be used in models and algorithms (e.g., independent vs. dependent variables) - Applying AI/ML Algorithms: Training, testing, evaluating, verification, and validation of model and algorithm performance - EDA, Documentation and model deployment - Monitoring of the performance of the models, application, and/or process Although the focus is on data science, data, and model deployment there are other important software development practices and operations (e.g., Agile Methodologies, Testing, DevOps, Continuous Integration/Continuous Deployment (CI/CD)) that are applied depending on the application. ### AI Tasks, Approaches and Uses The end goal of applications of data science and related AI and non-AI fields such as statistics, data visualization, AI, ML, NLP, GenAI, can be used for many things. This includes: - Measure relationship between features (e.g., measure how an attribute(s) increases based on other attribute(s)) - Ranking (e.g., ranking records based on similarity or importance) - Prediction of numerical feature based on other features (e.g., predicting future value) - Predicttion and assignment of labels or classes - Creating groups to analyze group statistics These AI and non-AI approaches can be used in combination and there is overlap between fields. For example, many statistical models are considered part of machine learning, and data visualization is commonly used to show outputs of AI algorithms. There are many common tasks that can be performed by AI: 1. Non-AI Tasks from Traditional Statistics and Analytic Approaches: - Calculate descriptive statistics (e.g., mean, median, maximum, minimum, etc.) - Visualize data features to calculate trends, find patterns, tell a story - Measure relationship and find patterns between variables and features - Validate conclusions using hypothesis testing techniques - Data science combines all tasks to extract meaning and insights from data. 2. Supervised ML: - Predictive Analytics: model can predict future value. Example applications include recommendation systems, predictive maintenance, anomaly detection, and image detection/recognition. - Labeling/Classification: given a training dataset the model can label new data. Example applications include email classification and image recognition. 3. Unsupervised ML: - Clustering/Grouping: algorithms can group based on feature similarity. Example applications include customer segmentation, anomaly detection, text clustering. 4. NLP: - Search system: information retrieval and ranking - Named entity recognition: system and model recognizes entities including but not limited to person names, organizations, and geographic locations - Text summarization: makes a text shorter while keeping original text meaning and accuracy. - Question answering: based on an input or prompt a system provides a response or output (e.g., chatbots) - Generative AI: given a prompt tool provide a human-like output (e.g., use large language models such as ChatGPT, Gemini, LLama, Mistral). # List of Class Repositories Each lecture material is divided into various repositories which include sample Python code and datasets. The following table includes a list of each lecture repository, its link, and a summary of discussed topics. | Repository | Description| |------------|------------| | [DATA601_L00-HW_Projects](https://github.com/fgonzaleumbc/DATA601_L00-HW_Projects) | This repository contains Homeworks and Projects material due throughout the semester. See the syllabus or "Data_Science_Topics_Index.xlsx" for the tentative due dates. | | [DATA601_L01-DS_Python_Overview](https://github.com/fgonzaleumbc/DATA601_L01-DS_Python_Overview) | This repository contains various power point presentations that provide an overview of the class, tools utilized in the class (e.g., Python, Jupyter Notebooks, Anaconda, Visual Studion code), installation instructions, high level overview of data science and Python. | | [DATA601_L02-Jupyter_Notebook_Python_Overview](https://github.com/fgonzaleumbc/DATA601_L02-Jupyter_Notebook_Python_Overview) | This repository contains an introduction to Jupyter Notebooks and Python overview. | | [DATA601_L03-Python_Collections_Statements_Functions](https://github.com/fgonzaleumbc/DATA601_L03-Python_Collections_Statements_Functions) | This repository contains class material with examples on Python data collections, logical operators, ifelse statements, while and for loops and functions. | | [DATA601_L04-OOP_Markdown_RegEx](https://github.com/fgonzaleumbc/DATA601_L04-OOP_Markdown_RegEx) | This repository contains discussions on object oriented programming (OOP), markdown languag within Jupyter Notebooks, and regular expressions (RegEx) | | [DATA601_L05-Numpy_Pandas](https://github.com/fgonzaleumbc/DATA601_L05-Numpy_Pandas) | This repository discusses using Numpy libarary for mathematical operations and the Pandas Libaries for working with data. | | [DATA601_L06-Data_Clean_Transform_Analysis](https://github.com/fgonzaleumbc/DATA601_L06-Data_Clean_Transform_Analysis) | This repository contains class material discussing working with date time objects, using Pandas for data cleaning, data analysis, and data transformation. | | [DATA601_L07-Data_Visualization](https://github.com/fgonzaleumbc/DATA601_L07-Data_Visualization) | This repository discusses data visualization libraries and how to create various types of charts and when to use them. | | [DATA601_L08-DS_Example_Discussion](https://github.com/fgonzaleumbc/DATA601_L08-DS_Example_Discussion) | Lecture 8 discusses a data science example and use case using a movie dataset. | | [DATA601_L09-Databases_Files_APIs](https://github.com/fgonzaleumbc/DATA601_L09-Databases_Files_APIs) | This repostiory contains material on working with various types of files (e.g., csv, txt, PDF, etc.), working with web data and web crawling, working with application programming interfaces (API), and introduction to relational databases and working with Structured Query Language (SQL), software version control systems (e.g., Git).| | [DATA601_L10-Statistics](https://github.com/fgonzaleumbc/DATA601_L10-Statistics) | This repository discusses probability and statistics in the context of data science. | | [DATA601_L11-Supervised_ML](https://github.com/fgonzaleumbc/DATA601_L11-Supervised_ML) | This repository discusses introduction to supervised machine learning, regression models, classification models and feature selection. | | [DATA601_L12-Unsupervised_ML](https://github.com/fgonzaleumbc/DATA601_L12-Unsupervised_ML) | This repository discusses introduction to unsupervised machine leraning, and clustering algorithms. | | [DATA601_L13-NLP](https://github.com/fgonzaleumbc/DATA601_L13-NLP) | This repository discusses introduction to natural language processing, token vectorization (e.g., Term Frequency Inverse Document Frequency (TFIDF)), bag of words models, similarity ranking techniques, text clustering, and text classification. | | [DATA601_L14-Dashboarding](https://github.com/fgonzaleumbc/DATA601_L14-Dashboarding) | This repository has class material on dashboard creation. | | [DATA601_L15-Ethics](https://github.com/fgonzaleumbc/DATA601_L15-Ethics) | This repository contains various presentations on data science and artificial intelligence Ethics. | | [DATA601_L16-Image_Classification](https://github.com/fgonzaleumbc/DATA601_L16-Image_Classification) | This repository contains examples on image classification. | | [DATA601_L17-Timeseries_Financial_Data_Analysis](https://github.com/fgonzaleumbc/DATA601_L17-Timeseries_Financial_Data_Analysis) | This repostitory contains discussiosn on time series data analysis. | | [DATA601_L18-RPA_Autotrending](https://github.com/fgonzaleumbc/DATA601_L18-RPA_Autotrending) | This repository contains discussions on robotic process automation (RPA) and auto trending calculations. | | [DATA601_L19-NLP_LLMs](https://github.com/fgonzaleumbc/DATA601_L19-NLP_LLMs) | This repostiory contains discussions and an introduction to advance NLP, generative AI, large language models (LLM), Ollama library, and agentic AI. | Although various commercial and platform specific tools such as data visualization tools (e.g., MS Power BI, Tableau) and cloud platforms (e.g., MS Azure, Amazon Web Services, Databricks) are only discussed in concept as it each has its own learning curves, costs, certifications, and other requirements. The following table provides recommended certification paths and training resources for data analyst and data scientist type positions. [Certification Paths Spreadsheet](https://docs.google.com/spreadsheets/d/1e8vvSrtGjtrkO5OrNAR8QYaCeowtCTX2/edit?usp=drive_link&ouid=103437557434463890960&rtpof=true&sd=true). For questions please contact me at:
Felix Gonzalez, P.E.
Adjunct Instructor,
Division of Professional Studies
Computer Science and Electrical Engineering
University of Maryland Baltimore County
fgonzale@umbc.edu

Prompts

Reviews

Tags

Write Your Review

Detailed Ratings

ALL
Correctness
Helpfulness
Interesting
Upload Pictures and Videos

Name
Size
Type
Download
Last Modified
  • Community

Add Discussion

Upload Pictures and Videos