MCS-226: Data Science and Big Data
Study Material

MCS-226: Data Science and Big Data

Complete Exam Answer Guide (2022-2025)

Importance Legend

๐Ÿ”ด
Most ImportantHigh probability
๐ŸŸก
Very ImportantMedium probability
๐ŸŸข
ImportantGood to know

MCS-226: DATA SCIENCE AND BIG DATA

Complete Exam Answer Guide


Course Code: MCS-226
Programme: MCA (Master of Computer Applications)
University: Indira Gandhi National Open University (IGNOU)
Block Coverage: Block 1-4 (Units 1-16)
Exam Sessions Covered: June 2022 - June 2025
Total Questions: 205 Unified Question Families


Importance Legend

Symbol Meaning Frequency
๐Ÿ”ด Most Important Asked 4+ times
๐ŸŸก Very Important Asked 2-3 times
๐ŸŸข Important Asked 1 time

UNIT 1: DATA SCIENCE - INTRODUCTION


Q1. ๐ŸŸก What is Data Science? Define Data Science and explain it with the help of its applications.

[Asked: Jun 2023, Jun 2022, Dec 2024 | Frequency: 3]

Answer

Definition of Data Science: Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract meaningful knowledge and insights from structured and unstructured data. It combines expertise from statistics, mathematics, computer science, and domain knowledge to analyze complex data and solve real-world problems.

Key Components of Data Science:

Component Description
Statistics Foundation for data analysis and inference
Machine Learning Algorithms for pattern recognition and prediction
Data Engineering Data collection, storage, and processing
Domain Expertise Industry-specific knowledge application
Visualization Presenting insights in understandable formats

Diagram:

d2 diagram

Applications of Data Science:

  1. Healthcare: Disease prediction, drug discovery, patient outcome analysis

  2. Finance: Fraud detection, risk assessment, algorithmic trading

  3. E-commerce: Recommendation systems, customer segmentation, demand forecasting

  4. Transportation: Route optimization, autonomous vehicles, traffic prediction

  5. Social Media: Sentiment analysis, content recommendation, trend detection

  6. Manufacturing: Predictive maintenance, quality control, supply chain optimization


Q2. ๐ŸŸก What are the applications/advantages of Data Science in an organization?

[Asked: Jun 2022, Jun 2023 | Frequency: 2]

Answer

Advantages of Data Science in Organizations:

Advantage Description
Informed Decision Making Data-driven insights replace guesswork
Predictive Capabilities Forecast trends and customer behavior
Cost Reduction Identify inefficiencies and optimize operations
Competitive Advantage Leverage data for market differentiation
Customer Understanding Deep insights into preferences and needs
Risk Management Early detection of potential issues
Process Automation Automate repetitive analytical tasks

Key Applications:

  1. Marketing Optimization: Target right customers with personalized campaigns

  2. Product Development: Data-driven feature prioritization

  3. Operational Efficiency: Streamline processes using analytics

  4. Revenue Growth: Identify new revenue opportunities

  5. Human Resources: Talent acquisition and retention analytics

  6. Supply Chain: Demand forecasting and inventory optimization


Q3. ๐ŸŸก What are the different types of data in Data Science? Briefly explain each type.

[Asked: Jun 2025 | Frequency: 2]

Answer

Types of Data in Data Science:

graphviz diagram

1. Based on Structure:

Type Description Examples
Structured Organized in predefined format (rows/columns) Databases, spreadsheets, SQL tables
Semi-Structured Partially organized with tags/markers JSON, XML, HTML, emails
Unstructured No predefined format Images, videos, audio, social media posts

2. Based on Nature:

Type Description Examples
Qualitative Descriptive, non-numeric Colors, names, categories
Quantitative Numeric, measurable Age, salary, temperature

3. Data Streams:

  • Continuous flow of data generated in real-time

  • Examples: Stock market feeds, sensor data, social media streams


Q4. ๐ŸŸข What is Structured Data? Explain with suitable example.

[Asked: Dec 2023 | Frequency: 1]

Answer

Structured Data is highly organized data that follows a predefined schema and can be easily stored in relational databases with rows and columns.

Characteristics:

  • Follows strict data model

  • Easily searchable using SQL queries

  • Stored in RDBMS (MySQL, PostgreSQL, Oracle)

  • Consistent format across records

Example - Employee Database:

EmpID Name Department Salary JoinDate
101 John IT 50000 2020-01-15
102 Mary HR 45000 2019-06-20
103 Alex Finance 55000 2021-03-10

Query Example:

SELECT Name, Salary FROM Employees WHERE Department = 'IT';


Q5. ๐ŸŸข Discuss how structured data is different from semi-structured data.

[Asked: Dec 2024 | Frequency: 1]

Answer
Aspect Structured Data Semi-Structured Data
Schema Strict, predefined schema Flexible, self-describing
Format Tables with rows/columns Tags, markers, hierarchies
Storage RDBMS NoSQL, document stores
Examples SQL databases, spreadsheets JSON, XML, HTML
Query Language SQL XPath, JSONPath
Flexibility Low - schema changes are complex High - easy to modify
Analysis Easy with traditional tools Requires parsing

Diagram:

d2 diagram

Q6. ๐ŸŸก What is Semi-structured data? Explain with suitable example.

[Asked: Dec 2023, Dec 2022 | Frequency: 2]

Answer

Semi-structured Data is data that doesn't conform to rigid tabular structure but contains tags, markers, or other elements to separate semantic elements and enforce hierarchies.

Characteristics:

  • Self-describing with tags/markers

  • Flexible schema

  • Hierarchical organization

  • Stored in NoSQL databases

Examples:

1. JSON Format:

{
  "student": {
    "id": "S001",
    "name": "Rahul Kumar",
    "courses": ["MCS-226", "MCS-221"],
    "grades": {
      "MCS-226": "A",
      "MCS-221": "B+"
    }
  }
}

2. XML Format:

<student>
  <id>S001</id>
  <name>Rahul Kumar</name>
  <courses>
    <course>MCS-226</course>
    <course>MCS-221</course>
  </courses>
</student>

Use Cases: Web APIs, configuration files, log files, email data


Q7. ๐ŸŸก What is Unstructured data? Explain with suitable example.

[Asked: Dec 2023, Dec 2022 | Frequency: 2]

Answer

Unstructured Data is data that has no predefined format or organization, making it difficult to store in traditional databases.

Characteristics:

  • No predefined data model

  • Difficult to search and analyze with traditional methods

  • Requires specialized tools for processing

  • Constitutes ~80% of all enterprise data

Examples:

Category Examples
Text Emails, documents, social media posts
Multimedia Images, videos, audio files
Web Content HTML pages, blogs, forums
Sensor Data IoT device readings

Diagram:

blockdiag diagram

Processing Methods: Natural Language Processing (NLP), Computer Vision, Deep Learning


Q8. ๐ŸŸข What is Qualitative data? Explain with example.

[Asked: Dec 2022 | Frequency: 1]

Answer

Qualitative Data (also called Categorical Data) represents characteristics or qualities that cannot be measured numerically but can be categorized.

Types:

Type Description Example
Nominal Categories without order Gender (Male/Female), Blood Type (A, B, O, AB)
Ordinal Categories with meaningful order Education Level (High School < Bachelor < Master < PhD)

Characteristics:

  • Non-numeric in nature

  • Describes attributes or properties

  • Can be counted but not measured

  • Analyzed using mode, frequency distribution

Examples:

  • Eye color: Blue, Brown, Green

  • Customer satisfaction: Poor, Average, Good, Excellent

  • Product categories: Electronics, Clothing, Food


Q9. ๐ŸŸก What is Quantitative data? Explain with example.

[Asked: Dec 2022, Jun 2022 | Frequency: 2]

Answer

Quantitative Data represents numerical values that can be measured and expressed using numbers.

Types:

Type Description Example
Discrete Countable, whole numbers Number of students (25, 30, 45)
Continuous Any value in a range Height (5.5 ft), Temperature (36.7°C)

Characteristics:

  • Numeric in nature

  • Can be measured precisely

  • Supports mathematical operations

  • Analyzed using mean, median, standard deviation

Examples:

  • Age: 25, 30, 45 years

  • Salary: ₹50,000, ₹75,000

  • Temperature: 25.5°C, 30.2°C

  • Distance: 100.5 km

Comparison with Qualitative:

Aspect Qualitative Quantitative
Nature Descriptive Numerical
Measurement Categories Exact values
Analysis Frequency, Mode Mean, Std Dev
Examples Colors, Grades Age, Salary

Q10. ๐ŸŸข Compare qualitative data with quantitative data.

[Asked: Jun 2023 | Frequency: 1]

Answer
Aspect Qualitative Data Quantitative Data
Definition Describes qualities/characteristics Describes quantities/amounts
Nature Non-numeric, categorical Numeric, measurable
Types Nominal, Ordinal Discrete, Continuous
Examples Gender, Color, Opinion Age, Height, Income
Collection Methods Surveys, Interviews, Observations Measurements, Counts, Experiments
Analysis Techniques Thematic analysis, Content analysis Statistical analysis, Regression
Central Tendency Mode Mean, Median, Mode
Visualization Pie charts, Bar graphs Histograms, Line graphs, Scatter plots
Flexibility Subjective interpretation Objective measurement
Sample Size Usually smaller Usually larger

Q11. ๐ŸŸข What is categorical data? Explain with example.

[Asked: Jun 2022 | Frequency: 1]

Answer

Categorical Data represents data that can be divided into distinct groups or categories. It is a type of qualitative data.

Types of Categorical Data:

  1. Nominal Data: Categories without inherent order

  2. Examples: Blood type (A, B, AB, O), Country names, Colors

  3. Ordinal Data: Categories with meaningful order

  4. Examples: Education level, Customer rating (1-5 stars)

Example Dataset:

Student Gender Grade City
A Male A Delhi
B Female B Mumbai
C Male A Chennai

Here, Gender, Grade, and City are all categorical variables.

Analysis Methods:

  • Frequency distribution

  • Mode calculation

  • Chi-square test

  • Bar charts and pie charts


Q12. ๐ŸŸข What is Measurement Scale of Data? What do you understand by this term?

[Asked: Jun 2023 | Frequency: 1]

Answer

Measurement Scale refers to the classification system used to categorize and quantify data based on the nature of information it represents and the mathematical operations that can be performed on it.

Diagram:

graphviz diagram

Purpose:

  • Determines appropriate statistical analysis

  • Guides data collection methodology

  • Defines mathematical operations possible

  • Helps in choosing visualization techniques


Q13. ๐ŸŸก Explain the characteristics of measurement scales of data.

[Asked: Jun 2023, Dec 2022 | Frequency: 2]

Answer

Four Measurement Scales (NOIR):

Scale Characteristics Operations Examples
Nominal Categories without order =, ≠ Gender, Blood Type, City
Ordinal Categories with order =, ≠, <, > Grades, Rankings, Ratings
Interval Equal intervals, no true zero +, - Temperature (°C), IQ Scores
Ratio Equal intervals, true zero +, -, ×, ÷ Height, Weight, Age, Income

Detailed Characteristics:

1. Nominal Scale:

  • Classifies data into mutually exclusive categories

  • No ranking or ordering

  • Mode is the only measure of central tendency

2. Ordinal Scale:

  • Categories have meaningful order

  • Differences between values are not uniform

  • Median can be calculated

3. Interval Scale:

  • Equal distances between values

  • No absolute zero point

  • Mean, median, mode all applicable

4. Ratio Scale:

  • Has true zero (absence of attribute)

  • All mathematical operations valid

  • Most informative scale


Q14. ๐ŸŸก List and define various measurement scales of data with suitable examples.

[Asked: Jun 2023, Dec 2022 | Frequency: 2]

Answer

1. Nominal Scale:

  • Definition: Classification without any order

  • Examples:

  • Gender: Male, Female

  • Marital Status: Single, Married, Divorced

  • Blood Group: A, B, AB, O

2. Ordinal Scale:

  • Definition: Classification with meaningful order but unequal intervals

  • Examples:

  • Education: Primary < Secondary < Graduate < Postgraduate

  • Satisfaction: Very Dissatisfied < Dissatisfied < Neutral < Satisfied < Very Satisfied

  • Military Ranks: Private < Corporal < Sergeant < Lieutenant

3. Interval Scale:

  • Definition: Ordered with equal intervals but no true zero

  • Examples:

  • Temperature in Celsius: 0°C doesn't mean no temperature

  • Calendar Years: Year 0 is not "beginning of time"

  • IQ Scores: 0 IQ doesn't mean no intelligence

4. Ratio Scale:

  • Definition: Ordered with equal intervals and absolute zero

  • Examples:

  • Height: 0 cm means no height

  • Weight: 0 kg means no weight

  • Income: ₹0 means no income

  • Age: 0 years means just born

Summary Table:

Scale Order Equal Interval True Zero Example
Nominal Colors
Ordinal Rankings
Interval Temperature
Ratio Weight

Q15. ๐ŸŸข What is Descriptive Analysis? Explain.

[Asked: Jun 2024 | Frequency: 1]

Answer

Descriptive Analysis is a statistical method that summarizes and describes the main features of a dataset, providing simple summaries about the sample and measures.

Key Components:

Component Description Examples
Central Tendency Average/typical value Mean, Median, Mode
Dispersion Spread of data Range, Variance, Std Dev
Distribution Shape of data Skewness, Kurtosis
Position Relative standing Percentiles, Quartiles

Techniques Used:

  1. Numerical Summaries: Mean, median, mode, standard deviation

  2. Graphical Representations: Histograms, bar charts, pie charts, box plots

  3. Frequency Tables: Count and percentage distributions

Example: For exam scores: [75, 80, 85, 90, 95]

  • Mean = 85

  • Median = 85

  • Range = 20

  • Standard Deviation = 7.07

Purpose: Understand "what happened" in the data without making predictions or inferences.


Q16. ๐ŸŸข What is Exploratory Analysis? Explain.

[Asked: Jun 2024 | Frequency: 1]

Answer

Exploratory Data Analysis (EDA) is an approach to analyzing datasets to summarize their main characteristics, often using visual methods, to discover patterns, spot anomalies, and check assumptions.

Key Objectives:

  1. Understand data structure and content

  2. Detect outliers and anomalies

  3. Identify patterns and relationships

  4. Generate hypotheses for further testing

  5. Check assumptions for statistical models

Techniques:

Technique Purpose
Summary Statistics Understand central tendency and spread
Visualization Identify patterns visually
Correlation Analysis Find relationships between variables
Missing Value Analysis Identify data quality issues
Outlier Detection Find unusual observations

Common Visualizations:

  • Histograms and density plots

  • Scatter plots and pair plots

  • Box plots

  • Heat maps (correlation matrix)

Difference from Descriptive Analysis:

  • More visual and interactive

  • Focuses on discovery rather than just summarization

  • May involve transformations and feature engineering


Q17. ๐ŸŸข What is Inferential Analysis? Explain.

[Asked: Jun 2024 | Frequency: 1]

Answer

Inferential Analysis uses sample data to make generalizations, predictions, or decisions about a larger population.

Key Concepts:

Concept Description
Population Entire group of interest
Sample Subset of population
Hypothesis Testing Testing assumptions about population
Confidence Intervals Range of plausible values
p-value Probability of results if null hypothesis true

Common Techniques:

  1. Hypothesis Testing: t-test, chi-square test, ANOVA

  2. Confidence Intervals: Estimating population parameters

  3. Regression Analysis: Predicting relationships

  4. Correlation Analysis: Measuring association strength

Example:

  • Sample: 100 students' exam scores

  • Inference: Average score of all MCA students is between 70-80 with 95% confidence

Diagram:

graph LR subgraph Sampling Population[Population] --> Sample[Sample] end subgraph Analysis Sample --> SampleStats[Sample Statistics] end subgraph Inference SampleStats --> PopParams[Population Parameters] PopParams --> Conclusions[Conclusions] end style Population fill:#4a90d9,stroke:#333,stroke-width:2px,color:white style Sample fill:#7ab8f5,stroke:#333,stroke-width:2px,color:white style SampleStats fill:#90c695,stroke:#333,stroke-width:2px,color:white style PopParams fill:#f5a962,stroke:#333,stroke-width:2px,color:white style Conclusions fill:#ff6b6b,stroke:#333,stroke-width:2px,color:white

Q18. ๐ŸŸก What is Predictive Analysis? Explain.

[Asked: Jun 2024, Jun 2023 | Frequency: 2]

Answer

Predictive Analysis uses historical data, statistical algorithms, and machine learning techniques to forecast future outcomes.

Key Components:

Component Description
Historical Data Past observations used for training
Statistical Algorithms Regression, time series
Machine Learning Classification, clustering, neural networks
Validation Testing model accuracy

Common Techniques:

  1. Regression Models: Linear, logistic, polynomial

  2. Classification: Decision trees, random forests, SVM

  3. Time Series: ARIMA, exponential smoothing

  4. Neural Networks: Deep learning models

Applications:

Domain Application
Finance Credit scoring, fraud detection
Healthcare Disease prediction, patient outcomes
Retail Demand forecasting, churn prediction
Marketing Customer lifetime value, response prediction

Process Flow:

flowchart TB %% Direction: Top to Bottom (compact and readable) %% Professional color theme and consistent node shapes A[Historical Data] --> B[Data Preparation] B --> C[Model Training] C --> D[Model Validation] D --> E[Predictions] E --> F[Business Decisions] %% Node Styling classDef data fill:#4A90D9,stroke:#2C5282,color:#fff,font-weight:bold,rx:6,ry:6; classDef process fill:#7AB8F5,stroke:#2C5282,color:#fff,font-weight:bold,rx:6,ry:6; classDef result fill:#90C695,stroke:#2E7D32,color:#fff,font-weight:bold,rx:6,ry:6; classDef decision fill:#F5A962,stroke:#B15B00,color:#fff,font-weight:bold,rx:6,ry:6; %% Assign Styles class A data class B,C,D process class E result class F decision

Q19. ๐ŸŸข Define the different methods for collecting, analysing and interpreting numerical information.

[Asked: Jun 2024 | Frequency: 1]

Answer

Methods for Numerical Data:

1. Data Collection Methods:

Method Description Example
Surveys Questionnaires with numeric responses Rating scales 1-10
Experiments Controlled data collection Lab measurements
Observations Recording numerical events Traffic counts
Secondary Sources Existing databases Census data, financial reports
Sensors/IoT Automated collection Temperature, pressure readings

2. Analysis Methods:

Type Techniques
Descriptive Mean, median, mode, standard deviation
Inferential t-tests, ANOVA, chi-square
Predictive Regression, machine learning
Exploratory Visualization, correlation

3. Interpretation Methods:**

  • Statistical significance testing

  • Confidence interval construction

  • Effect size calculation

  • Trend analysis

  • Comparative analysis


Q20. ๐ŸŸข What are the common misconceptions of data science?

[Asked: Jun 2024 | Frequency: 1]

Answer

Common Misconceptions in Data Analysis:

Misconception Reality
Correlation = Causation Correlation shows relationship, not cause-effect
Bigger Sample = Better Quality matters more than quantity
Data Never Lies Data can be biased, incomplete, or manipulated
More Variables = Better Model Can lead to overfitting
AI/ML Solves Everything Requires clean data and proper problem framing

Key Fallacies:

1. Correlation vs Causation:

  • Ice cream sales and drowning deaths both increase in summer

  • They're correlated but ice cream doesn't cause drowning

2. Simpson's Paradox:

  • Trend appears in groups but reverses when groups are combined

  • Example: Treatment A may be better in each group but B appears better overall

3. Data Dredging:

  • Mining data for patterns without hypothesis

  • Leads to false discoveries due to multiple comparisons


Q21. ๐ŸŸข What is Simpson's Paradox? Explain with the help of an example.

[Asked: Dec 2024 | Frequency: 1]

Answer

Simpson's Paradox is a phenomenon where a trend appears in different groups of data but disappears or reverses when the groups are combined.

Example - University Admission:

By Department:

Department Male Applied Male Admitted Female Applied Female Admitted
Engineering 800 480 (60%) 100 70 (70%)
Arts 100 10 (10%) 400 80 (20%)

Combined:

Gender Total Applied Total Admitted Rate
Male 900 490 54.4%
Female 500 150 30%

Paradox: Females have higher admission rates in EACH department, but lower OVERALL admission rate.

Explanation: More women applied to the harder-to-get-into department (Arts).

Diagram:

graphviz diagram

Key Lesson: Always consider confounding variables and stratify data appropriately.


Q22. ๐ŸŸข What is Dredging? Explain with the help of an example.

[Asked: Dec 2024 | Frequency: 1]

Answer

Data Dredging (also called p-hacking or data fishing) is the misuse of data analysis to find patterns that can be presented as statistically significant when in fact there is no underlying effect.

Characteristics:

  • Testing multiple hypotheses without correction

  • Cherry-picking favorable results

  • Ignoring negative findings

  • Post-hoc hypothesis generation

Example: A researcher tests 100 different foods for cancer correlation:

  • At 5% significance level, expect ~5 false positives

  • Publishing only "chocolate causes cancer" without mentioning 99 other tests

Problems:

  1. Inflated false positive rate

  2. Non-reproducible results

  3. Misleading conclusions

  4. Wasted resources on false leads

Prevention:

  • Pre-register hypotheses

  • Apply multiple testing corrections (Bonferroni)

  • Report all tests conducted

  • Replicate findings independently


Q23. ๐ŸŸก What is Data Science Life Cycle? Explain the different stages with the help of a diagram.

[Asked: Jun 2024, Dec 2023 | Frequency: 2]

Answer

Data Science Life Cycle is a systematic approach to solving data problems through iterative phases.

Diagram:

plantuml diagram

Stages Explained:

Stage Description Activities
1. Business Understanding Define problem and objectives Stakeholder meetings, goal setting
2. Data Collection Gather relevant data APIs, databases, surveys, web scraping
3. Data Preparation Clean and transform data Missing values, normalization, encoding
4. Exploratory Analysis Understand data patterns Visualization, statistics, correlations
5. Data Modeling Build analytical models ML algorithms, feature engineering
6. Model Evaluation Assess model performance Accuracy, precision, recall, F1-score
7. Deployment Implement in production APIs, dashboards, automation
8. Monitoring Track performance over time Drift detection, retraining

Key Characteristics:

  • Iterative, not linear

  • Requires cross-functional collaboration

  • Documentation at each stage is crucial


UNIT 2: PROBABILITY AND STATISTICS FOR DATA SCIENCE


Q24. ๐ŸŸก What is Conditional Probability? Explain with the help of a diagram.

[Asked: Jun 2025, Jun 2024, Dec 2023 | Frequency: 3]

Answer

Conditional Probability is the probability of an event occurring given that another event has already occurred.

Formula:

$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$

Where:

  • $P(A|B)$ = Probability of A given B has occurred

  • $P(A \cap B)$ = Probability of both A and B occurring

  • $P(B)$ = Probability of B occurring

Diagram:

d2 diagram

Venn Diagram Representation:

svgbob diagram

Example:

  • Box contains 6 red and 4 blue balls

  • P(2nd ball is red | 1st ball was red and not replaced)

  • P(Red₂|Red₁) = 5/9


Q25. ๐ŸŸก Write the equation for conditional probability and describe its components with a suitable example.

[Asked: Jun 2025, Dec 2023, Jun 2024 | Frequency: 3]

Answer

Conditional Probability Equation:

$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$

Components:

Component Symbol Meaning
Conditional Probability P(A|B) Probability of A happening given B occurred
Joint Probability P(A ∩ B) Probability of both A and B happening together
Marginal Probability P(B) Overall probability of event B

Example - Medical Diagnosis:

Disease (D) No Disease (D') Total
Positive Test (T) 95 50 145
Negative Test (T') 5 850 855
Total 100 900 1000

Calculate P(Disease | Positive Test):

$$P(D|T) = \frac{P(D \cap T)}{P(T)} = \frac{95/1000}{145/1000} = \frac{95}{145} = 0.655$$

Interpretation: If a person tests positive, there's a 65.5% chance they have the disease.


Q26. ๐ŸŸก What is Bayes Theorem?

[Asked: Dec 2024, Jun 2024, Jun 2023 | Frequency: 3]

Answer

Bayes' Theorem is a mathematical formula for determining conditional probability, allowing us to update the probability of a hypothesis based on new evidence.

Formula:

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

Components:

Term Name Description
P(A|B) Posterior Updated probability after evidence
P(A) Prior Initial probability before evidence
P(B|A) Likelihood Probability of evidence given hypothesis
P(B) Marginal Likelihood Total probability of evidence

Extended Form (Total Probability):

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B|A) \cdot P(A) + P(B|A') \cdot P(A')}$$

Key Applications:

  • Spam filtering

  • Medical diagnosis

  • Machine learning classification

  • Recommendation systems


Q27. ๐ŸŸก Explain Bayes Theorem with suitable equation and example.

[Asked: Dec 2024, Jun 2023, Jun 2024 | Frequency: 3]

Answer

Bayes' Theorem Equation:

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

Example - Disease Screening:

Given:

  • P(Disease) = 1% = 0.01 (Prior)

  • P(Positive | Disease) = 99% = 0.99 (Sensitivity)

  • P(Positive | No Disease) = 5% = 0.05 (False Positive Rate)

Find: P(Disease | Positive Test)

Solution:

Step 1: Calculate P(Positive) using total probability

$$P(+) = P(+|D) \cdot P(D) + P(+|D') \cdot P(D')$$
$$P(+) = 0.99 \times 0.01 + 0.05 \times 0.99 = 0.0099 + 0.0495 = 0.0594$$

Step 2: Apply Bayes' Theorem

$$P(D|+) = \frac{P(+|D) \cdot P(D)}{P(+)} = \frac{0.99 \times 0.01}{0.0594} = \frac{0.0099}{0.0594} = 0.167$$

Result: Only 16.7% chance of having disease even with positive test!

Diagram:

graphviz diagram

Q28. ๐ŸŸก What is a Random Variable? Explain the concept of random variable.

[Asked: Jun 2023, Jun 2024 | Frequency: 2]

Answer

Random Variable is a variable whose value is determined by the outcome of a random phenomenon. It maps outcomes of a random experiment to numerical values.

Types:

Type Description Example
Discrete Takes countable values Number of heads in 10 coin tosses
Continuous Takes any value in a range Height, weight, temperature

Notation:

  • X, Y, Z (capital letters) = Random variable

  • x, y, z (lowercase) = Specific value

Example - Dice Roll:

  • Random experiment: Rolling a fair die

  • Random variable X = Number shown on die

  • Possible values: X ∈ {1, 2, 3, 4, 5, 6}

  • P(X = 3) = 1/6

Properties:

  • Has a probability distribution

  • Can calculate expected value E(X)

  • Has variance Var(X) and standard deviation

Diagram:

flowchart LR %% Direction: Left to Right for logical progression A[Random<br/>Experiment] --> B[Outcome] B --> C[Random Variable X] C --> D[Numerical<br/>Value] D --> E["Probability P(X = x)"] %% Node Styling (color palette consistent with academic/teaching visuals) classDef experiment fill:#4A90D9,stroke:#2C5282,color:#fff,font-weight:bold,rx:6,ry:6; classDef outcome fill:#7AB8F5,stroke:#2C5282,color:#fff,font-weight:bold,rx:6,ry:6; classDef variable fill:#90C695,stroke:#2E7D32,color:#fff,font-weight:bold,rx:6,ry:6; classDef numeric fill:#FFE66D,stroke:#B59B00,color:#000,font-weight:bold,rx:6,ry:6; classDef prob fill:#F5A962,stroke:#B15B00,color:#fff,font-weight:bold,rx:6,ry:6; %% Assign Classes class A experiment class B outcome class C variable class D numeric class E prob

Q29. ๐ŸŸข Differentiate between Discrete Random Variable and Continuous Random Variable.

[Asked: Jun 2023 | Frequency: 1]

Answer
Aspect Discrete Random Variable Continuous Random Variable
Values Countable, finite or infinite Uncountable, any value in range
Gaps Has gaps between values No gaps, continuous spectrum
Probability P(X = x) > 0 for specific values P(X = x) = 0 for any single point
Distribution Probability Mass Function (PMF) Probability Density Function (PDF)
Examples Coin tosses, dice rolls, counts Height, weight, time, temperature
Graphical Bar chart Smooth curve
Calculation Sum of probabilities Integral of density function
Notation P(X = x) f(x) or P(a ≤ X ≤ b)

Examples:

Discrete:

  • X = Number of students in a class (0, 1, 2, ...)

  • Y = Number of defects in a product (0, 1, 2, ...)

Continuous:

  • X = Waiting time at a bus stop (0 to ∞)

  • Y = Height of students (any value like 5.67 ft)


Q30. ๐ŸŸข What is Binomial Distribution?

[Asked: Dec 2023 | Frequency: 1]

Answer

Binomial Distribution is a discrete probability distribution that models the number of successes in a fixed number of independent trials, where each trial has the same probability of success.

Conditions (BINS):

  • Binary outcomes (success/failure)

  • Independent trials

  • Number of trials is fixed

  • Same probability for each trial

Parameters:

  • n = number of trials

  • p = probability of success

  • q = 1 - p = probability of failure

Notation: X ~ Binomial(n, p)

Characteristics:

  • Mean: ฮผ = np

  • Variance: ฯƒ² = npq

  • Standard Deviation: ฯƒ = √(npq)


Q31. ๐ŸŸข Write the formula for binomial probability distribution.

[Asked: Dec 2023 | Frequency: 1]

Answer

Binomial Probability Formula:

$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$$

Where:

  • $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ = Number of ways to choose k successes from n trials

  • n = Total number of trials

  • k = Number of successes (0, 1, 2, ..., n)

  • p = Probability of success in each trial

  • (1-p) = Probability of failure

Example: Probability of getting exactly 3 heads in 5 coin tosses:

$$P(X = 3) = \binom{5}{3} (0.5)^3 (0.5)^2 = 10 \times 0.125 \times 0.25 = 0.3125$$

Q32. ๐ŸŸก Apply binomial probability distribution formula to produce the probability distribution for coin toss problem.

[Asked: Dec 2023, Jun 2022 | Frequency: 2]

Answer

Problem: Find probability distribution for number of heads in 4 coin tosses.

Given: n = 4, p = 0.5 (fair coin)

Formula: $P(X = k) = \binom{4}{k} (0.5)^k (0.5)^{4-k} = \binom{4}{k} (0.5)^4$

Calculations:

X (Heads) $\binom{4}{k}$ Calculation P(X)
0 1 1 × (0.5)⁴ 0.0625
1 4 4 × (0.5)⁴ 0.2500
2 6 6 × (0.5)⁴ 0.3750
3 4 4 × (0.5)⁴ 0.2500
4 1 1 × (0.5)⁴ 0.0625
Total 1.0000

Probability Distribution Graph:

svgbob diagram

Statistics:

  • Mean = np = 4 × 0.5 = 2

  • Variance = npq = 4 × 0.5 × 0.5 = 1

  • Std Dev = 1


Q33. ๐ŸŸก What kind of probability distribution is binomial? Explain the characteristics of binomial distribution.

[Asked: Jun 2022, Jun 2024 | Frequency: 2]

Answer

Binomial Distribution is a discrete probability distribution.

Characteristics:

Characteristic Description
Discrete X takes only whole number values (0, 1, 2, ..., n)
Fixed Trials Number of trials n is predetermined
Binary Outcomes Each trial has only two outcomes (success/failure)
Independence Trials are independent of each other
Constant Probability P(success) = p remains same for all trials

Mathematical Properties:

Property Formula
Mean (Expected Value) ฮผ = E(X) = np
Variance ฯƒ² = Var(X) = np(1-p)
Standard Deviation ฯƒ = √(np(1-p))
Mode ⌊(n+1)p⌋ or ⌈(n+1)p⌉ - 1

Shape Characteristics:

  • Symmetric when p = 0.5

  • Right-skewed when p < 0.5

  • Left-skewed when p > 0.5

  • Approaches normal distribution for large n (np > 5 and n(1-p) > 5)

Applications:

  • Quality control (defective items)

  • Medical trials (patient recovery)

  • Marketing (customer response)

  • Finance (default probability)


Q34. ๐ŸŸก What is Normal Distribution? Explain the characteristics of normal distribution.

[Asked: Jun 2024, Dec 2022 | Frequency: 2]

Answer

Normal Distribution (Gaussian Distribution) is a continuous probability distribution that is symmetric and bell-shaped, described by mean (ฮผ) and standard deviation (ฯƒ).

Formula (PDF):

$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$

Notation: X ~ N(ฮผ, ฯƒ²)

Characteristics:

Property Description
Symmetry Symmetric around mean ฮผ
Bell-shaped Peak at mean, tails extend infinitely
Mean = Median = Mode All central tendency measures equal
Total Area = 1 Under the curve
Asymptotic Curve never touches x-axis

Empirical Rule (68-95-99.7):

Range Percentage
ฮผ ± 1ฯƒ 68.27%
ฮผ ± 2ฯƒ 95.45%
ฮผ ± 3ฯƒ 99.73%

Diagram:

svgbob diagram

Standard Normal Distribution: Z ~ N(0, 1) where Z = (X - ฮผ)/ฯƒ


Q35. ๐ŸŸข What is probability distribution of continuous random variable? Explain with the help of a diagram.

[Asked: Dec 2022 | Frequency: 1]

Answer

Probability Distribution of Continuous Random Variable is described using a Probability Density Function (PDF), where probability is calculated as area under the curve.

Key Properties:

  1. f(x) ≥ 0 for all x

  2. Total area under curve = 1

  3. P(a ≤ X ≤ b) = ∫โ‚แต‡ f(x)dx

  4. P(X = specific value) = 0

Common Continuous Distributions:

Distribution Use Case
Normal Natural phenomena, errors
Exponential Waiting times
Uniform Equal probability in range
Chi-square Hypothesis testing

Diagram:

svgbob diagram

Example - Uniform Distribution:

  • X ~ Uniform(0, 10)

  • PDF: f(x) = 1/10 for 0 ≤ x ≤ 10

  • P(2 ≤ X ≤ 5) = (5-2)/10 = 0.3


Q36. ๐ŸŸข How does sampling differ from population?

[Asked: Dec 2023 | Frequency: 1]

Answer
Aspect Population Sample
Definition Entire group of interest Subset of population
Size Usually large (N) Smaller, manageable (n)
Data Collection Census (complete enumeration) Sampling techniques
Parameters Fixed values (ฮผ, ฯƒ) Estimates (x̄, s)
Cost High Lower
Time Time-consuming Faster
Accuracy True values Subject to sampling error
Feasibility Often impractical Practical
Notation Greek letters (ฮผ, ฯƒ, N) Latin letters (x̄, s, n)

Example:

  • Population: All MCA students in India

  • Sample: 500 randomly selected MCA students

Diagram:

actdiag diagram

Q37. ๐ŸŸข Discuss the relation of the terms 'statistic' and 'parameter' with sampling and population respectively.

[Asked: Dec 2023 | Frequency: 1]

Answer

Relationship:

Term Associated With Description Notation
Parameter Population Fixed, unknown value describing population ฮผ, ฯƒ, ฯ€
Statistic Sample Calculated value from sample data x̄, s, p̂

Key Differences:

Aspect Parameter Statistic
Source Population Sample
Value Fixed Varies by sample
Known? Usually unknown Calculated
Purpose What we want to know Estimate parameter

Common Pairs:

Measure Parameter (Population) Statistic (Sample)
Mean ฮผ (mu) x̄ (x-bar)
Standard Deviation ฯƒ (sigma) s
Proportion ฯ€ or p p̂ (p-hat)
Variance ฯƒ²
Size N n

Relationship:

  • Statistics are estimators of parameters

  • Multiple samples → Multiple statistics → Sampling distribution

  • As n → N, statistic → parameter


Q38. ๐ŸŸก What is Sampling? What is Sampling Distribution? Explain with the help of an example.

[Asked: Dec 2024, Jun 2022 | Frequency: 2]

Answer

Sampling is the process of selecting a subset (sample) from a population to make inferences about the entire population.

Sampling Distribution is the probability distribution of a statistic (like sample mean) obtained from all possible samples of a given size from a population.

Types of Sampling:

Type Method
Simple Random Each member has equal chance
Stratified Divide into groups, sample from each
Cluster Randomly select clusters
Systematic Select every kth member

Example - Sampling Distribution of Mean:

Population: {2, 4, 6, 8, 10}, ฮผ = 6

All possible samples of size 2 (with replacement):

Sample Values Mean (x̄)
1 2, 2 2
2 2, 4 3
3 2, 6 4
... ... ...
25 10, 10 10

Sampling Distribution:

  • Mean of x̄ values = ฮผ = 6

  • Standard Error = ฯƒ/√n

Diagram:

graphviz diagram

Q39. ๐ŸŸข What are the two measures to define the central tendencies of quantitative data? Explain with example.

[Asked: Dec 2024 | Frequency: 1]

Answer

Two Main Measures of Central Tendency:

1. Mean (Arithmetic Average):

$$\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}$$
  • Sum of all values divided by count

  • Affected by outliers

  • Uses all data points

Example: Data: 10, 20, 30, 40, 50 Mean = (10+20+30+40+50)/5 = 150/5 = 30

2. Median (Middle Value):

  • Middle value when data is sorted

  • Not affected by outliers

  • Better for skewed distributions

Example: Data: 10, 20, 30, 40, 100 Median = 30 (middle value) Mean = 40 (pulled up by outlier 100)

Comparison:

Aspect Mean Median
Calculation Sum/Count Middle value
Outlier sensitivity High Low
Best for Symmetric data Skewed data
Uses all values Yes No

Third Measure - Mode: Most frequently occurring value


Q40. ๐ŸŸก What are the different measures for defining the spread or variability of a quantitative variable? Explain with examples.

[Asked: Jun 2022, Dec 2024 | Frequency: 2]

Answer

Measures of Spread/Variability:

Measure Formula Description
Range Max - Min Simplest measure
Variance ฯƒ² = ฮฃ(xแตข - ฮผ)²/N Average squared deviation
Standard Deviation ฯƒ = √Variance Spread in original units
IQR Q3 - Q1 Range of middle 50%
Coefficient of Variation CV = (ฯƒ/ฮผ) × 100% Relative variability

Example: Data: 5, 10, 15, 20, 25

1. Range: Range = 25 - 5 = 20

2. Variance:

  • Mean (ฮผ) = 15

  • Deviations: -10, -5, 0, 5, 10

  • Squared deviations: 100, 25, 0, 25, 100

  • Variance = 250/5 = 50

3. Standard Deviation: ฯƒ = √50 = 7.07

4. IQR:

  • Q1 = 7.5, Q3 = 22.5

  • IQR = 22.5 - 7.5 = 15

When to Use:

  • Range: Quick overview

  • Std Dev: Most common, comparable data

  • IQR: Skewed data, with outliers

  • CV: Comparing variability of different units


Q41. ๐ŸŸข Explain the steps of significance testing with the help of an example.

[Asked: Dec 2022 | Frequency: 1]

Answer

Steps of Significance Testing (Hypothesis Testing):

Step 1: State Hypotheses

  • H₀ (Null Hypothesis): No effect/difference

  • H₁ (Alternative Hypothesis): Effect/difference exists

Step 2: Choose Significance Level (ฮฑ)

  • Typically ฮฑ = 0.05 or 0.01

  • Probability of rejecting H₀ when it's true (Type I error)

Step 3: Select Test Statistic

  • t-test, z-test, chi-square, F-test, etc.

Step 4: Calculate Test Statistic and p-value

Step 5: Make Decision

  • If p-value < ฮฑ: Reject H₀

  • If p-value ≥ ฮฑ: Fail to reject H₀

Example - Testing Mean Score:

Claim: Average exam score is 75

Sample: n = 36, x̄ = 78, s = 12

Solution:

  1. H₀: ฮผ = 75, H₁: ฮผ ≠ 75

  2. ฮฑ = 0.05

  3. t-test (unknown ฯƒ)

  4. t = (78-75)/(12/√36) = 3/2 = 1.5

  5. p-value ≈ 0.14 > 0.05

  6. Fail to reject H₀ - Insufficient evidence that mean differs from 75

Diagram:

graphviz diagram

Q42. ๐ŸŸข Write short note on Chi-square test.

[Asked: Jun 2023 | Frequency: 1]

Answer

Chi-Square Test (ฯ‡²) is a statistical test used to determine if there is a significant association between categorical variables.

Types:

Type Purpose
Goodness of Fit Compare observed vs expected frequencies
Test of Independence Check if two variables are related
Test of Homogeneity Compare distributions across groups

Formula:

$$\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$$

Where:

  • O = Observed frequency

  • E = Expected frequency

Degrees of Freedom:

  • Goodness of fit: df = k - 1

  • Independence: df = (r-1)(c-1)

Example - Test of Independence:

Like Coffee Don't Like Total
Male 30 20 50
Female 20 30 50
Total 50 50 100

Expected (if independent): Each cell = 25

ฯ‡² = (30-25)²/25 + (20-25)²/25 + (20-25)²/25 + (30-25)²/25 ฯ‡² = 1 + 1 + 1 + 1 = 4

df = (2-1)(2-1) = 1 Critical value at ฮฑ=0.05: 3.84

Since 4 > 3.84, reject H₀ - Gender and coffee preference are related.


UNIT 3: DATA PREPARATION FOR ANALYSIS


Q43. ๐ŸŸก What is Data Preprocessing? Explain with the help of an example.

[Asked: Dec 2022, Jun 2022 | Frequency: 2]

Answer

Data Preprocessing is the technique of transforming raw data into a clean, understandable format suitable for analysis. Raw data is often incomplete, inconsistent, and contains errors that must be corrected before analysis.

Why Preprocessing is Needed:

  • Real-world data is messy and incomplete

  • Contains noise, outliers, and missing values

  • Different formats and scales need standardization

  • Irrelevant features need to be removed

Key Steps in Data Preprocessing:

flowchart LR %% Direction: Left to Right — clear sequential data flow A[Raw Data] --> B[Data Cleaning] B --> C[Data Integration] C --> D[Data Transformation] D --> E[Data Reduction] E --> F[Clean Data] %% Define consistent, professional styling classDef raw fill:#FF6B6B,stroke:#B71C1C,color:#fff,font-weight:bold,rx:6,ry:6; classDef clean fill:#7AB8F5,stroke:#2C5282,color:#fff,font-weight:bold,rx:6,ry:6; classDef integrate fill:#90C695,stroke:#2E7D32,color:#fff,font-weight:bold,rx:6,ry:6; classDef transform fill:#FFE66D,stroke:#B59B00,color:#000,font-weight:bold,rx:6,ry:6; classDef reduce fill:#F5A962,stroke:#B15B00,color:#fff,font-weight:bold,rx:6,ry:6; classDef final fill:#4ECDC4,stroke:#00796B,color:#fff,font-weight:bold,rx:6,ry:6; %% Assign styles to nodes class A raw class B clean class C integrate class D transform class E reduce class F final

Example - Customer Dataset:

Raw Data (Before Preprocessing):

CustomerID Name Age Income City
101 John 25 50000 Delhi
102 NULL -5 75000 mumbai
103 Mary 30 NULL Delhi
104 Alex 999 60000 DELHI

After Preprocessing:

CustomerID Name Age Income City
101 John 25 50000 Delhi
102 Unknown 28 (mean) 75000 Mumbai
103 Mary 30 61667 (mean) Delhi
104 Alex 28 (replaced outlier) 60000 Delhi

Issues Fixed:

  • NULL replaced with defaults/mean values

  • Invalid age (-5, 999) corrected

  • City names standardized (case consistency)


Q44. ๐ŸŸข Why is data preprocessing important in data science and big data applications? Discuss with suitable diagram.

[Asked: Dec 2024 | Frequency: 1]

Answer

Importance of Data Preprocessing:

Reason Explanation
Data Quality Garbage in = Garbage out; clean data → accurate results
Model Performance ML models perform better with preprocessed data
Consistency Standardizes formats across different sources
Efficiency Reduces storage and computation requirements
Accuracy Removes noise and errors that affect analysis
Compatibility Makes data compatible with analysis tools

Impact on Big Data:

Challenge How Preprocessing Helps
Volume Data reduction techniques
Variety Format standardization
Velocity Stream preprocessing pipelines
Veracity Data validation and cleaning

Diagram - Preprocessing Pipeline:

d2 diagram

Without Preprocessing:

  • Models give inaccurate predictions

  • Analysis results are misleading

  • Storage and processing are inefficient

  • Integration of multiple sources fails


Q45. ๐ŸŸข Discuss different phases of data preprocessing.

[Asked: Dec 2024 | Frequency: 1]

Answer

Phases of Data Preprocessing:

flowchart LR %% Direction: Left to Right (compact horizontal flow) A["Phase 1:<br/>Data Cleaning"] --> B["Phase 2:<br/>Data Integration"] B --> C["Phase 3:<br/>Data Transformation"] C --> D["Phase 4:<br/>Data Reduction"] D --> E["Preprocessed<br/>Data"] %% Node styling classDef cleaning fill:#FF6B6B,stroke:#B71C1C,color:#fff,font-weight:bold,rx:6,ry:6; classDef integration fill:#4ECDC4,stroke:#00796B,color:#fff,font-weight:bold,rx:6,ry:6; classDef transformation fill:#FFE66D,stroke:#B59B00,color:#000,font-weight:bold,rx:6,ry:6; classDef reduction fill:#95E1D3,stroke:#00796B,color:#000,font-weight:bold,rx:6,ry:6; classDef preprocessed fill:#90C695,stroke:#2E7D32,color:#fff,font-weight:bold,rx:6,ry:6; %% Assign classes class A cleaning class B integration class C transformation class D reduction class E preprocessed

Phase 1: Data Cleaning

Task Description
Missing Values Fill with mean/median/mode or remove
Noise Removal Smooth out random errors
Outlier Detection Identify and handle extreme values
Inconsistency Fix contradictory data

Phase 2: Data Integration

Task Description
Schema Integration Combine schemas from multiple sources
Entity Resolution Match same entities across sources
Redundancy Removal Eliminate duplicate attributes
Conflict Resolution Handle different values for same entity

Phase 3: Data Transformation

Technique Purpose
Normalization Scale values to 0-1 range
Standardization Transform to mean=0, std=1
Aggregation Summarize data (daily → monthly)
Discretization Convert continuous to categorical
Encoding Convert categorical to numerical

Phase 4: Data Reduction

Technique Purpose
Dimensionality Reduction Reduce number of features (PCA)
Numerosity Reduction Reduce data volume (sampling)
Data Compression Encode data efficiently
Feature Selection Keep only relevant features

Q46. ๐ŸŸก What is Data Cleaning?

[Asked: Jun 2025, Dec 2023 | Frequency: 2]

Answer

Data Cleaning (also called Data Cleansing or Data Scrubbing) is the process of detecting and correcting (or removing) corrupt, inaccurate, or irrelevant records from a dataset.

Definition: Data cleaning involves identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the dirty data.

Common Data Quality Issues:

Issue Example Solution
Missing Values Age = NULL Imputation or deletion
Duplicate Records Same customer twice Deduplication
Inconsistent Formats Date: 10/12/2024 vs 2024-12-10 Standardization
Typos/Errors "Delih" instead of "Delhi" Correction
Outliers Age = 999 Statistical methods
Invalid Data Age = -5 Validation rules

Data Cleaning Process:

graph LR Identify[Identify Issues] --> Define[Define Rules] Define --> Apply[Apply Corrections] Apply --> Validate[Validate Results] Validate --> Document[Document Changes] style Identify fill:#ff6b6b,stroke:#333,stroke-width:2px,color:white style Define fill:#4ecdc4,stroke:#333,stroke-width:2px,color:white style Apply fill:#ffe66d,stroke:#333,stroke-width:2px,color:black style Validate fill:#95e1d3,stroke:#333,stroke-width:2px,color:black style Document fill:#90c695,stroke:#333,stroke-width:2px,color:white

Importance:

  • Ensures data accuracy and reliability

  • Improves analysis and model performance

  • Reduces errors in decision-making

  • Saves time in downstream processing


Q47. ๐Ÿ”ด What are the methods of data cleaning? List and briefly discuss the best practices used for data cleaning and data preparation.

[Asked: Jun 2025, Dec 2023, Dec 2022, Jun 2022 | Frequency: 4]

Answer

Methods of Data Cleaning:

1. Handling Missing Values:

Method When to Use
Deletion When missing data is random and small (<5%)
Mean/Median/Mode Imputation Numerical data with few missing values
Forward/Backward Fill Time series data
Predictive Imputation Use ML to predict missing values
Constant Value Replace with default (e.g., "Unknown")

2. Handling Duplicates:

# Identify duplicates
duplicates = df.duplicated()

# Remove duplicates
df_clean = df.drop_duplicates()

3. Handling Outliers:

Method Description
Z-Score Remove if
IQR Method Remove if < Q1-1.5×IQR or > Q3+1.5×IQR
Capping Replace with threshold values
Transformation Log transform to reduce impact

4. Standardization & Normalization:

Technique Formula Range
Min-Max Normalization (x - min)/(max - min) [0, 1]
Z-Score Standardization (x - ฮผ)/ฯƒ Unbounded

5. Data Type Conversion:

  • Convert strings to dates

  • Convert categories to numbers

  • Parse structured text fields

Best Practices:

Practice Description
Profile First Understand data before cleaning
Document Everything Keep log of all changes
Preserve Original Keep backup of raw data
Automate Create reusable cleaning scripts
Validate Check results after each step
Iterative Approach Clean in multiple passes

Data Cleaning Workflow:

graphviz diagram

Q48. ๐ŸŸข What is Data Curation? Explain with the help of an example.

[Asked: Dec 2022 | Frequency: 1]

Answer

Data Curation is the process of organizing, integrating, and maintaining data throughout its lifecycle to ensure it remains accessible, reliable, and valuable for current and future use.

Definition: Data curation involves the active management of data from creation through its entire lifecycle, including organization, validation, preservation, and ensuring long-term accessibility.

Key Activities in Data Curation:

Activity Description
Collection Gathering data from various sources
Organization Structuring and categorizing data
Validation Ensuring accuracy and quality
Preservation Storing for long-term access
Documentation Adding metadata and context
Access Control Managing who can use the data

Diagram:

plantuml diagram

Example - Research Data Curation:

A university research project on climate change:

Stage Curation Activity
Collection Gather temperature data from 100 weather stations
Organization Structure by location, date, measurement type
Validation Cross-check readings, flag anomalies
Documentation Add metadata: sensor type, calibration date, location coordinates
Preservation Store in institutional repository with backups
Access Publish dataset with DOI for citation

Before Curation:

  • Scattered files in different formats

  • No documentation of collection methods

  • Missing context for interpretation

After Curation:

  • Unified dataset with consistent format

  • Complete metadata for reproducibility

  • Accessible to other researchers

  • Preserved for future studies

Difference from Data Cleaning:

Data Cleaning Data Curation
Fixes errors and inconsistencies Manages entire data lifecycle
One-time process Ongoing activity
Technical focus Governance focus
Prepares for analysis Ensures long-term value

UNIT 4: DATA VISUALIZATION


Q49. ๐ŸŸข What is a Histogram?

[Asked: Jun 2023 | Frequency: 1]

Answer

Histogram is a graphical representation of the distribution of numerical data, showing the frequency of data points falling within specified ranges (bins).

Characteristics:

  • X-axis: Data ranges (bins)

  • Y-axis: Frequency/count

  • Bars are adjacent (no gaps)

  • Shows distribution shape

Diagram:

svgbob diagram

Use Cases:

  • Understanding data distribution

  • Identifying skewness

  • Detecting outliers

  • Comparing distributions


Q50. ๐ŸŸข How does Histogram differ from Bar Graph?

[Asked: Jun 2023 | Frequency: 1]

Answer
Aspect Histogram Bar Graph
Data Type Continuous/Numerical Categorical/Discrete
Bar Spacing No gaps (adjacent bars) Gaps between bars
X-axis Ranges/Bins Categories
Purpose Show distribution Compare categories
Bar Order Fixed (numerical order) Can be rearranged
Bar Width Meaningful (represents range) Arbitrary

Visual Comparison:

svgbob diagram

Q51. ๐ŸŸข Briefly discuss the utility of Histogram in Data Science.

[Asked: Jun 2023 | Frequency: 1]

Answer

Utilities of Histogram in Data Science:

Utility Description
Distribution Analysis Understand how data is spread
Outlier Detection Identify extreme values
Skewness Detection Determine if data is symmetric or skewed
Binning Decisions Help decide discretization strategy
Feature Engineering Guide transformation decisions
Data Quality Identify data issues

Distribution Patterns:

svgbob diagram

Applications:

  • EDA (Exploratory Data Analysis)

  • Feature selection

  • Model assumption validation

  • Data preprocessing decisions


Q52. ๐ŸŸก How to create a Histogram in R? Write the syntax and explain with example.

[Asked: Jun 2023, Dec 2024 | Frequency: 2]

Answer

Basic Syntax:

hist(x, main, xlab, ylab, col, border, breaks)

Parameters:

Parameter Description
x Vector of values
main Title of histogram
xlab X-axis label
ylab Y-axis label
col Fill color
border Border color
breaks Number of bins

Example:

# Create sample data
marks <- c(45, 67, 89, 34, 78, 56, 90, 23, 67, 88, 
           54, 76, 82, 39, 71, 63, 95, 48, 72, 85)

# Create histogram
hist(marks,
     main = "Distribution of Student Marks",
     xlab = "Marks",
     ylab = "Frequency",
     col = "lightblue",
     border = "black",
     breaks = 5)

Output:

Distribution of Student Marks
Frequency
    │
  6 │        ████
    │        ████
  4 │  ████  ████  ████
    │  ████  ████  ████
  2 │  ████  ████  ████  ████
    │  ████  ████  ████  ████
  0 └──────────────────────────
     20-40 40-60 60-80 80-100
              Marks


Q53. ๐Ÿ”ด What is a Box Plot? What do you mean by Box Plot?

[Asked: Jun 2025, Dec 2023, Dec 2022, Jun 2022 | Frequency: 4]

Answer

Box Plot (also called Box-and-Whisker Plot) is a standardized way of displaying the distribution of data based on five key statistics: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum.

Five-Number Summary:

Statistic Description
Minimum Smallest value (excluding outliers)
Q1 (25th percentile) Lower quartile
Median (Q2) Middle value (50th percentile)
Q3 (75th percentile) Upper quartile
Maximum Largest value (excluding outliers)

Diagram:

svgbob diagram

IQR (Interquartile Range): Q3 - Q1

Outlier Detection:

  • Lower outliers: < Q1 - 1.5 × IQR

  • Upper outliers: > Q3 + 1.5 × IQR


Q54. ๐ŸŸก What is the utility of Box Plot in Data Science? Briefly discuss.

[Asked: Jun 2025, Dec 2023 | Frequency: 2]

Answer

Utilities of Box Plot:

Utility Description
Distribution Summary Quick overview of data spread
Outlier Detection Clearly shows extreme values
Comparison Compare multiple groups side-by-side
Skewness Detection Asymmetric box indicates skew
Central Tendency Shows median clearly
Variability IQR shows data spread

Applications in Data Science:

  1. EDA: Initial data exploration

  2. Feature Analysis: Compare feature distributions

  3. Data Quality: Identify anomalies

  4. Group Comparison: Compare across categories

  5. Model Diagnostics: Check residual distributions

Interpreting Skewness:

svgbob diagram

Q55. ๐ŸŸก How to create a Box Plot in R? Write the syntax or list the commands.

[Asked: Dec 2022, Jun 2023, Dec 2024 | Frequency: 3]

Answer

Basic Syntax:

boxplot(x, main, xlab, ylab, col, border, horizontal, notch)

Parameters:

Parameter Description
x Vector or formula
main Title
xlab, ylab Axis labels
col Fill color
horizontal TRUE for horizontal plot
notch TRUE for notched box

Example 1: Single Box Plot

# Sample data
scores <- c(45, 67, 89, 34, 78, 56, 90, 23, 67, 88, 120)

# Create box plot
boxplot(scores,
        main = "Student Scores Distribution",
        ylab = "Scores",
        col = "lightgreen",
        border = "darkgreen")

Example 2: Grouped Box Plot

# Create data frame
data <- data.frame(
  scores = c(75, 80, 85, 70, 90, 65, 70, 75, 60, 80),
  group = c("A","A","A","A","A","B","B","B","B","B")
)

# Grouped box plot
boxplot(scores ~ group, 
        data = data,
        main = "Scores by Group",
        xlab = "Group",
        ylab = "Scores",
        col = c("lightblue", "lightpink"))


Q56. ๐ŸŸข What are whiskers in a BoxPlot?

[Asked: Dec 2023 | Frequency: 1]

Answer

Whiskers are the lines extending from the box in a box plot to the minimum and maximum values within a defined range.

Definition:

  • Lower Whisker: Extends from Q1 to the smallest value ≥ Q1 - 1.5×IQR

  • Upper Whisker: Extends from Q3 to the largest value ≤ Q3 + 1.5×IQR

Diagram:

Diagram:

svgbob diagram

Whisker Calculation:

Component Formula
IQR Q3 - Q1
Upper Limit Q3 + 1.5 × IQR
Lower Limit Q1 - 1.5 × IQR
Upper Whisker Max value ≤ Upper Limit
Lower Whisker Min value ≥ Lower Limit

Purpose:

  • Show data range excluding outliers

  • Help identify outliers (points beyond whiskers)

  • Indicate data variability


Q57. ๐ŸŸข Explain clearly how the Box Plot differs from Scatter Plot.

[Asked: Jun 2025 | Frequency: 1]

Answer
Aspect Box Plot Scatter Plot
Purpose Show distribution of ONE variable Show relationship between TWO variables
Variables Univariate (single variable) Bivariate (two variables)
Data Points Summarized (5-number summary) Individual points shown
Outliers Explicitly marked Visible but not marked
Comparison Compare distributions across groups Identify correlations
Best For Distribution, spread, outliers Correlation, patterns, trends

Visual Comparison:

svgbob diagram

When to Use:

Scenario Use
Analyze single variable distribution Box Plot
Compare groups Box Plot
Find relationship between 2 variables Scatter Plot
Identify clusters Scatter Plot
Detect outliers in one variable Box Plot
Predict one variable from another Scatter Plot

Q58. ๐ŸŸข Draw a sample box plot and explain it.

[Asked: Jun 2022 | Frequency: 1]

Answer

Sample Data: Test scores: 23, 45, 56, 67, 67, 72, 78, 85, 88, 89, 90, 120

Calculations:

  • Sorted: 23, 45, 56, 67, 67, 72, 78, 85, 88, 89, 90, 120

  • Q1 = 61.5

  • Median (Q2) = 75

  • Q3 = 88.5

  • IQR = 88.5 - 61.5 = 27

  • Lower Limit = 61.5 - 1.5(27) = 21

  • Upper Limit = 88.5 + 1.5(27) = 129

Box Plot:

svgbob diagram

Interpretation:

  • Median = 75: Half the students scored above 75

  • IQR = 27: Middle 50% of scores span 27 points

  • Symmetric: Median is roughly centered in box

  • No extreme outliers: All values within whisker range


Q59. ๐ŸŸก What is a Scatter Plot?

[Asked: Dec 2023, Dec 2024 | Frequency: 2]

Answer

Scatter Plot is a type of graph that displays values for two variables as a collection of points, showing the relationship or correlation between them.

Characteristics:

  • X-axis: Independent variable

  • Y-axis: Dependent variable

  • Each point represents one observation

  • Pattern reveals relationship type

Types of Relationships:

svgbob diagram

Use Cases:

  • Correlation analysis

  • Regression modeling

  • Outlier detection

  • Cluster identification


Q60. ๐ŸŸก What is the use of scatter plot? Give uses and best practices.

[Asked: Dec 2024, Dec 2023 | Frequency: 2]

Answer

Uses of Scatter Plot:

Use Description
Correlation Detection Identify positive/negative/no correlation
Trend Analysis Observe patterns in data
Outlier Detection Spot unusual data points
Regression Basis Foundation for linear regression
Cluster Identification Find natural groupings
Hypothesis Testing Validate assumptions about relationships

Best Practices:

Practice Guideline
Clear Labels Label both axes with units
Appropriate Scale Start axis at 0 when meaningful
Point Size Keep consistent, not too large
Color Coding Use for categorical grouping
Trend Line Add regression line if relevant
Avoid Overplotting Use transparency for large datasets

Example Interpretation:

Height vs Weight (Positive Correlation)

Weight │                    · ·
 (kg)  │                · · ·
       │            · · ·
       │        · · ·
       │    · · ·
       │· · ·
       └────────────────────────
                Height (cm)

Interpretation: As height increases, weight tends to increase
Correlation: Strong positive (r ≈ 0.8)


Q61. ๐ŸŸก How to draw a Scatter Plot in R? Write the syntax and explain with example.

[Asked: Dec 2024, Jun 2023, Jun 2024 | Frequency: 3]

Answer

Basic Syntax:

plot(x, y, main, xlab, ylab, col, pch, cex)

Parameters:

Parameter Description
x X-axis values
y Y-axis values
main Title
xlab, ylab Axis labels
col Point color
pch Point shape (1-25)
cex Point size

Example:

# Sample data
height <- c(150, 160, 165, 170, 175, 180, 185, 190)
weight <- c(50, 55, 60, 65, 70, 75, 80, 85)

# Create scatter plot
plot(height, weight,
     main = "Height vs Weight",
     xlab = "Height (cm)",
     ylab = "Weight (kg)",
     col = "blue",
     pch = 16,
     cex = 1.5)

# Add trend line
abline(lm(weight ~ height), col = "red", lwd = 2)

Point Shapes (pch values):

1: ○  2: △  3: +  4: ×  5: ◇
16: ●  17: ▲  18: ◆  19: ●  20: •


Q62. ๐ŸŸข What is a Heat Map?

[Asked: Jun 2023 | Frequency: 1]

Answer

Heat Map is a data visualization technique that uses color intensity to represent the magnitude of values in a matrix or table format.

Characteristics:

  • Uses color gradients (e.g., blue→red)

  • Displays 2D data matrix

  • Darker/brighter colors = higher values

  • Often includes clustering (dendrograms)

Diagram:

         Feature1  Feature2  Feature3
Sample1  ██████    ░░░░░░    ████
Sample2  ░░░░░░    ██████    ██
Sample3  ████      ████      ██████
Sample4  ██        ██        ░░░░░░

Color Scale: ░ Low ─────────── █ High

Components:

  1. Color Scale: Legend showing value-to-color mapping

  2. Cells: Individual data points

  3. Dendrograms: Optional clustering trees

  4. Labels: Row and column identifiers


Q63. ๐ŸŸข Give uses and best practices for Heat Maps.

[Asked: Jun 2023 | Frequency: 1]

Answer

Uses of Heat Maps:

Use Application
Correlation Matrix Visualize variable relationships
Gene Expression Compare expression across samples
Website Analytics User click patterns
Geographic Data Population density, temperature
Time Series Activity patterns by hour/day
Clustering Results Show group similarities

Best Practices:

Practice Guideline
Color Choice Use intuitive colors (blue=cold, red=hot)
Color Blindness Avoid red-green combinations
Normalization Scale data for fair comparison
Clustering Group similar rows/columns
Labels Keep readable, rotate if needed
Legend Always include color scale
Annotation Add values in cells if few

R Code Example:

# Create matrix
data <- matrix(runif(25), nrow=5, ncol=5)

# Create heatmap
heatmap(data,
        main = "Sample Heat Map",
        col = heat.colors(10))


Q64. ๐Ÿ”ด What is the use of Bar Chart? How to draw a Bar Chart in R?

[Asked: Jun 2022, Jun 2023, Dec 2024, Jun 2024 | Frequency: 4]

Answer

Use of Bar Chart:

Use Description
Comparison Compare values across categories
Ranking Show highest to lowest
Composition Parts of a whole (stacked)
Trends Changes over discrete periods
Distribution Frequency of categories

Types:

  • Vertical bar chart

  • Horizontal bar chart

  • Grouped bar chart

  • Stacked bar chart

R Syntax:

barplot(height, names.arg, main, xlab, ylab, col, border, horiz)

Parameters:

Parameter Description
height Vector of bar heights
names.arg Labels for bars
col Bar colors
horiz TRUE for horizontal
beside TRUE for grouped bars

Example:

# Data
sales <- c(250, 180, 320, 280, 150)
products <- c("A", "B", "C", "D", "E")

# Create bar chart
barplot(sales,
        names.arg = products,
        main = "Product Sales Comparison",
        xlab = "Product",
        ylab = "Sales (units)",
        col = c("red", "blue", "green", "orange", "purple"),
        border = "black")

Output:

Sales │
 320  │        ████
 280  │        ████  ████
 250  │  ████  ████  ████
 180  │  ████  ████  ████  ████
 150  │  ████  ████  ████  ████  ████
      └──────────────────────────────
          A     B     C     D     E
               Products


Q65. ๐ŸŸก How to create Line Graphs in R? Write the syntax and explain with example.

[Asked: Jun 2023, Dec 2024 | Frequency: 2]

Answer

Basic Syntax:

plot(x, y, type = "l", main, xlab, ylab, col, lwd, lty)

Parameters:

Parameter Description
type "l"=line, "b"=both, "o"=overplotted
lwd Line width
lty Line type (1=solid, 2=dashed, etc.)

Example:

# Data - Monthly sales
months <- 1:12
sales <- c(100, 120, 150, 180, 200, 220, 210, 190, 170, 150, 130, 140)

# Create line graph
plot(months, sales,
     type = "o",
     main = "Monthly Sales Trend",
     xlab = "Month",
     ylab = "Sales (units)",
     col = "blue",
     lwd = 2,
     pch = 16)

# Add grid
grid()

Multiple Lines:

# Second product
sales2 <- c(80, 100, 130, 150, 170, 180, 175, 160, 140, 120, 100, 110)

# Add to existing plot
lines(months, sales2, col = "red", lwd = 2, type = "o", pch = 17)

# Add legend
legend("topright", 
       legend = c("Product A", "Product B"),
       col = c("blue", "red"),
       lwd = 2,
       pch = c(16, 17))


Q66. ๐ŸŸข What is the use of Pair Plot? Explain how to read a pair plot.

[Asked: Dec 2024 | Frequency: 1]

Answer

Pair Plot (also called Scatter Plot Matrix) displays pairwise relationships between multiple variables in a dataset.

Uses:

Use Description
EDA Quick overview of all relationships
Correlation Identify correlated variables
Patterns Spot non-linear relationships
Outliers Detect multivariate outliers
Feature Selection Choose relevant features

How to Read a Pair Plot:

svgbob diagram

Reading Tips:

  1. Diagonal: Shows distribution (histogram/density)

  2. Upper/Lower Triangle: Scatter plots (often mirrored)

  3. Strong Correlation: Points form line pattern

  4. No Correlation: Random scatter

  5. Clusters: Grouped points suggest categories

R Example:

# Using pairs function
pairs(iris[,1:4], 
      main = "Iris Dataset Pair Plot",
      col = iris$Species,
      pch = 19)


Q67. ๐ŸŸข List the key characteristics of various types of plots for data visualization.

[Asked: Jun 2024 | Frequency: 1]

Answer
Plot Type Variables Best For Key Characteristics
Histogram 1 numerical Distribution Bins, frequency, no gaps
Bar Chart 1 categorical Comparison Gaps between bars, categories
Box Plot 1 numerical Summary stats 5-number summary, outliers
Scatter Plot 2 numerical Correlation Points, trends, clusters
Line Graph Time series Trends Connected points, time-based
Heat Map Matrix Patterns Color intensity, 2D grid
Pie Chart 1 categorical Proportions Circular, percentages
Pair Plot Multiple Relationships Matrix of scatter plots
Violin Plot 1 numerical Distribution Box plot + density
Area Chart Time series Cumulative Filled under line

Selection Guide:

graphviz diagram

UNIT 5: BIG DATA ARCHITECTURE


Q68. ๐ŸŸก What is Big Data?

[Asked: Jun 2025, Jun 2022 | Frequency: 2]

Answer

Big Data refers to extremely large and complex datasets that cannot be processed, stored, or analyzed using traditional data processing tools and techniques.

Definition: Big Data is characterized by high volume, velocity, and variety of data that requires advanced technologies and analytical methods to extract meaningful insights.

Key Characteristics (5 Vs):

V Description Example
Volume Massive amount of data Petabytes, Exabytes
Velocity Speed of data generation Real-time streaming
Variety Different data types Text, images, videos
Veracity Data quality/accuracy Trustworthiness
Value Business insights Actionable decisions

Sources of Big Data:

  • Social media (Facebook, Twitter)

  • IoT sensors

  • E-commerce transactions

  • Scientific experiments

  • Healthcare records

  • Financial markets


Q69. ๐Ÿ”ด What are the characteristics of Big Data? Explain the four V's with examples.

[Asked: Jun 2025, Dec 2022, Jun 2022, Jun 2024 | Frequency: 4]

Answer

The 4 V's of Big Data:

graphviz diagram

1. VOLUME (Size)

Aspect Description
Definition Massive scale of data
Scale Terabytes → Petabytes → Exabytes
Example Facebook generates 4+ PB of data daily
Challenge Storage and processing infrastructure

2. VELOCITY (Speed)

Aspect Description
Definition Speed of data generation and processing
Types Batch, Near real-time, Real-time
Example Stock market: millions of trades per second
Challenge Real-time processing requirements

3. VARIETY (Types)

Type Examples
Structured Databases, spreadsheets
Semi-structured JSON, XML, logs
Unstructured Images, videos, emails
Example Hospital: patient records + X-rays + doctor notes

4. VERACITY (Quality)

Aspect Description
Definition Accuracy and trustworthiness
Issues Missing data, inconsistencies, bias
Example Social media sentiment may be manipulated
Challenge Ensuring data quality at scale

5. VALUE (Insight)

Aspect Description
Definition Business value extracted from data
Goal Turn raw data into actionable insights
Example Netflix recommendations drive 80% of viewing
Challenge Deriving meaningful insights cost-effectively

Summary Table:

V Question Answered Key Metric
Volume How much? Size (TB, PB)
Velocity How fast? Speed (records/sec)
Variety What types? Format diversity
Veracity How accurate? Data quality %
Value How useful? Business impact

Q70. ๐ŸŸข Differentiate between Big Data and Data Warehouse.

[Asked: Jun 2025 | Frequency: 1]

Answer
Aspect Big Data Data Warehouse
Data Type Structured + Unstructured Primarily Structured
Volume Petabytes to Exabytes Terabytes
Processing Distributed (Hadoop, Spark) Centralized (SQL Server, Oracle)
Schema Schema-on-read Schema-on-write
Data Source Multiple heterogeneous sources Integrated enterprise sources
Storage HDFS, NoSQL RDBMS
Query Type Exploratory, ML Predefined reports, BI
Latency Real-time possible Typically batch
Cost Lower (commodity hardware) Higher (specialized hardware)
Flexibility High Limited

Diagram:

d2 diagram

Q71. ๐ŸŸข How does Big Data differ from relational data?

[Asked: Dec 2022 | Frequency: 1]

Answer
Aspect Big Data Relational Data
Volume Massive (PB+) Limited (GB-TB)
Structure Any (structured, unstructured) Structured only
Schema Flexible, schema-on-read Fixed, schema-on-write
Scaling Horizontal (add nodes) Vertical (bigger server)
Processing Distributed (MapReduce) Single server (SQL)
ACID Eventual consistency (BASE) Full ACID compliance
Query Language Various (Hive, Pig, etc.) SQL
Storage HDFS, NoSQL RDBMS tables
Cost Commodity hardware Expensive specialized
Use Case Analytics, ML, exploration Transactions, reports

Key Differences:

  1. Scale: Big Data handles internet-scale; RDBMS handles enterprise-scale

  2. Flexibility: Big Data accepts any format; RDBMS requires predefined schema

  3. Speed: Big Data can process in real-time; RDBMS typically batch


Q72. ๐ŸŸข What is Big Data Analysis?

[Asked: Dec 2024 | Frequency: 1]

Answer

Big Data Analysis is the process of examining large and varied datasets to uncover hidden patterns, correlations, market trends, customer preferences, and other useful business information.

Components:

Component Description
Data Collection Gathering from multiple sources
Data Storage HDFS, NoSQL databases
Data Processing MapReduce, Spark
Data Analysis Statistical and ML techniques
Visualization Dashboards, reports

Types of Big Data Analysis:

Type Purpose Example
Descriptive What happened? Sales reports
Diagnostic Why did it happen? Root cause analysis
Predictive What will happen? Demand forecasting
Prescriptive What should we do? Recommendation engines

Tools Used:

  • Apache Hadoop

  • Apache Spark

  • Apache Kafka

  • MongoDB, Cassandra

  • Tableau, Power BI


Q73. ๐ŸŸข What is Distributed File System? Explain in the context of big data.

[Asked: Dec 2024 | Frequency: 1]

Answer

Distributed File System (DFS) is a file system that stores data across multiple machines (nodes) in a network, providing the illusion of a single unified file system to users.

Definition: A DFS allows files to be stored on multiple servers and accessed as if they were on a local disk, enabling scalable storage and parallel processing.

Key Concepts:

Concept Description
Nodes Individual machines in the cluster
Blocks Files split into fixed-size chunks
Replication Each block copied to multiple nodes
Namespace Unified view of distributed files

Diagram:

blockdiag diagram

Example - HDFS:

  • File "data.txt" (384 MB)

  • Split into 3 blocks of 128 MB each

  • Each block replicated 3 times

  • Stored across multiple DataNodes


Q74. ๐ŸŸข Explain the different features of Distributed File System.

[Asked: Dec 2024 | Frequency: 1]

Answer

Features of Distributed File System:

Feature Description
Scalability Add nodes to increase capacity
Fault Tolerance Data replicated across nodes
Transparency Users see single file system
High Availability No single point of failure
Parallel Access Multiple clients access simultaneously
Data Locality Process data where it's stored

Detailed Features:

1. Scalability:

  • Horizontal scaling (add more machines)

  • Linear increase in capacity

  • No downtime for expansion

2. Fault Tolerance:

  • Data replication (typically 3 copies)

  • Automatic recovery from node failure

  • Continuous health monitoring

3. Transparency Types:

Type Description
Location Users don't know physical location
Access Same access method everywhere
Failure System handles failures invisibly
Replication Multiple copies appear as one

4. Data Locality:

  • Move computation to data

  • Reduces network bandwidth

  • Improves processing speed


Q75. ๐ŸŸข What is HDFS (Hadoop Distributed File System)?

[Asked: Jun 2025 | Frequency: 1]

Answer

HDFS (Hadoop Distributed File System) is a distributed, scalable, and fault-tolerant file system designed to store very large files across machines in a Hadoop cluster.

Key Features:

  • Stores files across commodity hardware

  • Handles petabytes of data

  • Fault-tolerant through replication

  • Optimized for large sequential reads

Architecture:

blockdiag diagram

Components:

Component Role
NameNode Master - manages metadata, namespace
DataNode Slave - stores actual data blocks
Secondary NameNode Checkpoint backup (not hot standby)

Block Storage:

  • Default block size: 128 MB

  • Each block replicated (default: 3)

  • Blocks distributed across DataNodes


Q76. ๐ŸŸก What are the characteristics of HDFS?

[Asked: Dec 2022, Jun 2022 | Frequency: 2]

Answer

Characteristics of HDFS:

Characteristic Description
Distributed Storage Data spread across multiple nodes
Fault Tolerance 3x replication by default
Scalability Scale to thousands of nodes
High Throughput Optimized for batch processing
Large Files Designed for GB-TB sized files
Write-Once Append-only, no random writes
Data Locality Move compute to data

Detailed Characteristics:

1. Large Block Size:

  • 128 MB default (vs 4 KB in traditional FS)

  • Reduces metadata overhead

  • Efficient for large sequential reads

2. Replication:

File: report.txt
  ↓
Block 1 → Node A, Node B, Node C
Block 2 → Node B, Node D, Node E
Block 3 → Node A, Node C, Node D

3. Rack Awareness:

  • Replicas placed in different racks

  • Survives rack-level failures

  • Optimizes network bandwidth

4. Write-Once, Read-Many:

  • Files written once

  • Appends supported (Hadoop 2.x+)

  • No random updates


Q77. ๐ŸŸก Why is HDFS used for Big data processing? What are the advantages of HDFS?

[Asked: Dec 2022, Jun 2022 | Frequency: 2]

Answer

Why HDFS for Big Data:

Reason Explanation
Scale Handles petabytes across thousands of nodes
Cost Runs on commodity hardware
Reliability Automatic replication and recovery
Performance High throughput for large files
Integration Works with Hadoop ecosystem

Advantages of HDFS:

Advantage Description
Fault Tolerance Node failure doesn't lose data
Scalability Add nodes without downtime
Cost-Effective Uses cheap commodity hardware
High Throughput Parallel data access
Data Locality Moves computation to data
Streaming Access Efficient for batch jobs

Comparison with Traditional FS:

Aspect HDFS Traditional FS
Scale PB+ TB
Hardware Commodity Enterprise
Failure Handling Automatic Manual
Access Pattern Sequential Random
Block Size 128 MB 4 KB

Q78. ๐ŸŸข Explain how Master/Slave process works in HDFS architecture.

[Asked: Dec 2024 | Frequency: 1]

Answer

Master/Slave Architecture in HDFS:

plantuml diagram

NameNode (Master):

Function Description
Namespace Management Maintains directory tree
Block Mapping Tracks which blocks on which nodes
Replication Ensures adequate copies exist
Client Coordination Directs clients to DataNodes

DataNode (Slave):

Function Description
Block Storage Stores actual data blocks
Heartbeat Sends health status every 3 seconds
Block Report Lists all blocks periodically
Data Transfer Serves read/write requests

Communication Flow:

  1. Heartbeat: DataNode → NameNode (every 3 sec)

  2. Confirms node is alive

  3. Receives commands (replicate, delete blocks)

  4. Block Report: DataNode → NameNode (every 6 hours)

  5. Complete list of blocks on node

  6. NameNode updates block mapping

  7. Read Operation:

  8. Client → NameNode: "Where is file X?"

  9. NameNode → Client: "Blocks on nodes A, B, C"

  10. Client → DataNode: Direct data transfer


Q79. ๐ŸŸข Write steps to load data into HDFS format.

[Asked: Jun 2025 | Frequency: 1]

Answer

Steps to Load Data into HDFS:

Step 1: Start Hadoop Services

start-dfs.sh
start-yarn.sh

Step 2: Create Directory in HDFS

hdfs dfs -mkdir /user/data
hdfs dfs -mkdir -p /user/data/input

Step 3: Upload File to HDFS

# Single file
hdfs dfs -put localfile.txt /user/data/input/

# Multiple files
hdfs dfs -put *.csv /user/data/input/

# From local directory
hdfs dfs -copyFromLocal /local/path/ /hdfs/path/

Step 4: Verify Upload

# List files
hdfs dfs -ls /user/data/input/

# Check file size
hdfs dfs -du -h /user/data/input/

# View file content
hdfs dfs -cat /user/data/input/file.txt | head

Common HDFS Commands:

Command Description
-put Upload local file to HDFS
-get Download from HDFS to local
-ls List directory contents
-cat Display file contents
-rm Delete file
-mkdir Create directory
-copyFromLocal Same as -put
-copyToLocal Same as -get

Workflow Diagram:

blockdiag diagram

Q80. ๐ŸŸข Differentiate between Apache Hadoop-1 and Hadoop-2 using suitable diagram.

[Asked: Dec 2024 | Frequency: 1]

Answer

Comparison Table:

Aspect Hadoop 1.x Hadoop 2.x
Resource Management JobTracker YARN (ResourceManager)
Processing MapReduce only Multiple frameworks
Scalability ~4000 nodes ~10000+ nodes
Single Point of Failure Yes (NameNode) No (HA NameNode)
Cluster Utilization Fixed slots Dynamic containers
Multi-tenancy Limited Full support

Hadoop 1.x Architecture:

svgbob diagram

Hadoop 2.x Architecture (YARN):

blockdiag diagram

Key Improvements in Hadoop 2.x:

Feature Benefit
YARN Separates resource management from processing
HA NameNode Eliminates single point of failure
Federation Multiple namespaces for scalability
Containers Dynamic resource allocation
Multi-framework Supports Spark, Tez, Storm, etc.

UNIT 6: PROGRAMMING USING MAPREDUCE


Q81. ๐ŸŸก What is MapReduce? What is Hadoop MapReduce?

[Asked: Dec 2023, Jun 2023, Jun 2022 | Frequency: 3]

Answer

MapReduce is a programming model and processing framework for distributed computing on large datasets across a cluster of computers.

Definition: MapReduce divides a task into two phases - Map (transforms data into key-value pairs) and Reduce (aggregates values by key) - enabling parallel processing of massive datasets.

Core Concepts:

Phase Function
Map Processes input → (key, value) pairs
Shuffle & Sort Groups values by key
Reduce Aggregates values for each key

Diagram:

graph LR Input[Input Data] --> Split[Split] Split --> Map[Map] Map --> Shuffle[Shuffle & Sort] Shuffle --> Reduce[Reduce] Reduce --> Output[Output] style Input fill:#4a90d9,stroke:#333,stroke-width:2px,color:white style Split fill:#7ab8f5,stroke:#333,stroke-width:2px,color:white style Map fill:#90c695,stroke:#333,stroke-width:2px,color:white style Shuffle fill:#ffe66d,stroke:#333,stroke-width:2px,color:black style Reduce fill:#f5a962,stroke:#333,stroke-width:2px,color:white style Output fill:#ff6b6b,stroke:#333,stroke-width:2px,color:white

Key Characteristics:

  • Parallel processing

  • Fault tolerance

  • Data locality

  • Scalable to thousands of nodes

Example - Word Count:

Input: "hello world hello"
Map Output: (hello,1), (world,1), (hello,1)
After Shuffle: (hello,[1,1]), (world,[1])
Reduce Output: (hello,2), (world,1)


Q82. ๐ŸŸก Explain the Map function and Reduce function with a suitable block diagram and example.

[Asked: Dec 2023, Jun 2022 | Frequency: 3]

Answer

Map Function:

  • Input: (key, value) pair

  • Output: List of (intermediate_key, intermediate_value) pairs

  • Processes each record independently

Reduce Function:

  • Input: (key, list of values)

  • Output: (key, aggregated_value)

  • Combines values for same key

Block Diagram:

d2 diagram

Example - Word Count:

Input File:

Hello World
Hello Hadoop
World of Big Data

Map Phase:

Mapper 1: "Hello World" → (Hello,1), (World,1)
Mapper 2: "Hello Hadoop" → (Hello,1), (Hadoop,1)
Mapper 3: "World of Big Data" → (World,1), (of,1), (Big,1), (Data,1)

Shuffle & Sort:

(Big, [1])
(Data, [1])
(Hadoop, [1])
(Hello, [1,1])
(of, [1])
(World, [1,1])

Reduce Phase:

Reducer: (Hello, [1,1]) → (Hello, 2)
Reducer: (World, [1,1]) → (World, 2)
Reducer: (Hadoop, [1]) → (Hadoop, 1)
...


Q83. ๐ŸŸข Give advantages of Hadoop MapReduce.

[Asked: Jun 2023 | Frequency: 1]

Answer

Advantages of Hadoop MapReduce:

Advantage Description
Scalability Process petabytes across thousands of nodes
Fault Tolerance Automatic task retry on failure
Cost-Effective Runs on commodity hardware
Parallel Processing Distributed computation
Data Locality Moves code to data, not data to code
Simplicity Simple programming model
Flexibility Works with any data type

Detailed Benefits:

1. Scalability:

  • Linear scalability with nodes

  • Add machines to increase capacity

  • No code changes needed

2. Fault Tolerance:

Node Failure → Detect → Reschedule Task → Continue

  • Tasks automatically rerun on other nodes

  • Data replicated for reliability

3. Data Locality:

Traditional: Move data → Process
MapReduce: Move code → Process locally

  • Reduces network traffic

  • Improves performance

4. Cost Savings:

  • No expensive specialized hardware

  • Open-source software

  • Commodity server clusters


Q84. ๐ŸŸข Discuss how key-value pair mechanism facilitates MapReduce programming.

[Asked: Jun 2023 | Frequency: 1]

Answer

Key-Value Pair Mechanism:

The key-value pair is the fundamental data structure in MapReduce, enabling:

  • Parallel processing

  • Data grouping

  • Distributed computation

How It Works:

Stage Input Output
Map (K1, V1) List of (K2, V2)
Shuffle (K2, V2) pairs (K2, [V2, V2, ...])
Reduce (K2, [V2...]) (K3, V3)

Benefits:

Benefit Explanation
Parallelization Each key-value processed independently
Grouping Same keys automatically grouped
Distribution Keys distributed across reducers
Flexibility Any data can be key or value
Sorting Keys sorted automatically

Example:

Document: "apple banana apple cherry"

Map Output (K,V pairs):
(apple, 1)
(banana, 1)
(apple, 1)
(cherry, 1)

After Shuffle (grouped by key):
apple → [1, 1]
banana → [1]
cherry → [1]

Reduce Output:
(apple, 2)
(banana, 1)
(cherry, 1)

Why Keys Matter:

  • Determine which reducer processes the data

  • Enable aggregation and joining

  • Allow parallel processing of different keys


Q85. ๐ŸŸข Explain Splitting operation of MapReduce.

[Asked: Jun 2023 | Frequency: 1]

Answer

Splitting is the first phase where input data is divided into fixed-size chunks called Input Splits for parallel processing.

Process:

blockdiag diagram

Characteristics:

Aspect Description
Split Size Typically equals HDFS block size (128 MB)
Logical Division Splits are logical, not physical
Record Boundary Respects record boundaries
Parallelism One mapper per split

Example:

Input File: 384 MB
HDFS Block Size: 128 MB

Splits Created:
- Split 1: 0-128 MB → Mapper 1
- Split 2: 128-256 MB → Mapper 2
- Split 3: 256-384 MB → Mapper 3

InputFormat Types:

Format Description
TextInputFormat Line-by-line (key=offset, value=line)
KeyValueInputFormat Tab-separated key-value
SequenceFileInputFormat Binary format
NLineInputFormat Fixed N lines per split

Q86. ๐ŸŸข Explain Mapping operation of MapReduce.

[Asked: Jun 2023 | Frequency: 1]

Answer

Mapping is the phase where user-defined map function processes each input record and emits intermediate key-value pairs.

Process:

flowchart LR %% Direction: Left to Right – natural data flow A["Input Split"] --> B["RecordReader"] B --> C["Map Function"] C --> D["Intermediate<br/>Key–Value Pairs"] D --> E["Partitioner"] %% Styling – same semantics as your actdiag colors classDef input fill:#E8F5E9,stroke:#2E7D32,color:#000,font-weight:bold,rx:6,ry:6; classDef reader fill:#BBDEFB,stroke:#0D47A1,color:#000,font-weight:bold,rx:6,ry:6; classDef map fill:#E3F2FD,stroke:#1976D2,color:#000,font-weight:bold,rx:6,ry:6; classDef kv fill:#FFF8E1,stroke:#B59B00,color:#000,font-weight:bold,rx:6,ry:6; classDef partition fill:#FCE4EC,stroke:#AD1457,color:#000,font-weight:bold,rx:6,ry:6; %% Assign classes class A input class B reader class C map class D kv class E partition %% Optional grouping lane (visual label only) %% Mermaid lacks true “lane” support like actdiag, %% but you can mimic it with a subgraph. subgraph L["Mapping Operation"] A --> B --> C --> D --> E end

Map Function Signature:

map(K1 key, V1 value, Context context) {
    // Transform input
    context.write(K2, V2);
}

Characteristics:

Aspect Description
Input One record at a time
Output Zero or more K-V pairs
Parallel Multiple mappers run concurrently
Stateless Each record processed independently

Example - Word Count Map:

Input Record: (0, "Hello World Hello")

Map Function:
for each word in value:
    emit(word, 1)

Output:
(Hello, 1)
(World, 1)
(Hello, 1)

Map Tasks:

  • Number of mappers = Number of input splits

  • Each mapper processes one split

  • Output written to local disk (not HDFS)


Q87. ๐ŸŸก What is the role of shuffling and sorting in MapReduce? Explain with word count example.

[Asked: Jun 2024, Jun 2022, Jun 2023 | Frequency: 3]

Answer

Shuffle and Sort is the intermediate phase between Map and Reduce that transfers, groups, and sorts data by key.

Roles:

Phase Role
Shuffle Transfer map outputs to reducers
Sort Sort data by keys
Merge Merge sorted data from multiple mappers

Process:

plantuml diagram

Word Count Example:

After Map Phase:

Mapper 1: (Hello,1), (World,1), (Hello,1)
Mapper 2: (Big,1), (Data,1), (Hello,1)
Mapper 3: (World,1), (Data,1)

After Shuffle & Sort:

Reducer 1 receives:
  (Big, [1])
  (Data, [1,1])
  (Hello, [1,1,1])

Reducer 2 receives:
  (World, [1,1])

Key Points:

  1. Partitioner decides which reducer gets which key

  2. Combiner can reduce data before shuffle (optional optimization)

  3. Sort ensures reducer gets sorted key order

  4. Merge combines data from all mappers

Importance:

  • Ensures same keys go to same reducer

  • Enables aggregation in reduce phase

  • Sorted order helps efficient processing


Q88. ๐ŸŸข Explain Reducing operation of MapReduce.

[Asked: Jun 2023 | Frequency: 1]

Answer

Reducing is the final phase where user-defined reduce function aggregates all values for each key into final output.

Process:

graph LR Shuffled[Shuffled Data] --> MergeSort[Merge Sort] MergeSort --> Reduce[Reduce Function] Reduce --> Output[Final Output] Output --> HDFS[HDFS] style Shuffled fill:#4a90d9,stroke:#333,stroke-width:2px,color:white style MergeSort fill:#7ab8f5,stroke:#333,stroke-width:2px,color:white style Reduce fill:#90c695,stroke:#333,stroke-width:2px,color:white style Output fill:#ffe66d,stroke:#333,stroke-width:2px,color:black style HDFS fill:#ff6b6b,stroke:#333,stroke-width:2px,color:white

Reduce Function Signature:

reduce(K2 key, Iterable<V2> values, Context context) {
    // Aggregate values
    context.write(K3, V3);
}

Characteristics:

Aspect Description
Input Key and list of all values for that key
Output Aggregated result per key
Sorting Keys arrive in sorted order
Parallelism Multiple reducers run concurrently

Example - Word Count Reduce:

Input: (Hello, [1, 1, 1])

Reduce Function:
sum = 0
for each count in values:
    sum += count
emit(key, sum)

Output: (Hello, 3)

Reducer Tasks:

  • Number configurable by user

  • Each reducer handles subset of keys

  • Output written to HDFS

  • One output file per reducer


Q89. ๐ŸŸก Explain word count problem with suitable example. Give pseudo-code for word count problem in MapReduce.

[Asked: Dec 2023, Dec 2022 | Frequency: 3]

Answer

Word Count Problem: Count the frequency of each word in a large collection of documents.

Input:

Document 1: "Hello World"
Document 2: "Hello Hadoop World"
Document 3: "Big Data World"

Expected Output:

Big     1
Data    1
Hadoop  1
Hello   2
World   3

Pseudo-code:

Mapper:

function MAP(key, value):
    // key: document ID
    // value: document content

    words = TOKENIZE(value)

    for each word in words:
        EMIT(word, 1)

Reducer:

function REDUCE(key, values):
    // key: word
    // values: list of counts [1, 1, 1, ...]

    total = 0

    for each count in values:
        total = total + count

    EMIT(key, total)

Execution Flow:

d2 diagram

Java Implementation (Simplified):

// Mapper Class
public void map(LongWritable key, Text value, Context context) {
    String[] words = value.toString().split("\\s+");
    for (String word : words) {
        context.write(new Text(word), new IntWritable(1));
    }
}

// Reducer Class
public void reduce(Text key, Iterable<IntWritable> values, Context context) {
    int sum = 0;
    for (IntWritable val : values) {
        sum += val.get();
    }
    context.write(key, new IntWritable(sum));
}


UNIT 7: OTHER BIG DATA ARCHITECTURES AND TOOLS


Q90. ๐ŸŸก What is Apache Spark? In context of Data Science, what is Apache SPARK?

[Asked: Jun 2025, Dec 2023, Jun 2023 | Frequency: 3]

Answer

Apache Spark is an open-source, distributed computing framework designed for fast, large-scale data processing and analytics. It provides an interface for programming clusters with implicit data parallelism and fault tolerance.

Definition: Spark is a unified analytics engine that supports batch processing, real-time streaming, machine learning, and graph processing, all in a single framework.

Key Features:

  • In-memory computing (100x faster than Hadoop MapReduce)

  • Supports multiple languages (Scala, Python, Java, R)

  • Unified platform for diverse workloads

  • Lazy evaluation for optimization

Diagram:

plantuml diagram

Core Components:

Component Purpose
Spark Core Basic functionality, RDD operations
Spark SQL Structured data processing
Spark Streaming Real-time data processing
MLlib Machine learning library
GraphX Graph processing

Q91. ๐Ÿ”ด What are the main features/characteristics of Apache Spark framework?

[Asked: Jun 2025, Dec 2023, Dec 2022, Jun 2022 | Frequency: 4]

Answer

Key Features of Apache Spark:

Feature Description
Speed 100x faster than Hadoop (in-memory)
Ease of Use APIs in Python, Scala, Java, R
Generality SQL, streaming, ML, graph in one platform
Fault Tolerance Automatic recovery from failures
Lazy Evaluation Optimizes execution plan
In-Memory Computing Caches data in RAM

Detailed Features:

1. In-Memory Processing:

d2 diagram

2. Resilient Distributed Datasets (RDD):

  • Immutable distributed collection

  • Fault-tolerant through lineage

  • Parallel operations

3. DAG Execution Engine:

graphviz diagram

4. Multiple Workload Support:

Workload Component Use Case
Batch Spark Core ETL jobs
Interactive Spark SQL Ad-hoc queries
Real-time Streaming Live dashboards
ML MLlib Predictions
Graph GraphX Social networks

Q92. ๐ŸŸก How does Apache Spark differ from Hadoop?

[Asked: Jun 2025, Jun 2023 | Frequency: 2]

Answer

Comparison Table:

Aspect Apache Spark Hadoop MapReduce
Processing In-memory Disk-based
Speed 100x faster (memory) Slower (disk I/O)
Ease of Use High-level APIs Low-level Java code
Real-time Yes (Spark Streaming) No (batch only)
Iterations Excellent (ML) Poor (writes to disk)
Cost Higher RAM needs Lower hardware cost
Languages Scala, Python, Java, R Primarily Java
Caching In-memory caching No caching

Diagram:

ditaa diagram

When to Use:

  • Spark: Iterative algorithms, real-time processing, interactive queries

  • Hadoop: Cost-sensitive batch processing, very large cold data


Q93. ๐ŸŸข Explain big data processing using Spark ecosystem.

[Asked: Dec 2024 | Frequency: 1]

Answer

Spark Ecosystem for Big Data Processing:

Diagram:

plantuml diagram

Processing Flow:

Step Component Activity
1 Data Ingestion Load from HDFS, S3, Kafka
2 Spark Core Distribute across cluster
3 Transformation Filter, map, join operations
4 Analysis SQL queries, ML models
5 Output Write to storage, serve APIs

Example Pipeline:

# Read data
df = spark.read.parquet("hdfs://data/sales")

# Transform
cleaned = df.filter(df.amount > 0) \
            .groupBy("region") \
            .sum("amount")

# ML Model
from pyspark.ml.clustering import KMeans
model = KMeans(k=5).fit(cleaned)

# Output
model.write.save("hdfs://models/customer_segments")


Q94. ๐ŸŸข Briefly discuss the purpose of Spark Core.

[Asked: Dec 2023 | Frequency: 1]

Answer

Spark Core is the foundational component of Apache Spark that provides:

Purpose Description
Task Scheduling Distributes tasks across cluster
Memory Management In-memory data caching
Fault Recovery RDD lineage for recovery
I/O Operations Reading/writing data
Basic Operations Map, reduce, filter, join

Key Concept - RDD (Resilient Distributed Dataset):

d2 diagram

RDD Properties:

  • Resilient: Recovers from node failures

  • Distributed: Data spread across nodes

  • Dataset: Collection of partitioned data


Q95. ๐ŸŸข Briefly discuss the purpose of Spark SQL.

[Asked: Dec 2023 | Frequency: 1]

Answer

Spark SQL enables structured data processing using SQL queries and DataFrame API.

Purpose:

Feature Description
SQL Interface Query data using SQL syntax
DataFrames Structured API with schema
Optimization Catalyst optimizer for queries
Integration Connect to Hive, JDBC, Parquet
Performance Optimized execution plans

Example:

# Create DataFrame
df = spark.read.json("customers.json")

# SQL Query
df.createOrReplaceTempView("customers")
result = spark.sql("""
    SELECT region, SUM(sales) as total
    FROM customers
    GROUP BY region
    ORDER BY total DESC
""")

# DataFrame API (equivalent)
result = df.groupBy("region") \
           .agg(sum("sales").alias("total")) \
           .orderBy(desc("total"))


Q96. ๐ŸŸข Briefly discuss the purpose of Spark Streaming.

[Asked: Dec 2023 | Frequency: 1]

Answer

Spark Streaming processes real-time data streams using micro-batch architecture.

Purpose:

Feature Description
Real-time Processing Process live data streams
Micro-batching Small batches (seconds)
Fault Tolerance Exactly-once semantics
Integration Kafka, Flume, Kinesis
Unified API Same code for batch and stream

Diagram:

actdiag diagram

Example:

from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, 1)  # 1-second batches
lines = ssc.socketTextStream("localhost", 9999)
words = lines.flatMap(lambda line: line.split(" "))
counts = words.countByValue()
counts.pprint()
ssc.start()


Q97. ๐ŸŸข Briefly discuss the purpose of MLlib.

[Asked: Dec 2023 | Frequency: 1]

Answer

MLlib is Spark's scalable machine learning library for distributed ML algorithms.

Purpose:

Feature Description
Scalable ML Train on clusters
Algorithms Classification, regression, clustering
Pipelines ML workflow automation
Feature Engineering Transformers and extractors
Model Persistence Save/load models

Supported Algorithms:

Category Algorithms
Classification Logistic Regression, Decision Trees, Random Forest, SVM
Regression Linear, Ridge, Lasso, Decision Tree
Clustering K-Means, Gaussian Mixture, LDA
Recommendation ALS (Collaborative Filtering)
Dimensionality PCA, SVD

Example Pipeline:

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=["f1","f2","f3"], 
                            outputCol="features")
rf = RandomForestClassifier(numTrees=100)

pipeline = Pipeline(stages=[assembler, rf])
model = pipeline.fit(training_data)
predictions = model.transform(test_data)


Q98. ๐ŸŸข Briefly discuss the purpose of GraphX.

[Asked: Dec 2023 | Frequency: 1]

Answer

GraphX is Spark's API for graph-parallel computation.

Purpose:

Feature Description
Graph Processing Analyze graph structures
Algorithms PageRank, Connected Components
Graph Construction From RDDs or files
Property Graphs Vertices and edges with properties
Pregel API Iterative graph algorithms

Diagram:

graphviz diagram

Built-in Algorithms:

  • PageRank: Vertex importance

  • Connected Components: Graph clusters

  • Triangle Counting: Network density

  • Shortest Paths: Distance calculation

Example:

import org.apache.spark.graphx._

// Create graph
val graph = Graph(vertices, edges)

// Run PageRank
val ranks = graph.pageRank(0.001).vertices
ranks.collect().foreach(println)


Q99. ๐ŸŸข What is HIVE? Explain the components of HIVE architecture.

[Asked: Jun 2025 | Frequency: 1]

Answer

Apache Hive is a data warehouse infrastructure built on Hadoop for data summarization, querying, and analysis using SQL-like language (HiveQL).

Definition: Hive provides SQL interface to query data stored in HDFS, converting queries to MapReduce/Spark jobs.

Architecture Diagram:

plantuml diagram

Components:

Component Purpose
Metastore Stores schema, table definitions
Driver Manages query lifecycle
Compiler Parses and compiles HiveQL
Optimizer Optimizes execution plan
Executor Runs the query plan
CLI/UI User interfaces

Q100. ๐ŸŸก Write short note on HIVE and its utility in Data Science.

[Asked: Jun 2023, Dec 2022 | Frequency: 2]

Answer

Apache Hive provides SQL-based data warehouse capabilities on Hadoop.

Key Features:

Feature Description
HiveQL SQL-like query language
Schema on Read Define schema at query time
Scalability Process petabytes of data
Extensibility Custom UDFs, SerDes
Integration Works with Hadoop ecosystem

Utility in Data Science:

Use Case How Hive Helps
Data Exploration SQL queries on big data
ETL Transform large datasets
Data Warehousing Structured analysis
Reporting Business intelligence
Ad-hoc Queries Quick data investigation

Example:

-- Create table
CREATE TABLE sales (
    id INT,
    product STRING,
    amount DOUBLE,
    date DATE
) PARTITIONED BY (year INT, month INT)
STORED AS PARQUET;

-- Query
SELECT product, SUM(amount) as total
FROM sales
WHERE year = 2024
GROUP BY product
ORDER BY total DESC
LIMIT 10;


Q101. ๐ŸŸข Write short note on HBase and its utility in Data Science.

[Asked: Dec 2023 | Frequency: 1]

Answer

Apache HBase is a distributed, scalable NoSQL database built on HDFS for real-time read/write access to big data.

Key Features:

Feature Description
Column-oriented Wide-column store
Real-time Access Low-latency reads/writes
Scalability Billions of rows, millions of columns
Consistency Strong consistency model
Auto-sharding Automatic data distribution

HBase Data Model:

svgbob diagram

Utility in Data Science:

Use Case Application
Time Series Sensor data, logs
User Profiles Real-time personalization
Messaging Chat, notifications
Metrics System monitoring
Search Indexing Fast lookups

UNIT 8: NoSQL DATABASES


Q102. ๐Ÿ”ด What is NoSQL? What are NoSQL databases?

[Asked: Jun 2025, Jun 2024, Dec 2022, Jun 2022 | Frequency: 4]

Answer

NoSQL (Not Only SQL) refers to non-relational databases designed for distributed data storage with flexible schemas, horizontal scaling, and high performance for specific use cases.

Definition: NoSQL databases store data in formats other than traditional relational tables, optimized for large-scale, distributed environments with varied data types.

Types of NoSQL Databases:

d2 diagram

Comparison with RDBMS:

Aspect RDBMS NoSQL
Schema Fixed Flexible
Scaling Vertical Horizontal
ACID Full support Eventual consistency
Data Model Tables Various (doc, graph, etc.)
Joins Supported Limited/None
Use Case Complex queries High volume, velocity

Q103. ๐Ÿ”ด Explain the features of NoSQL databases. How are NoSQL databases different from RDBMS?

[Asked: Jun 2025, Jun 2024, Dec 2022, Jun 2022 | Frequency: 4]

Answer

Features of NoSQL:

Feature Description
Schema Flexibility No fixed schema, dynamic structure
Horizontal Scaling Add nodes to scale (sharding)
High Availability Built-in replication
High Performance Optimized for specific access patterns
Distributed Data across multiple servers
BASE Model Basically Available, Soft state, Eventually consistent

ACID vs BASE:

ditaa diagram

Detailed Comparison:

Aspect RDBMS NoSQL
Data Model Relational tables Key-value, Document, Graph, Column
Schema Rigid, predefined Dynamic, flexible
Scalability Vertical (bigger server) Horizontal (more servers)
Transactions ACID compliant BASE model
Joins Complex joins supported Limited or none
Query Language SQL Database-specific
Consistency Strong Eventual
Best For Complex relationships Big Data, real-time

Q104. ๐ŸŸข What is key-value pair based NoSQL? List the benefits.

[Asked: Dec 2024 | Frequency: 1]

Answer

Key-Value Store is the simplest NoSQL database type that stores data as a collection of key-value pairs.

Structure:

svgbob diagram

Benefits:

Benefit Description
Simplicity Easy to understand and use
Speed O(1) lookups by key
Scalability Easy horizontal scaling
Flexibility Value can be any data type
Caching Perfect for cache layer
High Throughput Millions of ops/second

Popular Databases:

  • Redis: In-memory, caching, sessions

  • DynamoDB: AWS managed, serverless

  • Riak: Distributed, fault-tolerant

Use Cases:

  • Session storage

  • User preferences

  • Shopping carts

  • Caching

  • Real-time leaderboards


Q105. ๐ŸŸข Explain when to use key-value NoSQL database with example.

[Asked: Dec 2024 | Frequency: 1]

Answer

When to Use Key-Value Stores:

Scenario Why Key-Value Works
Simple lookups Direct access by key
High speed needed In-memory performance
Caching Fast data retrieval
Session management Quick session access
No complex queries Only key-based access

Example - Session Management:

User logs in → Generate session ID → Store in Redis

Key: "session:abc123def456"
Value: {
    "user_id": 12345,
    "username": "john_doe",
    "login_time": "2024-12-10T10:30:00",
    "cart_items": 3,
    "preferences": {"theme": "dark"}
}

Operations:
- SET session:abc123 {...}   → Store session
- GET session:abc123         → Retrieve session
- EXPIRE session:abc123 3600 → Auto-delete after 1 hour
- DEL session:abc123         → Logout

When NOT to Use:

  • Complex relationships between data

  • Need for joins or aggregations

  • Range queries required

  • Data has complex structure


Q106. ๐ŸŸก What is Graph based NoSQL? Explain when do we need graph database.

[Asked: Jun 2024, Dec 2022 | Frequency: 2]

Answer

Graph Database stores data as nodes (entities) and edges (relationships), optimized for traversing connected data.

Structure:

graphviz diagram

Components:

Component Description
Nodes Entities (people, products)
Edges Relationships between nodes
Properties Attributes on nodes/edges
Labels Node types

When to Use Graph Database:

Use Case Why Graph
Social Networks Friend connections, followers
Recommendations "People who bought X also..."
Fraud Detection Identify suspicious patterns
Knowledge Graphs Connected information
Network Analysis IT infrastructure, routing
Access Control Permission hierarchies

Example Query (Cypher - Neo4j):

// Find friends of friends
MATCH (user:Person {name: 'Alice'})-[:FRIENDS]->(friend)-[:FRIENDS]->(fof)
WHERE NOT (user)-[:FRIENDS]->(fof) AND user <> fof
RETURN fof.name AS Recommendation


Q107. ๐ŸŸข List the features of Column-based databases.

[Asked: Dec 2022 | Frequency: 1]

Answer

Column-Family Database (Wide-Column Store) stores data in column families rather than rows.

Structure:

Row-Oriented (RDBMS):        Column-Oriented (NoSQL):
┌────┬──────┬─────┬─────┐   ┌──────────────────────┐
│ ID │ Name │ Age │City │   │ ID:    1, 2, 3, 4    │
├────┼──────┼─────┼─────┤   │ Name:  A, B, C, D    │
│ 1  │  A   │ 25  │ NYC │   │ Age:   25,30,28,35   │
│ 2  │  B   │ 30  │ LA  │   │ City:  NYC,LA,CHI,SF │
│ 3  │  C   │ 28  │ CHI │   └──────────────────────┘
│ 4  │  D   │ 35  │ SF  │   
└────┴──────┴─────┴─────┘   Better for analytics
Better for transactions      (read specific columns)

Features:

Feature Description
Column Families Related columns grouped
Sparse Storage Only stores non-null values
High Write Throughput Append-only writes
Time-Series Friendly Efficient time-stamped data
Horizontal Scaling Easy sharding
Compression Same-type data compresses well

Popular Databases:

  • Apache Cassandra

  • Apache HBase

  • Google Bigtable

Best For:

  • Time-series data

  • IoT sensor data

  • Event logging

  • Analytics workloads


UNIT 9: MINING BIG DATA - SIMILARITY


Q108. ๐ŸŸก Define the term Similarity.

[Asked: Jun 2024, Jun 2022 | Frequency: 2]

Answer

Similarity is a measure that quantifies how alike or close two data objects are based on their features or attributes.

Definition: Similarity is a numerical measure (typically between 0 and 1) where 1 indicates identical objects and 0 indicates completely different objects.

Key Concepts:

Concept Description
Similarity How alike two objects are (0 to 1)
Distance How different two objects are
Relationship Similarity = 1 - Normalized Distance

Types of Similarity Measures:

d2 diagram

Applications:

  • Document similarity (plagiarism detection)

  • Recommendation systems

  • Clustering

  • Near-duplicate detection

  • Search engines


Q109. ๐ŸŸข Explain the Jaccard similarity of sets with the help of an example.

[Asked: Jun 2022 | Frequency: 1]

Answer

Jaccard Similarity measures the similarity between two sets as the ratio of their intersection to their union.

Formula:

$$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$

Diagram:

svgbob diagram

Example:

Set A = {apple, banana, orange, mango} Set B = {banana, orange, grape, kiwi}

Operation Result
A ∩ B {banana, orange}
A ∪ B {apple, banana, orange, mango, grape, kiwi}
|A ∩ B| 2
|A ∪ B| 6
$$J(A, B) = \frac{2}{6} = 0.333$$

Interpretation: The sets are 33.3% similar.

Properties:

  • Range: 0 ≤ J(A,B) ≤ 1

  • J(A,A) = 1 (identical sets)

  • J(A,B) = 0 when A ∩ B = ∅


Q110. ๐ŸŸข What do you understand by the term 'Finding Similar Documents'?

[Asked: Jun 2025 | Frequency: 1]

Answer

Finding Similar Documents is the process of identifying documents that share significant content, structure, or meaning with a given document or each other.

Why It Matters:

Application Use Case
Plagiarism Detection Identify copied content
Search Engines Find relevant results
News Aggregation Group related stories
Recommendation Suggest similar articles
Deduplication Remove near-duplicates

Challenge with Big Data:

  • Comparing every pair: O(n²) comparisons

  • For 1 million documents: 500 billion comparisons

  • Need efficient approximate methods

Solution Pipeline:

plantuml diagram

Q111. ๐ŸŸข What are the various concepts of document similarity analysis?

[Asked: Jun 2025 | Frequency: 1]

Answer

Key Concepts in Document Similarity:

1. Shingling (k-grams): Convert document to set of overlapping substrings.

Document: "the quick brown"
3-shingles: {"the", "he ", "e q", " qu", "qui", ...}

2. MinHashing: Create compact signatures that estimate Jaccard similarity.

Property Description
Input Set of shingles
Output Fixed-size signature
Property Pr(h(A) = h(B)) = J(A,B)

3. Locality Sensitive Hashing (LSH): Hash similar documents to same buckets with high probability.

ditaa diagram

4. Similarity Measures:

Measure Formula Best For
Jaccard |A∩B|/|A∪B| Sets
Cosine A·B/(|A||B|) Vectors
Edit Distance Min edits to transform Strings

Q112. ๐ŸŸก Explain how the similarity between two documents can be found.

[Asked: Jun 2024, Dec 2022 | Frequency: 2]

Answer

Step-by-Step Document Similarity:

Step 1: Preprocessing

  • Remove stopwords (the, is, a)

  • Convert to lowercase

  • Stemming/Lemmatization

Step 2: Representation

Method Description
Bag of Words Word frequency vector
TF-IDF Weighted word importance
Shingles Set of k-grams

Step 3: Calculate Similarity

Example - Cosine Similarity:

Doc1: "data science is fun"
Doc2: "science of data analysis"

Vocabulary: {data, science, is, fun, of, analysis}

Vector1: [1, 1, 1, 1, 0, 0]
Vector2: [1, 1, 0, 0, 1, 1]

Cosine = (1×1 + 1×1 + 1×0 + 1×0 + 0×1 + 0×1) / (√4 × √4)
       = 2 / 4 = 0.5

Diagram:

graphviz diagram

Q113. ๐ŸŸข Compare Minhashing and Locality Sensitive Hashing for document similarity.

[Asked: Jun 2025 | Frequency: 1]

Answer

Comparison:

Aspect MinHashing LSH
Purpose Compress set signatures Find candidate pairs
Input Set of shingles MinHash signatures
Output Fixed-size signature Candidate similar pairs
Complexity O(n × k) per doc O(n) for all docs
Preserves Jaccard similarity Similarity threshold

MinHashing Process:

Shingle Set → Apply h hash functions → Signature (h values)

Signature preserves: Pr(sig[i] matches) ≈ Jaccard(A,B)

LSH Process:

Signatures → Divide into b bands of r rows
           → Hash each band
           → Similar docs hash to same bucket

Diagram:

plantuml diagram

Trade-off in LSH:

  • More bands (b): More false positives, fewer misses

  • More rows (r): Fewer false positives, more misses

  • Threshold ≈ (1/b)^(1/r)


Q114. ๐ŸŸข What is a Euclidean distance measure?

[Asked: Jun 2025 | Frequency: 1]

Answer

Euclidean Distance is the straight-line distance between two points in n-dimensional space.

Formula (2D):

$$d(p, q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}$$

Formula (n-dimensional):

$$d(p, q) = \sqrt{\sum_{i=1}^{n}(p_i - q_i)^2}$$

Diagram:

svgbob diagram

Example:

  • Point P = (1, 2, 3)

  • Point Q = (4, 6, 8)

$$d = \sqrt{(4-1)^2 + (6-2)^2 + (8-3)^2} = \sqrt{9 + 16 + 25} = \sqrt{50} ≈ 7.07$$

Properties:

  • Always ≥ 0

  • d(p,q) = 0 iff p = q

  • Symmetric: d(p,q) = d(q,p)

  • Triangle inequality: d(p,r) ≤ d(p,q) + d(q,r)


Q115. ๐ŸŸข How does Euclidean distance differ from cosine distance?

[Asked: Jun 2025 | Frequency: 1]

Answer

Key Differences:

Aspect Euclidean Distance Cosine Distance
Measures Magnitude of difference Angle between vectors
Formula √ฮฃ(pแตข - qแตข)² 1 - cos(ฮธ)
Range 0 to ∞ 0 to 2
Sensitive to Magnitude Direction only
Best for Actual distances Text similarity

Diagram:

svgbob diagram

Example:

A = (1, 0)
B = (2, 0)
C = (0, 1)

Euclidean:
  d(A,B) = 1    (B is closer to A)
  d(A,C) = √2 ≈ 1.41

Cosine:
  cos(A,B) = 1 → distance = 0  (same direction)
  cos(A,C) = 0 → distance = 1  (perpendicular)

When to Use:

Use Case Recommended
Text documents Cosine (ignores doc length)
Geographic points Euclidean
High-dimensional sparse Cosine
Dense numerical data Euclidean

Q116. ๐ŸŸข What is the purpose of a distance measure?

[Asked: Dec 2024 | Frequency: 1]

Answer

Purpose of Distance Measures:

Purpose Description
Quantify Difference Numerical measure of dissimilarity
Clustering Group similar objects
Classification k-NN algorithm
Anomaly Detection Identify outliers
Search Find nearest neighbors

Applications:

plantuml diagram

Common Distance Measures:

Measure Formula Use Case
Euclidean √ฮฃ(xแตข-yแตข)² General purpose
Manhattan ฮฃ|xแตข-yแตข| Grid-based, outlier robust
Cosine 1 - cos(ฮธ) Text, sparse data
Jaccard 1 - J(A,B) Sets, binary data
Hamming Count of differences Binary strings
Edit Min edits Strings

Q117. ๐ŸŸข Differentiate between cosine distance and edit distance with example.

[Asked: Dec 2024 | Frequency: 1]

Answer

Comparison:

Aspect Cosine Distance Edit Distance
Input Type Vectors Strings
Measures Angular difference Character operations
Operations Dot product Insert, Delete, Replace
Range 0 to 2 0 to max(len(s1), len(s2))
Use Case Document similarity Spell checking, DNA

Cosine Distance Example:

Doc1: "the cat sat" → Vector: [1, 1, 1, 0, 0]
Doc2: "the dog ran" → Vector: [1, 0, 0, 1, 1]
Vocabulary: [the, cat, sat, dog, ran]

Cosine Similarity = (1×1 + 1×0 + 1×0 + 0×1 + 0×1) / (√3 × √3)
                  = 1/3 = 0.33

Cosine Distance = 1 - 0.33 = 0.67

Edit Distance Example:

String1: "kitten"
String2: "sitting"

Operations:
1. kitten → sitten  (replace k with s)
2. sitten → sittin  (replace e with i)
3. sittin → sitting (insert g)

Edit Distance = 3

Diagram:

d2 diagram

UNIT 10: MINING DATA STREAMS


Q118. ๐ŸŸก What are Data Streams? Explain Data Streams.

[Asked: Jun 2025, Jun 2023, Jun 2022 | Frequency: 3]

Answer

Data Stream is a continuous, unbounded sequence of data elements generated at rapid rates that must be processed in real-time or near real-time.

Definition: A data stream is an ordered sequence of data items that arrive continuously over time, often too fast and voluminous to store entirely.

Characteristics:

Characteristic Description
Continuous Never-ending flow of data
High Velocity Rapid arrival rate
Unbounded Potentially infinite
Time-Sensitive Must process quickly
Single Pass Cannot re-read easily
Evolving Patterns change over time

Diagram:

seqdiag diagram

Examples of Data Streams:

Domain Stream Type
Finance Stock tickers, transactions
Social Media Tweets, posts, likes
IoT Sensor readings
Telecom Call records, network logs
Web Clickstreams, search queries

Q119. ๐ŸŸก Why is Data Stream mining/processing a challenging task in Data Science?

[Asked: Jun 2025, Jun 2023 | Frequency: 2]

Answer

Challenges in Data Stream Processing:

Challenge Description
Volume Massive amounts of data
Velocity High arrival rate
Single Pass Cannot store all data
Memory Limits Limited RAM for processing
Real-time Must respond quickly
Concept Drift Patterns change over time

Diagram:

ditaa diagram

Technical Challenges:

  1. Memory Constraint: Can't store entire stream

  2. One-Pass Processing: Each item seen once

  3. Approximate Algorithms: Must sacrifice accuracy

  4. Concept Drift: Model becomes outdated

  5. Out-of-Order Data: Events may arrive late

  6. Load Spikes: Sudden bursts of data


Q120. ๐ŸŸข Explain the characteristics of data streams.

[Asked: Dec 2022 | Frequency: 1]

Answer

Key Characteristics:

Characteristic Description
Continuous Endless flow, no defined end
Rapid High data arrival rate
Unbounded Potentially infinite size
Temporal Time is crucial dimension
Ordered Sequence matters
Ephemeral Old data may be discarded

Formal Model:

Stream S = (s₁, s₂, s₃, ..., sโ‚™, ...)

Where:
- sแตข arrives at time tแตข
- tแตข < tแตข₊₁ (ordered)
- n → ∞ (unbounded)

Diagram:

nwdiag diagram

Processing Constraints:

  • Limited memory

  • Limited processing time per element

  • Approximate answers acceptable

  • Single pass over data


Q121. ๐ŸŸก How do Data Streams differ from Databases?

[Asked: Jun 2023, Jun 2022 | Frequency: 2]

Answer

Comparison:

Aspect Database (DBMS) Data Stream (DSMS)
Data Persistent, stored Transient, flowing
Size Finite Potentially infinite
Access Random, multiple Sequential, once
Query On-demand Continuous
Answer Exact Approximate
Processing Any time Real-time
Update Insert, Update, Delete Append only
Storage Disk-based Memory-based

Diagram:

plantuml diagram

Query Model Difference:

  • DBMS: "Find all sales > $1000" → Run once, get answer

  • DSMS: "Alert when sale > $1000" → Runs continuously


Q122. ๐ŸŸก Differentiate between DSMS and DBMS with diagram.

[Asked: Dec 2023, Jun 2024 | Frequency: 2]

Answer

DSMS vs DBMS:

Feature DBMS DSMS
Data Model Relations/Tables Streams
Query Type One-time Continuous
Data Arrival Static or slow Rapid, continuous
Storage Persistent Transient windows
Processing Pull-based Push-based
Results Complete, exact Incremental, approximate

Architecture Diagram:

d2 diagram

Query Execution:

DBMS:
  Query → Execute Once → Return All Results → Done

DSMS:
  Query → Register → Execute Continuously → 
        → Stream Results → Never Ends


Q123. ๐ŸŸข Discuss the issues and challenges of data stream.

[Asked: Jun 2024 | Frequency: 1]

Answer

Major Issues and Challenges:

Category Issue Description
Resource Memory Can't store all data
Resource CPU High processing demand
Data Volume Massive data amounts
Data Velocity Rapid arrival
Data Quality Missing/noisy data
Processing Single Pass One chance to process
Processing Real-time Strict time constraints
Analytics Concept Drift Patterns change
Analytics Approximation Exact answers impossible

Diagram:

blockdiag diagram

Q124. ๐ŸŸข What do you mean by data stream processing?

[Asked: Dec 2024 | Frequency: 1]

Answer

Data Stream Processing is the continuous computation and analysis of data as it flows through a system, without storing it permanently.

Key Concepts:

Concept Description
Event Single data item in stream
Window Subset of stream for analysis
Operator Transformation on stream
Pipeline Chain of operators
Sink Output destination

Processing Models:

Model Description
Record-at-a-time Process each event individually
Micro-batch Small batches (Spark Streaming)
True Streaming Continuous (Flink, Storm)

Diagram:

graph LR subgraph Stream ["Stream Pipeline"] Source[Kafka/Sensor/API] --> Ingest[Parse/Validate] Ingest --> Process[Filter/Transform/Enrich] Process --> Analyze[Aggregate/Detect/Predict] Analyze --> Output[Dashboard/Alert/Storage] end style Source fill:#4a90d9,stroke:#333,stroke-width:2px,color:white style Ingest fill:#7ab8f5,stroke:#333,stroke-width:2px,color:white style Process fill:#90c695,stroke:#333,stroke-width:2px,color:white style Analyze fill:#ffe66d,stroke:#333,stroke-width:2px,color:black style Output fill:#ff6b6b,stroke:#333,stroke-width:2px,color:white

[Asked: Dec 2024 | Frequency: 1]

Answer

Sliding Window Model is most useful for stock market trend analysis.

Why Sliding Window:

Reason Explanation
Recent Data Latest data most relevant
Continuous Update Trends update in real-time
Fixed Size Consistent analysis period
Forget Old Outdated data discarded

Types of Windows:

svgbob diagram

Stock Market Example:

Window Size: 5 minutes
Slide: 1 minute

Time 10:00 - Window: [09:55 - 10:00] → Moving Average = $150.25
Time 10:01 - Window: [09:56 - 10:01] → Moving Average = $150.40
Time 10:02 - Window: [09:57 - 10:02] → Moving Average = $150.55
...continues...

Use Cases:

  • Moving averages

  • Trend detection

  • Volume analysis

  • Anomaly detection (flash crashes)


Q126. ๐ŸŸก Compare Ad-hoc Queries and Standing Queries of data streams.

[Asked: Dec 2023 | Frequency: 3]

Answer

Comparison:

Aspect Ad-hoc Query Standing Query
Execution Once Continuous
Duration Finite Indefinite
Result Single answer Stream of answers
Trigger User-initiated Event-driven
Data Scope Historical + Current Current + Future
Storage Needs history Window-based

Diagram:

d2 diagram

Examples:

Type Example
Ad-hoc "What was average temperature yesterday?"
Standing "Alert me when temperature > 40°C"
Ad-hoc "Show sales report for Q3"
Standing "Notify on transactions > $10,000"

Q127. ๐ŸŸข Compare Land Mark Model and Sliding Windows Model.

[Asked: Jun 2023 | Frequency: 1]

Answer

Comparison:

Aspect Landmark Model Sliding Window Model
Start Point Fixed timestamp Moves with time
Data Included From landmark to now Last n items/time
Memory Grows over time Fixed size
Use Case Cumulative stats Recent trends

Diagram:

svgbob diagram

Example:

Model Query Result
Landmark "Total sales since store opening" Cumulative sum
Sliding "Sales in last 7 days" Recent total

When to Use:

Use Landmark Use Sliding
Cumulative statistics Recent trends
Growing aggregates Moving averages
Historical analysis Real-time monitoring
Audit trails Anomaly detection

Q128. ๐ŸŸข Explain any one mechanism of filtering of data streams.

[Asked: Dec 2022 | Frequency: 1]

Answer

Bloom Filter - Efficient mechanism for filtering data streams.

Purpose: Quickly test whether an element is a member of a set with minimal memory.

Properties:

  • Space-efficient probabilistic data structure

  • No false negatives (if says "no", definitely not in set)

  • Possible false positives (if says "yes", might be in set)

How It Works:

1. Create bit array of size m (all 0s)
2. Use k hash functions
3. To ADD element:
   - Hash element k times
   - Set those bit positions to 1
4. To CHECK element:
   - Hash element k times
   - If ALL positions are 1 → "Probably in set"
   - If ANY position is 0 → "Definitely not in set"

Diagram:

svgbob diagram

Use Cases:

  • Spam filtering

  • Cache checking

  • Database lookups

  • Network routing


Q129. ๐ŸŸก What is Bloom Filtering? Explain with example.

[Asked: Jun 2024, Jun 2022 | Frequency: 2]

Answer

Bloom Filter is a space-efficient probabilistic data structure for set membership testing.

Components:

Component Description
Bit Array m bits, initially all 0
Hash Functions k independent hash functions
Insert Set k bits to 1
Query Check if all k bits are 1

Example:

Setup: m = 10 bits, k = 3 hash functions

Initial Array: [0][0][0][0][0][0][0][0][0][0]
                0  1  2  3  4  5  6  7  8  9

Insert "cat":
  h1("cat") = 1
  h2("cat") = 4
  h3("cat") = 7

Array:         [0][1][0][0][1][0][0][1][0][0]

Insert "dog":
  h1("dog") = 2
  h2("dog") = 4  (already 1)
  h3("dog") = 9

Array:         [0][1][1][0][1][0][0][1][0][1]

Query "cat": Check 1,4,7 → All 1 → "Probably in set" ✓
Query "bird": Check 3,6,8 → Position 3 is 0 → "Not in set" ✓
Query "rat": Check 1,4,9 → All 1 → "Probably in set" 
            (FALSE POSITIVE - rat was never added!)

Diagram:

graphviz diagram

Trade-off:

  • Smaller array → More false positives

  • More hash functions → Better accuracy but slower


UNIT 11: LINK ANALYSIS


[Asked: Jun 2024, Dec 2023 | Frequency: 2]

Answer

Link Analysis is a technique that examines the relationships (links) between objects to extract meaningful information about their structure, importance, and connectivity.

Definition: Link analysis studies the hyperlink structure of the web or any network to understand relationships, determine importance of nodes, and discover patterns.

Key Concepts:

Concept Description
Node Entity (webpage, person)
Edge/Link Connection between nodes
In-links Links pointing to a node
Out-links Links from a node to others
Anchor Text Text describing the link

Diagram:

graphviz diagram

Applications:

Domain Application
Search Engines PageRank, HITS
Social Networks Influence analysis
Fraud Detection Suspicious patterns
Citation Analysis Research impact
Counter-terrorism Network mapping

[Asked: Dec 2023, Dec 2022 | Frequency: 3]

Answer

Purpose of Link Analysis:

Purpose Description
Rank Pages Determine page importance
Discover Structure Understand web topology
Find Communities Cluster related pages
Detect Spam Identify manipulation
Improve Search Better relevance ranking

Link Analysis for WWW:

The web can be viewed as a directed graph where:

  • Nodes = Web pages

  • Edges = Hyperlinks

Key Insight: A link from page A to page B is like a "vote" of confidence for B.

PageRank Computation Using Links:

d2 diagram

PageRank Formula:

$$PR(p) = \frac{1-d}{N} + d \sum_{q \in B_p} \frac{PR(q)}{L(q)}$$

Where:

  • d = Damping factor (typically 0.85)

  • N = Total number of pages

  • $B_p$ = Set of pages linking to p

  • L(q) = Number of outbound links from q

Algorithm Steps:

  1. Initialize all pages with PR = 1/N

  2. Iterate: redistribute PR through links

  3. Repeat until convergence


Q132. ๐ŸŸข What is PageRank?

[Asked: Jun 2024 | Frequency: 1]

Answer

PageRank is an algorithm developed by Google founders Larry Page and Sergey Brin to rank web pages based on their importance determined by the link structure.

Core Principle: A page is important if many important pages link to it.

Key Properties:

Property Description
Recursive Importance depends on linkers' importance
Democratic Each page gets equal vote initially
Iterative Computed through repeated calculations
Probabilistic Based on random surfer model

Random Surfer Model:

Imagine a person randomly browsing:

  • With probability d (0.85): Follow a random link

  • With probability 1-d (0.15): Jump to random page

PageRank = Probability surfer ends up on that page

Simple Example:

svgbob diagram

Q133. ๐Ÿ”ด Explain PageRank algorithm with suitable example.

[Asked: Jun 2024, Jun 2023, Jun 2022 | Frequency: 3]

Answer

PageRank Algorithm:

Step 1: Build the Web Graph

Pages: A, B, C
Links: A→B, A→C, B→C, C→A

graphviz diagram

Step 2: Initialize

  • N = 3 pages

  • Initial PR = 1/N = 0.33 for each page

  • Damping factor d = 0.85

Step 3: Iterate

$$PR(p) = \frac{1-d}{N} + d \sum_{q \in B_p} \frac{PR(q)}{L(q)}$$

Iteration 1:

Page Calculation New PR
A (1-0.85)/3 + 0.85 × (0.33/1) 0.05 + 0.28 = 0.33
B (1-0.85)/3 + 0.85 × (0.33/2) 0.05 + 0.14 = 0.19
C (1-0.85)/3 + 0.85 × (0.33/2 + 0.33/1) 0.05 + 0.42 = 0.47

After Several Iterations (Converged):

Page Final PageRank
A 0.30
B 0.18
C 0.52

Interpretation: Page C has highest rank because both A and B link to it.


Q134. ๐ŸŸข Explain the rank computation using MapReduce.

[Asked: Jun 2024 | Frequency: 1]

Answer

PageRank with MapReduce:

PageRank computation is iterative and can be parallelized using MapReduce.

Data Structure: Each page stores: (PageID, CurrentRank, [OutLinks])

Map Phase:

For each page P with rank R and outlinks [L1, L2, ...]:
  - Emit (P, [L1, L2, ...])     // Preserve structure
  - For each outlink Li:
      Emit (Li, R/num_outlinks)  // Distribute rank

Reduce Phase:

For page P receiving:
  - Outlinks list [L1, L2, ...]
  - Rank contributions [r1, r2, ...]

  NewRank = (1-d)/N + d × sum(contributions)
  Emit (P, NewRank, [L1, L2, ...])

Diagram:

plantuml diagram

Iterations: Run multiple MapReduce jobs until PageRank converges.


Q135. ๐ŸŸข Write short note on Different mechanisms of finding PageRank.

[Asked: Jun 2023 | Frequency: 1]

Answer

Mechanisms for Computing PageRank:

1. Power Iteration Method:

  • Most common approach

  • Iteratively multiply rank vector by transition matrix

  • Stop when ranks converge

r(k+1) = M × r(k)
Repeat until ||r(k+1) - r(k)|| < ฮต

2. Matrix Formulation:

  • Solve: r = M × r (eigenvector problem)

  • PageRank is principal eigenvector of transition matrix

3. MapReduce Computation:

  • Distributed computation for large graphs

  • Parallel processing across clusters

4. Monte Carlo Simulation:

  • Simulate random walks

  • Count visit frequency to each page

  • Approximate PageRank from frequencies

5. Algebraic Methods:

  • Gaussian elimination

  • LU decomposition

  • Suitable for small graphs only

Comparison:

Method Scale Accuracy Speed
Power Iteration Large High Medium
MapReduce Very Large High Fast (parallel)
Monte Carlo Large Approximate Fast
Algebraic Small Exact Slow

Q136. ๐ŸŸข Write short note on Sensitive PageRank.

[Asked: Dec 2023 | Frequency: 1]

Answer

Sensitive PageRank (also called Topic-Sensitive PageRank) is a variation that computes personalized rankings based on user interests or specific topics.

Motivation: Standard PageRank gives same ranking for all users, but relevance varies by context.

How It Works:

Aspect Standard PR Sensitive PR
Teleportation Random page Topic-related pages
Bias None Toward preferred topics
Result One ranking Multiple rankings

Formula Modification:

Standard: Random jump to any page with probability 1-d

Sensitive: Random jump to topic-specific pages with probability 1-d

$$PR_T(p) = \frac{(1-d)}{|T|} \cdot I_T(p) + d \sum_{q \in B_p} \frac{PR_T(q)}{L(q)}$$

Where $I_T(p) = 1$ if page p is in topic T, else 0.

Applications:

  • Personalized search results

  • Topic-specific recommendations

  • User preference modeling

Diagram:

ditaa diagram

Q137. ๐ŸŸข Explain the spider trap problem in PageRank.

[Asked: Dec 2024 | Frequency: 1]

Answer

Spider Trap occurs when a group of pages only link to each other, trapping the PageRank and absorbing all the rank over iterations.

Problem Description:

A spider trap is a set of pages where:

  • All outlinks stay within the set

  • No outlinks lead outside

  • PageRank flows in but never out

Diagram:

graphviz diagram

Effect on PageRank:

Iteration A B T1 T2
Initial 0.25 0.25 0.25 0.25
After many 0.0 0.0 0.5 0.5

All rank gets absorbed by the trap!

Solution: Taxation/Teleportation (damping factor)

  • With probability 1-d, jump to random page

  • Prevents complete absorption


Q138. ๐ŸŸข Explain the dead-end problem in PageRank.

[Asked: Dec 2024 | Frequency: 1]

Answer

Dead-End (Dangling Node) is a page with no outgoing links, causing PageRank to leak out of the system.

Problem Description:

When a random surfer reaches a dead-end:

  • No links to follow

  • PageRank has nowhere to go

  • Total PageRank decreases over iterations

Diagram:

d2 diagram

Effect on PageRank:

Iteration A B Dead-End Total
Initial 0.33 0.33 0.33 1.00
Next 0.17 0.17 0.28 0.62
Later 0.08 0.08 0.15 0.31
... 0.0 0.0 0.0 0.0

PageRank leaks out and eventually becomes zero!

Solutions:

Solution Description
Teleportation Dead-end teleports to random page
Self-link Add link from dead-end to itself
Remove Eliminate dead-ends from graph
Redistribute Distribute dead-end's PR equally

Q139. ๐ŸŸข Discuss the solutions for spider trap and dead-end problem in PageRank.

[Asked: Dec 2024 | Frequency: 1]

Answer

Combined Solution: Random Teleportation (Damping Factor)

The Solution Formula:

$$PR(p) = \frac{1-d}{N} + d \sum_{q \in B_p} \frac{PR(q)}{L(q)}$$

How It Solves Both Problems:

Problem How Teleportation Helps
Spider Trap With prob 1-d, jump OUT of trap
Dead-End With prob 1-d, jump to random page

Diagram:

plantuml diagram

Dead-End Specific Solutions:

  1. Prune dead-ends: Remove recursively

  2. Redistribute: Dead-end's PR split equally to all pages

  3. Self-loop: Dead-end links to itself

Spider Trap Specific Solutions:

  1. Taxation: Force some PR to leave (damping)

  2. Trust pages: Only count trusted links

  3. TrustRank: Propagate trust from seed set

Typical Parameters:

  • d = 0.85 (follow link)

  • 1-d = 0.15 (teleport)


[Asked: Jun 2025 | Frequency: 1]

Answer

Link Spamming is the practice of creating artificial or manipulative links to boost a page's search engine ranking unfairly.

Definition: Deliberate creation of link structures to deceive search engine algorithms into giving higher rankings than deserved.

Types of Link Spam:

Type Description
Link Farms Networks of pages linking to each other
Paid Links Buying links for PageRank
Comment Spam Adding links in blog comments
Hidden Links Invisible links on pages
Reciprocal Links "You link me, I link you"

Diagram:

svgbob diagram

Goal of Spammers:

  • Artificially inflate PageRank

  • Appear higher in search results

  • Drive traffic to low-quality content


[Asked: Jun 2025 | Frequency: 1]

Answer

Link Spam Example: Link Farm Attack

Scenario: A spam website wants to rank #1 for "cheap phones"

Setup:

graphviz diagram

How It Works:

Step Action
1 Create thousands of dummy pages
2 All pages link to target spam site
3 Farm pages link to each other (boost each other)
4 Try to get legitimate sites to link in
5 Target page gains artificial PageRank

Result Before Detection:

  • Spam page appears in top results

  • Users click and see low-quality content

  • Spammer profits from ads/scams


[Asked: Jun 2025 | Frequency: 1]

Answer

Solutions to Combat Link Spam:

1. TrustRank Algorithm:

  • Start with trusted seed pages (manually verified)

  • Propagate trust through links

  • Spam pages get low trust scores

2. Spam Mass:

  • Calculate how much PageRank comes from spam

  • Penalize pages with high spam contribution

3. Link Analysis:

Technique Detection Method
Graph Analysis Detect unusual link patterns
Temporal Analysis Sudden link spikes
Anchor Text Unnatural keyword stuffing
Link Velocity Too many links too fast

4. NoFollow Attribute:

  • <a rel="nofollow"> tells search engines to ignore link

  • Used for user-generated content (comments)

5. Machine Learning:

  • Train classifiers on known spam

  • Detect spam patterns automatically

Diagram:

ditaa diagram

Modern Approach: Search engines use combination of all techniques plus regular algorithm updates (Google Penguin) to penalize spam.


UNIT 12: WEB AND SOCIAL NETWORK ANALYSIS


Q143. ๐ŸŸข Explain how social networks can be represented using a graph.

[Asked: Dec 2022 | Frequency: 1]

Answer

Graph Representation of Social Networks:

A social network is naturally modeled as a graph where:

  • Nodes (Vertices) = People/Users

  • Edges (Links) = Relationships/Connections

Types of Social Graph:

Type Direction Example
Undirected Mutual Facebook friends
Directed One-way Twitter follow
Weighted Has strength Interaction frequency
Bipartite Two types Users & Groups

Diagram:

graphviz diagram

Key Graph Properties:

Property Meaning
Degree Number of connections
Path Route between two nodes
Clustering How connected neighbors are
Centrality Node importance
Components Connected subgraphs

Example Data Structure:

Adjacency List:
Alice: [Bob, Charlie]
Bob: [Alice, Diana]
Charlie: [Alice, Diana]
Diana: [Bob, Charlie, Eve]
Eve: [Diana]


[Asked: Jun 2022 | Frequency: 1]

Answer

Issues in Social Network Mining:

Category Issue Description
Scale Massive Size Billions of nodes and edges
Scale Dynamic Constantly changing
Data Noise Fake accounts, spam
Data Incompleteness Missing connections
Privacy Sensitive Data Personal information
Privacy Anonymization Hard to truly anonymize
Technical Heterogeneity Multiple relationship types
Technical Semantics Context matters

Diagram:

plantuml diagram

Specific Challenges:

  1. Community Detection: Finding groups is NP-hard

  2. Influence Propagation: Predicting spread patterns

  3. Link Prediction: Guessing future connections

  4. Sybil Attacks: Fake identity networks

  5. Filter Bubbles: Echo chamber detection


Q145. ๐ŸŸข What is Web Analytics?

[Asked: Jun 2024 | Frequency: 1]

Answer

Web Analytics is the collection, measurement, analysis, and reporting of website data to understand and optimize web usage.

Definition: Web analytics helps businesses understand how users interact with their websites to improve user experience and achieve goals.

Key Metrics:

Metric Description
Page Views Total pages viewed
Unique Visitors Distinct users
Bounce Rate Single-page visits
Session Duration Time on site
Conversion Rate Goal completions
Traffic Sources Where users come from

Process:

graph LR subgraph Analytics [Analytics Process] Collect[Tracking Code/Logs] --> Process[Clean/Aggregate] Process --> Analyze[Patterns/Trends] Analyze --> Report[Dashboards/Reports] Report --> Act[Optimize/Decide] end style Collect fill:#4a90d9,stroke:#333,stroke-width:2px,color:white style Process fill:#7ab8f5,stroke:#333,stroke-width:2px,color:white style Analyze fill:#90c695,stroke:#333,stroke-width:2px,color:white style Report fill:#ffe66d,stroke:#333,stroke-width:2px,color:black style Act fill:#ff6b6b,stroke:#333,stroke-width:2px,color:white

Popular Tools:

  • Google Analytics

  • Adobe Analytics

  • Mixpanel

  • Hotjar

Applications:

  • User behavior analysis

  • Marketing campaign tracking

  • A/B testing

  • Conversion optimization

  • Content performance


Q146. ๐ŸŸข Explain the issues in online advertising.

[Asked: Jun 2024 | Frequency: 1]

Answer

Issues in Online Advertising:

Category Issue Description
Fraud Click Fraud Fake clicks to exhaust budgets
Fraud Bot Traffic Non-human impressions
Fraud Ad Injection Unauthorized ad placement
Privacy Tracking User surveillance concerns
Privacy Data Collection Personal data harvesting
UX Ad Blockers Users block ads
UX Banner Blindness Users ignore ads
Quality Brand Safety Ads on inappropriate content
Quality Viewability Ads not actually seen

Click Fraud Example:

svgbob diagram

Solutions:

Issue Solution
Click Fraud Machine learning detection
Bot Traffic CAPTCHA, behavior analysis
Privacy Consent frameworks (GDPR)
Ad Blockers Native advertising
Brand Safety Content verification
Viewability Viewability standards (MRC)

Diagram:

ditaa diagram

Q147. ๐ŸŸข What is Data Lake? Explain the term Data Lake.

[Asked: Jun 2023 | Frequency: 1]

Answer

Data Lake is a centralized repository that stores all structured, semi-structured, and unstructured data at any scale in its native format.

Definition: A data lake stores raw data in its original format until it's needed for analysis, unlike data warehouses that require predefined schemas.

Key Characteristics:

Characteristic Description
Schema-on-Read Define structure when reading, not storing
Raw Format Store data as-is
Any Data Type Structured, semi-structured, unstructured
Scalable Handles petabytes of data
Cost-Effective Uses commodity storage
Flexible Adapt to changing needs

Diagram:

d2 diagram

Data Lake vs Data Warehouse:

Aspect Data Lake Data Warehouse
Schema Schema-on-Read Schema-on-Write
Data Type All types Structured only
Processing ELT ETL
Cost Lower Higher
Users Data Scientists Business Analysts

Q148. ๐ŸŸข Briefly discuss the key capabilities of data lake.

[Asked: Jun 2023 | Frequency: 1]

Answer

Key Capabilities of Data Lake:

Capability Description
Data Ingestion Collect from any source
Storage Store any data type at any scale
Processing Batch and real-time processing
Governance Data quality, security, compliance
Discovery Catalog and search data
Analytics ML, BI, advanced analytics

Detailed Capabilities:

1. Universal Data Ingestion:

  • Batch uploads

  • Real-time streaming

  • CDC (Change Data Capture)

  • API integrations

2. Scalable Storage:

  • Petabyte scale

  • Cost-effective object storage

  • Data compression

  • Lifecycle management

3. Data Processing:

  • ETL/ELT pipelines

  • Spark, Hadoop processing

  • SQL queries

  • Stream processing

4. Data Governance:

  • Access control

  • Data lineage

  • Quality monitoring

  • Compliance (GDPR, HIPAA)

5. Advanced Analytics:

  • Machine learning

  • Predictive analytics

  • Real-time dashboards

  • Ad-hoc queries


Q149. ๐ŸŸข What is Collaborative Filtering?

[Asked: Jun 2022 | Frequency: 1]

Answer

Collaborative Filtering is a recommendation technique that predicts user preferences based on the collective behavior of many users.

Core Principle: "Users who agreed in the past will agree in the future"

Types:

Type Description
User-Based Find similar users, recommend their items
Item-Based Find similar items, recommend to users
Matrix Factorization Decompose user-item matrix

How It Works:

svgbob diagram

Key Insight:

  • Don't need to know content

  • Uses patterns from user behavior

  • "People like you also liked..."


Q150. ๐ŸŸข Explain Collaborative filtering with the help of an example.

[Asked: Jun 2022 | Frequency: 1]

Answer

Collaborative Filtering Example - Movie Recommendations:

Step 1: User-Item Matrix

User Avengers Titanic Inception Notebook
Alice 5 3 5 ?
Bob 5 2 4 1
Carol 2 5 2 5
Dave ? 4 ? 4

Step 2: Find Similar Users (for Alice)

Calculate similarity (Cosine/Pearson):

  • Alice vs Bob: 0.95 (very similar - both like action)

  • Alice vs Carol: 0.25 (different - Carol likes romance)

Step 3: Predict Alice's Rating for "Notebook"

Since Alice ≈ Bob:

  • Bob rated Notebook = 1

  • Predict Alice's rating ≈ 1-2 (low)

Since Alice ≠ Carol:

  • Carol's high rating less relevant

Step 4: Recommendation

Alice's predicted ratings:
- Notebook: 1.5 (Don't recommend)
- Other action movies: High (Recommend!)

Diagram:

plantuml diagram

Q151. ๐ŸŸข What is a Recommender System?

[Asked: Dec 2024 | Frequency: 1]

Answer

Recommender System is an information filtering system that predicts and suggests items a user might be interested in based on various data sources.

Purpose:

  • Reduce information overload

  • Personalize user experience

  • Increase engagement and sales

Types of Recommender Systems:

Type Method Example
Content-Based Item features "Similar to what you liked"
Collaborative User behavior "Users like you also liked"
Hybrid Combination Netflix, Amazon
Knowledge-Based User requirements "Based on your needs"

Applications:

Platform Recommendation
Netflix Movies, TV shows
Amazon Products
Spotify Music, playlists
YouTube Videos
LinkedIn Jobs, connections

Architecture:

ditaa diagram

Q152. ๐ŸŸก Explain the concept of Recommendation System with diagram.

[Asked: Dec 2022, Dec 2024 | Frequency: 2]

Answer

Recommendation System Concepts:

1. Content-Based Filtering: Recommends items similar to what user liked before.

User likes: Action movies with Tom Cruise
System finds: Movies with similar attributes
Recommends: Mission Impossible series

2. Collaborative Filtering: Recommends based on similar users' preferences.

User A likes: Avengers, Iron Man
User B likes: Avengers, Iron Man, Thor
Recommend to A: Thor (because B liked it)

3. Hybrid Approach: Combines both methods for better accuracy.

Architecture Diagram:

graphviz diagram

Evaluation Metrics:

Metric Description
Precision Relevant / Recommended
Recall Relevant recommended / Total relevant
RMSE Prediction error
Coverage Items that can be recommended

UNIT 13: BASICS OF R PROGRAMMING


Q153. ๐ŸŸข Define Complex data type in R programming with example.

[Asked: Jun 2022 | Frequency: 1]

Answer

Complex Data Type in R is used to store complex numbers with real and imaginary parts.

Syntax:

z <- complex(real = a, imaginary = b)
# OR
z <- a + bi

Examples:

# Creating complex numbers
z1 <- 3 + 2i
z2 <- complex(real = 5, imaginary = -3)

# Check type
class(z1)  # "complex"

# Operations
z3 <- z1 + z2  # (8-1i)
z4 <- z1 * z2  # (21+1i)

# Extract parts
Re(z1)   # 3 (real part)
Im(z1)   # 2 (imaginary part)
Mod(z1)  # 3.606 (modulus: sqrt(3²+2²))
Conj(z1) # 3-2i (conjugate)

Use Cases:

  • Signal processing

  • Electrical engineering

  • Quantum mechanics simulations


Q154. ๐ŸŸก What are Strings in R? Explain with example.

[Asked: Jun 2024, Jun 2022 | Frequency: 2]

Answer

Strings (Character type) in R are sequences of characters enclosed in single or double quotes.

Creating Strings:

# Single or double quotes
str1 <- "Hello World"
str2 <- 'R Programming'

# Check type
class(str1)  # "character"

Common String Functions:

Function Purpose Example
nchar() Length nchar("Hello") → 5
paste() Concatenate paste("a", "b") → "a b"
substr() Substring substr("Hello", 1, 3) → "Hel"
toupper() Uppercase toupper("hi") → "HI"
tolower() Lowercase tolower("HI") → "hi"
strsplit() Split strsplit("a-b", "-") → ["a","b"]

Example:

name <- "Data Science"
print(nchar(name))           # 12
print(toupper(name))         # "DATA SCIENCE"
print(substr(name, 1, 4))    # "Data"
print(paste(name, "2024"))   # "Data Science 2024"


Q155. ๐ŸŸข Define %% operator in R programming with example.

[Asked: Jun 2022 | Frequency: 1]

Answer

%% Operator is the modulus operator that returns the remainder after division.

Syntax:

result <- dividend %% divisor

Examples:

# Basic modulus
10 %% 3   # Returns 1 (10 = 3×3 + 1)
15 %% 5   # Returns 0 (15 = 5×3 + 0)
7 %% 2    # Returns 1 (7 = 2×3 + 1)

# Check if even or odd
x <- 8
if (x %% 2 == 0) {
  print("Even")
} else {
  print("Odd")
}
# Output: "Even"

# Vector operation
c(10, 15, 22) %% 3  # Returns c(1, 0, 1)

Use Cases:

  • Check even/odd numbers

  • Circular array indexing

  • Time calculations (hours, minutes)

  • Divisibility tests


Q156. ๐ŸŸข Define ← or <<- operator in R programming with example.

[Asked: Jun 2022 | Frequency: 1]

Answer

Assignment Operators in R:

Operator Scope Description
<- Local Assigns value in current environment
<<- Global Assigns value in parent/global environment

Local Assignment (←):

x <- 10        # Assign 10 to x
y <- "Hello"   # Assign string to y
z <- c(1,2,3)  # Assign vector to z

# Same as = but preferred in R
x = 10  # Also works but <- is convention

Global Assignment (<<-):

# Used inside functions to modify global variables
x <- 5  # Global

test_func <- function() {
  x <- 10     # Creates LOCAL x (doesn't affect global)
  x <<- 20    # Modifies GLOBAL x
}

test_func()
print(x)  # 20 (global x was changed by <<-)

Diagram:

svgbob diagram

Q157. ๐ŸŸข Explain different types of data structures in R-language.

[Asked: Dec 2024 | Frequency: 1]

Answer

R Data Structures:

Structure Dimension Data Types Example
Vector 1D Homogeneous c(1,2,3)
Matrix 2D Homogeneous matrix(1:6, 2, 3)
Array nD Homogeneous array(1:24, c(2,3,4))
List 1D Heterogeneous list(1, "a", TRUE)
Data Frame 2D Heterogeneous columns data.frame(...)
Factor 1D Categorical factor(c("M","F"))

Diagram:

d2 diagram

Examples:

# Vector
v <- c(1, 2, 3, 4)

# Matrix
m <- matrix(1:6, nrow=2, ncol=3)

# List
l <- list(name="John", age=25, scores=c(90,85))

# Data Frame
df <- data.frame(
  Name = c("A", "B"),
  Age = c(20, 25)
)

# Factor
f <- factor(c("Low", "High", "Medium"))


Q158. ๐Ÿ”ด What is a Vector in R programming? Describe with example.

[Asked: Jun 2025, Dec 2023, Dec 2022, Jun 2022 | Frequency: 4]

Answer

Vector is the most basic data structure in R, a one-dimensional array that holds elements of the same data type.

Creating Vectors:

# Using c() function
numeric_vec <- c(1, 2, 3, 4, 5)
char_vec <- c("a", "b", "c")
logical_vec <- c(TRUE, FALSE, TRUE)

# Using sequences
seq_vec <- 1:10           # 1 to 10
seq_vec2 <- seq(1, 10, 2) # 1, 3, 5, 7, 9

# Using rep()
rep_vec <- rep(5, 3)      # c(5, 5, 5)

Vector Operations:

v <- c(10, 20, 30, 40, 50)

# Accessing elements
v[1]       # 10 (first element)
v[2:4]     # c(20, 30, 40)
v[c(1,5)]  # c(10, 50)

# Arithmetic (element-wise)
v + 5      # c(15, 25, 35, 45, 55)
v * 2      # c(20, 40, 60, 80, 100)

# Functions
length(v)  # 5
sum(v)     # 150
mean(v)    # 30
max(v)     # 50
min(v)     # 10

Diagram:

svgbob diagram

Q159. ๐ŸŸก What is a List in R programming? Describe with example.

[Asked: Dec 2023, Dec 2022 | Frequency: 2]

Answer

List is a data structure that can contain elements of different types (heterogeneous), including other lists.

Creating Lists:

# Basic list
my_list <- list(
  name = "Alice",
  age = 25,
  scores = c(85, 90, 78),
  passed = TRUE
)

# Unnamed list
l <- list(1, "hello", TRUE, c(1,2,3))

Accessing Elements:

# Using $ (named elements)
my_list$name      # "Alice"
my_list$scores    # c(85, 90, 78)

# Using [[ ]] (by index or name)
my_list[[1]]      # "Alice"
my_list[["age"]]  # 25

# Using [ ] (returns sub-list)
my_list[1]        # List with name element

List Operations:

# Add element
my_list$city <- "Mumbai"

# Modify element
my_list$age <- 26

# Remove element
my_list$passed <- NULL

# Length
length(my_list)   # Number of elements

# Names
names(my_list)    # c("name", "age", "scores", "city")

Diagram:

ditaa diagram

Q160. ๐ŸŸข Explain Matrices in R programming with example.

[Asked: Jun 2024 | Frequency: 1]

Answer

Matrix is a two-dimensional data structure with elements of the same type arranged in rows and columns.

Creating Matrices:

# Using matrix() function
m <- matrix(1:6, nrow = 2, ncol = 3)
#      [,1] [,2] [,3]
# [1,]    1    3    5
# [2,]    2    4    6

# By row (default is by column)
m2 <- matrix(1:6, nrow = 2, byrow = TRUE)
#      [,1] [,2] [,3]
# [1,]    1    2    3
# [2,]    4    5    6

# With row/column names
m3 <- matrix(1:4, nrow = 2,
             dimnames = list(c("R1","R2"), c("C1","C2")))

Matrix Operations:

m <- matrix(1:6, nrow = 2, ncol = 3)

# Accessing elements
m[1, 2]     # Element at row 1, col 2 → 3
m[1, ]      # First row → c(1, 3, 5)
m[, 2]      # Second column → c(3, 4)

# Dimensions
dim(m)      # c(2, 3)
nrow(m)     # 2
ncol(m)     # 3

# Arithmetic
m + 10      # Add 10 to all elements
m * 2       # Multiply all by 2

# Matrix multiplication
a <- matrix(1:4, 2, 2)
b <- matrix(5:8, 2, 2)
a %*% b     # Matrix multiplication


Q161. ๐ŸŸก What are Dataframes in R programming? Explain with example.

[Asked: Jun 2023, Dec 2022 | Frequency: 2]

Answer

Data Frame is a two-dimensional table where each column can have different data types, similar to a spreadsheet or SQL table.

Creating Data Frames:

# Using data.frame()
students <- data.frame(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(20, 22, 21),
  Grade = c("A", "B", "A"),
  Passed = c(TRUE, TRUE, TRUE)
)

#      Name Age Grade Passed
# 1   Alice  20     A   TRUE
# 2     Bob  22     B   TRUE
# 3 Charlie  21     A   TRUE

Accessing Data:

# Column access
students$Name          # Vector of names
students[, "Age"]      # Age column
students[, 2]          # Second column

# Row access
students[1, ]          # First row
students[1:2, ]        # First two rows

# Cell access
students[1, "Name"]    # "Alice"
students$Name[1]       # "Alice"

Common Operations:

# Dimensions
nrow(students)         # 3
ncol(students)         # 4
dim(students)          # c(3, 4)

# Add column
students$City <- c("NYC", "LA", "CHI")

# Add row
new_student <- data.frame(Name="Diana", Age=23, 
                          Grade="A", Passed=TRUE, City="SF")
students <- rbind(students, new_student)

# Summary
summary(students)
str(students)


Q162. ๐ŸŸข Give characteristics of Dataframes in R programming.

[Asked: Jun 2023 | Frequency: 1]

Answer

Characteristics of Data Frames:

Characteristic Description
2D Structure Rows and columns (like table)
Heterogeneous Columns Each column can have different type
Homogeneous Rows Each row has same structure
Named Columns Columns must have names
Equal Length All columns same length
Indexable Row/column indexing

Key Properties:

df <- data.frame(
  ID = 1:3,
  Name = c("A", "B", "C"),
  Score = c(85.5, 90.0, 78.5)
)

# Properties
class(df)        # "data.frame"
typeof(df)       # "list" (internally a list)
names(df)        # Column names
rownames(df)     # Row names (default: 1,2,3...)

Comparison:

Feature Matrix Data Frame
Data types Same Different per column
Columns Optional names Required names
Use case Math operations Data analysis

Diagram:

svgbob diagram

Q163. ๐ŸŸข What are factors in R programming?

[Asked: Jun 2025 | Frequency: 1]

Answer

Factor is a data structure used to represent categorical (nominal or ordinal) variables with a fixed set of possible values called levels.

Creating Factors:

# Basic factor
gender <- factor(c("Male", "Female", "Male", "Female"))
print(gender)
# [1] Male   Female Male   Female
# Levels: Female Male

# Ordered factor (ordinal)
size <- factor(c("Small", "Large", "Medium"),
               levels = c("Small", "Medium", "Large"),
               ordered = TRUE)
# [1] Small Large Medium
# Levels: Small < Medium < Large

Factor Properties:

# Get levels
levels(gender)    # c("Female", "Male")

# Number of levels
nlevels(gender)   # 2

# Underlying integers
as.integer(gender)  # c(2, 1, 2, 1)

# Summary
summary(gender)
# Female   Male 
#      2      2

Use Cases:

  • Survey responses (Agree, Disagree, Neutral)

  • Categories (Product types, Regions)

  • Ordinal data (Low, Medium, High)

  • Statistical modeling (ANOVA, regression)


Q164. ๐ŸŸข Give characteristics of factors in R programming.

[Asked: Jun 2025 | Frequency: 1]

Answer

Factor Characteristics:

Characteristic Description
Levels Fixed set of allowed values
Storage Stored as integers internally
Labels Human-readable level names
Ordering Can be ordered or unordered
Memory Efficient Integer storage saves space
Statistical Used in modeling

Diagram:

d2 diagram

Ordered vs Unordered:

# Unordered (nominal)
color <- factor(c("Red", "Blue", "Green"))
# No inherent order

# Ordered (ordinal)
rating <- factor(c("Poor", "Good", "Excellent"),
                 levels = c("Poor", "Good", "Excellent"),
                 ordered = TRUE)
# Poor < Good < Excellent

rating[1] < rating[3]  # TRUE (comparison works)

Common Operations:

f <- factor(c("A", "B", "A", "C"))

table(f)           # Frequency table
droplevels(f)      # Remove unused levels
relevel(f, "B")    # Change reference level


Q165. ๐Ÿ”ด Write R program for matrix operations.

[Asked: Dec 2022, Jun 2022 | Frequency: 4]

Answer

Matrix Operations in R:

# Create two 3×3 matrices
A <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9), nrow = 3, ncol = 3)
B <- matrix(c(9, 8, 7, 6, 5, 4, 3, 2, 1), nrow = 3, ncol = 3)

print("Matrix A:")
print(A)
#      [,1] [,2] [,3]
# [1,]    1    4    7
# [2,]    2    5    8
# [3,]    3    6    9

print("Matrix B:")
print(B)
#      [,1] [,2] [,3]
# [1,]    9    6    3
# [2,]    8    5    2
# [3,]    7    4    1

# Addition
C <- A + B
print("A + B:")
print(C)
#      [,1] [,2] [,3]
# [1,]   10   10   10
# [2,]   10   10   10
# [3,]   10   10   10

# Subtraction
D <- A - B
print("A - B:")
print(D)

# Element-wise multiplication
E <- A * B
print("A * B (element-wise):")
print(E)

# Matrix multiplication
F <- A %*% B
print("A %*% B (matrix multiplication):")
print(F)

# Transpose
print("Transpose of A:")
print(t(A))

# Determinant
print("Determinant of A:")
print(det(A))

# Inverse (if exists)
# print(solve(A))  # Only for invertible matrices


Q166. ๐ŸŸข How is R matrix multiplication different from C program?

[Asked: Dec 2022 | Frequency: 1]

Answer

Comparison: R vs C Matrix Multiplication

Aspect R C
Syntax Single operator %*% Nested loops
Code Length 1 line 10+ lines
Memory Automatic Manual allocation
Indexing 1-based 0-based
Vectorization Built-in Manual

R Code:

# Matrix multiplication in R
A <- matrix(1:4, 2, 2)
B <- matrix(5:8, 2, 2)
C <- A %*% B  # One line!

C Code:

// Matrix multiplication in C
int A[2][2] = {{1,3}, {2,4}};
int B[2][2] = {{5,7}, {6,8}};
int C[2][2];

// Triple nested loop required
for(int i = 0; i < 2; i++) {
    for(int j = 0; j < 2; j++) {
        C[i][j] = 0;
        for(int k = 0; k < 2; k++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}

Diagram:

ditaa diagram

Q167. ๐ŸŸข Write R code to concatenate strings.

[Asked: Jun 2025 | Frequency: 1]

Answer

String Concatenation in R:

# Using paste() - adds space by default
str1 <- "Hello"
str2 <- ","
str3 <- "Learning is Fun"

result <- paste(str1, str2, str3)
print(result)
# Output: "Hello , Learning is Fun"

# Using paste0() - no separator
result2 <- paste0(str1, str2, " ", str3)
print(result2)
# Output: "Hello, Learning is Fun"

# Custom separator
result3 <- paste(str1, str2, str3, sep = "")
print(result3)
# Output: "Hello,Learning is Fun"

# Collapse vector elements
words <- c("Hello", "World", "R")
collapsed <- paste(words, collapse = "-")
print(collapsed)
# Output: "Hello-World-R"

Functions Comparison:

Function Default Separator Example
paste() Space (" ") paste("a","b") → "a b"
paste0() None ("") paste0("a","b") → "ab"

Using sprintf():

name <- "Alice"
age <- 25
msg <- sprintf("My name is %s and I am %d years old", name, age)
print(msg)
# Output: "My name is Alice and I am 25 years old"


UNIT 14: DATA INTERFACING AND VISUALIZATION IN R


Q168. ๐ŸŸข What is JSON File in R?

[Asked: Jun 2025 | Frequency: 1]

Answer

JSON (JavaScript Object Notation) is a lightweight data interchange format that R can read and write using the jsonlite package.

JSON Structure:

{
  "name": "Alice",
  "age": 25,
  "courses": ["Data Science", "Machine Learning"],
  "active": true
}

Working with JSON in R:

# Install package
install.packages("jsonlite")
library(jsonlite)

# Read JSON file
data <- fromJSON("data.json")

# Read JSON from string
json_str <- '{"name": "Bob", "age": 30}'
data <- fromJSON(json_str)

# Write to JSON
toJSON(data)
write_json(data, "output.json")

Why JSON with R:

Purpose Description
Web APIs Most APIs return JSON
Data Exchange Universal format
Configuration Store settings
Lightweight Human-readable

Q169. ๐ŸŸข How to convert JSON into a data frame in R?

[Asked: Jun 2025 | Frequency: 1]

Answer

JSON to Data Frame Conversion:

# Load library
library(jsonlite)

# JSON string with array of objects
json_data <- '[
  {"name": "Alice", "age": 25, "city": "NYC"},
  {"name": "Bob", "age": 30, "city": "LA"},
  {"name": "Charlie", "age": 28, "city": "CHI"}
]'

# Convert to data frame
df <- fromJSON(json_data)
print(df)
#      name age city
# 1   Alice  25  NYC
# 2     Bob  30   LA
# 3 Charlie  28  CHI

# From file
df <- fromJSON("data.json")

# Check structure
class(df)  # "data.frame"
str(df)

Handling Nested JSON:

# Nested JSON
nested_json <- '{
  "company": "TechCorp",
  "employees": [
    {"name": "Alice", "dept": "IT"},
    {"name": "Bob", "dept": "HR"}
  ]
}'

data <- fromJSON(nested_json)
employees_df <- data$employees  # Extract nested data frame

Diagram:

svgbob diagram

Q170. ๐Ÿ”ด How to draw a Bar Chart in R?

[Asked: Jun 2024, Jun 2023, Jun 2022, Dec 2024 | Frequency: 4]

Answer

Bar Chart in R using barplot():

Syntax:

barplot(height, names.arg, main, xlab, ylab, col)

Example:

# Data
categories <- c("A", "B", "C", "D", "E")
values <- c(25, 40, 30, 55, 45)

# Basic bar chart
barplot(values,
        names.arg = categories,
        main = "Sales by Category",
        xlab = "Category",
        ylab = "Sales",
        col = "steelblue")

# Horizontal bar chart
barplot(values,
        names.arg = categories,
        main = "Sales by Category",
        horiz = TRUE,
        col = rainbow(5))

# Grouped bar chart
data <- matrix(c(10, 20, 15, 25, 30, 35), nrow = 2)
barplot(data,
        names.arg = c("Q1", "Q2", "Q3"),
        beside = TRUE,
        col = c("red", "blue"),
        legend = c("2023", "2024"))

Parameters:

Parameter Description
height Vector of bar heights
names.arg Labels for bars
main Chart title
col Bar colors
horiz Horizontal if TRUE
beside Grouped bars if TRUE

Q171. ๐Ÿ”ด How to create a Box Plot in R?

[Asked: Dec 2022, Jun 2023, Dec 2024 | Frequency: 3]

Answer

Box Plot in R using boxplot():

Syntax:

boxplot(data, main, xlab, ylab, col)

Example:

# Single box plot
data <- c(23, 25, 28, 30, 32, 35, 38, 40, 42, 100)
boxplot(data,
        main = "Distribution of Values",
        ylab = "Value",
        col = "lightblue")

# Multiple box plots
group1 <- c(10, 12, 14, 15, 18, 20)
group2 <- c(20, 22, 24, 26, 28, 30)
group3 <- c(15, 18, 20, 22, 25, 28)

boxplot(group1, group2, group3,
        names = c("A", "B", "C"),
        main = "Comparison of Groups",
        col = c("red", "green", "blue"))

# From data frame
df <- data.frame(
  value = c(10,12,15,20,22,25,30,32,35),
  group = c("A","A","A","B","B","B","C","C","C")
)
boxplot(value ~ group, data = df,
        main = "Values by Group",
        col = "orange")

Box Plot Anatomy:

svgbob diagram

Q172. ๐ŸŸก How to create a Histogram in R?

[Asked: Jun 2023, Dec 2024 | Frequency: 2]

Answer

Histogram in R using hist():

Syntax:

hist(x, breaks, main, xlab, ylab, col)

Example:

# Generate sample data
data <- rnorm(100, mean = 50, sd = 10)

# Basic histogram
hist(data,
     main = "Distribution of Values",
     xlab = "Value",
     ylab = "Frequency",
     col = "lightgreen")

# Custom breaks (bins)
hist(data,
     breaks = 20,
     main = "Histogram with 20 Bins",
     col = "steelblue",
     border = "white")

# Probability density instead of frequency
hist(data,
     probability = TRUE,
     main = "Density Histogram",
     col = "coral")
lines(density(data), col = "blue", lwd = 2)

Parameters:

Parameter Description
x Numeric vector
breaks Number of bins or breakpoints
probability TRUE for density
col Fill color
border Border color

Q173. ๐ŸŸก How to create Line Graphs in R?

[Asked: Jun 2023, Dec 2024 | Frequency: 2]

Answer

Line Graph in R using plot() with type="l":

Syntax:

plot(x, y, type = "l", main, xlab, ylab, col)

Example:

# Data
months <- 1:12
sales <- c(100, 120, 140, 130, 150, 180, 200, 190, 170, 160, 140, 150)

# Basic line graph
plot(months, sales,
     type = "l",
     main = "Monthly Sales",
     xlab = "Month",
     ylab = "Sales ($)",
     col = "blue",
     lwd = 2)

# Line with points
plot(months, sales,
     type = "b",  # both line and points
     main = "Monthly Sales",
     col = "red",
     pch = 16)

# Multiple lines
sales2024 <- c(110, 130, 150, 140, 160, 190, 210, 200, 180, 170, 150, 160)
plot(months, sales, type = "l", col = "blue", ylim = c(80, 220))
lines(months, sales2024, col = "red")
legend("topleft", legend = c("2023", "2024"), 
       col = c("blue", "red"), lty = 1)

Type Options:

Type Description
"l" Line only
"p" Points only
"b" Both (with gap)
"o" Overplotted
"s" Steps

Q174. ๐Ÿ”ด How to draw a Scatter Plot in R?

[Asked: Dec 2024, Jun 2024, Jun 2023 | Frequency: 3]

Answer

Scatter Plot in R using plot():

Syntax:

plot(x, y, main, xlab, ylab, pch, col)

Example:

# Data
height <- c(150, 160, 165, 170, 175, 180, 185, 190)
weight <- c(50, 55, 60, 65, 70, 75, 80, 85)

# Basic scatter plot
plot(height, weight,
     main = "Height vs Weight",
     xlab = "Height (cm)",
     ylab = "Weight (kg)",
     pch = 16,
     col = "blue")

# Add trend line
abline(lm(weight ~ height), col = "red", lwd = 2)

# Different point styles
plot(height, weight,
     pch = 19,        # Solid circle
     col = "darkgreen",
     cex = 1.5)       # Point size

# Color by category
gender <- c("M", "M", "F", "F", "M", "F", "M", "F")
colors <- ifelse(gender == "M", "blue", "red")
plot(height, weight, col = colors, pch = 16)
legend("topleft", legend = c("Male", "Female"),
       col = c("blue", "red"), pch = 16)

Common pch values:

pch Symbol
1 Circle
16 Solid circle
17 Triangle
18 Diamond
19 Solid circle

UNIT 15: DATA ANALYSIS AND R


Q175. ๐ŸŸก What is Linear Regression?

[Asked: Jun 2025, Jun 2022 | Frequency: 2]

Answer

Linear Regression is a statistical method to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation.

Simple Linear Regression Formula:

$$y = \beta_0 + \beta_1 x + \epsilon$$

Where:

  • y = Dependent variable (predicted)

  • x = Independent variable (predictor)

  • ฮฒ₀ = Intercept

  • ฮฒ₁ = Slope

  • ฮต = Error term

Diagram:

svgbob diagram

Assumptions:

  1. Linear relationship

  2. Independence of errors

  3. Homoscedasticity (constant variance)

  4. Normally distributed errors

Use Cases:

  • Predicting sales from advertising spend

  • Estimating house prices

  • Forecasting demand


Q176. ๐Ÿ”ด Explain Linear Regression using R-language.

[Asked: Dec 2024, Jun 2022 | Frequency: 3]

Answer

Linear Regression in R using lm():

# Sample data
height <- c(150, 160, 170, 180, 190)
weight <- c(50, 60, 70, 80, 90)

# Create linear model
model <- lm(weight ~ height)

# View model summary
summary(model)
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)
# (Intercept) -100.0000    ...
# height         1.0000    ...

# Get coefficients
coefficients(model)
# (Intercept)      height 
#    -100.00         1.00

# Predict new values
new_height <- data.frame(height = c(155, 175))
predict(model, new_height)
# [1] 55 75

# Plot regression line
plot(height, weight, main = "Height vs Weight",
     xlab = "Height", ylab = "Weight", pch = 16)
abline(model, col = "red", lwd = 2)

Key Functions:

Function Purpose
lm() Create linear model
summary() Model statistics
coefficients() Get coefficients
predict() Make predictions
residuals() Get residuals
abline() Draw regression line

Q177. ๐ŸŸข Differentiate between Linear Regression and Multiple Regression.

[Asked: Jun 2023 | Frequency: 1]

Answer

Comparison:

Aspect Linear Regression Multiple Regression
Variables 1 independent 2+ independent
Formula y = ฮฒ₀ + ฮฒ₁x y = ฮฒ₀ + ฮฒ₁x₁ + ฮฒ₂x₂ + ...
Complexity Simple More complex
Use Case Single factor analysis Multi-factor analysis

Simple Linear Regression Example:

# One predictor
model1 <- lm(price ~ area)
# price = ฮฒ₀ + ฮฒ₁ * area

Multiple Regression Example:

# Multiple predictors
model2 <- lm(price ~ area + bedrooms + age)
# price = ฮฒ₀ + ฮฒ₁*area + ฮฒ₂*bedrooms + ฮฒ₃*age

Diagram:

d2 diagram

When to Use:

  • Linear: One clear predictor

  • Multiple: Multiple factors influence outcome


Q178. ๐ŸŸข What is Multiple Regression?

[Asked: Dec 2022 | Frequency: 1]

Answer

Multiple Regression extends linear regression to include two or more independent variables to predict a dependent variable.

Formula:

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon$$

Example: Predicting house price based on:

  • Area (x₁)

  • Number of bedrooms (x₂)

  • Age of house (x₃)

# Multiple regression in R
model <- lm(price ~ area + bedrooms + age, data = houses)

Advantages:

  • Models real-world complexity

  • Controls for confounding variables

  • Better predictions

Assumptions:

  • No multicollinearity (predictors not highly correlated)

  • Linear relationship with each predictor

  • Independence of observations


Q179. ๐ŸŸข Write steps for Multiple Regression in R.

[Asked: Dec 2022 | Frequency: 1]

Answer

Steps for Multiple Regression in R:

# Step 1: Load data
data <- read.csv("housing.csv")

# Step 2: Explore data
head(data)
summary(data)
cor(data)  # Check correlations

# Step 3: Build model
model <- lm(price ~ area + bedrooms + age, data = data)

# Step 4: View summary
summary(model)

# Step 5: Check coefficients
coefficients(model)

# Step 6: Check significance (p-values)
# p < 0.05 means variable is significant

# Step 7: Check R-squared
# Higher = better fit (0 to 1)

# Step 8: Make predictions
new_data <- data.frame(area = 2000, bedrooms = 3, age = 10)
predicted_price <- predict(model, new_data)

# Step 9: Validate model
# Check residuals
plot(model)

# Step 10: Improve if needed
# Remove non-significant variables
model2 <- lm(price ~ area + bedrooms, data = data)


Q180. ๐Ÿ”ด What is Logistic Regression?

[Asked: Dec 2024, Dec 2023, Dec 2022 | Frequency: 3]

Answer

Logistic Regression is a statistical method for binary classification that predicts the probability of an outcome being in a particular category.

Key Characteristics:

Aspect Description
Output Probability (0 to 1)
Use Case Binary classification
Function Sigmoid/Logistic
Threshold Usually 0.5

Formula:

$$P(Y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}}$$

Sigmoid Function:

svgbob diagram

Examples:

  • Spam vs Not Spam (email)

  • Disease vs Healthy (medical)

  • Pass vs Fail (education)

  • Buy vs Not Buy (marketing)

Difference from Linear:

Linear Logistic
Continuous output Probability (0-1)
Predicts values Classifies
y = mx + c y = 1/(1+e^-z)

Q181. ๐ŸŸข Give the utility of Logistic Regression.

[Asked: Dec 2023 | Frequency: 1]

Answer

Utility of Logistic Regression:

Application Use Case
Healthcare Disease prediction (diabetes, cancer)
Finance Credit risk, fraud detection
Marketing Customer churn prediction
Email Spam classification
HR Employee attrition
Education Student pass/fail prediction

Why Use Logistic Regression:

  1. Interpretable: Coefficients show feature importance

  2. Probabilistic: Gives confidence in prediction

  3. Efficient: Fast training and prediction

  4. Robust: Works well with smaller datasets

  5. Baseline: Good starting point for classification

Output Interpretation:

  • P > 0.5 → Class 1 (Positive)

  • P ≤ 0.5 → Class 0 (Negative)


Q182. ๐Ÿ”ด How to implement Logistic Regression in R?

[Asked: Dec 2024, Dec 2023, Dec 2022, Jun 2024 | Frequency: 4]

Answer

Logistic Regression in R using glm():

# Step 1: Prepare data
data <- data.frame(
  hours_studied = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
  passed = c(0, 0, 0, 0, 1, 0, 1, 1, 1, 1)
)

# Step 2: Build logistic model
model <- glm(passed ~ hours_studied, 
             data = data, 
             family = binomial)

# Step 3: View summary
summary(model)

# Step 4: Get coefficients
coefficients(model)

# Step 5: Predict probabilities
new_data <- data.frame(hours_studied = c(3, 5, 8))
probabilities <- predict(model, new_data, type = "response")
print(probabilities)
# [1] 0.25 0.50 0.82

# Step 6: Convert to class
predicted_class <- ifelse(probabilities > 0.5, 1, 0)

# Step 7: Evaluate accuracy
actual <- c(0, 1, 1)
accuracy <- mean(predicted_class == actual)
print(paste("Accuracy:", accuracy))

# Step 8: Confusion matrix
table(Predicted = predicted_class, Actual = actual)

Key Function:

glm(formula, data, family = binomial)

  • glm() = Generalized Linear Model

  • family = binomial = Logistic regression

  • type = "response" = Get probabilities


UNIT 16: ADVANCED ANALYSIS USING R


Q183. ๐ŸŸก What is a Decision Tree?

[Asked: Dec 2022, Jun 2023 | Frequency: 2]

Answer

Decision Tree is a supervised learning algorithm that makes predictions by learning decision rules from data, represented as a tree structure.

Structure:

Component Description
Root Node Top node, first split
Internal Node Decision point
Branch Outcome of decision
Leaf Node Final prediction

Diagram:

graphviz diagram

Advantages:

  • Easy to understand and interpret

  • Handles both numerical and categorical data

  • No need for feature scaling

  • Visual representation

Disadvantages:

  • Prone to overfitting

  • Unstable (small changes affect tree)

  • Biased toward features with more levels


Q184. ๐ŸŸก Write steps for Decision Tree in R.

[Asked: Dec 2022, Dec 2024 | Frequency: 2]

Answer

Decision Tree in R using rpart:

# Step 1: Install and load package
install.packages("rpart")
install.packages("rpart.plot")
library(rpart)
library(rpart.plot)

# Step 2: Prepare data
data <- data.frame(
  Age = c(25, 30, 35, 40, 45, 50, 55, 60),
  Income = c(30, 40, 50, 60, 70, 80, 90, 100),
  Buy = c("No", "No", "Yes", "Yes", "Yes", "Yes", "No", "Yes")
)

# Step 3: Build decision tree
tree_model <- rpart(Buy ~ Age + Income, 
                    data = data, 
                    method = "class")

# Step 4: View tree
print(tree_model)

# Step 5: Plot tree
rpart.plot(tree_model, main = "Decision Tree")

# Step 6: Make predictions
new_data <- data.frame(Age = 38, Income = 55)
prediction <- predict(tree_model, new_data, type = "class")
print(prediction)

# Step 7: Evaluate
# Using confusion matrix
predicted <- predict(tree_model, data, type = "class")
table(Predicted = predicted, Actual = data$Buy)

Parameters:

Parameter Description
method="class" Classification tree
method="anova" Regression tree
cp Complexity parameter
minsplit Min observations for split

Q185. ๐ŸŸข Explain the role of entropy in decision trees.

[Asked: Jun 2023 | Frequency: 1]

Answer

Entropy measures the impurity or randomness in a dataset, used to decide the best split in decision trees.

Formula:

$$Entropy(S) = -\sum_{i=1}^{c} p_i \log_2(p_i)$$

Where $p_i$ = proportion of class i in the set

Interpretation:

  • Entropy = 0: Pure node (all same class)

  • Entropy = 1: Maximum impurity (50-50 split)

Example:

Dataset: 5 Yes, 5 No
p(Yes) = 0.5, p(No) = 0.5
Entropy = -0.5×log₂(0.5) - 0.5×log₂(0.5)
        = -0.5×(-1) - 0.5×(-1)
        = 0.5 + 0.5 = 1.0 (maximum impurity)

Dataset: 8 Yes, 2 No
p(Yes) = 0.8, p(No) = 0.2
Entropy = -0.8×log₂(0.8) - 0.2×log₂(0.2)
        ≈ 0.72 (less impure)

Diagram:

svgbob diagram

Q186. ๐ŸŸข Explain the role of information gain in decision trees.

[Asked: Jun 2023 | Frequency: 1]

Answer

Information Gain measures the reduction in entropy after a split, used to select the best attribute.

Formula:

$$IG(S, A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} \times Entropy(S_v)$$

Process:

  1. Calculate entropy of parent node

  2. Calculate weighted average entropy of children

  3. Information Gain = Parent Entropy - Children Entropy

  4. Choose attribute with highest IG

Example:

Parent: 6 Yes, 4 No
Entropy(Parent) = 0.97

After split on "Age":
  - Age ≤ 30: 2 Yes, 3 No → Entropy = 0.97
  - Age > 30: 4 Yes, 1 No → Entropy = 0.72

Weighted Entropy = (5/10)×0.97 + (5/10)×0.72 = 0.845

Information Gain = 0.97 - 0.845 = 0.125

Best Split: Attribute with highest Information Gain


Q187. ๐ŸŸข What are categorical and continuous variables?

[Asked: Jun 2023 | Frequency: 1]

Answer

Categorical Variables:

  • Discrete categories or groups

  • No numerical meaning

  • Examples: Gender, Color, Product Type

Continuous Variables:

  • Numerical values in a range

  • Can take any value

  • Examples: Age, Income, Temperature

Comparison:

Aspect Categorical Continuous
Values Finite set Infinite range
Type Qualitative Quantitative
Example Low/Medium/High 23.5, 45.2, 67.8
Statistics Mode, frequency Mean, std dev

In R:

# Categorical (Factor)
gender <- factor(c("Male", "Female", "Male"))

# Continuous (Numeric)
age <- c(25.5, 30.2, 45.8)

# Check types
is.factor(gender)   # TRUE
is.numeric(age)     # TRUE


Q188. ๐ŸŸก Explain Partitioning and Pruning in Decision Trees.

[Asked: Dec 2023, Jun 2025 | Frequency: 2]

Answer

Partitioning (Splitting): The process of dividing data at each node based on a feature.

Pruning: The process of removing branches to prevent overfitting.

Comparison:

Aspect Partitioning Pruning
Phase Tree building Tree optimization
Goal Create splits Remove branches
Effect Grows tree Shrinks tree
Prevents Underfitting Overfitting

Types of Pruning:

Type When Description
Pre-pruning During growth Stop early (max depth)
Post-pruning After growth Remove weak branches

Diagram:

plantuml diagram

In R:

# Control pruning with cp (complexity parameter)
tree <- rpart(y ~ x, data, cp = 0.01)

# Prune to optimal cp
pruned_tree <- prune(tree, cp = 0.05)


Q189. ๐ŸŸข What is a Random Forest?

[Asked: Dec 2023 | Frequency: 1]

Answer

Random Forest is an ensemble learning method that builds multiple decision trees and combines their predictions.

Key Concepts:

Concept Description
Ensemble Multiple models combined
Bagging Bootstrap sampling of data
Feature Randomness Random subset of features per tree
Voting Classification by majority vote
Averaging Regression by average

Diagram:

d2 diagram

Advantages:

  • Reduces overfitting

  • Handles high-dimensional data

  • Works with missing values

  • Provides feature importance


Q190. ๐ŸŸข How does Random Forest differ from Decision Tree?

[Asked: Dec 2023 | Frequency: 1]

Answer

Comparison:

Aspect Decision Tree Random Forest
Number Single tree Many trees (forest)
Data Full dataset Bootstrap samples
Features All features Random subset
Overfitting High risk Lower risk
Accuracy Lower Higher
Interpretability Easy Harder
Speed Faster Slower

Diagram:

d2 diagram

When to Use:

  • Decision Tree: Interpretability needed, small data

  • Random Forest: Accuracy matters, large data


Q191. ๐Ÿ”ด Explain Random Forest algorithm in R.

[Asked: Dec 2024, Jun 2024 | Frequency: 3]

Answer

Random Forest in R:

# Step 1: Install and load package
install.packages("randomForest")
library(randomForest)

# Step 2: Prepare data
data(iris)  # Example dataset

# Step 3: Split data
set.seed(123)
train_idx <- sample(1:nrow(iris), 0.7 * nrow(iris))
train_data <- iris[train_idx, ]
test_data <- iris[-train_idx, ]

# Step 4: Build Random Forest model
rf_model <- randomForest(Species ~ ., 
                         data = train_data, 
                         ntree = 100,
                         mtry = 2)

# Step 5: View model
print(rf_model)

# Step 6: Feature importance
importance(rf_model)
varImpPlot(rf_model)

# Step 7: Predict
predictions <- predict(rf_model, test_data)

# Step 8: Evaluate
confusion_matrix <- table(predictions, test_data$Species)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", round(accuracy * 100, 2), "%"))

Parameters:

Parameter Description
ntree Number of trees
mtry Features per split
importance Calculate importance
nodesize Minimum node size

Q192. ๐ŸŸก What is Clustering?

[Asked: Jun 2022, Dec 2023 | Frequency: 2]

Answer

Clustering is an unsupervised learning technique that groups similar data points together without predefined labels.

Types of Clustering:

Type Algorithm Description
Partitioning K-Means Divide into k clusters
Hierarchical Agglomerative Build tree of clusters
Density-based DBSCAN Group by density
Model-based GMM Probabilistic models

K-Means Process:

svgbob diagram

Applications:

  • Customer segmentation

  • Image compression

  • Anomaly detection

  • Document clustering


Q193. ๐ŸŸก Write steps for K-Means Clustering in R.

[Asked: Jun 2022, Jun 2024 | Frequency: 2]

Answer

K-Means Clustering in R:

# Step 1: Prepare data
data(iris)
# Use only numeric columns
data <- iris[, 1:4]

# Step 2: Scale data (important for K-Means)
data_scaled <- scale(data)

# Step 3: Determine optimal k (Elbow method)
wss <- sapply(1:10, function(k) {
  kmeans(data_scaled, k, nstart = 10)$tot.withinss
})
plot(1:10, wss, type = "b", 
     xlab = "Number of Clusters", 
     ylab = "Within-cluster SS")

# Step 4: Apply K-Means
set.seed(123)
kmeans_result <- kmeans(data_scaled, centers = 3, nstart = 25)

# Step 5: View results
print(kmeans_result$cluster)     # Cluster assignments
print(kmeans_result$centers)     # Cluster centroids
print(kmeans_result$size)        # Cluster sizes

# Step 6: Visualize clusters
library(cluster)
clusplot(data_scaled, kmeans_result$cluster, 
         color = TRUE, shade = TRUE)

# Step 7: Add cluster to data
iris$Cluster <- kmeans_result$cluster

# Step 8: Compare with actual species
table(iris$Cluster, iris$Species)


Q194. ๐ŸŸข What is Confusion Matrix?

[Asked: Jun 2022 | Frequency: 1]

Answer

Confusion Matrix is a table showing the performance of a classification model by comparing predicted vs actual values.

Structure (Binary Classification):

Predicted Positive Predicted Negative
Actual Positive TP (True Positive) FN (False Negative)
Actual Negative FP (False Positive) TN (True Negative)

Metrics:

Metric Formula Description
Accuracy (TP+TN)/(TP+TN+FP+FN) Overall correctness
Precision TP/(TP+FP) Positive predictive value
Recall TP/(TP+FN) Sensitivity
F1 Score 2×(P×R)/(P+R) Harmonic mean

Example in R:

actual <- c(1, 1, 0, 1, 0, 0, 1, 0)
predicted <- c(1, 0, 0, 1, 0, 1, 1, 0)

# Confusion matrix
table(Predicted = predicted, Actual = actual)
#          Actual
# Predicted 0 1
#         0 3 1
#         1 1 3

# TP=3, TN=3, FP=1, FN=1
# Accuracy = (3+3)/8 = 75%


Q195. ๐ŸŸข Define Classification.

[Asked: Jun 2022 | Frequency: 1]

Answer

Classification is a supervised learning task that assigns predefined labels to data based on training examples.

Characteristics:

Aspect Description
Input Features (predictors)
Output Discrete class label
Learning Supervised (labeled data)
Examples Spam/Not Spam, Disease/Healthy

Common Algorithms:

Algorithm Type
Logistic Regression Linear
Decision Tree Tree-based
Random Forest Ensemble
SVM Kernel-based
k-NN Instance-based
Naive Bayes Probabilistic

Process:

svgbob diagram

Q196. ๐ŸŸข Write steps for Classification in R.

[Asked: Jun 2022 | Frequency: 1]

Answer

Classification Steps in R:

# Step 1: Load data
data(iris)

# Step 2: Split into train/test
set.seed(123)
train_idx <- sample(1:nrow(iris), 0.7 * nrow(iris))
train <- iris[train_idx, ]
test <- iris[-train_idx, ]

# Step 3: Build classifier (using Random Forest)
library(randomForest)
model <- randomForest(Species ~ ., data = train)

# Step 4: Predict on test data
predictions <- predict(model, test)

# Step 5: Evaluate with confusion matrix
conf_matrix <- table(Predicted = predictions, 
                     Actual = test$Species)
print(conf_matrix)

# Step 6: Calculate accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
print(paste("Accuracy:", round(accuracy * 100, 2), "%"))

# Step 7: Other metrics
library(caret)
confusionMatrix(predictions, test$Species)


Q197. ๐ŸŸข Write short note on Support Vector Machines.

[Asked: Jun 2025 | Frequency: 1]

Answer

Support Vector Machine (SVM) is a supervised learning algorithm that finds the optimal hyperplane to separate classes.

Key Concepts:

Concept Description
Hyperplane Decision boundary
Support Vectors Points closest to boundary
Margin Distance between classes
Kernel Transform non-linear data

Diagram:

svgbob diagram

Kernels:

  • Linear: For linearly separable data

  • RBF: For non-linear data

  • Polynomial: Higher-dimensional mapping

SVM in R:

library(e1071)

# Train SVM
model <- svm(Species ~ ., data = train, kernel = "radial")

# Predict
predictions <- predict(model, test)

# Accuracy
mean(predictions == test$Species)


Q198. ๐ŸŸก What is Time Series Analysis?

[Asked: Jun 2025, Dec 2023 | Frequency: 2]

Answer

Time Series Analysis is the study of data points collected over time to identify patterns, trends, and make forecasts.

Components:

Component Description
Trend Long-term increase/decrease
Seasonality Regular periodic patterns
Cyclic Non-fixed period fluctuations
Noise Random variations

Diagram:

svgbob diagram

Applications:

  • Stock price prediction

  • Weather forecasting

  • Sales forecasting

  • Economic indicators

Common Models:

  • ARIMA (AutoRegressive Integrated Moving Average)

  • Exponential Smoothing

  • LSTM (Deep Learning)


Q199. ๐ŸŸข Write steps for Time Series Analysis in R.

[Asked: Jun 2024 | Frequency: 1]

Answer

Time Series Analysis in R:

# Step 1: Create time series object
data <- c(112, 118, 132, 129, 121, 135, 148, 148, 
          136, 119, 104, 118, 115, 126, 141, 135)
ts_data <- ts(data, start = c(2020, 1), frequency = 12)

# Step 2: Plot time series
plot(ts_data, main = "Monthly Sales", 
     xlab = "Time", ylab = "Sales")

# Step 3: Decompose into components
decomposed <- decompose(ts_data)
plot(decomposed)

# Step 4: Check stationarity
library(tseries)
adf.test(ts_data)

# Step 5: Build ARIMA model
library(forecast)
model <- auto.arima(ts_data)
summary(model)

# Step 6: Forecast
forecast_result <- forecast(model, h = 6)  # 6 periods ahead
plot(forecast_result)

# Step 7: Evaluate accuracy
accuracy(model)

Key Functions:

Function Purpose
ts() Create time series
decompose() Extract components
auto.arima() Automatic ARIMA
forecast() Future predictions

Q200. ๐ŸŸข Write short note on Association Rules.

[Asked: Implied from syllabus | Frequency: 1]

Answer

Association Rules discover relationships between items in transactional datasets (Market Basket Analysis).

Key Metrics:

Metric Formula Description
Support P(A∩B) Frequency of itemset
Confidence P(B|A) Conditional probability
Lift Confidence/P(B) Strength of rule

Example Rule:

{Bread, Butter} → {Milk}
Support = 30%  (30% of transactions have all three)
Confidence = 80%  (80% of Bread+Butter buyers also buy Milk)
Lift = 2.5  (2.5x more likely than random)

Apriori Algorithm in R:

library(arules)

# Load transaction data
data("Groceries")

# Generate rules
rules <- apriori(Groceries, 
                 parameter = list(support = 0.01, 
                                  confidence = 0.5))

# View top rules
inspect(head(sort(rules, by = "lift"), 10))

# Visualize
library(arulesViz)
plot(rules, method = "graph")


Q201. ๐ŸŸข Explain the role of pruning in decision trees.

[Asked: Dec 2023 | Frequency: 1]

Answer

Pruning is the process of removing branches from a fully grown decision tree to reduce overfitting and improve generalization.

Types of Pruning:

Type When Applied Description
Pre-pruning During building Stop growth early (max depth, min samples)
Post-pruning After building Remove weak branches from full tree

Why Pruning is Needed:

Without Pruning With Pruning
Overfits training data Better generalization
Complex tree Simpler tree
Memorizes noise Captures patterns
Poor test accuracy Better test accuracy

Cost-Complexity Pruning (in R):

# Build full tree
full_tree <- rpart(y ~ ., data = train, cp = 0)

# Find optimal cp
printcp(full_tree)
plotcp(full_tree)

# Prune tree
optimal_cp <- full_tree$cptable[which.min(full_tree$cptable[,"xerror"]),"CP"]
pruned_tree <- prune(full_tree, cp = optimal_cp)


Q202. ๐ŸŸข Explain the role of tree selection process in decision trees.

[Asked: Dec 2023 | Frequency: 1]

Answer

Tree Selection Process involves choosing the best tree from multiple candidates based on validation performance.

Steps:

Step Description
1 Build multiple trees with different parameters
2 Evaluate each on validation set
3 Select tree with best performance
4 Test on held-out test set

Selection Criteria:

Criterion Description
Accuracy Classification correctness
AUC-ROC Discrimination ability
Cross-validation error Average across folds
Complexity Prefer simpler trees

Process:

svgbob diagram

Q203. ๐ŸŸข How do categorical and continuous variables relate to decision trees?

[Asked: Jun 2023 | Frequency: 1]

Answer

Decision trees handle both categorical and continuous variables differently:

Categorical Variables:

  • Splits based on category membership

  • Binary split: One category vs rest

  • Multi-way split: Each category gets branch

Continuous Variables:

  • Splits based on threshold values

  • Binary split: ≤ threshold vs > threshold

  • Find best threshold by trying all values

Comparison:

Aspect Categorical Continuous
Split Type Category groups Threshold
Question "Is color = Red?" "Is age ≤ 30?"
Finding Split Try category combos Try all thresholds
Encoding Not needed Not needed

Example Tree:

graphviz diagram

In this tree:

  • Age is continuous (threshold split)

  • Education is categorical (category split)


Q204. ๐ŸŸข What is continuous variable?

[Asked: Jun 2023 | Frequency: 1]

Answer

Continuous Variable is a numerical variable that can take any value within a range, including decimals.

Characteristics:

Characteristic Description
Infinite values Any value in range possible
Measurable Can be measured precisely
Ordered Has natural ordering
Arithmetic Math operations meaningful

Examples:

Variable Possible Values
Height 150.5 cm, 175.23 cm
Temperature 36.5°C, 98.6°F
Salary $50,000.00
Time 2.5 hours
Distance 10.75 km

Continuous vs Discrete:

Continuous Discrete
Any value Specific values only
Measured Counted
Decimals possible Usually integers
Temperature, weight Number of children

In R:

# Continuous
age <- 25.5
temperature <- 98.6
is.numeric(age)  # TRUE

# Summary statistics apply
mean(c(25.5, 30.2, 28.7))  # 28.13


Q205. ๐ŸŸข Write short note on Association Rules using R.

[Asked: Jun 2024 | Frequency: 1]

Answer

Association Rules in R using arules package:

Installation:

install.packages("arules")
install.packages("arulesViz")
library(arules)
library(arulesViz)

Steps:

# Step 1: Load data
data("Groceries")  # Built-in transaction data

# Step 2: Explore data
summary(Groceries)
itemFrequencyPlot(Groceries, topN = 10)

# Step 3: Generate rules using Apriori
rules <- apriori(Groceries, 
                 parameter = list(
                   support = 0.001,
                   confidence = 0.5,
                   minlen = 2
                 ))

# Step 4: View rules
summary(rules)
inspect(head(rules, 10))

# Step 5: Sort by metrics
rules_sorted <- sort(rules, by = "lift")
inspect(head(rules_sorted, 5))

# Step 6: Visualize
plot(rules, method = "scatter")
plot(rules[1:20], method = "graph")

Key Parameters:

Parameter Description
support Min frequency of itemset
confidence Min conditional probability
minlen Minimum items in rule
maxlen Maximum items in rule

Q206. ๐ŸŸข What is the purpose of Central Limit Theorem?

[Asked: From Book Chapter 2 | Frequency: 1]

Answer

Central Limit Theorem (CLT) states that the sampling distribution of the sample mean approaches a normal distribution as sample size increases, regardless of the population distribution.

Statement: For samples of size n from a population with mean ฮผ and standard deviation ฯƒ:

  • Sample means follow N(ฮผ, ฯƒ/√n) as n → ∞

Purpose:

Purpose Description
Inference Make conclusions about population
Hypothesis Testing Use normal distribution for tests
Confidence Intervals Calculate error bounds
Estimation Estimate population parameters

Diagram:

svgbob diagram

Importance:

  • Works for any population distribution

  • Larger n → better approximation

  • n ≥ 30 usually sufficient

  • Foundation for statistical inference


END OF MCS-226 COMPREHENSIVE ANSWER BOOK


QUICK REVISION SUMMARY

Key Formulas

Topic Formula
Jaccard Similarity |A∩B| / |A∪B|
Euclidean Distance √ฮฃ(xแตข-yแตข)²
Cosine Similarity A·B / (|A|×|B|)
PageRank (1-d)/N + d×ฮฃ(PR(q)/L(q))
Entropy -ฮฃpแตข×log₂(pแตข)
Information Gain Entropy(parent) - Weighted Entropy(children)

Important R Functions

Task Function
Linear Regression lm()
Logistic Regression glm(family=binomial)
Decision Tree rpart()
Random Forest randomForest()
K-Means kmeans()
Time Series ts(), arima()

Big Data Technologies

Technology Purpose
Hadoop Distributed storage & processing
MapReduce Parallel computation paradigm
Spark Fast in-memory processing
Hive SQL on Hadoop
HBase NoSQL column store
NoSQL Flexible, scalable databases

Best of Luck for Your Exam! ๐ŸŽ“