MCS-226: Data Science and Big Data
This content is optimized for web viewing.
For the PDF Version and Micro Version (Exam Notes), please contact:
Suraj
WhatsApp: +91 86389 03328
https://wa.me/918638903328
MCS-226: Data Science and Big Data
Complete Exam Answer Guide (2022-2025)
Importance Legend
MCS-226: DATA SCIENCE AND BIG DATA
Complete Exam Answer Guide
Course Code: MCS-226
Programme: MCA (Master of Computer Applications)
University: Indira Gandhi National Open University (IGNOU)
Block Coverage: Block 1-4 (Units 1-16)
Exam Sessions Covered: June 2022 - June 2025
Total Questions: 205 Unified Question Families
Importance Legend
| Symbol | Meaning | Frequency |
|---|---|---|
| ๐ด | Most Important | Asked 4+ times |
| ๐ก | Very Important | Asked 2-3 times |
| ๐ข | Important | Asked 1 time |
UNIT 1: DATA SCIENCE - INTRODUCTION
Q1. ๐ก What is Data Science? Define Data Science and explain it with the help of its applications.
[Asked: Jun 2023, Jun 2022, Dec 2024 | Frequency: 3]
Definition of Data Science: Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract meaningful knowledge and insights from structured and unstructured data. It combines expertise from statistics, mathematics, computer science, and domain knowledge to analyze complex data and solve real-world problems.
Key Components of Data Science:
| Component | Description |
|---|---|
| Statistics | Foundation for data analysis and inference |
| Machine Learning | Algorithms for pattern recognition and prediction |
| Data Engineering | Data collection, storage, and processing |
| Domain Expertise | Industry-specific knowledge application |
| Visualization | Presenting insights in understandable formats |
Diagram:
Applications of Data Science:
-
Healthcare: Disease prediction, drug discovery, patient outcome analysis
-
Finance: Fraud detection, risk assessment, algorithmic trading
-
E-commerce: Recommendation systems, customer segmentation, demand forecasting
-
Transportation: Route optimization, autonomous vehicles, traffic prediction
-
Social Media: Sentiment analysis, content recommendation, trend detection
-
Manufacturing: Predictive maintenance, quality control, supply chain optimization
Q2. ๐ก What are the applications/advantages of Data Science in an organization?
[Asked: Jun 2022, Jun 2023 | Frequency: 2]
Advantages of Data Science in Organizations:
| Advantage | Description |
|---|---|
| Informed Decision Making | Data-driven insights replace guesswork |
| Predictive Capabilities | Forecast trends and customer behavior |
| Cost Reduction | Identify inefficiencies and optimize operations |
| Competitive Advantage | Leverage data for market differentiation |
| Customer Understanding | Deep insights into preferences and needs |
| Risk Management | Early detection of potential issues |
| Process Automation | Automate repetitive analytical tasks |
Key Applications:
-
Marketing Optimization: Target right customers with personalized campaigns
-
Product Development: Data-driven feature prioritization
-
Operational Efficiency: Streamline processes using analytics
-
Revenue Growth: Identify new revenue opportunities
-
Human Resources: Talent acquisition and retention analytics
-
Supply Chain: Demand forecasting and inventory optimization
Q3. ๐ก What are the different types of data in Data Science? Briefly explain each type.
[Asked: Jun 2025 | Frequency: 2]
Types of Data in Data Science:
1. Based on Structure:
| Type | Description | Examples |
|---|---|---|
| Structured | Organized in predefined format (rows/columns) | Databases, spreadsheets, SQL tables |
| Semi-Structured | Partially organized with tags/markers | JSON, XML, HTML, emails |
| Unstructured | No predefined format | Images, videos, audio, social media posts |
2. Based on Nature:
| Type | Description | Examples |
|---|---|---|
| Qualitative | Descriptive, non-numeric | Colors, names, categories |
| Quantitative | Numeric, measurable | Age, salary, temperature |
3. Data Streams:
-
Continuous flow of data generated in real-time
-
Examples: Stock market feeds, sensor data, social media streams
Q4. ๐ข What is Structured Data? Explain with suitable example.
[Asked: Dec 2023 | Frequency: 1]
Structured Data is highly organized data that follows a predefined schema and can be easily stored in relational databases with rows and columns.
Characteristics:
-
Follows strict data model
-
Easily searchable using SQL queries
-
Stored in RDBMS (MySQL, PostgreSQL, Oracle)
-
Consistent format across records
Example - Employee Database:
| EmpID | Name | Department | Salary | JoinDate |
|---|---|---|---|---|
| 101 | John | IT | 50000 | 2020-01-15 |
| 102 | Mary | HR | 45000 | 2019-06-20 |
| 103 | Alex | Finance | 55000 | 2021-03-10 |
Query Example:
SELECT Name, Salary FROM Employees WHERE Department = 'IT';
Q5. ๐ข Discuss how structured data is different from semi-structured data.
[Asked: Dec 2024 | Frequency: 1]
| Aspect | Structured Data | Semi-Structured Data |
|---|---|---|
| Schema | Strict, predefined schema | Flexible, self-describing |
| Format | Tables with rows/columns | Tags, markers, hierarchies |
| Storage | RDBMS | NoSQL, document stores |
| Examples | SQL databases, spreadsheets | JSON, XML, HTML |
| Query Language | SQL | XPath, JSONPath |
| Flexibility | Low - schema changes are complex | High - easy to modify |
| Analysis | Easy with traditional tools | Requires parsing |
Diagram:
Q6. ๐ก What is Semi-structured data? Explain with suitable example.
[Asked: Dec 2023, Dec 2022 | Frequency: 2]
Semi-structured Data is data that doesn't conform to rigid tabular structure but contains tags, markers, or other elements to separate semantic elements and enforce hierarchies.
Characteristics:
-
Self-describing with tags/markers
-
Flexible schema
-
Hierarchical organization
-
Stored in NoSQL databases
Examples:
1. JSON Format:
{
"student": {
"id": "S001",
"name": "Rahul Kumar",
"courses": ["MCS-226", "MCS-221"],
"grades": {
"MCS-226": "A",
"MCS-221": "B+"
}
}
}
2. XML Format:
<student>
<id>S001</id>
<name>Rahul Kumar</name>
<courses>
<course>MCS-226</course>
<course>MCS-221</course>
</courses>
</student>
Use Cases: Web APIs, configuration files, log files, email data
Q7. ๐ก What is Unstructured data? Explain with suitable example.
[Asked: Dec 2023, Dec 2022 | Frequency: 2]
Unstructured Data is data that has no predefined format or organization, making it difficult to store in traditional databases.
Characteristics:
-
No predefined data model
-
Difficult to search and analyze with traditional methods
-
Requires specialized tools for processing
-
Constitutes ~80% of all enterprise data
Examples:
| Category | Examples |
|---|---|
| Text | Emails, documents, social media posts |
| Multimedia | Images, videos, audio files |
| Web Content | HTML pages, blogs, forums |
| Sensor Data | IoT device readings |
Diagram:
Processing Methods: Natural Language Processing (NLP), Computer Vision, Deep Learning
Q8. ๐ข What is Qualitative data? Explain with example.
[Asked: Dec 2022 | Frequency: 1]
Qualitative Data (also called Categorical Data) represents characteristics or qualities that cannot be measured numerically but can be categorized.
Types:
| Type | Description | Example |
|---|---|---|
| Nominal | Categories without order | Gender (Male/Female), Blood Type (A, B, O, AB) |
| Ordinal | Categories with meaningful order | Education Level (High School < Bachelor < Master < PhD) |
Characteristics:
-
Non-numeric in nature
-
Describes attributes or properties
-
Can be counted but not measured
-
Analyzed using mode, frequency distribution
Examples:
-
Eye color: Blue, Brown, Green
-
Customer satisfaction: Poor, Average, Good, Excellent
-
Product categories: Electronics, Clothing, Food
Q9. ๐ก What is Quantitative data? Explain with example.
[Asked: Dec 2022, Jun 2022 | Frequency: 2]
Quantitative Data represents numerical values that can be measured and expressed using numbers.
Types:
| Type | Description | Example |
|---|---|---|
| Discrete | Countable, whole numbers | Number of students (25, 30, 45) |
| Continuous | Any value in a range | Height (5.5 ft), Temperature (36.7°C) |
Characteristics:
-
Numeric in nature
-
Can be measured precisely
-
Supports mathematical operations
-
Analyzed using mean, median, standard deviation
Examples:
-
Age: 25, 30, 45 years
-
Salary: ₹50,000, ₹75,000
-
Temperature: 25.5°C, 30.2°C
-
Distance: 100.5 km
Comparison with Qualitative:
| Aspect | Qualitative | Quantitative |
|---|---|---|
| Nature | Descriptive | Numerical |
| Measurement | Categories | Exact values |
| Analysis | Frequency, Mode | Mean, Std Dev |
| Examples | Colors, Grades | Age, Salary |
Q10. ๐ข Compare qualitative data with quantitative data.
[Asked: Jun 2023 | Frequency: 1]
| Aspect | Qualitative Data | Quantitative Data |
|---|---|---|
| Definition | Describes qualities/characteristics | Describes quantities/amounts |
| Nature | Non-numeric, categorical | Numeric, measurable |
| Types | Nominal, Ordinal | Discrete, Continuous |
| Examples | Gender, Color, Opinion | Age, Height, Income |
| Collection Methods | Surveys, Interviews, Observations | Measurements, Counts, Experiments |
| Analysis Techniques | Thematic analysis, Content analysis | Statistical analysis, Regression |
| Central Tendency | Mode | Mean, Median, Mode |
| Visualization | Pie charts, Bar graphs | Histograms, Line graphs, Scatter plots |
| Flexibility | Subjective interpretation | Objective measurement |
| Sample Size | Usually smaller | Usually larger |
Q11. ๐ข What is categorical data? Explain with example.
[Asked: Jun 2022 | Frequency: 1]
Categorical Data represents data that can be divided into distinct groups or categories. It is a type of qualitative data.
Types of Categorical Data:
-
Nominal Data: Categories without inherent order
-
Examples: Blood type (A, B, AB, O), Country names, Colors
-
Ordinal Data: Categories with meaningful order
-
Examples: Education level, Customer rating (1-5 stars)
Example Dataset:
| Student | Gender | Grade | City |
|---|---|---|---|
| A | Male | A | Delhi |
| B | Female | B | Mumbai |
| C | Male | A | Chennai |
Here, Gender, Grade, and City are all categorical variables.
Analysis Methods:
-
Frequency distribution
-
Mode calculation
-
Chi-square test
-
Bar charts and pie charts
Q12. ๐ข What is Measurement Scale of Data? What do you understand by this term?
[Asked: Jun 2023 | Frequency: 1]
Measurement Scale refers to the classification system used to categorize and quantify data based on the nature of information it represents and the mathematical operations that can be performed on it.
Diagram:
Purpose:
-
Determines appropriate statistical analysis
-
Guides data collection methodology
-
Defines mathematical operations possible
-
Helps in choosing visualization techniques
Q13. ๐ก Explain the characteristics of measurement scales of data.
[Asked: Jun 2023, Dec 2022 | Frequency: 2]
Four Measurement Scales (NOIR):
| Scale | Characteristics | Operations | Examples |
|---|---|---|---|
| Nominal | Categories without order | =, ≠ | Gender, Blood Type, City |
| Ordinal | Categories with order | =, ≠, <, > | Grades, Rankings, Ratings |
| Interval | Equal intervals, no true zero | +, - | Temperature (°C), IQ Scores |
| Ratio | Equal intervals, true zero | +, -, ×, ÷ | Height, Weight, Age, Income |
Detailed Characteristics:
1. Nominal Scale:
-
Classifies data into mutually exclusive categories
-
No ranking or ordering
-
Mode is the only measure of central tendency
2. Ordinal Scale:
-
Categories have meaningful order
-
Differences between values are not uniform
-
Median can be calculated
3. Interval Scale:
-
Equal distances between values
-
No absolute zero point
-
Mean, median, mode all applicable
4. Ratio Scale:
-
Has true zero (absence of attribute)
-
All mathematical operations valid
-
Most informative scale
Q14. ๐ก List and define various measurement scales of data with suitable examples.
[Asked: Jun 2023, Dec 2022 | Frequency: 2]
1. Nominal Scale:
-
Definition: Classification without any order
-
Examples:
-
Gender: Male, Female
-
Marital Status: Single, Married, Divorced
-
Blood Group: A, B, AB, O
2. Ordinal Scale:
-
Definition: Classification with meaningful order but unequal intervals
-
Examples:
-
Education: Primary < Secondary < Graduate < Postgraduate
-
Satisfaction: Very Dissatisfied < Dissatisfied < Neutral < Satisfied < Very Satisfied
-
Military Ranks: Private < Corporal < Sergeant < Lieutenant
3. Interval Scale:
-
Definition: Ordered with equal intervals but no true zero
-
Examples:
-
Temperature in Celsius: 0°C doesn't mean no temperature
-
Calendar Years: Year 0 is not "beginning of time"
-
IQ Scores: 0 IQ doesn't mean no intelligence
4. Ratio Scale:
-
Definition: Ordered with equal intervals and absolute zero
-
Examples:
-
Height: 0 cm means no height
-
Weight: 0 kg means no weight
-
Income: ₹0 means no income
-
Age: 0 years means just born
Summary Table:
| Scale | Order | Equal Interval | True Zero | Example |
|---|---|---|---|---|
| Nominal | ✗ | ✗ | ✗ | Colors |
| Ordinal | ✓ | ✗ | ✗ | Rankings |
| Interval | ✓ | ✓ | ✗ | Temperature |
| Ratio | ✓ | ✓ | ✓ | Weight |
Q15. ๐ข What is Descriptive Analysis? Explain.
[Asked: Jun 2024 | Frequency: 1]
Descriptive Analysis is a statistical method that summarizes and describes the main features of a dataset, providing simple summaries about the sample and measures.
Key Components:
| Component | Description | Examples |
|---|---|---|
| Central Tendency | Average/typical value | Mean, Median, Mode |
| Dispersion | Spread of data | Range, Variance, Std Dev |
| Distribution | Shape of data | Skewness, Kurtosis |
| Position | Relative standing | Percentiles, Quartiles |
Techniques Used:
-
Numerical Summaries: Mean, median, mode, standard deviation
-
Graphical Representations: Histograms, bar charts, pie charts, box plots
-
Frequency Tables: Count and percentage distributions
Example: For exam scores: [75, 80, 85, 90, 95]
-
Mean = 85
-
Median = 85
-
Range = 20
-
Standard Deviation = 7.07
Purpose: Understand "what happened" in the data without making predictions or inferences.
Q16. ๐ข What is Exploratory Analysis? Explain.
[Asked: Jun 2024 | Frequency: 1]
Exploratory Data Analysis (EDA) is an approach to analyzing datasets to summarize their main characteristics, often using visual methods, to discover patterns, spot anomalies, and check assumptions.
Key Objectives:
-
Understand data structure and content
-
Detect outliers and anomalies
-
Identify patterns and relationships
-
Generate hypotheses for further testing
-
Check assumptions for statistical models
Techniques:
| Technique | Purpose |
|---|---|
| Summary Statistics | Understand central tendency and spread |
| Visualization | Identify patterns visually |
| Correlation Analysis | Find relationships between variables |
| Missing Value Analysis | Identify data quality issues |
| Outlier Detection | Find unusual observations |
Common Visualizations:
-
Histograms and density plots
-
Scatter plots and pair plots
-
Box plots
-
Heat maps (correlation matrix)
Difference from Descriptive Analysis:
-
More visual and interactive
-
Focuses on discovery rather than just summarization
-
May involve transformations and feature engineering
Q17. ๐ข What is Inferential Analysis? Explain.
[Asked: Jun 2024 | Frequency: 1]
Inferential Analysis uses sample data to make generalizations, predictions, or decisions about a larger population.
Key Concepts:
| Concept | Description |
|---|---|
| Population | Entire group of interest |
| Sample | Subset of population |
| Hypothesis Testing | Testing assumptions about population |
| Confidence Intervals | Range of plausible values |
| p-value | Probability of results if null hypothesis true |
Common Techniques:
-
Hypothesis Testing: t-test, chi-square test, ANOVA
-
Confidence Intervals: Estimating population parameters
-
Regression Analysis: Predicting relationships
-
Correlation Analysis: Measuring association strength
Example:
-
Sample: 100 students' exam scores
-
Inference: Average score of all MCA students is between 70-80 with 95% confidence
Diagram:
Q18. ๐ก What is Predictive Analysis? Explain.
[Asked: Jun 2024, Jun 2023 | Frequency: 2]
Predictive Analysis uses historical data, statistical algorithms, and machine learning techniques to forecast future outcomes.
Key Components:
| Component | Description |
|---|---|
| Historical Data | Past observations used for training |
| Statistical Algorithms | Regression, time series |
| Machine Learning | Classification, clustering, neural networks |
| Validation | Testing model accuracy |
Common Techniques:
-
Regression Models: Linear, logistic, polynomial
-
Classification: Decision trees, random forests, SVM
-
Time Series: ARIMA, exponential smoothing
-
Neural Networks: Deep learning models
Applications:
| Domain | Application |
|---|---|
| Finance | Credit scoring, fraud detection |
| Healthcare | Disease prediction, patient outcomes |
| Retail | Demand forecasting, churn prediction |
| Marketing | Customer lifetime value, response prediction |
Process Flow:
Q19. ๐ข Define the different methods for collecting, analysing and interpreting numerical information.
[Asked: Jun 2024 | Frequency: 1]
Methods for Numerical Data:
1. Data Collection Methods:
| Method | Description | Example |
|---|---|---|
| Surveys | Questionnaires with numeric responses | Rating scales 1-10 |
| Experiments | Controlled data collection | Lab measurements |
| Observations | Recording numerical events | Traffic counts |
| Secondary Sources | Existing databases | Census data, financial reports |
| Sensors/IoT | Automated collection | Temperature, pressure readings |
2. Analysis Methods:
| Type | Techniques |
|---|---|
| Descriptive | Mean, median, mode, standard deviation |
| Inferential | t-tests, ANOVA, chi-square |
| Predictive | Regression, machine learning |
| Exploratory | Visualization, correlation |
3. Interpretation Methods:**
-
Statistical significance testing
-
Confidence interval construction
-
Effect size calculation
-
Trend analysis
-
Comparative analysis
Q20. ๐ข What are the common misconceptions of data science?
[Asked: Jun 2024 | Frequency: 1]
Common Misconceptions in Data Analysis:
| Misconception | Reality |
|---|---|
| Correlation = Causation | Correlation shows relationship, not cause-effect |
| Bigger Sample = Better | Quality matters more than quantity |
| Data Never Lies | Data can be biased, incomplete, or manipulated |
| More Variables = Better Model | Can lead to overfitting |
| AI/ML Solves Everything | Requires clean data and proper problem framing |
Key Fallacies:
1. Correlation vs Causation:
-
Ice cream sales and drowning deaths both increase in summer
-
They're correlated but ice cream doesn't cause drowning
2. Simpson's Paradox:
-
Trend appears in groups but reverses when groups are combined
-
Example: Treatment A may be better in each group but B appears better overall
3. Data Dredging:
-
Mining data for patterns without hypothesis
-
Leads to false discoveries due to multiple comparisons
Q21. ๐ข What is Simpson's Paradox? Explain with the help of an example.
[Asked: Dec 2024 | Frequency: 1]
Simpson's Paradox is a phenomenon where a trend appears in different groups of data but disappears or reverses when the groups are combined.
Example - University Admission:
By Department:
| Department | Male Applied | Male Admitted | Female Applied | Female Admitted |
|---|---|---|---|---|
| Engineering | 800 | 480 (60%) | 100 | 70 (70%) |
| Arts | 100 | 10 (10%) | 400 | 80 (20%) |
Combined:
| Gender | Total Applied | Total Admitted | Rate |
|---|---|---|---|
| Male | 900 | 490 | 54.4% |
| Female | 500 | 150 | 30% |
Paradox: Females have higher admission rates in EACH department, but lower OVERALL admission rate.
Explanation: More women applied to the harder-to-get-into department (Arts).
Diagram:
Key Lesson: Always consider confounding variables and stratify data appropriately.
Q22. ๐ข What is Dredging? Explain with the help of an example.
[Asked: Dec 2024 | Frequency: 1]
Data Dredging (also called p-hacking or data fishing) is the misuse of data analysis to find patterns that can be presented as statistically significant when in fact there is no underlying effect.
Characteristics:
-
Testing multiple hypotheses without correction
-
Cherry-picking favorable results
-
Ignoring negative findings
-
Post-hoc hypothesis generation
Example: A researcher tests 100 different foods for cancer correlation:
-
At 5% significance level, expect ~5 false positives
-
Publishing only "chocolate causes cancer" without mentioning 99 other tests
Problems:
-
Inflated false positive rate
-
Non-reproducible results
-
Misleading conclusions
-
Wasted resources on false leads
Prevention:
-
Pre-register hypotheses
-
Apply multiple testing corrections (Bonferroni)
-
Report all tests conducted
-
Replicate findings independently
Q23. ๐ก What is Data Science Life Cycle? Explain the different stages with the help of a diagram.
[Asked: Jun 2024, Dec 2023 | Frequency: 2]
Data Science Life Cycle is a systematic approach to solving data problems through iterative phases.
Diagram:
Stages Explained:
| Stage | Description | Activities |
|---|---|---|
| 1. Business Understanding | Define problem and objectives | Stakeholder meetings, goal setting |
| 2. Data Collection | Gather relevant data | APIs, databases, surveys, web scraping |
| 3. Data Preparation | Clean and transform data | Missing values, normalization, encoding |
| 4. Exploratory Analysis | Understand data patterns | Visualization, statistics, correlations |
| 5. Data Modeling | Build analytical models | ML algorithms, feature engineering |
| 6. Model Evaluation | Assess model performance | Accuracy, precision, recall, F1-score |
| 7. Deployment | Implement in production | APIs, dashboards, automation |
| 8. Monitoring | Track performance over time | Drift detection, retraining |
Key Characteristics:
-
Iterative, not linear
-
Requires cross-functional collaboration
-
Documentation at each stage is crucial
UNIT 2: PROBABILITY AND STATISTICS FOR DATA SCIENCE
Q24. ๐ก What is Conditional Probability? Explain with the help of a diagram.
[Asked: Jun 2025, Jun 2024, Dec 2023 | Frequency: 3]
Conditional Probability is the probability of an event occurring given that another event has already occurred.
Formula:
Where:
-
$P(A|B)$ = Probability of A given B has occurred
-
$P(A \cap B)$ = Probability of both A and B occurring
-
$P(B)$ = Probability of B occurring
Diagram:
Venn Diagram Representation:
Example:
-
Box contains 6 red and 4 blue balls
-
P(2nd ball is red | 1st ball was red and not replaced)
-
P(Red₂|Red₁) = 5/9
Q25. ๐ก Write the equation for conditional probability and describe its components with a suitable example.
[Asked: Jun 2025, Dec 2023, Jun 2024 | Frequency: 3]
Conditional Probability Equation:
Components:
| Component | Symbol | Meaning |
|---|---|---|
| Conditional Probability | P(A|B) | Probability of A happening given B occurred |
| Joint Probability | P(A ∩ B) | Probability of both A and B happening together |
| Marginal Probability | P(B) | Overall probability of event B |
Example - Medical Diagnosis:
| Disease (D) | No Disease (D') | Total | |
|---|---|---|---|
| Positive Test (T) | 95 | 50 | 145 |
| Negative Test (T') | 5 | 850 | 855 |
| Total | 100 | 900 | 1000 |
Calculate P(Disease | Positive Test):
Interpretation: If a person tests positive, there's a 65.5% chance they have the disease.
Q26. ๐ก What is Bayes Theorem?
[Asked: Dec 2024, Jun 2024, Jun 2023 | Frequency: 3]
Bayes' Theorem is a mathematical formula for determining conditional probability, allowing us to update the probability of a hypothesis based on new evidence.
Formula:
Components:
| Term | Name | Description |
|---|---|---|
| P(A|B) | Posterior | Updated probability after evidence |
| P(A) | Prior | Initial probability before evidence |
| P(B|A) | Likelihood | Probability of evidence given hypothesis |
| P(B) | Marginal Likelihood | Total probability of evidence |
Extended Form (Total Probability):
Key Applications:
-
Spam filtering
-
Medical diagnosis
-
Machine learning classification
-
Recommendation systems
Q27. ๐ก Explain Bayes Theorem with suitable equation and example.
[Asked: Dec 2024, Jun 2023, Jun 2024 | Frequency: 3]
Bayes' Theorem Equation:
Example - Disease Screening:
Given:
-
P(Disease) = 1% = 0.01 (Prior)
-
P(Positive | Disease) = 99% = 0.99 (Sensitivity)
-
P(Positive | No Disease) = 5% = 0.05 (False Positive Rate)
Find: P(Disease | Positive Test)
Solution:
Step 1: Calculate P(Positive) using total probability
Step 2: Apply Bayes' Theorem
Result: Only 16.7% chance of having disease even with positive test!
Diagram:
Q28. ๐ก What is a Random Variable? Explain the concept of random variable.
[Asked: Jun 2023, Jun 2024 | Frequency: 2]
Random Variable is a variable whose value is determined by the outcome of a random phenomenon. It maps outcomes of a random experiment to numerical values.
Types:
| Type | Description | Example |
|---|---|---|
| Discrete | Takes countable values | Number of heads in 10 coin tosses |
| Continuous | Takes any value in a range | Height, weight, temperature |
Notation:
-
X, Y, Z (capital letters) = Random variable
-
x, y, z (lowercase) = Specific value
Example - Dice Roll:
-
Random experiment: Rolling a fair die
-
Random variable X = Number shown on die
-
Possible values: X ∈ {1, 2, 3, 4, 5, 6}
-
P(X = 3) = 1/6
Properties:
-
Has a probability distribution
-
Can calculate expected value E(X)
-
Has variance Var(X) and standard deviation
Diagram:
Q29. ๐ข Differentiate between Discrete Random Variable and Continuous Random Variable.
[Asked: Jun 2023 | Frequency: 1]
| Aspect | Discrete Random Variable | Continuous Random Variable |
|---|---|---|
| Values | Countable, finite or infinite | Uncountable, any value in range |
| Gaps | Has gaps between values | No gaps, continuous spectrum |
| Probability | P(X = x) > 0 for specific values | P(X = x) = 0 for any single point |
| Distribution | Probability Mass Function (PMF) | Probability Density Function (PDF) |
| Examples | Coin tosses, dice rolls, counts | Height, weight, time, temperature |
| Graphical | Bar chart | Smooth curve |
| Calculation | Sum of probabilities | Integral of density function |
| Notation | P(X = x) | f(x) or P(a ≤ X ≤ b) |
Examples:
Discrete:
-
X = Number of students in a class (0, 1, 2, ...)
-
Y = Number of defects in a product (0, 1, 2, ...)
Continuous:
-
X = Waiting time at a bus stop (0 to ∞)
-
Y = Height of students (any value like 5.67 ft)
Q30. ๐ข What is Binomial Distribution?
[Asked: Dec 2023 | Frequency: 1]
Binomial Distribution is a discrete probability distribution that models the number of successes in a fixed number of independent trials, where each trial has the same probability of success.
Conditions (BINS):
-
Binary outcomes (success/failure)
-
Independent trials
-
Number of trials is fixed
-
Same probability for each trial
Parameters:
-
n = number of trials
-
p = probability of success
-
q = 1 - p = probability of failure
Notation: X ~ Binomial(n, p)
Characteristics:
-
Mean: ฮผ = np
-
Variance: ฯ² = npq
-
Standard Deviation: ฯ = √(npq)
Q31. ๐ข Write the formula for binomial probability distribution.
[Asked: Dec 2023 | Frequency: 1]
Binomial Probability Formula:
Where:
-
$\binom{n}{k} = \frac{n!}{k!(n-k)!}$ = Number of ways to choose k successes from n trials
-
n = Total number of trials
-
k = Number of successes (0, 1, 2, ..., n)
-
p = Probability of success in each trial
-
(1-p) = Probability of failure
Example: Probability of getting exactly 3 heads in 5 coin tosses:
Q32. ๐ก Apply binomial probability distribution formula to produce the probability distribution for coin toss problem.
[Asked: Dec 2023, Jun 2022 | Frequency: 2]
Problem: Find probability distribution for number of heads in 4 coin tosses.
Given: n = 4, p = 0.5 (fair coin)
Formula: $P(X = k) = \binom{4}{k} (0.5)^k (0.5)^{4-k} = \binom{4}{k} (0.5)^4$
Calculations:
| X (Heads) | $\binom{4}{k}$ | Calculation | P(X) |
|---|---|---|---|
| 0 | 1 | 1 × (0.5)⁴ | 0.0625 |
| 1 | 4 | 4 × (0.5)⁴ | 0.2500 |
| 2 | 6 | 6 × (0.5)⁴ | 0.3750 |
| 3 | 4 | 4 × (0.5)⁴ | 0.2500 |
| 4 | 1 | 1 × (0.5)⁴ | 0.0625 |
| Total | 1.0000 |
Probability Distribution Graph:
Statistics:
-
Mean = np = 4 × 0.5 = 2
-
Variance = npq = 4 × 0.5 × 0.5 = 1
-
Std Dev = 1
Q33. ๐ก What kind of probability distribution is binomial? Explain the characteristics of binomial distribution.
[Asked: Jun 2022, Jun 2024 | Frequency: 2]
Binomial Distribution is a discrete probability distribution.
Characteristics:
| Characteristic | Description |
|---|---|
| Discrete | X takes only whole number values (0, 1, 2, ..., n) |
| Fixed Trials | Number of trials n is predetermined |
| Binary Outcomes | Each trial has only two outcomes (success/failure) |
| Independence | Trials are independent of each other |
| Constant Probability | P(success) = p remains same for all trials |
Mathematical Properties:
| Property | Formula |
|---|---|
| Mean (Expected Value) | ฮผ = E(X) = np |
| Variance | ฯ² = Var(X) = np(1-p) |
| Standard Deviation | ฯ = √(np(1-p)) |
| Mode | ⌊(n+1)p⌋ or ⌈(n+1)p⌉ - 1 |
Shape Characteristics:
-
Symmetric when p = 0.5
-
Right-skewed when p < 0.5
-
Left-skewed when p > 0.5
-
Approaches normal distribution for large n (np > 5 and n(1-p) > 5)
Applications:
-
Quality control (defective items)
-
Medical trials (patient recovery)
-
Marketing (customer response)
-
Finance (default probability)
Q34. ๐ก What is Normal Distribution? Explain the characteristics of normal distribution.
[Asked: Jun 2024, Dec 2022 | Frequency: 2]
Normal Distribution (Gaussian Distribution) is a continuous probability distribution that is symmetric and bell-shaped, described by mean (ฮผ) and standard deviation (ฯ).
Formula (PDF):
Notation: X ~ N(ฮผ, ฯ²)
Characteristics:
| Property | Description |
|---|---|
| Symmetry | Symmetric around mean ฮผ |
| Bell-shaped | Peak at mean, tails extend infinitely |
| Mean = Median = Mode | All central tendency measures equal |
| Total Area = 1 | Under the curve |
| Asymptotic | Curve never touches x-axis |
Empirical Rule (68-95-99.7):
| Range | Percentage |
|---|---|
| ฮผ ± 1ฯ | 68.27% |
| ฮผ ± 2ฯ | 95.45% |
| ฮผ ± 3ฯ | 99.73% |
Diagram:
Standard Normal Distribution: Z ~ N(0, 1) where Z = (X - ฮผ)/ฯ
Q35. ๐ข What is probability distribution of continuous random variable? Explain with the help of a diagram.
[Asked: Dec 2022 | Frequency: 1]
Probability Distribution of Continuous Random Variable is described using a Probability Density Function (PDF), where probability is calculated as area under the curve.
Key Properties:
-
f(x) ≥ 0 for all x
-
Total area under curve = 1
-
P(a ≤ X ≤ b) = ∫โแต f(x)dx
-
P(X = specific value) = 0
Common Continuous Distributions:
| Distribution | Use Case |
|---|---|
| Normal | Natural phenomena, errors |
| Exponential | Waiting times |
| Uniform | Equal probability in range |
| Chi-square | Hypothesis testing |
Diagram:
Example - Uniform Distribution:
-
X ~ Uniform(0, 10)
-
PDF: f(x) = 1/10 for 0 ≤ x ≤ 10
-
P(2 ≤ X ≤ 5) = (5-2)/10 = 0.3
Q36. ๐ข How does sampling differ from population?
[Asked: Dec 2023 | Frequency: 1]
| Aspect | Population | Sample |
|---|---|---|
| Definition | Entire group of interest | Subset of population |
| Size | Usually large (N) | Smaller, manageable (n) |
| Data Collection | Census (complete enumeration) | Sampling techniques |
| Parameters | Fixed values (ฮผ, ฯ) | Estimates (x̄, s) |
| Cost | High | Lower |
| Time | Time-consuming | Faster |
| Accuracy | True values | Subject to sampling error |
| Feasibility | Often impractical | Practical |
| Notation | Greek letters (ฮผ, ฯ, N) | Latin letters (x̄, s, n) |
Example:
-
Population: All MCA students in India
-
Sample: 500 randomly selected MCA students
Diagram:
Q37. ๐ข Discuss the relation of the terms 'statistic' and 'parameter' with sampling and population respectively.
[Asked: Dec 2023 | Frequency: 1]
Relationship:
| Term | Associated With | Description | Notation |
|---|---|---|---|
| Parameter | Population | Fixed, unknown value describing population | ฮผ, ฯ, ฯ |
| Statistic | Sample | Calculated value from sample data | x̄, s, p̂ |
Key Differences:
| Aspect | Parameter | Statistic |
|---|---|---|
| Source | Population | Sample |
| Value | Fixed | Varies by sample |
| Known? | Usually unknown | Calculated |
| Purpose | What we want to know | Estimate parameter |
Common Pairs:
| Measure | Parameter (Population) | Statistic (Sample) |
|---|---|---|
| Mean | ฮผ (mu) | x̄ (x-bar) |
| Standard Deviation | ฯ (sigma) | s |
| Proportion | ฯ or p | p̂ (p-hat) |
| Variance | ฯ² | s² |
| Size | N | n |
Relationship:
-
Statistics are estimators of parameters
-
Multiple samples → Multiple statistics → Sampling distribution
-
As n → N, statistic → parameter
Q38. ๐ก What is Sampling? What is Sampling Distribution? Explain with the help of an example.
[Asked: Dec 2024, Jun 2022 | Frequency: 2]
Sampling is the process of selecting a subset (sample) from a population to make inferences about the entire population.
Sampling Distribution is the probability distribution of a statistic (like sample mean) obtained from all possible samples of a given size from a population.
Types of Sampling:
| Type | Method |
|---|---|
| Simple Random | Each member has equal chance |
| Stratified | Divide into groups, sample from each |
| Cluster | Randomly select clusters |
| Systematic | Select every kth member |
Example - Sampling Distribution of Mean:
Population: {2, 4, 6, 8, 10}, ฮผ = 6
All possible samples of size 2 (with replacement):
| Sample | Values | Mean (x̄) |
|---|---|---|
| 1 | 2, 2 | 2 |
| 2 | 2, 4 | 3 |
| 3 | 2, 6 | 4 |
| ... | ... | ... |
| 25 | 10, 10 | 10 |
Sampling Distribution:
-
Mean of x̄ values = ฮผ = 6
-
Standard Error = ฯ/√n
Diagram:
Q39. ๐ข What are the two measures to define the central tendencies of quantitative data? Explain with example.
[Asked: Dec 2024 | Frequency: 1]
Two Main Measures of Central Tendency:
1. Mean (Arithmetic Average):
-
Sum of all values divided by count
-
Affected by outliers
-
Uses all data points
Example: Data: 10, 20, 30, 40, 50 Mean = (10+20+30+40+50)/5 = 150/5 = 30
2. Median (Middle Value):
-
Middle value when data is sorted
-
Not affected by outliers
-
Better for skewed distributions
Example: Data: 10, 20, 30, 40, 100 Median = 30 (middle value) Mean = 40 (pulled up by outlier 100)
Comparison:
| Aspect | Mean | Median |
|---|---|---|
| Calculation | Sum/Count | Middle value |
| Outlier sensitivity | High | Low |
| Best for | Symmetric data | Skewed data |
| Uses all values | Yes | No |
Third Measure - Mode: Most frequently occurring value
Q40. ๐ก What are the different measures for defining the spread or variability of a quantitative variable? Explain with examples.
[Asked: Jun 2022, Dec 2024 | Frequency: 2]
Measures of Spread/Variability:
| Measure | Formula | Description |
|---|---|---|
| Range | Max - Min | Simplest measure |
| Variance | ฯ² = ฮฃ(xแตข - ฮผ)²/N | Average squared deviation |
| Standard Deviation | ฯ = √Variance | Spread in original units |
| IQR | Q3 - Q1 | Range of middle 50% |
| Coefficient of Variation | CV = (ฯ/ฮผ) × 100% | Relative variability |
Example: Data: 5, 10, 15, 20, 25
1. Range: Range = 25 - 5 = 20
2. Variance:
-
Mean (ฮผ) = 15
-
Deviations: -10, -5, 0, 5, 10
-
Squared deviations: 100, 25, 0, 25, 100
-
Variance = 250/5 = 50
3. Standard Deviation: ฯ = √50 = 7.07
4. IQR:
-
Q1 = 7.5, Q3 = 22.5
-
IQR = 22.5 - 7.5 = 15
When to Use:
-
Range: Quick overview
-
Std Dev: Most common, comparable data
-
IQR: Skewed data, with outliers
-
CV: Comparing variability of different units
Q41. ๐ข Explain the steps of significance testing with the help of an example.
[Asked: Dec 2022 | Frequency: 1]
Steps of Significance Testing (Hypothesis Testing):
Step 1: State Hypotheses
-
H₀ (Null Hypothesis): No effect/difference
-
H₁ (Alternative Hypothesis): Effect/difference exists
Step 2: Choose Significance Level (ฮฑ)
-
Typically ฮฑ = 0.05 or 0.01
-
Probability of rejecting H₀ when it's true (Type I error)
Step 3: Select Test Statistic
- t-test, z-test, chi-square, F-test, etc.
Step 4: Calculate Test Statistic and p-value
Step 5: Make Decision
-
If p-value < ฮฑ: Reject H₀
-
If p-value ≥ ฮฑ: Fail to reject H₀
Example - Testing Mean Score:
Claim: Average exam score is 75
Sample: n = 36, x̄ = 78, s = 12
Solution:
-
H₀: ฮผ = 75, H₁: ฮผ ≠ 75
-
ฮฑ = 0.05
-
t-test (unknown ฯ)
-
t = (78-75)/(12/√36) = 3/2 = 1.5
-
p-value ≈ 0.14 > 0.05
-
Fail to reject H₀ - Insufficient evidence that mean differs from 75
Diagram:
Q42. ๐ข Write short note on Chi-square test.
[Asked: Jun 2023 | Frequency: 1]
Chi-Square Test (ฯ²) is a statistical test used to determine if there is a significant association between categorical variables.
Types:
| Type | Purpose |
|---|---|
| Goodness of Fit | Compare observed vs expected frequencies |
| Test of Independence | Check if two variables are related |
| Test of Homogeneity | Compare distributions across groups |
Formula:
Where:
-
O = Observed frequency
-
E = Expected frequency
Degrees of Freedom:
-
Goodness of fit: df = k - 1
-
Independence: df = (r-1)(c-1)
Example - Test of Independence:
| Like Coffee | Don't Like | Total | |
|---|---|---|---|
| Male | 30 | 20 | 50 |
| Female | 20 | 30 | 50 |
| Total | 50 | 50 | 100 |
Expected (if independent): Each cell = 25
ฯ² = (30-25)²/25 + (20-25)²/25 + (20-25)²/25 + (30-25)²/25 ฯ² = 1 + 1 + 1 + 1 = 4
df = (2-1)(2-1) = 1 Critical value at ฮฑ=0.05: 3.84
Since 4 > 3.84, reject H₀ - Gender and coffee preference are related.
UNIT 3: DATA PREPARATION FOR ANALYSIS
Q43. ๐ก What is Data Preprocessing? Explain with the help of an example.
[Asked: Dec 2022, Jun 2022 | Frequency: 2]
Data Preprocessing is the technique of transforming raw data into a clean, understandable format suitable for analysis. Raw data is often incomplete, inconsistent, and contains errors that must be corrected before analysis.
Why Preprocessing is Needed:
-
Real-world data is messy and incomplete
-
Contains noise, outliers, and missing values
-
Different formats and scales need standardization
-
Irrelevant features need to be removed
Key Steps in Data Preprocessing:
Example - Customer Dataset:
Raw Data (Before Preprocessing):
| CustomerID | Name | Age | Income | City |
|---|---|---|---|---|
| 101 | John | 25 | 50000 | Delhi |
| 102 | NULL | -5 | 75000 | mumbai |
| 103 | Mary | 30 | NULL | Delhi |
| 104 | Alex | 999 | 60000 | DELHI |
After Preprocessing:
| CustomerID | Name | Age | Income | City |
|---|---|---|---|---|
| 101 | John | 25 | 50000 | Delhi |
| 102 | Unknown | 28 (mean) | 75000 | Mumbai |
| 103 | Mary | 30 | 61667 (mean) | Delhi |
| 104 | Alex | 28 (replaced outlier) | 60000 | Delhi |
Issues Fixed:
-
NULL replaced with defaults/mean values
-
Invalid age (-5, 999) corrected
-
City names standardized (case consistency)
Q44. ๐ข Why is data preprocessing important in data science and big data applications? Discuss with suitable diagram.
[Asked: Dec 2024 | Frequency: 1]
Importance of Data Preprocessing:
| Reason | Explanation |
|---|---|
| Data Quality | Garbage in = Garbage out; clean data → accurate results |
| Model Performance | ML models perform better with preprocessed data |
| Consistency | Standardizes formats across different sources |
| Efficiency | Reduces storage and computation requirements |
| Accuracy | Removes noise and errors that affect analysis |
| Compatibility | Makes data compatible with analysis tools |
Impact on Big Data:
| Challenge | How Preprocessing Helps |
|---|---|
| Volume | Data reduction techniques |
| Variety | Format standardization |
| Velocity | Stream preprocessing pipelines |
| Veracity | Data validation and cleaning |
Diagram - Preprocessing Pipeline:
Without Preprocessing:
-
Models give inaccurate predictions
-
Analysis results are misleading
-
Storage and processing are inefficient
-
Integration of multiple sources fails
Q45. ๐ข Discuss different phases of data preprocessing.
[Asked: Dec 2024 | Frequency: 1]
Phases of Data Preprocessing:
Phase 1: Data Cleaning
| Task | Description |
|---|---|
| Missing Values | Fill with mean/median/mode or remove |
| Noise Removal | Smooth out random errors |
| Outlier Detection | Identify and handle extreme values |
| Inconsistency | Fix contradictory data |
Phase 2: Data Integration
| Task | Description |
|---|---|
| Schema Integration | Combine schemas from multiple sources |
| Entity Resolution | Match same entities across sources |
| Redundancy Removal | Eliminate duplicate attributes |
| Conflict Resolution | Handle different values for same entity |
Phase 3: Data Transformation
| Technique | Purpose |
|---|---|
| Normalization | Scale values to 0-1 range |
| Standardization | Transform to mean=0, std=1 |
| Aggregation | Summarize data (daily → monthly) |
| Discretization | Convert continuous to categorical |
| Encoding | Convert categorical to numerical |
Phase 4: Data Reduction
| Technique | Purpose |
|---|---|
| Dimensionality Reduction | Reduce number of features (PCA) |
| Numerosity Reduction | Reduce data volume (sampling) |
| Data Compression | Encode data efficiently |
| Feature Selection | Keep only relevant features |
Q46. ๐ก What is Data Cleaning?
[Asked: Jun 2025, Dec 2023 | Frequency: 2]
Data Cleaning (also called Data Cleansing or Data Scrubbing) is the process of detecting and correcting (or removing) corrupt, inaccurate, or irrelevant records from a dataset.
Definition: Data cleaning involves identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the dirty data.
Common Data Quality Issues:
| Issue | Example | Solution |
|---|---|---|
| Missing Values | Age = NULL | Imputation or deletion |
| Duplicate Records | Same customer twice | Deduplication |
| Inconsistent Formats | Date: 10/12/2024 vs 2024-12-10 | Standardization |
| Typos/Errors | "Delih" instead of "Delhi" | Correction |
| Outliers | Age = 999 | Statistical methods |
| Invalid Data | Age = -5 | Validation rules |
Data Cleaning Process:
Importance:
-
Ensures data accuracy and reliability
-
Improves analysis and model performance
-
Reduces errors in decision-making
-
Saves time in downstream processing
Q47. ๐ด What are the methods of data cleaning? List and briefly discuss the best practices used for data cleaning and data preparation.
[Asked: Jun 2025, Dec 2023, Dec 2022, Jun 2022 | Frequency: 4]
Methods of Data Cleaning:
1. Handling Missing Values:
| Method | When to Use |
|---|---|
| Deletion | When missing data is random and small (<5%) |
| Mean/Median/Mode Imputation | Numerical data with few missing values |
| Forward/Backward Fill | Time series data |
| Predictive Imputation | Use ML to predict missing values |
| Constant Value | Replace with default (e.g., "Unknown") |
2. Handling Duplicates:
# Identify duplicates
duplicates = df.duplicated()
# Remove duplicates
df_clean = df.drop_duplicates()
3. Handling Outliers:
| Method | Description |
|---|---|
| Z-Score | Remove if |
| IQR Method | Remove if < Q1-1.5×IQR or > Q3+1.5×IQR |
| Capping | Replace with threshold values |
| Transformation | Log transform to reduce impact |
4. Standardization & Normalization:
| Technique | Formula | Range |
|---|---|---|
| Min-Max Normalization | (x - min)/(max - min) | [0, 1] |
| Z-Score Standardization | (x - ฮผ)/ฯ | Unbounded |
5. Data Type Conversion:
-
Convert strings to dates
-
Convert categories to numbers
-
Parse structured text fields
Best Practices:
| Practice | Description |
|---|---|
| Profile First | Understand data before cleaning |
| Document Everything | Keep log of all changes |
| Preserve Original | Keep backup of raw data |
| Automate | Create reusable cleaning scripts |
| Validate | Check results after each step |
| Iterative Approach | Clean in multiple passes |
Data Cleaning Workflow:
Q48. ๐ข What is Data Curation? Explain with the help of an example.
[Asked: Dec 2022 | Frequency: 1]
Data Curation is the process of organizing, integrating, and maintaining data throughout its lifecycle to ensure it remains accessible, reliable, and valuable for current and future use.
Definition: Data curation involves the active management of data from creation through its entire lifecycle, including organization, validation, preservation, and ensuring long-term accessibility.
Key Activities in Data Curation:
| Activity | Description |
|---|---|
| Collection | Gathering data from various sources |
| Organization | Structuring and categorizing data |
| Validation | Ensuring accuracy and quality |
| Preservation | Storing for long-term access |
| Documentation | Adding metadata and context |
| Access Control | Managing who can use the data |
Diagram:
Example - Research Data Curation:
A university research project on climate change:
| Stage | Curation Activity |
|---|---|
| Collection | Gather temperature data from 100 weather stations |
| Organization | Structure by location, date, measurement type |
| Validation | Cross-check readings, flag anomalies |
| Documentation | Add metadata: sensor type, calibration date, location coordinates |
| Preservation | Store in institutional repository with backups |
| Access | Publish dataset with DOI for citation |
Before Curation:
-
Scattered files in different formats
-
No documentation of collection methods
-
Missing context for interpretation
After Curation:
-
Unified dataset with consistent format
-
Complete metadata for reproducibility
-
Accessible to other researchers
-
Preserved for future studies
Difference from Data Cleaning:
| Data Cleaning | Data Curation |
|---|---|
| Fixes errors and inconsistencies | Manages entire data lifecycle |
| One-time process | Ongoing activity |
| Technical focus | Governance focus |
| Prepares for analysis | Ensures long-term value |
UNIT 4: DATA VISUALIZATION
Q49. ๐ข What is a Histogram?
[Asked: Jun 2023 | Frequency: 1]
Histogram is a graphical representation of the distribution of numerical data, showing the frequency of data points falling within specified ranges (bins).
Characteristics:
-
X-axis: Data ranges (bins)
-
Y-axis: Frequency/count
-
Bars are adjacent (no gaps)
-
Shows distribution shape
Diagram:
Use Cases:
-
Understanding data distribution
-
Identifying skewness
-
Detecting outliers
-
Comparing distributions
Q50. ๐ข How does Histogram differ from Bar Graph?
[Asked: Jun 2023 | Frequency: 1]
| Aspect | Histogram | Bar Graph |
|---|---|---|
| Data Type | Continuous/Numerical | Categorical/Discrete |
| Bar Spacing | No gaps (adjacent bars) | Gaps between bars |
| X-axis | Ranges/Bins | Categories |
| Purpose | Show distribution | Compare categories |
| Bar Order | Fixed (numerical order) | Can be rearranged |
| Bar Width | Meaningful (represents range) | Arbitrary |
Visual Comparison:
Q51. ๐ข Briefly discuss the utility of Histogram in Data Science.
[Asked: Jun 2023 | Frequency: 1]
Utilities of Histogram in Data Science:
| Utility | Description |
|---|---|
| Distribution Analysis | Understand how data is spread |
| Outlier Detection | Identify extreme values |
| Skewness Detection | Determine if data is symmetric or skewed |
| Binning Decisions | Help decide discretization strategy |
| Feature Engineering | Guide transformation decisions |
| Data Quality | Identify data issues |
Distribution Patterns:
Applications:
-
EDA (Exploratory Data Analysis)
-
Feature selection
-
Model assumption validation
-
Data preprocessing decisions
Q52. ๐ก How to create a Histogram in R? Write the syntax and explain with example.
[Asked: Jun 2023, Dec 2024 | Frequency: 2]
Basic Syntax:
hist(x, main, xlab, ylab, col, border, breaks)
Parameters:
| Parameter | Description |
|---|---|
x |
Vector of values |
main |
Title of histogram |
xlab |
X-axis label |
ylab |
Y-axis label |
col |
Fill color |
border |
Border color |
breaks |
Number of bins |
Example:
# Create sample data
marks <- c(45, 67, 89, 34, 78, 56, 90, 23, 67, 88,
54, 76, 82, 39, 71, 63, 95, 48, 72, 85)
# Create histogram
hist(marks,
main = "Distribution of Student Marks",
xlab = "Marks",
ylab = "Frequency",
col = "lightblue",
border = "black",
breaks = 5)
Output:
Distribution of Student Marks
Frequency
│
6 │ ████
│ ████
4 │ ████ ████ ████
│ ████ ████ ████
2 │ ████ ████ ████ ████
│ ████ ████ ████ ████
0 └──────────────────────────
20-40 40-60 60-80 80-100
Marks
Q53. ๐ด What is a Box Plot? What do you mean by Box Plot?
[Asked: Jun 2025, Dec 2023, Dec 2022, Jun 2022 | Frequency: 4]
Box Plot (also called Box-and-Whisker Plot) is a standardized way of displaying the distribution of data based on five key statistics: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum.
Five-Number Summary:
| Statistic | Description |
|---|---|
| Minimum | Smallest value (excluding outliers) |
| Q1 (25th percentile) | Lower quartile |
| Median (Q2) | Middle value (50th percentile) |
| Q3 (75th percentile) | Upper quartile |
| Maximum | Largest value (excluding outliers) |
Diagram:
IQR (Interquartile Range): Q3 - Q1
Outlier Detection:
-
Lower outliers: < Q1 - 1.5 × IQR
-
Upper outliers: > Q3 + 1.5 × IQR
Q54. ๐ก What is the utility of Box Plot in Data Science? Briefly discuss.
[Asked: Jun 2025, Dec 2023 | Frequency: 2]
Utilities of Box Plot:
| Utility | Description |
|---|---|
| Distribution Summary | Quick overview of data spread |
| Outlier Detection | Clearly shows extreme values |
| Comparison | Compare multiple groups side-by-side |
| Skewness Detection | Asymmetric box indicates skew |
| Central Tendency | Shows median clearly |
| Variability | IQR shows data spread |
Applications in Data Science:
-
EDA: Initial data exploration
-
Feature Analysis: Compare feature distributions
-
Data Quality: Identify anomalies
-
Group Comparison: Compare across categories
-
Model Diagnostics: Check residual distributions
Interpreting Skewness:
Q55. ๐ก How to create a Box Plot in R? Write the syntax or list the commands.
[Asked: Dec 2022, Jun 2023, Dec 2024 | Frequency: 3]
Basic Syntax:
boxplot(x, main, xlab, ylab, col, border, horizontal, notch)
Parameters:
| Parameter | Description |
|---|---|
x |
Vector or formula |
main |
Title |
xlab, ylab |
Axis labels |
col |
Fill color |
horizontal |
TRUE for horizontal plot |
notch |
TRUE for notched box |
Example 1: Single Box Plot
# Sample data
scores <- c(45, 67, 89, 34, 78, 56, 90, 23, 67, 88, 120)
# Create box plot
boxplot(scores,
main = "Student Scores Distribution",
ylab = "Scores",
col = "lightgreen",
border = "darkgreen")
Example 2: Grouped Box Plot
# Create data frame
data <- data.frame(
scores = c(75, 80, 85, 70, 90, 65, 70, 75, 60, 80),
group = c("A","A","A","A","A","B","B","B","B","B")
)
# Grouped box plot
boxplot(scores ~ group,
data = data,
main = "Scores by Group",
xlab = "Group",
ylab = "Scores",
col = c("lightblue", "lightpink"))
Q56. ๐ข What are whiskers in a BoxPlot?
[Asked: Dec 2023 | Frequency: 1]
Whiskers are the lines extending from the box in a box plot to the minimum and maximum values within a defined range.
Definition:
-
Lower Whisker: Extends from Q1 to the smallest value ≥ Q1 - 1.5×IQR
-
Upper Whisker: Extends from Q3 to the largest value ≤ Q3 + 1.5×IQR
Diagram:
Diagram:
Whisker Calculation:
| Component | Formula |
|---|---|
| IQR | Q3 - Q1 |
| Upper Limit | Q3 + 1.5 × IQR |
| Lower Limit | Q1 - 1.5 × IQR |
| Upper Whisker | Max value ≤ Upper Limit |
| Lower Whisker | Min value ≥ Lower Limit |
Purpose:
-
Show data range excluding outliers
-
Help identify outliers (points beyond whiskers)
-
Indicate data variability
Q57. ๐ข Explain clearly how the Box Plot differs from Scatter Plot.
[Asked: Jun 2025 | Frequency: 1]
| Aspect | Box Plot | Scatter Plot |
|---|---|---|
| Purpose | Show distribution of ONE variable | Show relationship between TWO variables |
| Variables | Univariate (single variable) | Bivariate (two variables) |
| Data Points | Summarized (5-number summary) | Individual points shown |
| Outliers | Explicitly marked | Visible but not marked |
| Comparison | Compare distributions across groups | Identify correlations |
| Best For | Distribution, spread, outliers | Correlation, patterns, trends |
Visual Comparison:
When to Use:
| Scenario | Use |
|---|---|
| Analyze single variable distribution | Box Plot |
| Compare groups | Box Plot |
| Find relationship between 2 variables | Scatter Plot |
| Identify clusters | Scatter Plot |
| Detect outliers in one variable | Box Plot |
| Predict one variable from another | Scatter Plot |
Q58. ๐ข Draw a sample box plot and explain it.
[Asked: Jun 2022 | Frequency: 1]
Sample Data: Test scores: 23, 45, 56, 67, 67, 72, 78, 85, 88, 89, 90, 120
Calculations:
-
Sorted: 23, 45, 56, 67, 67, 72, 78, 85, 88, 89, 90, 120
-
Q1 = 61.5
-
Median (Q2) = 75
-
Q3 = 88.5
-
IQR = 88.5 - 61.5 = 27
-
Lower Limit = 61.5 - 1.5(27) = 21
-
Upper Limit = 88.5 + 1.5(27) = 129
Box Plot:
Interpretation:
-
Median = 75: Half the students scored above 75
-
IQR = 27: Middle 50% of scores span 27 points
-
Symmetric: Median is roughly centered in box
-
No extreme outliers: All values within whisker range
Q59. ๐ก What is a Scatter Plot?
[Asked: Dec 2023, Dec 2024 | Frequency: 2]
Scatter Plot is a type of graph that displays values for two variables as a collection of points, showing the relationship or correlation between them.
Characteristics:
-
X-axis: Independent variable
-
Y-axis: Dependent variable
-
Each point represents one observation
-
Pattern reveals relationship type
Types of Relationships:
Use Cases:
-
Correlation analysis
-
Regression modeling
-
Outlier detection
-
Cluster identification
Q60. ๐ก What is the use of scatter plot? Give uses and best practices.
[Asked: Dec 2024, Dec 2023 | Frequency: 2]
Uses of Scatter Plot:
| Use | Description |
|---|---|
| Correlation Detection | Identify positive/negative/no correlation |
| Trend Analysis | Observe patterns in data |
| Outlier Detection | Spot unusual data points |
| Regression Basis | Foundation for linear regression |
| Cluster Identification | Find natural groupings |
| Hypothesis Testing | Validate assumptions about relationships |
Best Practices:
| Practice | Guideline |
|---|---|
| Clear Labels | Label both axes with units |
| Appropriate Scale | Start axis at 0 when meaningful |
| Point Size | Keep consistent, not too large |
| Color Coding | Use for categorical grouping |
| Trend Line | Add regression line if relevant |
| Avoid Overplotting | Use transparency for large datasets |
Example Interpretation:
Height vs Weight (Positive Correlation)
Weight │ · ·
(kg) │ · · ·
│ · · ·
│ · · ·
│ · · ·
│· · ·
└────────────────────────
Height (cm)
Interpretation: As height increases, weight tends to increase
Correlation: Strong positive (r ≈ 0.8)
Q61. ๐ก How to draw a Scatter Plot in R? Write the syntax and explain with example.
[Asked: Dec 2024, Jun 2023, Jun 2024 | Frequency: 3]
Basic Syntax:
plot(x, y, main, xlab, ylab, col, pch, cex)
Parameters:
| Parameter | Description |
|---|---|
x |
X-axis values |
y |
Y-axis values |
main |
Title |
xlab, ylab |
Axis labels |
col |
Point color |
pch |
Point shape (1-25) |
cex |
Point size |
Example:
# Sample data
height <- c(150, 160, 165, 170, 175, 180, 185, 190)
weight <- c(50, 55, 60, 65, 70, 75, 80, 85)
# Create scatter plot
plot(height, weight,
main = "Height vs Weight",
xlab = "Height (cm)",
ylab = "Weight (kg)",
col = "blue",
pch = 16,
cex = 1.5)
# Add trend line
abline(lm(weight ~ height), col = "red", lwd = 2)
Point Shapes (pch values):
1: ○ 2: △ 3: + 4: × 5: ◇
16: ● 17: ▲ 18: ◆ 19: ● 20: •
Q62. ๐ข What is a Heat Map?
[Asked: Jun 2023 | Frequency: 1]
Heat Map is a data visualization technique that uses color intensity to represent the magnitude of values in a matrix or table format.
Characteristics:
-
Uses color gradients (e.g., blue→red)
-
Displays 2D data matrix
-
Darker/brighter colors = higher values
-
Often includes clustering (dendrograms)
Diagram:
Feature1 Feature2 Feature3
Sample1 ██████ ░░░░░░ ████
Sample2 ░░░░░░ ██████ ██
Sample3 ████ ████ ██████
Sample4 ██ ██ ░░░░░░
Color Scale: ░ Low ─────────── █ High
Components:
-
Color Scale: Legend showing value-to-color mapping
-
Cells: Individual data points
-
Dendrograms: Optional clustering trees
-
Labels: Row and column identifiers
Q63. ๐ข Give uses and best practices for Heat Maps.
[Asked: Jun 2023 | Frequency: 1]
Uses of Heat Maps:
| Use | Application |
|---|---|
| Correlation Matrix | Visualize variable relationships |
| Gene Expression | Compare expression across samples |
| Website Analytics | User click patterns |
| Geographic Data | Population density, temperature |
| Time Series | Activity patterns by hour/day |
| Clustering Results | Show group similarities |
Best Practices:
| Practice | Guideline |
|---|---|
| Color Choice | Use intuitive colors (blue=cold, red=hot) |
| Color Blindness | Avoid red-green combinations |
| Normalization | Scale data for fair comparison |
| Clustering | Group similar rows/columns |
| Labels | Keep readable, rotate if needed |
| Legend | Always include color scale |
| Annotation | Add values in cells if few |
R Code Example:
# Create matrix
data <- matrix(runif(25), nrow=5, ncol=5)
# Create heatmap
heatmap(data,
main = "Sample Heat Map",
col = heat.colors(10))
Q64. ๐ด What is the use of Bar Chart? How to draw a Bar Chart in R?
[Asked: Jun 2022, Jun 2023, Dec 2024, Jun 2024 | Frequency: 4]
Use of Bar Chart:
| Use | Description |
|---|---|
| Comparison | Compare values across categories |
| Ranking | Show highest to lowest |
| Composition | Parts of a whole (stacked) |
| Trends | Changes over discrete periods |
| Distribution | Frequency of categories |
Types:
-
Vertical bar chart
-
Horizontal bar chart
-
Grouped bar chart
-
Stacked bar chart
R Syntax:
barplot(height, names.arg, main, xlab, ylab, col, border, horiz)
Parameters:
| Parameter | Description |
|---|---|
height |
Vector of bar heights |
names.arg |
Labels for bars |
col |
Bar colors |
horiz |
TRUE for horizontal |
beside |
TRUE for grouped bars |
Example:
# Data
sales <- c(250, 180, 320, 280, 150)
products <- c("A", "B", "C", "D", "E")
# Create bar chart
barplot(sales,
names.arg = products,
main = "Product Sales Comparison",
xlab = "Product",
ylab = "Sales (units)",
col = c("red", "blue", "green", "orange", "purple"),
border = "black")
Output:
Sales │
320 │ ████
280 │ ████ ████
250 │ ████ ████ ████
180 │ ████ ████ ████ ████
150 │ ████ ████ ████ ████ ████
└──────────────────────────────
A B C D E
Products
Q65. ๐ก How to create Line Graphs in R? Write the syntax and explain with example.
[Asked: Jun 2023, Dec 2024 | Frequency: 2]
Basic Syntax:
plot(x, y, type = "l", main, xlab, ylab, col, lwd, lty)
Parameters:
| Parameter | Description |
|---|---|
type |
"l"=line, "b"=both, "o"=overplotted |
lwd |
Line width |
lty |
Line type (1=solid, 2=dashed, etc.) |
Example:
# Data - Monthly sales
months <- 1:12
sales <- c(100, 120, 150, 180, 200, 220, 210, 190, 170, 150, 130, 140)
# Create line graph
plot(months, sales,
type = "o",
main = "Monthly Sales Trend",
xlab = "Month",
ylab = "Sales (units)",
col = "blue",
lwd = 2,
pch = 16)
# Add grid
grid()
Multiple Lines:
# Second product
sales2 <- c(80, 100, 130, 150, 170, 180, 175, 160, 140, 120, 100, 110)
# Add to existing plot
lines(months, sales2, col = "red", lwd = 2, type = "o", pch = 17)
# Add legend
legend("topright",
legend = c("Product A", "Product B"),
col = c("blue", "red"),
lwd = 2,
pch = c(16, 17))
Q66. ๐ข What is the use of Pair Plot? Explain how to read a pair plot.
[Asked: Dec 2024 | Frequency: 1]
Pair Plot (also called Scatter Plot Matrix) displays pairwise relationships between multiple variables in a dataset.
Uses:
| Use | Description |
|---|---|
| EDA | Quick overview of all relationships |
| Correlation | Identify correlated variables |
| Patterns | Spot non-linear relationships |
| Outliers | Detect multivariate outliers |
| Feature Selection | Choose relevant features |
How to Read a Pair Plot:
Reading Tips:
-
Diagonal: Shows distribution (histogram/density)
-
Upper/Lower Triangle: Scatter plots (often mirrored)
-
Strong Correlation: Points form line pattern
-
No Correlation: Random scatter
-
Clusters: Grouped points suggest categories
R Example:
# Using pairs function
pairs(iris[,1:4],
main = "Iris Dataset Pair Plot",
col = iris$Species,
pch = 19)
Q67. ๐ข List the key characteristics of various types of plots for data visualization.
[Asked: Jun 2024 | Frequency: 1]
| Plot Type | Variables | Best For | Key Characteristics |
|---|---|---|---|
| Histogram | 1 numerical | Distribution | Bins, frequency, no gaps |
| Bar Chart | 1 categorical | Comparison | Gaps between bars, categories |
| Box Plot | 1 numerical | Summary stats | 5-number summary, outliers |
| Scatter Plot | 2 numerical | Correlation | Points, trends, clusters |
| Line Graph | Time series | Trends | Connected points, time-based |
| Heat Map | Matrix | Patterns | Color intensity, 2D grid |
| Pie Chart | 1 categorical | Proportions | Circular, percentages |
| Pair Plot | Multiple | Relationships | Matrix of scatter plots |
| Violin Plot | 1 numerical | Distribution | Box plot + density |
| Area Chart | Time series | Cumulative | Filled under line |
Selection Guide:
UNIT 5: BIG DATA ARCHITECTURE
Q68. ๐ก What is Big Data?
[Asked: Jun 2025, Jun 2022 | Frequency: 2]
Big Data refers to extremely large and complex datasets that cannot be processed, stored, or analyzed using traditional data processing tools and techniques.
Definition: Big Data is characterized by high volume, velocity, and variety of data that requires advanced technologies and analytical methods to extract meaningful insights.
Key Characteristics (5 Vs):
| V | Description | Example |
|---|---|---|
| Volume | Massive amount of data | Petabytes, Exabytes |
| Velocity | Speed of data generation | Real-time streaming |
| Variety | Different data types | Text, images, videos |
| Veracity | Data quality/accuracy | Trustworthiness |
| Value | Business insights | Actionable decisions |
Sources of Big Data:
-
Social media (Facebook, Twitter)
-
IoT sensors
-
E-commerce transactions
-
Scientific experiments
-
Healthcare records
-
Financial markets
Q69. ๐ด What are the characteristics of Big Data? Explain the four V's with examples.
[Asked: Jun 2025, Dec 2022, Jun 2022, Jun 2024 | Frequency: 4]
The 4 V's of Big Data:
1. VOLUME (Size)
| Aspect | Description |
|---|---|
| Definition | Massive scale of data |
| Scale | Terabytes → Petabytes → Exabytes |
| Example | Facebook generates 4+ PB of data daily |
| Challenge | Storage and processing infrastructure |
2. VELOCITY (Speed)
| Aspect | Description |
|---|---|
| Definition | Speed of data generation and processing |
| Types | Batch, Near real-time, Real-time |
| Example | Stock market: millions of trades per second |
| Challenge | Real-time processing requirements |
3. VARIETY (Types)
| Type | Examples |
|---|---|
| Structured | Databases, spreadsheets |
| Semi-structured | JSON, XML, logs |
| Unstructured | Images, videos, emails |
| Example | Hospital: patient records + X-rays + doctor notes |
4. VERACITY (Quality)
| Aspect | Description |
|---|---|
| Definition | Accuracy and trustworthiness |
| Issues | Missing data, inconsistencies, bias |
| Example | Social media sentiment may be manipulated |
| Challenge | Ensuring data quality at scale |
5. VALUE (Insight)
| Aspect | Description |
|---|---|
| Definition | Business value extracted from data |
| Goal | Turn raw data into actionable insights |
| Example | Netflix recommendations drive 80% of viewing |
| Challenge | Deriving meaningful insights cost-effectively |
Summary Table:
| V | Question Answered | Key Metric |
|---|---|---|
| Volume | How much? | Size (TB, PB) |
| Velocity | How fast? | Speed (records/sec) |
| Variety | What types? | Format diversity |
| Veracity | How accurate? | Data quality % |
| Value | How useful? | Business impact |
Q70. ๐ข Differentiate between Big Data and Data Warehouse.
[Asked: Jun 2025 | Frequency: 1]
| Aspect | Big Data | Data Warehouse |
|---|---|---|
| Data Type | Structured + Unstructured | Primarily Structured |
| Volume | Petabytes to Exabytes | Terabytes |
| Processing | Distributed (Hadoop, Spark) | Centralized (SQL Server, Oracle) |
| Schema | Schema-on-read | Schema-on-write |
| Data Source | Multiple heterogeneous sources | Integrated enterprise sources |
| Storage | HDFS, NoSQL | RDBMS |
| Query Type | Exploratory, ML | Predefined reports, BI |
| Latency | Real-time possible | Typically batch |
| Cost | Lower (commodity hardware) | Higher (specialized hardware) |
| Flexibility | High | Limited |
Diagram:
Q71. ๐ข How does Big Data differ from relational data?
[Asked: Dec 2022 | Frequency: 1]
| Aspect | Big Data | Relational Data |
|---|---|---|
| Volume | Massive (PB+) | Limited (GB-TB) |
| Structure | Any (structured, unstructured) | Structured only |
| Schema | Flexible, schema-on-read | Fixed, schema-on-write |
| Scaling | Horizontal (add nodes) | Vertical (bigger server) |
| Processing | Distributed (MapReduce) | Single server (SQL) |
| ACID | Eventual consistency (BASE) | Full ACID compliance |
| Query Language | Various (Hive, Pig, etc.) | SQL |
| Storage | HDFS, NoSQL | RDBMS tables |
| Cost | Commodity hardware | Expensive specialized |
| Use Case | Analytics, ML, exploration | Transactions, reports |
Key Differences:
-
Scale: Big Data handles internet-scale; RDBMS handles enterprise-scale
-
Flexibility: Big Data accepts any format; RDBMS requires predefined schema
-
Speed: Big Data can process in real-time; RDBMS typically batch
Q72. ๐ข What is Big Data Analysis?
[Asked: Dec 2024 | Frequency: 1]
Big Data Analysis is the process of examining large and varied datasets to uncover hidden patterns, correlations, market trends, customer preferences, and other useful business information.
Components:
| Component | Description |
|---|---|
| Data Collection | Gathering from multiple sources |
| Data Storage | HDFS, NoSQL databases |
| Data Processing | MapReduce, Spark |
| Data Analysis | Statistical and ML techniques |
| Visualization | Dashboards, reports |
Types of Big Data Analysis:
| Type | Purpose | Example |
|---|---|---|
| Descriptive | What happened? | Sales reports |
| Diagnostic | Why did it happen? | Root cause analysis |
| Predictive | What will happen? | Demand forecasting |
| Prescriptive | What should we do? | Recommendation engines |
Tools Used:
-
Apache Hadoop
-
Apache Spark
-
Apache Kafka
-
MongoDB, Cassandra
-
Tableau, Power BI
Q73. ๐ข What is Distributed File System? Explain in the context of big data.
[Asked: Dec 2024 | Frequency: 1]
Distributed File System (DFS) is a file system that stores data across multiple machines (nodes) in a network, providing the illusion of a single unified file system to users.
Definition: A DFS allows files to be stored on multiple servers and accessed as if they were on a local disk, enabling scalable storage and parallel processing.
Key Concepts:
| Concept | Description |
|---|---|
| Nodes | Individual machines in the cluster |
| Blocks | Files split into fixed-size chunks |
| Replication | Each block copied to multiple nodes |
| Namespace | Unified view of distributed files |
Diagram:
Example - HDFS:
-
File "data.txt" (384 MB)
-
Split into 3 blocks of 128 MB each
-
Each block replicated 3 times
-
Stored across multiple DataNodes
Q74. ๐ข Explain the different features of Distributed File System.
[Asked: Dec 2024 | Frequency: 1]
Features of Distributed File System:
| Feature | Description |
|---|---|
| Scalability | Add nodes to increase capacity |
| Fault Tolerance | Data replicated across nodes |
| Transparency | Users see single file system |
| High Availability | No single point of failure |
| Parallel Access | Multiple clients access simultaneously |
| Data Locality | Process data where it's stored |
Detailed Features:
1. Scalability:
-
Horizontal scaling (add more machines)
-
Linear increase in capacity
-
No downtime for expansion
2. Fault Tolerance:
-
Data replication (typically 3 copies)
-
Automatic recovery from node failure
-
Continuous health monitoring
3. Transparency Types:
| Type | Description |
|---|---|
| Location | Users don't know physical location |
| Access | Same access method everywhere |
| Failure | System handles failures invisibly |
| Replication | Multiple copies appear as one |
4. Data Locality:
-
Move computation to data
-
Reduces network bandwidth
-
Improves processing speed
Q75. ๐ข What is HDFS (Hadoop Distributed File System)?
[Asked: Jun 2025 | Frequency: 1]
HDFS (Hadoop Distributed File System) is a distributed, scalable, and fault-tolerant file system designed to store very large files across machines in a Hadoop cluster.
Key Features:
-
Stores files across commodity hardware
-
Handles petabytes of data
-
Fault-tolerant through replication
-
Optimized for large sequential reads
Architecture:
Components:
| Component | Role |
|---|---|
| NameNode | Master - manages metadata, namespace |
| DataNode | Slave - stores actual data blocks |
| Secondary NameNode | Checkpoint backup (not hot standby) |
Block Storage:
-
Default block size: 128 MB
-
Each block replicated (default: 3)
-
Blocks distributed across DataNodes
Q76. ๐ก What are the characteristics of HDFS?
[Asked: Dec 2022, Jun 2022 | Frequency: 2]
Characteristics of HDFS:
| Characteristic | Description |
|---|---|
| Distributed Storage | Data spread across multiple nodes |
| Fault Tolerance | 3x replication by default |
| Scalability | Scale to thousands of nodes |
| High Throughput | Optimized for batch processing |
| Large Files | Designed for GB-TB sized files |
| Write-Once | Append-only, no random writes |
| Data Locality | Move compute to data |
Detailed Characteristics:
1. Large Block Size:
-
128 MB default (vs 4 KB in traditional FS)
-
Reduces metadata overhead
-
Efficient for large sequential reads
2. Replication:
File: report.txt
↓
Block 1 → Node A, Node B, Node C
Block 2 → Node B, Node D, Node E
Block 3 → Node A, Node C, Node D
3. Rack Awareness:
-
Replicas placed in different racks
-
Survives rack-level failures
-
Optimizes network bandwidth
4. Write-Once, Read-Many:
-
Files written once
-
Appends supported (Hadoop 2.x+)
-
No random updates
Q77. ๐ก Why is HDFS used for Big data processing? What are the advantages of HDFS?
[Asked: Dec 2022, Jun 2022 | Frequency: 2]
Why HDFS for Big Data:
| Reason | Explanation |
|---|---|
| Scale | Handles petabytes across thousands of nodes |
| Cost | Runs on commodity hardware |
| Reliability | Automatic replication and recovery |
| Performance | High throughput for large files |
| Integration | Works with Hadoop ecosystem |
Advantages of HDFS:
| Advantage | Description |
|---|---|
| Fault Tolerance | Node failure doesn't lose data |
| Scalability | Add nodes without downtime |
| Cost-Effective | Uses cheap commodity hardware |
| High Throughput | Parallel data access |
| Data Locality | Moves computation to data |
| Streaming Access | Efficient for batch jobs |
Comparison with Traditional FS:
| Aspect | HDFS | Traditional FS |
|---|---|---|
| Scale | PB+ | TB |
| Hardware | Commodity | Enterprise |
| Failure Handling | Automatic | Manual |
| Access Pattern | Sequential | Random |
| Block Size | 128 MB | 4 KB |
Q78. ๐ข Explain how Master/Slave process works in HDFS architecture.
[Asked: Dec 2024 | Frequency: 1]
Master/Slave Architecture in HDFS:
NameNode (Master):
| Function | Description |
|---|---|
| Namespace Management | Maintains directory tree |
| Block Mapping | Tracks which blocks on which nodes |
| Replication | Ensures adequate copies exist |
| Client Coordination | Directs clients to DataNodes |
DataNode (Slave):
| Function | Description |
|---|---|
| Block Storage | Stores actual data blocks |
| Heartbeat | Sends health status every 3 seconds |
| Block Report | Lists all blocks periodically |
| Data Transfer | Serves read/write requests |
Communication Flow:
-
Heartbeat: DataNode → NameNode (every 3 sec)
-
Confirms node is alive
-
Receives commands (replicate, delete blocks)
-
Block Report: DataNode → NameNode (every 6 hours)
-
Complete list of blocks on node
-
NameNode updates block mapping
-
Read Operation:
-
Client → NameNode: "Where is file X?"
-
NameNode → Client: "Blocks on nodes A, B, C"
-
Client → DataNode: Direct data transfer
Q79. ๐ข Write steps to load data into HDFS format.
[Asked: Jun 2025 | Frequency: 1]
Steps to Load Data into HDFS:
Step 1: Start Hadoop Services
start-dfs.sh
start-yarn.sh
Step 2: Create Directory in HDFS
hdfs dfs -mkdir /user/data
hdfs dfs -mkdir -p /user/data/input
Step 3: Upload File to HDFS
# Single file
hdfs dfs -put localfile.txt /user/data/input/
# Multiple files
hdfs dfs -put *.csv /user/data/input/
# From local directory
hdfs dfs -copyFromLocal /local/path/ /hdfs/path/
Step 4: Verify Upload
# List files
hdfs dfs -ls /user/data/input/
# Check file size
hdfs dfs -du -h /user/data/input/
# View file content
hdfs dfs -cat /user/data/input/file.txt | head
Common HDFS Commands:
| Command | Description |
|---|---|
-put |
Upload local file to HDFS |
-get |
Download from HDFS to local |
-ls |
List directory contents |
-cat |
Display file contents |
-rm |
Delete file |
-mkdir |
Create directory |
-copyFromLocal |
Same as -put |
-copyToLocal |
Same as -get |
Workflow Diagram:
Q80. ๐ข Differentiate between Apache Hadoop-1 and Hadoop-2 using suitable diagram.
[Asked: Dec 2024 | Frequency: 1]
Comparison Table:
| Aspect | Hadoop 1.x | Hadoop 2.x |
|---|---|---|
| Resource Management | JobTracker | YARN (ResourceManager) |
| Processing | MapReduce only | Multiple frameworks |
| Scalability | ~4000 nodes | ~10000+ nodes |
| Single Point of Failure | Yes (NameNode) | No (HA NameNode) |
| Cluster Utilization | Fixed slots | Dynamic containers |
| Multi-tenancy | Limited | Full support |
Hadoop 1.x Architecture:
Hadoop 2.x Architecture (YARN):
Key Improvements in Hadoop 2.x:
| Feature | Benefit |
|---|---|
| YARN | Separates resource management from processing |
| HA NameNode | Eliminates single point of failure |
| Federation | Multiple namespaces for scalability |
| Containers | Dynamic resource allocation |
| Multi-framework | Supports Spark, Tez, Storm, etc. |
UNIT 6: PROGRAMMING USING MAPREDUCE
Q81. ๐ก What is MapReduce? What is Hadoop MapReduce?
[Asked: Dec 2023, Jun 2023, Jun 2022 | Frequency: 3]
MapReduce is a programming model and processing framework for distributed computing on large datasets across a cluster of computers.
Definition: MapReduce divides a task into two phases - Map (transforms data into key-value pairs) and Reduce (aggregates values by key) - enabling parallel processing of massive datasets.
Core Concepts:
| Phase | Function |
|---|---|
| Map | Processes input → (key, value) pairs |
| Shuffle & Sort | Groups values by key |
| Reduce | Aggregates values for each key |
Diagram:
Key Characteristics:
-
Parallel processing
-
Fault tolerance
-
Data locality
-
Scalable to thousands of nodes
Example - Word Count:
Input: "hello world hello"
Map Output: (hello,1), (world,1), (hello,1)
After Shuffle: (hello,[1,1]), (world,[1])
Reduce Output: (hello,2), (world,1)
Q82. ๐ก Explain the Map function and Reduce function with a suitable block diagram and example.
[Asked: Dec 2023, Jun 2022 | Frequency: 3]
Map Function:
-
Input: (key, value) pair
-
Output: List of (intermediate_key, intermediate_value) pairs
-
Processes each record independently
Reduce Function:
-
Input: (key, list of values)
-
Output: (key, aggregated_value)
-
Combines values for same key
Block Diagram:
Example - Word Count:
Input File:
Hello World
Hello Hadoop
World of Big Data
Map Phase:
Mapper 1: "Hello World" → (Hello,1), (World,1)
Mapper 2: "Hello Hadoop" → (Hello,1), (Hadoop,1)
Mapper 3: "World of Big Data" → (World,1), (of,1), (Big,1), (Data,1)
Shuffle & Sort:
(Big, [1])
(Data, [1])
(Hadoop, [1])
(Hello, [1,1])
(of, [1])
(World, [1,1])
Reduce Phase:
Reducer: (Hello, [1,1]) → (Hello, 2)
Reducer: (World, [1,1]) → (World, 2)
Reducer: (Hadoop, [1]) → (Hadoop, 1)
...
Q83. ๐ข Give advantages of Hadoop MapReduce.
[Asked: Jun 2023 | Frequency: 1]
Advantages of Hadoop MapReduce:
| Advantage | Description |
|---|---|
| Scalability | Process petabytes across thousands of nodes |
| Fault Tolerance | Automatic task retry on failure |
| Cost-Effective | Runs on commodity hardware |
| Parallel Processing | Distributed computation |
| Data Locality | Moves code to data, not data to code |
| Simplicity | Simple programming model |
| Flexibility | Works with any data type |
Detailed Benefits:
1. Scalability:
-
Linear scalability with nodes
-
Add machines to increase capacity
-
No code changes needed
2. Fault Tolerance:
Node Failure → Detect → Reschedule Task → Continue
-
Tasks automatically rerun on other nodes
-
Data replicated for reliability
3. Data Locality:
Traditional: Move data → Process
MapReduce: Move code → Process locally
-
Reduces network traffic
-
Improves performance
4. Cost Savings:
-
No expensive specialized hardware
-
Open-source software
-
Commodity server clusters
Q84. ๐ข Discuss how key-value pair mechanism facilitates MapReduce programming.
[Asked: Jun 2023 | Frequency: 1]
Key-Value Pair Mechanism:
The key-value pair is the fundamental data structure in MapReduce, enabling:
-
Parallel processing
-
Data grouping
-
Distributed computation
How It Works:
| Stage | Input | Output |
|---|---|---|
| Map | (K1, V1) | List of (K2, V2) |
| Shuffle | (K2, V2) pairs | (K2, [V2, V2, ...]) |
| Reduce | (K2, [V2...]) | (K3, V3) |
Benefits:
| Benefit | Explanation |
|---|---|
| Parallelization | Each key-value processed independently |
| Grouping | Same keys automatically grouped |
| Distribution | Keys distributed across reducers |
| Flexibility | Any data can be key or value |
| Sorting | Keys sorted automatically |
Example:
Document: "apple banana apple cherry"
Map Output (K,V pairs):
(apple, 1)
(banana, 1)
(apple, 1)
(cherry, 1)
After Shuffle (grouped by key):
apple → [1, 1]
banana → [1]
cherry → [1]
Reduce Output:
(apple, 2)
(banana, 1)
(cherry, 1)
Why Keys Matter:
-
Determine which reducer processes the data
-
Enable aggregation and joining
-
Allow parallel processing of different keys
Q85. ๐ข Explain Splitting operation of MapReduce.
[Asked: Jun 2023 | Frequency: 1]
Splitting is the first phase where input data is divided into fixed-size chunks called Input Splits for parallel processing.
Process:
Characteristics:
| Aspect | Description |
|---|---|
| Split Size | Typically equals HDFS block size (128 MB) |
| Logical Division | Splits are logical, not physical |
| Record Boundary | Respects record boundaries |
| Parallelism | One mapper per split |
Example:
Input File: 384 MB
HDFS Block Size: 128 MB
Splits Created:
- Split 1: 0-128 MB → Mapper 1
- Split 2: 128-256 MB → Mapper 2
- Split 3: 256-384 MB → Mapper 3
InputFormat Types:
| Format | Description |
|---|---|
| TextInputFormat | Line-by-line (key=offset, value=line) |
| KeyValueInputFormat | Tab-separated key-value |
| SequenceFileInputFormat | Binary format |
| NLineInputFormat | Fixed N lines per split |
Q86. ๐ข Explain Mapping operation of MapReduce.
[Asked: Jun 2023 | Frequency: 1]
Mapping is the phase where user-defined map function processes each input record and emits intermediate key-value pairs.
Process:
Map Function Signature:
map(K1 key, V1 value, Context context) {
// Transform input
context.write(K2, V2);
}
Characteristics:
| Aspect | Description |
|---|---|
| Input | One record at a time |
| Output | Zero or more K-V pairs |
| Parallel | Multiple mappers run concurrently |
| Stateless | Each record processed independently |
Example - Word Count Map:
Input Record: (0, "Hello World Hello")
Map Function:
for each word in value:
emit(word, 1)
Output:
(Hello, 1)
(World, 1)
(Hello, 1)
Map Tasks:
-
Number of mappers = Number of input splits
-
Each mapper processes one split
-
Output written to local disk (not HDFS)
Q87. ๐ก What is the role of shuffling and sorting in MapReduce? Explain with word count example.
[Asked: Jun 2024, Jun 2022, Jun 2023 | Frequency: 3]
Shuffle and Sort is the intermediate phase between Map and Reduce that transfers, groups, and sorts data by key.
Roles:
| Phase | Role |
|---|---|
| Shuffle | Transfer map outputs to reducers |
| Sort | Sort data by keys |
| Merge | Merge sorted data from multiple mappers |
Process:
Word Count Example:
After Map Phase:
Mapper 1: (Hello,1), (World,1), (Hello,1)
Mapper 2: (Big,1), (Data,1), (Hello,1)
Mapper 3: (World,1), (Data,1)
After Shuffle & Sort:
Reducer 1 receives:
(Big, [1])
(Data, [1,1])
(Hello, [1,1,1])
Reducer 2 receives:
(World, [1,1])
Key Points:
-
Partitioner decides which reducer gets which key
-
Combiner can reduce data before shuffle (optional optimization)
-
Sort ensures reducer gets sorted key order
-
Merge combines data from all mappers
Importance:
-
Ensures same keys go to same reducer
-
Enables aggregation in reduce phase
-
Sorted order helps efficient processing
Q88. ๐ข Explain Reducing operation of MapReduce.
[Asked: Jun 2023 | Frequency: 1]
Reducing is the final phase where user-defined reduce function aggregates all values for each key into final output.
Process:
Reduce Function Signature:
reduce(K2 key, Iterable<V2> values, Context context) {
// Aggregate values
context.write(K3, V3);
}
Characteristics:
| Aspect | Description |
|---|---|
| Input | Key and list of all values for that key |
| Output | Aggregated result per key |
| Sorting | Keys arrive in sorted order |
| Parallelism | Multiple reducers run concurrently |
Example - Word Count Reduce:
Input: (Hello, [1, 1, 1])
Reduce Function:
sum = 0
for each count in values:
sum += count
emit(key, sum)
Output: (Hello, 3)
Reducer Tasks:
-
Number configurable by user
-
Each reducer handles subset of keys
-
Output written to HDFS
-
One output file per reducer
Q89. ๐ก Explain word count problem with suitable example. Give pseudo-code for word count problem in MapReduce.
[Asked: Dec 2023, Dec 2022 | Frequency: 3]
Word Count Problem: Count the frequency of each word in a large collection of documents.
Input:
Document 1: "Hello World"
Document 2: "Hello Hadoop World"
Document 3: "Big Data World"
Expected Output:
Big 1
Data 1
Hadoop 1
Hello 2
World 3
Pseudo-code:
Mapper:
function MAP(key, value):
// key: document ID
// value: document content
words = TOKENIZE(value)
for each word in words:
EMIT(word, 1)
Reducer:
function REDUCE(key, values):
// key: word
// values: list of counts [1, 1, 1, ...]
total = 0
for each count in values:
total = total + count
EMIT(key, total)
Execution Flow:
Java Implementation (Simplified):
// Mapper Class
public void map(LongWritable key, Text value, Context context) {
String[] words = value.toString().split("\\s+");
for (String word : words) {
context.write(new Text(word), new IntWritable(1));
}
}
// Reducer Class
public void reduce(Text key, Iterable<IntWritable> values, Context context) {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
UNIT 7: OTHER BIG DATA ARCHITECTURES AND TOOLS
Q90. ๐ก What is Apache Spark? In context of Data Science, what is Apache SPARK?
[Asked: Jun 2025, Dec 2023, Jun 2023 | Frequency: 3]
Apache Spark is an open-source, distributed computing framework designed for fast, large-scale data processing and analytics. It provides an interface for programming clusters with implicit data parallelism and fault tolerance.
Definition: Spark is a unified analytics engine that supports batch processing, real-time streaming, machine learning, and graph processing, all in a single framework.
Key Features:
-
In-memory computing (100x faster than Hadoop MapReduce)
-
Supports multiple languages (Scala, Python, Java, R)
-
Unified platform for diverse workloads
-
Lazy evaluation for optimization
Diagram:
Core Components:
| Component | Purpose |
|---|---|
| Spark Core | Basic functionality, RDD operations |
| Spark SQL | Structured data processing |
| Spark Streaming | Real-time data processing |
| MLlib | Machine learning library |
| GraphX | Graph processing |
Q91. ๐ด What are the main features/characteristics of Apache Spark framework?
[Asked: Jun 2025, Dec 2023, Dec 2022, Jun 2022 | Frequency: 4]
Key Features of Apache Spark:
| Feature | Description |
|---|---|
| Speed | 100x faster than Hadoop (in-memory) |
| Ease of Use | APIs in Python, Scala, Java, R |
| Generality | SQL, streaming, ML, graph in one platform |
| Fault Tolerance | Automatic recovery from failures |
| Lazy Evaluation | Optimizes execution plan |
| In-Memory Computing | Caches data in RAM |
Detailed Features:
1. In-Memory Processing:
2. Resilient Distributed Datasets (RDD):
-
Immutable distributed collection
-
Fault-tolerant through lineage
-
Parallel operations
3. DAG Execution Engine:
4. Multiple Workload Support:
| Workload | Component | Use Case |
|---|---|---|
| Batch | Spark Core | ETL jobs |
| Interactive | Spark SQL | Ad-hoc queries |
| Real-time | Streaming | Live dashboards |
| ML | MLlib | Predictions |
| Graph | GraphX | Social networks |
Q92. ๐ก How does Apache Spark differ from Hadoop?
[Asked: Jun 2025, Jun 2023 | Frequency: 2]
Comparison Table:
| Aspect | Apache Spark | Hadoop MapReduce |
|---|---|---|
| Processing | In-memory | Disk-based |
| Speed | 100x faster (memory) | Slower (disk I/O) |
| Ease of Use | High-level APIs | Low-level Java code |
| Real-time | Yes (Spark Streaming) | No (batch only) |
| Iterations | Excellent (ML) | Poor (writes to disk) |
| Cost | Higher RAM needs | Lower hardware cost |
| Languages | Scala, Python, Java, R | Primarily Java |
| Caching | In-memory caching | No caching |
Diagram:
When to Use:
-
Spark: Iterative algorithms, real-time processing, interactive queries
-
Hadoop: Cost-sensitive batch processing, very large cold data
Q93. ๐ข Explain big data processing using Spark ecosystem.
[Asked: Dec 2024 | Frequency: 1]
Spark Ecosystem for Big Data Processing:
Diagram:
Processing Flow:
| Step | Component | Activity |
|---|---|---|
| 1 | Data Ingestion | Load from HDFS, S3, Kafka |
| 2 | Spark Core | Distribute across cluster |
| 3 | Transformation | Filter, map, join operations |
| 4 | Analysis | SQL queries, ML models |
| 5 | Output | Write to storage, serve APIs |
Example Pipeline:
# Read data
df = spark.read.parquet("hdfs://data/sales")
# Transform
cleaned = df.filter(df.amount > 0) \
.groupBy("region") \
.sum("amount")
# ML Model
from pyspark.ml.clustering import KMeans
model = KMeans(k=5).fit(cleaned)
# Output
model.write.save("hdfs://models/customer_segments")
Q94. ๐ข Briefly discuss the purpose of Spark Core.
[Asked: Dec 2023 | Frequency: 1]
Spark Core is the foundational component of Apache Spark that provides:
| Purpose | Description |
|---|---|
| Task Scheduling | Distributes tasks across cluster |
| Memory Management | In-memory data caching |
| Fault Recovery | RDD lineage for recovery |
| I/O Operations | Reading/writing data |
| Basic Operations | Map, reduce, filter, join |
Key Concept - RDD (Resilient Distributed Dataset):
RDD Properties:
-
Resilient: Recovers from node failures
-
Distributed: Data spread across nodes
-
Dataset: Collection of partitioned data
Q95. ๐ข Briefly discuss the purpose of Spark SQL.
[Asked: Dec 2023 | Frequency: 1]
Spark SQL enables structured data processing using SQL queries and DataFrame API.
Purpose:
| Feature | Description |
|---|---|
| SQL Interface | Query data using SQL syntax |
| DataFrames | Structured API with schema |
| Optimization | Catalyst optimizer for queries |
| Integration | Connect to Hive, JDBC, Parquet |
| Performance | Optimized execution plans |
Example:
# Create DataFrame
df = spark.read.json("customers.json")
# SQL Query
df.createOrReplaceTempView("customers")
result = spark.sql("""
SELECT region, SUM(sales) as total
FROM customers
GROUP BY region
ORDER BY total DESC
""")
# DataFrame API (equivalent)
result = df.groupBy("region") \
.agg(sum("sales").alias("total")) \
.orderBy(desc("total"))
Q96. ๐ข Briefly discuss the purpose of Spark Streaming.
[Asked: Dec 2023 | Frequency: 1]
Spark Streaming processes real-time data streams using micro-batch architecture.
Purpose:
| Feature | Description |
|---|---|
| Real-time Processing | Process live data streams |
| Micro-batching | Small batches (seconds) |
| Fault Tolerance | Exactly-once semantics |
| Integration | Kafka, Flume, Kinesis |
| Unified API | Same code for batch and stream |
Diagram:
Example:
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 1) # 1-second batches
lines = ssc.socketTextStream("localhost", 9999)
words = lines.flatMap(lambda line: line.split(" "))
counts = words.countByValue()
counts.pprint()
ssc.start()
Q97. ๐ข Briefly discuss the purpose of MLlib.
[Asked: Dec 2023 | Frequency: 1]
MLlib is Spark's scalable machine learning library for distributed ML algorithms.
Purpose:
| Feature | Description |
|---|---|
| Scalable ML | Train on clusters |
| Algorithms | Classification, regression, clustering |
| Pipelines | ML workflow automation |
| Feature Engineering | Transformers and extractors |
| Model Persistence | Save/load models |
Supported Algorithms:
| Category | Algorithms |
|---|---|
| Classification | Logistic Regression, Decision Trees, Random Forest, SVM |
| Regression | Linear, Ridge, Lasso, Decision Tree |
| Clustering | K-Means, Gaussian Mixture, LDA |
| Recommendation | ALS (Collaborative Filtering) |
| Dimensionality | PCA, SVD |
Example Pipeline:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=["f1","f2","f3"],
outputCol="features")
rf = RandomForestClassifier(numTrees=100)
pipeline = Pipeline(stages=[assembler, rf])
model = pipeline.fit(training_data)
predictions = model.transform(test_data)
Q98. ๐ข Briefly discuss the purpose of GraphX.
[Asked: Dec 2023 | Frequency: 1]
GraphX is Spark's API for graph-parallel computation.
Purpose:
| Feature | Description |
|---|---|
| Graph Processing | Analyze graph structures |
| Algorithms | PageRank, Connected Components |
| Graph Construction | From RDDs or files |
| Property Graphs | Vertices and edges with properties |
| Pregel API | Iterative graph algorithms |
Diagram:
Built-in Algorithms:
-
PageRank: Vertex importance
-
Connected Components: Graph clusters
-
Triangle Counting: Network density
-
Shortest Paths: Distance calculation
Example:
import org.apache.spark.graphx._
// Create graph
val graph = Graph(vertices, edges)
// Run PageRank
val ranks = graph.pageRank(0.001).vertices
ranks.collect().foreach(println)
Q99. ๐ข What is HIVE? Explain the components of HIVE architecture.
[Asked: Jun 2025 | Frequency: 1]
Apache Hive is a data warehouse infrastructure built on Hadoop for data summarization, querying, and analysis using SQL-like language (HiveQL).
Definition: Hive provides SQL interface to query data stored in HDFS, converting queries to MapReduce/Spark jobs.
Architecture Diagram:
Components:
| Component | Purpose |
|---|---|
| Metastore | Stores schema, table definitions |
| Driver | Manages query lifecycle |
| Compiler | Parses and compiles HiveQL |
| Optimizer | Optimizes execution plan |
| Executor | Runs the query plan |
| CLI/UI | User interfaces |
Q100. ๐ก Write short note on HIVE and its utility in Data Science.
[Asked: Jun 2023, Dec 2022 | Frequency: 2]
Apache Hive provides SQL-based data warehouse capabilities on Hadoop.
Key Features:
| Feature | Description |
|---|---|
| HiveQL | SQL-like query language |
| Schema on Read | Define schema at query time |
| Scalability | Process petabytes of data |
| Extensibility | Custom UDFs, SerDes |
| Integration | Works with Hadoop ecosystem |
Utility in Data Science:
| Use Case | How Hive Helps |
|---|---|
| Data Exploration | SQL queries on big data |
| ETL | Transform large datasets |
| Data Warehousing | Structured analysis |
| Reporting | Business intelligence |
| Ad-hoc Queries | Quick data investigation |
Example:
-- Create table
CREATE TABLE sales (
id INT,
product STRING,
amount DOUBLE,
date DATE
) PARTITIONED BY (year INT, month INT)
STORED AS PARQUET;
-- Query
SELECT product, SUM(amount) as total
FROM sales
WHERE year = 2024
GROUP BY product
ORDER BY total DESC
LIMIT 10;
Q101. ๐ข Write short note on HBase and its utility in Data Science.
[Asked: Dec 2023 | Frequency: 1]
Apache HBase is a distributed, scalable NoSQL database built on HDFS for real-time read/write access to big data.
Key Features:
| Feature | Description |
|---|---|
| Column-oriented | Wide-column store |
| Real-time Access | Low-latency reads/writes |
| Scalability | Billions of rows, millions of columns |
| Consistency | Strong consistency model |
| Auto-sharding | Automatic data distribution |
HBase Data Model:
Utility in Data Science:
| Use Case | Application |
|---|---|
| Time Series | Sensor data, logs |
| User Profiles | Real-time personalization |
| Messaging | Chat, notifications |
| Metrics | System monitoring |
| Search Indexing | Fast lookups |
UNIT 8: NoSQL DATABASES
Q102. ๐ด What is NoSQL? What are NoSQL databases?
[Asked: Jun 2025, Jun 2024, Dec 2022, Jun 2022 | Frequency: 4]
NoSQL (Not Only SQL) refers to non-relational databases designed for distributed data storage with flexible schemas, horizontal scaling, and high performance for specific use cases.
Definition: NoSQL databases store data in formats other than traditional relational tables, optimized for large-scale, distributed environments with varied data types.
Types of NoSQL Databases:
Comparison with RDBMS:
| Aspect | RDBMS | NoSQL |
|---|---|---|
| Schema | Fixed | Flexible |
| Scaling | Vertical | Horizontal |
| ACID | Full support | Eventual consistency |
| Data Model | Tables | Various (doc, graph, etc.) |
| Joins | Supported | Limited/None |
| Use Case | Complex queries | High volume, velocity |
Q103. ๐ด Explain the features of NoSQL databases. How are NoSQL databases different from RDBMS?
[Asked: Jun 2025, Jun 2024, Dec 2022, Jun 2022 | Frequency: 4]
Features of NoSQL:
| Feature | Description |
|---|---|
| Schema Flexibility | No fixed schema, dynamic structure |
| Horizontal Scaling | Add nodes to scale (sharding) |
| High Availability | Built-in replication |
| High Performance | Optimized for specific access patterns |
| Distributed | Data across multiple servers |
| BASE Model | Basically Available, Soft state, Eventually consistent |
ACID vs BASE:
Detailed Comparison:
| Aspect | RDBMS | NoSQL |
|---|---|---|
| Data Model | Relational tables | Key-value, Document, Graph, Column |
| Schema | Rigid, predefined | Dynamic, flexible |
| Scalability | Vertical (bigger server) | Horizontal (more servers) |
| Transactions | ACID compliant | BASE model |
| Joins | Complex joins supported | Limited or none |
| Query Language | SQL | Database-specific |
| Consistency | Strong | Eventual |
| Best For | Complex relationships | Big Data, real-time |
Q104. ๐ข What is key-value pair based NoSQL? List the benefits.
[Asked: Dec 2024 | Frequency: 1]
Key-Value Store is the simplest NoSQL database type that stores data as a collection of key-value pairs.
Structure:
Benefits:
| Benefit | Description |
|---|---|
| Simplicity | Easy to understand and use |
| Speed | O(1) lookups by key |
| Scalability | Easy horizontal scaling |
| Flexibility | Value can be any data type |
| Caching | Perfect for cache layer |
| High Throughput | Millions of ops/second |
Popular Databases:
-
Redis: In-memory, caching, sessions
-
DynamoDB: AWS managed, serverless
-
Riak: Distributed, fault-tolerant
Use Cases:
-
Session storage
-
User preferences
-
Shopping carts
-
Caching
-
Real-time leaderboards
Q105. ๐ข Explain when to use key-value NoSQL database with example.
[Asked: Dec 2024 | Frequency: 1]
When to Use Key-Value Stores:
| Scenario | Why Key-Value Works |
|---|---|
| Simple lookups | Direct access by key |
| High speed needed | In-memory performance |
| Caching | Fast data retrieval |
| Session management | Quick session access |
| No complex queries | Only key-based access |
Example - Session Management:
User logs in → Generate session ID → Store in Redis
Key: "session:abc123def456"
Value: {
"user_id": 12345,
"username": "john_doe",
"login_time": "2024-12-10T10:30:00",
"cart_items": 3,
"preferences": {"theme": "dark"}
}
Operations:
- SET session:abc123 {...} → Store session
- GET session:abc123 → Retrieve session
- EXPIRE session:abc123 3600 → Auto-delete after 1 hour
- DEL session:abc123 → Logout
When NOT to Use:
-
Complex relationships between data
-
Need for joins or aggregations
-
Range queries required
-
Data has complex structure
Q106. ๐ก What is Graph based NoSQL? Explain when do we need graph database.
[Asked: Jun 2024, Dec 2022 | Frequency: 2]
Graph Database stores data as nodes (entities) and edges (relationships), optimized for traversing connected data.
Structure:
Components:
| Component | Description |
|---|---|
| Nodes | Entities (people, products) |
| Edges | Relationships between nodes |
| Properties | Attributes on nodes/edges |
| Labels | Node types |
When to Use Graph Database:
| Use Case | Why Graph |
|---|---|
| Social Networks | Friend connections, followers |
| Recommendations | "People who bought X also..." |
| Fraud Detection | Identify suspicious patterns |
| Knowledge Graphs | Connected information |
| Network Analysis | IT infrastructure, routing |
| Access Control | Permission hierarchies |
Example Query (Cypher - Neo4j):
// Find friends of friends
MATCH (user:Person {name: 'Alice'})-[:FRIENDS]->(friend)-[:FRIENDS]->(fof)
WHERE NOT (user)-[:FRIENDS]->(fof) AND user <> fof
RETURN fof.name AS Recommendation
Q107. ๐ข List the features of Column-based databases.
[Asked: Dec 2022 | Frequency: 1]
Column-Family Database (Wide-Column Store) stores data in column families rather than rows.
Structure:
Row-Oriented (RDBMS): Column-Oriented (NoSQL):
┌────┬──────┬─────┬─────┐ ┌──────────────────────┐
│ ID │ Name │ Age │City │ │ ID: 1, 2, 3, 4 │
├────┼──────┼─────┼─────┤ │ Name: A, B, C, D │
│ 1 │ A │ 25 │ NYC │ │ Age: 25,30,28,35 │
│ 2 │ B │ 30 │ LA │ │ City: NYC,LA,CHI,SF │
│ 3 │ C │ 28 │ CHI │ └──────────────────────┘
│ 4 │ D │ 35 │ SF │
└────┴──────┴─────┴─────┘ Better for analytics
Better for transactions (read specific columns)
Features:
| Feature | Description |
|---|---|
| Column Families | Related columns grouped |
| Sparse Storage | Only stores non-null values |
| High Write Throughput | Append-only writes |
| Time-Series Friendly | Efficient time-stamped data |
| Horizontal Scaling | Easy sharding |
| Compression | Same-type data compresses well |
Popular Databases:
-
Apache Cassandra
-
Apache HBase
-
Google Bigtable
Best For:
-
Time-series data
-
IoT sensor data
-
Event logging
-
Analytics workloads
UNIT 9: MINING BIG DATA - SIMILARITY
Q108. ๐ก Define the term Similarity.
[Asked: Jun 2024, Jun 2022 | Frequency: 2]
Similarity is a measure that quantifies how alike or close two data objects are based on their features or attributes.
Definition: Similarity is a numerical measure (typically between 0 and 1) where 1 indicates identical objects and 0 indicates completely different objects.
Key Concepts:
| Concept | Description |
|---|---|
| Similarity | How alike two objects are (0 to 1) |
| Distance | How different two objects are |
| Relationship | Similarity = 1 - Normalized Distance |
Types of Similarity Measures:
Applications:
-
Document similarity (plagiarism detection)
-
Recommendation systems
-
Clustering
-
Near-duplicate detection
-
Search engines
Q109. ๐ข Explain the Jaccard similarity of sets with the help of an example.
[Asked: Jun 2022 | Frequency: 1]
Jaccard Similarity measures the similarity between two sets as the ratio of their intersection to their union.
Formula:
Diagram:
Example:
Set A = {apple, banana, orange, mango} Set B = {banana, orange, grape, kiwi}
| Operation | Result |
|---|---|
| A ∩ B | {banana, orange} |
| A ∪ B | {apple, banana, orange, mango, grape, kiwi} |
| |A ∩ B| | 2 |
| |A ∪ B| | 6 |
Interpretation: The sets are 33.3% similar.
Properties:
-
Range: 0 ≤ J(A,B) ≤ 1
-
J(A,A) = 1 (identical sets)
-
J(A,B) = 0 when A ∩ B = ∅
Q110. ๐ข What do you understand by the term 'Finding Similar Documents'?
[Asked: Jun 2025 | Frequency: 1]
Finding Similar Documents is the process of identifying documents that share significant content, structure, or meaning with a given document or each other.
Why It Matters:
| Application | Use Case |
|---|---|
| Plagiarism Detection | Identify copied content |
| Search Engines | Find relevant results |
| News Aggregation | Group related stories |
| Recommendation | Suggest similar articles |
| Deduplication | Remove near-duplicates |
Challenge with Big Data:
-
Comparing every pair: O(n²) comparisons
-
For 1 million documents: 500 billion comparisons
-
Need efficient approximate methods
Solution Pipeline:
Q111. ๐ข What are the various concepts of document similarity analysis?
[Asked: Jun 2025 | Frequency: 1]
Key Concepts in Document Similarity:
1. Shingling (k-grams): Convert document to set of overlapping substrings.
Document: "the quick brown"
3-shingles: {"the", "he ", "e q", " qu", "qui", ...}
2. MinHashing: Create compact signatures that estimate Jaccard similarity.
| Property | Description |
|---|---|
| Input | Set of shingles |
| Output | Fixed-size signature |
| Property | Pr(h(A) = h(B)) = J(A,B) |
3. Locality Sensitive Hashing (LSH): Hash similar documents to same buckets with high probability.
4. Similarity Measures:
| Measure | Formula | Best For |
|---|---|---|
| Jaccard | |A∩B|/|A∪B| | Sets |
| Cosine | A·B/(|A||B|) | Vectors |
| Edit Distance | Min edits to transform | Strings |
Q112. ๐ก Explain how the similarity between two documents can be found.
[Asked: Jun 2024, Dec 2022 | Frequency: 2]
Step-by-Step Document Similarity:
Step 1: Preprocessing
-
Remove stopwords (the, is, a)
-
Convert to lowercase
-
Stemming/Lemmatization
Step 2: Representation
| Method | Description |
|---|---|
| Bag of Words | Word frequency vector |
| TF-IDF | Weighted word importance |
| Shingles | Set of k-grams |
Step 3: Calculate Similarity
Example - Cosine Similarity:
Doc1: "data science is fun"
Doc2: "science of data analysis"
Vocabulary: {data, science, is, fun, of, analysis}
Vector1: [1, 1, 1, 1, 0, 0]
Vector2: [1, 1, 0, 0, 1, 1]
Cosine = (1×1 + 1×1 + 1×0 + 1×0 + 0×1 + 0×1) / (√4 × √4)
= 2 / 4 = 0.5
Diagram:
Q113. ๐ข Compare Minhashing and Locality Sensitive Hashing for document similarity.
[Asked: Jun 2025 | Frequency: 1]
Comparison:
| Aspect | MinHashing | LSH |
|---|---|---|
| Purpose | Compress set signatures | Find candidate pairs |
| Input | Set of shingles | MinHash signatures |
| Output | Fixed-size signature | Candidate similar pairs |
| Complexity | O(n × k) per doc | O(n) for all docs |
| Preserves | Jaccard similarity | Similarity threshold |
MinHashing Process:
Shingle Set → Apply h hash functions → Signature (h values)
Signature preserves: Pr(sig[i] matches) ≈ Jaccard(A,B)
LSH Process:
Signatures → Divide into b bands of r rows
→ Hash each band
→ Similar docs hash to same bucket
Diagram:
Trade-off in LSH:
-
More bands (b): More false positives, fewer misses
-
More rows (r): Fewer false positives, more misses
-
Threshold ≈ (1/b)^(1/r)
Q114. ๐ข What is a Euclidean distance measure?
[Asked: Jun 2025 | Frequency: 1]
Euclidean Distance is the straight-line distance between two points in n-dimensional space.
Formula (2D):
Formula (n-dimensional):
Diagram:
Example:
-
Point P = (1, 2, 3)
-
Point Q = (4, 6, 8)
Properties:
-
Always ≥ 0
-
d(p,q) = 0 iff p = q
-
Symmetric: d(p,q) = d(q,p)
-
Triangle inequality: d(p,r) ≤ d(p,q) + d(q,r)
Q115. ๐ข How does Euclidean distance differ from cosine distance?
[Asked: Jun 2025 | Frequency: 1]
Key Differences:
| Aspect | Euclidean Distance | Cosine Distance |
|---|---|---|
| Measures | Magnitude of difference | Angle between vectors |
| Formula | √ฮฃ(pแตข - qแตข)² | 1 - cos(ฮธ) |
| Range | 0 to ∞ | 0 to 2 |
| Sensitive to | Magnitude | Direction only |
| Best for | Actual distances | Text similarity |
Diagram:
Example:
A = (1, 0)
B = (2, 0)
C = (0, 1)
Euclidean:
d(A,B) = 1 (B is closer to A)
d(A,C) = √2 ≈ 1.41
Cosine:
cos(A,B) = 1 → distance = 0 (same direction)
cos(A,C) = 0 → distance = 1 (perpendicular)
When to Use:
| Use Case | Recommended |
|---|---|
| Text documents | Cosine (ignores doc length) |
| Geographic points | Euclidean |
| High-dimensional sparse | Cosine |
| Dense numerical data | Euclidean |
Q116. ๐ข What is the purpose of a distance measure?
[Asked: Dec 2024 | Frequency: 1]
Purpose of Distance Measures:
| Purpose | Description |
|---|---|
| Quantify Difference | Numerical measure of dissimilarity |
| Clustering | Group similar objects |
| Classification | k-NN algorithm |
| Anomaly Detection | Identify outliers |
| Search | Find nearest neighbors |
Applications:
Common Distance Measures:
| Measure | Formula | Use Case |
|---|---|---|
| Euclidean | √ฮฃ(xแตข-yแตข)² | General purpose |
| Manhattan | ฮฃ|xแตข-yแตข| | Grid-based, outlier robust |
| Cosine | 1 - cos(ฮธ) | Text, sparse data |
| Jaccard | 1 - J(A,B) | Sets, binary data |
| Hamming | Count of differences | Binary strings |
| Edit | Min edits | Strings |
Q117. ๐ข Differentiate between cosine distance and edit distance with example.
[Asked: Dec 2024 | Frequency: 1]
Comparison:
| Aspect | Cosine Distance | Edit Distance |
|---|---|---|
| Input Type | Vectors | Strings |
| Measures | Angular difference | Character operations |
| Operations | Dot product | Insert, Delete, Replace |
| Range | 0 to 2 | 0 to max(len(s1), len(s2)) |
| Use Case | Document similarity | Spell checking, DNA |
Cosine Distance Example:
Doc1: "the cat sat" → Vector: [1, 1, 1, 0, 0]
Doc2: "the dog ran" → Vector: [1, 0, 0, 1, 1]
Vocabulary: [the, cat, sat, dog, ran]
Cosine Similarity = (1×1 + 1×0 + 1×0 + 0×1 + 0×1) / (√3 × √3)
= 1/3 = 0.33
Cosine Distance = 1 - 0.33 = 0.67
Edit Distance Example:
String1: "kitten"
String2: "sitting"
Operations:
1. kitten → sitten (replace k with s)
2. sitten → sittin (replace e with i)
3. sittin → sitting (insert g)
Edit Distance = 3
Diagram:
UNIT 10: MINING DATA STREAMS
Q118. ๐ก What are Data Streams? Explain Data Streams.
[Asked: Jun 2025, Jun 2023, Jun 2022 | Frequency: 3]
Data Stream is a continuous, unbounded sequence of data elements generated at rapid rates that must be processed in real-time or near real-time.
Definition: A data stream is an ordered sequence of data items that arrive continuously over time, often too fast and voluminous to store entirely.
Characteristics:
| Characteristic | Description |
|---|---|
| Continuous | Never-ending flow of data |
| High Velocity | Rapid arrival rate |
| Unbounded | Potentially infinite |
| Time-Sensitive | Must process quickly |
| Single Pass | Cannot re-read easily |
| Evolving | Patterns change over time |
Diagram:
Examples of Data Streams:
| Domain | Stream Type |
|---|---|
| Finance | Stock tickers, transactions |
| Social Media | Tweets, posts, likes |
| IoT | Sensor readings |
| Telecom | Call records, network logs |
| Web | Clickstreams, search queries |
Q119. ๐ก Why is Data Stream mining/processing a challenging task in Data Science?
[Asked: Jun 2025, Jun 2023 | Frequency: 2]
Challenges in Data Stream Processing:
| Challenge | Description |
|---|---|
| Volume | Massive amounts of data |
| Velocity | High arrival rate |
| Single Pass | Cannot store all data |
| Memory Limits | Limited RAM for processing |
| Real-time | Must respond quickly |
| Concept Drift | Patterns change over time |
Diagram:
Technical Challenges:
-
Memory Constraint: Can't store entire stream
-
One-Pass Processing: Each item seen once
-
Approximate Algorithms: Must sacrifice accuracy
-
Concept Drift: Model becomes outdated
-
Out-of-Order Data: Events may arrive late
-
Load Spikes: Sudden bursts of data
Q120. ๐ข Explain the characteristics of data streams.
[Asked: Dec 2022 | Frequency: 1]
Key Characteristics:
| Characteristic | Description |
|---|---|
| Continuous | Endless flow, no defined end |
| Rapid | High data arrival rate |
| Unbounded | Potentially infinite size |
| Temporal | Time is crucial dimension |
| Ordered | Sequence matters |
| Ephemeral | Old data may be discarded |
Formal Model:
Stream S = (s₁, s₂, s₃, ..., sโ, ...)
Where:
- sแตข arrives at time tแตข
- tแตข < tแตข₊₁ (ordered)
- n → ∞ (unbounded)
Diagram:
Processing Constraints:
-
Limited memory
-
Limited processing time per element
-
Approximate answers acceptable
-
Single pass over data
Q121. ๐ก How do Data Streams differ from Databases?
[Asked: Jun 2023, Jun 2022 | Frequency: 2]
Comparison:
| Aspect | Database (DBMS) | Data Stream (DSMS) |
|---|---|---|
| Data | Persistent, stored | Transient, flowing |
| Size | Finite | Potentially infinite |
| Access | Random, multiple | Sequential, once |
| Query | On-demand | Continuous |
| Answer | Exact | Approximate |
| Processing | Any time | Real-time |
| Update | Insert, Update, Delete | Append only |
| Storage | Disk-based | Memory-based |
Diagram:
Query Model Difference:
-
DBMS: "Find all sales > $1000" → Run once, get answer
-
DSMS: "Alert when sale > $1000" → Runs continuously
Q122. ๐ก Differentiate between DSMS and DBMS with diagram.
[Asked: Dec 2023, Jun 2024 | Frequency: 2]
DSMS vs DBMS:
| Feature | DBMS | DSMS |
|---|---|---|
| Data Model | Relations/Tables | Streams |
| Query Type | One-time | Continuous |
| Data Arrival | Static or slow | Rapid, continuous |
| Storage | Persistent | Transient windows |
| Processing | Pull-based | Push-based |
| Results | Complete, exact | Incremental, approximate |
Architecture Diagram:
Query Execution:
DBMS:
Query → Execute Once → Return All Results → Done
DSMS:
Query → Register → Execute Continuously →
→ Stream Results → Never Ends
Q123. ๐ข Discuss the issues and challenges of data stream.
[Asked: Jun 2024 | Frequency: 1]
Major Issues and Challenges:
| Category | Issue | Description |
|---|---|---|
| Resource | Memory | Can't store all data |
| Resource | CPU | High processing demand |
| Data | Volume | Massive data amounts |
| Data | Velocity | Rapid arrival |
| Data | Quality | Missing/noisy data |
| Processing | Single Pass | One chance to process |
| Processing | Real-time | Strict time constraints |
| Analytics | Concept Drift | Patterns change |
| Analytics | Approximation | Exact answers impossible |
Diagram:
Q124. ๐ข What do you mean by data stream processing?
[Asked: Dec 2024 | Frequency: 1]
Data Stream Processing is the continuous computation and analysis of data as it flows through a system, without storing it permanently.
Key Concepts:
| Concept | Description |
|---|---|
| Event | Single data item in stream |
| Window | Subset of stream for analysis |
| Operator | Transformation on stream |
| Pipeline | Chain of operators |
| Sink | Output destination |
Processing Models:
| Model | Description |
|---|---|
| Record-at-a-time | Process each event individually |
| Micro-batch | Small batches (Spark Streaming) |
| True Streaming | Continuous (Flink, Storm) |
Diagram:
Q125. ๐ข Which model of data stream processing is useful in finding stock market trends?
[Asked: Dec 2024 | Frequency: 1]
Sliding Window Model is most useful for stock market trend analysis.
Why Sliding Window:
| Reason | Explanation |
|---|---|
| Recent Data | Latest data most relevant |
| Continuous Update | Trends update in real-time |
| Fixed Size | Consistent analysis period |
| Forget Old | Outdated data discarded |
Types of Windows:
Stock Market Example:
Window Size: 5 minutes
Slide: 1 minute
Time 10:00 - Window: [09:55 - 10:00] → Moving Average = $150.25
Time 10:01 - Window: [09:56 - 10:01] → Moving Average = $150.40
Time 10:02 - Window: [09:57 - 10:02] → Moving Average = $150.55
...continues...
Use Cases:
-
Moving averages
-
Trend detection
-
Volume analysis
-
Anomaly detection (flash crashes)
Q126. ๐ก Compare Ad-hoc Queries and Standing Queries of data streams.
[Asked: Dec 2023 | Frequency: 3]
Comparison:
| Aspect | Ad-hoc Query | Standing Query |
|---|---|---|
| Execution | Once | Continuous |
| Duration | Finite | Indefinite |
| Result | Single answer | Stream of answers |
| Trigger | User-initiated | Event-driven |
| Data Scope | Historical + Current | Current + Future |
| Storage | Needs history | Window-based |
Diagram:
Examples:
| Type | Example |
|---|---|
| Ad-hoc | "What was average temperature yesterday?" |
| Standing | "Alert me when temperature > 40°C" |
| Ad-hoc | "Show sales report for Q3" |
| Standing | "Notify on transactions > $10,000" |
Q127. ๐ข Compare Land Mark Model and Sliding Windows Model.
[Asked: Jun 2023 | Frequency: 1]
Comparison:
| Aspect | Landmark Model | Sliding Window Model |
|---|---|---|
| Start Point | Fixed timestamp | Moves with time |
| Data Included | From landmark to now | Last n items/time |
| Memory | Grows over time | Fixed size |
| Use Case | Cumulative stats | Recent trends |
Diagram:
Example:
| Model | Query | Result |
|---|---|---|
| Landmark | "Total sales since store opening" | Cumulative sum |
| Sliding | "Sales in last 7 days" | Recent total |
When to Use:
| Use Landmark | Use Sliding |
|---|---|
| Cumulative statistics | Recent trends |
| Growing aggregates | Moving averages |
| Historical analysis | Real-time monitoring |
| Audit trails | Anomaly detection |
Q128. ๐ข Explain any one mechanism of filtering of data streams.
[Asked: Dec 2022 | Frequency: 1]
Bloom Filter - Efficient mechanism for filtering data streams.
Purpose: Quickly test whether an element is a member of a set with minimal memory.
Properties:
-
Space-efficient probabilistic data structure
-
No false negatives (if says "no", definitely not in set)
-
Possible false positives (if says "yes", might be in set)
How It Works:
1. Create bit array of size m (all 0s)
2. Use k hash functions
3. To ADD element:
- Hash element k times
- Set those bit positions to 1
4. To CHECK element:
- Hash element k times
- If ALL positions are 1 → "Probably in set"
- If ANY position is 0 → "Definitely not in set"
Diagram:
Use Cases:
-
Spam filtering
-
Cache checking
-
Database lookups
-
Network routing
Q129. ๐ก What is Bloom Filtering? Explain with example.
[Asked: Jun 2024, Jun 2022 | Frequency: 2]
Bloom Filter is a space-efficient probabilistic data structure for set membership testing.
Components:
| Component | Description |
|---|---|
| Bit Array | m bits, initially all 0 |
| Hash Functions | k independent hash functions |
| Insert | Set k bits to 1 |
| Query | Check if all k bits are 1 |
Example:
Setup: m = 10 bits, k = 3 hash functions
Initial Array: [0][0][0][0][0][0][0][0][0][0]
0 1 2 3 4 5 6 7 8 9
Insert "cat":
h1("cat") = 1
h2("cat") = 4
h3("cat") = 7
Array: [0][1][0][0][1][0][0][1][0][0]
Insert "dog":
h1("dog") = 2
h2("dog") = 4 (already 1)
h3("dog") = 9
Array: [0][1][1][0][1][0][0][1][0][1]
Query "cat": Check 1,4,7 → All 1 → "Probably in set" ✓
Query "bird": Check 3,6,8 → Position 3 is 0 → "Not in set" ✓
Query "rat": Check 1,4,9 → All 1 → "Probably in set"
(FALSE POSITIVE - rat was never added!)
Diagram:
Trade-off:
-
Smaller array → More false positives
-
More hash functions → Better accuracy but slower
UNIT 11: LINK ANALYSIS
Q130. ๐ก What is Link Analysis? Explain the term.
[Asked: Jun 2024, Dec 2023 | Frequency: 2]
Link Analysis is a technique that examines the relationships (links) between objects to extract meaningful information about their structure, importance, and connectivity.
Definition: Link analysis studies the hyperlink structure of the web or any network to understand relationships, determine importance of nodes, and discover patterns.
Key Concepts:
| Concept | Description |
|---|---|
| Node | Entity (webpage, person) |
| Edge/Link | Connection between nodes |
| In-links | Links pointing to a node |
| Out-links | Links from a node to others |
| Anchor Text | Text describing the link |
Diagram:
Applications:
| Domain | Application |
|---|---|
| Search Engines | PageRank, HITS |
| Social Networks | Influence analysis |
| Fraud Detection | Suspicious patterns |
| Citation Analysis | Research impact |
| Counter-terrorism | Network mapping |
Q131. ๐ด What is the purpose of link analysis? How can link analysis be used for WWW and to compute PageRank?
[Asked: Dec 2023, Dec 2022 | Frequency: 3]
Purpose of Link Analysis:
| Purpose | Description |
|---|---|
| Rank Pages | Determine page importance |
| Discover Structure | Understand web topology |
| Find Communities | Cluster related pages |
| Detect Spam | Identify manipulation |
| Improve Search | Better relevance ranking |
Link Analysis for WWW:
The web can be viewed as a directed graph where:
-
Nodes = Web pages
-
Edges = Hyperlinks
Key Insight: A link from page A to page B is like a "vote" of confidence for B.
PageRank Computation Using Links:
PageRank Formula:
Where:
-
d = Damping factor (typically 0.85)
-
N = Total number of pages
-
$B_p$ = Set of pages linking to p
-
L(q) = Number of outbound links from q
Algorithm Steps:
-
Initialize all pages with PR = 1/N
-
Iterate: redistribute PR through links
-
Repeat until convergence
Q132. ๐ข What is PageRank?
[Asked: Jun 2024 | Frequency: 1]
PageRank is an algorithm developed by Google founders Larry Page and Sergey Brin to rank web pages based on their importance determined by the link structure.
Core Principle: A page is important if many important pages link to it.
Key Properties:
| Property | Description |
|---|---|
| Recursive | Importance depends on linkers' importance |
| Democratic | Each page gets equal vote initially |
| Iterative | Computed through repeated calculations |
| Probabilistic | Based on random surfer model |
Random Surfer Model:
Imagine a person randomly browsing:
-
With probability d (0.85): Follow a random link
-
With probability 1-d (0.15): Jump to random page
PageRank = Probability surfer ends up on that page
Simple Example:
Q133. ๐ด Explain PageRank algorithm with suitable example.
[Asked: Jun 2024, Jun 2023, Jun 2022 | Frequency: 3]
PageRank Algorithm:
Step 1: Build the Web Graph
Pages: A, B, C
Links: A→B, A→C, B→C, C→A
Step 2: Initialize
-
N = 3 pages
-
Initial PR = 1/N = 0.33 for each page
-
Damping factor d = 0.85
Step 3: Iterate
Iteration 1:
| Page | Calculation | New PR |
|---|---|---|
| A | (1-0.85)/3 + 0.85 × (0.33/1) | 0.05 + 0.28 = 0.33 |
| B | (1-0.85)/3 + 0.85 × (0.33/2) | 0.05 + 0.14 = 0.19 |
| C | (1-0.85)/3 + 0.85 × (0.33/2 + 0.33/1) | 0.05 + 0.42 = 0.47 |
After Several Iterations (Converged):
| Page | Final PageRank |
|---|---|
| A | 0.30 |
| B | 0.18 |
| C | 0.52 |
Interpretation: Page C has highest rank because both A and B link to it.
Q134. ๐ข Explain the rank computation using MapReduce.
[Asked: Jun 2024 | Frequency: 1]
PageRank with MapReduce:
PageRank computation is iterative and can be parallelized using MapReduce.
Data Structure: Each page stores: (PageID, CurrentRank, [OutLinks])
Map Phase:
For each page P with rank R and outlinks [L1, L2, ...]:
- Emit (P, [L1, L2, ...]) // Preserve structure
- For each outlink Li:
Emit (Li, R/num_outlinks) // Distribute rank
Reduce Phase:
For page P receiving:
- Outlinks list [L1, L2, ...]
- Rank contributions [r1, r2, ...]
NewRank = (1-d)/N + d × sum(contributions)
Emit (P, NewRank, [L1, L2, ...])
Diagram:
Iterations: Run multiple MapReduce jobs until PageRank converges.
Q135. ๐ข Write short note on Different mechanisms of finding PageRank.
[Asked: Jun 2023 | Frequency: 1]
Mechanisms for Computing PageRank:
1. Power Iteration Method:
-
Most common approach
-
Iteratively multiply rank vector by transition matrix
-
Stop when ranks converge
r(k+1) = M × r(k)
Repeat until ||r(k+1) - r(k)|| < ฮต
2. Matrix Formulation:
-
Solve: r = M × r (eigenvector problem)
-
PageRank is principal eigenvector of transition matrix
3. MapReduce Computation:
-
Distributed computation for large graphs
-
Parallel processing across clusters
4. Monte Carlo Simulation:
-
Simulate random walks
-
Count visit frequency to each page
-
Approximate PageRank from frequencies
5. Algebraic Methods:
-
Gaussian elimination
-
LU decomposition
-
Suitable for small graphs only
Comparison:
| Method | Scale | Accuracy | Speed |
|---|---|---|---|
| Power Iteration | Large | High | Medium |
| MapReduce | Very Large | High | Fast (parallel) |
| Monte Carlo | Large | Approximate | Fast |
| Algebraic | Small | Exact | Slow |
Q136. ๐ข Write short note on Sensitive PageRank.
[Asked: Dec 2023 | Frequency: 1]
Sensitive PageRank (also called Topic-Sensitive PageRank) is a variation that computes personalized rankings based on user interests or specific topics.
Motivation: Standard PageRank gives same ranking for all users, but relevance varies by context.
How It Works:
| Aspect | Standard PR | Sensitive PR |
|---|---|---|
| Teleportation | Random page | Topic-related pages |
| Bias | None | Toward preferred topics |
| Result | One ranking | Multiple rankings |
Formula Modification:
Standard: Random jump to any page with probability 1-d
Sensitive: Random jump to topic-specific pages with probability 1-d
Where $I_T(p) = 1$ if page p is in topic T, else 0.
Applications:
-
Personalized search results
-
Topic-specific recommendations
-
User preference modeling
Diagram:
Q137. ๐ข Explain the spider trap problem in PageRank.
[Asked: Dec 2024 | Frequency: 1]
Spider Trap occurs when a group of pages only link to each other, trapping the PageRank and absorbing all the rank over iterations.
Problem Description:
A spider trap is a set of pages where:
-
All outlinks stay within the set
-
No outlinks lead outside
-
PageRank flows in but never out
Diagram:
Effect on PageRank:
| Iteration | A | B | T1 | T2 |
|---|---|---|---|---|
| Initial | 0.25 | 0.25 | 0.25 | 0.25 |
| After many | 0.0 | 0.0 | 0.5 | 0.5 |
All rank gets absorbed by the trap!
Solution: Taxation/Teleportation (damping factor)
-
With probability 1-d, jump to random page
-
Prevents complete absorption
Q138. ๐ข Explain the dead-end problem in PageRank.
[Asked: Dec 2024 | Frequency: 1]
Dead-End (Dangling Node) is a page with no outgoing links, causing PageRank to leak out of the system.
Problem Description:
When a random surfer reaches a dead-end:
-
No links to follow
-
PageRank has nowhere to go
-
Total PageRank decreases over iterations
Diagram:
Effect on PageRank:
| Iteration | A | B | Dead-End | Total |
|---|---|---|---|---|
| Initial | 0.33 | 0.33 | 0.33 | 1.00 |
| Next | 0.17 | 0.17 | 0.28 | 0.62 |
| Later | 0.08 | 0.08 | 0.15 | 0.31 |
| ... | 0.0 | 0.0 | 0.0 | 0.0 |
PageRank leaks out and eventually becomes zero!
Solutions:
| Solution | Description |
|---|---|
| Teleportation | Dead-end teleports to random page |
| Self-link | Add link from dead-end to itself |
| Remove | Eliminate dead-ends from graph |
| Redistribute | Distribute dead-end's PR equally |
Q139. ๐ข Discuss the solutions for spider trap and dead-end problem in PageRank.
[Asked: Dec 2024 | Frequency: 1]
Combined Solution: Random Teleportation (Damping Factor)
The Solution Formula:
How It Solves Both Problems:
| Problem | How Teleportation Helps |
|---|---|
| Spider Trap | With prob 1-d, jump OUT of trap |
| Dead-End | With prob 1-d, jump to random page |
Diagram:
Dead-End Specific Solutions:
-
Prune dead-ends: Remove recursively
-
Redistribute: Dead-end's PR split equally to all pages
-
Self-loop: Dead-end links to itself
Spider Trap Specific Solutions:
-
Taxation: Force some PR to leave (damping)
-
Trust pages: Only count trusted links
-
TrustRank: Propagate trust from seed set
Typical Parameters:
-
d = 0.85 (follow link)
-
1-d = 0.15 (teleport)
Q140. ๐ข What is Link Spamming?
[Asked: Jun 2025 | Frequency: 1]
Link Spamming is the practice of creating artificial or manipulative links to boost a page's search engine ranking unfairly.
Definition: Deliberate creation of link structures to deceive search engine algorithms into giving higher rankings than deserved.
Types of Link Spam:
| Type | Description |
|---|---|
| Link Farms | Networks of pages linking to each other |
| Paid Links | Buying links for PageRank |
| Comment Spam | Adding links in blog comments |
| Hidden Links | Invisible links on pages |
| Reciprocal Links | "You link me, I link you" |
Diagram:
Goal of Spammers:
-
Artificially inflate PageRank
-
Appear higher in search results
-
Drive traffic to low-quality content
Q141. ๐ข Illustrate link spam with a suitable example.
[Asked: Jun 2025 | Frequency: 1]
Link Spam Example: Link Farm Attack
Scenario: A spam website wants to rank #1 for "cheap phones"
Setup:
How It Works:
| Step | Action |
|---|---|
| 1 | Create thousands of dummy pages |
| 2 | All pages link to target spam site |
| 3 | Farm pages link to each other (boost each other) |
| 4 | Try to get legitimate sites to link in |
| 5 | Target page gains artificial PageRank |
Result Before Detection:
-
Spam page appears in top results
-
Users click and see low-quality content
-
Spammer profits from ads/scams
Q142. ๐ข What are the possible solutions to combat link spamming?
[Asked: Jun 2025 | Frequency: 1]
Solutions to Combat Link Spam:
1. TrustRank Algorithm:
-
Start with trusted seed pages (manually verified)
-
Propagate trust through links
-
Spam pages get low trust scores
2. Spam Mass:
-
Calculate how much PageRank comes from spam
-
Penalize pages with high spam contribution
3. Link Analysis:
| Technique | Detection Method |
|---|---|
| Graph Analysis | Detect unusual link patterns |
| Temporal Analysis | Sudden link spikes |
| Anchor Text | Unnatural keyword stuffing |
| Link Velocity | Too many links too fast |
4. NoFollow Attribute:
-
<a rel="nofollow">tells search engines to ignore link -
Used for user-generated content (comments)
5. Machine Learning:
-
Train classifiers on known spam
-
Detect spam patterns automatically
Diagram:
Modern Approach: Search engines use combination of all techniques plus regular algorithm updates (Google Penguin) to penalize spam.
UNIT 12: WEB AND SOCIAL NETWORK ANALYSIS
Q143. ๐ข Explain how social networks can be represented using a graph.
[Asked: Dec 2022 | Frequency: 1]
Graph Representation of Social Networks:
A social network is naturally modeled as a graph where:
-
Nodes (Vertices) = People/Users
-
Edges (Links) = Relationships/Connections
Types of Social Graph:
| Type | Direction | Example |
|---|---|---|
| Undirected | Mutual | Facebook friends |
| Directed | One-way | Twitter follow |
| Weighted | Has strength | Interaction frequency |
| Bipartite | Two types | Users & Groups |
Diagram:
Key Graph Properties:
| Property | Meaning |
|---|---|
| Degree | Number of connections |
| Path | Route between two nodes |
| Clustering | How connected neighbors are |
| Centrality | Node importance |
| Components | Connected subgraphs |
Example Data Structure:
Adjacency List:
Alice: [Bob, Charlie]
Bob: [Alice, Diana]
Charlie: [Alice, Diana]
Diana: [Bob, Charlie, Eve]
Eve: [Diana]
Q144. ๐ข Explain the issues related to mining of social networks.
[Asked: Jun 2022 | Frequency: 1]
Issues in Social Network Mining:
| Category | Issue | Description |
|---|---|---|
| Scale | Massive Size | Billions of nodes and edges |
| Scale | Dynamic | Constantly changing |
| Data | Noise | Fake accounts, spam |
| Data | Incompleteness | Missing connections |
| Privacy | Sensitive Data | Personal information |
| Privacy | Anonymization | Hard to truly anonymize |
| Technical | Heterogeneity | Multiple relationship types |
| Technical | Semantics | Context matters |
Diagram:
Specific Challenges:
-
Community Detection: Finding groups is NP-hard
-
Influence Propagation: Predicting spread patterns
-
Link Prediction: Guessing future connections
-
Sybil Attacks: Fake identity networks
-
Filter Bubbles: Echo chamber detection
Q145. ๐ข What is Web Analytics?
[Asked: Jun 2024 | Frequency: 1]
Web Analytics is the collection, measurement, analysis, and reporting of website data to understand and optimize web usage.
Definition: Web analytics helps businesses understand how users interact with their websites to improve user experience and achieve goals.
Key Metrics:
| Metric | Description |
|---|---|
| Page Views | Total pages viewed |
| Unique Visitors | Distinct users |
| Bounce Rate | Single-page visits |
| Session Duration | Time on site |
| Conversion Rate | Goal completions |
| Traffic Sources | Where users come from |
Process:
Popular Tools:
-
Google Analytics
-
Adobe Analytics
-
Mixpanel
-
Hotjar
Applications:
-
User behavior analysis
-
Marketing campaign tracking
-
A/B testing
-
Conversion optimization
-
Content performance
Q146. ๐ข Explain the issues in online advertising.
[Asked: Jun 2024 | Frequency: 1]
Issues in Online Advertising:
| Category | Issue | Description |
|---|---|---|
| Fraud | Click Fraud | Fake clicks to exhaust budgets |
| Fraud | Bot Traffic | Non-human impressions |
| Fraud | Ad Injection | Unauthorized ad placement |
| Privacy | Tracking | User surveillance concerns |
| Privacy | Data Collection | Personal data harvesting |
| UX | Ad Blockers | Users block ads |
| UX | Banner Blindness | Users ignore ads |
| Quality | Brand Safety | Ads on inappropriate content |
| Quality | Viewability | Ads not actually seen |
Click Fraud Example:
Solutions:
| Issue | Solution |
|---|---|
| Click Fraud | Machine learning detection |
| Bot Traffic | CAPTCHA, behavior analysis |
| Privacy | Consent frameworks (GDPR) |
| Ad Blockers | Native advertising |
| Brand Safety | Content verification |
| Viewability | Viewability standards (MRC) |
Diagram:
Q147. ๐ข What is Data Lake? Explain the term Data Lake.
[Asked: Jun 2023 | Frequency: 1]
Data Lake is a centralized repository that stores all structured, semi-structured, and unstructured data at any scale in its native format.
Definition: A data lake stores raw data in its original format until it's needed for analysis, unlike data warehouses that require predefined schemas.
Key Characteristics:
| Characteristic | Description |
|---|---|
| Schema-on-Read | Define structure when reading, not storing |
| Raw Format | Store data as-is |
| Any Data Type | Structured, semi-structured, unstructured |
| Scalable | Handles petabytes of data |
| Cost-Effective | Uses commodity storage |
| Flexible | Adapt to changing needs |
Diagram:
Data Lake vs Data Warehouse:
| Aspect | Data Lake | Data Warehouse |
|---|---|---|
| Schema | Schema-on-Read | Schema-on-Write |
| Data Type | All types | Structured only |
| Processing | ELT | ETL |
| Cost | Lower | Higher |
| Users | Data Scientists | Business Analysts |
Q148. ๐ข Briefly discuss the key capabilities of data lake.
[Asked: Jun 2023 | Frequency: 1]
Key Capabilities of Data Lake:
| Capability | Description |
|---|---|
| Data Ingestion | Collect from any source |
| Storage | Store any data type at any scale |
| Processing | Batch and real-time processing |
| Governance | Data quality, security, compliance |
| Discovery | Catalog and search data |
| Analytics | ML, BI, advanced analytics |
Detailed Capabilities:
1. Universal Data Ingestion:
-
Batch uploads
-
Real-time streaming
-
CDC (Change Data Capture)
-
API integrations
2. Scalable Storage:
-
Petabyte scale
-
Cost-effective object storage
-
Data compression
-
Lifecycle management
3. Data Processing:
-
ETL/ELT pipelines
-
Spark, Hadoop processing
-
SQL queries
-
Stream processing
4. Data Governance:
-
Access control
-
Data lineage
-
Quality monitoring
-
Compliance (GDPR, HIPAA)
5. Advanced Analytics:
-
Machine learning
-
Predictive analytics
-
Real-time dashboards
-
Ad-hoc queries
Q149. ๐ข What is Collaborative Filtering?
[Asked: Jun 2022 | Frequency: 1]
Collaborative Filtering is a recommendation technique that predicts user preferences based on the collective behavior of many users.
Core Principle: "Users who agreed in the past will agree in the future"
Types:
| Type | Description |
|---|---|
| User-Based | Find similar users, recommend their items |
| Item-Based | Find similar items, recommend to users |
| Matrix Factorization | Decompose user-item matrix |
How It Works:
Key Insight:
-
Don't need to know content
-
Uses patterns from user behavior
-
"People like you also liked..."
Q150. ๐ข Explain Collaborative filtering with the help of an example.
[Asked: Jun 2022 | Frequency: 1]
Collaborative Filtering Example - Movie Recommendations:
Step 1: User-Item Matrix
| User | Avengers | Titanic | Inception | Notebook |
|---|---|---|---|---|
| Alice | 5 | 3 | 5 | ? |
| Bob | 5 | 2 | 4 | 1 |
| Carol | 2 | 5 | 2 | 5 |
| Dave | ? | 4 | ? | 4 |
Step 2: Find Similar Users (for Alice)
Calculate similarity (Cosine/Pearson):
-
Alice vs Bob: 0.95 (very similar - both like action)
-
Alice vs Carol: 0.25 (different - Carol likes romance)
Step 3: Predict Alice's Rating for "Notebook"
Since Alice ≈ Bob:
-
Bob rated Notebook = 1
-
Predict Alice's rating ≈ 1-2 (low)
Since Alice ≠ Carol:
- Carol's high rating less relevant
Step 4: Recommendation
Alice's predicted ratings:
- Notebook: 1.5 (Don't recommend)
- Other action movies: High (Recommend!)
Diagram:
Q151. ๐ข What is a Recommender System?
[Asked: Dec 2024 | Frequency: 1]
Recommender System is an information filtering system that predicts and suggests items a user might be interested in based on various data sources.
Purpose:
-
Reduce information overload
-
Personalize user experience
-
Increase engagement and sales
Types of Recommender Systems:
| Type | Method | Example |
|---|---|---|
| Content-Based | Item features | "Similar to what you liked" |
| Collaborative | User behavior | "Users like you also liked" |
| Hybrid | Combination | Netflix, Amazon |
| Knowledge-Based | User requirements | "Based on your needs" |
Applications:
| Platform | Recommendation |
|---|---|
| Netflix | Movies, TV shows |
| Amazon | Products |
| Spotify | Music, playlists |
| YouTube | Videos |
| Jobs, connections |
Architecture:
Q152. ๐ก Explain the concept of Recommendation System with diagram.
[Asked: Dec 2022, Dec 2024 | Frequency: 2]
Recommendation System Concepts:
1. Content-Based Filtering: Recommends items similar to what user liked before.
User likes: Action movies with Tom Cruise
System finds: Movies with similar attributes
Recommends: Mission Impossible series
2. Collaborative Filtering: Recommends based on similar users' preferences.
User A likes: Avengers, Iron Man
User B likes: Avengers, Iron Man, Thor
Recommend to A: Thor (because B liked it)
3. Hybrid Approach: Combines both methods for better accuracy.
Architecture Diagram:
Evaluation Metrics:
| Metric | Description |
|---|---|
| Precision | Relevant / Recommended |
| Recall | Relevant recommended / Total relevant |
| RMSE | Prediction error |
| Coverage | Items that can be recommended |
UNIT 13: BASICS OF R PROGRAMMING
Q153. ๐ข Define Complex data type in R programming with example.
[Asked: Jun 2022 | Frequency: 1]
Complex Data Type in R is used to store complex numbers with real and imaginary parts.
Syntax:
z <- complex(real = a, imaginary = b)
# OR
z <- a + bi
Examples:
# Creating complex numbers
z1 <- 3 + 2i
z2 <- complex(real = 5, imaginary = -3)
# Check type
class(z1) # "complex"
# Operations
z3 <- z1 + z2 # (8-1i)
z4 <- z1 * z2 # (21+1i)
# Extract parts
Re(z1) # 3 (real part)
Im(z1) # 2 (imaginary part)
Mod(z1) # 3.606 (modulus: sqrt(3²+2²))
Conj(z1) # 3-2i (conjugate)
Use Cases:
-
Signal processing
-
Electrical engineering
-
Quantum mechanics simulations
Q154. ๐ก What are Strings in R? Explain with example.
[Asked: Jun 2024, Jun 2022 | Frequency: 2]
Strings (Character type) in R are sequences of characters enclosed in single or double quotes.
Creating Strings:
# Single or double quotes
str1 <- "Hello World"
str2 <- 'R Programming'
# Check type
class(str1) # "character"
Common String Functions:
| Function | Purpose | Example |
|---|---|---|
nchar() |
Length | nchar("Hello") → 5 |
paste() |
Concatenate | paste("a", "b") → "a b" |
substr() |
Substring | substr("Hello", 1, 3) → "Hel" |
toupper() |
Uppercase | toupper("hi") → "HI" |
tolower() |
Lowercase | tolower("HI") → "hi" |
strsplit() |
Split | strsplit("a-b", "-") → ["a","b"] |
Example:
name <- "Data Science"
print(nchar(name)) # 12
print(toupper(name)) # "DATA SCIENCE"
print(substr(name, 1, 4)) # "Data"
print(paste(name, "2024")) # "Data Science 2024"
Q155. ๐ข Define %% operator in R programming with example.
[Asked: Jun 2022 | Frequency: 1]
%% Operator is the modulus operator that returns the remainder after division.
Syntax:
result <- dividend %% divisor
Examples:
# Basic modulus
10 %% 3 # Returns 1 (10 = 3×3 + 1)
15 %% 5 # Returns 0 (15 = 5×3 + 0)
7 %% 2 # Returns 1 (7 = 2×3 + 1)
# Check if even or odd
x <- 8
if (x %% 2 == 0) {
print("Even")
} else {
print("Odd")
}
# Output: "Even"
# Vector operation
c(10, 15, 22) %% 3 # Returns c(1, 0, 1)
Use Cases:
-
Check even/odd numbers
-
Circular array indexing
-
Time calculations (hours, minutes)
-
Divisibility tests
Q156. ๐ข Define ← or <<- operator in R programming with example.
[Asked: Jun 2022 | Frequency: 1]
Assignment Operators in R:
| Operator | Scope | Description |
|---|---|---|
<- |
Local | Assigns value in current environment |
<<- |
Global | Assigns value in parent/global environment |
Local Assignment (←):
x <- 10 # Assign 10 to x
y <- "Hello" # Assign string to y
z <- c(1,2,3) # Assign vector to z
# Same as = but preferred in R
x = 10 # Also works but <- is convention
Global Assignment (<<-):
# Used inside functions to modify global variables
x <- 5 # Global
test_func <- function() {
x <- 10 # Creates LOCAL x (doesn't affect global)
x <<- 20 # Modifies GLOBAL x
}
test_func()
print(x) # 20 (global x was changed by <<-)
Diagram:
Q157. ๐ข Explain different types of data structures in R-language.
[Asked: Dec 2024 | Frequency: 1]
R Data Structures:
| Structure | Dimension | Data Types | Example |
|---|---|---|---|
| Vector | 1D | Homogeneous | c(1,2,3) |
| Matrix | 2D | Homogeneous | matrix(1:6, 2, 3) |
| Array | nD | Homogeneous | array(1:24, c(2,3,4)) |
| List | 1D | Heterogeneous | list(1, "a", TRUE) |
| Data Frame | 2D | Heterogeneous columns | data.frame(...) |
| Factor | 1D | Categorical | factor(c("M","F")) |
Diagram:
Examples:
# Vector
v <- c(1, 2, 3, 4)
# Matrix
m <- matrix(1:6, nrow=2, ncol=3)
# List
l <- list(name="John", age=25, scores=c(90,85))
# Data Frame
df <- data.frame(
Name = c("A", "B"),
Age = c(20, 25)
)
# Factor
f <- factor(c("Low", "High", "Medium"))
Q158. ๐ด What is a Vector in R programming? Describe with example.
[Asked: Jun 2025, Dec 2023, Dec 2022, Jun 2022 | Frequency: 4]
Vector is the most basic data structure in R, a one-dimensional array that holds elements of the same data type.
Creating Vectors:
# Using c() function
numeric_vec <- c(1, 2, 3, 4, 5)
char_vec <- c("a", "b", "c")
logical_vec <- c(TRUE, FALSE, TRUE)
# Using sequences
seq_vec <- 1:10 # 1 to 10
seq_vec2 <- seq(1, 10, 2) # 1, 3, 5, 7, 9
# Using rep()
rep_vec <- rep(5, 3) # c(5, 5, 5)
Vector Operations:
v <- c(10, 20, 30, 40, 50)
# Accessing elements
v[1] # 10 (first element)
v[2:4] # c(20, 30, 40)
v[c(1,5)] # c(10, 50)
# Arithmetic (element-wise)
v + 5 # c(15, 25, 35, 45, 55)
v * 2 # c(20, 40, 60, 80, 100)
# Functions
length(v) # 5
sum(v) # 150
mean(v) # 30
max(v) # 50
min(v) # 10
Diagram:
Q159. ๐ก What is a List in R programming? Describe with example.
[Asked: Dec 2023, Dec 2022 | Frequency: 2]
List is a data structure that can contain elements of different types (heterogeneous), including other lists.
Creating Lists:
# Basic list
my_list <- list(
name = "Alice",
age = 25,
scores = c(85, 90, 78),
passed = TRUE
)
# Unnamed list
l <- list(1, "hello", TRUE, c(1,2,3))
Accessing Elements:
# Using $ (named elements)
my_list$name # "Alice"
my_list$scores # c(85, 90, 78)
# Using [[ ]] (by index or name)
my_list[[1]] # "Alice"
my_list[["age"]] # 25
# Using [ ] (returns sub-list)
my_list[1] # List with name element
List Operations:
# Add element
my_list$city <- "Mumbai"
# Modify element
my_list$age <- 26
# Remove element
my_list$passed <- NULL
# Length
length(my_list) # Number of elements
# Names
names(my_list) # c("name", "age", "scores", "city")
Diagram:
Q160. ๐ข Explain Matrices in R programming with example.
[Asked: Jun 2024 | Frequency: 1]
Matrix is a two-dimensional data structure with elements of the same type arranged in rows and columns.
Creating Matrices:
# Using matrix() function
m <- matrix(1:6, nrow = 2, ncol = 3)
# [,1] [,2] [,3]
# [1,] 1 3 5
# [2,] 2 4 6
# By row (default is by column)
m2 <- matrix(1:6, nrow = 2, byrow = TRUE)
# [,1] [,2] [,3]
# [1,] 1 2 3
# [2,] 4 5 6
# With row/column names
m3 <- matrix(1:4, nrow = 2,
dimnames = list(c("R1","R2"), c("C1","C2")))
Matrix Operations:
m <- matrix(1:6, nrow = 2, ncol = 3)
# Accessing elements
m[1, 2] # Element at row 1, col 2 → 3
m[1, ] # First row → c(1, 3, 5)
m[, 2] # Second column → c(3, 4)
# Dimensions
dim(m) # c(2, 3)
nrow(m) # 2
ncol(m) # 3
# Arithmetic
m + 10 # Add 10 to all elements
m * 2 # Multiply all by 2
# Matrix multiplication
a <- matrix(1:4, 2, 2)
b <- matrix(5:8, 2, 2)
a %*% b # Matrix multiplication
Q161. ๐ก What are Dataframes in R programming? Explain with example.
[Asked: Jun 2023, Dec 2022 | Frequency: 2]
Data Frame is a two-dimensional table where each column can have different data types, similar to a spreadsheet or SQL table.
Creating Data Frames:
# Using data.frame()
students <- data.frame(
Name = c("Alice", "Bob", "Charlie"),
Age = c(20, 22, 21),
Grade = c("A", "B", "A"),
Passed = c(TRUE, TRUE, TRUE)
)
# Name Age Grade Passed
# 1 Alice 20 A TRUE
# 2 Bob 22 B TRUE
# 3 Charlie 21 A TRUE
Accessing Data:
# Column access
students$Name # Vector of names
students[, "Age"] # Age column
students[, 2] # Second column
# Row access
students[1, ] # First row
students[1:2, ] # First two rows
# Cell access
students[1, "Name"] # "Alice"
students$Name[1] # "Alice"
Common Operations:
# Dimensions
nrow(students) # 3
ncol(students) # 4
dim(students) # c(3, 4)
# Add column
students$City <- c("NYC", "LA", "CHI")
# Add row
new_student <- data.frame(Name="Diana", Age=23,
Grade="A", Passed=TRUE, City="SF")
students <- rbind(students, new_student)
# Summary
summary(students)
str(students)
Q162. ๐ข Give characteristics of Dataframes in R programming.
[Asked: Jun 2023 | Frequency: 1]
Characteristics of Data Frames:
| Characteristic | Description |
|---|---|
| 2D Structure | Rows and columns (like table) |
| Heterogeneous Columns | Each column can have different type |
| Homogeneous Rows | Each row has same structure |
| Named Columns | Columns must have names |
| Equal Length | All columns same length |
| Indexable | Row/column indexing |
Key Properties:
df <- data.frame(
ID = 1:3,
Name = c("A", "B", "C"),
Score = c(85.5, 90.0, 78.5)
)
# Properties
class(df) # "data.frame"
typeof(df) # "list" (internally a list)
names(df) # Column names
rownames(df) # Row names (default: 1,2,3...)
Comparison:
| Feature | Matrix | Data Frame |
|---|---|---|
| Data types | Same | Different per column |
| Columns | Optional names | Required names |
| Use case | Math operations | Data analysis |
Diagram:
Q163. ๐ข What are factors in R programming?
[Asked: Jun 2025 | Frequency: 1]
Factor is a data structure used to represent categorical (nominal or ordinal) variables with a fixed set of possible values called levels.
Creating Factors:
# Basic factor
gender <- factor(c("Male", "Female", "Male", "Female"))
print(gender)
# [1] Male Female Male Female
# Levels: Female Male
# Ordered factor (ordinal)
size <- factor(c("Small", "Large", "Medium"),
levels = c("Small", "Medium", "Large"),
ordered = TRUE)
# [1] Small Large Medium
# Levels: Small < Medium < Large
Factor Properties:
# Get levels
levels(gender) # c("Female", "Male")
# Number of levels
nlevels(gender) # 2
# Underlying integers
as.integer(gender) # c(2, 1, 2, 1)
# Summary
summary(gender)
# Female Male
# 2 2
Use Cases:
-
Survey responses (Agree, Disagree, Neutral)
-
Categories (Product types, Regions)
-
Ordinal data (Low, Medium, High)
-
Statistical modeling (ANOVA, regression)
Q164. ๐ข Give characteristics of factors in R programming.
[Asked: Jun 2025 | Frequency: 1]
Factor Characteristics:
| Characteristic | Description |
|---|---|
| Levels | Fixed set of allowed values |
| Storage | Stored as integers internally |
| Labels | Human-readable level names |
| Ordering | Can be ordered or unordered |
| Memory Efficient | Integer storage saves space |
| Statistical | Used in modeling |
Diagram:
Ordered vs Unordered:
# Unordered (nominal)
color <- factor(c("Red", "Blue", "Green"))
# No inherent order
# Ordered (ordinal)
rating <- factor(c("Poor", "Good", "Excellent"),
levels = c("Poor", "Good", "Excellent"),
ordered = TRUE)
# Poor < Good < Excellent
rating[1] < rating[3] # TRUE (comparison works)
Common Operations:
f <- factor(c("A", "B", "A", "C"))
table(f) # Frequency table
droplevels(f) # Remove unused levels
relevel(f, "B") # Change reference level
Q165. ๐ด Write R program for matrix operations.
[Asked: Dec 2022, Jun 2022 | Frequency: 4]
Matrix Operations in R:
# Create two 3×3 matrices
A <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9), nrow = 3, ncol = 3)
B <- matrix(c(9, 8, 7, 6, 5, 4, 3, 2, 1), nrow = 3, ncol = 3)
print("Matrix A:")
print(A)
# [,1] [,2] [,3]
# [1,] 1 4 7
# [2,] 2 5 8
# [3,] 3 6 9
print("Matrix B:")
print(B)
# [,1] [,2] [,3]
# [1,] 9 6 3
# [2,] 8 5 2
# [3,] 7 4 1
# Addition
C <- A + B
print("A + B:")
print(C)
# [,1] [,2] [,3]
# [1,] 10 10 10
# [2,] 10 10 10
# [3,] 10 10 10
# Subtraction
D <- A - B
print("A - B:")
print(D)
# Element-wise multiplication
E <- A * B
print("A * B (element-wise):")
print(E)
# Matrix multiplication
F <- A %*% B
print("A %*% B (matrix multiplication):")
print(F)
# Transpose
print("Transpose of A:")
print(t(A))
# Determinant
print("Determinant of A:")
print(det(A))
# Inverse (if exists)
# print(solve(A)) # Only for invertible matrices
Q166. ๐ข How is R matrix multiplication different from C program?
[Asked: Dec 2022 | Frequency: 1]
Comparison: R vs C Matrix Multiplication
| Aspect | R | C |
|---|---|---|
| Syntax | Single operator %*% |
Nested loops |
| Code Length | 1 line | 10+ lines |
| Memory | Automatic | Manual allocation |
| Indexing | 1-based | 0-based |
| Vectorization | Built-in | Manual |
R Code:
# Matrix multiplication in R
A <- matrix(1:4, 2, 2)
B <- matrix(5:8, 2, 2)
C <- A %*% B # One line!
C Code:
// Matrix multiplication in C
int A[2][2] = {{1,3}, {2,4}};
int B[2][2] = {{5,7}, {6,8}};
int C[2][2];
// Triple nested loop required
for(int i = 0; i < 2; i++) {
for(int j = 0; j < 2; j++) {
C[i][j] = 0;
for(int k = 0; k < 2; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
Diagram:
Q167. ๐ข Write R code to concatenate strings.
[Asked: Jun 2025 | Frequency: 1]
String Concatenation in R:
# Using paste() - adds space by default
str1 <- "Hello"
str2 <- ","
str3 <- "Learning is Fun"
result <- paste(str1, str2, str3)
print(result)
# Output: "Hello , Learning is Fun"
# Using paste0() - no separator
result2 <- paste0(str1, str2, " ", str3)
print(result2)
# Output: "Hello, Learning is Fun"
# Custom separator
result3 <- paste(str1, str2, str3, sep = "")
print(result3)
# Output: "Hello,Learning is Fun"
# Collapse vector elements
words <- c("Hello", "World", "R")
collapsed <- paste(words, collapse = "-")
print(collapsed)
# Output: "Hello-World-R"
Functions Comparison:
| Function | Default Separator | Example |
|---|---|---|
paste() |
Space (" ") | paste("a","b") → "a b" |
paste0() |
None ("") | paste0("a","b") → "ab" |
Using sprintf():
name <- "Alice"
age <- 25
msg <- sprintf("My name is %s and I am %d years old", name, age)
print(msg)
# Output: "My name is Alice and I am 25 years old"
UNIT 14: DATA INTERFACING AND VISUALIZATION IN R
Q168. ๐ข What is JSON File in R?
[Asked: Jun 2025 | Frequency: 1]
JSON (JavaScript Object Notation) is a lightweight data interchange format that R can read and write using the jsonlite package.
JSON Structure:
{
"name": "Alice",
"age": 25,
"courses": ["Data Science", "Machine Learning"],
"active": true
}
Working with JSON in R:
# Install package
install.packages("jsonlite")
library(jsonlite)
# Read JSON file
data <- fromJSON("data.json")
# Read JSON from string
json_str <- '{"name": "Bob", "age": 30}'
data <- fromJSON(json_str)
# Write to JSON
toJSON(data)
write_json(data, "output.json")
Why JSON with R:
| Purpose | Description |
|---|---|
| Web APIs | Most APIs return JSON |
| Data Exchange | Universal format |
| Configuration | Store settings |
| Lightweight | Human-readable |
Q169. ๐ข How to convert JSON into a data frame in R?
[Asked: Jun 2025 | Frequency: 1]
JSON to Data Frame Conversion:
# Load library
library(jsonlite)
# JSON string with array of objects
json_data <- '[
{"name": "Alice", "age": 25, "city": "NYC"},
{"name": "Bob", "age": 30, "city": "LA"},
{"name": "Charlie", "age": 28, "city": "CHI"}
]'
# Convert to data frame
df <- fromJSON(json_data)
print(df)
# name age city
# 1 Alice 25 NYC
# 2 Bob 30 LA
# 3 Charlie 28 CHI
# From file
df <- fromJSON("data.json")
# Check structure
class(df) # "data.frame"
str(df)
Handling Nested JSON:
# Nested JSON
nested_json <- '{
"company": "TechCorp",
"employees": [
{"name": "Alice", "dept": "IT"},
{"name": "Bob", "dept": "HR"}
]
}'
data <- fromJSON(nested_json)
employees_df <- data$employees # Extract nested data frame
Diagram:
Q170. ๐ด How to draw a Bar Chart in R?
[Asked: Jun 2024, Jun 2023, Jun 2022, Dec 2024 | Frequency: 4]
Bar Chart in R using barplot():
Syntax:
barplot(height, names.arg, main, xlab, ylab, col)
Example:
# Data
categories <- c("A", "B", "C", "D", "E")
values <- c(25, 40, 30, 55, 45)
# Basic bar chart
barplot(values,
names.arg = categories,
main = "Sales by Category",
xlab = "Category",
ylab = "Sales",
col = "steelblue")
# Horizontal bar chart
barplot(values,
names.arg = categories,
main = "Sales by Category",
horiz = TRUE,
col = rainbow(5))
# Grouped bar chart
data <- matrix(c(10, 20, 15, 25, 30, 35), nrow = 2)
barplot(data,
names.arg = c("Q1", "Q2", "Q3"),
beside = TRUE,
col = c("red", "blue"),
legend = c("2023", "2024"))
Parameters:
| Parameter | Description |
|---|---|
height |
Vector of bar heights |
names.arg |
Labels for bars |
main |
Chart title |
col |
Bar colors |
horiz |
Horizontal if TRUE |
beside |
Grouped bars if TRUE |
Q171. ๐ด How to create a Box Plot in R?
[Asked: Dec 2022, Jun 2023, Dec 2024 | Frequency: 3]
Box Plot in R using boxplot():
Syntax:
boxplot(data, main, xlab, ylab, col)
Example:
# Single box plot
data <- c(23, 25, 28, 30, 32, 35, 38, 40, 42, 100)
boxplot(data,
main = "Distribution of Values",
ylab = "Value",
col = "lightblue")
# Multiple box plots
group1 <- c(10, 12, 14, 15, 18, 20)
group2 <- c(20, 22, 24, 26, 28, 30)
group3 <- c(15, 18, 20, 22, 25, 28)
boxplot(group1, group2, group3,
names = c("A", "B", "C"),
main = "Comparison of Groups",
col = c("red", "green", "blue"))
# From data frame
df <- data.frame(
value = c(10,12,15,20,22,25,30,32,35),
group = c("A","A","A","B","B","B","C","C","C")
)
boxplot(value ~ group, data = df,
main = "Values by Group",
col = "orange")
Box Plot Anatomy:
Q172. ๐ก How to create a Histogram in R?
[Asked: Jun 2023, Dec 2024 | Frequency: 2]
Histogram in R using hist():
Syntax:
hist(x, breaks, main, xlab, ylab, col)
Example:
# Generate sample data
data <- rnorm(100, mean = 50, sd = 10)
# Basic histogram
hist(data,
main = "Distribution of Values",
xlab = "Value",
ylab = "Frequency",
col = "lightgreen")
# Custom breaks (bins)
hist(data,
breaks = 20,
main = "Histogram with 20 Bins",
col = "steelblue",
border = "white")
# Probability density instead of frequency
hist(data,
probability = TRUE,
main = "Density Histogram",
col = "coral")
lines(density(data), col = "blue", lwd = 2)
Parameters:
| Parameter | Description |
|---|---|
x |
Numeric vector |
breaks |
Number of bins or breakpoints |
probability |
TRUE for density |
col |
Fill color |
border |
Border color |
Q173. ๐ก How to create Line Graphs in R?
[Asked: Jun 2023, Dec 2024 | Frequency: 2]
Line Graph in R using plot() with type="l":
Syntax:
plot(x, y, type = "l", main, xlab, ylab, col)
Example:
# Data
months <- 1:12
sales <- c(100, 120, 140, 130, 150, 180, 200, 190, 170, 160, 140, 150)
# Basic line graph
plot(months, sales,
type = "l",
main = "Monthly Sales",
xlab = "Month",
ylab = "Sales ($)",
col = "blue",
lwd = 2)
# Line with points
plot(months, sales,
type = "b", # both line and points
main = "Monthly Sales",
col = "red",
pch = 16)
# Multiple lines
sales2024 <- c(110, 130, 150, 140, 160, 190, 210, 200, 180, 170, 150, 160)
plot(months, sales, type = "l", col = "blue", ylim = c(80, 220))
lines(months, sales2024, col = "red")
legend("topleft", legend = c("2023", "2024"),
col = c("blue", "red"), lty = 1)
Type Options:
| Type | Description |
|---|---|
| "l" | Line only |
| "p" | Points only |
| "b" | Both (with gap) |
| "o" | Overplotted |
| "s" | Steps |
Q174. ๐ด How to draw a Scatter Plot in R?
[Asked: Dec 2024, Jun 2024, Jun 2023 | Frequency: 3]
Scatter Plot in R using plot():
Syntax:
plot(x, y, main, xlab, ylab, pch, col)
Example:
# Data
height <- c(150, 160, 165, 170, 175, 180, 185, 190)
weight <- c(50, 55, 60, 65, 70, 75, 80, 85)
# Basic scatter plot
plot(height, weight,
main = "Height vs Weight",
xlab = "Height (cm)",
ylab = "Weight (kg)",
pch = 16,
col = "blue")
# Add trend line
abline(lm(weight ~ height), col = "red", lwd = 2)
# Different point styles
plot(height, weight,
pch = 19, # Solid circle
col = "darkgreen",
cex = 1.5) # Point size
# Color by category
gender <- c("M", "M", "F", "F", "M", "F", "M", "F")
colors <- ifelse(gender == "M", "blue", "red")
plot(height, weight, col = colors, pch = 16)
legend("topleft", legend = c("Male", "Female"),
col = c("blue", "red"), pch = 16)
Common pch values:
| pch | Symbol |
|---|---|
| 1 | Circle |
| 16 | Solid circle |
| 17 | Triangle |
| 18 | Diamond |
| 19 | Solid circle |
UNIT 15: DATA ANALYSIS AND R
Q175. ๐ก What is Linear Regression?
[Asked: Jun 2025, Jun 2022 | Frequency: 2]
Linear Regression is a statistical method to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation.
Simple Linear Regression Formula:
Where:
-
y = Dependent variable (predicted)
-
x = Independent variable (predictor)
-
ฮฒ₀ = Intercept
-
ฮฒ₁ = Slope
-
ฮต = Error term
Diagram:
Assumptions:
-
Linear relationship
-
Independence of errors
-
Homoscedasticity (constant variance)
-
Normally distributed errors
Use Cases:
-
Predicting sales from advertising spend
-
Estimating house prices
-
Forecasting demand
Q176. ๐ด Explain Linear Regression using R-language.
[Asked: Dec 2024, Jun 2022 | Frequency: 3]
Linear Regression in R using lm():
# Sample data
height <- c(150, 160, 170, 180, 190)
weight <- c(50, 60, 70, 80, 90)
# Create linear model
model <- lm(weight ~ height)
# View model summary
summary(model)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -100.0000 ...
# height 1.0000 ...
# Get coefficients
coefficients(model)
# (Intercept) height
# -100.00 1.00
# Predict new values
new_height <- data.frame(height = c(155, 175))
predict(model, new_height)
# [1] 55 75
# Plot regression line
plot(height, weight, main = "Height vs Weight",
xlab = "Height", ylab = "Weight", pch = 16)
abline(model, col = "red", lwd = 2)
Key Functions:
| Function | Purpose |
|---|---|
lm() |
Create linear model |
summary() |
Model statistics |
coefficients() |
Get coefficients |
predict() |
Make predictions |
residuals() |
Get residuals |
abline() |
Draw regression line |
Q177. ๐ข Differentiate between Linear Regression and Multiple Regression.
[Asked: Jun 2023 | Frequency: 1]
Comparison:
| Aspect | Linear Regression | Multiple Regression |
|---|---|---|
| Variables | 1 independent | 2+ independent |
| Formula | y = ฮฒ₀ + ฮฒ₁x | y = ฮฒ₀ + ฮฒ₁x₁ + ฮฒ₂x₂ + ... |
| Complexity | Simple | More complex |
| Use Case | Single factor analysis | Multi-factor analysis |
Simple Linear Regression Example:
# One predictor
model1 <- lm(price ~ area)
# price = ฮฒ₀ + ฮฒ₁ * area
Multiple Regression Example:
# Multiple predictors
model2 <- lm(price ~ area + bedrooms + age)
# price = ฮฒ₀ + ฮฒ₁*area + ฮฒ₂*bedrooms + ฮฒ₃*age
Diagram:
When to Use:
-
Linear: One clear predictor
-
Multiple: Multiple factors influence outcome
Q178. ๐ข What is Multiple Regression?
[Asked: Dec 2022 | Frequency: 1]
Multiple Regression extends linear regression to include two or more independent variables to predict a dependent variable.
Formula:
Example: Predicting house price based on:
-
Area (x₁)
-
Number of bedrooms (x₂)
-
Age of house (x₃)
# Multiple regression in R
model <- lm(price ~ area + bedrooms + age, data = houses)
Advantages:
-
Models real-world complexity
-
Controls for confounding variables
-
Better predictions
Assumptions:
-
No multicollinearity (predictors not highly correlated)
-
Linear relationship with each predictor
-
Independence of observations
Q179. ๐ข Write steps for Multiple Regression in R.
[Asked: Dec 2022 | Frequency: 1]
Steps for Multiple Regression in R:
# Step 1: Load data
data <- read.csv("housing.csv")
# Step 2: Explore data
head(data)
summary(data)
cor(data) # Check correlations
# Step 3: Build model
model <- lm(price ~ area + bedrooms + age, data = data)
# Step 4: View summary
summary(model)
# Step 5: Check coefficients
coefficients(model)
# Step 6: Check significance (p-values)
# p < 0.05 means variable is significant
# Step 7: Check R-squared
# Higher = better fit (0 to 1)
# Step 8: Make predictions
new_data <- data.frame(area = 2000, bedrooms = 3, age = 10)
predicted_price <- predict(model, new_data)
# Step 9: Validate model
# Check residuals
plot(model)
# Step 10: Improve if needed
# Remove non-significant variables
model2 <- lm(price ~ area + bedrooms, data = data)
Q180. ๐ด What is Logistic Regression?
[Asked: Dec 2024, Dec 2023, Dec 2022 | Frequency: 3]
Logistic Regression is a statistical method for binary classification that predicts the probability of an outcome being in a particular category.
Key Characteristics:
| Aspect | Description |
|---|---|
| Output | Probability (0 to 1) |
| Use Case | Binary classification |
| Function | Sigmoid/Logistic |
| Threshold | Usually 0.5 |
Formula:
Sigmoid Function:
Examples:
-
Spam vs Not Spam (email)
-
Disease vs Healthy (medical)
-
Pass vs Fail (education)
-
Buy vs Not Buy (marketing)
Difference from Linear:
| Linear | Logistic |
|---|---|
| Continuous output | Probability (0-1) |
| Predicts values | Classifies |
| y = mx + c | y = 1/(1+e^-z) |
Q181. ๐ข Give the utility of Logistic Regression.
[Asked: Dec 2023 | Frequency: 1]
Utility of Logistic Regression:
| Application | Use Case |
|---|---|
| Healthcare | Disease prediction (diabetes, cancer) |
| Finance | Credit risk, fraud detection |
| Marketing | Customer churn prediction |
| Spam classification | |
| HR | Employee attrition |
| Education | Student pass/fail prediction |
Why Use Logistic Regression:
-
Interpretable: Coefficients show feature importance
-
Probabilistic: Gives confidence in prediction
-
Efficient: Fast training and prediction
-
Robust: Works well with smaller datasets
-
Baseline: Good starting point for classification
Output Interpretation:
-
P > 0.5 → Class 1 (Positive)
-
P ≤ 0.5 → Class 0 (Negative)
Q182. ๐ด How to implement Logistic Regression in R?
[Asked: Dec 2024, Dec 2023, Dec 2022, Jun 2024 | Frequency: 4]
Logistic Regression in R using glm():
# Step 1: Prepare data
data <- data.frame(
hours_studied = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
passed = c(0, 0, 0, 0, 1, 0, 1, 1, 1, 1)
)
# Step 2: Build logistic model
model <- glm(passed ~ hours_studied,
data = data,
family = binomial)
# Step 3: View summary
summary(model)
# Step 4: Get coefficients
coefficients(model)
# Step 5: Predict probabilities
new_data <- data.frame(hours_studied = c(3, 5, 8))
probabilities <- predict(model, new_data, type = "response")
print(probabilities)
# [1] 0.25 0.50 0.82
# Step 6: Convert to class
predicted_class <- ifelse(probabilities > 0.5, 1, 0)
# Step 7: Evaluate accuracy
actual <- c(0, 1, 1)
accuracy <- mean(predicted_class == actual)
print(paste("Accuracy:", accuracy))
# Step 8: Confusion matrix
table(Predicted = predicted_class, Actual = actual)
Key Function:
glm(formula, data, family = binomial)
-
glm()= Generalized Linear Model -
family = binomial= Logistic regression -
type = "response"= Get probabilities
UNIT 16: ADVANCED ANALYSIS USING R
Q183. ๐ก What is a Decision Tree?
[Asked: Dec 2022, Jun 2023 | Frequency: 2]
Decision Tree is a supervised learning algorithm that makes predictions by learning decision rules from data, represented as a tree structure.
Structure:
| Component | Description |
|---|---|
| Root Node | Top node, first split |
| Internal Node | Decision point |
| Branch | Outcome of decision |
| Leaf Node | Final prediction |
Diagram:
Advantages:
-
Easy to understand and interpret
-
Handles both numerical and categorical data
-
No need for feature scaling
-
Visual representation
Disadvantages:
-
Prone to overfitting
-
Unstable (small changes affect tree)
-
Biased toward features with more levels
Q184. ๐ก Write steps for Decision Tree in R.
[Asked: Dec 2022, Dec 2024 | Frequency: 2]
Decision Tree in R using rpart:
# Step 1: Install and load package
install.packages("rpart")
install.packages("rpart.plot")
library(rpart)
library(rpart.plot)
# Step 2: Prepare data
data <- data.frame(
Age = c(25, 30, 35, 40, 45, 50, 55, 60),
Income = c(30, 40, 50, 60, 70, 80, 90, 100),
Buy = c("No", "No", "Yes", "Yes", "Yes", "Yes", "No", "Yes")
)
# Step 3: Build decision tree
tree_model <- rpart(Buy ~ Age + Income,
data = data,
method = "class")
# Step 4: View tree
print(tree_model)
# Step 5: Plot tree
rpart.plot(tree_model, main = "Decision Tree")
# Step 6: Make predictions
new_data <- data.frame(Age = 38, Income = 55)
prediction <- predict(tree_model, new_data, type = "class")
print(prediction)
# Step 7: Evaluate
# Using confusion matrix
predicted <- predict(tree_model, data, type = "class")
table(Predicted = predicted, Actual = data$Buy)
Parameters:
| Parameter | Description |
|---|---|
method="class" |
Classification tree |
method="anova" |
Regression tree |
cp |
Complexity parameter |
minsplit |
Min observations for split |
Q185. ๐ข Explain the role of entropy in decision trees.
[Asked: Jun 2023 | Frequency: 1]
Entropy measures the impurity or randomness in a dataset, used to decide the best split in decision trees.
Formula:
Where $p_i$ = proportion of class i in the set
Interpretation:
-
Entropy = 0: Pure node (all same class)
-
Entropy = 1: Maximum impurity (50-50 split)
Example:
Dataset: 5 Yes, 5 No
p(Yes) = 0.5, p(No) = 0.5
Entropy = -0.5×log₂(0.5) - 0.5×log₂(0.5)
= -0.5×(-1) - 0.5×(-1)
= 0.5 + 0.5 = 1.0 (maximum impurity)
Dataset: 8 Yes, 2 No
p(Yes) = 0.8, p(No) = 0.2
Entropy = -0.8×log₂(0.8) - 0.2×log₂(0.2)
≈ 0.72 (less impure)
Diagram:
Q186. ๐ข Explain the role of information gain in decision trees.
[Asked: Jun 2023 | Frequency: 1]
Information Gain measures the reduction in entropy after a split, used to select the best attribute.
Formula:
Process:
-
Calculate entropy of parent node
-
Calculate weighted average entropy of children
-
Information Gain = Parent Entropy - Children Entropy
-
Choose attribute with highest IG
Example:
Parent: 6 Yes, 4 No
Entropy(Parent) = 0.97
After split on "Age":
- Age ≤ 30: 2 Yes, 3 No → Entropy = 0.97
- Age > 30: 4 Yes, 1 No → Entropy = 0.72
Weighted Entropy = (5/10)×0.97 + (5/10)×0.72 = 0.845
Information Gain = 0.97 - 0.845 = 0.125
Best Split: Attribute with highest Information Gain
Q187. ๐ข What are categorical and continuous variables?
[Asked: Jun 2023 | Frequency: 1]
Categorical Variables:
-
Discrete categories or groups
-
No numerical meaning
-
Examples: Gender, Color, Product Type
Continuous Variables:
-
Numerical values in a range
-
Can take any value
-
Examples: Age, Income, Temperature
Comparison:
| Aspect | Categorical | Continuous |
|---|---|---|
| Values | Finite set | Infinite range |
| Type | Qualitative | Quantitative |
| Example | Low/Medium/High | 23.5, 45.2, 67.8 |
| Statistics | Mode, frequency | Mean, std dev |
In R:
# Categorical (Factor)
gender <- factor(c("Male", "Female", "Male"))
# Continuous (Numeric)
age <- c(25.5, 30.2, 45.8)
# Check types
is.factor(gender) # TRUE
is.numeric(age) # TRUE
Q188. ๐ก Explain Partitioning and Pruning in Decision Trees.
[Asked: Dec 2023, Jun 2025 | Frequency: 2]
Partitioning (Splitting): The process of dividing data at each node based on a feature.
Pruning: The process of removing branches to prevent overfitting.
Comparison:
| Aspect | Partitioning | Pruning |
|---|---|---|
| Phase | Tree building | Tree optimization |
| Goal | Create splits | Remove branches |
| Effect | Grows tree | Shrinks tree |
| Prevents | Underfitting | Overfitting |
Types of Pruning:
| Type | When | Description |
|---|---|---|
| Pre-pruning | During growth | Stop early (max depth) |
| Post-pruning | After growth | Remove weak branches |
Diagram:
In R:
# Control pruning with cp (complexity parameter)
tree <- rpart(y ~ x, data, cp = 0.01)
# Prune to optimal cp
pruned_tree <- prune(tree, cp = 0.05)
Q189. ๐ข What is a Random Forest?
[Asked: Dec 2023 | Frequency: 1]
Random Forest is an ensemble learning method that builds multiple decision trees and combines their predictions.
Key Concepts:
| Concept | Description |
|---|---|
| Ensemble | Multiple models combined |
| Bagging | Bootstrap sampling of data |
| Feature Randomness | Random subset of features per tree |
| Voting | Classification by majority vote |
| Averaging | Regression by average |
Diagram:
Advantages:
-
Reduces overfitting
-
Handles high-dimensional data
-
Works with missing values
-
Provides feature importance
Q190. ๐ข How does Random Forest differ from Decision Tree?
[Asked: Dec 2023 | Frequency: 1]
Comparison:
| Aspect | Decision Tree | Random Forest |
|---|---|---|
| Number | Single tree | Many trees (forest) |
| Data | Full dataset | Bootstrap samples |
| Features | All features | Random subset |
| Overfitting | High risk | Lower risk |
| Accuracy | Lower | Higher |
| Interpretability | Easy | Harder |
| Speed | Faster | Slower |
Diagram:
When to Use:
-
Decision Tree: Interpretability needed, small data
-
Random Forest: Accuracy matters, large data
Q191. ๐ด Explain Random Forest algorithm in R.
[Asked: Dec 2024, Jun 2024 | Frequency: 3]
Random Forest in R:
# Step 1: Install and load package
install.packages("randomForest")
library(randomForest)
# Step 2: Prepare data
data(iris) # Example dataset
# Step 3: Split data
set.seed(123)
train_idx <- sample(1:nrow(iris), 0.7 * nrow(iris))
train_data <- iris[train_idx, ]
test_data <- iris[-train_idx, ]
# Step 4: Build Random Forest model
rf_model <- randomForest(Species ~ .,
data = train_data,
ntree = 100,
mtry = 2)
# Step 5: View model
print(rf_model)
# Step 6: Feature importance
importance(rf_model)
varImpPlot(rf_model)
# Step 7: Predict
predictions <- predict(rf_model, test_data)
# Step 8: Evaluate
confusion_matrix <- table(predictions, test_data$Species)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", round(accuracy * 100, 2), "%"))
Parameters:
| Parameter | Description |
|---|---|
ntree |
Number of trees |
mtry |
Features per split |
importance |
Calculate importance |
nodesize |
Minimum node size |
Q192. ๐ก What is Clustering?
[Asked: Jun 2022, Dec 2023 | Frequency: 2]
Clustering is an unsupervised learning technique that groups similar data points together without predefined labels.
Types of Clustering:
| Type | Algorithm | Description |
|---|---|---|
| Partitioning | K-Means | Divide into k clusters |
| Hierarchical | Agglomerative | Build tree of clusters |
| Density-based | DBSCAN | Group by density |
| Model-based | GMM | Probabilistic models |
K-Means Process:
Applications:
-
Customer segmentation
-
Image compression
-
Anomaly detection
-
Document clustering
Q193. ๐ก Write steps for K-Means Clustering in R.
[Asked: Jun 2022, Jun 2024 | Frequency: 2]
K-Means Clustering in R:
# Step 1: Prepare data
data(iris)
# Use only numeric columns
data <- iris[, 1:4]
# Step 2: Scale data (important for K-Means)
data_scaled <- scale(data)
# Step 3: Determine optimal k (Elbow method)
wss <- sapply(1:10, function(k) {
kmeans(data_scaled, k, nstart = 10)$tot.withinss
})
plot(1:10, wss, type = "b",
xlab = "Number of Clusters",
ylab = "Within-cluster SS")
# Step 4: Apply K-Means
set.seed(123)
kmeans_result <- kmeans(data_scaled, centers = 3, nstart = 25)
# Step 5: View results
print(kmeans_result$cluster) # Cluster assignments
print(kmeans_result$centers) # Cluster centroids
print(kmeans_result$size) # Cluster sizes
# Step 6: Visualize clusters
library(cluster)
clusplot(data_scaled, kmeans_result$cluster,
color = TRUE, shade = TRUE)
# Step 7: Add cluster to data
iris$Cluster <- kmeans_result$cluster
# Step 8: Compare with actual species
table(iris$Cluster, iris$Species)
Q194. ๐ข What is Confusion Matrix?
[Asked: Jun 2022 | Frequency: 1]
Confusion Matrix is a table showing the performance of a classification model by comparing predicted vs actual values.
Structure (Binary Classification):
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | TP (True Positive) | FN (False Negative) |
| Actual Negative | FP (False Positive) | TN (True Negative) |
Metrics:
| Metric | Formula | Description |
|---|---|---|
| Accuracy | (TP+TN)/(TP+TN+FP+FN) | Overall correctness |
| Precision | TP/(TP+FP) | Positive predictive value |
| Recall | TP/(TP+FN) | Sensitivity |
| F1 Score | 2×(P×R)/(P+R) | Harmonic mean |
Example in R:
actual <- c(1, 1, 0, 1, 0, 0, 1, 0)
predicted <- c(1, 0, 0, 1, 0, 1, 1, 0)
# Confusion matrix
table(Predicted = predicted, Actual = actual)
# Actual
# Predicted 0 1
# 0 3 1
# 1 1 3
# TP=3, TN=3, FP=1, FN=1
# Accuracy = (3+3)/8 = 75%
Q195. ๐ข Define Classification.
[Asked: Jun 2022 | Frequency: 1]
Classification is a supervised learning task that assigns predefined labels to data based on training examples.
Characteristics:
| Aspect | Description |
|---|---|
| Input | Features (predictors) |
| Output | Discrete class label |
| Learning | Supervised (labeled data) |
| Examples | Spam/Not Spam, Disease/Healthy |
Common Algorithms:
| Algorithm | Type |
|---|---|
| Logistic Regression | Linear |
| Decision Tree | Tree-based |
| Random Forest | Ensemble |
| SVM | Kernel-based |
| k-NN | Instance-based |
| Naive Bayes | Probabilistic |
Process:
Q196. ๐ข Write steps for Classification in R.
[Asked: Jun 2022 | Frequency: 1]
Classification Steps in R:
# Step 1: Load data
data(iris)
# Step 2: Split into train/test
set.seed(123)
train_idx <- sample(1:nrow(iris), 0.7 * nrow(iris))
train <- iris[train_idx, ]
test <- iris[-train_idx, ]
# Step 3: Build classifier (using Random Forest)
library(randomForest)
model <- randomForest(Species ~ ., data = train)
# Step 4: Predict on test data
predictions <- predict(model, test)
# Step 5: Evaluate with confusion matrix
conf_matrix <- table(Predicted = predictions,
Actual = test$Species)
print(conf_matrix)
# Step 6: Calculate accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
print(paste("Accuracy:", round(accuracy * 100, 2), "%"))
# Step 7: Other metrics
library(caret)
confusionMatrix(predictions, test$Species)
Q197. ๐ข Write short note on Support Vector Machines.
[Asked: Jun 2025 | Frequency: 1]
Support Vector Machine (SVM) is a supervised learning algorithm that finds the optimal hyperplane to separate classes.
Key Concepts:
| Concept | Description |
|---|---|
| Hyperplane | Decision boundary |
| Support Vectors | Points closest to boundary |
| Margin | Distance between classes |
| Kernel | Transform non-linear data |
Diagram:
Kernels:
-
Linear: For linearly separable data
-
RBF: For non-linear data
-
Polynomial: Higher-dimensional mapping
SVM in R:
library(e1071)
# Train SVM
model <- svm(Species ~ ., data = train, kernel = "radial")
# Predict
predictions <- predict(model, test)
# Accuracy
mean(predictions == test$Species)
Q198. ๐ก What is Time Series Analysis?
[Asked: Jun 2025, Dec 2023 | Frequency: 2]
Time Series Analysis is the study of data points collected over time to identify patterns, trends, and make forecasts.
Components:
| Component | Description |
|---|---|
| Trend | Long-term increase/decrease |
| Seasonality | Regular periodic patterns |
| Cyclic | Non-fixed period fluctuations |
| Noise | Random variations |
Diagram:
Applications:
-
Stock price prediction
-
Weather forecasting
-
Sales forecasting
-
Economic indicators
Common Models:
-
ARIMA (AutoRegressive Integrated Moving Average)
-
Exponential Smoothing
-
LSTM (Deep Learning)
Q199. ๐ข Write steps for Time Series Analysis in R.
[Asked: Jun 2024 | Frequency: 1]
Time Series Analysis in R:
# Step 1: Create time series object
data <- c(112, 118, 132, 129, 121, 135, 148, 148,
136, 119, 104, 118, 115, 126, 141, 135)
ts_data <- ts(data, start = c(2020, 1), frequency = 12)
# Step 2: Plot time series
plot(ts_data, main = "Monthly Sales",
xlab = "Time", ylab = "Sales")
# Step 3: Decompose into components
decomposed <- decompose(ts_data)
plot(decomposed)
# Step 4: Check stationarity
library(tseries)
adf.test(ts_data)
# Step 5: Build ARIMA model
library(forecast)
model <- auto.arima(ts_data)
summary(model)
# Step 6: Forecast
forecast_result <- forecast(model, h = 6) # 6 periods ahead
plot(forecast_result)
# Step 7: Evaluate accuracy
accuracy(model)
Key Functions:
| Function | Purpose |
|---|---|
ts() |
Create time series |
decompose() |
Extract components |
auto.arima() |
Automatic ARIMA |
forecast() |
Future predictions |
Q200. ๐ข Write short note on Association Rules.
[Asked: Implied from syllabus | Frequency: 1]
Association Rules discover relationships between items in transactional datasets (Market Basket Analysis).
Key Metrics:
| Metric | Formula | Description |
|---|---|---|
| Support | P(A∩B) | Frequency of itemset |
| Confidence | P(B|A) | Conditional probability |
| Lift | Confidence/P(B) | Strength of rule |
Example Rule:
{Bread, Butter} → {Milk}
Support = 30% (30% of transactions have all three)
Confidence = 80% (80% of Bread+Butter buyers also buy Milk)
Lift = 2.5 (2.5x more likely than random)
Apriori Algorithm in R:
library(arules)
# Load transaction data
data("Groceries")
# Generate rules
rules <- apriori(Groceries,
parameter = list(support = 0.01,
confidence = 0.5))
# View top rules
inspect(head(sort(rules, by = "lift"), 10))
# Visualize
library(arulesViz)
plot(rules, method = "graph")
Q201. ๐ข Explain the role of pruning in decision trees.
[Asked: Dec 2023 | Frequency: 1]
Pruning is the process of removing branches from a fully grown decision tree to reduce overfitting and improve generalization.
Types of Pruning:
| Type | When Applied | Description |
|---|---|---|
| Pre-pruning | During building | Stop growth early (max depth, min samples) |
| Post-pruning | After building | Remove weak branches from full tree |
Why Pruning is Needed:
| Without Pruning | With Pruning |
|---|---|
| Overfits training data | Better generalization |
| Complex tree | Simpler tree |
| Memorizes noise | Captures patterns |
| Poor test accuracy | Better test accuracy |
Cost-Complexity Pruning (in R):
# Build full tree
full_tree <- rpart(y ~ ., data = train, cp = 0)
# Find optimal cp
printcp(full_tree)
plotcp(full_tree)
# Prune tree
optimal_cp <- full_tree$cptable[which.min(full_tree$cptable[,"xerror"]),"CP"]
pruned_tree <- prune(full_tree, cp = optimal_cp)
Q202. ๐ข Explain the role of tree selection process in decision trees.
[Asked: Dec 2023 | Frequency: 1]
Tree Selection Process involves choosing the best tree from multiple candidates based on validation performance.
Steps:
| Step | Description |
|---|---|
| 1 | Build multiple trees with different parameters |
| 2 | Evaluate each on validation set |
| 3 | Select tree with best performance |
| 4 | Test on held-out test set |
Selection Criteria:
| Criterion | Description |
|---|---|
| Accuracy | Classification correctness |
| AUC-ROC | Discrimination ability |
| Cross-validation error | Average across folds |
| Complexity | Prefer simpler trees |
Process:
Q203. ๐ข How do categorical and continuous variables relate to decision trees?
[Asked: Jun 2023 | Frequency: 1]
Decision trees handle both categorical and continuous variables differently:
Categorical Variables:
-
Splits based on category membership
-
Binary split: One category vs rest
-
Multi-way split: Each category gets branch
Continuous Variables:
-
Splits based on threshold values
-
Binary split: ≤ threshold vs > threshold
-
Find best threshold by trying all values
Comparison:
| Aspect | Categorical | Continuous |
|---|---|---|
| Split Type | Category groups | Threshold |
| Question | "Is color = Red?" | "Is age ≤ 30?" |
| Finding Split | Try category combos | Try all thresholds |
| Encoding | Not needed | Not needed |
Example Tree:
In this tree:
-
Age is continuous (threshold split)
-
Education is categorical (category split)
Q204. ๐ข What is continuous variable?
[Asked: Jun 2023 | Frequency: 1]
Continuous Variable is a numerical variable that can take any value within a range, including decimals.
Characteristics:
| Characteristic | Description |
|---|---|
| Infinite values | Any value in range possible |
| Measurable | Can be measured precisely |
| Ordered | Has natural ordering |
| Arithmetic | Math operations meaningful |
Examples:
| Variable | Possible Values |
|---|---|
| Height | 150.5 cm, 175.23 cm |
| Temperature | 36.5°C, 98.6°F |
| Salary | $50,000.00 |
| Time | 2.5 hours |
| Distance | 10.75 km |
Continuous vs Discrete:
| Continuous | Discrete |
|---|---|
| Any value | Specific values only |
| Measured | Counted |
| Decimals possible | Usually integers |
| Temperature, weight | Number of children |
In R:
# Continuous
age <- 25.5
temperature <- 98.6
is.numeric(age) # TRUE
# Summary statistics apply
mean(c(25.5, 30.2, 28.7)) # 28.13
Q205. ๐ข Write short note on Association Rules using R.
[Asked: Jun 2024 | Frequency: 1]
Association Rules in R using arules package:
Installation:
install.packages("arules")
install.packages("arulesViz")
library(arules)
library(arulesViz)
Steps:
# Step 1: Load data
data("Groceries") # Built-in transaction data
# Step 2: Explore data
summary(Groceries)
itemFrequencyPlot(Groceries, topN = 10)
# Step 3: Generate rules using Apriori
rules <- apriori(Groceries,
parameter = list(
support = 0.001,
confidence = 0.5,
minlen = 2
))
# Step 4: View rules
summary(rules)
inspect(head(rules, 10))
# Step 5: Sort by metrics
rules_sorted <- sort(rules, by = "lift")
inspect(head(rules_sorted, 5))
# Step 6: Visualize
plot(rules, method = "scatter")
plot(rules[1:20], method = "graph")
Key Parameters:
| Parameter | Description |
|---|---|
| support | Min frequency of itemset |
| confidence | Min conditional probability |
| minlen | Minimum items in rule |
| maxlen | Maximum items in rule |
Q206. ๐ข What is the purpose of Central Limit Theorem?
[Asked: From Book Chapter 2 | Frequency: 1]
Central Limit Theorem (CLT) states that the sampling distribution of the sample mean approaches a normal distribution as sample size increases, regardless of the population distribution.
Statement: For samples of size n from a population with mean ฮผ and standard deviation ฯ:
- Sample means follow N(ฮผ, ฯ/√n) as n → ∞
Purpose:
| Purpose | Description |
|---|---|
| Inference | Make conclusions about population |
| Hypothesis Testing | Use normal distribution for tests |
| Confidence Intervals | Calculate error bounds |
| Estimation | Estimate population parameters |
Diagram:
Importance:
-
Works for any population distribution
-
Larger n → better approximation
-
n ≥ 30 usually sufficient
-
Foundation for statistical inference
END OF MCS-226 COMPREHENSIVE ANSWER BOOK
QUICK REVISION SUMMARY
Key Formulas
| Topic | Formula |
|---|---|
| Jaccard Similarity | |A∩B| / |A∪B| |
| Euclidean Distance | √ฮฃ(xแตข-yแตข)² |
| Cosine Similarity | A·B / (|A|×|B|) |
| PageRank | (1-d)/N + d×ฮฃ(PR(q)/L(q)) |
| Entropy | -ฮฃpแตข×log₂(pแตข) |
| Information Gain | Entropy(parent) - Weighted Entropy(children) |
Important R Functions
| Task | Function |
|---|---|
| Linear Regression | lm() |
| Logistic Regression | glm(family=binomial) |
| Decision Tree | rpart() |
| Random Forest | randomForest() |
| K-Means | kmeans() |
| Time Series | ts(), arima() |
Big Data Technologies
| Technology | Purpose |
|---|---|
| Hadoop | Distributed storage & processing |
| MapReduce | Parallel computation paradigm |
| Spark | Fast in-memory processing |
| Hive | SQL on Hadoop |
| HBase | NoSQL column store |
| NoSQL | Flexible, scalable databases |
Best of Luck for Your Exam! ๐