MI-AI MCQ: June 2018

Archive for June 2018

What are the two methods used for the calibration in Supervised Learning?

Answer: The two methods used for predicting good probabilities in Supervised Learning are

a) Platt Calibration

b) Isotonic Regression

These methods are designed for binary classification, and it is not trivial.

Which method is frequently used to prevent overfitting?

Answer: When there is sufficient data 'Isotonic Regression' is used to prevent an overfitting issue.

What is the difference between heuristic for rule learning and heuristics for decision trees?

Answer: The difference is that the heuristics for decision trees evaluate the average quality of a number of disjointed sets while rule learners only evaluate the quality of the set of instances that is covered with the candidate rule.

What is Perceptron in Machine Learning?

Answer: In Machine Learning, Perceptron is an algorithm for supervised classification of the input into one of several possible non-binary outputs.

Explain the two components of Bayesian logic program?

Answer: Bayesian logic program consists of two components. The first component is a logical one ; it consists of a set of Bayesian Clauses, which captures the qualitative structure of the domain. The second component is a quantitative one, it encodes the quantitative information about the domain.

What are Bayesian Networks (BN) ?

Answer: Bayesian Network is used to represent the graphical model for probability relationship among a set of variables .

Why instance based learning algorithm sometimes referred as Lazy learning algorithm?

Answer: Instance based learning algorithm is also referred as Lazy learning algorithm as they delay the induction or generalization process until classification is performed.

What are the two classification methods that SVM ( Support Vector Machine) can handle?

Answer:

a) Combining binary classifiers

b) Modifying binary to incorporate multiclass learning

What are the two paradigms of ensemble methods?

Answer: The two paradigms of ensemble methods are

a) Sequential ensemble methods

b) Parallel ensemble methods

What is ensemble learning?

Answer: To solve a particular computational program, multiple models such as classifiers or experts are strategically generated and combined. This process is known as ensemble learning.

Why ensemble learning is used?

Answer: Ensemble learning is used to improve the classification, prediction, function approximation etc of a model.

When to use ensemble learning?

Answer: Ensemble learning is used when you build component classifiers that are more accurate and independent from each other.

What is the general principle of an ensemble method and what is bagging and boosting in ensemble method?

Answer: The general principle of an ensemble method is to combine the predictions of several models built with a given learning algorithm in order to improve robustness over a single model. Bagging is a method in ensemble for improving unstable estimation or classification schemes. While boosting method are used sequentially to reduce the bias of the combined model. Boosting and Bagging both can reduce errors by reducing the variance term.

What is an Incremental Learning algorithm in ensemble?

Answer: Incremental learning method is the ability of an algorithm to learn from new data that may be available after classifier has already been generated from already available dataset.

What is bias-variance decomposition of classification error in ensemble method?

Answer: The expected error of a learning algorithm can be decomposed into bias and variance. A bias term measures how closely the average classifier produced by the learning algorithm matches the target function. The variance term measures how much the learning algorithm's prediction fluctuates for different training sets.

What are support vector machines?

Answer: Support vector machines are supervised learning algorithms used for classification and regression analysis.

What is dimension reduction in Machine Learning?

Answer: In Machine Learning and statistics, dimension reduction is the process of reducing the number of random variables under considerations and can be divided into feature selection and feature extraction

What are the components of relational evaluation techniques?

Answer: The important components of relational evaluation techniques are

a) Data Acquisition

b) Ground Truth Acquisition

c) Cross Validation Technique

d) Query Type

e) Scoring Metric

f) Significance Test

What are the different methods for Sequential Supervised Learning?

Answer: The different methods to solve Sequential Supervised Learning problems are

a) Sliding-window methods

b) Recurrent sliding windows

c) Hidden Markow models

d) Maximum entropy Markow models

e) Conditional random fields

f) Graph transformer networks

What are the different categories you can categorized the sequence learning process?

a) Sequence prediction

b) Sequence generation

c) Sequence recognition

d) Sequential decision

What is batch statistical learning?

Answer: Statistical learning techniques allow learning a function or predictor from a set of observed data that can make predictions about unseen or future data. These techniques provide guarantees on the performance of the learned predictor on the future unseen data based on a statistical assumption on the data generating process.

What are the areas in robotics and information processing where sequential prediction problem arises?

Answer: The areas in robotics and information processing where sequential prediction problem arises are

a) Imitation Learning

b) Structured prediction

c) Model based reinforcement learning

What is PAC Learning?

Answer: PAC (Probably Approximately Correct) learning is a learning framework that has been introduced to analyze learning algorithms and their statistical efficiency.

What is PCA, KPCA and ICA used for?

Answer: PCA (Principal Components Analysis), KPCA ( Kernel based Principal Component Analysis) and ICA ( Independent Component Analysis) are important feature extraction techniques used for dimensionality reduction.

Explain what is the function of 'Unsupervised Learning'?

a) Find clusters of the data

b) Find low-dimensional representations of the data

c) Find interesting directions in data

d) Interesting coordinates and correlations

e) Find novel observations/ database cleaning

Mention the difference between Data Mining and Machine learning?

Answer: Machine learning relates with the study, design and development of the algorithms that give computers the capability to learn without being explicitly programmed. While, data mining can be defined as the process in which the unstructured data tries to extract knowledge or unknown interesting patterns. During this process machine, learning algorithms are used.

How would you simulate the approach AlphaGo took to beat Lee Sedol at Go?

Answer: AlphaGo beating Lee Sedol, the best human player at Go, in a best-of-five series was a truly seminal event in the history of machine learning and deep learning. The Nature paper above describes how this was accomplished with "Monte-Carlo tree search with deep neural networks that have been trained by supervised learning, from human expert games, and by reinforcement learning from games of self-play."

How do you think Google is training data for self-driving cars?

Answer: Machine learning interview questions like this one really test your knowledge of different machine learning methods, and your inventiveness if you don't know the answer. Google is currently using recaptcha to source labelled data on storefronts and traffic signs. They are also building on training data collected by Sebastian Thrun at GoogleX — some of which was obtained by his grad students driving buggies on desert dunes!

Where do you usually source datasets?

Answer: Machine learning interview questions like these try to get at the heart of your machine learning interest. Somebody who is truly passionate about machine learning will have gone off and done side projects on their own, and have a good idea of what great datasets are out there. If you're missing any, check out Quandl for economic and financial data, and Kaggle's Datasets collection for another great list.

How would you approach the "Netflix Prize" competition?

Answer: The Netflix Prize was a famed competition where Netflix offered $1,000,000 for a better collaborative filtering algorithm. The team that won called BellKor had a 10% improvement and used an ensemble of different methods to win. Some familiarity with the case and its solution will help demonstrate you've paid attention to machine learning for a while.

What are your favorite use cases of machine learning models?

Answer: The Quora thread above contains some examples, such as decision trees that categorize people into different tiers of intelligence based on IQ scores. Make sure that you have a few examples in mind and describe what resonated with you. It's important that you demonstrate an interest in how machine learning is implemented.

Do you have research experience in machine learning?

Answer: Related to the last point, most organizations hiring for machine learning positions will look for your formal experience in the field. Research papers, co-authored or supervised by leaders in the field, can make the difference between you being hired and not. Make sure you have a summary of your research experience and papers ready — and an explanation for your background and lack of formal research experience if you don't.

What are the last machine learning papers you've read?

Answer: Keeping up with the latest scientific literature on machine learning is a must if you want to demonstrate interest in a machine learning position. This overview of deep learning in Nature by the scions of deep learning themselves (from Hinton to Bengio to LeCun) can be a good reference paper and an overview of what's happening in deep learning — and the kind of paper you might want to cite.

What do you think of our current data process?

Answer: This kind of question requires you to listen carefully and impart feedback in a manner that is constructive and insightful. Your interviewer is trying to gauge if you'd be a valuable member of their team and whether you grasp the nuances of why certain things are set the way they are in the company's data process based on company- or industry-specific conditions. They're trying to see if you can be an intellectual peer. Act accordingly.

How can we use your machine learning skills to generate revenue?

Answer: This is a tricky question. The ideal answer would demonstrate knowledge of what drives the business and how your skills could relate. For example, if you were interviewing for music-streaming startup Spotify, you could remark that your skills at developing a better recommendation model would increase user retention, which would then increase revenue in the long run.

The startup metrics Slideshare linked above will help you understand exactly what performance indicators are important for startups and tech companies as they think about revenue and growth.

How would you implement a recommendation system for our company's users?

Answer: A lot of machine learning interview questions of this type will involve implementation of machine learning models to a company's problems. You'll have to research the company and its industry in-depth, especially the revenue drivers the company has, and the types of users the company takes on in the context of the industry it's in.

Which data visualization libraries do you use? What are your thoughts on the best data visualization tools?

Answer: What's important here is to define your views on how to properly visualize data and your personal preferences when it comes to tools. Popular tools include R's ggplot, Python's seaborn and matplotlib, and tools such as Plot.ly and Tableau.

Describe a hash table.

Answer: hash table is a data structure that produces an associative array. A key is mapped to certain values through the use of a hash function. They are often used for tasks such as database indexing.

What are some differences between a linked list and an array?

Answer: An array is an ordered collection of objects. A linked list is a series of objects with pointers that direct how to process them sequentially. An array assumes that every element has the same size, unlike the linked list. A linked list can more easily grow organically: an array has to be pre-defined or re-defined for organic growth. Shuffling a linked list involves changing which points direct where — meanwhile, shuffling an array is more complex and takes more memory.

Pick an algorithm. Write the pseudocode for a parallel implementation.

Answer: This kind of question demonstrates your ability to think in parallelism and how you could handle concurrency in programming implementations dealing with big data. Take a look at pseudocode frameworks such as Peril-L and visualization tools such as Web Sequence Diagrams to help you demonstrate your ability to write code that reflects parallelism.

Do you have experience with Spark or big data tools for machine learning?

Answer: You'll want to get familiar with the meaning of big data for different companies and the different tools they'll want. Spark is the big data tool most in demand now, able to handle immense datasets with speed. Be honest if you don't have experience with the tools demanded, but also take a look at job descriptions and see what tools pop up: you'll want to invest in familiarizing yourself with them.

How do you handle missing or corrupted data in a dataset?

Answer: You could find missing/corrupted data in a dataset and either drop those rows or columns, or decide to replace them with another value.

In Pandas, there are two very useful methods: isnull() and dropna() that will help you find columns of data with missing or corrupted data and drop those values. If you want to fill the invalid values with a placeholder value (for example, 0), you could use the fillna() method.

What's the "kernel trick" and how is it useful?

Answer: The Kernel trick involves kernel functions that can enable in higher-dimension spaces without explicitly calculating the coordinates of points within that dimension: instead, kernel functions compute the inner products between the images of all pairs of data in a feature space. This allows them the very useful attribute of calculating the coordinates of higher dimensions while being computationally cheaper than the explicit calculation of said coordinates. Many algorithms can be expressed in terms of inner products. Using the kernel trick enables us effectively run algorithms in a high-dimensional space with lower-dimensional data.

How would you evaluate a logistic regression model?

Answer: A subsection of the question above. You have to demonstrate an understanding of what the typical goals of a logistic regression are (classification, prediction etc.) and bring up a few examples and use cases.

What evaluation approaches would you work to gauge the effectiveness of a machine learning model?

Answer: You would first split the dataset into training and test sets, or perhaps use cross-validation techniques to further segment the dataset into composite sets of training and test sets within the data. You should then implement a choice selection of performance metrics: here is a fairly comprehensive list. You could use measures such as the F1 score, the accuracy, and the confusion matrix. What's important here is to demonstrate that you understand the nuances of how a model is measured and how to choose the right performance measures for the right situations.

How do you ensure you're not overfitting with a model?

Answer: This is a simple restatement of a fundamental problem in machine learning: the possibility of overfitting training data and carrying the noise of that data through to the test set, thereby providing inaccurate generalizations.

There are three main methods to avoid overfitting:

1- Keep the model simpler: reduce variance by taking into account fewer variables and parameters, thereby removing some of the noise in the training data.

2- Use cross-validation techniques such as k-folds cross-validation.

3- Use regularization techniques such as LASSO that penalize certain model parameters if they're likely to cause overfitting.

Name an example where ensemble techniques might be useful.

Answer: Ensemble techniques use a combination of learning algorithms to optimize better predictive performance. They typically reduce overfitting in models and make the model more robust (unlikely to be influenced by small changes in the training data).

You could list some examples of ensemble methods, from bagging to boosting to a "bucket of models" method and demonstrate how they could increase predictive power.

When should you use classification over regression?

Answer: Classification produces discrete values and dataset to strict categories, while regression gives you continuous results that allow you to better distinguish differences between individual points. You would use classification over regression if you wanted your results to reflect the belongingness of data points in your dataset to certain explicit categories (ex: If you wanted to know whether a name was male or female rather than just how correlated they were with male and female names.)

How would you handle an imbalanced dataset?

An imbalanced dataset is when you have, for example, a classification test and 90% of the data is in one class. That leads to problems: an accuracy of 90% can be skewed if you have no predictive power on the other category of data! Here are a few tactics to get over the hump:

1- Collect more data to even the imbalances in the dataset.

2- Resample the dataset to correct for imbalances.

3- Try a different algorithm altogether on your dataset.

What's important here is that you have a keen sense for what damage an unbalanced dataset can cause, and how to balance that.

What's the F1 score? How would you use it?

Answer: The F1 score is a measure of a model's performance. It is a weighted average of the precision and recall of a model, with results tending to 1 being the best, and those tending to 0 being the worst. You would use it in classification tests where true negatives don't matter much.

Big Data | True and False

Computerized support is only used for organizational decisions that are responses to external pressures, not for taking advantage of opportunities. T/F

Answer: False

The complexity of today's business environment creates many new challenges for organizations, such as global competition, but creates few new opportunities in return. T/F

Answer: False

In addition to deploying business intelligence (BI) systems, companies may also perform other actions to counter business pressures, such as improving customer service and entering business alliances. T/F

Answer: True

The overwhelming majority of competitive actions taken by businesses today feature computerized information system support. T/F

Answer: True

The access to data and ability to manipulate data (frequently including real-time data) are key elements of business intelligence (BI) systems. T/F

Answer: True

One of the four components of BI systems, business performance management, is a collection of source data in the data warehouse. T/F

Answer: False

Actionable intelligence is the primary goal of modern-day Business Intelligence (BI) systems vs. historical reporting that characterized Management Information Systems (MIS). T/F

Answer: True

Data warehouse and BI initiatives typically follow a process similar to that used in military intelligence initiatives. T/F

Answer: True

The two critical partnerships required for BI governance are (a) a partnership between functional area users and/or product/service area employees, and (b) a partnership between representatives of the marketing and vendor sides. T/F

Answer: False

The term intelligence in a BI context is used to describe clandestine operations dedicated to stealing corporate secrets, in the manner of the government's CIA and other covert agencies. T/F

Answer: False

Information systems that support such transactions as ATM withdrawals, bank deposits, and cash register scans at the grocery store represent transaction processing, a critical branch of BI. T/F

Answer: False

Many business users in the 1980s referred to their mainframes as "the black hole," because all the information went into it, but little ever came back and ad hoc real-time querying was virtually impossible. T/F

Answer: True

The success of BI is assured not because of which personnel would be the most likely to use it, but as a result of pervasive adoption across the organization. T/F

Answer: False

BI represents a bold new paradigm in which the company's business strategy must be aligned to its business intelligence analysis initiatives. T/F

Answer: False

Traditional BI systems use a large volume of static data that has been extracted, cleansed, and loaded into a data warehouse to produce reports and analyses. T/F

Answer: True

Almost all BI applications are constructed with shells provided by an outsourcing provider who may themselves create a custom solution for a vendor or work with another client. T/F

Answer: False

The use of dashboards and data visualizations is seldom effective in finding efficiencies in organizations, as demonstrated by the Seattle Children's Hospital Case Study. T/F

Answer: False

The use of statistics in baseball by the Oakland Athletics, as described in the Moneyball case study, is an example of the effectiveness of prescriptive analytics. T/F

Answer: True

Pushing programming out to distributed data is achieved solely by using the Hadoop Distributed File System or HDFS. T/F

Answer: False

Volume, velocity, and variety of data characterize the Big Data paradigm. T/F

Answer: True

In the Isle of Capri case, the only capability added by the new software was increased processing speed of processing reports. T/F

Answer: False

The "islands of data" problem in the 1980s describes the phenomenon of unconnected data being stored in numerous locations within an organization. T/F

Answer: True

Subject oriented databases for data warehousing are organized by detailed subjects such as disk drives, computers, and networks. T/F

Answer: False

Data warehouses are subsets of data marts. T/F

Answer: False

One way an operational data store differs from a data warehouse is the recency of their data. T/F

Answer: True

Organizations seldom devote a lot of effort to creating metadata because it is not important for the effective use of data warehouses. T/F

Answer: False

Without middleware, different BI programs cannot easily connect to the data warehouse. T/F

Answer: True

Two-tier data warehouse/BI infrastructures offer organizations more flexibility but cost more than three-tier ones. T/F

Answer: False

Moving the data into a data warehouse is usually the easiest part of its creation. T/F

Answer: False

The hub-and-spoke data warehouse model uses a centralized warehouse feeding dependent data marts. T/F

Answer: True

Because of performance and data quality issues, most experts agree that the federated architecture should supplement data warehouses, not replace them. T/F

Answer: True

Bill Inmon advocates the data mart bus architecture whereas Ralph Kimball promotes the hub-and-spoke architecture, a data mart bus architecture with conformed dimensions. T/F

Answer: False

The ETL process in data warehousing usually takes up a small portion of the time in a data-centric project. T/F

Answer: False

In the Starwood Hotels case, up-to-date data and faster reporting helped hotel managers better manage their occupancy rates. T/F

Answer: True

Large companies, especially those with revenue upwards of $500 million consistently reap substantial cost savings through the use of hosted data warehouses. T/F

Answer: False

OLTP systems are designed to handle ad hoc analysis and complex queries that deal with many data items. T/F

Answer: False

The data warehousing maturity model consists of six stages: prenatal, infant, child, teenager, adult, and sage. T/F

Answer: True

A well-designed data warehouse means that user requirements do not have to change as business needs change. T/F

Answer: False

Data warehouse administrators (DWAs) do not need strong business insight since they only handle the technical aspect of the infrastructure. T/F

Answer: False

Because the recession has raised interest in low-cost open source software, it is now set to replace traditional enterprise software. T/F

Answer: False

All of the following are true about in-database processing technology EXCEPT

A) it pushes the algorithms to where the data is.
B) it makes the response to queries much faster than conventional databases.
C) it is often used for apps like credit card fraud detection and investment risk management.
D) it is the same as in-memory storage technology.

Answer: D

How does the use of cloud computing affect the scalability of a data warehouse?

A) Cloud computing vendors bring as much hardware as needed to users' offices.
B) Hardware resources are dynamically allocated as use increases.
C) Cloud vendors are mostly based overseas where the cost of labor is low.
D) Cloud computing has little effect on a data warehouse's scalability.

Answer: B

Which of the following statements is more descriptive of active data warehouses in contrast with traditional data warehouses?

A) strategic decisions whose impacts are hard to measure
B) detailed data available for strategic use only
C) large numbers of users, including operational staffs
D) restrictive reporting with daily and weekly data currency

Answer: C

Active data warehousing can be used to support the highest level of decision making sophistication and power. The major feature that enables this in relation to handling the data is

A) country of (data) origin.
B) nature of the data.
C) speed of data transfer.
D) source of the data.

Answer: C

Which of the following online analytical processing (OLAP) technologies does NOT require the precomputation and storage of information?

A) MOLAP
B) ROLAP
C) HOLAP
D) SQL

Answer: B

When querying a dimensional database, a user went from summarized data to its underlying details. The function that served this purpose is

A) dice.
B) slice.
C) roll-up.
D) drill down.

Answer: D

When representing data in a data warehouse, using several dimension tables that are each connected only to a fact table means you are using which warehouse structure?

A) star schema
B) snowflake schema
C) relational schema
D) dimensional schema

Answer: A

All of the following are benefits of hosted data warehouses EXCEPT

A) smaller upfront investment.
B) better quality hardware.
C) greater control of data.
D) frees up in-house systems.

Answer: C

Data warehouses provide direct and indirect benefits to using organizations. Which of the following is an indirect benefit of data warehouses?

A) better and more timely information
B) extensive new analyses performed by users
C) simplified access to data
D) improved customer service

Answer: D

In which stage of extraction, transformation, and load (ETL) into a data warehouse are anomalies detected and corrected?

A) transformation
B) extraction
C) load
D) cleanse

Answer: D

In which stage of extraction, transformation, and load (ETL) into a data warehouse are data aggregated?

A) transformation
B) extraction
C) load
D) cleanse

Answer: A

Which approach to data warehouse integration focuses more on sharing process functionality than data across systems?

A) extraction, transformation, and load
B) enterprise application integration
C) enterprise information integration
D) enterprise function integration

Answer: B

Which data warehouse architecture uses a normalized relational warehouse that feeds multiple data marts?

A) independent data marts architecture
B) centralized data warehouse architecture
C) hub-and-spoke data warehouse architecture
D) federated architecture

Answer: C

Which data warehouse architecture uses metadata from existing data warehouses to create a hybrid logical data warehouse comprised of data from the other warehouses?

A) independent data marts architecture
B) centralized data warehouse architecture
C) hub-and-spoke data warehouse architecture
D) federated architecture

Answer: D

Which of the following BEST enables a data warehouse to handle complex queries and scale up to handle many more requests?

A) use of the web by users as a front-end
B) parallel processing
C) Microsoft Windows
D) a larger IT staff

Answer: B

A Web client that connects to a Web server, which is in turn connected to a BI application server, is reflective of a

A) one tier architecture.
B) two tier architecture.
C) three tier architecture.
D) four tier architecture.

Answer: C

All of the following statements about metadata are true EXCEPT

A) metadata gives context to reported data.
B) there may be ethical issues involved in the creation of metadata.
C) metadata helps to describe the meaning and structure of data.
D) for most organizations, data warehouse metadata are an unnecessary expense.

Answer: D

Which kind of data warehouse is created separately from the enterprise data warehouse by a department and not reliant on it for updates?

A) sectional data mart
B) public data mart
C) independent data mart
D) volatile data mart

Answer: C

Operational or transaction databases are product oriented, handling transactions that update the database. In contrast, data warehouses are

A) subject-oriented and nonvolatile.
B) product-oriented and nonvolatile.
C) product-oriented and volatile.
D) subject-oriented and volatile.

Answer: A

The "single version of the truth" embodied in a data warehouse such as Capri Casinos' means all of the following EXCEPT

A) decision makers get to see the same results to queries.
B) decision makers have the same data available to support their decisions.
C) decision makers get to use more dependable data for their decisions.
D) decision makers have unfettered access to all data in the warehouse.

Answer: D

Big Data often involves a form of distributed storage and processing using Hadoop and MapReduce. One reason for this is

A) centralized storage creates too many vulnerabilities.
B) the "Big" in Big Data necessitates over 10,000 processing nodes.
C) the processing power needed for the centralized model would overload a single computer.
D) Big Data systems have to match the geographical spread of social media.

Answer: C

Which of the following statements about Big Data is true?

A) Data chunks are stored in different locations on one computer.
B) Hadoop is a type of processor used to process Big Data applications.
C) MapReduce is a storage filing system.
D) Pure Big Data systems do not involve fault tolerance.

Answer: D

Prescriptive BI capabilities are viewed as more powerful than predictive ones for all the following reasons EXCEPT

A) prescriptive BI gives actual guidance as to actions.
B) understanding the likelihood of certain events often leaves unclear remedies.
C) only prescriptive BI capabilities have monetary value to top-level managers.
D) prescriptive models generally build on (with some overlap) predictive ones.

Answer: C

How are descriptive analytics methods different from the other two types?

A) They answer "what-if?" queries, not "how many?" queries.
B) They answer "what-is?" queries, not "what will be?" queries.
C) They answer "what to do?" queries, not "what-if?" queries.
D) They answer "what will be?" queries, not "what to do?" queries.

Answer: B

Today, many vendors offer diversified tools, some of which are completely preprogrammed (called s). How are these shells utilized?

A) They are used for customization of BI solutions.
B) All a user needs to do is insert the numbers.
C) The shell provides a secure environment for the organization's BI data.
D) They host an enterprise data warehouse that can assist in decision making.

Answer: B

What has caused the growth of the demand for instant, on-demand access to dispersed information?

A) the increasing divide between users who focus on the strategic level and those who are more oriented to the tactical level
B) the need to create a database infrastructure that is always online and contains all the information from the OLTP systems
C) the more pressing need to close the gap between the operational data and strategic objectives
D) the fact that BI cannot simply be a technical exercise for the information systems department

Answer: C

If a company's strategy is properly aligned with DW and BI initiatives, and if the company's IS organization can be made capable of playing its role in such a project, and if the requisite user community is in place and has the proper motivation, then

A) it is no longer necessary to start BI within the company.
B) it is wise to start BI and establish a BI Competency Center (BICC) within the company.
C) the organization is ready for the introduction of new data-generating technologies, such as radio-frequency identification (RFID).
D) business leaders are required to document their business processes and to sign off on the legitimacy of the information they rely on.

Answer: B

What can the BI users in an organization help guide and direct?

A) how to implement and deploy a BI initiative that can be lengthy, expensive, and failure prone
B) how the DW is structured and the types of BI tools and other supporting software that are needed
C) how to decompose the planning and execution into business, organization, functionality, and infrastructure components
D) how the DW is structured and the costs and the appreciation for different classes of potential users

Answer: B

The very design that makes an OLTP system efficient for transaction processing makes it inefficient for what?

A) end-user ad hoc reports, queries, and analysis
B) transaction processing systems that constantly update operational databases
C) the collection of reputable sources of intelligence
D) transactions such as ATM withdrawals, where we need to reduce a bank balance accordingly

Answer: A

Online transaction processing (OLTP) systems handle a company's routine ongoing business. In contrast, a data warehouse is typically

A) the end result of BI processes and operations.
B) a repository of actionable intelligence obtained from a data mart.
C) a distinct system that provides storage for data that will be made use of in analysis.
D) an integral subsystem of an online analytical processing (OLAP) system.

Answer: C

When middles look across an organization to ensure that project priorities reflect the needs of the entire business, what is their main concern?

A) that their proprietary BI methods are protected from industrial espionage
B) that additional information available through an enterprise data warehouse should assist in decision making
C) that a project does not just serve to sub-optimize one area over others
D) that return on investment (ROI) and total cost of ownership justify the cost—benefit ratio

Answer: C

Once a data warehouse is in place, the general process of intelligence creation begins with

A) end-user examinations of decision-making impacts.
B) identifying and prioritizing specific BI projects.
C) estimating the cost-benefit ratio of the ROI.
D) establishing the critical partnerships required for BI governance.

Answer: B

When Sabre developed their Enterprise Data Warehouse, they chose to use near-real time updating of their database. The main reason they did so was

A) to provide a 360 degree view of the organization.
B) to aggregate performance metrics in an understandable way.
C) to be able to assess internal operations.
D) to provide up-to-date executive insights.

Answer: D

In answering the question "Which customers are likely to be using fake credit cards?" you are most likely to use which of the following analytic applications?

A) channel optimization
B) customer segmentation
C) fraud detection
D) customer profitability

Answer: C

In answering the question "Which customers are most likely to click on my online ads and purchase my goods?" you are most likely to use which of the following analytic applications?

A) customer profitability
B) propensity to buy
C) customer attrition
D) channel optimization

Answer: B

Business intelligence (BI) can be characterized as a transformation of

A) data to information to decisions to actions.
B) Big Data to data to information to decisions.
C) actions to decisions to feedback to information.
D) data to processing to information to actions.

Answer: A

Organizations counter the pressures they experience in their business environments in multiple ways. Which of the following is NOT an effective way to counter these pressures?

A) reactive actions
B) anticipative actions
C) adaptive actions
D) retroactive actions

Answer: D

Which of the following is NOT an example that falls within the four major categories of business environment factors for today's organizations?

A) globalization
B) increased pool of customers
C) fewer government regulations
D) increased competition

Answer: C

In the Magpie Sensing case study, the automated collection of temperature and humidity data on shipped goods helped with various types of analytics. Which of the following is an example of predictive analytics?

A) real time reports of the shipment's temperature
B) warning of an open shipment seal
C) location of the shipment
D) optimal temperature setting

Answer: B

In the Magpie Sensing case study, the automated collection of temperature and humidity data on shipped goods helped with various types of analytics. Which of the following is an example of prescriptive analytics?

A) real time reports of the shipment's temperature
B) warning of an open shipment seal
C) location of the shipment
D) optimal temperature setting

Answer: D

Statistical Data Variable Type: A variable that contains the values of either Yes or No would best be categorized as which of the following variable types?

Nominal
Binary
Discrete
Ratio

Answer: Binary

Statistical Data Variable Type: A variable that contains a countable number of distinct values would best be categorized as which of the following variable types?

Ordinal
Discrete
Ratio
Interval

Answer: Discrete

Which of the following measures of central location would be best to use when the Skewness is approximately zero?

Mean
Median
Mode
2nd Quartile

Answer: Mean

Which of the following is a common way of visualizing the frequency distribution of data points over a range of possible values?

Quantitative Easing Chart
Histogram
Skewness chart
Scatterplot

Answer: Histogram

If you have survey results from 100 people and the average response is 40% with a standard deviation of 5. Which of the following can you approximate from the results

95% of the respondents think that there is a 30% - 50% chance that the FSU football team will win the ACC championship.
70% of the respondents think that there is a 30% - 50% chance that the FSU football team will win the ACC championship.
100% of the respondents think that there is a 30% - 50% chance that the FSU football team will win the ACC championship.
0% of the respondents think that there is a 30% - 50% chance that the FSU football team will win the ACC championship.

Answer: 95% of the respondents think that there is a 30% - 50% chance that the FSU football team will win the ACC championship.

Which of the following is NOT a qualitative data type?

Conversations
Surveys with numerical answers
Magazine articles
Media broadcasts

Answer: Surveys with numerical answers

There is a web-based survey that asks you, "On a rating of 1(hated it) to 5(loved it), how much did you like the movie." This value is stored in your database and you need to categorize the statistical variable type. Which of the following variable types would be best?c

There is a web-based survey that asks you, "On a rating of 1(hated it) to 5(loved it), how much did you like the movie." This value is stored in your database and you need to categorize the statistical variable type. Which of the following variable types would be best?

Ordinal
Discrete
Interval
Ratio

Answer: Interval

For a column in your dataset, your data analysis tool is telling you that the standard deviation is zero. What does this say about the data in that column?

all values are zero
all values are the same
all values are different
that statistics are worthless

Answer: all values are the same

Which is more important to you- model accuracy, or model performance?

Well, it has everything to do with how model accuracy is only a subset of model performance, and at that, a sometimes misleading one. For example, if you wanted to detect fraud in a massive dataset with a sample of millions, a more accurate model would most likely predict no fraud at all if only a vast minority of cases were fraud. However, this would be useless for a predictive model — a model designed to find fraud that asserted there was no fraud at all.

How is a decision tree pruned?

Pruning is what happens in decision trees when branches that have weak predictive power are removed in order to reduce the complexity of the model and increase the predictive accuracy of a decision tree model. Pruning can happen bottom-up and top-down, with approaches such as reduced error pruning and cost complexity pruning.

Reduced error pruning is perhaps the simplest version: replace each node. If it doesn't decrease predictive accuracy, keep it pruned. While simple, this heuristic actually comes pretty close to an approach that would optimize for maximum accuracy.

What cross-validation technique would you use on a time series dataset?

Instead of using standard k-folds cross-validation, you have to pay attention to the fact that a time series is not randomly distributed data — it is inherently ordered by chronological order. If a pattern emerges in later time periods for example, your model may still pick up on it even if that effect doesn't hold in earlier years!

You'll want to do something like forward chaining where you'll be able to model on past data then look at forward-facing data.

fold 1 : training [1], test [2]
fold 2 : training [1 2], test [3]
fold 3 : training [1 2 3], test [4]
fold 4 : training [1 2 3 4], test [5]
fold 5 : training [1 2 3 4 5], test [6]

What's the difference between a generative and discriminative model?

Answer: A generative model will learn categories of data while a discriminative model will simply learn the distinction between different categories of data. Discriminative models will generally outperform generative models on classification tasks.

What is deep learning, and how does it contrast with other machine learning algorithms?

Answer: Deep learning is a subset of machine learning that is concerned with neural networks: how to use backpropagation and certain principles from neuroscience to more accurately model large sets of unlabelled or semi-structured data. In that sense, deep learning represents an unsupervised learning algorithm that learns representations of data through the use of neural nets.

What's the difference between probability and likelihood?

Discrete Random Variables

Suppose that you have a stochastic process that takes discrete values (e.g., outcomes of tossing a coin 10 times, number of customers who arrive at a store in 10 minutes etc). In such cases, we can calculate the probability of observing a particular set of outcomes by making suitable assumptions about the underlying stochastic process (e.g., probability of coin landing heads is pp and that coin tosses are independent).

Denote the observed outcomes by OO and the set of parameters that describe the stochastic process as θθ. Thus, when we speak of probability we want to calculate P(O|θ)P(O|θ). In other words, given specific values for θθ, P(O|θ)P(O|θ) is the probability that we would observe the outcomes represented by OO.

However, when we model a real life stochastic process, we often do not know θθ. We simply observe OO and the goal then is to arrive at an estimate for θθ that would be a plausible choice given the observed outcomes OO. We know that given a value of θθ the probability of observing OO is P(O|θ)P(O|θ). Thus, a 'natural' estimation process is to choose that value of θθ that would maximize the probability that we would actually observe OO. In other words, we find the parameter values θθ that maximize the following function:

L(θ|O)=P(O|θ)L(θ|O)=P(O|θ)
L(θ|O)L(θ|O) is called as the likelihood function. Notice that by definition the likelihood function is conditioned on the observed OO and that it is a function of the unknown parameters θθ.

Continuous Random Variables

In the continuous case the situation is similar with one important difference. We can no longer talk about the probability that we observed OO given θθ because in the continuous case P(O|θ)=0P(O|θ)=0. Without getting into technicalities, the basic idea is as follows:

Denote the probability density function (pdf) associated with the outcomes OO as: f(O|θ)f(O|θ). Thus, in the continuous case we estimate θθ given observed outcomes OO by maximizing the following function:

L(θ|O)=f(O|θ)L(θ|O)=f(O|θ)
In this situation, we cannot technically assert that we are finding the parameter value that maximizes the probability that we observe OO as we maximize the pdf associated with the observed outcomes OO.

What's a Fourier transform?

Answer: A Fourier transform is a generic method to decompose generic functions into a superposition of symmetric functions. Or as this more intuitive tutorial puts it, given a smoothie, it's how we find the recipe. The Fourier transform finds the set of cycle speeds, amplitudes and phases to match any time signal. A Fourier transform converts a signal from time to frequency domain — it's a very common way to extract features from audio signals or other time series such as sensor data.

What's the difference between Type I and Type II error?

Type I error is a false positive, while Type II error is a false negative. Briefly stated, Type I error means claiming something has happened when it hasn't, while Type II error means that you claim nothing is happening when in fact something is.

A clever way to think about this is to think of Type I error as telling a man he is pregnant, while Type II error means you tell a pregnant woman she isn't carrying a baby.

What's your favorite algorithm, and can you explain it to me in less than a minute?

Answer: This type of question tests your understanding of how to communicate complex and technical nuances with poise and the ability to summarize quickly and efficiently. Make sure you have a choice and make sure you can explain different algorithms so simply and effectively that a five-year-old could grasp the basics!

Explain the difference between L1 and L2 regularization.

L2 regularization tends to spread error among all the terms, while L1 is more binary/sparse, with many variables either being assigned a 1 or 0 in weighting. L1 corresponds to setting a Laplacean prior on the terms, while L2 corresponds to a Gaussian prior.

Why is "Naive" Bayes naive?

Despite its practical applications, especially in text mining, Naive Bayes is considered "Naive" because it makes an assumption that is virtually impossible to see in real-life data: the conditional probability is calculated as the pure product of the individual probabilities of components. This implies the absolute independence of features — a condition probably never met in real life.

As a Quora commenter put it whimsically, a Naive Bayes classifier that figured out that you liked pickles and ice cream would probably naively recommend you a pickle ice cream.

Bayes' Theorem is the basis behind a branch of machine learning that most notably includes the Naive Bayes classifier. That's something important to consider when you're faced with machine learning interview questions.

What is Bayes' Theorem? How is it useful in a machine learning context?

Bayes' Theorem gives you the posterior probability of an event given what is known as prior knowledge.

Mathematically, it's expressed as the true positive rate of a condition sample divided by the sum of the false positive rate of the population and the true positive rate of a condition. Say you had a 60% chance of actually having the flu after a flu test, but out of people who had the flu, the test will be false 50% of the time, and the overall population only has a 5% chance of having the flu. Would you actually have a 60% chance of having the flu after having a positive test?

Bayes' Theorem says no. It says that you have a (.6 0.05) (True Positive Rate of a Condition Sample) / (.60.05)(True Positive Rate of a Condition Sample) + (.5*0.95) (False Positive Rate of a Population) = 0.0594 or 5.94% chance of getting a flu.

Bayes' Theorem is the basis behind a branch of machine learning that most notably includes the Naive Bayes classifier. That's something important to consider when you're faced with machine learning interview questions.

Define precision and recall?

Recall is also known as the true positive rate: the amount of positives your model claims compared to the actual number of positives there are throughout the data.

Precision is also known as the positive predictive value, and it is a measure of the amount of accurate positives your model claims compared to the number of positives it actually claims.

It can be easier to think of recall and precision in the context of a case where you've predicted that there were 10 apples and 5 oranges in a case of 10 apples.

You'd have perfect recall (there are actually 10 apples, and you predicted there would be 10) but 66.7% precision because out of the 15 events you predicted, only 10 (the apples) are correct.

Explain how a ROC curve works.

Answer: The ROC curve is a graphical representation of the contrast between true positive rates and the false positive rate at various thresholds. It's often used as a proxy for the trade-off between the sensitivity of the model (true positives) vs the fall-out or the probability it will trigger a false alarm (false positives).

How is KNN different from k-means clustering?

Answer: K-Nearest Neighbors is a supervised classification algorithm, while k-means clustering is an unsupervised clustering algorithm.

While the mechanisms may seem similar at first, what this really means is that in order for K-Nearest Neighbors to work, you need labeled data you want to classify an unlabeled point into (thus the nearest neighbor part).

K-means clustering requires only a set of unlabeled points and a threshold: the algorithm will take unlabeled points and gradually learn how to cluster them into groups by computing the mean of the distance between different points.

The critical difference here is that KNN needs labeled points and is thus supervised learning, while k-means doesn't — and is thus unsupervised learning.

What is the difference between supervised and unsupervised machine learning?

Answer: Supervised learning requires training labeled data. For example, in order to do classification (a supervised learning task), you'll need to first label the data you'll use to train the model to classify data into your labeled groups. Unsupervised learning, in contrast, does not require labeling data explicitly.

What's the trade-off between bias and variance?

Bias is error due to erroneous or overly simplistic assumptions in the learning algorithm you're using. This can lead to the model underfitting your data, making it hard for it to have high predictive accuracy and for you to generalize your knowledge from the training set to the test set.

Variance is error due to too much complexity in the learning algorithm you're using. This leads to the algorithm being highly sensitive to high degrees of variation in your training data, which can lead your model to overfit the data. You'll be carrying too much noise from your training data for your model to be very useful for your test data.

The bias-variance decomposition essentially decomposes the learning error from any algorithm by adding the bias, the variance and a bit of irreducible error due to noise in the underlying dataset. Essentially, if you make the model more complex and add more variables, you'll lose bias but gain some variance — in order to get the optimally reduced amount of error, you'll have to tradeoff bias and variance. You don't want either high bias or high variance in your model.

Give a popular application of machine learning that you see on a day-to-day basis?

Answer: The recommendation engine implemented by major ecommerce websites uses Machine Learning.

What is Genetic Programming?

Answer: Genetic programming is one of the two techniques used in machine learning. The model is based on the testing and selecting the best choice among a set of results.

In what areas is pattern recognition used?

Pattern Recognition can be used in

a) Computer Vision

b) Speech Recognition

c) Data Mining

d) Statistics

e) Informal Retrieval

f) Bioinformatics

What are the advantages of Naive Bayes?

Answer: In Naïve Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. The main advantage is that it can't learn interactions between features.

What is a classifier in machine learning?

Answer: A classifier in a Machine Learning is a system that inputs a vector of discrete or continuous feature values and outputs a single discrete value, the class.

What is the difference between artificial learning and machine learning?

Answer: Designing and developing algorithms according to the behaviours based on empirical data are known as Machine Learning. While artificial intelligence in addition to machine learning, it also covers other aspects like knowledge representation, natural language processing, planning, robotics etc.

What is algorithm independent machine learning?

Answer: Machine learning in where mathematical foundations is independent of any particular classifier or learning algorithm is referred as algorithm independent machine learning.

Explain what is the function of 'Supervised Learning'?

Answer:

a) Classifications

b) Speech recognition

c) Regression

d) Predict time series

e) Annotate strings

What is the function of unsupervised learning?

Answer:

a) Find clusters of the data

b) Find low-dimensional representations of the data

c) Find interesting directions in data

d) Interesting coordinates and correlations

e) Find novel observations/ database cleaning

What is 'Training set' and 'Test set'?

Answer: In various areas of information science like machine learning, a set of data is used to discover the potentially predictive relationship known as 'Training Set'. Training set is an examples given to the learner, while Test set is used to test the accuracy of the hypotheses generated by the learner, and it is the set of example held back from the learner. Training set are distinct from Test set.

What is the standard approach to supervised learning?

Answer: The standard approach to supervised learning is to split the set of example into the training set and the test.

What are the three stages to build the hypotheses or model in machine learning?

Answer:

a) Model building

b) Model testing

c) Applying the model

What are the different Algorithm techniques in Machine Learning?

The different types of techniques in Machine Learning are

a) Supervised Learning

b) Unsupervised Learning

c) Semi-supervised Learning

d) Reinforcement Learning

e) Transduction

f) Learning to Learn

What are the five popular algorithms of Machine Learning?

a) Decision Trees

b) Neural Networks (back propagation)

c) Probabilistic networks

d) Nearest Neighbor

e) Support vector machines

What is inductive Machine Learning?

Answer: The inductive machine learning involves the process of learning by examples, where a system, from a set of observed instances tries to induce a general rule.

How can you avoid overfitting?

Answer: By using a lot of data overfitting can be avoided, overfitting happens relatively as you have a small dataset, and you try to learn from it. But if you have a small database and you are forced to come with a model based on that. In such situation, you can use a technique known as cross validation. In this method the dataset splits into two section, testing and training datasets, the testing dataset will only test the model while, in training dataset, the data points will come up with the model.

In this technique, a model is usually given a dataset of a known data on which training (training data set) is run and a dataset of unknown data against which the model is tested. The idea of cross validation is to define a dataset to "test" the model in the training phase.

Why does overfitting happen?

Answer: The possibility of overfitting exists as the criteria used for training the model is not the same as the criteria used to judge the efficacy of a model.

What is 'Overfitting' in Machine learning?

Answer: In machine learning, when a statistical model describes random error or noise instead of underlying relationship 'overfitting' occurs. When a model is excessively complex, overfitting is normally observed, because of having too many parameters with respect to the number of training data types. The model exhibits poor performance which has been overfit. In layman's terms the model fits too closely to the training set and does not generalize to test set.

Why overfitting happens?

Answer: The possibility of overfitting exists as the criteria used for training the model is not the same as the criteria used to judge the efficacy of a model.

How can you avoid overfitting ?

By using a lot of data overfitting can be avoided, overfitting happens relatively as you have a small dataset, and you try to learn from it. But if you have a small database and you are forced to come with a model based on that. In such situation, you can use a technique known as cross validation. In this method the dataset splits into two section, testing and training datasets, the testing dataset will only test the model while, in training dataset, the datapoints will come up with the model.

In this technique, a model is usually given a dataset of a known data on which training (training data set) is run and a dataset of unknown data against which the model is tested. The idea of cross validation is to define a dataset to "test" the model in the training phase.

Mention the difference between Data Mining and Machine learning?

What is Machine Learning?

Answer: Machine learning is a branch of computer science which deals with system programming in order to automatically learn and improve with experience. For example: Robots are programed so that they can perform the task based on data they gather from sensors. It automatically learns programs from data.

The simplest way to answer this question is - we give the data and equation to the machine. Ask the machine to look at the data and identify the coefficient values in an equation.

For example for the linear regression y=mx+c, we give the data for the variable x, y and the machine learns about the values of m and c from the data.

Modern formal definition of machine learning according to Tom Mitchell.

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Give a derivation of for a single example in batch gradient descent? (Gradient Descent For Linear Regression)

Derivation of for a single example in batch Gradient Descent For Linear Regression.

Derivation of a single variable in gradient descent.

What is the algorithm for implementing gradient descent for linear regression?

The algorithm for implementing gradient descent for linear regression

We can substitute our actual cost function and our actual hypothesis function.
m is the size of the training set, theta 0 a constant that will be changing simultaneously with theta 1 and x, y are values of the given training set (data).

How does gradient descent converge with a fixed step size alpha?

As we approach a local minimum, gradient descent will take smaller steps.
Thus no need to decrease alpha over time.

Why should we adjust the parameter alpha when using gradient descent?

To ensure that the gradient descent algorithm converges in a reasonable time.
Failure to converge or too much time to obtain the minimum value implies that our step size is wrong.

Why does gradient descent, regardless of the slope's sign, eventually converge to its minimum value?

Answer:

The following graph shows that:

• when the slope is negative, the value of theta 1 increases.
• when the slope is positive, the value of theta 1 decreases.

Depict the graphical implementation of minimizing the cost function using gradient descent.

Answer:

The graphical implementation of minimizing the cost function using gradient descent.

We put theta 0 on the x axis and theta 1 on the y axis, with the cost function on the vertical z axis.
The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters.

State the algorithm for gradient descent.

Repeat until convergence, where j=0,1 represents the feature index number.

How do we implement an iteration step when calculating Gradient Descent in code?

Answer:

At each iteration j, one should simultaneously update the parameters.
Updating a specific parameter prior to calculating another one on the j iteration would yield to a wrong implementation.

What is the contour line of a two variable function?

A contour line of a two variable function has a constant value at all points of the same line.

What is a visual interpretation of the cost function?

Answer:

• The training data set is scattered on the X-Y plane.

• We are trying to make a straight line (defined by hθ(x)) which passes through these scattered data points.

What are alternative terms of a Cost Function?

Answer:

Squared error function.
Mean squared error.

Give a pictorial representation of what the cost function of a supervised learning problem does.

Cost function of a supervised learning problem.

Give a pictorial representation of what the cost function of a supervised learning problem does.

What is the definition of a cost function of a supervised learning problem?

Definition of a cost function of a supervised learning problem.

Answer: Takes an average difference of all the results of the hypothesis with inputs from x's and the actual output y's.

How do we measure the accuracy of a hypothesis function?

Answer: By using a cost function, usually denoted by J.

What do we call a learning problem, if the target variable can take on only a small number of values?

Answer: When y can take on only a small number of discrete values, the learning problem is also called a classification problem.

What do we call a learning problem, if the target variable is continuous?

Answer: When the target variable that we're trying to predict is continuous, the learning problem is also called a regression problem.

What is Gradient Decent used for? What are the basic steps of Gradient Decent?

What is Gradient Decent used for?

Gradient decent is used to find the minimized values for a function, in which we simultaneously update our theta values so as not to skew our algorithm.

What are the basic steps of Gradient Decent?

1. Set our parameters equal to arbitrary values.
2. Change our thetas to reduce J(theta) until we hopefully end up a a minimum.
3.We then as which direction we can take a "baby step" in to take us "down hill" more quickly.
4. Repeat step 2 and 3 until we reach a minimum.

Give the pictorial process for a supervised learning problem. Explain Supervised Learning Problem.

Give the pictorial process for a supervised learning problem.

Supervised Learning Problem.

Give the pictorial process for a supervised learning problem.

What is supervised learning?

Supervised learning is when we are teaching a machine to learn using inputs that we know the correct outputs to.

What is unsupervised learning?

Unsupervised learning is when we are teaching a machine to learn using inputs that we do not know the correct outputs to. We can derive the structure of our given data by clustering it based upon relations among the given data.

What are two different types of supervised learning problems?

Regression Problems
Classification Problems

What is a Regression problem ?

A regression problem is a supervised machine learning problem in which we are trying to map inputs to a continuous function.

What is a Classification problem?

A classification problem is a supervised machine learning problem in which we are trying to map inputs to discrete outputs.

What are the necessary to develop a learning algorithm for a supervised machine learning problem?

1. Obtain the data set.

2. Feed the training set to the learning algorithm we have created.

3.The hypothesis takes an input and tries to output the estimated value of our output Y.

What is the purpose of our hypothesis in a supervised regression problem?

The purpose of our hypothesis in a supervised regression problem is to take an input and try to return the estimated value of our output y.