Data Requirement Analysis

Data Requirements Assay

David Loshin , in The Practitioner's Guide to Data Quality Improvement, 2011

nine.4 The Data Requirements Analysis Process

The information requirements analysis process employs a top-down approach that emphasizes business organisation-driven needs, so the analysis is conducted to ensure the identified requirements are relevant and feasible. The process incorporates data discovery and assessment in the context of explicitly qualified business information consumer needs. Having identified the data requirements, candidate data sources are determined and their quality is assessed using the data quality cess procedure described in affiliate 11. Any inherent issues that tin be resolved immediately are addressed using the approaches described in affiliate 12, and those requirements can be used for instituting data quality control, as described in chapter 13.

The data requirements assay process consists of these phases:

1.

Identifying the business contexts

two.

Conducting stakeholder interviews

iii.

Synthesizing expectations and requirements

4.

Developing source-to-target mappings

Once these steps are completed, the resulting artifacts are reviewed to ascertain information quality rules in relation to the dimensions of information quality described in chapter viii.

ix.4.ane Identifying the Concern Contexts

The business contexts associated with information consumption and reuse provide the telescopic for the determination of data requirements. Conferring with enterprise architects to understand where system boundaries intersect with lines of business will provide a skilful starting point for determining how (and under what circumstances) information sets are used.

Figure 9.2 shows the steps in this phase of the process:

Figure 9.two. Identifying the business contexts.

1.

Identify relevant stakeholders: Stakeholders may be identified through a review of existing arrangement documentation or may be identified by the data quality team through discussions with business analysts, enterprise analysts, and enterprise architects. The pool of relevant stakeholders may include concern programme sponsors, business application owners, business process managers, senior management, data consumers, system owners, as well as frontline staff members who are the beneficiaries of shared or reused information.

two.

Acquire documentation: The data quality analyst must become familiar with overall goals and objectives of the target information platforms to provide context for identifying and assessing specific information and data requirements. To do this, it is necessary to review existing artifacts that provide details about the consuming systems, requiring a review of project charters, project scoping documents, requirements, blueprint, and testing documentation. At this stage, the analysts should accumulate any available documentation artifacts that can help in determining collective data use.

iii.

Document goals and objectives: Determining existing performance measures and success criteria provides a baseline representation of high-level system requirements for summarization and categorization. Conceptual data models may be that can provide farther clarification and guidance regarding the functional and operational expectations of the collection of target systems.

four.

Summarize scope of capabilities: Create graphic representations that convey the loftier-level functions and capabilities of the targeted systems, every bit well equally providing item of functional requirements and target user profiles. When combined with other context knowledge, one may create a business context diagram or document that summarizes and illustrates the cardinal data flows, functions, and capabilities of the downstream information consumers.

5.

Certificate impacts and constraints: Constraints are conditions that bear upon or prevent the implementation of system functionality, whereas impacts are potential changes to characteristics of the environment to adapt the implementation of system functionality. Identifying and understanding all relevant impacts and constraints to the target systems are critical, considering the impacts and constraints oftentimes define, limit, and frame the data controls and rules that will be managed as office of the information quality environment. Not only that, source-to-target mappings may be impacted by constraints or dependencies associated with the choice of candidate data sources.

The resulting artifacts describe the high-level functions of downstream systems, and how organizational data is expected to encounter those systems' needs. Any identified impacts or constraints of the targeted systems, such equally legacy organisation dependencies, global reference tables, existing standards and definitions, and data retention policies, will be documented. In addition, this phase will provide a preliminary view of global reference data requirements that may impact source data element selection and transformation rules. Time stamps and organization standards for time, geography, availability and chapters of potential data sources, frequency and approaches for data extractions, and transformations are additional data points for identifying potential impacts and requirements.

9.4.2 Conduct Stakeholder Interviews

Reviewing existing documentation only provides a static snapshot of what may (or may not) be true about the state of the data environment. A more complete picture tin can exist assembled past collecting what might be accounted "hard evidence" from the key individuals associated with the business organization processes that apply data. Therefore, our adjacent phase (shown in Figure 9.iii) is to conduct conversations with the previously identified key stakeholders, notation their critical areas of business organization, and summarize those concerns equally a way to identify gaps to be filled in the course of information requirements.

Figure 9.3. Conducting stakeholder interviews.

This stage of the process consists of these 5 steps:

i.

Identify candidates and review roles: Review the full general roles and responsibilities of the interview candidates to guide and focus the interview questions inside their specific business process (and associated application) contexts.

ii.

Develop interview questions: The next pace in interview training is to create a gear up of questions designed to elicit the business concern information requirements. The formulation of questions can be driven by the context information collected during the initial phase of the procedure. There are two broad categories of questions – directed questions, which are specific and aimed at gathering details most the functions and processes inside a department or area, and open-ended questions, which are less specific and often lead to dialogue and conversation. They are more focused on trying to understand the information requirements for operational direction and decision making.

iii.

Schedule and acquit interviews: Interviews with executive stakeholders should be scheduled earlier, because their time is difficult to secure. Data obtained during executive stakeholder interviews provides boosted clarity regarding overall goals and objectives and may outcome in refinement of subsequent interviews. Interviews should exist scheduled at a location where the participants will non be interrupted.

4.

Summarize and identify gaps: Review and organize the notes from the interviews, including the attendees list, full general notes, and answers to the specific questions. By considering the business definitions that were clarified related to various aspects of the business (specially in relation to known reference information dimensions, such as time, geography, and regulatory problems), one continues to formulate a fuller determination of system constraints and data dependencies.

5.

Resolve gaps and finalize results: Completion of the initial interview summaries volition identify additional questions or clarifications required from the interview candidates. At that point the data quality practitioner can bike back with the interviewee to resolve outstanding bug.

One time any outstanding questions have been answered, the interview results can be combined with the business organisation context information (every bit described in section nine.four.ane) to enable the data quality annotator to ascertain specific steps and processes for the request for and documentation of business organisation information requirements.

ix.4.three Synthesize Requirements

This next stage synthesizes the results of the documentation browse and the interviews to collect metadata and data expectations as function of the business concern process flows. The analysts volition review the downstream applications' use of business concern information (as well equally questions to exist answered) to place named data concepts and types of aggregates, and associated information element characteristics.

Figure ix.4 shows the sequence of these steps:

Figure 9.4. Synthesizing the results.

1.

Document data workflow: Create an data flow model that depicts the sequence, hierarchy, and timing of process activities. The goal is to use this workflow to identify locations inside the business processes where information quality controls tin be introduced for continuous monitoring and measurement.

two.

Identify required data elements: Reviewing the business questions volition help segregate the required (or commonly used) data concepts (political party, product, agreement, etc.) from the characterizations or aggregation categories (e.k., grouped past geographic region). This drives the determination of required reference data and potential primary data items.

iii.

Specify required facts: These facts represent specific pieces of business information that are tracked, managed, used, shared, or forwarded to a reporting and analytics facility in which they are counted or measured (such as quantity or volume). In addition, the data quality analyst must document whatever qualifying characteristics of the data that represent conditions or dimensions that are used to filter or organize your facts (such as time or location). The metadata for these information concepts and facts will be captured inside a metadata repository for farther analysis and resolution.

iv.

Harmonize data element semantics: A metadata glossary captures all the business organization terms associated with the business organisation workflows, and classifies the hierarchical composition of any aggregated or analyzed data concepts. Nigh glossaries may contain a core ready of terms beyond similar projects forth with additional project specific terms. When possible, use existing metadata repositories to capture the approved organization definition.

The utilise of common terms becomes a challenge in data requirements analysis, specially when common utilize precludes the beingness of agreed-to definitions. These bug get astute when aggregations are applied to counts of objects that may share the aforementioned name but don't really share the same pregnant. This situation will lead to inconsistencies in reporting, analyses, and operational activities, which in turn will lead to loss of trust in data. Harmonization and metadata resolution are discussed in greater detail in chapter 10.

ix.4.iv Source-to-Target Mapping

The goal of source-to-target mapping is to clearly specify the source information elements that are used in downstream applications. In most situations, the consuming applications may use similar data elements from multiple information sources; the information quality analyst must determine if whatsoever consolidation and/or aggregation requirements (i.e., transformations) are required, and determine the level of diminutive data needed for drill-downward, if necessary. Any transformations specify how upstream data elements are modified for downstream consumption and business rules applied as part of the data menstruation. During this phase, the data annotator may place the need for reference data sets. Equally we will encounter in affiliate 10, reference data sets are often used by data elements that have low cardinality and rely on standardized values.

Figure nine.5 shows the sequence of these steps:

Effigy 9.5. Source-to-target mapping.

i.

Advise target models: Evaluate the catalog of identified data elements and look for those that are often created, referenced, or modified. Past considering both the conceptual and the logical structures of these data elements and their enclosing information sets, the annotator can identify potential differences and anomalies inherent in the metadata, and then resolve whatever disquisitional anomalies across data chemical element sizes, types, or formats. These will form the core of a data sharing model, which represents the data elements to exist taken from the sources, potentially transformed, validated, and then provided to the consuming applications.

2.

Identify candidate data sources: Consult the information management teams to review the candidate data sources containing the identified data elements, and review the collection of information facts needed by the consuming applications. For each fact, make up one's mind whether it corresponds to a defined information concept or data element, exists in any data sets in the organization, or is a computed value (and if and then, what are the data elements that are used to compute that value), and and then document each potential data source.

3.

Develop source-to-target mappings: Because this analysis should provide enough input to specify which candidate data sources tin be extracted, the next footstep is to consider how that information is to be transformed into a mutual representation that is and then normalized in training for consolidation. The consolidation processes collect the sets of objects and prepare them for populating the consuming applications. During this step, the analysts enumerate which source information elements contribute to target information elements, specify the transformations to exist applied, and note where it relies on standardizations and normalizations revealed during before stages of the process.

Read full affiliate

URL:

https://www.sciencedirect.com/science/article/pii/B9780123737175000099

Data Requirements Assay

David Loshin , in Business Intelligence (2nd Edition), 2013

Summary

This provides a proficient starting point in the data requirements analysis process that tin facilitate the data selection process. Past the finish of these exercises (which may crave multiple iterations), you may be able to place source applications whose data subsystems contain instances that are suitable for integration into a business concern analytics surroundings. Yet at that place are still other considerations: merely considering the information sets are bachelor and accessible does not mean they tin can satisfy the analytics consumers' needs, specially if the data sets are not of a high enough level of quality. It is therefore as well critically important to appraise the data quality expectations and use a validation process to determine if the quality levels of candidate data sources can meet the nerveless downstream user needs, and this will exist covered in subsequent chapters.

Read full chapter

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9780123858894000077

Bringing It All Together

David Loshin , in The Practitioner's Guide to Data Quality Improvement, 2011

20.2.4 Data Requirements Analysis

Organizational data quality direction almost does not make sense outside of the context of growing information reuse, alternating opinions regarding centralization/decentralization of data, or increasing scrutiny from external parties. The fact that data sets are reused for purposes that were never intended implies a greater need for identifying, clarifying, and documenting the collected data requirements from across the awarding landscape, every bit well as instituting accountability for ensuring that the quality characteristics expected by all data consumers are met.

Inconsistencies due to intermediate transformations and cleansings have plagued business reporting and analytics, requiring recurring time investments for reviews and reconciliations. Yet, attempting to impose restrictions upstream often are pushed dorsum, resulting in a less than optimal situation. Data requirements analysis is a process intended to accumulate data requirements from across the spectrum of downstream information consumers. Demonstrating that all application'southward are accountable for making the best effort for ensuring the quality of data for all downstream purposes and that the organization benefits as a whole when ensuring that those requirements are met.

Whereas traditional requirements analysis centers on functional needs, data requirements analysis complements the functional requirements process and focuses on the information needs, providing a standard set of procedures for identifying, analyzing, and validating data requirements and quality for information-consuming applications. Data requirements assay helps in:

Articulating a articulate understanding of data needs of all consuming business processes,

Identifying relevant data quality dimensions associated with those data needs,

Assessing the quality and suitability of candidate data sources,

Adjustment and standardizing the exchange of data beyond systems;

Implementing production procedures for monitoring the conformance to expectations and correcting data as early as possible in the production flow, and

Continually reviewing to identify improvement opportunities in relation to downstream data needs.

Assay of system goals, objectives, and stakeholder desires is conducted to elicit business information characteristics that drive the definition of data and information requirements that are relevant, add together value, and can be observed. The information requirements analysis process employs a top-down approach that incorporates data discovery and assessment in the context of explicitly qualified business data consumer needs. Candidate data sources are determined, assessed, and qualified within the context of the requirements, and whatever inherent issues that can be resolved immediately are addressed using the approaches described in chapter 12. The data requirements analysis procedure consists of these phases:

i.

Identifying the business organization contexts

2.

Conducting stakeholder interviews

iii.

Synthesizing expectations and requirements

4.

Developing source-to-target mappings

Data quality rules divers as a result of the requirements analysis process can be engineered into the arrangement's system development life cycle (SDLC) for validation, monitoring, and observance of agreed-to data quality standards.

Read full affiliate

URL:

https://world wide web.sciencedirect.com/science/commodity/pii/B9780123737175000208

Domain modeling

Marco Brambilla , Piero Fraternali , in Interaction Flow Modeling Language, 2015

3.11.1 Designing the Core Subschema

The process of defining a core subschema from the description of the core concepts identified in the data requirements analysis is straightforward:

1.

The cadre concept is represented past a grade (chosen cadre course).

2.

Properties with a single, atomic value get attributes of the cadre form. The identifying properties become the primary key of the cadre class.

three.

Backdrop with multiple or structured values become internal components of the core class.

Internal components are represented equally classes connected to the core class via a part-of association. Two cases are possible, which differ in the multiplicity constraints of the clan connecting the component to the core class:

i.

If the connecting clan has a one:1 multiplicity constraint for the component, the component is a proper subpart of the core concept. In this case, no instance of the internal component can exist in absence of the core class instance it belongs to, and multiple core objects cannot share the same instance of the internal component. Internal components of this kind are sometimes called "weak classes" in data modeling terminology, or "part-of components" in object-oriented terminology.

2.

If the association between the core grade and the component has 0:∗ multiplicity for the internal component, the notion of "component" is interpreted in a broader sense. The internal component is considered a role of the core concept, even if an case of it may exist independently of the connection to a core class instance and tin be shared among different core objects. Nonetheless, the internal component is non deemed an essential data asset of the awarding and thus is non elevated to the condition of a cadre concept.

Figure 3.17 illustrates the typical domain model of a core subschema, including ane cadre course, two proper nonshared internal components, and ane shared component.

Effigy iii.17. Typical core subschema.

Note that a shared component may be part of 1 or more concepts, but information technology is not treated as an independent object for the purpose of the awarding. Such a consideration is useful for building the front-end model, which should present or manage components every bit parts of their "enclosing" core concepts and non as standalone objects.

Read total chapter

URL:

https://www.sciencedirect.com/science/commodity/pii/B9780128001080000035

Inspection, Monitoring, Auditing, and Tracking

David Loshin , in The Practitioner's Guide to Data Quality Improvement, 2011

17.half dozen Putting It Together

Data quality incident management combines different technologies to enable proactive direction of existing, known data quality rules derived from both the data requirements analysis and the data quality cess processes, including data profiling, metadata management, and rule validation. The introduction of an incident direction system provides a forum for collecting cognition nigh emergent and outstanding data quality issues and tin can guide the governance activities to ensure that data errors are prioritized, the right individuals are notified, and that the actions taken are aligned with the expectations set out in the data quality service level agreement.

Read full chapter

URL:

https://www.sciencedirect.com/science/commodity/pii/B9780123737175000178

Remediation and Improvement Planning

David Loshin , in The Practitioner's Guide to Data Quality Improvement, 2011

12.1 Triage

Limitations to staffing will influence the data quality team to consider the all-time allocation of resource to address issues. At that place will always be a backlog of problems for review and consideration, revealed either by straight reports from data consumers or results of data quality assessments. Just in lodge to achieve the "all-time bang for the cadet," and most effectively use the bachelor staff and resources, one can prioritize the issues for review and potential remediation as a past-product of weighing feasibility and cost effectiveness of a solution against the recognized business impact of the effect. In essence, one gets the optimal value when the everyman costs are incurred to resolve the issues with the greatest perceived negative bear upon.

When a data quality issue has been identified, the triage process will take into account these aspects of the identified issue:

Criticality: the caste to which the business processes are impaired by the existence of the issue

Frequency: how often the issue has appeared

Feasibility of correction: the likelihood of expending the endeavor to correct the results of the failure

Feasibility of prevention: the likelihood of expending the try to eliminate the root cause or found continuous monitoring to detect the bug

The triage process is performed to understand these aspects in terms of the business impact, the size of the problem, as well as the number of individuals or systems affected. Triage enables the data quality practitioner to review the general characteristics of the problem and business concern impacts in training for assigning a level of severity and priority.

12.i.one The Prioritization Matrix

By its very nature, the triage process must employ some protocols for firsthand cess of whatsoever result that has been identified, as well as prioritize those issues in the context of existing issues. A prioritization matrix is a tool that can help provide clarity for deciding relative importance, getting agreement on priorities, and so determining the actions that are likely to provide best results inside appropriate time frames. Collecting data about the outcome's criticality, frequency, and the feasibility of the cosmetic and preventative actions enables a more confident decision-making process for prioritization.

Unlike approaches tin can be taken to gather a prioritization matrix, especially when determining weighting strategies and allocations. In one example, shown in Table 12.1, the columns of the matrix show the evaluation criteria. At that place is one row for each data quality event. In this example, weights are assigned to the criteria based on the degree to which the score would contribute to the overall prioritization. In this instance, the highest weight is assigned to the criticality. The data quality practitioner will gather information as input to the scoring procedure, and each of the criteria's weighted scores is calculated, and summed in the full.

Tabular array 12.i. Example Prioritization Matrix

Criteria Criticality Weight = 4 Frequency Weight = one Correction Feasibility Weight = 1 Prevention Feasibility Weight = 2 Full
Issues Score Weighted score Score Weighted score Score Weighted score Score Weighted score

The weights must be determined in relation to the business organisation context and the expectations as directed by the results of the data requirements analysis procedure (every bit discussed in affiliate nine). Every bit these requirements are integrated into a data quality service level agreement (or DQ SLA, as is covered in chapter thirteen), the criteria for weighting and evaluation are adjusted accordingly. In improver, the organization'south level of maturity in data quality and data governance may also inform the determination of scoring protocols as well every bit weightings.

12.1.2 Gathering Noesis

There may be little to no groundwork information associated with any identified or reported information quality upshot, so the practitioner will need to get together cognition to evaluate the prioritization criteria, using guidance based on the information requirements. The assignment of points can be based on the answers to a sequence of questions intended to tease out the details associated with criticality and frequency, such as the following:

Accept any business processes/activities been impacted by the data issue?

If and so, how many business organization processes/activities are impacted by the data issue?

What business applications accept failed as a result of the data issue?

If and then, how many business concern processes accept failed?

How many individuals are affected?

How many systems are affected?

What types of systems are afflicted?

How many records are affected?

How many times has this effect been reported? Within what time frame?

How long has this been an issue?

Then, based on the list of individuals and systems afflicted, the data quality analyst tin review business organization impacts within the context of both known and newly discovered issues, request questions such equally these:

What are the potential business impacts?

Is this an event that has already been anticipated based on the data requirements analysis process?

Has this issue introduced delays or halts in production information processing that must be performed within existing constraints?

Has this upshot introduced delays in the development or deployment of critical business concern systems?

The adjacent step is to evaluate what data sets have been affected and what, if any, firsthand cosmetic actions need to exist taken, such as whether whatever data sets demand to be recreated, modified, or corrected, or if whatsoever concern processes need to exist rolled dorsum to a previous state. The following types of questions are used in this evaluation:

Are there short-term corrective measures that can be taken to restart halted processes?

Are there long-term measures that can exist taken to identify when the issue occurs in the future?

Are there system modifications that can be performed to eliminate the issue's occurrence birthday?

The answers to these questions will present alternatives for correction also as prevention, which can exist assessed in terms of their feasibility.

12.i.three Assigning Criticality

Having nerveless knowledge about each issue, the data quality analyst tin synthesize the intentions of the data quality requirements with what has been learned during the triage process to determine the level of severity and assign priority for resolution. The collected information tin exist used to populate the prioritization matrix, assign scores, and apply weights. Issues tin be assigned a priority score based on the results of the weightings applied in the prioritization matrix. In plow, each consequence can exist prioritized, from both a relative standpoint (i.e., which issues take relative precedence compare to others) and an absolute standpoint (i.e., is a specific issue high or depression priority). This prioritization can also be assigned in the context of those problems identified during a finite time catamenia ("this past calendar week") or in relation to the full set of open information quality problems.

Information effect priority volition be defined by the members of the various data governance groups. As an example, an organization may define four levels of priority, such every bit those shown in Tabular array 12.two.

Tabular array 12.two. Example Classifications of Severity or Criticality

Classification Description Implications
Business critical The existence of a business disquisitional problem prevents necessary business activities from completing, and must be resolved before those activities can go on. Addressing the issue demands immediate attending and overrules activities associated with issues or a lower priority.
Serious Serious problems pose measurably high impacts to the business, merely the result does not forestall critical business organisation processes from completing. These issues require evaluation and must be addressed, but are superseded past business critical issues.
Tolerable With tolerable issues, at that place are identified impacts to the business, merely they crave boosted research to decide whether correction and elimination are economically viable. Information technology is not articulate if the negative business impacts exceed the total costs of remediation; further investigation is necessary.
Acknowledged Acknowledged issues are recognized and documented, only the scale of the business organization impact does not warrant the additional investment in remediation. Information technology is articulate that the negative business impacts practise non exceed the full costs of remediation; no further investigation is necessary.

Depending on the scoring process, the weighting, and the assessment, any newly reported effect can be evaluated and assigned a priority that should straight the initiation of specific remediation actions. Bug tin exist recategorized as well. For case, bug categorized as tolerable may be downgraded to best-selling in one case the evaluation determines that the costs for remediation exceed the negative impact. Similarly, once a piece of work-effectually has been adamant for a concern critical issue, that issue may no longer forestall necessary business organization activities from continuing, in which case it could exist reclassified as a serious issue.

Read full chapter

URL:

https://world wide web.sciencedirect.com/science/commodity/pii/B9780123737175000129

Coordination

David Loshin , in Chief Data Management, 2009

ii.three Stakeholders

Who are the players in an MDM environment? There are many potential stakeholders beyond the enterprise:

Senior direction

Business concern clients

Application owners

Information architects

Data governance and information quality practitioners

Metadata analysts

System developers

Operations staff

Here we explore who the stakeholders are and what their expected participation should be over the course of program evolution.

two.3.1 Senior Management

Conspicuously, without the support of the senior management, it would exist difficult to execute any enterprise action. At the senior level, managers are motivated to demonstrate that their (and their teams') performances take contributed to the organization'south successful achievement of its business objectives. Transitioning to a master data surround should enable more nimbleness and agility in both ensuring the predictable behavior of existing applications and systems and rapidly developing support for new business organisation initiatives. This core message drives senior-level appointment.

Senior management also plays a special role in ensuring that the residuum of the arrangement remains engaged. Adopting a strategic view to oversee the long-term value of the transition and migration should trump short-term tactical business initiatives. In addition, the senior managers should besides prepare the organization for the behavioral changes that will exist required by the staff equally responsibilities and incentives evolve from focusing on vertical business concern expanse success to how line-of-business triumphs contribute to overall organizational success.

2.3.ii Business organization Clients

For each of the defined lines of concern, there are representative clients whose operations and success rely on the anticipated, high availability of application data. For the most part, unless the business organisation client is intricately involved in the underlying technology associated with the business processes, it almost doesn't matter how the arrangement works, simply rather that the system works. Presuming that the data used within the existing business applications meet the business user'due south expectations, incorporating the business organization customer's data into a master repository is only relevant to the business client if the process degrades data usability.

Still, the business organisation customer may derive value from improvements in data quality equally a by-production of information consolidation, and future application development volition be fabricated more efficient when facilitated through a service model that supports application integration with enterprise master data services. Supporting the business client implies a number of specific actions and responsibilities, 2 of which are specially relevant. Kickoff, the MDM plan team must capture and document the business organization client's information expectations and application service-level expectations and assure the client that those expectations will be monitored and met. 2nd, considering information technology is essential for the squad to empathize the global motion picture of master object use, it is important for the technical team to assess which data objects are used past the business applications and how those objects are used. Therefore, as subject matter experts, information technology is imperative that the business organisation clients participate in the business process modeling and data requirements assay process.

ii.iii.three Application Owners

Any applications that involve the use of data objects to be consolidated inside an MDM surround will need to exist modified to accommodate to the employ of master data instead of local versions or replicas. This means that the utilise of the master data asset must be carefully socialized with the application owners, because they become the "gatekeepers" to MDM success. As with the business owners, each application possessor will exist concerned with ensuring predictable beliefs of the business applications and may even see master data management equally a chance to continued anticipated behavior, as it involves a significant transition from 1 underlying (production) data asset to a potentially unproven one.

The awarding owner is a key stakeholder, then, as the successful connected predictable operation of the awarding depends on the reliability and quality of the principal repository. When identifying information requirements in grooming for developing a principal data model, it volition be necessary to appoint the application owner to ensure that operational requirements are documented and incorporated into the model (and component services) design.

two.three.iv Information Architects

Underlying whatever organizational information initiative is a demand for information models in an enterprise architecture. The models for master data objects must conform the current needs of the existing applications while supporting the requirements for future business changes. The information architects must collaborate to accost both aspects of application needs and fold those needs into the data requirements process for the underlying models and the representation framework that will be employed.

ii.3.5 Data Governance and Information Quality

An enterprise initiative introduces new constraints on the means that individuals create, admission and use, modify, and retire information. To ensure that these constraints are not violated, the data governance and data quality staff must innovate stewardship, buying, and management policies also as the means to monitor observance to these policies.

A success factor for MDM is its ubiquity; the value becomes apparent to the organization as more lines of business participate, both as data suppliers and as main data consumers. This suggests that MDM needs governance to encourage collaboration and participation across the enterprise, but it also drives governance past providing a unmarried point of truth. Ultimately, the use of the master data asset as an acknowledged high-quality resources is driven past transparent adherence to defined information policies specifying the adequate levels of information quality for shared information. MDM programs crave some layer of governance, whether that means incorporating metadata assay and registration, developing "rules of date" for collaboration, defining information quality expectations and rules, monitoring and managing quality of data and changes to primary information, providing stewardship to oversee automation of linkage and hierarchies, or offering processes for researching root causes and the subsequent elimination of sources of flawed information.

2.three.6 Metadata Analysts

Metadata represent a key component to MDM as well as the governance processes that underlie it, and managing metadata must be closely linked to information and awarding architecture too as information governance. Managing all types of metadata (not merely technical or structural) volition provide the "mucilage" to connect these together. In this environment, metadata incorporate the consolidated view of the data elements and their corresponding definitions, formats, sizes, structures, data domains, patterns, and the like, and they provide an excellent platform for metadata analysts to actualize the value proposed by a comprehensive enterprise metadata repository.

ii.3.7 System Developers

Aspects of performance and storage change as replicated data instances are absorbed into the master information arrangement. Again, the determination of the underlying architecture approach will impact production systems every bit well as new development projects and volition modify the fashion that the application framework uses the underlying data asset (as is discussed in Chapters 9, 11 and 12 Chapter 9 Chapter 11 Chapter 12 ). Arrangement analysts and developers will need to restructure their views of systemic needs as the ability to formulate system services grows at the core level, at a level targeted at the ways that conceptual data objects are used, and at the application interface level.

2.3.eight Operations Staff

One of the hidden risks of moving toward a common repository for master data is the fact that frequently, to become the task done, operations staff may demand to bypass the standard protocols for information access and modification. In fact, in some organizations, this approach to bypassing standard interfaces is institutionalized, with metrics associated with the number of times that "fixes" or modifications are applied to data using direct access (east.g., updates via SQL) instead of going through the preferred channels.

Alternatively, desktop applications are employed to supplement existing applications and as a way to gather the right corporeality of information to complete a business organisation process. Bypassing standard operating procedures and desktop supplements pose an interesting claiming to the successful MDM program, in absorbing what might be termed "finely grained distributed information" into the master framework equally well as taming the behavior that essentially allows for leaks in the enterprise master data framework. In other words, the folks with their boots on the ground may demand to modify their habits as central data entities are captured and migrated into a primary environment.

Read full chapter

URL:

https://www.sciencedirect.com/science/commodity/pii/B9780123742254000023

Information Consolidation and Integration

David Loshin , in Principal Data Management, 2009

10.6 Consolidation

Consolidation is the result of the tasks applied to data integration. Identifying values are parsed, standardized, and normalized across the aforementioned data domains; subjected to nomenclature and blocking schemes; and and then submitted to the unique identification service to analyze duplicates, wait for hierarchical groupings, locate an existing tape, or make up one's mind that one does non exist. The existence of multiple instances of the same entity raises some critical questions shown in the sidebar.

The answers to the questions frame the implementation and tuning of matching strategies and the resulting consolidation algorithms. The decision to merge records into a single repository depends on a number of different inputs, and these are explored in greater item in Chapter ix.

Critical Questions about Multiple Instances of Entities

What are the thresholds that indicate when matches be?

When are multiple instances merged into a single representation in a chief repository as opposed to registration inside a chief registry?

If the determination is to merge into a single record, are there whatsoever restrictions or constraints on how that merging may be washed?

At what points in the processing stream is consolidation performed?

If merges tin exist done in the hub, can consuming systems consume and apply that merge event?

Which business rules determine which values are forwarded into the main copy—in other words, what are the survivorship rules?

If the merge occurs and is afterward found to be incorrect, can you lot disengage the action?

How practise you apply transactions against the merged entity to the subsequently unmerged entities later on the merge is undone?

x.vi.1 Similarity Thresholds

When performing approximate matching, what criteria are used for distinguishing a match from a nonmatch? With exact matching, information technology is articulate whether or not ii records refer to the same object. With estimate matching, nevertheless, at that place is frequently not a definitive answer, but rather some indicate forth a continuum indicating the degree to which 2 records match. Therefore, it is up to the information architect and the business clients to define the signal at which two values are considered to be a match, and this is specified using a threshold score.

When identifying values are submitted to the integration service, a search is made through the chief index for potential matches, and and then a pair-wise comparing is performed to determine the similarity score. If that similarity score is above the threshold, it is considered a match. We can exist more than precise and really define three score ranges: a high threshold above which indicates a friction match; a low threshold under which is considered non a friction match; and any scores between those thresholds, which require transmission review to determine whether the identifying values should exist matched or not.

This process of incorporating people into the matching process can have its benefits, especially in a learning environment. The user may begin the matching process past specifying specific thresholds, but as the process integrates user decisions well-nigh what kinds of questionable similarity values indicate matches and which practise non, a learning heuristic may both automatically adjust the thresholds also equally the similarity scoring to yield a finer accurateness of similarity measurement.

10.6.2 Survivorship

Consolidation, to some extent, implies merging of information, and essentially there are two approaches: on the 1 manus, at that place is value in ensuring the existence of a "aureate copy" of data, which suggests merging multiple instances as a cleansing process performed before persistence (if using a hub). On the other mitt, different applications take different requirements for how data is used, and merging records early in the piece of work streams may introduce inconsistencies for downstream processing, which suggests delaying the merging of data until the bodily point of use. These questions assistance to drive the determination of the underlying architecture.

Either style, operational merging raises the concept of survivorship. Survivorship is the process applied when two (or more) records representing the same entity contain conflicting data to determine which record'southward value survives in the resulting merged record. This proc-ess must incorporate information or business rules into the consolidation procedure, and these rules reflect the label of the quality of the data sources as determined during the source data analysis described in Chapter 2, the kinds of transactions being performed, and the business organisation customer data quality expectations as discussed in Chapter 5.

Ultimately, every master data attribute depends on the data values within the corresponding source data sets identified every bit candidates and validated through the data requirements analysis process. The master data attribute's value is populated as directed by a source-to-target mapping based on the quality and suitability of every candidate source. Business organization rules delineate valid source systems, their corresponding priority, qualifying conditions, transformations, and the circumstances under which these rules are practical. These rules are applied at different locations within the processing streams depending on the business awarding requirements and how those requirements accept directed the underlying organization and service architectures.

Another key concept to remember with respect to survivorship is the retention policy for source data associated with the chief view. Directed data cleansing and data value survivorship applied when each information instance is brought into the environment provides a benefit when those processes ensure the correctness of the single view at the bespeak of entry. Notwithstanding because not all data instances imported into the organisation are used, cleansing them may plough out to exist additional work that might not have been immediately necessary. Cleansing the data on demand would limit the work to what is needed past the business process, but it introduces complexity in managing multiple instances and history regarding when the appropriate survivorship rules should take been applied.

A hybrid thought is to employ the survivorship rules to determine its standard form, yet always maintain a tape of the original (unmodified) input data. The reason is that a variation in a proper noun or address provides extra knowledge virtually the master object, such as an individual's nickname or a variation in product description that may occur in other situations. Reducing each occurrence of a variation into a unmarried form removes knowledge associated with potential aliased identifying data, which ultimately reduces your global knowledge of the underlying object. But if y'all can determine that the input data is just variations of one (or more) records that are already known, storing newly acquired versions linked to the cleansed form will provide greater knowledge moving forward, as well equally enabling traceability.

ten.half dozen.3 Integration Errors

Nosotros have already introduced the concepts of the two types of errors that may be encountered during data integration. The first type of fault is called a faux positive, and information technology occurs when two data instances representing 2 distinct existent-life entities are incorrectly assumed to refer to the same entity and are inadvertently merged into a single primary representation. Fake positives violate the uniqueness constraint that a master representation exists for every unique entity. The second type of error is called a simulated negative, and it occurs when ii information instances representing the same real-world entity are not adamant to match, with the possibility of creating a duplicate master representation. Simulated negatives violate the uniqueness constraint that there is ane and simply one master representation for every unique entity.

Despite the plan's all-time laid plans, it is likely that a number of both types of errors will occur during the initial migration of information into the master repository, as boosted data sets are merged in and as information come up into the main environment from applications in product. Preparing for this eventuality is an important task:

Determine the risks and impacts associated with both types of errors and raise the level of awareness appropriately. For example, faux negatives in a marketing campaign may lead to a prospective customer being contacted more than than one time, whereas a fake negative for a terrorist screening may take a more devastating bear upon. Faux positives in product information management may pb to confused inventory management in some cases, whereas in other cases they may lead to missed opportunities for responding to customer requests for proposals.

Devise an touch assessment and resolution scheme. Provide a process for separating the unique identities from the merged instance upon identification of a false positive, in which two entities are incorrectly merged, and decide the distinguishing factors that can be reincorporated into the identifying aspect set, if necessary. Likewise, provide a means for resolving duplicated data instances and determining what prevented those two instances from being identified as the aforementioned entity.

These tasks both suggest maintaining historical information about the style that the identity resolution process was practical, what deportment were taken, and ways to unravel these actions when either false positives or imitation negatives are identified.

10.6.4 Batch versus Inline

There are two operational paradigms for data consolidation: batch and inline. The batch approach collects static views of a number of data sets and imports them into a unmarried location (such as a staging area or loaded into a target database), and and then the combined fix of data instances is subjected to the consolidation tasks of parsing, standardization, blocking, and matching, as described in Department x.4. The inline approach embeds the consolidation tasks within operational services that are available at any time new data is brought into the organization. Inlined consolidation compares every new data instance with the existing master registry to determine if an equivalent example already exists inside the surroundings. In this approach, newly caused data instances are parsed and standardized in grooming for immediate comparison against the versions managed within the master registry, and any necessary modifications, corrections, or updates are practical as the new example either is matched against existing data or is identified every bit an entity that has not however been seen.

The approaches taken depend on the selected base architecture and the application requirements for synchronization and for consistency. Batch consolidation is often applied as part of the migration procedure to accrue the information from across systems that are beingness folded into a chief environment. The batch processing allows for the standardization of the collected records to seek out unique entities and resolve any duplicates into a single identity. Inlined consolidation is the approach used in operations mode to ensure that equally information come into the environment, they are directly synchronized with the primary.

10.6.v History and Lineage

Knowing that both fake positives and false negatives will occur directs the inclusion of a means to roll dorsum modifications to master objects on determination that an error has occurred. The nearly obvious way to enable this capability is to maintain a full history associated with every primary data value. In other words, every time a modification is made to a value in a master record, the organisation must log the change that was made, the source of the modification (due east.g., the data source and set of rules triggered to modify the value), and the appointment and fourth dimension that the modification was made. Using this information, when the being of an fault is detected, a lineage service can traverse the historical record for whatsoever master data object to determine at which point a change was fabricated that introduced the error.

Addressing the fault is more complicated, considering not only does the error need to exist resolved through a information value rollback to the indicate in fourth dimension that the error was introduced, but any additional modifications dependent on that flawed master record must also be identified and rolled back. The most comprehensive lineage framework will allow for backtracking besides as forwards tracking from the rollback bespeak to seek out and resolve whatsoever possible errors that the identified flaw may have triggered. Yet, the forward tracking may be overkill if the business organization requirements do not insist on complete consistency—in this blazon of situation, the only relevant errors are the ones that prevent concern tasks from successfully completing; proactively addressing potential issues may not be necessary until the impacted records are really used.

Read full chapter

URL:

https://world wide web.sciencedirect.com/science/commodity/pii/B9780123742254000102