Archive for the Technical Focus Category

Widgets

I’m new to the widget world.  As part of my build-out of www.360degreevendor.com, I’ve been experimenting with the Clearspring widget set.  I’ve tested out both the LaunchPad options: In-Widget and On-Page methods.  The In-Widget seems to be a better fit for my site, as I have a number of areas within the home page that I’d like to share, such as the vendor search box.  Although I can see putting an On-Page button on the home page to allow people to share the entire site.

I floated a question about  Clearspring on the LinkedIn Q&A board, and received some good feedback.  The majority of the respondents recommended using WidgetBox, so I think I’ll do a trial with both and see which works out best.  I don’t think this is necessarily an either/or situation, so I might end up using both depending on their relative strengths.

Data Warehouse Appliances - Apples to Apples (update)

I’ve updated the data warehouse appliance spreadsheet, adding two new solutions (SAND & Calpont).  I’ve also updated information on several existing vendor solutions.

Analytics in the Cloud

I attended a webinar today sponsored by Amazon and Vertica called “Data Analytics in the Cloud“.  The Vertica portion was mostly a duplicate of a prior web cast, but the Amazon portion on the Cloud concept was very interesting.  The key points of the cloud concept as I see it are:

  • Pay as you go model - you only pay for the disk space and processing you consume. No start-up costs, but you have to sign a contract. (They claimed the cost was 1/2 that of an in-house solution)
  • Time to market - hours instead of weeks to turn up a terabyte sized system, including hardware, OS, and the Vertica column based database
  • On-demand scalability - seamlessly scales to meet your demand
  • Proven platform - hosted by Amazon on the same platform that hosts the Amazon.com site

I think the benefits of this approach are obvious, especially for a small but rapidly growing operation.  The infrastructure and software license costs alone would be prohibitive, and time to market is critical especially when launching a new idea.

The downsides include:

  • Security concerns, especially for highly sensitive customer data
  • Performance - both in terms of loading large amounts of data and in real-time queries
  • Long-term cost - as with any usage based cost model, the upfront savings could be surpassed by subsequent usage fees

Data Integration Architecture

In the “causes of failure” department, the lack of a sound architecture is a close second only to the lack of a sound strategy.  And only because the go/no-go decision should be made in the strategy phase, where a number of failed projects should have been weeded out before they ever got started.

I won’t bore you with the details of a data integration (or data management, data warehouse, etc…) architecture - you can download my white paper if you want to dive deeper. First I’ll touch on the key elements of a good architecture, then list the top 3 mistakes:

Key Elements:

  1. Data acquisition - this includes identification of sources, integration approach (virtual vs physical), data quality processing, and any other step needed to gather data and prepare it for use in an analytical capacity.
  2. Data storage - Physical storage includes the old standbys (operational data stores, data warehouse, data marts, etc…) although these distinctions are blurring.  Virtual storage includes enterprise information integration (EII) and other methods that provide a virtual view of data across disparate underlying physical stores.
  3. Delivery encompasses all forms of information dissemination, including traditional forms of business intelligence such as reporting, OLAP, and dashboard.  Also included is integration into other applications such as a marketing automation system.  Often overlooked is the integration back into transactional systems to provide real-time analytics - read The New Age of Innovation(if you haven’t already) to get a view of the value of analytics through the business lens.
  4. Meta data - this is the glue that holds the whole thing together.  When done properly, the first three categories are driven off an integrated meta data repository that supports the development, operations, and end user communities.

Top 3 mistakes:

  1. Not defining the architecture - just like building a house, you must plan out the solution end-to-end if you want all the pieces to fit together.
  2. Taking a product vendor driven approach - buying a tool does not translate into defining an architecture.  The technical architecture should precede and transcend the toolset.  Buy tools that fit into your architecture, not the other way around.
  3. Technology for technologies sake - the term “business alignment” is overused and often misunderstood, but the technical implementation should operate within the parameters set during the strategy process.

Data Integration Strategy

For the majority of data management initiatives I’ve seen that have failed to meet expectations, most of have resulted from a lack of strategic planning.  I’m not talking about big picture strategic planning akin to a McKinsey or Bain effort.  The strategy effort I’m talking about involves surveying the business to determine if, where, and how to deploy a data solution, then setting the parameters that will guide the technical implementation (assuming the decision is made to go forward).  My sense is that this step is skipped for one of two reasons:

  1. Management isn’t aware of this step in the process
  2. The perception of a “strategy process”, particularly within IT, is of a costly, long drawn out affair that produces zero value while killing numerous trees

Both of these issues can be overcome via education.  The most powerful message is that a data centric strategy does not need to take months nor cost hundreds of thousands of dollars.  For most small to midsized businesses, and even for business units of larger corporations, the strategic direction can be formulated with one working session between business and IT representatives, with follow ups via e-mail and conference call/web meeting.  The most important decision, go/no-go, should be obvious to all participants at the conclusion of the working session based on the opportunities identified and the readiness of both business and technical groups to take on an initiative of this magnitude.  Assuming the decision is to go forward, the project team should have the data required to formulate a business plan and roadmap, which can be refined and finalized without the need for another time consuming all-hands meeting.

For more details download my white paper (Data Integration Strategy).

Book Review: The New Age of Innovation (Part 2 - Analytics)

I ended Part 1with the excerpt stating that CEOs and other senior management need to take an active role in defining the analytical and data warehouse strategy for their companies.  I thought this was an important statement in that I’ve never read a business book that has advocated that level of intimacy between the executive ranks and data management initiatives.  But it underscores the importance the authors give to analytics in driving innovation.

In the first post, I talked about the key elements of the “House of Innovation”, including the importance of flexible business processes that connect N=1 and R=G.  The key concept here is that it’s not enough for a company to define their business processes - they need to determine how their processes will change to satisfy the demands of N=1.  And the key driver to this change is real-time analytical capabilities.  According to the authors, it’s not enough to implement the current approach of collecting data periodically in a large repository to be analyzed at a later date.  Analytics need to be embedded within the business processes, providing insight to managers as they make decisions on what and how to produce, and how to align resources.

From a technical perspective, this is equivalent to business activity monitoring (BAM), which has been around for 8+ years (that I know of).  It involves putting sensors and other monitoring devices (both hardware and software) into machines and applications, and feeding that information flow directly into decision support systems used by line managers.  The authors also touch on the need to pull together (in a batch process) disparate data to form a 360 degree customer view.  This forms the framework for integrating and understanding the real-time information flow. 

The problem I have with all of this is that the authors make two large and broad assumptions - that data quality is not a problem, and integrating data into a repository is relatively easy.  One the one hand I understand their not wanting to get into a technical discussion around cleansing and integrating data.  But in the majority of large companies, this is the number 1 technical issue (I’m conveniently lumping these two issues together, even though they’re two separate steps).  So in assuming away these two issues, they’re removing a prime contributor to both effort and risk associated with implementing.

All in all, I enjoyed reading the book, and gained a new perspective on the importance of data management in driving business innovation. Although I have issues with the implementation aspects of their theories, I agree with the business side of their argument.  And of course it was nice to get some props from the business world for the importance of analytics in driving business innovation.

Bottom line - this is a must read book for anyone in the data management arena (e.g., BI, DW, data integration).

Data Warehouse Appliances - Apples to Apples

I’ve listened to several appliance web casts in the past 2 weeks, and have increased my knowledge of the growing data warehouse appliance market.  One thing that is clear to me is that the landscape has changed considerably in the past 2+ years, with the addition of several vendors providing a “full stack” similar to Teradata and Netezza.  There are also a number of vendors that are providing “software appliances” (to borrow a phrase from Philip Russom, TDWI analyst), by customizing database, operating system, and integration software with either off-the-shelf hardware components (Intel, AMD) or partnering with hardware vendors such as Sun.  And while I’ve gotten bits and pieces of information on this topic from a number of sources, I haven’t seen a full side by side comparison of these products. 

So I pulled together a data warehouse appliance spreadsheet that compares vendors along several categories, including applications, software, architecture (basically SMP or MPP) and hardware.  I’ve also attempted to group the vendors, using “full stack”, “database”, and “hardware” as my initial cut.  I intend this to be a work in progress, and will provide updates as I add new vendors and refine information for existing vendors.  Once I get a complete set of data, I will provide a summary of the dataset.

Please feel free to send me an e-mail if you see incorrect or missing information, or think I should add a vendor to the mix.

VDI (Virtual Data Integration)

What is virtual integration and when is it appropriate to use? Virtual data integration (VDI) is the use of a software layer that interfaces with the reporting or end-user delivery application, in place of direct access of a data repository.  The software layer provides a mapping between the conceptual data model that the user interfaces with, and the physical model(s) and underlying data stores. The reason I say “model(s)” is that one of the most powerful aspects of VDI is the ability to make multiple data stores (whether they be transactional, ODS, data warehouse/marts, or others) look like one integrated repository.    Below are the pros and cons of using VDI verses physical data integration. 

Pros:

        Shortens implementation time since data does not have to be physically integrated

        Provides short term benefit to the business while buying time for technical team to integrate/retire existing redundant and fragmented data architecture

        Can reduce data storage requirements

        Source data updates are available immediately to end-users

        Eliminates cost and time associated with extraction and movement of data (e.g., no batch window)

        Business rules can be modified “on the fly” 

Cons:

        Data is not accessible if source system is unavailable

        Complex transformations and aggregations can impact report response time

        Updates to source systems can result in out of sync data on reports

        Complex match-merge routines may not be possible outside of batch processing

        Accessing large data sets can impact report response time

        Query processing can impact source system performance 

In my mind, the key differentiator is the volume of data being analyzed.  If end users are accessing relatively few numbers of records, a VDI solution is viable.  But when large data volumes need to be accessed, such as in a data mining or long term trending exercise, performance may become an insurmountable issue, both from the end user as well as the source system owner perspective.