<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>Guest</title>
    <link>http://www.itmatter.com/web/guest/blog/-/blogs/rss</link>
    <description>Guest</description>
    <item>
      <title>Playing with Bayesian Classifier</title>
      <link>http://www.itmatter.com/web/guest/blog/-/blogs/playing-with-bayesian-classifier</link>
      <description>&lt;h1&gt;&lt;a name="Introduction"&gt;&lt;/a&gt;Introduction&lt;/h1&gt;&lt;p&gt;Just released source code for&amp;nbsp;a simple Bayesian classifier based on ci-bayes project. The classifier offers two working modes: basic db mode, and direct db mode.&amp;nbsp;&lt;/p&gt;&lt;h1&gt;&lt;a name="Details"&gt;&lt;/a&gt;Details&lt;/h1&gt;&lt;p&gt;In 'basicdb' mode the classifier still uses memory hash table structures to store feature/token and category counts, but also implements classifier and classification listener interfaces for persisting information in the back-end database. This allows for initializing the memory data structures from database instead of having to rebuild them each time before use.&lt;/p&gt;&lt;p&gt;In 'directdb' mode the classifier does not store any information in memory, instead works directly on back-end database tables. This mode was designed to work on very large data sets, large number of classification categories, where it is not feasible/desirable to store all data structures in memory.&lt;/p&gt;&lt;p&gt;Future enhancements, building on functionality in directdb mode, will include integration with memcached direitbuted cache implementation to minimize the number of calls to the database. Alternatively, one can use ehcache as a secondary cache for the Hibernate O/R backend with only configuration changes required.&lt;/p&gt;&lt;h1&gt;&lt;a name="Details"&gt;&lt;/a&gt;Source Code&lt;/h1&gt;&lt;p&gt;Hosted at Google Code: &lt;a href="http://code.google.com/p/itmatter-bayes/"&gt;code.google.com/p/itmatter-bayes/&lt;/a&gt;&amp;nbsp;licensed under LGPLv3.&lt;/p&gt;</description>
      <pubDate>Sat, 04 Jun 2011 17:50:44 GMT</pubDate>
      <guid isPermaLink="false">http://www.itmatter.com/web/guest/blog/-/blogs/playing-with-bayesian-classifier</guid>
      <dc:creator>Daniel Fisla</dc:creator>
      <dc:date>2011-06-04T17:50:44Z</dc:date>
    </item>
    <item>
      <title>Google Pagerank for Wikipedia Topics</title>
      <link>http://www.itmatter.com/web/guest/blog/-/blogs/google-pagerank-for-wikipedia-topics</link>
      <description>&lt;p&gt;&amp;nbsp;Working on semantic extensions&amp;nbsp;for the&amp;nbsp;jamwiki software, I spend my time loading and parsing the Wikipedia dumps on regular basis. My goal is to use NLP Parsers and NLP&amp;nbsp;co-reference linker to extract information on entities (people, places, dates, etc...). In addition to, linkage with freebase and dbpedia entities.&lt;/p&gt;&lt;div&gt;Out of the 7M+ plus pages in Wikipedia dumps about 2.2M pages fall into the &amp;quot;topic&amp;quot; page type.&amp;nbsp;The other page types would be category pages, templates, redirects and so forth.&lt;/div&gt;&lt;div&gt;&amp;nbsp;&lt;/div&gt;&lt;div&gt;I realized that to take the 2.2M topics, which on average consist of anywhere from 50 to 100+ sentences, I am looking at 100-200M NLP POS sentence parse trees. To represent them as semantic graph/network results in full Million nodes and over 1Billion edges (unverified estimates).&amp;nbsp;I/O architecture is key here and distributed hash table approach such as implemented by memcache turns out to be the best apprach to take.&lt;/div&gt;&lt;div&gt;&amp;nbsp;&lt;/div&gt;&lt;div&gt;To make this a more realistic&amp;nbsp;exercise,&amp;nbsp;I decided that I needed a way to limit the initial input size of the topics selected for analysis. To do&amp;nbsp;this, I needed some kind of topic rank as my input set selection criteria. I searched the internet for top 100K or 200K topics but could not find a complete list - mostly just papers describing the scoring methodologies and their outcomes.&lt;/div&gt;&lt;div&gt;&amp;nbsp;&lt;/div&gt;&lt;div&gt;Naturally, I came across ideas of other great minds and I decided to read up on the Google Pagerank white paper and implemented the algorithm as described here:&amp;nbsp;&lt;a target="_blank" style="color: rgb(42,93,176)" href="http://www.sirgroane.net/google-page-rank/"&gt;http://www.sirgroane.&lt;wbr&gt;&lt;/wbr&gt;net/google-page-rank/&lt;/a&gt;&amp;nbsp;&lt;/div&gt;&lt;div&gt;&amp;nbsp;&lt;/div&gt;&lt;div&gt;It is exciting to watch the Pagerank algorithm work its way and compute the Google PR for each topic based on other topics liking into it. My challenge right now is to find a way to parallelize the algorithm as it would take about a week or two using my current hardware configuration to compute a sufficiently converged PRs on the 122M set of topic links - I am far to impatient to wait that long. :-)&lt;/div&gt;&lt;div&gt;&amp;nbsp;&lt;/div&gt;&lt;div&gt;As described in the white paper the average of pagerank values for all topics should converge to 1.0 over time. While running the computation I decided to take some sample average values across all topics:&lt;/div&gt;&lt;div&gt;&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;div&gt;0.18007095210131&amp;nbsp;&lt;/div&gt;&lt;div&gt;0.19245859487535&amp;nbsp;&lt;/div&gt;&lt;div&gt;0.22674132451718&lt;/div&gt;&lt;div&gt;0.26364032294520&lt;/div&gt;&lt;div&gt;&amp;nbsp;&lt;/div&gt;&lt;div&gt;One can see that the average page rank value is starting to converge, the rate of convergence is determined by the initial dampening value - for details please refer to the Google white paper.&amp;nbsp;&lt;br /&gt;&amp;nbsp;&lt;/div&gt;&lt;div&gt;I will post the actual source code and final results on the Wikipedia data set in the near future. If anyone is interested in the source code and the resulting data set please contact me at dfisla at itmatter.com.&lt;/div&gt;&lt;div&gt;&amp;nbsp;&lt;/div&gt;&lt;/div&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;</description>
      <pubDate>Tue, 28 Sep 2010 15:14:59 GMT</pubDate>
      <guid isPermaLink="false">http://www.itmatter.com/web/guest/blog/-/blogs/google-pagerank-for-wikipedia-topics</guid>
      <dc:creator>Daniel Fisla</dc:creator>
      <dc:date>2010-09-28T15:14:59Z</dc:date>
    </item>
    <item>
      <title>IM X Conference</title>
      <link>http://www.itmatter.com/web/guest/blog/-/blogs/im-x-conference</link>
      <description>&lt;p&gt;At our annual Information Management Conference (&lt;a href="http://www.imconference.ca/default.asp"&gt;Adastra IM X&lt;/a&gt;) I had the pleasure to run into &lt;a href="http://semanta.cz/wiki/display/public/Semanta"&gt;Peter Hora from Semanta&lt;/a&gt;, &lt;a href="http://roman.stanek.org/"&gt;Roman Stanek &lt;/a&gt;from GoodData, including the usual keynote speakers such as &lt;a href="http://aiken.isy.vcu.edu/"&gt;Dr. Peter Aiken&lt;/a&gt;. As usual, the presentations were great, but the side discussions were even better!&lt;/p&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;</description>
      <pubDate>Sat, 17 Apr 2010 17:18:44 GMT</pubDate>
      <guid isPermaLink="false">http://www.itmatter.com/web/guest/blog/-/blogs/im-x-conference</guid>
      <dc:creator>Daniel Fisla</dc:creator>
      <dc:date>2010-04-17T17:18:44Z</dc:date>
    </item>
    <item>
      <title>Open Source Projects</title>
      <link>http://www.itmatter.com/web/guest/blog/-/blogs/open-source-projects</link>
      <description>&lt;p&gt;Company blog of Daniel Fisla - the founder of ITMATTER Inc. - Software architect/technologist.&lt;/p&gt;&lt;p&gt;As a founder of ITMATTER Inc. I am involved in numerous open source software projects and plan to use this blog to communicate short insights into current and future projects of interest.&lt;/p&gt;&lt;p&gt;Current list of projects consists of:&lt;/p&gt;&lt;p&gt;1. &lt;strong&gt;DataCleaner &lt;/strong&gt;- Open Source - Data Quality Solution -&amp;nbsp;&lt;a href="http://datacleaner.eobjects.org/"&gt;http://datacleaner.eobjects.org&lt;/a&gt;&lt;/p&gt;&lt;p&gt;2. &lt;strong&gt;JAMWiki Java Wiki Engine&lt;/strong&gt;&amp;nbsp;&amp;nbsp;- Open Source - MediaWiki compatible Wiki Solution -&amp;nbsp;&lt;a href="http://jamwiki.org/"&gt;http://jamwiki.org&lt;/a&gt;&lt;/p&gt;&lt;p&gt;3.&amp;nbsp;&lt;strong&gt;The Java Wikipedia API (Bliki engine)&lt;/strong&gt;&amp;nbsp;&amp;nbsp;- Open Source - MediaWiki API and JAMWiki parser extension -&amp;nbsp;&lt;a href="http://code.google.com/p/gwtwiki/"&gt;http://code.google.com/p/gwtwiki/&lt;/a&gt;&lt;/p&gt;&lt;p&gt;4. &lt;strong&gt;ITMATTER/JAMWiki High Performance Extensions&lt;/strong&gt;&amp;nbsp;&amp;nbsp;- Open Source - Extensions to JAMWiki Java Wiki Engine&amp;nbsp;&amp;nbsp;-&amp;nbsp;&lt;a href="http://jamwiki.org/"&gt;http://jamwiki.org (dfisla branch)&lt;/a&gt;&lt;/p&gt;&lt;p&gt;5. &lt;strong&gt;uniBlogger.com&lt;/strong&gt; - English Wikipedia and Wikinews mirror built on&amp;nbsp;ITMATTER/JAMWiki High Performance Extensions - &lt;a href="http://www.uniblogger.com/"&gt;http://www.uniblogger.com&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;</description>
      <pubDate>Fri, 16 Apr 2010 17:59:22 GMT</pubDate>
      <guid isPermaLink="false">http://www.itmatter.com/web/guest/blog/-/blogs/open-source-projects</guid>
      <dc:creator>Daniel Fisla</dc:creator>
      <dc:date>2010-04-16T17:59:22Z</dc:date>
    </item>
    <item>
      <title>Distributed Data - I/O Bound</title>
      <link>http://www.itmatter.com/web/guest/blog/-/blogs/distributed-data-i-o-bound</link>
      <description>&lt;p&gt;As a part of the JAMWiki Java Engine High Performance Extensions project, I have done extensive research&amp;nbsp;into minimizing disk access overhead. For validation purposes, I have loaded the Wikipedia articles into JAMWiki solution - about 22GB+ in size, 8M+ articles.&lt;/p&gt;&lt;p&gt;With respect to parsing Wikipedia articles and other related content, I have decided to pre-render a significant number of articles with content size over certain threshold (about 1Million articles in total - articles over 3KB -compressed blob data). This gave me a more or less a constant rendering time for large articles. From analysis where the article size was analysed, about 10% of all Wikipedia articles accounted for about 85% of&amp;nbsp;total content size. Under such distribution, pre-rendering (caching) the top 10% of articles would greatly increase the request processing times.&lt;/p&gt;&lt;p&gt;First, I have evaluated EhCache where the idea was to push cached data into the edge nodes (web server tier) in order to avoid database access when retrieving topic, category, and other content. I have looked at versions from 1.6.0 up to latets 2.0.0 snapshot version of EhCache.&lt;/p&gt;&lt;p&gt;What became apparent is that this cache approach worked well for read operations, however did not work so well for frequent updates and pre-building of large persistent EhCache files. Any data structure to uses B-Tree indexes to store lookup values will not perform well under frequent updates. There are other data structures such as B+-Trees and Red-Black Trees that would do better (on average), however none of these can match the performance of dristributed hash tables.&lt;/p&gt;</description>
      <pubDate>Sat, 17 Apr 2010 15:38:31 GMT</pubDate>
      <guid isPermaLink="false">http://www.itmatter.com/web/guest/blog/-/blogs/distributed-data-i-o-bound</guid>
      <dc:creator>Daniel Fisla</dc:creator>
      <dc:date>2010-04-17T15:38:31Z</dc:date>
    </item>
    <item>
      <title>PIM-Related Questions on the Minds of Business Leaders</title>
      <link>http://www.itmatter.com/web/guest/blog/-/blogs/pim-related-questions-on-the-minds-of-business-leaders</link>
      <description>&lt;p&gt;Interview with Miko Roller from &lt;a href="http://www.heiler.com"&gt;Heiler Software AG&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;I asked Adastra MDM/PIM consultant Daniel Fisla to elaborate on 10 topics about PIM based on his experience as a consultant in the industry. Mr. Fisla provides a valuable perspective coming from many years of interaction with companies and implementation of enterprise solutions including PIM.&lt;/p&gt;&lt;p&gt;For full article and PDF download visit &lt;a href="http://productinformationmanagement.ca/2009/04/17/10-pimrelated-questions-on-the-minds-of-business-leaders.aspx"&gt;Product Information Management Blog&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;</description>
      <pubDate>Sat, 17 Apr 2010 23:16:26 GMT</pubDate>
      <guid isPermaLink="false">http://www.itmatter.com/web/guest/blog/-/blogs/pim-related-questions-on-the-minds-of-business-leaders</guid>
      <dc:creator>Daniel Fisla</dc:creator>
      <dc:date>2010-04-17T23:16:26Z</dc:date>
    </item>
    <item>
      <title>Speaking at Adastra IM 2009 Conference</title>
      <link>http://www.itmatter.com/web/guest/blog/-/blogs/speaking-at-adastra-im-2009-conference</link>
      <description>&lt;p&gt;&lt;span lang="EN-GB" style="font-family: &amp;quot;Verdana&amp;quot;,&amp;quot;sans-serif&amp;quot;; font-size: 9pt; mso-fareast-font-family: SimSun; mso-bidi-font-family: 'Times New Roman'; mso-ansi-language: EN-GB; mso-fareast-language: ZH-CN; mso-bidi-language: AR-SA"&gt;Had the opportunity to present about PIM: Integrating People, Processes and Technologies &amp;ndash; Challenges and Solutions.&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;&lt;h2&gt;ABSTRACT&lt;/h2&gt;&lt;div&gt;As Master Data Management gains more and more prominence in IT circles, the individual pieces that make up a comprehensive MDM strategy are sometimes not given the prominence they should.&lt;/div&gt;&lt;div&gt;&amp;nbsp;&lt;/div&gt;&lt;div&gt;Product Information Management is one of those important pieces.&amp;nbsp; A subset of Master Data Management, Product Information Management (PIM) faces a unique challenge&lt;b&gt;: the integration of people, products, and processes in face of rising amount and complexity&lt;/b&gt; of product related information.&amp;nbsp; Attend this session to learn why PIM is such an important part of an enterprise MDM strategy, how PIM differs from other MDM components such as CDI, and what IT practitioners need to know when considering the implementation of PIM tools.&amp;nbsp;&lt;/div&gt;</description>
      <pubDate>Sat, 17 Apr 2010 23:34:53 GMT</pubDate>
      <guid isPermaLink="false">http://www.itmatter.com/web/guest/blog/-/blogs/speaking-at-adastra-im-2009-conference</guid>
      <dc:creator>Daniel Fisla</dc:creator>
      <dc:date>2010-04-17T23:34:53Z</dc:date>
    </item>
    <item>
      <title>Successful Product Information Management Implementations: A Guide for PIM Tool Selection</title>
      <link>http://www.itmatter.com/web/guest/blog/-/blogs/successful-product-information-management-implementations:-a-guide-for-pim-tool-selection</link>
      <description>&lt;h2&gt;Abstract&lt;/h2&gt;&lt;p&gt;This paper is the first in a series aimed at Master Data Management (MDM)-related challenges, and is geared towards a business audience and presents a high level overview of the approach and the methodology implemented by Adastra on numerous MDM and Product Information Management (PIM) projects. &amp;nbsp;Leveraging this experience as well as extensive Business Intelligence and Extract-Transform-Load tool selections, we discuss the selection methods that should be applied to the PIM tool-selection process in order to minimize risk for organizations as they plan their PIM technology investment. Subsequent papers will address PIM implementation-specific challenges and solutions.&lt;/p&gt;&lt;h2&gt;&lt;strong&gt;The Challenge &lt;/strong&gt;&lt;/h2&gt;&lt;p&gt;Given the rise of MDM, and PIM in particular, many organizations are challenged to find the software tools to meet their day-to-day and future operational needs. The larger software vendors have mainly acquired their technology by purchasing smaller vendors and are still in the process of integrating these solutions into their computing platforms. At the same time, smaller companies continue to challenge the established vendors by pushing innovation and staying on the leading edge of technology. &amp;nbsp;The result is a complex array of multiple PIM vendors, products and solutions in an evolving marketplace.&amp;nbsp;In this environment, a risk-adverse and proven approach to PIM tool selection is essential to ensuring project success.&lt;/p&gt;&lt;div&gt;&amp;nbsp;&lt;/div&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;&lt;p&gt;The rest of the paper can be downloaded by registering at Adastra Canada &lt;a href="http://www.adastracorp.com/document.aspx?menu_id=25&amp;amp;submenu_id=371&amp;amp;id=898"&gt;web site&lt;/a&gt;.&lt;/p&gt;</description>
      <pubDate>Sat, 17 Apr 2010 23:07:12 GMT</pubDate>
      <guid isPermaLink="false">http://www.itmatter.com/web/guest/blog/-/blogs/successful-product-information-management-implementations:-a-guide-for-pim-tool-selection</guid>
      <dc:creator>Daniel Fisla</dc:creator>
      <dc:date>2010-04-17T23:07:12Z</dc:date>
    </item>
    <item>
      <title>Speaking at IRMAC</title>
      <link>http://www.itmatter.com/web/guest/blog/-/blogs/speaking-at-irmac</link>
      <description>&lt;p&gt;Had the opportunity to speak at &lt;a href="http://www.irmac.ca/"&gt;Information Resource Management Association of Canada&lt;/a&gt;&amp;nbsp;- the same presentation as in MDM Toronto Summit 2008.&lt;/p&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;</description>
      <pubDate>Sat, 17 Apr 2010 23:31:22 GMT</pubDate>
      <guid isPermaLink="false">http://www.itmatter.com/web/guest/blog/-/blogs/speaking-at-irmac</guid>
      <dc:creator>Daniel Fisla</dc:creator>
      <dc:date>2010-04-17T23:31:22Z</dc:date>
    </item>
    <item>
      <title>Speaking at MDM Toronto Sumit 2008</title>
      <link>http://www.itmatter.com/web/guest/blog/-/blogs/speaking-at-mdm-toronto-sumit-2008</link>
      <description>&lt;p&gt;&lt;span lang="EN-GB" style="font-family: 'Verdana','sans-serif'; font-size: 9pt; mso-fareast-font-family: SimSun; mso-bidi-font-family: 'Times New Roman'; mso-ansi-language: EN-GB; mso-fareast-language: AR-SA; mso-bidi-language: AR-SA"&gt;&lt;span lang="EN-GB" style="font-family: 'Verdana','sans-serif'; font-size: 9pt; mso-fareast-font-family: SimSun; mso-bidi-font-family: 'Times New Roman'; mso-ansi-language: EN-GB; mso-fareast-language: ZH-CN; mso-bidi-language: AR-SA"&gt;MDM Challenges and Solutions from the Real World&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span lang="EN-GB" style="font-family: 'Verdana','sans-serif'; font-size: 9pt; mso-fareast-font-family: SimSun; mso-bidi-font-family: 'Times New Roman'; mso-ansi-language: EN-GB; mso-fareast-language: AR-SA; mso-bidi-language: AR-SA"&gt;&lt;span lang="EN-GB" style="font-family: 'Verdana','sans-serif'; font-size: 9pt; mso-fareast-font-family: SimSun; mso-bidi-font-family: 'Times New Roman'; mso-ansi-language: EN-GB; mso-fareast-language: ZH-CN; mso-bidi-language: AR-SA"&gt;For highlights see &lt;a href="http://www.adastracorp.com/document.aspx?id=146"&gt;Adastra page&lt;/a&gt;.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;&lt;font face="Verdana"&gt;Master Data&lt;/font&gt;&lt;/strong&gt;&lt;font face="Verdana"&gt; &amp;ndash; a core set of data critical to major business processes and functions&lt;/font&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;&lt;font face="Verdana"&gt;Master Data Management&lt;/font&gt;&lt;/strong&gt;&lt;font face="Verdana"&gt; &amp;ndash; organizational structures, business processes, culture and technical tools ensuring key (master) data in the enterprise is:&lt;/font&gt;&lt;/p&gt;&lt;p&gt;&lt;font face="Verdana"&gt;1. Reliable and Correct - reliable and stable data sources, managed and provided by reliable and stable systems&lt;br /&gt;2. Unified - in content and understanding &lt;br /&gt;3. Available - at the right place at the right time&lt;br /&gt;&lt;/font&gt;&lt;/p&gt;</description>
      <pubDate>Sat, 17 Apr 2010 23:26:51 GMT</pubDate>
      <guid isPermaLink="false">http://www.itmatter.com/web/guest/blog/-/blogs/speaking-at-mdm-toronto-sumit-2008</guid>
      <dc:creator>Daniel Fisla</dc:creator>
      <dc:date>2010-04-17T23:26:51Z</dc:date>
    </item>
  </channel>
</rss>
