What is datamining, and does it encourage the creation of a specific kind of history?

v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}

Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. It can be used in many different areas and fields such as history and businesses. For those who cannot grasp the computer technology side of things such as myself, here is an example of an easy way to see data mining in action in everyday life, in particular businesses. For example, one Midwest grocery chain used the data mining capacity of Oracle software to analyse local buying patterns. They discovered that when men bought diapers on Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on Saturdays. On Thursdays, however, they only bought a few items. The retailer concluded that they purchased the beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered information in various ways to increase revenue. For example, they could move the beer display closer to the diaper display. And, they could make sure beer and diapers were sold at full price on Thursdays.[1] Data mining consists of five key elements. Analyse the data by application software, store and manage the data in a multidimensional database system, present the data in a useful format, provide data access to business analysts and ICT professionals and extract/transform/load transaction data onto the data warehouse system. Data mining shows the difference between methodologies such as ‘keyword’ searches, which highlights a specific piece of data (a word), compared to highlighting information through the Semantic Web; An implied  meaning within the results.[2] This post will look at different methods such as Ngram and Topic Modelling and will evaluate how data mining is presented, as well as showing how historians interpret them.   

The purpose of the data mining is to extract useful knowledge from the data, and to put that knowledge to beneficial use. Data mining consist of many tools which analyse information from different perspectives. It is used particularly to compress large databases which span across different fields.[3]Data mining techniques can be used to filter many variables to a vital few to build or improve predictive models. Specific examples are provided in four categories: classification, regression, clustering, and association. One classification technique is a tree. In a tree, the data mining tool begins with a pool of all cases and then gradually divides and subdivides them based on selected variables. The tool can continue branching and branching until each subgroup contains very few (maybe as few as one) cases[4]. A limit is needed to prevent ‘overfitting;’ where categories divide repeatedly leaving as little as one factor within it, this is one of the dangers and disadvantages in the decision tree methodology. For analytical evaluation the tree primarily highlights key variables[5]. Text mining’s use for historians or researchers is debatable. The algorithms would create more results in searching for a word, yet the relevance of the search may not always be practical because words may have more than one significant connotation; meaning some results will be unrelated to the question.

 

Image

Topic Modeling is selecting topics from a body of data, Beli mentioned ‘Wikipedia.’ Then by selecting a file you can connect the topics, which means ‘annotating’ the file with the use of algorithms to locate different topics.[6]  This is equivalent to the classification process in data mining but by using a different method.

Stepwise regression is a type of multivariate regression in which variables are entered into the model one by one, and meanwhile variables are tested for removal. It can be a good model to use when supposedly independent variables are correlated. Stepwise regression is one of the techniques that can help thin out the forest and find important predictive factors[7].  Despite this useful tool, humanities resources are equipped to function without it. This does not mean that they are equipped to deal with the ‘black box’ problem in data mining.[8] Unfortunately, the ‘black box’ problem is when some output data does not correspond with the input data and thus presents unsatisfactory results. In some ways it is similar to the Bayesian system from the previous step of clustering, when the output data is not relevant to the input data on some occasions. This is particularly impractical for historians.            

 

Image

 

Cluster techniques detect groupings in the data. We can use this technique as a start on summarization and segmentation of the data for further analysis. Two common methods for clustering are K-Means and hierarchical. K-Means iteratively moves from an initial set of cluster centres to a final set of centres. Each observation is assigned to the cluster with the nearest mean. Hierarchical clustering finds the pair of objects that most resemble each other, then iteratively adds objects until they are all in one cluster.[9] .  Historians have credited the Inverse Document Frequency (IDF) by highlighting the experimental nature. However, when it is amalgamated with the Frequency of the Term (TF) it is advance in its, ‘text retrieval into methods for retrieval of other media, and into language processing techniques for other purposes.[10]’ This proves to be a good way of correlating data to provide maximum results.         

 

Image

 

Association examines correlations between large numbers of quantitative variables by grouping the variables into factors. Each of the resulting factors can be interpreted by reviewing the meaning of the variables that were assigned to each factor. One benefit of association is that many variables can be summarized by just a few factors.[11]

Image

Google N-Gram is was created so that people can visualise the rise and fall of particular keywords across 5 million books and 500 years, and has so far covered about 4% of all those books ever published.  From the rise and fall of the information displayed on the graph obvious correlations can be seen, but further interpretation seems quite difficult to deduce[12]. In searching a specific question for example, when was the Cold War? The words ‘Cold’ cross referenced against ‘War’ would analyse each words as separate entities, counting how many times they were mentioned rather than the significance the words have together.  Aiden built a software tool called the n-grams viewer to chart the frequency of phrases across a corpus of 500 billion words. A ‘one-gram’ plots the frequency of a single word such as ‘feminism’ over time; a ‘two-gram’ shows the frequency of a contiguous phrase, such as ‘touch base’. Mathew Hurst also analysed it from a language perspective and compared words from different versions of English, American English and British English, to see how they had changed over the years, which was very little. He also compared the same word but by one beginning with a capital letter and the other one with a lower case letter. However advanced this may seem, some humanities researchers in the traditional camp complain that their field can never be encapsulated by the frequency charts of words and phrases produced by the n-grams tool.[13]Other scholars have deep reservations about the digital humanities movement as a whole — especially if it will come at the expense of traditional approaches. “You can’t help but worry that this is going to sweep the deck of all money for humanities everywhere else,” says Anthony Grafton, a historian at Princeton[14].

Image

 

It can be argued that digitalising history has ushered in a new era through the way in which information is researched. Tim Hitchcock and some of his colleagues consider historians who discredit this type of history as fairly old fashioned and isolated people within archives[15].

Data mining methodologies aided historians by categorising and producing new and almost unthought-of perspectives behind the information displayed. In relation to data mining, it could be argued that digital history is easier and more precise for research. On the other hand, books were the original form of information that cannot replace personal interaction and sentiment that one feels with the document. Many historians have been excited about the new digital ways of research and Google N-Gram. N-Gram aim to release new data as soon as it can be compiled. In addition, unlike text-mining tools like COHA, Google Ngrams is multilingual. For the first time, historians working on Chinese, French, German, and Spanish sources can do what many historians have been doing for some time[16]. Overall data mining is a combination of large observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful for the data owner. Data mining products are taking the industry by storm. The major database vendors have already taken steps to ensure that their platforms incorporate data mining techniques. In a historical sense there is still some improvements to be made and not all historians may be won over by this new revolution but it is the future and has definitely encouraged new creation of different forms of history.


[2] Fabio Ciravegna, Mark Greengrass, Tim Hitchcock, Sam Chapman, Jamie Mc Laughlin and Ravish Bhagdev, haystack, pp65-67

[3] Data Mining: What is Data Mining?, http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm; consulted 10 April 2013

[5] Data mining for process improvement, http://www.crosstalkonline.org/storage/issue-archives/2011/201101/201101-Below.pdf; consulted 10 April 2013

[6] Topic models, http://videolectures.net/mlss09uk_blei_tm/; consulted 10 April 2013

[8] Fabio Ciravegna, Mark Greengrass, Tim Hitchcock, Sam Chapman, Jamie Mc Laughlin and Ravish Bhagdev, haystack, pp.67-78

[10] Stephen Robertson, Understanding Inverse Document Frequency: On theoretical arguments for IDF, Journal of Documentation, 60 no. vol. 5, pp. 503–520

[11] ibid

[12] Sapping Attention, http://sappingattention.blogspot.co.uk/; consulted 10 April 2013

[14] Ibid

[15] With Criminal Intent, http://criminalintent.org/; consulted 10 April 2013

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s