At the 8th DITA lap session we covered that data mining is one of the modern techniques used in information systems. After an informative overview of the subject, we performed an exercise in finding the relationship between data mining and text analysis. In particular via exporting the Old Bailey’s API results to Voyant Tools and a Text Mining Research Project from the Utrecht University Digital Humanities Lab.
In this post I am aiming to cover 3 main points:
- The Old Bailey API structured.
- Exporting the API results to Voyant Tools.
- Data analysis that I have chosen from the Utrecht Text Mining Project.
The Old Bailey API structured
The Old Bailey Online API provides with information on legal proceedings from 1674 to 1913 and allows access to over 197,000 trials held at London’s highest criminal court. My search involved a general search that allowed me to find documents in full text. I also did a search using the same key word on the Old Bailey API Demonstrator which allowed my results to be queried and passed to Voyant Tools.
This time I was interested to search for animal theft carried out by women between 1831 and 1931 in order to see how prevalent these types of thefts were in women. In order to get a quick result I used the original search, using together the term “Animal Theft” as a keyword and as you see below the result came up in an ascending chronological order.
I then did a search on Old Bailey API Demonstrator to obtain results that I can export to Voyant tool. During my exercise I noticed that both of these features (API Demonstrator & general search) are not seen to provide great help with mining or analysing the text so I needed a tool that would allow me to look into the text such as the Voyant.
Exporting the API results to Voyant Tools
The other exercise we performed at DITA lab session was to export the results to Voyant tools for further analysis and visualization. This tool allowed to display the results as a cloud of words and showed the most frequently mentioned words with a total of 4,761 words and 964 unique words.
Here the real exploration of text mining begins with analyzing the data to find links and common denominators. In the exercise, this tool detected the most common animals that were stolen, which were dogs and fowls. This tool also enabled me to find out about the most common places that thefts occurred which were in some stations. Gathering information as such also enables researchers to come to conclusions that go beyond the crime. For example, the motives involved or security measures available in stations.
In view of the past and current experience, I can say that the analysis of texts across Voyant needs special skills to use it effectively in order to come up with more detailed results such as forecasting and foresight.
Data analysis that I have chosen from the Utrecht Text Mining Project
This was probably the most interesting part of our exercise. We had to look at text mining projects from Utrecht Text Mining Project and chose one that we preferred to talk about. I chose to focus on the Annotated Books Online Project, which is based on analysing annotations that readers left in the books they owned. This project is allowing experts and researcher to look through sixty copies from library collections.
I took a sample of these books and tried to analyse annotations. I must say that it was difficult to analyze because of the poor image quality and blurred text. What got my attention was that some of these annotations were many like a book inside another book, which may indicate the wealth of knowledge in that era. I wonder if it is possible to export these annotations with the original texts of books to Voyant and find similarities between these annotations and the original text of the book.
In the last DITA lab we dealt with text analysis by using certain tools. These tools are for extracting information within large scale content in the text (I will show you later in this post the result of this analysis). In this exercise, we took the words either written across computer scripts or published online. This makes me imagine this process without a computer! Impossible mission.
Text analysis includes information retrieval, lexical analysis for studying the prevalence of words, identification of linguistic patterns, linguistic coding, data mining techniques such as analysis of linkages and connections, visualization and predictive analysis. The main objective of these operations is to convert the text to be analyzed by natural language processing applications and methods of analytical data. This term also describes those special text analysis applications, either independently or in conjunction with the query and data analysis ( http://en.wikipedia.org/wiki/Text_mining).
See Through Your Text
This sentence above was in the first page of http://voyant-tools.org/ . With this tool I used a data set of #newyorksnow tweets archived using TAGS which I wrote a post about https://malzahranidita14.wordpress.com/2014/11/09/tags-and-linking-to-archives/ ) .
As you can see above, this tool has several interesting features:
- The results are highlighted in yellow.
- You can see the trend results in a graph. This is a quick way to look at the differences and similarities in the results.
- The exclusion of some general words as analysis can create its own list according to its objectives.
Analysis Arabic Text
During this week I have searched for tools that can analyze Arabic text. Unfortunately, I did not find any tools, and as an Arabic speaker I would have liked to have seen tools that support Arabic text programs.
It appears to me that dealing with different subsets can be complex. For example, analyzing Arabic subsets can be quite difficult, especially when there are no tools to support that operation. Most of these problems relate to symbols and characters used in the Arabic language that differs entirely from European languages.
Little by little I am becoming aware of the importance of information science and the responsibility that falls on the shoulder of the experts in the field. Information science is not only referring to collecting information, but also to the way of using information, the activities around it and how these activities can determine the quality of it. As I mentioned in my last post, inflation of information and fast growth require the creation of new applications in order to keep pace with recent developments.
In the last DITA’s lab we used a Web application called Altmetric Explorer, which permits to track and collect dialogues from scientific articles on the Internet. It can also provide with details of how these conversations affect the value of the articles.
To be able to track the conversations you must first register via your email. This application does not provide free services. However, luckily for us, the program provides free services for students in this course until mid-January 2015.
The basic idea of this application is based on two things: Combining and Rating.
Combining refers to the collection of articles from multiple sources published.
Rating refers to the evaluation of articles based on the number of conversations about this article in multiple places, such as news, Journals, Facebook, Twitter and blogs.
In the lab we explored and obtained bibliographies and social media data from this App for research and we used the Altmetric interface to filter data, and then we saved the report and exported it as a CSV file.
Later, I created and exported two reports and I am going to share one report with you in this post.
I chose all articles published by Oxford University Press from the past month, which mentions Facebook, Twitter, and blogs and in news outlets. I ended up with thousands of articles.
As you can see in the first picture above, in the“ Articles” tab we can see many colored circles on the left of each article and with a number on it. In the“Activity” tab we can obtain all the activities around articles. In the second picture above, in the “Journal” tab we can get detailed information about journals that published articles in the form of a graph.
Let’s take the following example where you can see how to explore what you can obtain from an article.
This article is the second ranked with a 345 score, which is based on conversations around this article as follows:
Altmetric has seen 25 stories from 24 outlets, 18 posts from 18 blogs, 42 tweets from 42 accounts with an upper bound of 129,273 combined followers, 20 public wall posts from 20 accounts. 45 Google+ posts from 35 accounts, 2 videos from 2 accounts. Also the data shown below was collected from the profiles of tweeters who shared this article.
Between interface and CSV file
Altmetric Explorer application can display the results as a CSV file that I can open in Excel which allowed me immediately to know the number of articles that I came up with which were 4,719. At a quick glance, we can see below the difference between viewing the data on the interface and viewing it as a CSV file.
CSV files concentrate on the quantities and allows us to create a schedule for the result where I can import large sets of data, organize, explore and analyze it. However, these results can not be updated. Only viewing the data on the interface can be updated every minute. In addition, the interface allows us to see the data as qualitative information.
In the previous exercise we learned how to archive tweets and in this application, we learned how to search for activities relating to scientific articles. Search and archiving are important in social networking. What has also become important is the ability to measure the quality of scientific articles.
Therefore, can we create new standards for evaluation by designing new social network programs?
In one of my previous posts, I talked about mashups, which is an application that collects two API from two different programs. I also talked about how this mashup could contribute to the rapid growth of information technology.
Considering the plethora of information published daily by millions of people, it is necessary to have an application similar to this that collects information and archives it in a text format to be later analyzed and studied.
Twitter is one of the most widely used social media networking service with millions of users around the world. In the last DITA session my classmates and I learned how to use an application developed by Martin Hawksey. This application is a mashup using Google’s API and Twitter Search API to archive Tweets and their related metadata.
What caught my attention with this application is that we were able within a few minutes to get a picture of what is occurring in the world on a particular topic. Through this application, we can also measure what people are interested in, the time of posting tweets, the content and number of retweets and the number of followers.
As can be seen below, with this application, we ended up with a dynamic ball where you can see more than 5,000 tweeters and you can update it every hour. From this ball we can see who is posting a tweet with a specific tag, who has the most followers and so on.
Archiving Arabic Tweets
In the lab we used an English tag #citylis. For those who are not familiar with hashtags, a hashtag is an unspaced prefix that precedes a word or phrase to form a label that can be searched by anyone who is interested in obtaining group results containing that specific word or phrase. However, because Arabic is my native language I used that application to archive and collect Arabic tweets just to see whether this application also supports Arabic language.
I’ve searched with a tag (#الهلال), (#Alhelal) which is a name of the Saudi local football team. Alhelal played with an Australian football team on November 1st 2014 within the AFC Champions League final. I collected all the tweets with the hashtaged (Alhelal) from 1st November to 7th Of November. The number of tweets amounted to 5,261 and the peak tweets were posted between 10:00 pm GMT 12:30 am. The analysis also shows that some Tweeters are more popular than others. Below you can see the time scale and data.
Also, with this application I was able to see who is the most popular user and which are the most frequent retweets.
Although one can see that this app is useful for researchers for archiving and analyzing media databases, on the other hand it can be argued that the privacy of Twitter users is not protected. With this in mind, it makes me think that everything that is published online, especially on social networking sites, is accessible to anybody not only to read but also to archive it.
Information retrieval is very important not only for information sciences, but also for digital libraries. For example, digital libraries offer a list of books in alphabetical order. When the user types a name of a book that she or he wants to find, the system goes on a search mode looking for the book by name. Then it retrieves all the books that bear that name.
Thus, the information retrieval system is a convenient and a time effective system for the user because it saves him or her from a daunting job in finding a file or document manually.
Every researcher knows that the information which he or she is looking for existing somewhere on the internet. So before we start our inquiry and search it is important to know two things:
- Learn the various search sites.
- Understand the logical commands.
- Learn the various search sites
Search sites are divided into two main sections, namely directories and search engines. Directories regulate websites by topic. Users can choose the subject in question and review the list of resources in a category, then they are narrowing your search by descriptors or sub-categories. Search sites represent an excellent source of directories where you can search for public information and browse lists of sites organized in the same category. Yahoo is an example of a search directory.
Search Engines on the other hand, are very large databases that are linked to websites and allow you to search for pages that contain keywords you have used. Search engines search in all existing pages on the web using “links spiders” or “robots”. Because of their large size, search engines are not useful for general queries, as the search results can be thousands to millions. However, most search engines include sophisticated search tools that can help the search for specific information quickly and easily, if they are used properly. Google is an example of a search engine.
Understanding the logical commands
If you want to maximize the use of search engines, you need to be familiar with advanced search features. Most of these features rely on logical commands, known as transactions which is about symbols or words in particular, lets you search and revision control. The following are some of the most common commands:
– Match any word
This command is used for searching pages that may contain any word and not necessarily all the keywords you have typed in. This command usually appears as the default setting for many search engines.
For example, if you type “mountains or rivers” the search engine will return all pages that include the word, “Mountains”, as well as all pages that contain the word “river”.
– Match All
This command searches for pages that contain all keywords combined.
For example, if you type “mountains and rivers”, the search engine will return all pages related to “mountains and rivers” together.
This command excludes documents that contain certain words.
For example, if you type “and not Hudson River” or “not Hudson River” the engine will return pages that include the word “Hudson” but not the word “River”. This command is useful if you are interested in obtaining information concerning the Hudson car, but you do not want to browse the pages that include the Hudson River.
Sometimes, you find that the search did not return any results or results were not relevant. This can be frustrating, especially if you know that the information you are looking for exists somewhere on the Internet. If this occurs, you can resolve this fast by:
– Re-read the Help button to make sure you are using the correct rules.
– Check your spelling.
– Verify that you are using Boolean operators and the correct syntax.
– Try using a less specific query.
– Use synonyms.
– Go to another search engine and try the search again.
I hope you found my post this week useful.
In the computer lab in DITA, we learned something very interesting. How we embedded map.
I chose Cambridge Central library (where I live) and where I spend most of my time in reading and browsing a lot of references.