Sunday, July 18, 2021

Importance of Organisation of Information

Analyze and evaluate the search engines and the metadata on each site. Here are a few questions and suggestions to stimulate your thinking:
• Try several types of searches: vague terms, specific terms, searches you know will bring up disparate records (like “tire”).
• In evaluating the results of the search(es), ask yourself: did I get too many returned Too few Are they the images I expected to get Why or why not What things are similar about the images, what are different
• Evaluate the search engines themselves. What works What doesn’t work What features do you wish they offered What features were completely useless to you Was there an advanced search feature Did it do what you expected it to do
• Apart from a “search bar,” was there any other way to narrow the records that were returned to you (Explore the entire site – sometimes these things are hidden in plain view…)
• What elements or entities are present in the metadata associated with each image Are they the same for all the records you examined
• Do the metadata values appear to be standardized How do you know In what way(s) are they standard
• Would the website benefit from using standards (e.g., standards associated with Data Structure, Data Value, or Data Content) If so, which one(s) What would those standardsallow you to do
• What metadata elements would be the best access points? (i.e., which elements are the most important when you are looking for a digital image)
• Getty Images and Wikimedia Commons have slightly (or not-so-slightly) different purposes.What are the purposes of each site How does the purpose of each website affect their respective search engines How does the purpose affect their respective metadata schemata and/or standards.
Getty Images Getty Images was found by Mark Getty and Jonathan Klein in the year 1995. It is headquartered presently at Seattle in Washington. The company was formed with the aim of providing licensed images to customers for specific needs. The repository of Getty Images is humongous and contains an unlimited archive of photographs and licensed videos. The major industries that Getty images target are advertising, publishing, graphic designing and corporate requirements. Getty images provide access to millions of images, that include a a few of historical importance which could be used for education or commercial purposes
Wikimedia Commons Wikimedia Commons was the brain child of Eric Möller, who initialized its operations in September, 2004. Its a huge collection of digital media content such as photographs, audio and video files. The intent of Wikimedia Commons was to provide content mainly for education across all the projects of the Wikimedia Foundation. It was planned as a common platform for all services including Wikipedia, that gave a universal platform for accessing and uploading media files. 
The main type of image retrieval mechanisms from repositories are image retrieval using Metadata and Content Based Image Retrieval (CBIR). In image retrieval using metadata, the common method of searching an image is by the use of keywords that are associated with the image. These keywords may be added as description or annotations along with the image. In Content Based Image Retrieval, the images are searched using the peculiarities of the content. This may be an images size, shape, color or texture. The main difference between the two methods are that in metadata search, the retrieval is very much dependant on the annotation quality while in CBIR the search performs completely on the basis of the images attributes.
 The main criteria for testing both the websites would be to check for the quality of the search results. Different criteria could be used to analyze the search results. The number of images returned, whether the search results very specific or generic, the UI and the user friendliness of the website and what features distinguish both the websites. The metadata of the images are also analyzed. How specic are the annotation? What are its access points? What is the basic difference between the two websites fundamentally? 
 Initially the site is searched using very generic terms like "mountains", "people", "wildlife", "books“. The results denote that specific results appear for each search term. Close to 60 images per page with around 100 pages of similar content are displayed in an average. The images are all of high quality and the site has particular filters to narrow down the search more particularly. 
Next vague terms such as about, it, do, the etc are used to analyze the results. The different vague terms return results that are in no relation with the word used for the search. The number of search results also varied to a great degree for each search term. The common thing noticed for each of these vague terms were that the search results returned mainly sports images and images of celebrities for these searches. 
A few words that have disparate meanings are given in the search bar. Words like tear, row, project, axes, bat which have the same spelling but different meanings are given for search. It could be noticed that the results that are displayed shows a majority of mostly one of the intended meaning in homographs. As an example, the word bat displays mainly the animal instead of the sports equipment. Similarly the word "row" displayed large number of results that illustrated a fight between people instead of the sport. 
 Metadata gives information about a particular data, in this case an image file. Metadata of images gives information about the image that facilitates searching and working with them more efficiently. On analysis of the Metadata, it could be identified that the images are not retrieved based on the annotations that are given on the image. The search is based on the embedded metadata of the image. 
 Initially the site is searched using common terms like “people", “animals", “nature", “vehicles“. The results denote that subcategories for each search term appears. The sub categories are again subdivided further to smaller more specific categories. The images are of comparatively fewer in number and contains a lot of illustrations and drawn images as compared to Getty images. 
As in the previous exercise, vague terms such as about, it, do, the etc are used to analyze the results. The different vague terms returned results lesser than compared to Getty Images. The results gave further suggestions for the terms given in the search box and seemed to be more streamlined, organized and specific. The search results seemed to be more organized in Wikimedia Commons
A few words that have disparate meanings are given in the search bar. The same words like tear, row, project, axes, bat which have the same spelling but different meanings are given for search. When the term “bat” was given, the results displayed only the animal’s picture and not the sports equipment. When the term “cricket bat” was specifically given, it displayed a few results containing the image of the particular search term. The results mostly show that the search was carried out entirely different from Getty Images. It could be analyzed that the images were retrieved based on the context. 
There were fewer options of narrowing the search terms. There is an option of doing the search based on a specified language on the left side. The search could also be narrowed down based on the quality of the images. The search options used for narrowing the quality were categorized as Featured pictures, Quality Images and valued images. 
 The metadata of the image is an information that is recorded at the time it is being captured. The Metadata of the image could be used to gather information about the image and is used while searching for the image. As mentioned, the type of metadata of an image could be categorized as technical metadata, contextual metadata and embedded metadata. In Wikimedia commons the search pattern is different from that of Getty Images. Here the metadata analyzed would be based on the texts around the image. This would be similar to that of a Context Based Image retrieval system.
It could be identified that the image retrieval of both the websites is done entirely differently. In Getty images, the search is carried out as in normal search engines, based on the keywords that are given at the time of search. Getty Images is mainly a repository for Stock photography, that could be used for commercial purposes. 
On evaluation, the search results given by Getty Images were very relevant only in specific scenarios. Especially if the search terms that were given were ambiguous, the results also turned out to be irrelevant. Similarly the number of images that the website returned for each search term was also on the higher side. 
In some of the searches, relevant results did not come up. There should be a better way of organizing the myriad of data and image files that they have in the repository. Especially if the search involved homographs, there should be suggestions to retrieve the exact results for the user. Categorizing the files under appropriate headings would help the user to carry out the search more efficiently.
The metadata information is used to retrieve the results from a particular search query. The search could be carried out by manual annotations given in the image. For example, a particular image of a forest could be named as “jungle”, “woods” or “thicket”, which would result in the image being displayed in the search. Another method is to carry out the metadata search based on the context. The image would be displayed by automatically detecting the text surrounding the image. 
In Terms of Wikimedia Commons, the search takes place differently from that of Getty Images. Most of the search terms have to be specific for the correct results to be displayed. Especially if the search terms are vague or disparate, the results give out categorized suggestions in links. 
The collection of images in Wikimedia Commons seemed to be less as compared to Getty Images. The quality of the images in terms of resolution and other specifications were also not at par with Getty Images. The images were mostly intended for an education purpose. The UI also was not as adept as Getty Images for searching or categorizing the images. 
The main drawback of Wikimedia commons was the limitation in terms of the repository that contained the images. The variety and sub-types of images were also limited. If the term “root” was searched, images pertaining to the types of root system were not available. The number of related illustrations and video content was also limited. 
The images in Wikimedia Commons contained the Exif data, which contained the date and time the photo was taken and also other specifications. The metadata also contained license information about the image. The metadata is analyzed in an image search based on the context or the embedded information. Both these come under the metadata for the image. 

Conclusion 

The importance of metadata in search engines, especially during image search is analyzed using 2 websites with image content. Image search is mainly carried out based on the different kinds of metadata in the particular image. There are algorithms and mechanisms to organize the search based on specific metadata attributes. The metadata could either be contextual, technical or embedded. And the search could be done wither of these attributes. 

References 

Bach, J. R., & Horowitz, B. (2000). U.S. Patent No. 6,084,595. Washington, DC: U.S. Patent and Trademark Office. Cai, D., He, X., Li, Z., Ma, W. Y., & Wen, J. R. (2004, October).
Hierarchical clustering of WWW image search results using visual, textual and link information. In Proceedings of the 12th annual ACM international conference on Multimedia (pp. 952-959).
ACM. Datta, R., Joshi, D., Li, J., & Wang, J. Z. (2008). Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys (Csur), 40(2), 5. Gudivada, V. N., & Raghavan, V. V. (1995).
Content based image retrieval systems. Computer, 28(9), 18-22. Jain, A. K., & Vailaya, A. (1996). Image retrieval using color and shape. Pattern recognition, 29(8), 1233-1244. Smeulders, A. W., Worring, M., Santini, S., Gupta, A., & Jain, R. (2000).
Content-based image retrieval at the end of the early years. IEEE Transactions on pattern analysis and machine intelligence, 22(12), 1349-1380. Yee, K. P., Swearingen, K., Li, K., & Hearst, M. (2003, April).
Faceted metadata for image search and browsing. In Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 401-408). ACM. Zacks, C., Telek, M., Marino, F., Taxier, K., & Harel, D. (2002). U.S. Patent Application No. 10/324,457.

No comments:

Post a Comment