Showing posts with label google. Show all posts
Showing posts with label google. Show all posts

Thursday, June 11, 2009

Review: Design considerations for a network of Information

In this position paper they insist on the fact that the Internet has to be reshaped and be focused on data instead of endpoints, in order to have a data centric network or network of information.
According to Jacobson [1], the first generation of the network dealt with connecting wires. The second one focused on end nodes hosting data while the third generation should refocus on what humans care the most about Information.

Their information models distinguish between 2 main objects:
Data Objects (DO): are the actual bit-patterns that contain the information: such as file, phone call, video, web page, a song(Beethoven's 9th symphony) etc. These data objects can be divided into "chunks" smaller pieces in order to simplify the transfer.
Information Objects (IO): holds semantic information and meta-data related to data objects, such as Beethoven's 9th symphony is an mp3 file encoded with 128kbps, IOs can be composed of other IOs or directly pointing to one or multiple DOs. An IO can represent the Eiffel tower and point to DO like pictures, a wiki page or service to buy tickets, etc.

Versioning and Revocation:
Since some information is frequently changing such as news papers. An IO can represent today's version, however the IO should adapt dynamically by binding to another DO (web page) the next day, and so on for the IO pointing to yesterday's news.
They suggest that objects invalidate themselves in order to conserve consistency. After an amount of time, objects should be recertified before it can be used. By applying this technique, they maintain consistency due to disconnected operation during information update of other replicas for example. DO can be deleted same way when no certification is given.

Security considerations:
Security in today's architecture is based on confidence (Encryption keys) of the host delivering the object, they propose to reshape security conventions so we can handle secured data instead of secured tunnels.
Integrity and authenticity is directly tied to object's names which means there would be a cryptographic relation between the name and object such as self-certifying names. However to enable off-line verification, DO must carry private keys which can be compromised. Another approach is to assign precomputed signatures to objects. It remains a research field.

Name resolution(NR):
Data objects are retrieved based on their unique identity (UID). NR starts with locating the object in the network then routing forwards the object retrieval query to its storage location and finally the DO is sent to the requesting client.
The naming resolution resolves an UID into one or more locations and should work on global and local scale by cooperating between NR systems for example. They show the side effects if ID/address split mechanism is adopted with the following example, if a laptop hosting numerous Data Objects moves its location then all Data objects location changes too. This will lead to huge number of updates in the NR system.
The NR system will be influenced by the characteristics of namespaces. They would like to adopt flat names which respects the non right of ownership and other characteristics revealed by [2].
Off course, using flat names prevents the use of hierarchical names spaces and systems like DNS.
DHT based solutions are promising since the are decentralized, scalable, self-organized and don't need central structure. However when going globally DHT uses flat names hence non hierarchical names which prevents cooperation with other systems.

Routing:
Addressable entities are still increasing and will reach millions even billions in few years with the emergence of sensor networks, Internet of things, growing data etc. They claim that routing research are not encouraging according to [3] (will be reviewed later). Hence, they ll investigate the efficiency of name based routing which integrate both resolution and retrieval paths. Name based routing will locate DO based on their ID by transforming the ID directly into a path without going through ID-address transition. Other techniques such as LLc and NodeID are to be investigated also (Soon will be reviewed).

Storage:
The information network can be implemented following two different models:
  • Network based storage model where storage resources are provided by the network infrastructure, like dedicated storage servers.
  • Network managed storage model where network nodes control portions of storage memory of users connected to the network. Users will be able to decide what DO goes public or be shared only with friends etc.

Search:
Search systems are expected to go far beyond text match search, such as semantic search or even search functionality based on GPS position, location positioning. For example when a picture of Eiffel tower is taken, a search mechanism will handle the identification of the monument based on GPS or other techniques and points to DO related informations such as web page, history etc.

This position paper gives many ideas and anticipations about the future Internet architecture and reveals the weakness in the current addressing system. They distinguish between DO and IO and argued that a network of information needs a scalable naming system supported by an efficient routing system.

References:
1 - V. Jacobson, M. Mosko, D. Smetters, and J. Garcia-Luna-Aceves. Content-centric networking. Whitepaper, Palo Alto Research Center, Jan. 2007.
2 - M. Walfish, H. Balakrishnan, and S. Shenker. Untangling the web from DNS. In NSDI’04: Proc. 1st Symp. on Networked Systems Design and Implementation, San Francisco, CA, USA, 2004.

Link to the article

Wednesday, April 15, 2009

Review: Toward a search architecture for software components

This paper proposes a design of a component search engine for Grid applications.
With the development of the component based programming model, applications are going to be more dynamically formed with the associations of components. Developers should be able to reuse already developed components that matches their needs. To do so, a component search engine seems to be essential.
The component search for Grid applications offers two facilities:
  1. Developers will be able to find the best component for their need.
  2. The framework can replace a malfunctioning or a slow component dynamically (at run time). The application should be able to decide the best component to be replaced with the malfunctioning.
They assume that open source Grid applications will appear and software components can be found on portals. These components will be ranked according to their usage, the more a component is used by applications the more important it is considered. This raking will establish a trust index. This approach is used by Google to rank the pages and improve search results.

One of the related works:
Agora components search engine supports the location and indexing of components and the search and retrieval of a component. Agora discovers automatically sites containing software components by crawling the web (Google's web crawler), when it finds a page containing an Applet tag, it downloads and indexes the related component. Agora supports JavaBeans and CORBA components. The database search is keyword based refined by users.

Workflows:
A workflow can be described as a process description of how tasks are done, by whom, in what order and how quickly.
Workflows are represented with low level languages such as BPEL4WS which requires too much user effort to describe a simple workflow.
Other high level language and Graphical User Interface on top of BPEL4WS are being introduced/build that generates BPEL code.

Their approach is workflow based: components can be adapted and coordinated through workflows. Applications should be able to choose and bind with other components from different sources on the Grid. Such applications searches first in its own local repository for components previously used or installed and uses a search engine to find suitable components.

The application development process can be divided into 3 stages:
  1. Application Sketching is when developers specifies: (1) An abstract workflow plan containing the way information passes through the application's parts. (2) A place-holder describing the functions and operations to be carried out. This description will help finding a list of suited components.
  2. Components discovering is based on 2 steps: First they resolve the place-holder query by searching in the local repository. If a suitable component is found locally than an identifier of this component is returned to the application. Second, If no component was found, a Query session is started on remote sites. A list of ranked components is returned and refined by user specifications.
  3. Application assembling is the binding phase. Data or protocol conversion are often needed due to the heterogeneous input/output between components (string to array of double conversion etc).
GRIDLE is their component search engine: Google like Ranking, Indexing and Discovery service for a Link-based Eco-system of software components. The main modules are the following:
  1. The Component Crawler is like a Web Crawler, it retrieves new components and updates links (bindings) between components and pass the results to the indexer.
  2. The Indexer will build the index data structure of GRIDLE. Characteristics and meta data associated to the component should be carefully selected to be indexed. Actually the meta information associated to components will help retrieve the suited one. Such Meta data can be: (1) Functional information like interfaces (published methods, names, signatures) and runtime environment. (2) Non functional information such as QoS and textual description. (3) Linking information to other components.
  3. The Query Analyzer resolves the queries on index basis, it uses a ranking module to retrieve the most relevant components, the search will be refined by the user.

To this stage, I don't have advanced knowledge in such systems and search engines but I find this approach interesting since the world of component development is emerging.
In the near future, thousands of components will be developed and ready to use. One of the main reasons of the wide adoption of the component based programming model is the ability to reuse already developed components and save time during the development process. A search engine seems to be necessary in order to find and locate suitable components.
Some issues in their approach remains unexplained or not clear such as:
  • Components will be updated , deleted, added, so how to determine the crawler iteration frequency in order to update the indexing?
  • The same question appears when dealing with Component binding, since the model is inspired from Web pages, I think that components are more dynamic when it deals with binding with other components. Bindings will dynamically (on runtime) appear/disappear when replacing a component, how to maintain the ranking of a component? What is the frequency of the component ranking algorithm ?
  • In their approach, first they search locally for a suited component. What if remote sites holds better suited components with higher ranks than those already placed in the local repository? What policy to use in order to keep updating the local repository?
  • The Crawling module searches for new components, do we need to insert an agent on every repository?
  • How to manage the heterogeneous aspects between components? COM and CORBA components?
  • Semantic web and ontology use might simplify the mapping and query even though it is considered to be a disadvantage for the designers of GRIDLE due to the unique usage of a unified taxonomy.

Link to the article
PS: According to the Ranking algorithm, the rank of the page hosting the article increased while the rank of my blog is decreasing, actually I am offering a portion of my page's rank.Lien